Extract text from pdf in R and word Detection

Extract text from pdf in R, first we need to install pdftools package from cran.

Let’s install the pdftools package from cran.

install.packages("pdftools")

Load the package

library("pdftools")

The pdf file needs to save in local directory or get it from online. Here we are extracting one sample document from online.

Store the link in pdf.file variable.

pdf.file <- "https://file-examples-com.github.io/uploads/2017/10/file-sample_150kB.pdf"

Set the working directory

setwd("D:/RStudio/PDFEXTRACT/")

Let’s download the demo pdf file into the local directory

How to run R code in PyCharm? » R & PyCharm »

download.file(pdf.file, destfile = "sample.pdf", mode = "wb")

pdf_text() function, which returns a character vector of length equal to the number of pages in the file.

Extract text from pdf in R

Now we can extract the text from all pages.

pdf.text <- pdftools::pdf_text("sample.pdf")

Suppose if you want to display second page information then use below code,

cat(pdf.text[[2]]) 

Displayed only a few text here

In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam
est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat
et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis
tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque
scelerisque fermentum erat, id posuere justo pulvinar ut. Cras id eros sed enim aliquam
lobortis. Sed lobortis nisl ut eros efficitur tincidunt. Cras justo mi, porttitor quis mattis vel, ultricies ut purus. Ut facilisis et lacus eu cursus.

Now if you want to extract a particular word from these pages, unlist the data and convert it into lower case letters

How to do t test statistical analysis in R, Assumptions and Inference

pdf.text<-unlist(pdf.text)
pdf.text<-tolower(pdf.text)

Suppose if we want to extract the page number details for the word contains “Suspendisse

library(stringr)
res<-data.frame(str_detect(pdf.text,"suspendisse"))
colnames(res)<-"Result"
res<-subset(res,res$Result==TRUE)
row.names(res)

Output

1] "2" "3"

The word “suspendisse” contains on pages number 2 and 3.

Conclusion

This article described text data extraction from pdf files and particular word detection from pdf data in R.

Data Analysis in R pdf tools & pdftk » Read, Merge, Split, Attach

Don’t forget to show your love, Subscribe the Newsletter and COMMENT below!
[newsletter_form type="minimal"]

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

three + 19 =