Extract text from pdf in R and word Detection

Extract text from pdf in R, first we need to install pdftools package from cran.

Let’s install the pdftools package from cran.

install.packages("pdftools")

Load the package

library("pdftools")

The pdf file needs to save in local directory or get it from online. Here we are extracting one sample document from online.

Store the link in pdf.file variable.

pdf.file <- "https://file-examples-com.github.io/uploads/2017/10/file-sample_150kB.pdf"

Set the working directory

setwd("D:/RStudio/PDFEXTRACT/")

Let’s download the demo pdf file into the local directory

How to run R code in PyCharm? » R & PyCharm »

download.file(pdf.file, destfile = "sample.pdf", mode = "wb")

pdf_text() function, which returns a character vector of length equal to the number of pages in the file.

Extract text from pdf in R

Now we can extract the text from all pages.

pdf.text <- pdftools::pdf_text("sample.pdf")

Suppose if you want to display second page information then use below code,

cat(pdf.text[[2]])

Displayed only a few text here

In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam
est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat
et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis
tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque
scelerisque fermentum erat, id posuere justo pulvinar ut. Cras id eros sed enim aliquam
lobortis. Sed lobortis nisl ut eros efficitur tincidunt. Cras justo mi, porttitor quis mattis vel, ultricies ut purus. Ut facilisis et lacus eu cursus.

Now if you want to extract a particular word from these pages, unlist the data and convert it into lower case letters

How to do t test statistical analysis in R, Assumptions and Inference

pdf.text<-unlist(pdf.text)
pdf.text<-tolower(pdf.text)

Suppose if we want to extract the page number details for the word contains “Suspendisse“

library(stringr)
res<-data.frame(str_detect(pdf.text,"suspendisse"))
colnames(res)<-"Result"
res<-subset(res,res$Result==TRUE)
row.names(res)

Output

1] "2" "3"

The word “suspendisse” contains on pages number 2 and 3.

Conclusion

This article described text data extraction from pdf files and particular word detection from pdf data in R.

Data Analysis in R pdf tools & pdftk » Read, Merge, Split, Attach

Don’t forget to show your love, Subscribe the Newsletter and COMMENT below!
[newsletter_form type="minimal"]

Now retrieving an image set.

Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python, Second Edition (Greyscale Indian Edition)

(4551098)

₹1,475.00 (as of June 25 22:51 GMT +07:00 - )

Extract text from pdf in R and word Detection

Extract text from pdf in R

Conclusion

Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python, Second Edition (Greyscale Indian Edition)

You may also like...

Leave a Reply Cancel reply

Quality articles need supporters. Will you be one?

Extract text from pdf in R and word Detection

Extract text from pdf in R

Conclusion

Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python, Second Edition (Greyscale Indian Edition)

You may also like...

Return the corresponding value of Cauchy density in R

Filtering Data in R 10 Tips -tidyverse package

Find the Maximum Value by Group in R

Leave a Reply Cancel reply

Quality articles need supporters. Will you be one?