How to deal with text in R
How to deal with text in R, It is possible to handle data more efficiently and effectively by adhering to tidy data standards, and processing text is no exception.
According to Hadley Wickham’s description (Wickham 2014), tidy data follows a particular structure:
- Every variable has a column.
- Every observation is arranged in rows.
- A table represents each kind of observational unit.
Thus, a table with one token per row is our definition of the tidy text format. Tokenization is the process of dividing text into meaningful units of text, like words, that we are interested in using for analysis.
Unlike the way text is typically maintained in contemporary analysis, which may use strings or a document-term matrix, this structure has one token per row.
The token that is kept in each row for tidy text mining is often a single word, but it can also be an n-gram, sentence, or paragraph.
We provide ability to tokenize by frequently used text units such as these and convert to a one-term-per-row format in the tidytext package.
A standard set of “tidy” tools, such as dplyr (Wickham and Francois 2016), tidyr (Wickham 2016), ggplot2 (Wickham 2009), and broom (Robinson 2017), can be used to manipulate tidy data sets.
Users are able to switch between these packages with ease because the input and output are kept in neat tables. We’ve discovered that many text analyses and explorations automatically make use of these handy tools.
However, the tidytext program does not require the user to maintain text data organization during an investigation. The package contains functions to clean() objects from well-known text mining R packages like tm (Feinerer, Hornik, and Meyer 2008) and quanteda (Benoit and Nulty 2016).
Adoption from the banking sector to push the growth of RPA market (finnstats.com)
For more information, see the broom package [Robinson et al, quoted above]. This enables a workflow in which dplyr and other tidy tools are used for importing, filtering, and processing data.
The data is then transformed into a document-term matrix for machine learning applications. ggplot2 can then be used to re-convert the models into a clean format for interpretation and visualization.
Tidying the works of Jane Austen
library(janeaustenr) library(dplyr) library(stringr) original_books <- austen_books() %>% group_by(book) %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% ungroup()
original_books # A tibble: 73,422 × 4 text book linenumber chapter <chr> <fct> <int> <int> 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0 2 "" Sense & Sensibility 2 0 3 "by Jane Austen" Sense & Sensibility 3 0 4 "" Sense & Sensibility 4 0 5 "(1811)" Sense & Sensibility 5 0 6 "" Sense & Sensibility 6 0 7 "" Sense & Sensibility 7 0 8 "" Sense & Sensibility 8 0 9 "" Sense & Sensibility 9 0 10 "CHAPTER 1" Sense & Sensibility 10 1 # ℹ 73,412 more rows # ℹ Use `print(n = ...)` to see more rows
library(tidytext) tidy_books <- original_books %>% unnest_tokens(word, text) tidy_books
# A tibble: 725,055 × 4 book linenumber chapter word <fct> <int> <int> <chr> 1 Sense & Sensibility 1 0 sense 2 Sense & Sensibility 1 0 and 3 Sense & Sensibility 1 0 sensibility 4 Sense & Sensibility 3 0 by 5 Sense & Sensibility 3 0 jane 6 Sense & Sensibility 3 0 austen 7 Sense & Sensibility 5 0 1811 8 Sense & Sensibility 10 1 chapter 9 Sense & Sensibility 10 1 1 10 Sense & Sensibility 13 1 the # ℹ 725,045 more rows # ℹ Use `print(n = ...)` to see more rows
data(stop_words) tidy_books <- tidy_books %>% anti_join(stop_words)
Three lexicons’ worth of stop words are included in the tidytext package’s stop_words dataset. If it makes more sense for a particular study, we can filter() to utilize only one set of stop words rather than using them all at once, as we have done here.
Word frequencies
The most frequently occurring terms across all of the books can also be found by using dplyr’s count() function.
tidy_books %>% count(word, sort = TRUE) # A tibble: 13,914 × 2 word n <chr> <int> 1 miss 1855 2 time 1337 3 fanny 862 4 dear 822 5 lady 817 6 sir 806 7 day 797 8 emma 787 9 sister 727 10 house 699 # ℹ 13,904 more rows # ℹ Use `print(n = ...)` to see more rows
We have our word counts saved in a tidy data frame because we have been utilizing tidy tools. information enables us to route information straight to the ggplot2 tool, enabling us to, for instance, see the most frequently used terms.
library(ggplot2) tidy_books %>% count(word, sort = TRUE) %>% filter(n > 600) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(n, word)) + geom_col() + labs(y = NULL)
library(gutenbergr) hgwells <- gutenberg_download(c(35, 36, 5230, 159)) tidy_hgwells <- hgwells %>% unnest_tokens(word, text) %>% anti_join(stop_words)
tidy_hgwells %>% count(word, sort = TRUE) #> # A tibble: 11,769 × 2 #> word n #> <chr> <int> #> 1 time 454 #> 2 people 302 #> 3 door 260 #> 4 heard 249 #> 5 black 232 #> 6 stood 229 #> 7 white 222 #> 8 hand 218 #> 9 kemp 213 #> 10 eyes 210 #> # … with 11,759 more rows
How to Add Superscripts and Subscripts to Plots in R? (datasciencetut.com)