How to deal with text in R

by finnstats

How to deal with text in R, It is possible to handle data more efficiently and effectively by adhering to tidy data standards, and processing text is no exception.

According to Hadley Wickham’s description (Wickham 2014), tidy data follows a particular structure:

Every variable has a column.
Every observation is arranged in rows.
A table represents each kind of observational unit.

Thus, a table with one token per row is our definition of the tidy text format. Tokenization is the process of dividing text into meaningful units of text, like words, that we are interested in using for analysis.

Unlike the way text is typically maintained in contemporary analysis, which may use strings or a document-term matrix, this structure has one token per row.

The token that is kept in each row for tidy text mining is often a single word, but it can also be an n-gram, sentence, or paragraph.

We provide ability to tokenize by frequently used text units such as these and convert to a one-term-per-row format in the tidytext package.

A standard set of “tidy” tools, such as dplyr (Wickham and Francois 2016), tidyr (Wickham 2016), ggplot2 (Wickham 2009), and broom (Robinson 2017), can be used to manipulate tidy data sets.

Users are able to switch between these packages with ease because the input and output are kept in neat tables. We’ve discovered that many text analyses and explorations automatically make use of these handy tools.

However, the tidytext program does not require the user to maintain text data organization during an investigation. The package contains functions to clean() objects from well-known text mining R packages like tm (Feinerer, Hornik, and Meyer 2008) and quanteda (Benoit and Nulty 2016).

Adoption from the banking sector to push the growth of RPA market (finnstats.com)

For more information, see the broom package [Robinson et al, quoted above]. This enables a workflow in which dplyr and other tidy tools are used for importing, filtering, and processing data.

The data is then transformed into a document-term matrix for machine learning applications. ggplot2 can then be used to re-convert the models into a clean format for interpretation and visualization.

Tidying the works of Jane Austen

library(janeaustenr)
library(dplyr)
library(stringr)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE)))) %>%
  ungroup()

original_books
# A tibble: 73,422 × 4
   text                    book                linenumber chapter
   <chr>                   <fct>                    <int>   <int>
 1 "SENSE AND SENSIBILITY" Sense & Sensibility          1       0
 2 ""                      Sense & Sensibility          2       0
 3 "by Jane Austen"        Sense & Sensibility          3       0
 4 ""                      Sense & Sensibility          4       0
 5 "(1811)"                Sense & Sensibility          5       0
 6 ""                      Sense & Sensibility          6       0
 7 ""                      Sense & Sensibility          7       0
 8 ""                      Sense & Sensibility          8       0
 9 ""                      Sense & Sensibility          9       0
10 "CHAPTER 1"             Sense & Sensibility         10       1
# ℹ 73,412 more rows
# ℹ Use `print(n = ...)` to see more rows

library(tidytext)
tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books

# A tibble: 725,055 × 4
   book                linenumber chapter word       
   <fct>                    <int>   <int> <chr>      
 1 Sense & Sensibility          1       0 sense      
 2 Sense & Sensibility          1       0 and        
 3 Sense & Sensibility          1       0 sensibility
 4 Sense & Sensibility          3       0 by         
 5 Sense & Sensibility          3       0 jane       
 6 Sense & Sensibility          3       0 austen     
 7 Sense & Sensibility          5       0 1811       
 8 Sense & Sensibility         10       1 chapter    
 9 Sense & Sensibility         10       1 1          
10 Sense & Sensibility         13       1 the        
# ℹ 725,045 more rows
# ℹ Use `print(n = ...)` to see more rows

data(stop_words)

tidy_books <- tidy_books %>%
  anti_join(stop_words)

Three lexicons’ worth of stop words are included in the tidytext package’s stop_words dataset. If it makes more sense for a particular study, we can filter() to utilize only one set of stop words rather than using them all at once, as we have done here.

Word frequencies

The most frequently occurring terms across all of the books can also be found by using dplyr’s count() function.

tidy_books %>%
  count(word, sort = TRUE) 
# A tibble: 13,914 × 2
   word       n
   <chr>  <int>
 1 miss    1855
 2 time    1337
 3 fanny    862
 4 dear     822
 5 lady     817
 6 sir      806
 7 day      797
 8 emma     787
 9 sister   727
10 house    699
# ℹ 13,904 more rows
# ℹ Use `print(n = ...)` to see more rows

We have our word counts saved in a tidy data frame because we have been utilizing tidy tools. information enables us to route information straight to the ggplot2 tool, enabling us to, for instance, see the most frequently used terms.

library(ggplot2)

tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

library(gutenbergr)
hgwells <- gutenberg_download(c(35, 36, 5230, 159))
tidy_hgwells <- hgwells %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

tidy_hgwells %>%
  count(word, sort = TRUE)
#> # A tibble: 11,769 × 2
#>    word       n
#>    <chr>  <int>
#>  1 time     454
#>  2 people   302
#>  3 door     260
#>  4 heard    249
#>  5 black    232
#>  6 stood    229
#>  7 white    222
#>  8 hand     218
#>  9 kemp     213
#> 10 eyes     210
#> # … with 11,759 more rows

How to Add Superscripts and Subscripts to Plots in R? (datasciencetut.com)