String Manipulation in R

String Manipulation in R, In this article, we’ll show you how to manipulate strings in the R programming language using many methods.

To begin, we’ll read text from a file into the computer to demonstrate the string operations.

data<-readLines("D:/RStudio/Binning/TextData.txt")
head(data)

The “data” variable will have a vector with five elements, one for each of the five lines of the document.

You can see an example of those lines here.

[1] "Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data." 

Draw a trend line using ggplot-Quick Guide »

String Manipulation in R

You can use the “nchar” function to count the number of characters in a string by giving the string in as an argument.

nchar(data[1])
[1] 362

Our vector’s first element is a 362-character string, as you can see.

The “toupper” function can be used to convert all the characters in a string to upper case.

[1] "DATA SCIENCE IS AN INTERDISCIPLINARY FIELD THAT USES SCIENTIFIC METHODS, PROCESSES, ALGORITHMS AND SYSTEMS TO EXTRACT KNOWLEDGE AND INSIGHTS FROM NOISY, STRUCTURED AND UNSTRUCTURED DATA,[1][2] AND APPLY KNOWLEDGE AND ACTIONABLE INSIGHTS FROM DATA ACROSS A BROAD RANGE OF APPLICATION DOMAINS. DATA SCIENCE IS RELATED TO DATA MINING, MACHINE LEARNING AND BIG DATA."

You can see an example of how that would appear here.

Similarly, you can use the “tolower” method if you’d like to change all the string’s characters to lower case.

tolower(data[1])

The “chartr” function can be used to replace a certain set of characters in a string.

chartr(" ","-",data[1])

The first input is a string containing the characters that should be replaced.  The replacement characters are stored in the second argument, which is a string.

Dot Plots in R-Strip Charts for Small Sample Size »

The last argument is the string upon which the operation should be applied. You can see how the function replaced every space character with a hyphen in the output.

[1] "Data-science-is-an-interdisciplinary-field-that-uses-scientific-methods,-processes,-algorithms-and-systems-to-extract-knowledge-and-insights-from-noisy,-structured-and-unstructured-data,[1][2]-and-apply-knowledge-and-actionable-insights-from-data-across-a-broad-range-of-application-domains.-Data-science-is-related-to-data-mining,-machine-learning-and-big-data."

The “strsplit” function allows you to split a string into two parts using an expression.

Take a look at the syntax in this section.

list<-strsplit(data[1]," ")

The first input is the string we want to split, and the second argument is the expression we want to use to split it.

The space character is used to break up the string in this situation. This will produce a list, therefore we’ll need to use the “unlist” method to create a character vector.

list1<-unlist(list)
list1

Because each word in the original string was separated by a space character, you’ll note that the vector contains one element per word when you look at the output.

 [1] "Data"              "science"           "is"                "an"               
 [5] "interdisciplinary" "field"             "that"              "uses"             
 [9] "scientific"        "methods,"          "processes,"        "algorithms"       
[13] "and"               "systems"           "to"                "extract"          
[17] "knowledge"         "and"               "insights"          "from"             
[21] "noisy,"            "structured"        "and"               "unstructured"     
[25] "data,[1][2]"       "and"               "apply"             "knowledge"        
[29] "and"               "actionable"        "insights"          "from"             
[33] "data"              "across"            "a"                 "broad"            
[37] "range"             "of"                "application"       "domains."         
[41] "Data"              "science"           "is"                "related"          
[45] "to"                "data"              "mining,"           "machine"          
[49] "learning"          "and"               "big"               "data."   

By feeding the “list1” vector we just produced into the “sort” function, we can sort it as well.

sorting<-sort(list1)

As a result, the components will be sorted alphabetically.

The “paste” function can also be used to concatenate the elements of a character vector.

Types of Data Visualization Charts » Advantages»

paste(sorting,collapse=" ")

The string value that will be used to separate the distinct elements is determined by the “collapse” option.

[1] "a across actionable algorithms an and and and and and and application apply big broad data data Data Data data,[1][2] data. domains. extract field from from insights insights interdisciplinary is is knowledge knowledge learning machine methods, mining, noisy, of processes, range related science science scientific structured systems that to to unstructured uses"

We’ll simply use a single space character to separate them in our situation. Our alphabetically sorted list is represented by a single string in this output.

The “substr” function can be used to isolate a specified portion of a string.

subs<-substr(data[1],start=3,stop=30)
subs

Simply enter the segment’s start and end indices, and this contiguous section will be output.

"ta science is an interdiscip"

However, you’ll see that this substring has a leading and trailing space character.

What is mean by the best standard deviation? »

We can get rid of them by using the “trimws” function, which removes any whitespace from a string’s beginning and end.

It’s possible that you’ll wish to count backward from the last location to build a substring.

So, for example, you might desire the last five characters, as shown above.  You’ll need to utilize the “stringr” library’s “str sub” function for this.

library(stringr)
str_sub(data[1],-5,-1)

In this situation, notice how the start and endpoint arguments are both negative.

As a result, the start point is the fifth character from the string’s final point, and the endpoint is the last character’s index.

[1] "data."

The output shows that the final five characters were successfully recovered.

You should now be able to change the characters in a string, split a string into a vector, and retrieve specific substrings.

You may also like...

1 Response

  1. Anjana says:

    Useful one…

Leave a Reply

Your email address will not be published. Required fields are marked *

four × five =