You may come across messy text data in web scaping applications, analysis of writing samples, open-ended survey responses, etc. Even if we are only interested in numbers, it is useful to be able to extract them from text.

Strings are sequences of characters.

"Suki"
## [1] "Suki"

R treats all character data (including strings) as the character type.

class("Suki")
## [1] "character"

Making Strings

We use quotes to construct strings. By using double quotes, we allow for the possibility that a single quote exists within the string.

"Suki's favorite ball"
## [1] "Suki's favorite ball"
"The dog said 'woof'!"
## [1] "The dog said 'woof'!"

Each space is also a character, as is the empty string ““.

Some characters are special, so we have “escape characters” to specify them in strings:

Since strings (or character objects) are one of the atomic data types, like numeric or logical, they can go into scalars, vectors, arrays, lists, or be the type of a column in a data frame. We can use nchar() to get the length of a single string:

nchar("probably not the most efficient way to get a character count for an essay")
## [1] 73
nchar(c("Suki","Sokka","Fiddler","Daisy"))
## [1] 4 5 7 5

We can print a string using print().

print("cats")
## [1] "cats"

On Your Own

Construct a vector of three strings. Store this vector as mystring.

Substring Operations

A substring is one part of a larger string. To extract substrings, we use substr(). A string is not a vector or a list, so we cannot use operations like [[ ]] or [ ] to extract substrings.

mystr <- "Did you know that female praying mantises don't actually eat their mates in the wild?"
substr(mystr, start = 26, stop = 39)
## [1] "praying mantis"

We can also use substr to replace elements.

substr(mystr, 13, 13) = "X"
mystr
## [1] "Did you knowXthat female praying mantises don't actually eat their mates in the wild?"

The function substr() can also be used for vectors.

substr() vectorizes over all its arguments:

pets <- c("Suki","Sokka","Fiddler","Daisy")
substr(pets, 1, 2)
## [1] "Su" "So" "Fi" "Da"
substr(pets, 12, 13)
## [1] "" "" "" ""
substr(pets, 5, 6)
## [1] ""   "a"  "le" "y"

On Your Own

Extract the first 4 characters of all the entries in mystring.

Dividing Strings into Vectors

strsplit() divides a string according to key characters, by splitting each element of the character vector x at appearances of the pattern split:

pets2 <- "Suki, Sokka, Fiddler, Daisy"
pets2
## [1] "Suki, Sokka, Fiddler, Daisy"
strsplit(pets2, ", ")
## [[1]]
## [1] "Suki"    "Sokka"   "Fiddler" "Daisy"

Converting Objects into Strings

Explicitly converting one variable type to another is called casting. Notice that the number “7.2e12” is printed as supplied, but “7.2e5” is not. This is because if a number is exceeding large, small, or close to zero, then R will by default use scientific notation for that number.

as.character(7.2)
## [1] "7.2"
as.character(7.2e12)
## [1] "7.2e+12"
as.character(7.2, 7.2e12)
## [1] "7.2"
as.character(7.2e5)
## [1] "720000"

The paste() Function

The paste() function is very flexible. With one vector argument, works like as.character():

paste(1:10)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

With 2 or more vector arguments, it combines them with recycling.

paste(pets, 1:10)
##  [1] "Suki 1"    "Sokka 2"   "Fiddler 3" "Daisy 4"   "Suki 5"    "Sokka 6"  
##  [7] "Fiddler 7" "Daisy 8"   "Suki 9"    "Sokka 10"

We can change the separator between pasted items:

paste(pets, 1:10, sep="_")
##  [1] "Suki_1"    "Sokka_2"   "Fiddler_3" "Daisy_4"   "Suki_5"    "Sokka_6"  
##  [7] "Fiddler_7" "Daisy_8"   "Suki_9"    "Sokka_10"
paste(pets, 1:10, sep="")
##  [1] "Suki1"    "Sokka2"   "Fiddler3" "Daisy4"   "Suki5"    "Sokka6"  
##  [7] "Fiddler7" "Daisy8"   "Suki9"    "Sokka10"

We can also condense multiple strings together using the collapse argument:

paste(pets2, " (", pets, ")", sep="", collapse="; ")
## [1] "Suki, Sokka, Fiddler, Daisy (Suki); Suki, Sokka, Fiddler, Daisy (Sokka); Suki, Sokka, Fiddler, Daisy (Fiddler); Suki, Sokka, Fiddler, Daisy (Daisy)"

The paste function is super useful for automatically generating names!

filename <- paste("savedFromR_setting", 2, ".csv")
write.csv(mtcars, filename)

Example

Often we want to study of analyze a block of text.

theurl <- "https://lgpcappiello.github.io/teaching/stat128/theplot.txt"
theplot <- readLines(theurl, warn=FALSE)

# how many lines
length(theplot)
## [1] 11

Pattern Matching and Replacement

# ?grep

These functions are super useful for picking apart messy data!

Word Count Tables

Now lets break up the data set by spaces. We do this in hopes that it will separate each word as an element.

all.words <- strsplit(theplot, split=" ")[[1]]
head(all.words)
## [1] "Shrek"    "2"        "is"       "a"        "2004"     "American"

We can now tabulate how often each word appears using the table() function. Then we can sort the frequencies in order using sort().

wc <- table(all.words)
wc <- sort(wc, decreasing=TRUE)
wc[wc > 1]
## all.words
##          and           to        Shrek            a       Harold          the 
##            6            6            5            3            3            3 
##        their         with           by          Far        Fiona      Fiona's 
##            3            3            2            2            2            2 
##       Fiona,          her           is particularly         that         they 
##            2            2            2            2            2            2

Word Clouds

Note that punctuation and case can both cause words to appear that are not of interest to us. We may also want to remove super common words like “and” and “the”. The tm library can help with this.

# install.packages("tm")
library(tm)
## Warning: package 'tm' was built under R version 4.1.3
## Loading required package: NLP
# Create a corpus  
docs <- Corpus(VectorSource(theplot))
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'forcats' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
docs <- tm_map(docs, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents
docs <- tm_map(docs, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents

Now we create a dataframe with each word in one column and their frequency in a second column.

dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)
# install.packages("wordcloud2")
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 4.1.3
wordcloud2(df)