You may come across messy text data in web scaping applications, analysis of writing samples, open-ended survey responses, etc. Even if we are only interested in numbers, it is useful to be able to extract them from text.
Strings are sequences of characters.
"Suki"
## [1] "Suki"
R treats all character data (including strings) as the
character
type.
class("Suki")
## [1] "character"
We use quotes to construct strings. By using double quotes, we allow for the possibility that a single quote exists within the string.
"Suki's favorite ball"
## [1] "Suki's favorite ball"
"The dog said 'woof'!"
## [1] "The dog said 'woof'!"
Each space is also a character, as is the empty string ““.
Some characters are special, so we have “escape characters” to specify them in strings:
\"
\t
\n
and carriage return \r
(use
the former when possible)Since strings (or character objects) are one of the atomic data
types, like numeric
or logical
, they can go
into scalars, vectors, arrays, lists, or be the type of a column in a
data frame. We can use nchar()
to get the length of a
single string:
nchar("probably not the most efficient way to get a character count for an essay")
## [1] 73
nchar(c("Suki","Sokka","Fiddler","Daisy"))
## [1] 4 5 7 5
We can print a string using print()
.
print("cats")
## [1] "cats"
Construct a vector of three strings. Store this vector as
mystring
.
A substring is one part of a larger string. To extract
substrings, we use substr()
. A string is not a vector or a
list, so we cannot use operations like
[[ ]]
or [ ]
to extract substrings.
mystr <- "Did you know that female praying mantises don't actually eat their mates in the wild?"
substr(mystr, start = 26, stop = 39)
## [1] "praying mantis"
We can also use substr
to replace elements.
substr(mystr, 13, 13) = "X"
mystr
## [1] "Did you knowXthat female praying mantises don't actually eat their mates in the wild?"
The function substr()
can also be used for vectors.
substr()
vectorizes over all its arguments:
pets <- c("Suki","Sokka","Fiddler","Daisy")
substr(pets, 1, 2)
## [1] "Su" "So" "Fi" "Da"
substr(pets, 12, 13)
## [1] "" "" "" ""
substr(pets, 5, 6)
## [1] "" "a" "le" "y"
Extract the first 4 characters of all the entries in
mystring
.
strsplit()
divides a string according to key characters,
by splitting each element of the character vector x
at
appearances of the pattern split
:
pets2 <- "Suki, Sokka, Fiddler, Daisy"
pets2
## [1] "Suki, Sokka, Fiddler, Daisy"
strsplit(pets2, ", ")
## [[1]]
## [1] "Suki" "Sokka" "Fiddler" "Daisy"
Explicitly converting one variable type to another is called casting. Notice that the number “7.2e12” is printed as supplied, but “7.2e5” is not. This is because if a number is exceeding large, small, or close to zero, then R will by default use scientific notation for that number.
as.character(7.2)
## [1] "7.2"
as.character(7.2e12)
## [1] "7.2e+12"
as.character(7.2, 7.2e12)
## [1] "7.2"
as.character(7.2e5)
## [1] "720000"
paste()
FunctionThe paste()
function is very flexible. With one vector
argument, works like as.character()
:
paste(1:10)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
With 2 or more vector arguments, it combines them with recycling.
paste(pets, 1:10)
## [1] "Suki 1" "Sokka 2" "Fiddler 3" "Daisy 4" "Suki 5" "Sokka 6"
## [7] "Fiddler 7" "Daisy 8" "Suki 9" "Sokka 10"
We can change the separator between pasted items:
paste(pets, 1:10, sep="_")
## [1] "Suki_1" "Sokka_2" "Fiddler_3" "Daisy_4" "Suki_5" "Sokka_6"
## [7] "Fiddler_7" "Daisy_8" "Suki_9" "Sokka_10"
paste(pets, 1:10, sep="")
## [1] "Suki1" "Sokka2" "Fiddler3" "Daisy4" "Suki5" "Sokka6"
## [7] "Fiddler7" "Daisy8" "Suki9" "Sokka10"
We can also condense multiple strings together using the
collapse
argument:
paste(pets2, " (", pets, ")", sep="", collapse="; ")
## [1] "Suki, Sokka, Fiddler, Daisy (Suki); Suki, Sokka, Fiddler, Daisy (Sokka); Suki, Sokka, Fiddler, Daisy (Fiddler); Suki, Sokka, Fiddler, Daisy (Daisy)"
The paste
function is super useful for automatically
generating names!
filename <- paste("savedFromR_setting", 2, ".csv")
write.csv(mtcars, filename)
Often we want to study of analyze a block of text.
theurl <- "https://lgpcappiello.github.io/teaching/stat128/theplot.txt"
theplot <- readLines(theurl, warn=FALSE)
# how many lines
length(theplot)
## [1] 11
# ?grep
These functions are super useful for picking apart messy data!
Now lets break up the data set by spaces. We do this in hopes that it will separate each word as an element.
all.words <- strsplit(theplot, split=" ")[[1]]
head(all.words)
## [1] "Shrek" "2" "is" "a" "2004" "American"
We can now tabulate how often each word appears using the
table()
function. Then we can sort the frequencies in order
using sort()
.
wc <- table(all.words)
wc <- sort(wc, decreasing=TRUE)
wc[wc > 1]
## all.words
## and to Shrek a Harold the
## 6 6 5 3 3 3
## their with by Far Fiona Fiona's
## 3 3 2 2 2 2
## Fiona, her is particularly that they
## 2 2 2 2 2 2
Note that punctuation and case can both cause words to appear that
are not of interest to us. We may also want to remove super common words
like “and” and “the”. The tm
library can help with
this.
# install.packages("tm")
library(tm)
## Warning: package 'tm' was built under R version 4.1.3
## Loading required package: NLP
# Create a corpus
docs <- Corpus(VectorSource(theplot))
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'forcats' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
docs <- tm_map(docs, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents
docs <- tm_map(docs, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents
Now we create a dataframe with each word in one column and their frequency in a second column.
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
# install.packages("wordcloud2")
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 4.1.3
wordcloud2(df)