# install.packages('tidyverse')
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The tidyverse is a collection of R packages including ggplot2
, dplyr
, tidyr
, readr
, purr
, tibble
, and stringr
. There are approximately 20 of these in total.
The tidyverse philosophy is structured around “action” instead of “objects”.
Tidyverse is designed for and often used for data science applications. It tends to be more beginner-friendly than base R, but (1) it can be a major deviation from base R and (2) it’s not always flexible enough to do everything we want it to do.
Tidyverse syntax is designed to be used with a “piping” operator (loaded in when we call the tidyverse
library).
The piping operator, %>%
“feeds” things from left to right.
vec <- 1:10
vec %>% mean() # "feed" vec into the mean function
## [1] 5.5
This is useful when we have a sequence of multiple operations and want to pipe the output of one into the next.
Convert the following base R code into a tidyverse approach using the piping operator:
colMeans(subset(mtcars, vs = 1))
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
A tibble is tidyverse’s answer to the data frame. You may have seen things stored in the R environment as tibbles before - they’re fairly widely used and function similarly to data frames.
Unlike data frames, tibbles are designed to print in a way that doesn’t fill your markdown document with 47 pages of output when you forget to comment out that print
statement before compiling. (Similar to the head
function in base R.)
class(diamonds)
## [1] "tbl_df" "tbl" "data.frame"
dim(diamonds)
## [1] 53940 10
diamonds
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
Tidyverse is a gigantic collections of functions and objects, but these are a few of the main ones to help you get started.
Note, in general, for tidyverse help files arguments typically start with a “.”, this in contrast to many of the base R help files where arguments are in all caps.
select()
: Select variables in a data frame.filter()
: Subset a data frame, retaining all rows that satisfy your conditions.arrange()
: Orders the rows of a data frame by the values of selected columns.rename()
: Changes the names of individual variables using new_name = old_name syntaxmutate()
: Adds new variables and preserves existing ones.group_by()
: Takes an existing tibble and converts it into a grouped tibble where operations can then be performed “by group”.summarize()
/summarise()
: Summarizes results for each group (rows), and summary statistics (columns).Generally, the functions above have the following properties:
select()
Isolate particular columns:
diamonds %>% select(price, cut)
## # A tibble: 53,940 × 2
## price cut
## <int> <ord>
## 1 326 Ideal
## 2 326 Premium
## 3 327 Good
## 4 334 Premium
## 5 335 Good
## 6 336 Very Good
## 7 336 Very Good
## 8 337 Very Good
## 9 337 Fair
## 10 338 Very Good
## # … with 53,930 more rows
We store this output as before:
priceCut <- diamonds %>% select(price, cut)
priceCut
## # A tibble: 53,940 × 2
## price cut
## <int> <ord>
## 1 326 Ideal
## 2 326 Premium
## 3 327 Good
## 4 334 Premium
## 5 335 Good
## 6 336 Very Good
## 7 336 Very Good
## 8 337 Very Good
## 9 337 Fair
## 10 338 Very Good
## # … with 53,930 more rows
You can also use the operator “:”, and negative signs with the select() function. With the “name1:name2” operator we can select all columns between the column named “name1” and “name2”. With negative signs we can omit all variables that are preceded with a negative sign. These methods are typically not allowed in standard base R indexing when using names:
# Select all columns between (and including) cut and price:
diamonds %>% select(price:cut)
## # A tibble: 53,940 × 6
## price table depth clarity color cut
## <int> <dbl> <dbl> <ord> <ord> <ord>
## 1 326 55 61.5 SI2 E Ideal
## 2 326 61 59.8 SI1 E Premium
## 3 327 65 56.9 VS1 E Good
## 4 334 58 62.4 VS2 I Premium
## 5 335 58 63.3 SI2 J Good
## 6 336 57 62.8 VVS2 J Very Good
## 7 336 57 62.3 VVS1 I Very Good
## 8 337 55 61.9 SI1 H Very Good
## 9 337 61 65.1 VS2 E Fair
## 10 338 61 59.4 VS1 H Very Good
## # … with 53,930 more rows
# Select all but price and cut
diamonds %>% select(-price, -cut)
## # A tibble: 53,940 × 8
## carat color clarity depth table x y z
## <dbl> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 E SI2 61.5 55 3.95 3.98 2.43
## 2 0.21 E SI1 59.8 61 3.89 3.84 2.31
## 3 0.23 E VS1 56.9 65 4.05 4.07 2.31
## 4 0.29 I VS2 62.4 58 4.2 4.23 2.63
## 5 0.31 J SI2 63.3 58 4.34 4.35 2.75
## 6 0.24 J VVS2 62.8 57 3.94 3.96 2.48
## 7 0.24 I VVS1 62.3 57 3.95 3.98 2.47
## 8 0.26 H SI1 61.9 55 4.07 4.11 2.53
## 9 0.22 E VS2 65.1 61 3.87 3.78 2.49
## 10 0.23 H VS1 59.4 61 4 4.05 2.39
## # … with 53,930 more rows
filter()
Isolate particular rows:
mean(diamonds$depth) # mean of depth variable
## [1] 61.7494
diamonds %>% filter(depth > mean(depth)) # all rows where depth > mean(depth)
## # A tibble: 28,909 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 7 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
## 8 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
## 9 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 10 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
## # … with 28,899 more rows
We can also filter on multiple conditions:
diamonds %>% filter(depth > mean(depth), cut == "Good", price > 350)
## # A tibble: 3,548 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
## 2 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
## 3 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
## 4 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
## 5 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
## 6 0.26 Good D VS2 65.2 56 403 3.99 4.02 2.61
## 7 0.32 Good H SI2 63.1 56 403 4.34 4.37 2.75
## 8 0.32 Good H SI2 63.8 56 403 4.36 4.38 2.79
## 9 0.3 Good I SI1 63.2 55 405 4.25 4.29 2.7
## 10 0.3 Good H SI1 63.7 57 554 4.28 4.26 2.72
## # … with 3,538 more rows
arrange()
The arrange()
function works like sort()
or order()
in base R.
diamonds %>% arrange(price)
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
rename()
Allows us to rename variables (replaces colnames()
).
Structure: NewName = OldName
. If we want to rename the cut
column to Cut
(capitalized), this would look like:
diamonds %>% rename(Cut = cut) #new = old
## # A tibble: 53,940 × 10
## carat Cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
We can do this with as many columns at one time as we want.
mutate()
The mutate
function can be used to add or change a variable.
diamonds %>% mutate(price = price/100) # change price units from dollars to hundreds of dollars
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 3.26 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 3.26 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 3.27 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 3.34 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 3.35 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 3.36 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 3.36 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 3.37 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 3.37 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 3.38 4 4.05 2.39
## # … with 53,930 more rows
We can change multiple columns at once and add new columns:
diamonds %>% mutate(price = price/100, newVar = 10*depth + price)
## # A tibble: 53,940 × 11
## carat cut color clarity depth table price x y z newVar
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 3.26 3.95 3.98 2.43 618.
## 2 0.21 Premium E SI1 59.8 61 3.26 3.89 3.84 2.31 601.
## 3 0.23 Good E VS1 56.9 65 3.27 4.05 4.07 2.31 572.
## 4 0.29 Premium I VS2 62.4 58 3.34 4.2 4.23 2.63 627.
## 5 0.31 Good J SI2 63.3 58 3.35 4.34 4.35 2.75 636.
## 6 0.24 Very Good J VVS2 62.8 57 3.36 3.94 3.96 2.48 631.
## 7 0.24 Very Good I VVS1 62.3 57 3.36 3.95 3.98 2.47 626.
## 8 0.26 Very Good H SI1 61.9 55 3.37 4.07 4.11 2.53 622.
## 9 0.22 Fair E VS2 65.1 61 3.37 3.87 3.78 2.49 654.
## 10 0.23 Very Good H VS1 59.4 61 3.38 4 4.05 2.39 597.
## # … with 53,930 more rows
The function transmute()
is similar, but drops all other variables:
diamonds %>% transmute(price = price/100, newVar = 10*depth + price)
## # A tibble: 53,940 × 2
## price newVar
## <dbl> <dbl>
## 1 3.26 618.
## 2 3.26 601.
## 3 3.27 572.
## 4 3.34 627.
## 5 3.35 636.
## 6 3.36 631.
## 7 3.36 626.
## 8 3.37 622.
## 9 3.37 654.
## 10 3.38 597.
## # … with 53,930 more rows
group_by()
This is often used with the summarize
function to group sets of observations together.
summarize()
/ summarise()
Group by cut
, then calculate the mean price
for each cut
:
diamonds %>% group_by(cut) %>% summarize(PriceMean = mean(price))
## # A tibble: 5 × 2
## cut PriceMean
## <ord> <dbl>
## 1 Fair 4359.
## 2 Good 3929.
## 3 Very Good 3982.
## 4 Premium 4584.
## 5 Ideal 3458.
We can also do this with multiple groups and summary statistics.
diamonds %>%
group_by(cut, color) %>%
summarize(PriceMean = mean(price),
PriceMedian = median(price))
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
## # A tibble: 35 × 4
## # Groups: cut [5]
## cut color PriceMean PriceMedian
## <ord> <ord> <dbl> <dbl>
## 1 Fair D 4291. 3730
## 2 Fair E 3682. 2956
## 3 Fair F 3827. 3035
## 4 Fair G 4239. 3057
## 5 Fair H 5136. 3816
## 6 Fair I 4685. 3246
## 7 Fair J 4976. 3302
## 8 Good D 3405. 2728.
## 9 Good E 3424. 2420
## 10 Good F 3496. 2647
## # … with 25 more rows
Convert the following line of code to code that uses piping operators and tidyverse functions.
by(diamonds$price, diamonds$color, summary)
## diamonds$color: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 357 911 1838 3170 4214 18693
## ------------------------------------------------------------
## diamonds$color: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 882 1739 3077 4003 18731
## ------------------------------------------------------------
## diamonds$color: F
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 342 982 2344 3725 4868 18791
## ------------------------------------------------------------
## diamonds$color: G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 354 931 2242 3999 6048 18818
## ------------------------------------------------------------
## diamonds$color: H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337 984 3460 4487 5980 18803
## ------------------------------------------------------------
## diamonds$color: I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1120 3730 5092 7202 18823
## ------------------------------------------------------------
## diamonds$color: J
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335 1860 4234 5324 7695 18710
Explain in plain English what each of the following code chunks does. You may need to examine the help files for some of the functions.
diamonds %>%
mutate(DxC = depth*carat) %>%
group_by(cut) %>%
summarise(AvgDxC = mean(DxC),
AvgCut = mean(price))
## # A tibble: 5 × 3
## cut AvgDxC AvgCut
## <ord> <dbl> <dbl>
## 1 Fair 67.2 4359.
## 2 Good 52.9 3929.
## 3 Very Good 49.9 3982.
## 4 Premium 54.6 4584.
## 5 Ideal 43.4 3458.
diamonds %>%
filter(cut == "Ideal") %>%
select(cut, carat, depth, price)
## # A tibble: 21,551 × 4
## cut carat depth price
## <ord> <dbl> <dbl> <int>
## 1 Ideal 0.23 61.5 326
## 2 Ideal 0.23 62.8 340
## 3 Ideal 0.31 62.2 344
## 4 Ideal 0.3 62 348
## 5 Ideal 0.33 61.8 403
## 6 Ideal 0.33 61.2 403
## 7 Ideal 0.33 61.1 403
## 8 Ideal 0.23 61.9 404
## 9 Ideal 0.32 60.9 404
## 10 Ideal 0.3 61 405
## # … with 21,541 more rows
diamonds %>%
filter(price >median(price)) %>%
group_by(color) %>%
summarize(mean_depth = mean(depth),
min_depth = min(depth)) %>%
arrange(mean_depth)%>%
head(n = 10)
## # A tibble: 7 × 3
## color mean_depth min_depth
## <ord> <dbl> <dbl>
## 1 D 61.7 55.5
## 2 E 61.7 53.1
## 3 F 61.8 55.4
## 4 G 61.8 43
## 5 I 61.8 50.8
## 6 H 61.8 54.7
## 7 J 61.9 43