Tidyverse

# install.packages('tidyverse')
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

The tidyverse is a collection of R packages including ggplot2, dplyr, tidyr, readr, purr, tibble, and stringr. There are approximately 20 of these in total.

The tidyverse philosophy is structured around “action” instead of “objects”.

Tidyverse is designed for and often used for data science applications. It tends to be more beginner-friendly than base R, but (1) it can be a major deviation from base R and (2) it’s not always flexible enough to do everything we want it to do.

Piping Operator

Tidyverse syntax is designed to be used with a “piping” operator (loaded in when we call the tidyverse library).

The piping operator, %>% “feeds” things from left to right.

vec <- 1:10
vec %>% mean() # "feed" vec into the mean function

## [1] 5.5

This is useful when we have a sequence of multiple operations and want to pipe the output of one into the next.

On Your Own

Convert the following base R code into a tidyverse approach using the piping operator:

colMeans(subset(mtcars, vs = 1))

##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

Tibbles

A tibble is tidyverse’s answer to the data frame. You may have seen things stored in the R environment as tibbles before - they’re fairly widely used and function similarly to data frames.

Unlike data frames, tibbles are designed to print in a way that doesn’t fill your markdown document with 47 pages of output when you forget to comment out that print statement before compiling. (Similar to the head function in base R.)

class(diamonds)

## [1] "tbl_df"     "tbl"        "data.frame"

dim(diamonds)

## [1] 53940    10

diamonds

## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

Key Functions

Tidyverse is a gigantic collections of functions and objects, but these are a few of the main ones to help you get started.

Note, in general, for tidyverse help files arguments typically start with a “.”, this in contrast to many of the base R help files where arguments are in all caps.

select(): Select variables in a data frame.
filter(): Subset a data frame, retaining all rows that satisfy your conditions.
arrange(): Orders the rows of a data frame by the values of selected columns.
rename(): Changes the names of individual variables using new_name = old_name syntax
mutate(): Adds new variables and preserves existing ones.
group_by(): Takes an existing tibble and converts it into a grouped tibble where operations can then be performed “by group”.
summarize()/summarise(): Summarizes results for each group (rows), and summary statistics (columns).

Generally, the functions above have the following properties:

The first argument is a data frame or a tibble.
The subsequent arguments are used to determine what to do with the data-frame/tibble in the first argument.
The returned value is a data frame or a tibble.
The inputted data-frames/tibbles should be well formatted to start off with. Each row should be an observation, and each column should be a variable.
When we refer to column names for the data frame or tibble in the first argument we do not need to use quotes around the column names.

`select()`

Isolate particular columns:

diamonds %>% select(price, cut)

## # A tibble: 53,940 × 2
##    price cut      
##    <int> <ord>    
##  1   326 Ideal    
##  2   326 Premium  
##  3   327 Good     
##  4   334 Premium  
##  5   335 Good     
##  6   336 Very Good
##  7   336 Very Good
##  8   337 Very Good
##  9   337 Fair     
## 10   338 Very Good
## # … with 53,930 more rows

We store this output as before:

priceCut <- diamonds %>% select(price, cut)
priceCut

## # A tibble: 53,940 × 2
##    price cut      
##    <int> <ord>    
##  1   326 Ideal    
##  2   326 Premium  
##  3   327 Good     
##  4   334 Premium  
##  5   335 Good     
##  6   336 Very Good
##  7   336 Very Good
##  8   337 Very Good
##  9   337 Fair     
## 10   338 Very Good
## # … with 53,930 more rows

You can also use the operator “:”, and negative signs with the select() function. With the “name1:name2” operator we can select all columns between the column named “name1” and “name2”. With negative signs we can omit all variables that are preceded with a negative sign. These methods are typically not allowed in standard base R indexing when using names:

# Select all columns between (and including) cut and price:
diamonds %>% select(price:cut)

## # A tibble: 53,940 × 6
##    price table depth clarity color cut      
##    <int> <dbl> <dbl> <ord>   <ord> <ord>    
##  1   326    55  61.5 SI2     E     Ideal    
##  2   326    61  59.8 SI1     E     Premium  
##  3   327    65  56.9 VS1     E     Good     
##  4   334    58  62.4 VS2     I     Premium  
##  5   335    58  63.3 SI2     J     Good     
##  6   336    57  62.8 VVS2    J     Very Good
##  7   336    57  62.3 VVS1    I     Very Good
##  8   337    55  61.9 SI1     H     Very Good
##  9   337    61  65.1 VS2     E     Fair     
## 10   338    61  59.4 VS1     H     Very Good
## # … with 53,930 more rows

# Select all but price and cut
diamonds %>% select(-price, -cut)

## # A tibble: 53,940 × 8
##    carat color clarity depth table     x     y     z
##    <dbl> <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  0.23 E     SI2      61.5    55  3.95  3.98  2.43
##  2  0.21 E     SI1      59.8    61  3.89  3.84  2.31
##  3  0.23 E     VS1      56.9    65  4.05  4.07  2.31
##  4  0.29 I     VS2      62.4    58  4.2   4.23  2.63
##  5  0.31 J     SI2      63.3    58  4.34  4.35  2.75
##  6  0.24 J     VVS2     62.8    57  3.94  3.96  2.48
##  7  0.24 I     VVS1     62.3    57  3.95  3.98  2.47
##  8  0.26 H     SI1      61.9    55  4.07  4.11  2.53
##  9  0.22 E     VS2      65.1    61  3.87  3.78  2.49
## 10  0.23 H     VS1      59.4    61  4     4.05  2.39
## # … with 53,930 more rows

`filter()`

Isolate particular rows:

mean(diamonds$depth) # mean of depth variable

## [1] 61.7494

diamonds %>% filter(depth > mean(depth)) # all rows where depth > mean(depth)

## # A tibble: 28,909 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  2  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  3  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  4  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  5  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  6  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
##  7  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73
##  8  0.23 Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46
##  9  0.31 Ideal     J     SI2      62.2    54   344  4.35  4.37  2.71
## 10  0.3  Ideal     I     SI2      62      54   348  4.31  4.34  2.68
## # … with 28,899 more rows

We can also filter on multiple conditions:

diamonds %>% filter(depth > mean(depth), cut == "Good", price > 350)

## # A tibble: 3,548 × 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
##  2  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
##  3  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
##  4  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
##  5  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
##  6  0.26 Good  D     VS2      65.2    56   403  3.99  4.02  2.61
##  7  0.32 Good  H     SI2      63.1    56   403  4.34  4.37  2.75
##  8  0.32 Good  H     SI2      63.8    56   403  4.36  4.38  2.79
##  9  0.3  Good  I     SI1      63.2    55   405  4.25  4.29  2.7 
## 10  0.3  Good  H     SI1      63.7    57   554  4.28  4.26  2.72
## # … with 3,538 more rows

`arrange()`

The arrange() function works like sort() or order() in base R.

diamonds %>% arrange(price)

## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

`rename()`

Allows us to rename variables (replaces colnames()).

Structure: NewName = OldName. If we want to rename the cut column to Cut (capitalized), this would look like:

diamonds %>% rename(Cut = cut) #new = old

## # A tibble: 53,940 × 10
##    carat Cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

We can do this with as many columns at one time as we want.

`mutate()`

The mutate function can be used to add or change a variable.

diamonds %>% mutate(price = price/100) # change price units from dollars to hundreds of dollars

## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55  3.26  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61  3.26  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65  3.27  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58  3.34  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58  3.35  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57  3.36  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57  3.36  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55  3.37  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61  3.37  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61  3.38  4     4.05  2.39
## # … with 53,930 more rows

We can change multiple columns at once and add new columns:

diamonds %>% mutate(price = price/100, newVar = 10*depth + price)

## # A tibble: 53,940 × 11
##    carat cut       color clarity depth table price     x     y     z newVar
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55  3.26  3.95  3.98  2.43   618.
##  2  0.21 Premium   E     SI1      59.8    61  3.26  3.89  3.84  2.31   601.
##  3  0.23 Good      E     VS1      56.9    65  3.27  4.05  4.07  2.31   572.
##  4  0.29 Premium   I     VS2      62.4    58  3.34  4.2   4.23  2.63   627.
##  5  0.31 Good      J     SI2      63.3    58  3.35  4.34  4.35  2.75   636.
##  6  0.24 Very Good J     VVS2     62.8    57  3.36  3.94  3.96  2.48   631.
##  7  0.24 Very Good I     VVS1     62.3    57  3.36  3.95  3.98  2.47   626.
##  8  0.26 Very Good H     SI1      61.9    55  3.37  4.07  4.11  2.53   622.
##  9  0.22 Fair      E     VS2      65.1    61  3.37  3.87  3.78  2.49   654.
## 10  0.23 Very Good H     VS1      59.4    61  3.38  4     4.05  2.39   597.
## # … with 53,930 more rows

The function transmute() is similar, but drops all other variables:

diamonds %>% transmute(price = price/100, newVar = 10*depth + price)

## # A tibble: 53,940 × 2
##    price newVar
##    <dbl>  <dbl>
##  1  3.26   618.
##  2  3.26   601.
##  3  3.27   572.
##  4  3.34   627.
##  5  3.35   636.
##  6  3.36   631.
##  7  3.36   626.
##  8  3.37   622.
##  9  3.37   654.
## 10  3.38   597.
## # … with 53,930 more rows

`group_by()`

This is often used with the summarize function to group sets of observations together.

`summarize()` / `summarise()`

Group by cut, then calculate the mean price for each cut:

diamonds %>% group_by(cut) %>% summarize(PriceMean = mean(price))

## # A tibble: 5 × 2
##   cut       PriceMean
##   <ord>         <dbl>
## 1 Fair          4359.
## 2 Good          3929.
## 3 Very Good     3982.
## 4 Premium       4584.
## 5 Ideal         3458.

We can also do this with multiple groups and summary statistics.

diamonds %>%
  group_by(cut, color) %>%
  summarize(PriceMean = mean(price), 
            PriceMedian = median(price))

## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.

## # A tibble: 35 × 4
## # Groups:   cut [5]
##    cut   color PriceMean PriceMedian
##    <ord> <ord>     <dbl>       <dbl>
##  1 Fair  D         4291.       3730 
##  2 Fair  E         3682.       2956 
##  3 Fair  F         3827.       3035 
##  4 Fair  G         4239.       3057 
##  5 Fair  H         5136.       3816 
##  6 Fair  I         4685.       3246 
##  7 Fair  J         4976.       3302 
##  8 Good  D         3405.       2728.
##  9 Good  E         3424.       2420 
## 10 Good  F         3496.       2647 
## # … with 25 more rows

On Your Own

Convert the following line of code to code that uses piping operators and tidyverse functions.

by(diamonds$price, diamonds$color, summary)

## diamonds$color: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     357     911    1838    3170    4214   18693 
## ------------------------------------------------------------ 
## diamonds$color: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     882    1739    3077    4003   18731 
## ------------------------------------------------------------ 
## diamonds$color: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     342     982    2344    3725    4868   18791 
## ------------------------------------------------------------ 
## diamonds$color: G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     354     931    2242    3999    6048   18818 
## ------------------------------------------------------------ 
## diamonds$color: H
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     337     984    3460    4487    5980   18803 
## ------------------------------------------------------------ 
## diamonds$color: I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1120    3730    5092    7202   18823 
## ------------------------------------------------------------ 
## diamonds$color: J
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     335    1860    4234    5324    7695   18710

On Your Own

Explain in plain English what each of the following code chunks does. You may need to examine the help files for some of the functions.

diamonds %>%
  mutate(DxC = depth*carat) %>%
  group_by(cut) %>%
  summarise(AvgDxC = mean(DxC), 
            AvgCut = mean(price))

## # A tibble: 5 × 3
##   cut       AvgDxC AvgCut
##   <ord>      <dbl>  <dbl>
## 1 Fair        67.2  4359.
## 2 Good        52.9  3929.
## 3 Very Good   49.9  3982.
## 4 Premium     54.6  4584.
## 5 Ideal       43.4  3458.

diamonds %>%
  filter(cut == "Ideal") %>%
  select(cut, carat, depth, price)

## # A tibble: 21,551 × 4
##    cut   carat depth price
##    <ord> <dbl> <dbl> <int>
##  1 Ideal  0.23  61.5   326
##  2 Ideal  0.23  62.8   340
##  3 Ideal  0.31  62.2   344
##  4 Ideal  0.3   62     348
##  5 Ideal  0.33  61.8   403
##  6 Ideal  0.33  61.2   403
##  7 Ideal  0.33  61.1   403
##  8 Ideal  0.23  61.9   404
##  9 Ideal  0.32  60.9   404
## 10 Ideal  0.3   61     405
## # … with 21,541 more rows

diamonds %>%
  filter(price >median(price)) %>%
  group_by(color) %>%
  summarize(mean_depth = mean(depth),
            min_depth = min(depth)) %>%
  arrange(mean_depth)%>%
  head(n = 10)

## # A tibble: 7 × 3
##   color mean_depth min_depth
##   <ord>      <dbl>     <dbl>
## 1 D           61.7      55.5
## 2 E           61.7      53.1
## 3 F           61.8      55.4
## 4 G           61.8      43  
## 5 I           61.8      50.8
## 6 H           61.8      54.7
## 7 J           61.9      43

Tidyverse

Lauren Cappiello

Piping Operator

On Your Own

Tibbles

Key Functions

select()

filter()

arrange()

rename()

mutate()

group_by()

summarize() / summarise()

On Your Own

On Your Own

`select()`

`filter()`

`arrange()`

`rename()`

`mutate()`

`group_by()`

`summarize()` / `summarise()`