Pulling Data Into R

Let’s pull in the results for the survey we did on the first day of class. R will automatically create a data frame when we do this. You can pull in data directly from the web:

stdat <- read.csv("https://lgpcappiello.github.io/teaching/stat128/SurveyResponses.csv")

…or you can use the import dataset wizard in RStudio. Data import can get a little complicated, so if the data are stored on your computer, this is what I recommend. You can find this wizard in the Environment tab.

Examining Real Data

Usually, we don’t want to try to print out a lot of data right in the RMarkdown file or in the console. To avoid this, we can use the View command:

# View data in a separate window. 
# View(stdat)

We can quickly examine a matrix or data frame using the head function, which shows only the first six rows:

# head(stdat)

We see pretty quickly that there are some issues here. Let’s use what we’ve learned to clean up this dataset. We’ll start with the variable names:

names(stdat)
##  [1] "Major"                                                                                                    
##  [2] "Gender"                                                                                                   
##  [3] "Your.year.in.school"                                                                                      
##  [4] "Your.age"                                                                                                 
##  [5] "Number.of.classes.you.are.taking.in.Fall.2022"                                                            
##  [6] "Number.of.credits.units.you.are.taking.in.Fall.2022"                                                      
##  [7] "Do.you.live.on.campus."                                                                                   
##  [8] "Are.you.working..on.or.off.campus..while.in.school."                                                      
##  [9] "What.is.your.target.grade.in.this.class....If.your.goal.is.just.to.pass..that.s.fine....please.select.C.."
## [10] "Have.you.taken.a.statistics.class.before."                                                                
## [11] "Your.height.in.inches."                                                                                   
## [12] "Your.shoe.size..please.indicate.men.s.or.women.s.sizing."                                                 
## [13] "Do.you.have.any.pets."                                                                                    
## [14] "Are.you.a.parent.or.primary.guardian.to.a.child.under.18."
names(stdat) <- c("major","gender","year","age","classes","units","livecampus","working",
                  "grade","statclass","height","shoe","pets","parent")
names(stdat) # much better!
##  [1] "major"      "gender"     "year"       "age"        "classes"   
##  [6] "units"      "livecampus" "working"    "grade"      "statclass" 
## [11] "height"     "shoe"       "pets"       "parent"

Now let’s check out these data. The summary function will run the summary command on each variable in the dataset.

summary(stdat)
##     major              gender              year                age       
##  Length:26          Length:26          Length:26          Min.   :19.00  
##  Class :character   Class :character   Class :character   1st Qu.:21.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :21.00  
##                                                           Mean   :23.19  
##                                                           3rd Qu.:24.75  
##                                                           Max.   :40.00  
##    classes             units            livecampus          working         
##  Length:26          Length:26          Length:26          Length:26         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     grade            statclass             height          shoe          
##  Length:26          Length:26          Min.   :60.00   Length:26         
##  Class :character   Class :character   1st Qu.:66.00   Class :character  
##  Mode  :character   Mode  :character   Median :68.70   Mode  :character  
##                                        Mean   :68.13                     
##                                        3rd Qu.:70.00                     
##                                        Max.   :76.00                     
##      pets              parent         
##  Length:26          Length:26         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Data Manipulation and Cleaning

Sorting Data

We can sort a vector using the sort function, which defaults to smallest-to-largest.

sort(stdat$age) # default
##  [1] 19 20 20 21 21 21 21 21 21 21 21 21 21 21 23 23 23 24 24 25 25 25 26 27 28
## [26] 40
sort(stdat$age, decreasing = TRUE) # largest-to-smallest
##  [1] 40 28 27 26 25 25 25 24 24 23 23 23 21 21 21 21 21 21 21 21 21 21 21 20 20
## [26] 19

We can get the order of a variable using the order function. This returns the ranks of the variable.

order(stdat$age)
##  [1] 20 14 23  1  3  4  6  8 12 13 16 18 22 26  5 11 21 15 25 10 17 19  9  7  2
## [26] 24

So the first number is 20th when sorted from smallest to largest, the second is 14th, etc.

This can then be used to rearrange the data. To rearrange the data frame based on the age variable,

ord1 <- stdat[order(stdat$age),]
head(ord1)
##                                           major gender   year age classes units
## 20                            Computer Science   Male  Junior  19       6    16
## 14                             Computer Science   male Senior  20       5    13
## 23                             Computer Science   Male Junior  20       7    19
## 1                     Computer science and math   male Senior  21       6    15
## 3  Mathematics (with an emphasis in Statistics) Female Senior  21       3     9
## 4                              Computer Science   Male Senior  21       6    16
##    livecampus                       working grade
## 20        Yes Yes, I am currently employed.     A
## 14         No Yes, I am currently employed.     A
## 23         No Yes, I am currently employed.     A
## 1          No Yes, I am currently employed.     A
## 3          No Yes, I am currently employed.     A
## 4          No No, and I am not planning to.     A
##                                                                       statclass
## 20 Yes - I have taken Introductory Statistics before at a college or university
## 14 Yes - I have taken Introductory Statistics before at a college or university
## 23 Yes - I have taken Introductory Statistics before at a college or university
## 1  Yes - I have taken Introductory Statistics before at a college or university
## 3  Yes - I have taken Introductory Statistics before at a college or university
## 4  Yes - I have taken Introductory Statistics before at a college or university
##    height      shoe pets parent
## 20     69      10 M   No     No
## 14     73        10  Yes     No
## 23     70 11.5 Mens   No     No
## 1      66    9 mens   No     No
## 3      63 Women's 7  Yes     No
## 4      66         8   No     No

We can also sort based on one variable and then another. (For character data, R will sort by alphabetical order.)

ord2 <- stdat[order(stdat$age, stdat$major),]
head(ord2)
##                major gender   year age classes units livecampus
## 20 Computer Science   Male  Junior  19       6    16        Yes
## 14  Computer Science   male Senior  20       5    13         No
## 23  Computer Science   Male Junior  20       7    19         No
## 4   Computer Science   Male Senior  21       6    16         No
## 16  Computer Science   Male Senior  21       5    12         No
## 18  Computer Science   Male Senior  21       5    12         No
##                          working grade
## 20 Yes, I am currently employed.     A
## 14 Yes, I am currently employed.     A
## 23 Yes, I am currently employed.     A
## 4  No, and I am not planning to.     A
## 16 Yes, I am currently employed.     A
## 18 Yes, I am currently employed.     A
##                                                                       statclass
## 20 Yes - I have taken Introductory Statistics before at a college or university
## 14 Yes - I have taken Introductory Statistics before at a college or university
## 23 Yes - I have taken Introductory Statistics before at a college or university
## 4  Yes - I have taken Introductory Statistics before at a college or university
## 16 Yes - I have taken Introductory Statistics before at a college or university
## 18 Yes - I have taken Introductory Statistics before at a college or university
##    height      shoe pets parent
## 20     69      10 M   No     No
## 14     73        10  Yes     No
## 23     70 11.5 Mens   No     No
## 4      66         8   No     No
## 16     67  7.5 mens   No     No
## 18     71   13 Mens  Yes     No

On Your Own

Use the order function to arrange the data by target grade (grade) and then height.

Reassignment

Now, let’s examine the number of classes you are all taking.

stdat$classes
##  [1] "6"                                       
##  [2] "5"                                       
##  [3] "3"                                       
##  [4] "6"                                       
##  [5] "4"                                       
##  [6] "5"                                       
##  [7] "5"                                       
##  [8] "6-May"                                   
##  [9] "3"                                       
## [10] "4"                                       
## [11] "4"                                       
## [12] "5"                                       
## [13] "3"                                       
## [14] "5"                                       
## [15] "3"                                       
## [16] "5"                                       
## [17] "4"                                       
## [18] "5"                                       
## [19] "4 at the moment, but possibly less (job)"
## [20] "6"                                       
## [21] "4"                                       
## [22] "5"                                       
## [23] "7"                                       
## [24] "4"                                       
## [25] "Currently 6 classes"                     
## [26] "5"
class(stdat$classes)
## [1] "character"

We probably want this to be a numeric variable. There are a few ways to pull things out of strings, including the grepl function, but for now we’ll use the simplest, direct approach.

Let’s imagine there’s too much data to do this completely manually. What happens if we try to convert this to a numeric variable?

as.numeric(stdat$classes)
## Warning: NAs introduced by coercion
##  [1]  6  5  3  6  4  5  5 NA  3  4  4  5  3  5  3  5  4  5 NA  6  4  5  7  4 NA
## [26]  5
class.num <- as.numeric(stdat$classes) # save this for now so we can use it later!
## Warning: NAs introduced by coercion

This isn’t what we want, but it does provide us with some useful information! If I can extract the places where R is giving an NA because of characters in the string, I can use that to examine only those entries. The which function takes a logical statement on an object and returns the indices at which the statement is true. We will also use the function is.na, which returns TRUE wherever there is missing data stored as NA.

is.na(class.num) 
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [25]  TRUE FALSE
which(is.na(class.num)) # gets the indices where class.num is missing
## [1]  8 19 25
ind <- which(is.na(class.num))

So now I know where there are issues. Let’s check those out in the original data (which we have been careful not to change so far!).

stdat$classes[c(ind)]
## [1] "6-May"                                   
## [2] "4 at the moment, but possibly less (job)"
## [3] "Currently 6 classes"

Now what? Since we’ve narrowed this down to only three entries, we can fix this quickly and directly. (There are other ways to do this that don’t require us to look at the data so much, but this is a good exercise!)

Let’s assume that all of the numbers in these strings are the number of classes each person is taking. Then I want to replace these three entries with 6, 4, and 6, respectively - we can modifiy those entries directly in the vector.

stdat$classes[c(ind)] <- c(6,4,6)
stdat$classes
##  [1] "6" "5" "3" "6" "4" "5" "5" "6" "3" "4" "4" "5" "3" "5" "3" "5" "4" "5" "4"
## [20] "6" "4" "5" "7" "4" "6" "5"
stdat$classes <- as.numeric(stdat$classes) # overwrite the classes vector with the numeric version
stdat$classes
##  [1] 6 5 3 6 4 5 5 6 3 4 4 5 3 5 3 5 4 5 4 6 4 5 7 4 6 5

On Your Own

Examine the units variable in stdat. Convert it to numeric, modifying any strings to be numbers. Comment on how you decided to convert the range into a number. (There’s no wrong answer here - sometimes we have to make a decision and we just do the best we can!)

Adding and Removing Variables

To remove something from the R environment, we use rm.

rm(class.num) # we don't need this anymore

We might want to add a variable if, for example, we wanted to do some kind of conversion. Let’s convert height (in inches) to centimeters and store it in the dataframe as a new variable.

stdat$cmHt <- stdat$height*2.54

We can also add new columns and rows to a matrix using cbind and rbind, respectively.

m1 <- matrix(1:9, nrow=3, ncol=3)
m1
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
v1 <- 10:12
m1 <- cbind(m1, v1)
m1
##            v1
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
v2 <- 13:16
m1 <- rbind(m1, v2)
m1
##             v1
##     1  4  7 10
##     2  5  8 11
##     3  6  9 12
## v2 13 14 15 16

On Your Own

Create a new variable, unitRatio, in the stdat dataframe. This variable should contain the ratio of units to classes (the average units per class for each person).

Selecting Based on a Condition

We saw this a little bit when we used is.na to select based on the condition that there is missing data.

stdat[stdat$age < 21,]
##                major gender   year age classes units livecampus
## 14  Computer Science   male Senior  20       5    13         No
## 20 Computer Science   Male  Junior  19       6    16        Yes
## 23  Computer Science   Male Junior  20       7    19         No
##                          working grade
## 14 Yes, I am currently employed.     A
## 20 Yes, I am currently employed.     A
## 23 Yes, I am currently employed.     A
##                                                                       statclass
## 14 Yes - I have taken Introductory Statistics before at a college or university
## 20 Yes - I have taken Introductory Statistics before at a college or university
## 23 Yes - I have taken Introductory Statistics before at a college or university
##    height      shoe pets parent   cmHt
## 14     73        10  Yes     No 185.42
## 20     69      10 M   No     No 175.26
## 23     70 11.5 Mens   No     No 177.80
stdat[stdat$major == "Computer Science",]
##               major gender   year age classes units livecampus
## 4  Computer Science   Male Senior  21       6    16         No
## 5  Computer Science   Male Senior  23       4    12         No
## 14 Computer Science   male Senior  20       5    13         No
## 16 Computer Science   Male Senior  21       5    12         No
## 17 Computer Science      M Senior  25       4    11         No
## 18 Computer Science   Male Senior  21       5    12         No
## 19 Computer Science   Male Senior  25       4    12         No
## 22 Computer Science   Male Senior  21       5    15         No
## 23 Computer Science   Male Junior  20       7    19         No
## 24 Computer Science   Male Senior  40       4    13         No
## 26 Computer Science   Male Senior  21       5    15         No
##                                            working grade
## 4                    No, and I am not planning to.     A
## 5                    Yes, I am currently employed.     B
## 14                   Yes, I am currently employed.     A
## 16                   Yes, I am currently employed.     A
## 17                   Yes, I am currently employed.     B
## 18                   Yes, I am currently employed.     A
## 19 Not currently, but I am planning to find a job.     B
## 22       No, but I plan to work during the summer.     A
## 23                   Yes, I am currently employed.     A
## 24                   Yes, I am currently employed.     A
## 26                   No, and I am not planning to.     A
##                                                                       statclass
## 4  Yes - I have taken Introductory Statistics before at a college or university
## 5                            Yes - a high school class other than AP Statistics
## 14 Yes - I have taken Introductory Statistics before at a college or university
## 16 Yes - I have taken Introductory Statistics before at a college or university
## 17 Yes - I have taken Introductory Statistics before at a college or university
## 18 Yes - I have taken Introductory Statistics before at a college or university
## 19 Yes - I have taken Introductory Statistics before at a college or university
## 22 Yes - I have taken Introductory Statistics before at a college or university
## 23 Yes - I have taken Introductory Statistics before at a college or university
## 24                                                    Yes - AP or IB Statistics
## 26 Yes - I have taken Introductory Statistics before at a college or university
##    height      shoe pets parent   cmHt
## 4      66         8   No     No 167.64
## 5      69        9M   No     No 175.26
## 14     73        10  Yes     No 185.42
## 16     67  7.5 mens   No     No 170.18
## 17     68      11 M  Yes     No 172.72
## 18     71   13 Mens  Yes     No 180.34
## 19     70       10M   No     No 177.80
## 22     66  10 men's  Yes     No 167.64
## 23     70 11.5 Mens   No     No 177.80
## 24     70        12  Yes     No 177.80
## 26     65       7.5   No     No 165.10

On Your Own

Get all of the data for people whose year is “Junior”.