Let’s pull in the results for the survey we did on the first day of class. R will automatically create a data frame when we do this. You can pull in data directly from the web:
stdat <- read.csv("https://lgpcappiello.github.io/teaching/stat128/SurveyResponses.csv")
…or you can use the import dataset wizard in RStudio. Data import can get a little complicated, so if the data are stored on your computer, this is what I recommend. You can find this wizard in the Environment tab.
Usually, we don’t want to try to print out a lot of data right in the RMarkdown file or in the console. To avoid this, we can use the View
command:
# View data in a separate window.
# View(stdat)
We can quickly examine a matrix or data frame using the head
function, which shows only the first six rows:
# head(stdat)
We see pretty quickly that there are some issues here. Let’s use what we’ve learned to clean up this dataset. We’ll start with the variable names:
names(stdat)
## [1] "Major"
## [2] "Gender"
## [3] "Your.year.in.school"
## [4] "Your.age"
## [5] "Number.of.classes.you.are.taking.in.Fall.2022"
## [6] "Number.of.credits.units.you.are.taking.in.Fall.2022"
## [7] "Do.you.live.on.campus."
## [8] "Are.you.working..on.or.off.campus..while.in.school."
## [9] "What.is.your.target.grade.in.this.class....If.your.goal.is.just.to.pass..that.s.fine....please.select.C.."
## [10] "Have.you.taken.a.statistics.class.before."
## [11] "Your.height.in.inches."
## [12] "Your.shoe.size..please.indicate.men.s.or.women.s.sizing."
## [13] "Do.you.have.any.pets."
## [14] "Are.you.a.parent.or.primary.guardian.to.a.child.under.18."
names(stdat) <- c("major","gender","year","age","classes","units","livecampus","working",
"grade","statclass","height","shoe","pets","parent")
names(stdat) # much better!
## [1] "major" "gender" "year" "age" "classes"
## [6] "units" "livecampus" "working" "grade" "statclass"
## [11] "height" "shoe" "pets" "parent"
Now let’s check out these data. The summary
function will run the summary command on each variable in the dataset.
summary(stdat)
## major gender year age
## Length:26 Length:26 Length:26 Min. :19.00
## Class :character Class :character Class :character 1st Qu.:21.00
## Mode :character Mode :character Mode :character Median :21.00
## Mean :23.19
## 3rd Qu.:24.75
## Max. :40.00
## classes units livecampus working
## Length:26 Length:26 Length:26 Length:26
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## grade statclass height shoe
## Length:26 Length:26 Min. :60.00 Length:26
## Class :character Class :character 1st Qu.:66.00 Class :character
## Mode :character Mode :character Median :68.70 Mode :character
## Mean :68.13
## 3rd Qu.:70.00
## Max. :76.00
## pets parent
## Length:26 Length:26
## Class :character Class :character
## Mode :character Mode :character
##
##
##
We can sort a vector using the sort
function, which defaults to smallest-to-largest.
sort(stdat$age) # default
## [1] 19 20 20 21 21 21 21 21 21 21 21 21 21 21 23 23 23 24 24 25 25 25 26 27 28
## [26] 40
sort(stdat$age, decreasing = TRUE) # largest-to-smallest
## [1] 40 28 27 26 25 25 25 24 24 23 23 23 21 21 21 21 21 21 21 21 21 21 21 20 20
## [26] 19
We can get the order of a variable using the order
function. This returns the ranks of the variable.
order(stdat$age)
## [1] 20 14 23 1 3 4 6 8 12 13 16 18 22 26 5 11 21 15 25 10 17 19 9 7 2
## [26] 24
So the first number is 20th when sorted from smallest to largest, the second is 14th, etc.
This can then be used to rearrange the data. To rearrange the data frame based on the age variable,
ord1 <- stdat[order(stdat$age),]
head(ord1)
## major gender year age classes units
## 20 Computer Science Male Junior 19 6 16
## 14 Computer Science male Senior 20 5 13
## 23 Computer Science Male Junior 20 7 19
## 1 Computer science and math male Senior 21 6 15
## 3 Mathematics (with an emphasis in Statistics) Female Senior 21 3 9
## 4 Computer Science Male Senior 21 6 16
## livecampus working grade
## 20 Yes Yes, I am currently employed. A
## 14 No Yes, I am currently employed. A
## 23 No Yes, I am currently employed. A
## 1 No Yes, I am currently employed. A
## 3 No Yes, I am currently employed. A
## 4 No No, and I am not planning to. A
## statclass
## 20 Yes - I have taken Introductory Statistics before at a college or university
## 14 Yes - I have taken Introductory Statistics before at a college or university
## 23 Yes - I have taken Introductory Statistics before at a college or university
## 1 Yes - I have taken Introductory Statistics before at a college or university
## 3 Yes - I have taken Introductory Statistics before at a college or university
## 4 Yes - I have taken Introductory Statistics before at a college or university
## height shoe pets parent
## 20 69 10 M No No
## 14 73 10 Yes No
## 23 70 11.5 Mens No No
## 1 66 9 mens No No
## 3 63 Women's 7 Yes No
## 4 66 8 No No
We can also sort based on one variable and then another. (For character data, R will sort by alphabetical order.)
ord2 <- stdat[order(stdat$age, stdat$major),]
head(ord2)
## major gender year age classes units livecampus
## 20 Computer Science Male Junior 19 6 16 Yes
## 14 Computer Science male Senior 20 5 13 No
## 23 Computer Science Male Junior 20 7 19 No
## 4 Computer Science Male Senior 21 6 16 No
## 16 Computer Science Male Senior 21 5 12 No
## 18 Computer Science Male Senior 21 5 12 No
## working grade
## 20 Yes, I am currently employed. A
## 14 Yes, I am currently employed. A
## 23 Yes, I am currently employed. A
## 4 No, and I am not planning to. A
## 16 Yes, I am currently employed. A
## 18 Yes, I am currently employed. A
## statclass
## 20 Yes - I have taken Introductory Statistics before at a college or university
## 14 Yes - I have taken Introductory Statistics before at a college or university
## 23 Yes - I have taken Introductory Statistics before at a college or university
## 4 Yes - I have taken Introductory Statistics before at a college or university
## 16 Yes - I have taken Introductory Statistics before at a college or university
## 18 Yes - I have taken Introductory Statistics before at a college or university
## height shoe pets parent
## 20 69 10 M No No
## 14 73 10 Yes No
## 23 70 11.5 Mens No No
## 4 66 8 No No
## 16 67 7.5 mens No No
## 18 71 13 Mens Yes No
Use the order
function to arrange the data by target grade (grade
) and then height.
Now, let’s examine the number of classes you are all taking.
stdat$classes
## [1] "6"
## [2] "5"
## [3] "3"
## [4] "6"
## [5] "4"
## [6] "5"
## [7] "5"
## [8] "6-May"
## [9] "3"
## [10] "4"
## [11] "4"
## [12] "5"
## [13] "3"
## [14] "5"
## [15] "3"
## [16] "5"
## [17] "4"
## [18] "5"
## [19] "4 at the moment, but possibly less (job)"
## [20] "6"
## [21] "4"
## [22] "5"
## [23] "7"
## [24] "4"
## [25] "Currently 6 classes"
## [26] "5"
class(stdat$classes)
## [1] "character"
We probably want this to be a numeric variable. There are a few ways to pull things out of strings, including the grepl
function, but for now we’ll use the simplest, direct approach.
Let’s imagine there’s too much data to do this completely manually. What happens if we try to convert this to a numeric variable?
as.numeric(stdat$classes)
## Warning: NAs introduced by coercion
## [1] 6 5 3 6 4 5 5 NA 3 4 4 5 3 5 3 5 4 5 NA 6 4 5 7 4 NA
## [26] 5
class.num <- as.numeric(stdat$classes) # save this for now so we can use it later!
## Warning: NAs introduced by coercion
This isn’t what we want, but it does provide us with some useful information! If I can extract the places where R is giving an NA
because of characters in the string, I can use that to examine only those entries. The which
function takes a logical statement on an object and returns the indices at which the statement is true. We will also use the function is.na
, which returns TRUE wherever there is missing data stored as NA
.
is.na(class.num)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [25] TRUE FALSE
which(is.na(class.num)) # gets the indices where class.num is missing
## [1] 8 19 25
ind <- which(is.na(class.num))
So now I know where there are issues. Let’s check those out in the original data (which we have been careful not to change so far!).
stdat$classes[c(ind)]
## [1] "6-May"
## [2] "4 at the moment, but possibly less (job)"
## [3] "Currently 6 classes"
Now what? Since we’ve narrowed this down to only three entries, we can fix this quickly and directly. (There are other ways to do this that don’t require us to look at the data so much, but this is a good exercise!)
Let’s assume that all of the numbers in these strings are the number of classes each person is taking. Then I want to replace these three entries with 6, 4, and 6, respectively - we can modifiy those entries directly in the vector.
stdat$classes[c(ind)] <- c(6,4,6)
stdat$classes
## [1] "6" "5" "3" "6" "4" "5" "5" "6" "3" "4" "4" "5" "3" "5" "3" "5" "4" "5" "4"
## [20] "6" "4" "5" "7" "4" "6" "5"
stdat$classes <- as.numeric(stdat$classes) # overwrite the classes vector with the numeric version
stdat$classes
## [1] 6 5 3 6 4 5 5 6 3 4 4 5 3 5 3 5 4 5 4 6 4 5 7 4 6 5
Examine the units
variable in stdat
. Convert it to numeric, modifying any strings to be numbers. Comment on how you decided to convert the range into a number. (There’s no wrong answer here - sometimes we have to make a decision and we just do the best we can!)
To remove something from the R environment, we use rm
.
rm(class.num) # we don't need this anymore
We might want to add a variable if, for example, we wanted to do some kind of conversion. Let’s convert height (in inches) to centimeters and store it in the dataframe as a new variable.
stdat$cmHt <- stdat$height*2.54
We can also add new columns and rows to a matrix using cbind
and rbind
, respectively.
m1 <- matrix(1:9, nrow=3, ncol=3)
m1
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
v1 <- 10:12
m1 <- cbind(m1, v1)
m1
## v1
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
v2 <- 13:16
m1 <- rbind(m1, v2)
m1
## v1
## 1 4 7 10
## 2 5 8 11
## 3 6 9 12
## v2 13 14 15 16
Create a new variable, unitRatio
, in the stdat
dataframe. This variable should contain the ratio of units to classes (the average units per class for each person).
We saw this a little bit when we used is.na
to select based on the condition that there is missing data.
stdat[stdat$age < 21,]
## major gender year age classes units livecampus
## 14 Computer Science male Senior 20 5 13 No
## 20 Computer Science Male Junior 19 6 16 Yes
## 23 Computer Science Male Junior 20 7 19 No
## working grade
## 14 Yes, I am currently employed. A
## 20 Yes, I am currently employed. A
## 23 Yes, I am currently employed. A
## statclass
## 14 Yes - I have taken Introductory Statistics before at a college or university
## 20 Yes - I have taken Introductory Statistics before at a college or university
## 23 Yes - I have taken Introductory Statistics before at a college or university
## height shoe pets parent cmHt
## 14 73 10 Yes No 185.42
## 20 69 10 M No No 175.26
## 23 70 11.5 Mens No No 177.80
stdat[stdat$major == "Computer Science",]
## major gender year age classes units livecampus
## 4 Computer Science Male Senior 21 6 16 No
## 5 Computer Science Male Senior 23 4 12 No
## 14 Computer Science male Senior 20 5 13 No
## 16 Computer Science Male Senior 21 5 12 No
## 17 Computer Science M Senior 25 4 11 No
## 18 Computer Science Male Senior 21 5 12 No
## 19 Computer Science Male Senior 25 4 12 No
## 22 Computer Science Male Senior 21 5 15 No
## 23 Computer Science Male Junior 20 7 19 No
## 24 Computer Science Male Senior 40 4 13 No
## 26 Computer Science Male Senior 21 5 15 No
## working grade
## 4 No, and I am not planning to. A
## 5 Yes, I am currently employed. B
## 14 Yes, I am currently employed. A
## 16 Yes, I am currently employed. A
## 17 Yes, I am currently employed. B
## 18 Yes, I am currently employed. A
## 19 Not currently, but I am planning to find a job. B
## 22 No, but I plan to work during the summer. A
## 23 Yes, I am currently employed. A
## 24 Yes, I am currently employed. A
## 26 No, and I am not planning to. A
## statclass
## 4 Yes - I have taken Introductory Statistics before at a college or university
## 5 Yes - a high school class other than AP Statistics
## 14 Yes - I have taken Introductory Statistics before at a college or university
## 16 Yes - I have taken Introductory Statistics before at a college or university
## 17 Yes - I have taken Introductory Statistics before at a college or university
## 18 Yes - I have taken Introductory Statistics before at a college or university
## 19 Yes - I have taken Introductory Statistics before at a college or university
## 22 Yes - I have taken Introductory Statistics before at a college or university
## 23 Yes - I have taken Introductory Statistics before at a college or university
## 24 Yes - AP or IB Statistics
## 26 Yes - I have taken Introductory Statistics before at a college or university
## height shoe pets parent cmHt
## 4 66 8 No No 167.64
## 5 69 9M No No 175.26
## 14 73 10 Yes No 185.42
## 16 67 7.5 mens No No 170.18
## 17 68 11 M Yes No 172.72
## 18 71 13 Mens Yes No 180.34
## 19 70 10M No No 177.80
## 22 66 10 men's Yes No 167.64
## 23 70 11.5 Mens No No 177.80
## 24 70 12 Yes No 177.80
## 26 65 7.5 No No 165.10
Get all of the data for people whose year
is “Junior”.