For this lab, we will work with a random sample of the 2000 Behavioral Risk Factor Surveillance System (BRFSS), an annual telephone survey of 350,000 people in the United States. The BRFSS is designed to identify risk factors in the adult population and report emerging health trends. The website (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
source("http://www.openintro.org/stat/data/cdc.R")
hist(cdc$height)
Let’s check out the help file to see how we can control the histogram.
Most base R plotting functions have a lot of the same arguments (all of which have at least relatively reasonable defaults):
main
: the main title that appears at the top of the
plotxlab
and ylab
: the x- and y-axis titles,
respectivelyxlim
and ylim
: the range of x- and
y-valuesaxes
: whether to draw the axes on the plot (usually
defaults to TRUE
)# notice that I can break up the function call into multiple lines at commas
hist(cdc$height,
main = "Histogram of Heights, 2000 BRFSS Data",
xlab = "Height (in)")
What can we do that’s specific to histograms? Let’s go back to the help file.
hist(cdc$height, breaks=20, freq=TRUE,
main = "Histogram of Heights, 2000 BRFSS Data",
xlab = "Height (in)",
border = "purple", col = "pink")
colors()
function (or
Google them).#RRGGBB
. Where the RR is for red, GG for green and BB for
blue and value ranges from 00 to FF. Usually we get these from Google
too.rgb()
allows us to specify red, green and blue component with a number between
0 and 1. This function returns the corresponding hex code.rainbow()
, heat.colors()
,
terrain.colors()
, topo.colors()
and
cm.colors()
. We pass in the number of colors that we
want.Create a histogram of weight
. Give it an appropriate
title and axis labels; make it have 15 breaks; and add some color.
hist(cdc$height[cdc$gender == "m"], breaks=20,
main = "Male and Female Heights, 2000 BRFSS Data",
xlab = "Height (in)",
# ylim = c(0,3000),
col=rgb(1,0,0,0.25))
hist(cdc$height[cdc$gender == "f"], breaks=20,
col=rgb(0,0,1,0.25),
add=TRUE) # add to existing plot
boxplot(cdc$weight, main = "Boxplot of Weight", ylab = "Weight ")
For boxplots, recall:
One can use boxplots to compare different groups using ~
character. On the right side of ~
is the numeric variable,
and the left side of ~
is a grouping variable (character,
logical, factor).
boxplot(cdc$weight ~ cdc$gender,
main = "",
xlab = "Gender",
ylab = "Weight ")
Create side-by-side boxplots of wtdesire
broken down by
gender. Make sure to give your graph appropriate title and axis
labels.
We can create scatter plots using the generic plot
function. (Generic meaning it will create different plots depending on
the data input.) To create a scatterplot,
plot(cdc$height, cdc$weight) # plot(X, Y)
If we examine the help file, we can see a variety of options, many of which are the same ones we saw for histograms.
Create a scatter plot of weight
versus
wtdesire
. Make sure to give your graph appropriate title
and axis labels.
Bar charts allow us to visualize categorical data (like pie charts but in a way that’s actually useful). As an argument, they take in tabled information about a categorical variable.
freq.tab <- table(cdc$genhlth)/nrow(cdc) # convert frequency table to proportions
barplot(freq.tab, main = "Barplot of General Health",
ylab = "Proportion",
col = "yellow")
Create a barplot of gender. Make sure to give your graph appropriate title and axis labels.
We can add a line to an existing plot - to draw attention to specific
values or provide additional insight/context - using the function
abline
.
The abline
function has five main uses:
a
and b
: abline(a,b)
abline(coef = c(a,b))
abline(v = x)
abline(h = y)
hist(cdc$weight, breaks=20, main="Distribution of Weight",
xlab="Weight",
border = "mediumpurple4",
col = "mediumpurple1")
abline(v=mean(cdc$weight),
col="mediumblue")
# Add a line on scatter plot
plot(cdc$height, cdc$weight,
xlab = "Height",
ylab = "Weight",
col = "darkblue")
# Add a thicker solid line
abline(h = median(cdc$weight), col = "red", lwd = 2)
# Add a dashed line
abline(v = median(cdc$height), col = "red", lty = 2)
The regression line that uses height to predict weight is \[\text{weight} = -192.74 + 5.40\times\text{height}\] Add this line to the scatterplot below.
plot(cdc$height, cdc$weight,
xlab = "Height",
ylab = "Weight",
col = "darkblue")
Note As an aside, the command used to generate the coefficients for this regression line is:
reg1 <- lm(cdc$weight ~ cdc$height)
reg1$coef
## (Intercept) cdc$height
## -192.741582 5.394595
Sometimes, we would rather add a line connecting two points, rather than a continuous vertical, horizontal or linear line. To do this we can use the lines() function.
Add a line on scatter plot:
plot(cdc$height, cdc$weight,
xlab = "Height",
ylab = "Weight",
col = "darkblue")
# Add line connecting two points
lines(x = c(55, 84),
y = c(400, 200), col = "red", lwd = 2)
# Add line connecting a series of points:
lines(x = c(50, 60, 70, 80, 90),
y = c(100, 350, 225, 300, 425), col = "green", lwd = 2)
We can also add points to any base R graph using the points() function.
# Add a line on scatter plot
plot(cdc$height, cdc$weight,
xlab = "Height",
ylab = "Weight",
col = "darkblue")
# Make a solid line
points(mean(cdc$height), mean(cdc$weight), col = "red", pch = 16)
# Sample random points to plot
set.seed(62)
random_index = sample(1:nrow(cdc), 20)
points(cdc$height[random_index], cdc$weight[random_index],
col = "yellow", pch = 8)
par
Help FileWe’ve seen, but haven’t discussed, a variety of other graphical arguments that can be used to adjust our plots.
lty
: line type (dotted, dashed, etc.)lwd
: the line width (a positive number, defaults to
1)pch
: the point type (open circles, dots, squares,
etc.)The par
help file discusses these options along with
many others.
If necessary, we can also add a legend to a plot using the
legend
function.
hist(cdc$weight, breaks=20, main="Distribution of Weight",
xlab="Weight (kg)",
border = "mediumpurple4",
col = "mediumpurple1")
abline(v=mean(cdc$weight),
col="mediumblue",
lty=2,
lwd=2)
legend("topright", # this can also be a value on the x-axis
legend = c("Mean Weight"),
lty = 2,
col = "mediumblue",
lwd = 2)
Another example:
plot(cdc$height, cdc$weight,
xlab = "Height",
ylab = "Weight",
col = "darkblue")
# Make a solid line
abline(h = median(cdc$weight), col = "red", lwd = 2)
# Add a single point
points(mean(cdc$height), mean(cdc$weight), col = "red", pch = 16)
# Sample random points to plot
set.seed(62)
random_index = sample(1:nrow(cdc), 20)
points(cdc$height[random_index], cdc$weight[random_index],
col = "orange", pch = 8)
# Make a legend
legend("topleft",
legend = c("All Data Values",
"Mean",
"Sample Data",
"Median Weight"),
pch = c(1, 16, 8, NA),
col = c("darkblue", "red", "orange", "red"),
lty = c(NA, NA, NA, 1),
lwd = c(NA, NA, NA, 2))
In general,
Good plots communicate a clear message. The less someone has to know before they can understand your plot, the better!