We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
We see that there are quite a few variables in the data set, enough
to do a very in-depth analysis. For this lab, we’ll restrict our
attention to just two of the variables: the above ground living area of
the house in square feet (Gr.Liv.Area
) and the sale price
(SalePrice
).
area
and price
to represent
Gr.Liv.Area
and SalePrice
. The first one is
done for you below (You can add a line below it to do the second
variable):area <- ames$Gr.Liv.Area
Next, look at the distribution of area
in our
population of home sales by making a histogram. Also look at some of the
summary statistics of area
by using the
summary()
command.
Does this data seem normally distributed? Use the command
qqnorm()
on area
to confirm.
In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.
If we were interested in estimating the mean living area in Ames based on a sample, we can use the following command to survey the population.
samp1 <- sample(area, 50)
This command collects a simple random sample of size 50 from area,
which is assigned to samp1
. This is like going into the
City Assessor’s database and pulling up the files on 50 random home
sales. Working with these 50 files would be considerably simpler than
working with all 2930 home sales.
Describe the distribution of the sample (use a histogram to visualize it). Since this is a random sample, your distribution should look different than those of the people around you.
If we’re interested in estimating the average living area in
homes in Ames using the sample, our best single guess is the sample
mean. Find it below using the mean
function:
Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.
Take a second sample, also of size 50, and call it
samp2
. You can do this by using the exact code as above.
Each time your run the sample function, it gives you a different
sample!
How does the mean of samp2
compare with the mean of
samp1
?
Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean? Why?
Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples and compute the sample mean of each. You can do this by running the code below:
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
sample_means50
now has 5000 samples
obtained by repeatedly taking samples of size 50 from area
and taking the mean. Look at the distribution of the sample means in the
space below. How are the means distributed? Make a histogram to check
this.The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.
To get a sense of the effect that sample size has on our distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100 by running the code below.
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
par(mfrow=c(3,1))
hist(sample_means10, main="Samples of Size 10", xlim=c(1000, 2400), xlab="Means", freq = F)
hist(sample_means50, main="Samples of Size 50", xlim=c(1000, 2400), xlab="Means", freq = F)
hist(sample_means100, main="Samples of Size 100", xlim=c(1000, 2400), xlab="Means", freq = F)
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
Take a random sample of size 50 from price
. Using
this sample, what is your best point estimate of the population
mean?
Since you have access to the population, simulate the sampling
distribution for price
by taking 5000 samples from the
population of size 50 and computing 5000 sample means. Use the code we
ran for area
to get started.
Make a histogram of the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the actual population mean by finding the mean of the price variable.