Descriptive Statistics

This project is due by Sunday, March 14, 2021 at 11:59 PM.

This project will utilize a dataset with 517 observations on forest fires in the northeast region of Portugal. This dataset is available through the UC Irvine Machine Learning Repository. To enter the data in R, use the following command:

source("http://lgpcappiello.github.io/teaching/stat1/project1data.R")

the data is now available using the name forestfires.

The UC Irvine Machine Learning Repository provides the following information about the variables in this dataset. I have filled in some additional information in brackets [ ]. This kind of information is often provided separately from the data in what is called a “data dictionary”.

Citation: P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.

Part 1: Full Dataset

  1. Which variables are qualitative? Which are quantitative? If there are any variables you are unsure of, discuss your hesitations.

  2. Create a histogram of area. Describe its skew and modality.

  3. Create a boxplot of ISI. Are there any outliers? What does the boxplot suggest about the skew of this variable?

  4. Convert temp in degrees Celcius to temperature in degrees Fahrenheit. Store it in a new object called forestfires$fahr. The equation for converting from Celcius to Fahrenheit is \[ \text{temperature in Fehrenheit}=1.8\times(\text{temperature in Celcius})+32 \]

  5. Obtain the mean, median, standard deviation, and interquartile range of fahr. Which measure of center do you recommend? Which measure of variability? Explain your recommendations.

Note that you can calculate \(\sqrt{x}\) in R using the command

sqrt(x)
  1. Obtain side-by-side boxplots of fahr by month. What do you observe about the centers and spread of the montly temperature recordings? Summarize your observations in 3-4 sentences.

Part 2: Burn Days Only

Next, use the following command to get only the observations where some of the forest burned (when burned area is greater than zero):

burns <- forestfires[forestfires$area > 0,]

These data are available under the name burns and include 270 observations. Answer the following questions using the burns data.

  1. Obtain a barplot and relative frequency distribution of month. What is the mode? Are the frequencies of months with burn days what you would expect based on your boxplots in (6)? Explain.

  2. Use a scatterplot to examine the relationship between RH and fahr. Describe the relationship between these two variables. If we took this data in Sacramento, do you think we would see a similar relationship? Explain your thoughts.

  3. Ask one additional question about these data that you can answer using descriptive statistics and/or data visualization. State clearly what question you are asking and which dataset you are interested in (forestfires or burns). Finally, use descriptive statistics to answer your question. Bonus points may be awarded for especially interesting questions or questions that arise as a result of doing additional research into what these variables represent.