This project is due by Sunday, March 14, 2021 at 11:59 PM.
This project will utilize a dataset with 517 observations on forest fires in the northeast region of Portugal. This dataset is available through the UC Irvine Machine Learning Repository. To enter the data in R, use the following command:
source("http://lgpcappiello.github.io/teaching/stat1/project1data.R")
the data is now available using the name forestfires
.
The UC Irvine Machine Learning Repository provides the following information about the variables in this dataset. I have filled in some additional information in brackets [ ]. This kind of information is often provided separately from the data in what is called a “data dictionary”.
X
: x-axis spatial coordinate within the Montesinho park map: 1 to 9Y
: y-axis spatial coordinate within the Montesinho park map: 2 to 9month
: month of the year: ‘jan’ to ‘dec’day
: day of the week: ‘mon’ to ‘sun’FFMC
: FFMC [Fine Fuel Moisture Code] index from the FWI [Fire Weather Index] system: 18.7 to 96.20DMC
: DMC [Duff Moisture Code] index from the FWI system: 1.1 to 291.3DC
: DC [Drought Code] index from the FWI system: 7.9 to 860.6ISI
: ISI [Initial Spread Index] from the FWI system: 0.0 to 56.10temp
: temperature in degrees Celsius: 2.2 to 33.30RH
: relative humidity in %: 15.0 to 100wind
: wind speed in km/h: 0.40 to 9.40rain
: outside rain in mm/m2 : 0.0 to 6.4area
: the burned area of the forest (in ha): 0.00 to 1090.84Citation: P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.
Which variables are qualitative? Which are quantitative? If there are any variables you are unsure of, discuss your hesitations.
Create a histogram of area
. Describe its skew and modality.
Create a boxplot of ISI
. Are there any outliers? What does the boxplot suggest about the skew of this variable?
Convert temp
in degrees Celcius to temperature in degrees Fahrenheit. Store it in a new object called forestfires$fahr
. The equation for converting from Celcius to Fahrenheit is \[
\text{temperature in Fehrenheit}=1.8\times(\text{temperature in Celcius})+32
\]
Obtain the mean, median, standard deviation, and interquartile range of fahr
. Which measure of center do you recommend? Which measure of variability? Explain your recommendations.
Note that you can calculate \(\sqrt{x}\) in R using the command
sqrt(x)
fahr
by month
. What do you observe about the centers and spread of the montly temperature recordings? Summarize your observations in 3-4 sentences.Next, use the following command to get only the observations where some of the forest burned (when burned area is greater than zero):
burns <- forestfires[forestfires$area > 0,]
These data are available under the name burns
and include 270 observations. Answer the following questions using the burns
data.
Obtain a barplot and relative frequency distribution of month
. What is the mode? Are the frequencies of months with burn days what you would expect based on your boxplots in (6)? Explain.
Use a scatterplot to examine the relationship between RH
and fahr
. Describe the relationship between these two variables. If we took this data in Sacramento, do you think we would see a similar relationship? Explain your thoughts.
Ask one additional question about these data that you can answer using descriptive statistics and/or data visualization. State clearly what question you are asking and which dataset you are interested in (forestfires
or burns
). Finally, use descriptive statistics to answer your question. Bonus points may be awarded for especially interesting questions or questions that arise as a result of doing additional research into what these variables represent.