Missing data occur when some of the values are missing.
- This is very common!
- We can remove those observations… but often that means losing a lot of data.
Missing data occur when some of the values are missing.
Some points have been removed at random:
data(chmiss) head(chmiss, 10)
## race fire theft age involact income ## 60626 10.0 6.2 29 60.4 NA 11.744 ## 60640 22.2 9.5 44 76.5 0.1 9.323 ## 60613 19.6 10.5 36 NA 1.2 9.948 ## 60657 17.3 7.7 37 NA 0.5 10.656 ## 60614 24.5 8.6 53 81.4 0.7 9.730 ## 60610 54.0 34.1 68 52.6 0.3 8.231 ## 60611 4.9 11.0 75 42.6 0.0 21.480 ## 60625 7.1 6.9 18 78.5 0.0 11.104 ## 60618 5.3 7.3 31 90.1 NA 10.694 ## 60647 21.5 15.1 NA 89.8 1.1 9.631
mod1 <- lm(involact ~ ., chmiss) # not all software will run this regression! summary(mod1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.116483 0.605761 -1.843 0.079475 .
race 0.010487 0.003128 3.352 0.003018 **
fire 0.043876 0.010319 4.252 0.000356 ***
theft -0.017220 0.005900 -2.918 0.008215 **
age 0.009377 0.003494 2.684 0.013904 *
income 0.068701 0.042156 1.630 0.118077
Residual standard error: 0.3382 on 21 degrees of freedom
(20 observations deleted due to missingness)
Multiple R-squared: 0.7911, Adjusted R-squared: 0.7414
F-statistic: 15.91 on 5 and 21 DF, p-value: 1.594e-06
cmeans <- apply(chmiss, 2, mean, na.rm=TRUE) cmeans
## race fire theft age involact income ## 35.6093023 11.4244444 32.6511628 59.9690476 0.6477273 10.7358667
mchm <- chmiss for(i in c(1, 2, 3, 4, 6)){ # 5 is the outcome variable mchm[is.na(chmiss[,i]),i] <- cmeans[i] }
mod2 <- lm(involact ~ ., mchm) summary(mod2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.070802 0.509453 0.139 0.89020
race 0.007117 0.002706 2.631 0.01224 *
fire 0.028742 0.009385 3.062 0.00402 **
theft -0.003059 0.002746 -1.114 0.27224
age 0.006080 0.003208 1.895 0.06570 .
income -0.027092 0.031678 -0.855 0.39779
Residual standard error: 0.3841 on 38 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.682, Adjusted R-squared: 0.6401
F-statistic: 16.3 on 5 and 38 DF, p-value: 1.409e-08
We can use regression methods to predict the missing values of the covariates:
chmiss[is.na (chmiss$fire) , ]
## race fire theft age involact income ## 60607 50.2 NA 147 83 0.9 7.459 ## 60608 55.5 NA 29 79 1.5 8.177
race.miss <- lm (fire ~ race + theft + age + income, chmiss) predict(race.miss, chmiss[is.na(chmiss$fire),])
## 60607 60608 ## 50.23688 15.40465
How do our predicted values compare to the actual values?
data(chredlin) chredlin$fire[is.na(chmiss$fire)]
## [1] 39.7 23.3