Missing Data

Missing data occur when some of the values are missing.

  • This is very common!
  • We can remove those observations… but often that means losing a lot of data.

“Fixing” Missing Data

  • Remove observations with (any amount of) missing data.
  • Fill in or impute the missing values.
  • Missing value correlation method.
  • Maximum likelihood methods.

Chicago Insurance Data

Some points have been removed at random:

data(chmiss)
head(chmiss, 10)
##       race fire theft  age involact income
## 60626 10.0  6.2    29 60.4       NA 11.744
## 60640 22.2  9.5    44 76.5      0.1  9.323
## 60613 19.6 10.5    36   NA      1.2  9.948
## 60657 17.3  7.7    37   NA      0.5 10.656
## 60614 24.5  8.6    53 81.4      0.7  9.730
## 60610 54.0 34.1    68 52.6      0.3  8.231
## 60611  4.9 11.0    75 42.6      0.0 21.480
## 60625  7.1  6.9    18 78.5      0.0 11.104
## 60618  5.3  7.3    31 90.1       NA 10.694
## 60647 21.5 15.1    NA 89.8      1.1  9.631

The Model

mod1 <- lm(involact ~ ., chmiss) # not all software will run this regression!
summary(mod1)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.116483 0.605761 -1.843 0.079475 .

race 0.010487 0.003128 3.352 0.003018 **

fire 0.043876 0.010319 4.252 0.000356 ***

theft -0.017220 0.005900 -2.918 0.008215 **

age 0.009377 0.003494 2.684 0.013904 *

income 0.068701 0.042156 1.630 0.118077

The Model, Continued

Residual standard error: 0.3382 on 21 degrees of freedom

(20 observations deleted due to missingness)

Multiple R-squared: 0.7911, Adjusted R-squared: 0.7414

F-statistic: 15.91 on 5 and 21 DF, p-value: 1.594e-06

  • In R, any case with at least one missing value is removed from the regression.

Mean Imputation

cmeans <- apply(chmiss, 2, mean, na.rm=TRUE)
cmeans
##       race       fire      theft        age   involact     income 
## 35.6093023 11.4244444 32.6511628 59.9690476  0.6477273 10.7358667
mchm <- chmiss
for(i in c(1, 2, 3, 4, 6)){ # 5 is the outcome variable
  mchm[is.na(chmiss[,i]),i] <- cmeans[i]
}
  • We don’t impute values for the outcome because this is the variable we’re trying to model.

Refitting the Model

mod2 <- lm(involact ~ ., mchm)
summary(mod2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.070802 0.509453 0.139 0.89020

race 0.007117 0.002706 2.631 0.01224 *

fire 0.028742 0.009385 3.062 0.00402 **

theft -0.003059 0.002746 -1.114 0.27224

age 0.006080 0.003208 1.895 0.06570 .

income -0.027092 0.031678 -0.855 0.39779

Refitting the Model, Continued

Residual standard error: 0.3841 on 38 degrees of freedom

(3 observations deleted due to missingness)

Multiple R-squared: 0.682, Adjusted R-squared: 0.6401

F-statistic: 16.3 on 5 and 38 DF, p-value: 1.409e-08

  • Theft and age no longer significant.
  • Regression coefs all closer to zero than before.

Mean Imputation

  • Bias/Variance Tradeoff
    • Reduction in variance
    • Increase in bias (coefficients toward zero)
    • May not be worth it

Using Regression Methods

We can use regression methods to predict the missing values of the covariates:

chmiss[is.na (chmiss$fire) , ]
##       race fire theft age involact income
## 60607 50.2   NA   147  83      0.9  7.459
## 60608 55.5   NA    29  79      1.5  8.177
race.miss <- lm (fire ~ race + theft + age + income, chmiss)
predict(race.miss, chmiss[is.na(chmiss$fire),])
##    60607    60608 
## 50.23688 15.40465

Using Regression Methods

How do our predicted values compare to the actual values?

data(chredlin)
chredlin$fire[is.na(chmiss$fire)]
## [1] 39.7 23.3
  • This method also introduces some bias toward zero in coefs.
  • Tends to reduce variance.
  • This method works better the more collinear the predictors are.

More Complex Missing Data Problems

  • These methods assume data are missing at random, which is often not the case.
  • If a more substantial proportion of the data are missing:
    • Expectation Minimization algorithms
    • Multiple Imputation