Missing Data

Missing data occur when some of the values are missing.

This is very common!
We can remove those observations… but often that means losing a lot of data.

“Fixing” Missing Data

Remove observations with (any amount of) missing data.
Fill in or impute the missing values.
Missing value correlation method.
Maximum likelihood methods.

Chicago Insurance Data

Some points have been removed at random:

data(chmiss)
head(chmiss, 10)

##       race fire theft  age involact income
## 60626 10.0  6.2    29 60.4       NA 11.744
## 60640 22.2  9.5    44 76.5      0.1  9.323
## 60613 19.6 10.5    36   NA      1.2  9.948
## 60657 17.3  7.7    37   NA      0.5 10.656
## 60614 24.5  8.6    53 81.4      0.7  9.730
## 60610 54.0 34.1    68 52.6      0.3  8.231
## 60611  4.9 11.0    75 42.6      0.0 21.480
## 60625  7.1  6.9    18 78.5      0.0 11.104
## 60618  5.3  7.3    31 90.1       NA 10.694
## 60647 21.5 15.1    NA 89.8      1.1  9.631

The Model

mod1 <- lm(involact ~ ., chmiss) # not all software will run this regression!
summary(mod1)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.116483 0.605761 -1.843 0.079475 .

race 0.010487 0.003128 3.352 0.003018 **

fire 0.043876 0.010319 4.252 0.000356 ***

theft -0.017220 0.005900 -2.918 0.008215 **

age 0.009377 0.003494 2.684 0.013904 *

income 0.068701 0.042156 1.630 0.118077

The Model, Continued

Residual standard error: 0.3382 on 21 degrees of freedom

(20 observations deleted due to missingness)

Multiple R-squared: 0.7911, Adjusted R-squared: 0.7414

F-statistic: 15.91 on 5 and 21 DF, p-value: 1.594e-06

In R, any case with at least one missing value is removed from the regression.

Mean Imputation

cmeans <- apply(chmiss, 2, mean, na.rm=TRUE)
cmeans

##       race       fire      theft        age   involact     income 
## 35.6093023 11.4244444 32.6511628 59.9690476  0.6477273 10.7358667

mchm <- chmiss
for(i in c(1, 2, 3, 4, 6)){ # 5 is the outcome variable
  mchm[is.na(chmiss[,i]),i] <- cmeans[i]
}

We don’t impute values for the outcome because this is the variable we’re trying to model.

Refitting the Model

mod2 <- lm(involact ~ ., mchm)
summary(mod2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.070802 0.509453 0.139 0.89020

race 0.007117 0.002706 2.631 0.01224 *

fire 0.028742 0.009385 3.062 0.00402 **

theft -0.003059 0.002746 -1.114 0.27224

age 0.006080 0.003208 1.895 0.06570 .

income -0.027092 0.031678 -0.855 0.39779

Refitting the Model, Continued

Residual standard error: 0.3841 on 38 degrees of freedom

(3 observations deleted due to missingness)

Multiple R-squared: 0.682, Adjusted R-squared: 0.6401

F-statistic: 16.3 on 5 and 38 DF, p-value: 1.409e-08

Theft and age no longer significant.
Regression coefs all closer to zero than before.

Mean Imputation

Bias/Variance Tradeoff
- Reduction in variance
- Increase in bias (coefficients toward zero)
- May not be worth it

Using Regression Methods

We can use regression methods to predict the missing values of the covariates:

chmiss[is.na (chmiss$fire) , ]

##       race fire theft age involact income
## 60607 50.2   NA   147  83      0.9  7.459
## 60608 55.5   NA    29  79      1.5  8.177

race.miss <- lm (fire ~ race + theft + age + income, chmiss)
predict(race.miss, chmiss[is.na(chmiss$fire),])

##    60607    60608 
## 50.23688 15.40465

Using Regression Methods

How do our predicted values compare to the actual values?

data(chredlin)
chredlin$fire[is.na(chmiss$fire)]

## [1] 39.7 23.3

This method also introduces some bias toward zero in coefs.
Tends to reduce variance.
This method works better the more collinear the predictors are.

Missing Data

“Fixing” Missing Data

Chicago Insurance Data

The Model

The Model, Continued

Mean Imputation

Refitting the Model

Refitting the Model, Continued

Mean Imputation

Using Regression Methods

Using Regression Methods

More Complex Missing Data Problems