How do we estimate \(\beta\)?

Consider the simple case where \(y = \beta_0 + \beta_1x + \epsilon\).

\[\begin{aligned} \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1\bar{x} \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2} \\ &= \frac{s_{xy}}{s_x^2} \\ &= r_{xy}\frac{s_y}{s_x} \end{aligned}\]

Notice how the sample correlation, covariance, variances, and coefficients are all related.

x <- c(0.25,1,1.75)
y <- c(13, 18, 19)
plot(x, y, pch=20, cex=4, xlim=c(0, 2), ylim=c(10, 20))

x <- c(0.25,1,1.75)
y <- c(13, 18, 19)
plot(x, y, pch=20, cex=4, xlim=c(0, 2), ylim=c(10, 20))
abline(12, 4, col="blue", lwd=2)
abline(14, 4, col="red", lwd=2)
abline(lm(y~x), lwd=2)

Error

\[e_i = y_i - f(x_i)\] Our goal is to minimize overall error.

Why can’t we jump right in with minimizing this quantity?

Goal: Minimize Error

One possibility: absolute error \(|y_i - f(x_i)|\)

Another possibility: squared error \((y_i - f(x_i))^2\)

Squared error is used far more often than absolute error.

Why do you think that is?

Least Squares: Minimizing Error

This process minimizes the overall squared distance between the regression line and each \(y\) value.

  • That is, we minimize the vertical distance between each point and the line.

Least Squares

Let \(f(x_i) = \beta_0 + \beta_1x_i\). Minimize \[\sum_{i=1}^n (y_i - f(x_i))^2\] to find estimates for \(\beta_0\) and \(\beta_1\).

Least Squares

Note: least squares is a convex optimization problem.

  • That is, every local minimum is a global minimum.
    • (We don’t need to do any kind of second derivative check.)

The general case

More generally, we can do this with matrices! \[\sum\epsilon_i^2 = \epsilon^T\epsilon = (y-X\beta)^T(y-X\beta)\]

Differentiating with respect to \(\beta\) and setting to 0, we find that \(\hat{\beta}\) satisfies

\[X^TX\hat{\beta} = X^ty\]

…and if \(X^TX\) is invertible,

\[\begin{aligned} \hat{\beta} &= (X^TX)^{-1}X^Ty \\ X\hat{\beta}_1 &= X(X^TX)^{-1}X^Ty \\ \hat{y} &= Hy \\ \end{aligned}\]

where \(H\) is called the hat-matrix.