Skip to main content

Linear Regression

Recall Y=f(X)+ϵY = f(X) + \epsilon. Now we first assume it's a linear model, then Y=β0+β1X1++βpXp+ϵY = \beta_0+\beta_1X_1 +\ldots + \beta_pX_p + \epsilon where f(X)=β0+β1X1++βpXpf(X) = \beta_0+\beta_1X_1 +\ldots + \beta_pX_p is a linear combination of the predictors X1,X2,,XpX_1, X_2, \ldots, X_p which shows the linear relationship between XX and YY, that is, the linear regression model.

  • for computation purpose, we use matrix β=(β0,β1,,βp)\beta = (\beta_0, \beta_1, \ldots, \beta_p) and X=(1,X1,X2,,Xp)X = (1, X_1, X_2, \ldots, X_p), then Y=βXT+ϵY = \beta X^T+ \epsilon

Recall our goal is to minimize the [f(X)f^(X)]2[f(X) - \hat f(X)]^2 part of the error to get a better model f^\hat f. To quantify, we first define some terms and values:

  • Residual Sum of Squares(RSS): we define residue as ei=yiy^ie_i = y_i - \hat y_i, then RSS=i=1nei2=i=1n(yiy^i)2RSS = \sum_{i = 1}^{n}e_i^2 = \sum_{i = 1}^{n}(y_i - \hat y_i)^2
    • then we want have the least square error (i.e. minimize the RSS)
  • Total Sum of Squares(TSS): TSS=i=1n(yiyˉ)2,yˉ=1ni=1nyiTSS = \sum_{i = 1}^{n}(y_i - \bar y)^2, \bar y = \frac{1}{n}\sum_{i = 1}^{n}y_i
  • estimate the unknown variance σ2\sigma^2 by σ^2=1np1i=1n(yixiTβ^)2\hat \sigma^2 = \frac{1}{n-p-1}\sum_{i=1}^n (y_i - x_i^T\hat \beta)^2
  • Here, yiy_i is the real value from YY, y^i\hat y_i is the predict value from Y^\hat Y

In linear model, the ϵi\epsilon_ i are uncorrelated, else the estimated standard error will not be close to the true standard error.

R2R^2 is the proportion of the variation in the outcome that is predictable from the predictors where R2=1RSSTSS=(Cov(Xi)σxi)2R^2 = 1 - \frac{RSS}{TSS} = (\frac{Cov(X_i)}{\prod\sigma_{x_i}})^2

  • 0R210 \le R^2 \le 1.
  • R2R^2 close to 1 indicates a large proportion of the variability in the response that is explained by the predictors. (Interpret better by predictor)
  • R2R^2 can not determinate how model fits, it just give a sense of interpretation. More complicated model may have more close to 1's R2R^2

The 95%95\% confidence interval of βj\beta_j is [β^j1.96SE(β^j),β^j+1.96SE(β^j)][\hat\beta_j - 1.96*SE(\hat\beta_j), \hat\beta_j + 1.96*SE(\hat\beta_j)] where the stand error SE(β^j)=σ^2[(XTX)1]jjSE(\hat\beta_j) = \sqrt{\hat \sigma^2[(X^TX)^{-1}]_{jj}} where 1.961.96 get from normal distribution.

  • The difference between confidence interval and prediction interval is that CI is a range for E[yx]E[y|x] and PI is a range for yy, that is, the stand error SESE is different in different case.

And we can also use tt-statistics to check the significant of βj\beta_j (Hypothesis test of βj=0\beta_j = 0, where t=β^jSE(β^j)t = \frac{\hat \beta_j}{SE(\hat\beta_j)}

  • with np1n-p-1 degrees of freedom when βj=0\beta_j = 0

We can also use F-statistics to test all parameters where H0=βpq+1=βpq+2==βp=0H_0 = \beta_{p-q+1}=\beta_{p-q+2}=\ldots = \beta_{p} = 0

After we get the final model, the prediction we have y^=xTβ^\hat y = x^T\hat \beta at X=xX = x has following property:

  • Expectation: E[y^X=x]=xTE[β^]=xTβ\mathbb{E}[\hat y | X = x] = x^T\mathbb{E}[\hat \beta] = x^T\beta
  • Variance: Var[y^X=x]=xTCov(β^)x=σ2xT(XTX)1Var[\hat y| X= x] = x^T Cov(\hat \beta)x = \sigma^2x^T(X^TX)^{-1}
  • MSE: E[(yy^)2X=x]=Var[ϵ]+Var[y^X=x]=σ2+σ2xT(XTX)1\mathbb{E}[(y - \hat y)^2 | X = x] = Var[\epsilon] + Var[\hat y| X= x] = \sigma^2 + \sigma^2x^T(X^TX)^{-1}

In linear regression, Ordinary Least Squares(OLS) is a approach to find such f^\hat{f} where we need to find β^\hat\beta where β^=argminαRp+11ni=1n(yixiTα)2=argminαRp+11ni=1nyiXα22\hat \beta = \arg\min\limits_{\alpha\in \R^{p+1}} \frac{1}{n}\sum_{i=1}^n (y_i - x_i^T \alpha)^2 = \arg\min\limits_{\alpha\in \R^{p+1}} \frac{1}{n}\sum_{i=1}^n ||y_i - X \alpha||^2_2

  • The Loss function for OLS is L(β,Dtrain)=RSS=yXβ22L(\beta,D_{train}) = RSS = ||y-X\beta||^2_2 , Penality is Pen(β)=0Pen(\beta) = 0
  • β^=(XTX)1XTy\hat \beta = (X^TX)^{-1}X^Ty is the solution of OLS where XX is full column rank.
  • OLS with larger variance but no bias for the model which is truely linear

After we get β^\hat\beta , we first assume each error terms are linear independent with E[ϵ]=0,Var(ϵi)=σ2\mathbb{E}[\epsilon] = 0, Var(\epsilon_i) = \sigma^2 and then check:

  • Unbiasedness: E[β^]=β\mathbb{E}[\hat \beta] = \beta
  • The covariance matrix of β^\hat \beta is: Cov(β^)=σ2(XTX)1Cov(\hat \beta) = \sigma^2(X^TX)^{-1}
  • The above two properties imply the l2l_2 estimation error E[β^β22]=σ2Tr[(XTX)1]\mathbb{E} [||\hat \beta - \beta||_2^2] = \sigma^2Tr[(X^TX)^{-1}] where XTX=nIp+1X^TX = nI_{p+1} so that we have final answer E[β^β22]=σ2(p+1)n\mathbb{E} [||\hat \beta - \beta||_2^2] = \frac{\sigma^2(p+1)}{n}.
  • The MSEMSE of estimating β\beta increases as pp gets larger which show the larger variance.
  • If p>np>n, then OLS estimator is not unique and its variance is infinite.

To reduce our work to find a better model and avoid p>np> n situation, we always want to remove all the predictors without significances. That is, we always process model selection to get best training on linear model. Or we can use Lasso, Ridge or elastic net regression to shrinks the coefficient towards zero which the similar concept as remove those coefficient.

  • elastic net is the combination of lasso and ridge where: β^λR=arg minβRSS+λ[(1α)β1+αβ22],α[0,1]\hat\beta^R_{\lambda} = \argmin\limits_{\beta} RSS + \lambda[(1-\alpha)||\beta||_1 + \alpha||\beta||^2_2], \alpha\in[0,1], but we dont talk more about this.

The estimate of beta is unbiased. E[β^]=E[1nXTy]=E[1nXT(Xβ+ϵ)]=1nXTXβ+E[1nXTϵ]=βE[\hat\beta] = E[\frac{1}{n}X^Ty]= E[\frac{1}{n}X^T(X\beta + \epsilon)] = \frac{1}{n}X^TX\beta + E[\frac{1}{n}X^T\epsilon] =\beta

Hierarchy Principle: If we include an interaction term X1×X2X_1 \times X_2in the model, we should also include the main effects X1,X2X_ 1, X_2 even if the pp-values associated with their coefficients are not significant.