Recall Y=f(X)+ϵ. Now we first assume it's a linear model, then Y=β0+β1X1+…+βpXp+ϵ where f(X)=β0+β1X1+…+βpXp is a linear combination of the predictors X1,X2,…,Xp which shows the linear relationship between X and Y, that is, the linear regression model.
for computation purpose, we use matrix β=(β0,β1,…,βp) and X=(1,X1,X2,…,Xp), then Y=βXT+ϵ
Recall our goal is to minimize the [f(X)−f^(X)]2 part of the error to get a better model f^. To quantify, we first define some terms and values:
Residual Sum of Squares(RSS): we define residue as ei=yi−y^i, then RSS=∑i=1nei2=∑i=1n(yi−y^i)2
then we want have the least square error (i.e. minimize the RSS)
Total Sum of Squares(TSS):TSS=∑i=1n(yi−yˉ)2,yˉ=n1∑i=1nyi
estimate the unknown variance σ2 by σ^2=n−p−11∑i=1n(yi−xiTβ^)2
Here, yi is the real value from Y, y^i is the predict value from Y^
In linear model, the ϵi are uncorrelated, else the estimated standard error will not be close to the true standard error.
R2 is the proportion of the variation in the outcome that is predictable from the predictors where R2=1−TSSRSS=(∏σxiCov(Xi))2
0≤R2≤1.
R2 close to 1 indicates a large proportion of the variability in the response that is explained by the predictors. (Interpret better by predictor)
R2 can not determinate how model fits, it just give a sense of interpretation. More complicated model may have more close to 1's R2
The 95% confidence interval of βj is [β^j−1.96∗SE(β^j),β^j+1.96∗SE(β^j)] where the stand error SE(β^j)=σ^2[(XTX)−1]jj where 1.96 get from normal distribution.
The difference between confidence interval and prediction interval is that CI is a range for E[y∣x] and PI is a range for y, that is, the stand error SE is different in different case.
And we can also use t−statistics to check the significant of βj (Hypothesis test of βj=0, where t=SE(β^j)β^j
with n−p−1 degrees of freedom when βj=0
We can also use F-statistics to test all parameters where H0=βp−q+1=βp−q+2=…=βp=0
After we get the final model, the prediction we have y^=xTβ^ at X=x has following property:
In linear regression, Ordinary Least Squares(OLS) is a approach to find such f^ where we need to find β^ where β^=argα∈Rp+1minn1∑i=1n(yi−xiTα)2=argα∈Rp+1minn1∑i=1n∣∣yi−Xα∣∣22
The Loss function for OLS is L(β,Dtrain)=RSS=∣∣y−Xβ∣∣22 , Penality is Pen(β)=0
β^=(XTX)−1XTy is the solution of OLS where X is full column rank.
OLS with larger variance but no bias for the model which is truely linear
After we get β^ , we first assume each error terms are linear independent with E[ϵ]=0,Var(ϵi)=σ2 and then check:
Unbiasedness: E[β^]=β
The covariance matrix of β^ is: Cov(β^)=σ2(XTX)−1
The above two properties imply the l2 estimation error E[∣∣β^−β∣∣22]=σ2Tr[(XTX)−1] where XTX=nIp+1 so that we have final answer E[∣∣β^−β∣∣22]=nσ2(p+1).
The MSE of estimating β increases as p gets larger which show the larger variance.
If p>n, then OLS estimator is not unique and its variance is infinite.
To reduce our work to find a better model and avoid p>n situation, we always want to remove all the predictors without significances. That is, we always process model selection to get best training on linear model. Or we can use Lasso, Ridge or elastic net regression to shrinks the coefficient towards zero which the similar concept as remove those coefficient.
elastic net is the combination of lasso and ridge where: β^λR=βargminRSS+λ[(1−α)∣∣β∣∣1+α∣∣β∣∣22],α∈[0,1], but we dont talk more about this.
The estimate of beta is unbiased. E[β^]=E[n1XTy]=E[n1XT(Xβ+ϵ)]=n1XTXβ+E[n1XTϵ]=β
Hierarchy Principle: If we include an interaction term X1×X2in the model, we should also include the main effects X1,X2 even if the p−values associated with their coefficients are not significant.