Skip to main content

Situation without the Test Data to do Model Validation

There are two common approaches for model selection when we don't have DtestD_{test}

  1. Avoid estimating the expected MSE by making an adjustment to the training error to account for the model complexity.
  2. use data-splitting techniques to create a "test set"

Notice when we do the the above action, we already have a training model which means we then calculation all MSE by this model.

Avoid Estimating the Expected MSE

Let L:=L(f^)L := L(\hat f) which is the maximized value of likelihood function for f^\hat f

Mallow's CpC_p: Cp=1n(RSS+2pσ^2)C_p = \frac{1}{n}(RSS + 2p\hat \sigma^2)

  • σ^2\hat \sigma^2 is an estimate of Var[ϵ]=σ2Var[\epsilon] = \sigma^2
  • the lowest CpC_p the best
  • only for linear fitted model (via OLS) in regression problem

AIC: AIC=2logL+2pAIC = -2 \log L + 2p

  • In the linear model with ϵii.i.dN(0,σ2)\epsilon_i \overset{i.i.d} \sim N(0, \sigma^2), AIC=Cpσ^2AIC = \frac{C_p}{\hat\sigma^2}
  • the lowest the best

BIC: BIC=2logL+(logn)pBIC = -2\log L + (\log n)p

  • BIC has heavier penalty as number of predictors increase so that it result more like smaller-size model
  • the lowest the best

adjusted R2:=1RSS/(np1)TSS/(n1)R^2:= 1 - \frac{RSS/(n-p-1)}{TSS/(n-1)}

  • the greatest the best

With Estimating the Expected MSE

Validation Set: one-time data splitting; splitting the given dataset into training and validation

  • highly unstable

Cross-validation: multiple-time data splitting

One of the Cross-Validation is Leave-One-Out Cross-Validation(LOOCV): each time select a validation set, and set the others as training set, then calculate MSE denote MSEiMSE_i. The LOOCV estimate for the test MSE is the average of those MSEiMSE_i where CV(n)=1ni=1nMSEiCV_{(n)} = \frac{1}{n}\sum_{i = 1}^n MSE_i

  • Calculation expensive, but stable

Another one is k-Fold Cross-Validation: randomly divide the data into kk equal-sized groups or folds, select one of them as validation set, the others as training and then calculate MSE denote as MSEiMSE_i. Similarly, repeat and select different fold, and the final test MSE is CVk=1ki=1kCV_{k} =\frac{1}{k}\sum_{i = 1}^k

Pros and Cons


  • a direct test MSE

  • can be used in a wider range of model selection tasks

  • requires a relative large sample size

  • difficult to have guarantees for the model selected by using CV.

  • when Var[ϵ]Var[\epsilon] can be consistently estimated then use without the method without estimating the Expected MSE

  • applicable to all supervised learning problems

AIC/BIC and so on approach:

  • better for a limited sample size dataset
  • suitable when likelihood is specified in any model