Situation without the Test Data to do Model Validation
There are two common approaches for model selection when we don't have
- Avoid estimating the expected MSE by making an adjustment to the training error to account for the model complexity.
- use data-splitting techniques to create a "test set"
Notice when we do the the above action, we already have a training model which means we then calculation all MSE by this model.
Avoid Estimating the Expected MSE
Let which is the maximized value of likelihood function for
Mallow's :
- is an estimate of
- the lowest the best
- only for linear fitted model (via OLS) in regression problem
- In the linear model with ,
- the lowest the best
- BIC has heavier penalty as number of predictors increase so that it result more like smaller-size model
- the lowest the best
- the greatest the best
With Estimating the Expected MSE
Validation Set: one-time data splitting; splitting the given dataset into training and validation
- highly unstable
Cross-validation: multiple-time data splitting
One of the Cross-Validation is Leave-One-Out Cross-Validation(LOOCV): each time select a validation set, and set the others as training set, then calculate MSE denote . The LOOCV estimate for the test MSE is the average of those where
- Calculation expensive, but stable
Another one is k-Fold Cross-Validation: randomly divide the data into equal-sized groups or folds, select one of them as validation set, the others as training and then calculate MSE denote as . Similarly, repeat and select different fold, and the final test MSE is
Pros and Cons
a direct test MSE
can be used in a wider range of model selection tasks
requires a relative large sample size
difficult to have guarantees for the model selected by using CV.
when can be consistently estimated then use without the method without estimating the Expected MSE
applicable to all supervised learning problems
AIC/BIC and so on approach:
- better for a limited sample size dataset
- suitable when likelihood is specified in any model