Cross-validation: Cosmos — All that is, or was, or ever will be

Cross-validation

cosmos 31st March 2019 at 5:51pm

Test the model on data you haven't used for training.

min-max, average

https://www.cs.cmu.edu/~schneide/tut5/node42.html

Wikipedia has good explanations: https://en.wikipedia.org/wiki/Cross-validation_(statistics)

Hold-out cross-validation

video – here

Randomly split training set S into two subsets, the training subset S_train, (70%) and the cross-validation subset S_CV (30%).
Train the model on the training set, and test it on S_CV
Pick the model (set of Hyperparameters) with smallest error on S_CV

To search for best configuration of hyperparameters acoording to the validation error, there are several methods. Some popular ones are:

Grid search, try all possible configurations from chosen trial values.
Random search

Sometimes, once a model complexity (and other Hyperparameters) are picked, the model is trained on the whole data set.

Optimizing the number of epochs See Overfitting and underfitting

Test set

To predict the generalization error (see Learning theory) of the chosen hyperparameters that are best, we can't just look at their result at S_CV, as that set has been by the algorithm to choose the final answer. As a result, the error on S_CV is a biased estimator of generalization error. To have an unbiased estimator, we need a test set, that is only used once the learning algorithm has finished completely. See video

k-fold cross-validation

video

Split data into k equal pieces. Common k=10.
For i from 1 to k:

Hold-out the ith piece for testing, and use the other k-1 pieces for training.

3. Average errors from the k iterations

More computationally expensive

Leave-one-out cross-validation

k-fold CV, for whem k={number of training examples}, so for each iteration, you leave one out.

Even more computationally expensive, but more accurate estimate of generalization error. Only done when the data is very scarce.

Cross-validation generalization error bounds

I have the feeling that one (not necessarily me xD) could maybe theoretically prove that something like cross-validation is an optimal way to estimate the generalization error of your algorithm (under some definitions of optimality). In a way that would be kinda annoying because CV is used all the time, so we won't get anything better. But, on the other hand it'd be pretty interesting and kinda useful to know... [insert thonk emoji]

Some thoughts when I misunderstood train/validation/test learning.

What I describe here is some sort of hyperlearning where we have two steps, where we learn two sets of hyperparameters, and use two different validation sets (which I call below validation, and test, mistakenly...)

When we have trained the data using a method as hold-out CV, and with some fixed learning Hyperparameters. If we want to learn the hyperparameters, we can do this whole training procedure with several values of the hyperparameter.

However, to choose the hyperparameters that are best, we can't just look at their result at S_CV, as that set has been by the algorithm to choose the final answer. As a result, the error on S_CV is a biased estimator of generalization error (see Learning theory). To have an unbiased estimator, we need a test set, that is only used once the learning algorithm has finished completely. We can compare the results on the test set to choose the best set of hyperparameters. Note, that once we have done that, the test error is no longer an unbiased estimate of generalization error, as we have used it to output our very final answer; i.e. our final answer depended on it!! We'd need a hypertest set