Generalization dynamics: Cosmos — All that is, or was, or ever will be

Generalization dynamics

cosmos 8th November 2018 at 7:53pm

High-dimensional dynamics of generalization error in neural networks (see Statistical mechanics of neural networks also) Using random matrix theory and exact solutions in deep linear models, we derive the generalization error and training error dynamics of learning and analyze how they depend on the dimensionality of data and signal to noise ratio of the learning problem.

We find that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks.

Overtraining is worst at intermediate network sizes. For both linear and nonlinear nets, catastrophic overtraining is a symp- tom of a model whose complexity is exactly matched to the size of the training set, and can be combated either by making the model smaller or larger .

We then turn to non-linear neural networks, and show that making networks very large does not harm their generalization performance. On the contrary, it can in fact reduce overtraining, even without early stopping or regularization of any sort. We identify two novel phenomena underlying this behavior in overcomplete models:

first, there is a frozen subspace of the weights in which no learning occurs under gradient descent; Singular learning theory, Sloppy models. These subspaces, are spaces are the neutral spaces of the parameter–function GP map!
and second, the statistical properties of the high-dimensional regime yield better-conditioned input correlations which protect against overtraining

derive a generalization error bound which incorporates the frozen subspace and conditioning effects and qualitatively matches the behavior observed in simulation.

the optimal stopping time (for minimizing generalization error) in these systems growths with the signal to noise ratio (SNR), and we provide arguments that this growth is approximately logarithmic. When the quality of the data is higher (high SNR), we expect the algorithm to weigh the data more heavily and thus run gradient descent for longer.

"while a large deep network can indeed fit random labels, gradient-trained DNNs initialized with small- norm weights learn simpler functions first and hence generalize well if there is a consistent rule to be learned."

Remarkably, even without early stopping, generalization can still be better in larger networks. To understand this, we derive an alternative bound on the Rademacher complex- ity of a two layer non-linear neural network with fixed first layer weights that incorporates the dynamics of gradient descent. We show that complexity is limited because of a frozen subspace in which no learning occurs, and overtraining is prevented by a larger gap in the eigenspectrum of the data in the hidden layer in overcomplete models