Generalization in deep learning: Cosmos — All that is, or was, or ever will be

Generalization in deep learning

cosmos 22nd December 2019 at 1:41am

Exploring generalization in deep learning – Generalization in Deep Learning – Current state of generalization theory for deep learning (2018)) – slides on myths about generalization in deep learning

Generalization Error in Deep Learning

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes – A Closer Look at Memorization in Deep Networks

See gingkoapp tree: Kolmogorov complexity and generalization in deen neural networks

–> Even for Kernel methods, the classical learning theory doesn't predict what we observe (– see here). Overparametrization in deep learning

Understanding deep learning requires rethinking generalization

Deep learning generalizes because the parameter-function map is biased towards simple functions

Other approaches

On the Spectral Bias of Neural Networks .... lower frequency functions (or components) are more robust! (like Wu et al. find also!). See A Fine-Grained Spectral Perspective on Neural Networks for a more theoretically grounded look on the spectral simplicity bias of neural nets.

Understanding Generalization through Visualizations

Fisher-Rao Metric, Geometry, and Complexity of Neural Networks – A jamming transition from under- to over-parametrization affects loss landscape and generalization – Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning

An analytic theory of generalization dynamics and transfer learning in deep linear networks

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Neural tangent kernel

Recently people are also using the Neural tangent kernel (paper) to understand generalization. See papers here: https://github.com/damaru2/ntk/blob/master/readme.md

Flat minima

FLAT MINIMA

On large-batch train-ing for deep learning: Generalization gap and sharp minima.

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data – Sharpness + norm bounds (as an alternative to margin + norm bounds, but similar as margin is similar to sharpness; robustness of the training loss to changes in weights; see Exploring generalization in deep learning)

Sharp Minima Can Generalize For Deep Nets – maybe it's not sharp, but frequent! (Arrival of the frequent)

Entropy-SGD – Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors

Three Factors Influencing Minima in SGD – Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization

PAC-Bayesian learning approach

Deep learning generalizes because the parameter-function map is biased towards simple functions :)

(video) Karolina Dziugaite on Nonvacuous Generalization Bounds for Deep Neural Networks via PAC-Bayes – Data-dependent PAC-Bayes priors via differential privacy – Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors

PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. – Exploring Generalization in Deep Learning

Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach

Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience

Norm/Margin-based generalization bound

See Understanding Machine Learning by Shai and Shai for SVM bounds; they arell all based on Structural risk minimization and bounding Rademacher complexity by bounding norms of the parameters (sometimes refered to as scale-sensitive bound/analysis)

For valid generalization the size of the weights is more important than the size of thenetwork – The sample complexity of pattern classification with neural networks: the size ofthe weights is more important than the size of the network.

Regularization Networks and Support Vector Machines – Rademacher and gaussian complexities: Risk bounds andstructural results (2002) – (pdf from oxford course, Rademacher complexity of neural networks

Norm-based capacity control in neural networks

In search of the real inductive bias: On the roleof implicit regularization in deep learning.

Spectrally-normalized margin bounds for neural networks, allow for use of a variety of norms. Further work explores which choices of norm are better! (comments on need for upper and lower bounds) They use Covering number bounds based on those in Neural Network Learning: Theoretical Foundations by Anthony and Barlett,1999, itself based on fat-shattering analsyis in Barlett 1996

PAC-bayesian approach to spectrally-normalized margin bounds for neural networks.

Implicit regularization in deep learning

On the Margin Theory of Feedforward Neural Networks

Exploring Generalization in Deep Learning They are very related to sharpness/robustness analysis. They look at how many of the ReLU activations are changed by small changes to the weights, to bound what they call the sharpness, which bounds the generalization error via a PAC-Bayes bound (on parameter space, of course..). Similar for some of the compression-based bounds, although in that case they use a Covering number argument rather than a PAC-Bayes one

Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks (new margin bounds which work better)

Predicting the Generalization Gap in Deep Networks with Margin Distributions – openreview (new)

Tighter norm-based bounds|https://arxiv.org/pdf/1902.00800.pdf]]

Lower bounds? some here and here, but they are just lower bounds on RadComp I think, so lower bounds on worst-case over algos..

Compression-based generalization bound

See Understanding Machine Learning by Shai and Shai, Chap 30

https://www.offconvex.org/2018/02/17/generalization2/ – Stronger generalization bounds for deep nets via a compression approach. See stability analysis from "Threshold logic and its applications book"

Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach – blog

Compressibility and Generalization in Large-Scale Deep Learning

Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network

Robustness, stability, sensitivity, continuity/smoothness

Robustness relative to changes in inputs (and to training set), for example by measured by Lipschitz continuitiy. Bounds are not very good usually, as they suffer from exponential dependence on dimension (Curse of dimensionality). I think results are probably proved using Covering numbers which bound Rademacher complexity (via Massart's lemma..)

Distance-based classification with lipschitz functions

Robustness and generalization However, the covering number of the input domain and thus the capacity can be exponential in the input dimension

Generalization Error of Invariant Classifiers

Related to arguments for generaliztion used in Information bottleneck approaches..

Sensitivity and Generalization in Neural Networks: an Empirical Study

VC dimension bounds

Also Fat-shattering dimension (and Natarajan dimension?)

What Size Net Gives Valid Generalization?

M. Anthony and P. L. Bartlett.Neural network learning: Theoretical foundations. cambridgeuniversity press, 2009. – Discrete mathematics of neural networks

Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks

The sample complexity of pattern classification with neural networks: the size ofthe weights is more important than the size of the network.

neural networks with quadratic vc dimension – almost linear vc-dimension bounds for piecewise polynomial networks – Neural Nets with Superlinear VC-Dimension

VC Dimension of Neural Networks

Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks

bounds for the computational power and learning complexity of analog neural nets

Generalization dynamics (also interaction between generalization and noise counterintuitive behavior, see discussion here)

High-dimensional dynamics of generalization error in neural networks (see Statistical mechanics of neural networks also) Using random matrix theory and exact solutions in deep linear models, we derive the generalization error and training error dynamics of learning and analyze how they depend on the dimensionality of data and signal to noise ratio of the learning problem.

The Dynamics of Learning: A Random Matrix Approach

An analytic theory of generalization dynamics and transfer learning in deep linear networks

Optimization algorithms and regularization

Implicit regularization in deep learning

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Regularization caused by optimization algorithm.. "we investigate the transformations under which the function computed by a network remains the same and therefore argue for complexity measures and optimization algorithms that have similar invariances. We find complexity measures that have similar invariances to neural networks and optimization algorithms that implicitly regularize them". Path-norm a metric in parameter space, that is closer to model distance? Sloppy systems?

Stochastic gradient descent

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks – Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations – Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization – The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects

Towards Understanding the Generalization Bias of Two Layer Convolutional Linear Classifiers with Gradient Descent

Three Factors Influencing Minima in SGD – Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization

Implicit Bias of Gradient Descent on Linear Convolutional Networks

Towards Understanding Regularization in Batch Normalization

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

Generalization of Linear neural networks

Matrix factorization as two-layer linear network, increasing the number of hidden units in the linear network we get better generalization

High-dimensional dynamics of generalization error in neural networks – Sampolinsky lecture

Implicit Regularization in Matrix Factorization

Implicit Bias of Gradient Descent on Linear Convolutional Networks

"For fully connected networks with single output, Theorem 1 shows that there is no effect of depth on the implicit bias of gradient descent. Regardless of the depth of the network, the asymptotic classifier is always the hard margin support vector machine classifier, which is also the limit direction of gradient descent for linear logistic regression in the direct parameterization of β = w. In contrast, next we show that for convolutional networks we get very different biases. Let us first look at a 2–layer linear convolutional network, i.e., a network with single convolutional layer followed by a fully connected final layer."

Hmm but for matrix facorization, they did get differences? is it only about rank? What about for non linearly separable data? In any case changing the way we parametrize can have an effect in general, clearly

"Results for matrix sensing in Gunasekar et al. [8] imply that for two layer fully connected networks with multiple outputs, the implicit bias is to a maximum margin solution with respect to the nuclear norm kβk?. This is already different from the implicit bias of a one-layer “network” (i.e. optimizing β directly), which would be in terms of the Frobenius norm kβkF . We suspect that with multiple outputs, as more layers are added, even fully connected networks exhibit a shrinking sparsity penalty on the singular values of the effective linear matrix predictor β ∈ R C×D."

it might be beneficial to continue optimizing even after the loss value L(β (t) ) itself becomes negligible.

"We can decompose the characterization of implicit bias of gradient descent on a parameterization P(w) into two parts: (a) what is the implicit bias of gradient descent in the space of parameters w?, and (b) what does this imply in term of the linear predictor β = P(w), i.e., how does the bias in parameter space translate to the linear predictor learned from the model class?"

Generalization Error of Linear Neural Networks in Unidentifiable Cases. Only looks at one-hidden layer linear networks I think, and finds some cases in which overparametrization is bad?

Generalization Error of Linear Neural Networks in an Empirical Bayes Approach

Generalization in Deep Learning we provide a theorem (Theorem 1) stating that the hypothesis space of over-parameterized linear models can memorize any training data and decrease the training and test errors arbitrarily close to zero (including zero) with the norm of parameters being arbitrarily large, even when the parameters are arbitrarily far from the ground-truth parameters

Generalization in Deep reinforcement learning

A Study on Overfitting in Deep Reinforcement Learning

Assessing Generalization in Deep Reinforcement Learning

On Inductive Biases in Deep Reinforcement Learning

The role of over-parametrization in generalization of neural networks

Generalization in deep learning (Bengio et al) – en general, no me parecio tan interesante, excepto un bound de Rademacher complexity, q tampoco es q le permita decir mucho per bueno. val just tells us that we generalize. It doesn't tells us why. Just like PAC bounds don't tells us why the target function lies within our test set, for e.g..

A Modern Take on the Bias-Variance Tradeoff in Neural Networks

On Generalization and Regularization in Deep Learning

Unreasonable Effectiveness of Learning Neural Nets: Accessible States and Robust Ensembles

Data-Dependent Stability of Stochastic Gradient Descent

Train faster, generalize better: Stability of stochastic gradient descent <> Train longer, generalize better: closing the generalization gap in large batch training of neural networks (lol)

What size network is good for generalization of a specific task of interest – We show that for some tasks increasing network size leads to worse generalization. This is not surprising. The striking feature is that there exist other tasks for which increasing network size improves generalization. We give an explanation of this phenomenon in terms of the information entropy. I think what this paper is missing is the concept of universal complexity measures. You can see that tasks of “medium complexity” are the hardest to learn, because their measure of complexity isn’t very good. Even just entropy, would be better (as highest entropy corresponds to medium complexity in their case)

Analytical learning theory

Franco's Generalization complexity

Generalization ability of Boolean functions implemented in feedforward neural networks

From Understanding deep learning requires rethinking generalization

"Moreover, we did not observe any overfitting: the generalization error does not degrade by reaching zero training error, or by using larger networks."

"what we really want is to minimize the variance of the net functions induced by weights near the actual weight vector."

Flat minima good to help fight catastrophic forgetting.

Simplicity bias in neural networks

See this gingko tree and this overleaf document (from my short project in summer 2017 with Ard Louis. See emails with Ard