Exploring generalization in deep learning – Generalization in Deep Learning – Current state of generalization theory for deep learning (2018)) – slides on myths about generalization in deep learning
Generalization Error in Deep Learning
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes – A Closer Look at Memorization in Deep Networks
See gingkoapp tree: Kolmogorov complexity and generalization in deen neural networks
–> Even for Kernel methods, the classical learning theory doesn't predict what we observe (– see here). Overparametrization in deep learning
Understanding deep learning requires rethinking generalization
Deep learning generalizes because the parameter-function map is biased towards simple functions
On the Spectral Bias of Neural Networks .... lower frequency functions (or components) are more robust! (like Wu et al. find also!). See A Fine-Grained Spectral Perspective on Neural Networks for a more theoretically grounded look on the spectral simplicity bias of neural nets.
Understanding Generalization through Visualizations
Fisher-Rao Metric, Geometry, and Complexity of Neural Networks – A jamming transition from under- to over-parametrization affects loss landscape and generalization – Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
An analytic theory of generalization dynamics and transfer learning in deep linear networks
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
Recently people are also using the Neural tangent kernel (paper) to understand generalization. See papers here: https://github.com/damaru2/ntk/blob/master/readme.md
On large-batch train-ing for deep learning: Generalization gap and sharp minima.
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data – Sharpness + norm bounds (as an alternative to margin + norm bounds, but similar as margin is similar to sharpness; robustness of the training loss to changes in weights; see Exploring generalization in deep learning)
Sharp Minima Can Generalize For Deep Nets – maybe it's not sharp, but frequent! (Arrival of the frequent)
Entropy-SGD – Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors
Three Factors Influencing Minima in SGD – Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization
Deep learning generalizes because the parameter-function map is biased towards simple functions :)
(video) Karolina Dziugaite on Nonvacuous Generalization Bounds for Deep Neural Networks via PAC-Bayes – Data-dependent PAC-Bayes priors via differential privacy – Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors
PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. – Exploring Generalization in Deep Learning
Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach
Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience
See Understanding Machine Learning by Shai and Shai for SVM bounds; they arell all based on Structural risk minimization and bounding Rademacher complexity by bounding norms of the parameters (sometimes refered to as scale-sensitive bound/analysis)
For valid generalization the size of the weights is more important than the size of thenetwork – The sample complexity of pattern classification with neural networks: the size ofthe weights is more important than the size of the network.
Regularization Networks and Support Vector Machines – Rademacher and gaussian complexities: Risk bounds andstructural results (2002) – (pdf from oxford course, Rademacher complexity of neural networks
Norm-based capacity control in neural networks
In search of the real inductive bias: On the roleof implicit regularization in deep learning.
Spectrally-normalized margin bounds for neural networks, allow for use of a variety of norms. Further work explores which choices of norm are better! (comments on need for upper and lower bounds) They use Covering number bounds based on those in Neural Network Learning: Theoretical Foundations by Anthony and Barlett,1999, itself based on fat-shattering analsyis in Barlett 1996
PAC-bayesian approach to spectrally-normalized margin bounds for neural networks.
Implicit regularization in deep learning
On the Margin Theory of Feedforward Neural Networks
Exploring Generalization in Deep Learning They are very related to sharpness/robustness analysis. They look at how many of the ReLU activations are changed by small changes to the weights, to bound what they call the sharpness, which bounds the generalization error via a PAC-Bayes bound (on parameter space, of course..). Similar for some of the compression-based bounds, although in that case they use a Covering number argument rather than a PAC-Bayes one
Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks (new margin bounds which work better)
Predicting the Generalization Gap in Deep Networks with Margin Distributions – openreview (new)
Tighter norm-based bounds|https://arxiv.org/pdf/1902.00800.pdf]]
Lower bounds? some here and here, but they are just lower bounds on RadComp I think, so lower bounds on worst-case over algos..
See Understanding Machine Learning by Shai and Shai, Chap 30
https://www.offconvex.org/2018/02/17/generalization2/ – Stronger generalization bounds for deep nets via a compression approach. See stability analysis from "Threshold logic and its applications book"
Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach – blog
Compressibility and Generalization in Large-Scale Deep Learning
Robustness relative to changes in inputs (and to training set), for example by measured by Lipschitz continuitiy. Bounds are not very good usually, as they suffer from exponential dependence on dimension (Curse of dimensionality). I think results are probably proved using Covering numbers which bound Rademacher complexity (via Massart's lemma..)
Distance-based classification with lipschitz functions
Robustness and generalization However, the covering number of the input domain and thus the capacity can be exponential in the input dimension
Generalization Error of Invariant Classifiers
Related to arguments for generaliztion used in Information bottleneck approaches..
Sensitivity and Generalization in Neural Networks: an Empirical Study
Also Fat-shattering dimension (and Natarajan dimension?)
What Size Net Gives Valid Generalization?
M. Anthony and P. L. Bartlett.Neural network learning: Theoretical foundations. cambridgeuniversity press, 2009. – Discrete mathematics of neural networks
Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks
neural networks with quadratic vc dimension – almost linear vc-dimension bounds for piecewise polynomial networks – Neural Nets with Superlinear VC-Dimension
VC Dimension of Neural Networks
Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks
bounds for the computational power and learning complexity of analog neural nets
High-dimensional dynamics of generalization error in neural networks (see Statistical mechanics of neural networks also) Using random matrix theory and exact solutions in deep linear models, we derive the generalization error and training error dynamics of learning and analyze how they depend on the dimensionality of data and signal to noise ratio of the learning problem.
The Dynamics of Learning: A Random Matrix Approach
An analytic theory of generalization dynamics and transfer learning in deep linear networks
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
Regularization caused by optimization algorithm.. "we investigate the transformations under which the function computed by a network remains the same and therefore argue for complexity measures and optimization algorithms that have similar invariances. We find complexity measures that have similar invariances to neural networks and optimization algorithms that implicitly regularize them". Path-norm a metric in parameter space, that is closer to model distance? Sloppy systems?
A Bayesian Perspective on Generalization and Stochastic Gradient Descent
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks – Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations – Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization – The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects
Three Factors Influencing Minima in SGD – Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization
Implicit Bias of Gradient Descent on Linear Convolutional Networks
Towards Understanding Regularization in Batch Normalization
Matrix factorization as two-layer linear network, increasing the number of hidden units in the linear network we get better generalization
High-dimensional dynamics of generalization error in neural networks – Sampolinsky lecture
Implicit Regularization in Matrix Factorization
Implicit Bias of Gradient Descent on Linear Convolutional Networks
"For fully connected networks with single output, Theorem 1 shows that there is no effect of depth on the implicit bias of gradient descent. Regardless of the depth of the network, the asymptotic classifier is always the hard margin support vector machine classifier, which is also the limit direction of gradient descent for linear logistic regression in the direct parameterization of β = w. In contrast, next we show that for convolutional networks we get very different biases. Let us first look at a 2–layer linear convolutional network, i.e., a network with single convolutional layer followed by a fully connected final layer."
Hmm but for matrix facorization, they did get differences? is it only about rank? What about for non linearly separable data? In any case changing the way we parametrize can have an effect in general, clearly
"Results for matrix sensing in Gunasekar et al. [8] imply that for two layer fully connected networks with multiple outputs, the implicit bias is to a maximum margin solution with respect to the nuclear norm kβk?. This is already different from the implicit bias of a one-layer “network” (i.e. optimizing β directly), which would be in terms of the Frobenius norm kβkF . We suspect that with multiple outputs, as more layers are added, even fully connected networks exhibit a shrinking sparsity penalty on the singular values of the effective linear matrix predictor β ∈ R C×D."
it might be beneficial to continue optimizing even after the loss value L(β (t) ) itself becomes negligible.
"We can decompose the characterization of implicit bias of gradient descent on a parameterization P(w) into two parts: (a) what is the implicit bias of gradient descent in the space of parameters w?, and (b) what does this imply in term of the linear predictor β = P(w), i.e., how does the bias in parameter space translate to the linear predictor learned from the model class?"
Generalization Error of Linear Neural Networks in Unidentifiable Cases. Only looks at one-hidden layer linear networks I think, and finds some cases in which overparametrization is bad?
Generalization Error of Linear Neural Networks in an Empirical Bayes Approach
Generalization in Deep Learning we provide a theorem (Theorem 1) stating that the hypothesis space of over-parameterized linear models can memorize any training data and decrease the training and test errors arbitrarily close to zero (including zero) with the norm of parameters being arbitrarily large, even when the parameters are arbitrarily far from the ground-truth parameters
A Study on Overfitting in Deep Reinforcement Learning
Assessing Generalization in Deep Reinforcement Learning
On Inductive Biases in Deep Reinforcement Learning
The role of over-parametrization in generalization of neural networks
Generalization in deep learning (Bengio et al) – en general, no me parecio tan interesante, excepto un bound de Rademacher complexity, q tampoco es q le permita decir mucho per bueno. val just tells us that we generalize. It doesn't tells us why. Just like PAC bounds don't tells us why the target function lies within our test set, for e.g..
A Modern Take on the Bias-Variance Tradeoff in Neural Networks
On Generalization and Regularization in Deep Learning
Unreasonable Effectiveness of Learning Neural Nets: Accessible States and Robust Ensembles
Data-Dependent Stability of Stochastic Gradient Descent
Train faster, generalize better: Stability of stochastic gradient descent <> Train longer, generalize better: closing the generalization gap in large batch training of neural networks (lol)
What size network is good for generalization of a specific task of interest – We show that for some tasks increasing network size leads to worse generalization. This is not surprising. The striking feature is that there exist other tasks for which increasing network size improves generalization. We give an explanation of this phenomenon in terms of the information entropy. I think what this paper is missing is the concept of universal complexity measures. You can see that tasks of “medium complexity” are the hardest to learn, because their measure of complexity isn’t very good. Even just entropy, would be better (as highest entropy corresponds to medium complexity in their case)
Do Deep Nets Really Need to be Deep?
On the Depth of Deep Neural Networks: A Theoretical View
Diagnosing Convolutional Neural Networks using their Spectral Response observe that the best models are also the most sensitive to perturbations of their input.. ??
How Many Samples are Needed to Learn a Convolutional Neural Network?
On Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics
PAC-Bayes Control: Synthesizing Controllers that Provably Generalize to Novel Environments
Rademacher Complexity for Adversarially Robust Generalization
Learning Invariances using the Marginal Likelihood
To understand deep learning we need to understand kernel learning
Generalization ability of Boolean functions implemented in feedforward neural networks
From Understanding deep learning requires rethinking generalization
"Moreover, we did not observe any overfitting: the generalization error does not degrade by reaching zero training error, or by using larger networks."
"what we really want is to minimize the variance of the net functions induced by weights near the actual weight vector."
Flat minima good to help fight catastrophic forgetting.
See this gingko tree and this overleaf document (from my short project in summer 2017 with Ard Louis. See emails with Ard