Deep learning theory

cosmos 30th November 2018 at 2:16am
Deep learning Learning theory

Neural network theoryLearning theory

Better understanding the shortcomings and potential of deep learning may suggest ways of improving it, both to make it more capable and to make it more robust [2].

Nice lecture videos:

ICML 2018 tutorial

A Theoretical Framework for Deep Learning Networks

The mathematics of deep learning

Learning theory and neural networks gingko tree

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

Understanding deep learning requires rethinking generalization

Why and when can deep - but not shallow - networks avoid the curse of dimensionality: a review, see also Expressivity of neural networks

The approximation of functions with a compositional structure – can be achieved with the same degree of accuracy by deep and shallow networks but that the number of parameters are much smaller for the deep networks than for the shallow network with equivalent approximation accuracy.

Lectures from Beg Rohu 2018 summer school

Generalization in deep learning

Deep learning generalizes because the parameter-function map is biased towards simple functions

Expressibility of neural networks

Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations

Exponential expressivity in deep neural networks through transient chaos


bounds for the computational power and learning complexity of analog neural nets

On the Depth of Deep Neural Networks: A Theoretical View

Optimization for training deep models

Loss surface of neural networks

Information propagation in deep networks

Information bottleneck in deep learning

See gingkoapp tree

Information Theory of Deep Learning. Naftali Tishby

Deep Learning and the Information Bottleneck Principle

Opening the Black Box of Deep Neural Networks via Information

See more at Information bottleneck

Deep Information Propagation

Exponential expressivity in deep neural networks through transient chaos

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

Gradients problems in deep learning

Vanishing gradients problem

Exploding gradients problem

Residual neural networkss – Highway networks

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

On the Expressive Power of Deep Neural Networks

Optimization dynamics, generalization

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Backprop minimizes the free energy

Geometry, Optimization and Generalization in Multilayer Networks. He talks about the capacity of the network in a similar way to the theory of neural nets (see this video). I think it's the same as VC dimension actually. It's more or less proportional to the number of edges, apparently.

Time-bounded functions can be represented cheaply Feedforward neural nets are the ultimately learning machines but – Learning is hard

The fact that we can learn so effectively is quite mysterious

Optimization and capacity control, which avoids Overfitting!

Relation to Matrix factorization. Better and worse inductive biases is what seems to cause things. For the case of matrix factorization in real use cases it seems that matrices have low trace norm more than they have low rank. Given the right inductive bias, we can learn efficiently, and generalize well.

Good inductive bias for matrix factorization seems to be trace norm rather than rank If we have trace norm giving us the inductive bias, then decreasing rank (less neurons) often doesn't actually help with generalization..., as increasing the rank/number of hidden units can allow for lower trace norms.

Need to find right measure of Complexity, for NN "path norm" seems best from those they tried.

Real vs random labels, behaviour is as expected with random data

But we didn't tell the network to minimize the path norm (complexity). So where is the regularization coming from?. He thinks it's the Optimization algorithm that is biasing us towards simple global optima (work on this for convex opti?), but couldn't it be a GP map-like Simplicity bias. He approaches it from Geometry. I think he is right in that the algorithm plays a role in biasing. But GP bias probably does also, or it could at least be seized (as in Neuroevolution with indirect encodings..)

In any case the experiments where more hidden units allow for better generalization means that more hidden units allow to drive the complexity measure lower (so rank isn't the right complexity measure. See here)

Mollifying Networks and smoothening landscapes

Deep learning, Spin glasses, Protein folding, Loss surfaces

Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior

why deep learning works -- spin glasses, protein folding, etc.

–>Why Deep Learning Works

Improving RBMs with physical chemistry

Energy landscapes of deep networks

A Correspondence Between Random Neural Networks and Statistical Field Theory

The primary argument about the funnel is that these learning systems are strongly correlated, and therefore not readily treated by mean field theory. Specifically, the classical idea was that strongly correlated random models lie in a different universality class. So they behave completely differently than a spin glass , and this gives rise to the convexity .

See discussion in comments here and here. Comments on why deep learning works: perspectives from theoretical chemistry

The Loss Surfaces of Multilayer Networks – loss surfaces paper. uncorrelated inputs, and outputs?: Yeah basically everything is random, so they can't show dependence of behvaior on properties of the function being learned. They focus on random functions really..

See Statistical mechanics of neural networks

–>Theoretical neuroscience and deep learning theory

Renormalization group and Deep learning

Why Deep Learning Works II: the Renormalization Group

An exact mapping between the Variational Renormalization Group and Deep Learning


Supervised Learning with Quantum-Inspired Tensor Networks

Deep learning and the renormalization group

See also comments on Why does deep and cheap learning work so well?

Renormaliation group can be seen through the lens of compression/Information bottleneck

Simplicity and deep learning

See also Simplicity and learning

Why does deep and cheap learning work so well?

Neuroscience and deep learning

–>Theoretical neuroscience and deep learning theory

–> Francois Chollet. Ablation studies are a good idea when we don't have much theory.. For world models, apparently LSTM even with random weights work well <<–

Integrating symbols into deep learning

More resources

Representational Power of Restricted Boltzmann Machines and Deep Belief NetworksDeep Belief Networks Are Compact Universal ApproximatorsScaling learning algorithms towards AIHierarchical model of natural images and the origin of scale invarianceWhy does Deep Learning work?, Spin glass, Why Deep Learning Works III: a previewLagrangian Relaxation for MAP Estimation in Graphical ModelsModels of object recognitionDeep neural networks are easily fooled: High confidence predictions for unrecognizable images video Learning in Layered Neural NetworksYann LeCun: "Deep Learning, Graphical Models, Energy-Based Models, Structured Prediction, Pt. 1"Why does Deep Learning work? - A perspective from Group Theory

Connections to Signal processing:

As someone who is just entering the field of deep learning theory (coming from physics), I find this discussion very interesting. I commend your approach of respecting the truth about the universe. Basically, Feynman's principle of science "If it disagrees with experiment it is wrong. In that simple statement is the key to science. It does not make any difference how beautiful your guess is. It does not make any difference how smart you are, who made the guess, or what his name is – if it disagrees with experiment it is wrong. That is all there is to it."

When one goes deep into theory it's easy to loose track of that empirical grounding. Of course it's perfectly fine to do pure maths. But I do find many theorists who actually wanted to approach some real-world problem (like why deep learning works, or whatever), and ended up just proving a bunch of theorems which don't really say much about the real thing.. I have to keep reminding myself to stay in touch with real applied ML, because I am interested in theory about the real thing!

In any case, I do like a lot one thing that Ali says. He puts emphasis on pedagogy, and the value of simple theorems and experiments for that. At the end of the day, a "proof" 's objective (even mathematicians say so) is to convince someone else of something. It's all about communication. Simple proofs, simple experiments, simple explanations, are not very useful for engineering, on a vacuum. But put those things amongst people, and it fosters collaboration, understanding, etc. To me that is that is the biggest value of theory :)

Do Deep Nets Really Need to be Deep?

see my annotated version in Kami

The introduction is nice. It basically says "yes, in practice deep nets tend to work better than shallow ones, but why?":

You are given a training set with 1M labeled points. When you train a shallow neural net with one fully connected feed-forward hidden layer on this data you obtain 86% accuracy on test data. When you train a deeper neural net as in [1] consisting of a convolutional layer, pooling layer, and three fully connected feed-forward layers on the same data you obtain 91% accuracy on the same test set.

What is the source of this improvement? Is the 5% increase in accuracy of the deep net over the shallow net because: a) the deep net has more parameters; b) the deep net can learn more complex functions given the same number of parameters; c) the deep net has better inductive bias and thus learns more interesting/useful functions (e.g., because the deep net is deeper it learns hierarchical representations [5]); d) nets without convolution can’t easily learn what nets with convolution can learn; e) current learning algorithms and regularization methods work better with deep architectures than shallow architectures[8]; f) all or some of the above; g) none of the above?

They basically show that a) and b) are not the case, at least not often. They are basically arguing for reason e) in this paper, by showing that shallow nets can implement the same functions that deep nets learn (in a series of experiments and task they try at least..), however, they just don't find them when trained with the algorithms we have.

If deep nets really do have more simplicity bias (still need to run more experiments on that!), then it could be a reason for them working better in this way.