Better understanding the shortcomings and potential of deep learning may suggest ways of improving it, both to make it more capable and to make it more robust .
The approximation of functions with a compositional structure – can be achieved with the same degree of accuracy by deep and shallow networks but that the number of parameters are much smaller for the deep networks than for the shallow network with equivalent approximation accuracy.
See gingkoapp tree
See more at Information bottleneck
Geometry, Optimization and Generalization in Multilayer Networks. He talks about the capacity of the network in a similar way to the theory of neural nets (see this video). I think it's the same as VC dimension actually. It's more or less proportional to the number of edges, apparently.
|Optimization and capacity control, which avoids Overfitting!|
Relation to Matrix factorization. Better and worse inductive biases is what seems to cause things. For the case of matrix factorization in real use cases it seems that matrices have low trace norm more than they have low rank. Given the right inductive bias, we can learn efficiently, and generalize well.
Good inductive bias for matrix factorization seems to be trace norm rather than rank If we have trace norm giving us the inductive bias, then decreasing rank (less neurons) often doesn't actually help with generalization..., as increasing the rank/number of hidden units can allow for lower trace norms.
Need to find right measure of Complexity, for NN "path norm" seems best from those they tried.
Real vs random labels, behaviour is as expected with random data
But we didn't tell the network to minimize the path norm (complexity). So where is the regularization coming from?. He thinks it's the Optimization algorithm that is biasing us towards simple global optima (work on this for convex opti?), but couldn't it be a GP map-like Simplicity bias. He approaches it from Geometry. I think he is right in that the algorithm plays a role in biasing. But GP bias probably does also, or it could at least be seized (as in Neuroevolution with indirect encodings..)
In any case the experiments where more hidden units allow for better generalization means that more hidden units allow to drive the complexity measure lower (so rank isn't the right complexity measure. See here)
Mollifying Networks and smoothening landscapes
|The primary argument about the funnel is that these learning systems are strongly correlated, and therefore not readily treated by mean field theory. Specifically, the classical idea was that strongly correlated random models lie in a different universality class. So they behave completely differently than a spin glass , and this gives rise to the convexity .|
See discussion in comments here and here. Comments on why deep learning works: perspectives from theoretical chemistry
The Loss Surfaces of Multilayer Networks – loss surfaces paper. uncorrelated inputs, and outputs?: Yeah basically everything is random, so they can't show dependence of behvaior on properties of the function being learned. They focus on random functions really..
See also comments on Why does deep and cheap learning work so well?
Renormaliation group can be seen through the lens of compression/Information bottleneck
See also Simplicity and learning
–> Francois Chollet. Ablation studies are a good idea when we don't have much theory.. For world models, apparently LSTM even with random weights work well <<–
Representational Power of Restricted Boltzmann Machines and Deep Belief Networks – Deep Belief Networks Are Compact Universal Approximators – Scaling learning algorithms towards AI – Hierarchical model of natural images and the origin of scale invariance – Why does Deep Learning work?, Spin glass, Why Deep Learning Works III: a preview – Lagrangian Relaxation for MAP Estimation in Graphical Models – Models of object recognition – Deep neural networks are easily fooled: High confidence predictions for unrecognizable images video – https://scholar.google.co.uk/citations?user=iqDZ9WYAAAAJ – Accelerated Learning in Layered Neural Networks – Yann LeCun: "Deep Learning, Graphical Models, Energy-Based Models, Structured Prediction, Pt. 1" – Why does Deep Learning work? - A perspective from Group Theory
As someone who is just entering the field of deep learning theory (coming from physics), I find this discussion very interesting. I commend your approach of respecting the truth about the universe. Basically, Feynman's principle of science "If it disagrees with experiment it is wrong. In that simple statement is the key to science. It does not make any difference how beautiful your guess is. It does not make any difference how smart you are, who made the guess, or what his name is – if it disagrees with experiment it is wrong. That is all there is to it."
When one goes deep into theory it's easy to loose track of that empirical grounding. Of course it's perfectly fine to do pure maths. But I do find many theorists who actually wanted to approach some real-world problem (like why deep learning works, or whatever), and ended up just proving a bunch of theorems which don't really say much about the real thing.. I have to keep reminding myself to stay in touch with real applied ML, because I am interested in theory about the real thing!
In any case, I do like a lot one thing that Ali says. He puts emphasis on pedagogy, and the value of simple theorems and experiments for that. At the end of the day, a "proof" 's objective (even mathematicians say so) is to convince someone else of something. It's all about communication. Simple proofs, simple experiments, simple explanations, are not very useful for engineering, on a vacuum. But put those things amongst people, and it fosters collaboration, understanding, etc. To me that is that is the biggest value of theory :)
see my annotated version in Kami
The introduction is nice. It basically says "yes, in practice deep nets tend to work better than shallow ones, but why?":
You are given a training set with 1M labeled points. When you train a shallow neural net with one fully connected feed-forward hidden layer on this data you obtain 86% accuracy on test data. When you train a deeper neural net as in  consisting of a convolutional layer, pooling layer, and three fully connected feed-forward layers on the same data you obtain 91% accuracy on the same test set.
What is the source of this improvement? Is the 5% increase in accuracy of the deep net over the shallow net because: a) the deep net has more parameters; b) the deep net can learn more complex functions given the same number of parameters; c) the deep net has better inductive bias and thus learns more interesting/useful functions (e.g., because the deep net is deeper it learns hierarchical representations ); d) nets without convolution can’t easily learn what nets with convolution can learn; e) current learning algorithms and regularization methods work better with deep architectures than shallow architectures; f) all or some of the above; g) none of the above?
They basically show that a) and b) are not the case, at least not often. They are basically arguing for reason e) in this paper, by showing that shallow nets can implement the same functions that deep nets learn (in a series of experiments and task they try at least..), however, they just don't find them when trained with the algorithms we have.
If deep nets really do have more simplicity bias (still need to run more experiments on that!), then it could be a reason for them working better in this way.