Loss surface of neural networks: Cosmos — All that is, or was, or ever will be

Loss surface of neural networks

cosmos 7th December 2018 at 11:35pm

loss landscapes/surfaces in Deep learning/Machine learning

Perspective: Energy Landscapes for Machine Learning

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs The loss functions of deep neural networks are complex and their geometric properties are not well understood. We show that the optima of these complex loss functions are in fact connected by simple curves over which training and test accuracy are nearly constant.

Geometry of energy landscapes and the optimizability of deep neural networks

Loss surface of neural networks video

Topology and Geometry of Half-Rectified Network Optimization – EMPIRICAL ANALYSIS OF THE HESSIAN OF OVERPARAMETRIZED NEURAL NETWORKS (most directions are nearly flat) in bigger nets

the observations in Keskar et al. [2016], in that, it appears that the LB solution and SB solutions are separated by a barrier, and that the latter of which generalizes better. Moreover, the line interpolations extending away from either end points appear to be confirming the sharpness of LB solution. However, we find the line interpolation connecting the end points of Part I and Part II turns out to not contain any barriers (Figure 8). This suggests that while the small and large batch method converge to different solutions, these solutions have been in the same basin all along. This raises the striking possibility that that other seemingly different solutions may be similarly connected by a flat region to form a larger basin.

QUALITATIVELY CHARACTERIZING NEURAL NETWORK OPTIMIZATION PROBLEMS – Running the linear interpolation experiment on this problem, we find in Fig. 1 that the 1-D subspace spanning the initial parameters and final parameters is very well-behaved, and that SGD spends most of its time exploring the flat region at the bottom of the valley See Figure below (visualization_SGD_trajectory_flat_basin)

Essentially No Barriers in Neural Network Energy Landscape Minima are not located in isolated finite-width valleys, but there are paths through the parameter space along which the loss remains very close to the value at the minima. (<– these results are for ResNets and DenseNets which are known to have smoother loss landscapes) – Using Mode Connectivity for Loss Landscape Analysis These results are indicative of the connectedness of deep learning los

VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS .

Deep Learning without Poor Local Minima " 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers)." Note that the saddles have no negative eigenvalue, but they have decreasing directions (by definition of saddle), they are just of a higher order (higher order derivative will be negative)

AN EMPIRICAL ANALYSIS OF DEEP NETWORK LOSS SURFACES The work of Dauphin et al. (2014) empirically investigated properties of the critical points of neural network loss functions and demonstrated that their critical points behave similarly to the critical points of random Gaussian error functions in high dimensional space. We will expose further evidence along this trajectory.

On the Quality of the Initial Basin in Overspecified Neural Networks

https://arxiv.org/abs/1611.06310

Stats 385 - Theories of Deep Learning - Joan Bruna - Lecture 8

The loss surface F() of a given model can be expressed in terms of its level sets , which contain for each energy levelall parameters yielding a loss smaller or equal than. A first question we address concerns the topology of these level sets, i.e. under which conditions they are connected. Connected level sets imply that one can always find a descent direction at each energy level, and therefore that no poor local minima can exist.

We then move to the half-rectified case and show that the topology is intrinsically different and clearly dependent on the interplay between data distribution and model architecture. Our main theoretical contribution is to prove that half-rectified single layer networks are asymptotically connected, and we provide explicit bounds that reveal the aforementioned interplay.

Beyond the question of whether the loss contains poor local minima or not, the immediate follow-up question that determines the convergence of algorithms in practice is the local conditioning of the loss surface. It is thus related not to the topology but to the shape or geometry of the level sets. As the energy level decays, one expects the level sets to exhibit more complex irregular structures, which correspond to regions where F() has small curvature. In order to verify this intuition, we introduce an efficient algorithm to estimate the geometric regularity of these level sets by approximating geodesics of each level set starting at two random boundary points. Our algorithm uses dynamic programming and can be efficiently deployed to study mid-scale CNN architectures on MNIST, CIFAR-10 and RNN models on Penn Treebank next word prediction. Our empirical results show that these models have a nearly convex behavior up until their lowest test errors, with a single connected component that becomes more elongated as the energy decays. The rest of the paper is structured as follows. Section 2 presents our theoretical results on the topological connectedness of multilayer networks. Section 3 presents our path discovery algorithm and Section 4 covers the numerical experiments.

Poor local minima, topology of level sets

Deep linear networks – The Loss Surfaces of Multilayer Networks – overparametrization gives connected sublevel-sets (Proposition 2.2) – can't work in general with nonlinearities. Proof ideas

Case for quadratic and finitely kernelizable nonlinearities

The case for ReLUs, can show a kind of asymptotic connectedness, where we only need to go outside of the sublevel set by an amount of energy that decreases as we increase the number of neurons.

The fact that these results work indpendent of the loss function mean that they must be a property of the parameter-function map, so that all functions are connected to each other with enough overparametrization/redundancy!

Relations to Kernel methods

Geometry of level sets

TOPOLOGY AND GEOMETRY OF HALF-RECTIFIED NETWORK OPTIMIZATION

are they round and convex or like "tentacles"/spagetti? – algorithm to explore this

– Open questions

Others

Adding One Neuron Can Eliminate All Bad Local Minima

video

Regularity of level sets <> Difficulty of training <> generaliation <> dependence on target function

Mollifying Networks and smoothening landscapes

Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016.

Levent Sagun, V. Ugur Güney, Gérard Ben Arous, and Yann LeCun. Explorations on high dimensional landscapes. ICLR 2015 Workshop Contribution, arXiv:1412.6615, 2014.

Levent Sagun, Léon Bottou, and Yann LeCun. Singularity of the hessian in deep learning. arXiv preprint arXiv:1611.07476, 2016.

Soudry, Daniel, & Carmon, Yair. 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.

Swirszcz, Grzegorz, Czarnecki, Wojciech Marian, & Pascanu, Razvan. 2016. Local minima in training of neural networks. arXiv preprint arXiv:1611.06310

A jamming transition from under- to over-parametrization affects loss landscape and generalization