Exploding gradients problem

cosmos 18th March 2019 at 4:50pm
Deep learning theory

also Vanishing gradients problem

videoThe exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions (openreview)

  1. For some types of nonlinearities/layers, the gradients w.r.t. to parameters at different layers increase with the number of layers (it increases as we go from output to input). (See visualization of decision surfaces on parameter space for fixed input
  2. To avoid pseudorandom walk due to the decision surface from lower layers, we have to choose a smaller learning rate, as the we add more layers (see here)
  3. For any network, we can do the residual trick to convert to a residual network. The size of the residuals is then found to be small when the learning rate is small
  4. By using the residual trick, and then doing a Taylor expansion (where we assume residual terms are small), we can reduce the number of layers, while still approximating well, giving a notion of effective depth (note: depending on for how many layers we do the Taylor expansion, we get different number of layers in the approximate network). The effective depth decreases if the residuals are smaller.
  5. Therefore error is large. Although for the nets with exploding gradients (and thus small residuals), the depth of the approximate nets is small, they have a lot of layers of nonliearities (with fixed parameters, that's why they are not counted for in the effective depth). These needlessly randomize the features, they argue

See summary

gingkotree

Other problems discussed in the video

Apart from exploding gradients, networks can fail by

  • pseudolinearity (which means that as the variance of activations decreases, for certain nonlinearities, the nonlinearity effectively looks more and more linear). This causes ineffective layers, and low effective depth (and then potentially high error). For some nonlinearities, the pseudolinearity effect only happens if domain bias also occurs. ReLU is one example.

–> Residual neural networks seem to alliviate these problems! (see also this video)

Exponential expressivity in deep neural networks through transient chaos

Deep Information Propagation

Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

Information bottleneck

Mean field theory of neural networksNeural network Gaussian processes!