Exploding gradients problem: Cosmos — All that is, or was, or ever will be

Exploding gradients problem

cosmos 18th March 2019 at 4:50pm

For some types of nonlinearities/layers, the gradients w.r.t. to parameters at different layers increase with the number of layers (it increases as we go from output to input). (See visualization of decision surfaces on parameter space for fixed input
To avoid pseudorandom walk due to the decision surface from lower layers, we have to choose a smaller learning rate, as the we add more layers (see here)
For any network, we can do the residual trick to convert to a residual network. The size of the residuals is then found to be small when the learning rate is small
By using the residual trick, and then doing a Taylor expansion (where we assume residual terms are small), we can reduce the number of layers, while still approximating well, giving a notion of effective depth (note: depending on for how many layers we do the Taylor expansion, we get different number of layers in the approximate network). The effective depth decreases if the residuals are smaller.
Therefore error is large. Although for the nets with exploding gradients (and thus small residuals), the depth of the approximate nets is small, they have a lot of layers of nonliearities (with fixed parameters, that's why they are not counted for in the effective depth). These needlessly randomize the features, they argue