Neural tangent kernel

cosmos 13th July 2019 at 12:13am
Deep learning theory

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Definition

It is basically defined as the product of the Jacobians of the Parameter-function map.


For infinite width, the NTK stochastically converges to a deterministic function, over initialization

This is similar to how the distribution over function converges to a certain Gaussian process in the infinite width limit

For small learning rate, and "increasing" Jacobian, NTK/linearized/lazy dynamics describes training dynamics, because parameters barely change, even though the function changes

Different papers take "increasing" the Jacobian to be different things, but which all have the same effect described above. In Jacot et al 2018, they take the limit of infinite width, which means that even if the individual weights change infinitesimally, the total change in the function can O(1) (see intuitive explanation in section "Jacot et al." below). In Chizat et al. 2018, they increase the Jacobian, by scaling the model output, but the effect is the same. Just like in Jacot et al. they need to choose the weight variances such that the network output doesn't blow up at initialization, Chizat et al., also need to assume that the network doesn't blow up at initialization. In fact, several of their theorems assume the network output is initialized at 00.

Jacot et al.

Under the NTK parametrization, which they use, this limit implies that the learning rate (for GD on the standard-parametrization) is O(1/nO(1/\sqrt{n} (where nn is layer size). So the parameters move less and less for a fixed TT, in this limit, which is, intuitively, why the NTK stays constant for this period of time until TT

The interesting thing is that the function ff can change, as all the parameters "conspire" for it to change. Therefore it can potentially fit a function, and find a global minimum, while the parameters have almost not moved at all.

I think the intuition for this "conspiracy" is that the total change in ff is given by a sum over all the parameters' individual gradients. The number of parameters grows like n2n^2. gradient w.r.t. last hidden layer activations is O(1/n)O(1/\sqrt{n}), w.r.t to second to last hidden layer activations is O(n(1/n)2)=O(1/n)O(\sqrt{n}(1/\sqrt{n})^2) = O(1/\sqrt{n}), where the n\sqrt{n} comes from variance of summing over all the activations in last hidden layer. This means that the gradient w.r.t. to a weight, in NTK parameterization, is O((1/n)2)=O(1/n)O((1/\sqrt{n})^2)=O(1/n) In GD, each weight changes by an ammount of the same order as the gradient (assumin O(1)O(1) learning rate, which we assume for NTK-parametrization learning rate), so each weight contributes to change ff by O(1/n2)O(1/n^2). Therefore the total contribution from all the weights is O(1)O(1). Note that the contributions all have the same sign as they are essentially the gradient w.r.t. that weight, squared, so they add linearly, (and not growing like n\sqrt{n} if they were all randomly signed)

Chizat et al

How come it can find a minimum arbitrarily close to the initialization?

Ah I see by the non-degenerate Jacobian assumption, you can find a local change that will fit yy^*, and α\alpha large is just needed to reach the overall size/scale of yy^* with the local change


Recent Developments in Over-parametrized Neural Networks, Part II