Neural tangent kernel: Cosmos — All that is, or was, or ever will be

Neural tangent kernel

cosmos 13th July 2019 at 12:13am

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Definition

It is basically defined as the product of the Jacobians of the Parameter-function map.

For infinite width, the NTK stochastically converges to a deterministic function, over initialization

This is similar to how the distribution over function converges to a certain Gaussian process in the infinite width limit

For small learning rate, and "increasing" Jacobian, NTK/linearized/lazy dynamics describes training dynamics, because parameters barely change, even though the function changes

Different papers take "increasing" the Jacobian to be different things, but which all have the same effect described above. In Jacot et al 2018, they take the limit of infinite width, which means that even if the individual weights change infinitesimally, the total change in the function can O(1) (see intuitive explanation in section "Jacot et al." below). In Chizat et al. 2018, they increase the Jacobian, by scaling the model output, but the effect is the same. Just like in Jacot et al. they need to choose the weight variances such that the network output doesn't blow up at initialization, Chizat et al., also need to assume that the network doesn't blow up at initialization. In fact, several of their theorems assume the network output is initialized at $0$ .

Jacot et al.

Under the NTK parametrization, which they use, this limit implies that the learning rate (for GD on the standard-parametrization) is $O(1/\sqrt{n}$ (where $n$ is layer size). So the parameters move less and less for a fixed $T$ , in this limit, which is, intuitively, why the NTK stays constant for this period of time until $T$

The interesting thing is that the function $f$ can change, as all the parameters "conspire" for it to change. Therefore it can potentially fit a function, and find a global minimum, while the parameters have almost not moved at all.

I think the intuition for this "conspiracy" is that the total change in $f$ is given by a sum over all the parameters' individual gradients. The number of parameters grows like $n^2$ . gradient w.r.t. last hidden layer activations is $O(1/\sqrt{n})$ , w.r.t to second to last hidden layer activations is $O(\sqrt{n}(1/\sqrt{n})^2) = O(1/\sqrt{n})$ , where the $\sqrt{n}$ comes from variance of summing over all the activations in last hidden layer. This means that the gradient w.r.t. to a weight, in NTK parameterization, is $O((1/\sqrt{n})^2)=O(1/n)$ In GD, each weight changes by an ammount of the same order as the gradient (assumin $O(1)$ learning rate, which we assume for NTK-parametrization learning rate), so each weight contributes to change $f$ by $O(1/n^2)$ . Therefore the total contribution from all the weights is $O(1)$ . Note that the contributions all have the same sign as they are essentially the gradient w.r.t. that weight, squared, so they add linearly, (and not growing like $\sqrt{n}$ if they were all randomly signed)

Chizat et al

How come it can find a minimum arbitrarily close to the initialization?

Ah I see by the non-degenerate Jacobian assumption, you can find a local change that will fit $y^*$ , and $\alpha$ large is just needed to reach the overall size/scale of $y^*$ with the local change

Recent Developments in Over-parametrized Neural Networks, Part II

video. We are going to combine the ideas of random features (training last layer) with Polyak condition
Random features. video. Ali Rahimi show that for random features (whp i guess), there exists a weigting of the features that gives a function close to any function expressible in the RKHS. Is it for any function, whp, this is true, or whp this is true for any function? I guess the former. By "close" they prove it's within $||f||/\sqrt{m}$ , where $m$ is the number of features, of the target function, in RKHS norm
Optimizing the last layer is a convex problem. If we initialize right, then optimizing the other weights, doesn't hurt much (vid) what was the point of all of this polynomial stuff?
To consider what happens when all the parameters are moving, roughly the same amount, then we move into the regime of neural tangent kernels
Neural tangent kernel depends strongly on initialization
NTK is given by a sum of layer-wise kernels, where the weights depend on initialization/learning rates
Just like the NNGP, NTK for fully connected only depends on the angles and the norms (and if inputs are normalized, they only depend on angle), so they are dot product kernels vid
computation of kernel
GD~NTK needs initialization so that f_0 doesn't blow up
Heuristic argument for why GD stays close to NTK kernel GD
these are all worst-case results!
kernel-norm-based sample complexity vid, for learning polynomials
learning (generalization) for learning teacher networks (and generally functions in the RKHS) video, has been used to show learnability, and generalization bounds? Allen-Zhu et al used other arguments
don't get the last point about SGD..., as in is this for any loss? Hmm, I guess, because it can never reach zero training error, as the training set keeps getting bigger; I think that's the point of that point.
Learning rate. Has to be very small! Learning rate has to scale like $1/m$
Limitations of these kernel methods
- For matrix completion NTK and GD appear to do very different things
For logistic loss with small LR, it will converge to the NTK solution, but then after some time, it will do GD, and a question is what max-margin solution will it converge too, if starting from the NTK solution?