Neural network Gaussian process

cosmos 15th March 2019 at 8:05pm
Gaussian process

Gaussian process that approximates the Prior over functions in a Bayesian neural network (Bayesian deep learning). The approximation is exact in the limit of infinitely wide layers (infinitely many neurons per hidden layer), and under the assumption that the distribution over weights in the Bayesian neural network is i.i.d. (and often Gaussian)

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

Signal propagation in neural networks

Neural tangent kernel

For parametrized function f(f;θ)f(f;\theta), the neural tangent kernel is defined as

Kθ(x,x)=θf(x;θ),θf(x;θ)K_\theta(x,x') = \langle \nabla_\theta f(x;\theta), \nabla_\theta f (x';\theta)\rangle

where the angle brackets mean the standard Inner product in Rp\mathbb{R}^p, pp number of parameters, I think (check).

When ff is a feedforward neural network, with a certain distribution over parameters, then in the limit of infinite width KθK_\theta converges to a limit KK_\infty

This convergence allows one to predict the evolution of ff under gradient descent on θ\theta. For example, if we apply gradient flow (I think, GD in the limit of small step size) on a training set X\mathcal{X} and loss function 1X(x,y)X12(f(x)y)2\frac{1}{|\mathcal{X}|}\sum_{(x,y)\in \mathcal{X}}\frac{1}{2}(f(x)-y)^2, for codomain(ff) = R\mathbb{R}, Jacot et al derived:

ftt=1XKθt(X,X)(ftf)\frac{f_t}{\partial t} = -\frac{1}{|\mathcal{X}|}K_{\theta_t}(\mathcal{X},\mathcal{X})(f_t-f^*)

where ff^* is the "ground truth" function that for each (x,y)X(x,y)\in \mathcal{X} maps f(x)=yf^*(x) = y, and ff and ff^* are thought of as dimension-X|\mathcal{X}| vectors. Jacot proved that under suitable conditions(which?), with training time TT fixed and width \to \infty, KθtKK_{\theta_t} \to K_\infty for all 0tT0\leq t \leq T. This means that

in the large width regime, ff (in the function space) evolves approximately according to a linear differential equation under gradient flow! wow!. See more of this here!

Turns out that the NTK is the transpose of the empirical Fisher information matrix


It may also approximate the prior over functions when training with SGD.

The kernel function of the Gaussian process depends on the choice of architecture, and properties of the parameter distribution, in particular the weight variance σw2/n\sigma_w^2/n (where nn is the size of the input to the layer) and the bias variance σb2\sigma_b^2. The kernel for fully connected ReLU networks has a well known analytical form known as the arccosine kernel [ref], while for convolutional and residual networks it can be efficiently computed (see e.g. here)

–> For a derivation and more detailed introduction of the main results (for Fully connected networks)

For an even more rigorous.careful treatment, and further results see: Deep Neural Networks as Gaussian Processes, Gaussian Process Behaviour in Wide Deep Neural Networks


Convolutional neural network Gaussian processes

It turns out that Convolutional neural networks (CNNs) with infinitely many filters per layer are also Gaussian processes, with a different, and more complex, kernel function (so that in effect CNNs "see" different inputs as being "similar" to each other than fully connected nets).

Deep Convolutional Networks as shallow Gaussian Processes

BAYESIAN CONVOLUTIONAL NEURAL NETWORKS WITH MANY CHANNELS ARE GAUSSIAN PROCESSES

A Gaussian Process perspective on Convolutional Neural Networks


Applications to Generalization

Deep learning generalizes because the parameter-function map is biased towards simple functions


Mean field theory of neural networks

These papers study in more detail the properties of the kernel of neural networks, and even explore some ideas related to robustness of the outputs of the neural network to changes of the weight (see SI of this paper: Exponential expressivity in deep neural networks through transient chaos in particular). <– Here is an idea: Given their analysis of the effect of a change on weights ww, Δw\Delta w, on the corresponding change on the outputs of the network Δf(x,w)\Delta f(x,w), for some given input point xx, one can perhaps find a formula for the Hessian of the loss (which is a sum of functions of f(x;w)f(x;w) over a set of xx corresponding to the training set). If we are lucky, it may be possible to relate that formula with the formula for P(f)P(f) given by the Gaussian process analysis (which is rather similar in nature, as the calculation of the kernel is what they explore in that paper, in terms of how the correlation between two points changes as you propagate through layers, see comments in Deep Neural Networks as Gaussian Processes and the paper itself for this to make more sense!)

See Neural network Gaussian process

A Correspondence Between Random Neural Networks and Statistical Field Theory

Deep Information Propagation

Exponential expressivity in deep neural networks through transient chaos

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

See more at Statistical mechanics of neural networks


"The results indicate that on this dataset (Delf yatch hydrodynamics dataset) the Bayesian deep network and theGaussian process do not make similar predictions. Of the two, the Bayesian neural networkachieves signi cantly better log likelihoods on average, indicating that a nite networkperforms better than its in nite analogue in this case." (https://arxiv.org/pdf/1804.11271.pdf)