Gaussian process that approximates the Prior over functions in a Bayesian neural network (Bayesian deep learning). The approximation is exact in the limit of infinitely wide layers (infinitely many neurons per hidden layer), and under the assumption that the distribution over weights in the Bayesian neural network is i.i.d. (and often Gaussian)
For parametrized function , the neural tangent kernel is defined as
where the angle brackets mean the standard Inner product in , number of parameters, I think (check).
When is a feedforward neural network, with a certain distribution over parameters, then in the limit of infinite width converges to a limit
This convergence allows one to predict the evolution of under gradient descent on . For example, if we apply gradient flow (I think, GD in the limit of small step size) on a training set and loss function , for codomain() = , Jacot et al derived:
where is the "ground truth" function that for each maps , and and are thought of as dimension- vectors. Jacot proved that under suitable conditions(which?), with training time fixed and width , for all . This means that
in the large width regime, (in the function space) evolves approximately according to a linear differential equation under gradient flow! wow!. See more of this here!
Turns out that the NTK is the transpose of the empirical Fisher information matrix
It may also approximate the prior over functions when training with SGD.
The kernel function of the Gaussian process depends on the choice of architecture, and properties of the parameter distribution, in particular the weight variance (where is the size of the input to the layer) and the bias variance . The kernel for fully connected ReLU networks has a well known analytical form known as the arccosine kernel [ref], while for convolutional and residual networks it can be efficiently computed (see e.g. here)
–> For a derivation and more detailed introduction of the main results (for Fully connected networks)
For an even more rigorous.careful treatment, and further results see: Deep Neural Networks as Gaussian Processes, Gaussian Process Behaviour in Wide Deep Neural Networks
It turns out that Convolutional neural networks (CNNs) with infinitely many filters per layer are also Gaussian processes, with a different, and more complex, kernel function (so that in effect CNNs "see" different inputs as being "similar" to each other than fully connected nets).
Deep Convolutional Networks as shallow Gaussian Processes
BAYESIAN CONVOLUTIONAL NEURAL NETWORKS WITH MANY CHANNELS ARE GAUSSIAN PROCESSES
A Gaussian Process perspective on Convolutional Neural Networks
Deep learning generalizes because the parameter-function map is biased towards simple functions
These papers study in more detail the properties of the kernel of neural networks, and even explore some ideas related to robustness of the outputs of the neural network to changes of the weight (see SI of this paper: Exponential expressivity in deep neural networks through transient chaos in particular). <– Here is an idea: Given their analysis of the effect of a change on weights , , on the corresponding change on the outputs of the network , for some given input point , one can perhaps find a formula for the Hessian of the loss (which is a sum of functions of over a set of corresponding to the training set). If we are lucky, it may be possible to relate that formula with the formula for given by the Gaussian process analysis (which is rather similar in nature, as the calculation of the kernel is what they explore in that paper, in terms of how the correlation between two points changes as you propagate through layers, see comments in Deep Neural Networks as Gaussian Processes and the paper itself for this to make more sense!)
See Neural network Gaussian process
A Correspondence Between Random Neural Networks and Statistical Field Theory
Exponential expressivity in deep neural networks through transient chaos
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
See more at Statistical mechanics of neural networks
"The results indicate that on this dataset (Delf yatch hydrodynamics dataset) the Bayesian deep network and theGaussian process do not make similar predictions. Of the two, the Bayesian neural networkachieves signi cantly better log likelihoods on average, indicating that a nite networkperforms better than its in nite analogue in this case." (https://arxiv.org/pdf/1804.11271.pdf)