Gaussian process: Cosmos — All that is, or was, or ever will be

Gaussian process

cosmos 3rd April 2019 at 4:57pm

Gaussian processes are Probabilistic models, defined as a probability distribution over a set of random variables (i.e. a Stochastic process) where any finite set of Random variables in the process is jointly Gaussian distributed. This set of random variables is usually interpreted as the output values of a function on an input space, so that we say that Gaussian processes define a distribution over functions, as we repeat below.

Good quick intro. Visual introudction (distill) for a thorough introduction. Also intro here

A Gaussian process is thus a distribution over functions such as the values of the functions at a finite set of points are jointly distributed by a Multivariate Gaussian distribution with a Covariance matrix that is given by a Kernel function (ensuring consistency via what's called the marginaliation property), which is a function of two Inputs. This is also called a Gaussian random field.

In terms of equations, the values of the function at any finite set of $n$ inputs $(x_1,...,x_n)$ , are jointly distributed with a Gaussian distribution:,

$P_{\mathbf{\theta}\sim Q} \left(f_\mathbf{\theta}(x_1)=\tilde{y}_1,...,f_\mathbf{\theta}(x_n)=\tilde{y}_n\right) \propto \exp{\left(-\frac{1}{2}\mathbf{\tilde{y}}^T \mathbf{K}^{-1}\mathbf{\tilde{y}}\right)},$

where $\mathbf{\tilde{y}}=(\tilde{y}_1,...,\tilde{y}_n)$ . The entries of the covariance matrix $\mathbf{K}$ are given by the Kernel function $k$ as $K_{ij}=k(x_i,x_j)$ .

Kernels encode how "similar" two points $x_i$ and $x_j$ in the input Domain of the distribution over functions are. What this means precisely is that the kernel at these two points, $k(x_i,x_j)$ is high, the the function is more likely to have similar values at these two points. This allows to encode a wide variety of prior knowledge/assumptions about the functions one is trying to learn, like Invariances/symmetries. Often, one chooses kernels that prefers smoothness, so that that $y$ s which are close $x$ s under some Metric (often Euclidean metric) are more likely to be similar... To see more on choice of kernels, see discussion in the page of Reproducing kernel Hilbert spaces.

Application in Generative supervised learning

Gaussian processes are usually used in Generative supervised learning. In brief, generative supervised learning works as follows: assume a certain model $p(\mathbf{y}|\mathbf{x})$ where the $y$ s correspond to the $x$ s in these vectors. To learn a predictor from a set of data, we do the following: given output $y$ s for some inputs $x$ s as data, we can compute a Predictive distribution for the outputs $y$ s corresponding to unobserved inputs $x$ s (see Bayesian inference).

A Gaussian process model models $p(\mathbf{y}|\mathbf{x})$ as $p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\mathbf{x})$ , where $p(\mathbf{y}|\mathbf{f})$ is a Likelihood function connecting outputs to the values of a "latent function" $f$ . This latent function is distributed according to a Gaussian process, as described above, which can now be interpreted as a prior over functions.

They are Bayesian Kernel methods

Using Gaussian processes for can be efficiently done up to datasets of about 100,000 data points, with current techniques and computers.

They are equivalent to Bayesian Kernel ridge regression! (what they call the "weight-space view" in here)

See section 4.3 in Murphy's book (Machine learning - a probabilistic perspective) to see the derivation of the fact that the marginal distribution of a subset of variables from a larger set of random variables which have a Gaussian joint distribution. This is why the Gaussian process property (that the values at any set of points have joint Gaussian distribution) corresponds to a Gaussian prior over functions (Gaussian random field; field with quadratic energy functional..; see Path integral ).

Relationships between Gaussian processes, Support Vector machines and Smoothing Splines – Support vector machine –Splines

Deep Neural Networks as Gaussian Processes – Extensions for CNNs

Gaussian Process Behaviour in Wide Deep Neural Networks

https://en.wikipedia.org/wiki/Gaussian_process

Approximate inference for Gaussian processes

Gaussian processes with non-Gaussian likelihood

two major obstacles: non-Gaussianity of the posterior process and the size of the kernel matrix K0(xi,xj). A first obvious problem stems from the fact that the posterior process is usually non-Gaussian (except when the likelihood itself is Gaussian in the fx). Hence, in many important cases its analytical form precludes an exact evaluation of the multidimensional integrals that occur in posterior averages. Nevertheless, various methods have been introduced to approximate these averages. A variety of such methods may be understood as approximations of the non-Gaussian posterior process by a Gaussian one (Jaakkola and Haussler 1999; Seeger 2000), for instance in (Williams and Barber 1998) the posterior mean is replaced by the posterior maximum (MAP) and information about the fluctuations are derived by a quadratic expansion around this maximum

Usually the observed labels / $y$ are assumed to be either equal to the function modelled by the GP, or have a Gaussian distribution around it (what's called a Gaussian likelihood – note that here the function $f$ works like the parameters in Bayesian inference).

If one assumes a non-Gaussian likelihood, then the problem is not Analytically tractable any more..

There are several approximations which are used then

The most common case is in Gaussian process classification. See here

Further techniques

One approach is to partition the data set into separate groups [e.g. Snelsonand Ghahramani, 2007, Urtasun and Darrell, 2008]. An alternative is to build a low rank approximationto the covariance matrix based around 'inducing vari-ables' [see e.g. Csato and Opper, 2002, Seeger et al.,2003, Quinonero Candela and Rasmussen, 2005, Titsias, 2009]. These approaches lead to a Computational complexity of $O(nm^2)$ and storage demands of $O(nm)$ where $n$ is number of data points and $m$ is a user selected parameter governing the number of inducing variables.

In this paper, they then introduced Variational inference for Gaussian processes.

Sparse On-Line Gaussian Processes

Theory of Gaussian processess

Gaussian processes where training data cover the whole input space (non trivial because the y values are still random samples according to likelihood, and our task is to esimate the latent $f$ (which we can use to estimate future $y$ ))

paper

Gaussian process Kernels

See https://www.cs.toronto.edu/~duvenaud/cookbook/

Combination of kernels

See here and here

Remember that kernel functions with one of its arguments evaluated are members of the reproducing kernel Hilbert space to which all the functions supported by a particular Gaussian process belong.

Therefore adding kernels, amounts to adding the functions on these two spaces. That is why the resulting functions work like this when combining kernels!

Automated statistician

Automatically chooses kernels, and does many other things: https://www.automaticstatistician.com/index/