Generative supervised learning

cosmos 31st March 2019 at 5:15am
Supervised learning

Generative unsupervised learning refers to a kind of Supervised learning task where the objective is learning the function p(inputoutput)p(\text{input}|\text{output}), together with p(output)p(\text{output}), which can be used to find p(outputinput)p(\text{output}|\text{input}) using Baye's theorem. See notes. See lecture video def. The method effectively builds a kind of Generative model of the training data, i.e., it models the probability distribution p((x,y))p((x,y)).

slides

When applied to Classification, this types of models are called generative classifiers.

The difference between generative models and discriminative models that have the same conditional probability p (y|x) , i.e. same predictive model, is that they are trained differently and will learn different sets of parameters. The difference boils down to the generative model making extra assumptions (about the distribution of the data) which the discriminative model doesn't make. The performance of one versus the other will depend on how well or how badly these assumptions hold. An example to explore these isues is Logistic regression, and binary linear discriminant analysis.

Learning method

The learning method is similar to that described in Discriminative learning. However, the crucial differences is in how we model the joint likelihood

p((x,y)θ)=ip(x(i)y(i);θ)p(y(i);ϕ)p((x,y)|\theta) = \prod_{i}p(x^{(i)}|y^{(i)};\theta)p(y^{(i)};\phi)

where ϕ\phi are some more parameters in the generative model.

Now to compare with the discriminative case, we can compute

p(y(i)x(i);θ,ϕ)=p(x(i)y(i);θ)p(y(i);ϕ)p(x(i);ϕ,θ)p(y^{(i)}|x^{(i)};\theta, \phi) = \frac{p(x^{(i)}|y^{(i)};\theta) p(y^{(i)};\phi) }{p(x^{(i)};\phi,\theta)}

where p(x(i);ϕ,θ)=yMp(x(i)y;θ)p(y;ϕ)p(x^{(i)};\phi,\theta) = \sum_{y\in M} p(x^{(i)}|y;\theta) p(y;\phi)

where MM is the range of values that yy can take.

Then, another way to write the joint likelihood is:

p((x,y)θ)=ip(y(i)x(i);θ,ϕ)p(x(i);θ,ϕ)p((x,y)|\theta) = \prod_{i}p(y^{(i)}|x^{(i)};\theta, \phi)p(x^{(i)};\theta, \phi)

We can see that the generative model also indirectly models p(x)p(x), and if the yy are considered as Latent variables, then the model can be used as a Generative model of the inputs. This is often the approach in Semi-supervised learning

See more at Generative vs discriminative models

Methods

Training and prediction with missing data

Generative models can be used for prediction with missing input features, because we can estimate them from their marginal distribution computed using the model.

Can also train with missing data. See chapter 8 in Murphy's book

Can also train with unlabelled data: i.e. one can do Semi-supervised learning (see here for e.g.


Some other benefits like more principled (and therefore potentially more accurate) uncertainty estimates, more types of inference one can do than in a simple discriminative model, and potentially being easier to incorporate prior knowledge into the generative model than in the discriminative one.