Discriminative learning

cosmos 4th November 2016 at 2:43pm
Supervised learning

A type of Supervised learning where one learns the function p(outputinput)p(\text{output}|\text{input}). See notes

Learning method

The learning itself is done by Maximum likelihood, where the likelihood is:

p(θ(x,y))=p((x,y)θ)p(θ)p(x,y)p(\theta|(x,y))=\frac{p((x,y)|\theta)p(\theta)}{p(x,y)}

where θ\theta are the parameters of the theory, yy are the outputs, and xx are the input variables. Now our aim is maximizing this likelihood w.r.t θ\theta, and as the denominator doesn't depend on θ\theta, we can ignore it. We can also assume that p(θp(\theta, our prior, is uniform. Then, we want to maximize:

p((x,y)θ)=ip(y(i)x(i);θ)p(x(i))p((x,y)|\theta) = \prod_{i}p(y^{(i)}|x^{(i)};\theta)p(x^{(i)})

where we assumed that all the data points are independent. We have also assumed that our model only models p(yx)p(y|x), so that θ\theta doesn't appear in p(x)p(x). This is the main difference with Generative supervised learning. While maximizing the log likelihood

l(θ)=log(p(x,y)θ)=ilogp(y(i)x(i);θ)+ilogp(x(i))l(\theta) = \log{(p(x,y)|\theta)} = \sum_i \log{p(y^{(i)}|x^{(i)};\theta)} + \sum_i \log{p(x^{(i)})}

only the first term depends on θ\theta, the second is fixed (and thus ignored in the optimization procedure).

See also Generative vs discriminative models

https://www.wikiwand.com/en/Discriminative_model

Deterministic discriminative models

These have all the probability centered around one output, so they are better described as modelling directly y(x)y(x), the output as a function of the input

Examples