A discriminant Markov network, used for Supervised learning tasks.
Neural networks [3.1] : Conditional random fields - motivation.
Motivation: Sequence classification, where the random variables in the sequence aren't i.i.d, in particular, they need not be independent! More generally, CRFs are used when we want to model random variables that have dependencies (Graphical model).
Approach of conditional random fields (CRF): model joint distribution, i.e. model the Stochastic process generating the data. Notation
Neural networks [3.2] : Conditional random fields - linear chain CRF
Definition. The current label prediction depends on the just observed input, and the adjacent inputs in the sequence.
Informal neural network representation
Neural networks [3.3] : Conditional random fields - context window. A context window refers to the set of inputs that affect a particular output label prediction
Unary and pairwise log-factors
Terms that go into the exponential of a Softmax. They are unary, or pairwise, depending on whether these terms depend on one label or on two labels.
Neural networks [3.4] : Conditional random fields - computing the partition function, gives rise to the and vectors/tables. These give the partial sum in the partition function, involving all the terms to the left of a given position in the sequence for , and to the right for . Thus they will be useful for things like computing Marginal probabilityes
Computing these tables is known as the forward-backward algorithm for CRFs, respetively. It has other names, such as Belief propragation (see Dynamical systems on networks), sum-product algo. Actually more stable implementation by working on log space. Further trick to make it more numerically stable
Neural networks [3.5] : Conditional random fields - computing marginals. For instance, computing the probability of a single label at a given position in the sequence. Uses and values above.
At each position , pick label with highest marginal probability . It turns out that this choice is the one that minimizes the sum of classification errors (generalization error) over the whole sequence, assuming the CRF is the true distribution! See proof (see also Bayesian statistics).
The other option: the most probable sequence, can be performed using Viterbi decoding algorithm
Neural networks [3.7] : Conditional random fields - factors, sufficient statistics and linear CRF
Factors or compatibility function, which use Sufficient statistics. Often represent a CRF as a product of factors.