Conditional random field

cosmos 30th January 2017 at 12:24am

A discriminant Markov network, used for Supervised learning tasks.

Neural networks [3.1] : Conditional random fields - motivation.

Motivation: Sequence classification, where the random variables in the sequence aren't i.i.d, in particular, they need not be independent! More generally, CRFs are used when we want to model random variables that have dependencies (Graphical model).

Approach of conditional random fields (CRF): model joint distribution, i.e. model the Stochastic process generating the data. Notation

Linear chain CRF

Neural networks [3.2] : Conditional random fields - linear chain CRF

Definition. The current label prediction depends on the just observed input, and the adjacent inputs in the sequence.

Informal neural network representation

Context window

Neural networks [3.3] : Conditional random fields - context window. A context window refers to the set of inputs that affect a particular output label prediction

Unary and pairwise log-factors

Terms that go into the exponential of a Softmax. They are unary, or pairwise, depending on whether these terms depend on one label $y$ or on two labels.

Partition function

Neural networks [3.4] : Conditional random fields - computing the partition function, gives rise to the $\alpha$ and $\beta$ vectors/tables. These give the partial sum in the partition function, involving all the terms to the left of a given position in the sequence for $\alpha$ , and to the right for $\beta$ . Thus they will be useful for things like computing Marginal probabilityes

Computing these tables is known as the forward-backward algorithm for CRFs, respetively. It has other names, such as Belief propragation (see Dynamical systems on networks), sum-product algo. Actually more stable implementation by working on log space. Further trick to make it more numerically stable

Computing marginals

Neural networks [3.5] : Conditional random fields - computing marginals. For instance, computing the probability of a single label at a given position $k$ in the sequence. Uses $\alpha$ and $\beta$ values above.

Performing classification

At each position $k$ , pick label $y_k$ with highest marginal probability $p(y_k|\mathbf{X})$ . It turns out that this choice is the one that minimizes the sum of classification errors (generalization error) over the whole sequence, assuming the CRF is the true distribution! See proof (see also Bayesian statistics).

The other option: the most probable sequence, can be performed using Viterbi decoding algorithm

Factors, sufficient statistics

Neural networks [3.7] : Conditional random fields - factors, sufficient statistics and linear CRF

Factors or compatibility function, which use Sufficient statistics. Often represent a CRF as a product of factors.

Linear CRF (no hidden units in NNs). It's a log-linear model

Training CRFs

Loss function

Neural networks [4.1] : Training CRFs - loss function