Language model

cosmos 16th August 2019 at 11:11pm
Natural language processing Natural language understanding

A Probabilistic model for Language: A Probability distribution over strings (sequence of tokens) in a Language

https://venturebeat.com/2019/08/13/nvidia-trains-worlds-largest-transformer-based-language-model/amp/

Text generation. GPT2 https://mobile.twitter.com/Miles_Brundage/status/1115178488393154560

Tree-based models: https://www.microsoft.com/en-us/research/publication/brief-report-ordered-neurons-integrating-tree-structures-into-recurrent-neural-networks/?OCID=msr_project_bestpaper_iclr_tw

Uses of conditional/joint probability models

Conditional language models

Machine translation

Question answering

Dialogue


We use the Chain rule for joint probabilities (exact) to expand the joint prob dist of tokens.

Our objective function can be the Cross entropy (a way of measuring how close two prob dists are) relative to the empirically observed frequency.

Perplexity: 2cross entropy2^{\text{cross entropy}}


Data

WikiText dataset.


Language models

n-gram models

A nth-order Markov chain, where each word's prob dist depends on the previous n words.

Maximum likelihood correspond to empirical counts of the form p(w3w1,w2)=count(w1,w2,w3)count(w1,w2)p(w_3|w_1,w_2)=\frac{count(w_1,w_2,w_3)}{count(w_1,w_2)} for 3-grams.

Maximum likelihood is a bad objective for language, apparently, need instead to choose a good prior and do MAP

Smoothing techniques

There are many cases which come out with prob 0. To improve on this, we can use bi-grams to resolve these cases. This is the idea of back-off, which is a kind of smoothing technique (c.f. Laplace smoothing)

One very simple way is linear interpolation: the probability of the next word is a convex combination of the probabilities assigned by the one-gram, bi-gram, three-gram etc, probs.

Kneser-Ney

An empirical study of smoothing techniques for language modeling

We want our posterior distribution to agree with the real distribution in the real world. For insance, we want to recover Zipf's law (Power laws, long tails..) These long tails is what make good-old-AI rule systems to fail, as there's too much stuff to account for.

~ constant time algo.

Long n-grams: data is too sparse –> can't really capture long-term dependencies.

– N-grams can't capture correlations and other patterns which it hasn't seen, so it's bad at generalizing

Neural n-gram language models

These will be able to better capture correlations, and generalize.., by better capturing semantics, stored in the hidden layers..

Use Artificial neural network to model the n-gram prob dist.: input being n previous words, output prob dist. of next word.

Recurrent neural networks

And also Augmented RNN.

Want to have memory going all the way back..

The models are harder to paralelize than the neural n-gram models, because there's depenednce between the network acting at dfifferent points in the sequence.

Truncated Backpropagation through time

Add breaks in the backpropagation, maybe between sentences.

However, we do forward propagate!

Skip-thought vectors

Attention-based models

See New advances in deep learning (BERT, Transformer, Attention

https://blog.openai.com/better-language-models/#task6