A Probabilistic model for Language: A Probability distribution over strings (sequence of tokens) in a Language
Text generation. GPT2 https://mobile.twitter.com/Miles_Brundage/status/1115178488393154560
We use the Chain rule for joint probabilities (exact) to expand the joint prob dist of tokens.
Our objective function can be the Cross entropy (a way of measuring how close two prob dists are) relative to the empirically observed frequency.
Perplexity:
Data
WikiText dataset.
A nth-order Markov chain, where each word's prob dist depends on the previous n words.
Maximum likelihood correspond to empirical counts of the form for 3-grams.
Maximum likelihood is a bad objective for language, apparently, need instead to choose a good prior and do MAP
Smoothing techniques
There are many cases which come out with prob 0. To improve on this, we can use bi-grams to resolve these cases. This is the idea of back-off, which is a kind of smoothing technique (c.f. Laplace smoothing)
One very simple way is linear interpolation: the probability of the next word is a convex combination of the probabilities assigned by the one-gram, bi-gram, three-gram etc, probs.
Kneser-Ney
An empirical study of smoothing techniques for language modeling
We want our posterior distribution to agree with the real distribution in the real world. For insance, we want to recover Zipf's law (Power laws, long tails..) These long tails is what make good-old-AI rule systems to fail, as there's too much stuff to account for.
~ constant time algo.
Long n-grams: data is too sparse –> can't really capture long-term dependencies.
– N-grams can't capture correlations and other patterns which it hasn't seen, so it's bad at generalizing
These will be able to better capture correlations, and generalize.., by better capturing semantics, stored in the hidden layers..
Use Artificial neural network to model the n-gram prob dist.: input being n previous words, output prob dist. of next word.
And also Augmented RNN.
Want to have memory going all the way back..
The models are harder to paralelize than the neural n-gram models, because there's depenednce between the network acting at dfifferent points in the sequence.
Truncated Backpropagation through time
Add breaks in the backpropagation, maybe between sentences.
However, we do forward propagate!
Skip-thought vectors
See New advances in deep learning (BERT, Transformer, Attention