Topic modelling

cosmos 13th January 2017 at 12:58pm
Machine learning

Latent Dirichlet allocation

given number of topics

Topics: distribution of terms over a fixed vocabulary

tm, for preprocessing data. reduce stock words (commonly occuring, not useful). snowballC. stemming software.

Algotihm: Clustering algo.

Self-consistency. each word assigned topics based on other words which are assigned topics.

Generative model: Each document has a probability over topics (prior of parameter is Dirichlet). Then each word is drawn from the probability distribution represented by the model, each independently.