K-means algorithm

cosmos 16th December 2016 at 2:32pm
Clustering

Intro vidK Means with Titanic Dataset - Practical Machine Learning Tutorial with Python p.36

Input is an unlabelled data set x1,x2,...,xmRnx_1, x_2, ..., x_m \in \mathbb{R}^n

  1. Initialize a set of points (centroids): μ1,...,μkRn\mu_1, ..., \mu_k \in \mathbb{R}^n randomly
  2. Repeat until convergence:
    1. Set ci=argminjxiμjc_i = \arg\min\limits_j || x_i - \mu_j ||. (Assigning the point xix_i to the cluster with centroid closest to it.)
    2. μj=i=1m1{ci=j}xii=1m1{ci=j}\mu_j = \frac{\sum\limits_{i=1}^m 1\{c_i=j\}x_i}{\sum\limits_{i=1}^m 1\{c_i=j\}}. (Update the cluster centroids to be the mean of the points assigned to it)

K-means visualizationVisualizing K-means equilibria

K-means is guaranteed to converge. If we define the distortion function:

J(c,μ)=i=1mxici2J(c, \mu) = \sum\limits_{i=1}^m || x_i -c_i || ^2,

k-means is Coordinate descent on JJ

Choosing the number of clusters is often done manually, but there are also automatic algorithms.

It can fall into local optima, and to check that, one can try different random initializations, and see if any converges to a lower value of JJ