K-means algorithm: Cosmos — All that is, or was, or ever will be

K-means algorithm

cosmos 16th December 2016 at 2:32pm

Input is an unlabelled data set $x_1, x_2, ..., x_m \in \mathbb{R}^n$

Initialize a set of points (centroids): $\mu_1, ..., \mu_k \in \mathbb{R}^n$ randomly
Repeat until convergence:
1. Set $c_i = \arg\min\limits_j || x_i - \mu_j ||$ . (Assigning the point $x_i$ to the cluster with centroid closest to it.)
2. $\mu_j = \frac{\sum\limits_{i=1}^m 1\{c_i=j\}x_i}{\sum\limits_{i=1}^m 1\{c_i=j\}}$ . (Update the cluster centroids to be the mean of the points assigned to it)

K-means is guaranteed to converge. If we define the distortion function:

$J(c, \mu) = \sum\limits_{i=1}^m || x_i -c_i || ^2$ ,

Choosing the number of clusters is often done manually, but there are also automatic algorithms.

It can fall into local optima, and to check that, one can try different random initializations, and see if any converges to a lower value of $J$