Intro vid – K Means with Titanic Dataset - Practical Machine Learning Tutorial with Python p.36
Input is an unlabelled data set x1,x2,...,xm∈Rn
- Initialize a set of points (centroids): μ1,...,μk∈Rn randomly
- Repeat until convergence:
- Set ci=argjmin∣∣xi−μj∣∣. (Assigning the point xi to the cluster with centroid closest to it.)
- μj=i=1∑m1{ci=j}i=1∑m1{ci=j}xi. (Update the cluster centroids to be the mean of the points assigned to it)
K-means visualization – Visualizing K-means equilibria
K-means is guaranteed to converge. If we define the distortion function:
J(c,μ)=i=1∑m∣∣xi−ci∣∣2,
k-means is Coordinate descent on J
Choosing the number of clusters is often done manually, but there are also automatic algorithms.
It can fall into local optima, and to check that, one can try different random initializations, and see if any converges to a lower value of J