Introduction

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k < n. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function.


$ V=\sum_{i=1}^{k}\sum_{xj\epsilon Sj}^{}{(xj-\mu j)}^{2} $

So basically it is trying to choose clusters so that the average distance between the data vectors and the chosen means (basically the model parameter) is a minimum. That is the means are the a good representation of the way the data is distributed in the space.

Usually when we implement the K-means algorithm we use a iterative procedure.

1) Randomly initialize a bunch of means.

2) For every data vector see which mean it is "closest" to and update this fact.

3) Using the information about each data vector being close toa particular mean for each mean and its associated data recompute the new mean and update.

4) Check whether the average distance is less thansome threshold. If yes we are done ! Else go back to step 1.

Software

References

  • An Efficient k-Means Clustering Algorithm: Analysis and Implementation, T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman and, A. Wu, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 24, No. 7, July 2002.

Alumni Liaison

Recent Math PhD now doing a post-doctorate at UC Riverside.

Kuei-Nuan Lin