Revision as of 14:26, 6 May 2014

Introduction to Clustering A slecture by CS student David Runyan

Introduction

In class, we covered the simple Bayesian classifier. This form of classification falls under a category known as supervised learning. What this means is that a set of labelled data data is used to to "train" the underlying model. However, it is not always possible to have such a data set, yet we may still wish to discover some form of underlying structure in an unlabelled data set. Such a task falls under the category of unsupervised learning.

Clustering is a form of unsupervised learning. "Clustering is the problem of identifying groups, or clusters of data points in multidimensional space". For example, consider the following data set:

Figure 1

Although this data set is not labelled, we can still clearly see three different groups, or clusters, of data points, as shown here:

Figure 2

In some ways, each cluster can be thought of a different "class" of data points. In this way, clustering can be seen as the unsupervised analog of the classification problem.

There is no "best" clustering algorithm. Any algorithm imposes some sort of structure on the data. If the structure imposed by the clustering algorithm matches the true structure of the data, then the actual clusters may be discovered. As a result, there are many different clustering methods. Here I will describe a popular and intuitive method known as K-means clustering.

K-means clustering

One way of thinking about a cluster is a set of points who's pairwise distances are small compared to the distance to the points of a different cluster. This is illustrated in the following figure:

Figure 3

In the figure above, the magenta line represents a general distance between two points of the same cluster and the red line represents a general distance between points of different clusters. Intuitively, we expect the red line to be significantly longer than the magenta line.

Back to ECE662, Spring 2014

@@ Line 26: / Line 26: @@
 <center>[[Image:runyan2.jpg|frame|none|alt=Alt text|<font size= 4> '''Figure 2''' </font size>]] </center> <br />
+In some ways, each cluster can be thought of a different "class" of data points. In this way, clustering can be seen as the unsupervised analog of the classification problem.
+There is no "best" clustering algorithm. Any algorithm imposes some sort of structure on the data. If the structure imposed by the clustering algorithm matches the true structure of the data, then the actual clusters may be discovered. As a result, there are many different clustering methods. Here I will describe a popular and intuitive method known as K-means clustering.
 ----
+<center>
+=K-means clustering=
+</center>
+One way of thinking about a cluster is a set of points who's pairwise distances are small compared to the distance to the points of a different cluster. This is illustrated in the following figure:
+<center>[[Image:runyan3.jpg|frame|none|alt=Alt text|<font size= 4> '''Figure 3''' </font size>]] </center> <br />
+In the figure above, the magenta line represents a general distance between two points of the same cluster and the red line represents a general distance between points of different clusters. Intuitively, we expect the red line to be significantly longer than the magenta line.
 ----
 ----
 [[2014_Spring_ECE_662_Boutin|Back to ECE662, Spring 2014]]

Difference between revisions of "SlectureDavidRunyanCS662Spring14" - Rhea

Revision as of 14:26, 6 May 2014

Introduction

K-means clustering

Alumni Liaison