Lecture 23 Lecture notes

Jump to: Outline| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28

Clustering Method, given the pairwise distances

Let d be the number of objects. Let $DIST_{ij}$ denote the distance between objects $$ X_i $$ and $$ X_j $$ . The notion of distance here is not clear unless the application itself is considered. For example, when dealing with some psychological studies data, we may have to consult the psychologist himself as to what is the "distance" between given two concepts. These distances form the input to our clustering algorithm.

The constraint on these distances is that they be

- symmetric between any two objects ( $DIST_{ij}=DIST_{ji}$ )

- always positive ( $DIST_{ij}>=0$ )

- zero for distance of an object from itself ( $DIST_{ii}=0$ )

- follow $\bigtriangleup$ inequality

Table 1

Idea:

         If  $DIST_{ij}$  small =>  $$ X_i $$ ,  $$ X_j $$  in same cluster.

         If  $DIST_{ij}$  large =>  $$ X_i $$ ,  $$ X_j $$  in different clusters.

How to define small or large? One option is to fix a threshold $$ t_0 $$ .

such that

          $$ t_0 $$  < "typical" distance between clusters, and

          $$ t_0 $$  > "typical" distance within clusters.

Consider the following situation of objects distribution. This is a very conducive situation, and almost any clustering method will work well.

Figure 1

A problem arises when thew following situation arises:

Figure 2

Although we see two separate clusters with a thin distribution of objects in between, the algorithm mentioned above identifies only one cluster.

Graph Theory Clustering

dataset $\{x_1, x_2, \dots , x_d\}$ no feature vector given.

given $$ dist(x_i , x_j) $$

Construct a graph:

node represents the objects.
edges are relations between objects.
edge weights represents distances.

Definitions:

A complete graph is a graph with $$ d(d-1)/2 $$ edges.

Example:

Number of nodes d = 4
Number of edges e = 6

Figure 3

Figure 4

A subgraph $$ G' $$ of a graph $$ G=(V,E,f) $$ is a graph $$ (V',E',f') $$ such that $V'\subset V$ $E'\subset E$ $f'\subset f$ restricted to $$ E' $$

A path in a graph between $V_i,V_k \subset V_k$ is an alternating sequence of vertices and edges containing no repeated edges and no repeated vertices and for which $$ e_i $$ is incident to $$ V_i $$ and $V_{i+1}$ , for each $i=1,2,\dots,k-1$ . ( $V_1 e_1 V_2 e_2 V_3 \dots V_{k-1} e_{k-1} V_k$ )

A graph is "connected" if a path exists between any two vertices in the graph

A component is a maximal connected graph. (i.e. includes as many nodes as possible)

A maximal complete subgraph of a graph $$ G $$ is a complete subgraph of $$ G $$ that is not a proper subgraph of any other complete subgraph of $$ G $$ .