This article gives a short introduction to clustering and then explains K-means algorithm in an efficient way using a live demo in Silverlight. The demo can be used to understand the working of k-means algorithm through user-defined data points. The full source code in C# and Silverlight is available for download below.
K-Means Demo in Silverlight
The K-means algorithm is illustrated in this demo. You can enter a new set of data points and test the resultant clusters. The source code of the demo is available for download below.
What is Machine Learning and Clustering
Machine learning is a scientific discipline used to automatically learn in order to understand complex patterns and make intelligent decisions based on data. This computational learning can be supervised or unsupervised. Data Mining is the process of extracting useful patterns from large volumes of data. Uncovering hidden patterns in data using data mining techniques will be very useful for businesses, scientists and governments.
Clustering is the process of organizing a set of items into subsets (called clusters) so that items in the same cluster are similar. The similarity between items can be defined by a function or a formula, based on the context. For example, the Euclidean distance between two points acts as a similarity function for list of points/co-ordinates in space. Clustering is a method of unsupervised learning and a common technique for statistical data analysis used in many fields. The term clustering can also refer to automatic classification, numerical taxonomy, topological analysis etc. For more information on Clustering, see http://en.wikipedia.org/wiki/Cluster_analysis.
Data Structures for this Article
We illustrate the k-means algorithm using a set of points in 2-dimensional (2D) space. The following data-structure classes are created. The Point class represents a point in 2D space. The PointCollection represents a set of points and/or cluster.
public class Point
{
public int Id { get; set; }
public double X { get; set; }
public double Y { get; set; }
}
public class PointCollection : List<Point>
{
public Point Centroid { get; set; }
}
K-Means Algorithm
The K-Means is a simple clustering algorithm used to divide a set of objects, based on their attributes/features, into k clusters, where k is a predefined or user-defined constant. The main idea is to define k centroids, one for each cluster. The centroid of a cluster is formed in such a way that it is closely related (in terms of similarity function) to all objects of that cluster.
Since we know the number of clusters to be formed, the objects in the input list are initially divided into random groups, that is, each object is assigned to a random cluster. After this, the algorithm iteratively refines each group by moving objects from irrelevant group to relevant group. The relevance is defined by the similarity measure or function. Whenever a new object is added or removed from a cluster, its centroid is updated or recalculated. Each iteration is guaranteed to increase the similarility between all the points inside a cluster. This iterative refinement is continued until all the clusters become stable i.e. there is no futher movement of objects between clusters. For more information on k-means algorithm, see http://en.wikipedia.org/wiki/K-means_clustering. The k-means algorithm is also referred to as Lloyd's algorithm.
The K-means algorithm can be used for grouping any set of objects whose similarity measure can be defined numerically. For example, a set of records of a relational-database table can be divided into clusters based on any numerical field of the table. For example, the set of customers or employees can be divided based on their attributes/properties like age, income, date-of-join, etc. In such cases, the similarity measure has to be defined based on that attribue.
The following source-code implements the K-means algorithm, using the data-structures defined above.
public static List<PointCollection> DoKMeans(PointCollection points, int clusterCount)
{
//divide points into equal clusters
List<PointCollection> allClusters = new List<PointCollection>();
List<List<Point>> allGroups = ListUtility.SplitList<Point>(points, clusterCount);
foreach (List<Point> group in allGroups)
{
PointCollection cluster = new PointCollection();
cluster.AddRange(group);
allClusters.Add(cluster);
}
//start k-means clustering
int movements = 1;
while (movements > 0)
{
movements = 0;
foreach (PointCollection cluster in allClusters) //for all clusters
{
for (int pointIndex = 0; pointIndex < cluster.Count; pointIndex++) //for all points in each cluster
{
Point point = cluster[pointIndex];
int nearestCluster = FindNearestCluster(allClusters, point);
if (nearestCluster != allClusters.IndexOf(cluster)) //if point has moved
{
if (cluster.Count > 1) //each cluster shall have minimum one point
{
Point removedPoint = cluster.RemovePoint(point);
allClusters[nearestCluster].AddPoint(removedPoint);
movements += 1;
}
}
}
}
}
return allClusters;
}
The SplitList() function defined in ListUtility class is used to split a list of objects into equal number of groups. This is explained in more detail in this article. The FindNearestCluster() function finds the cluster that is very nearest (in terms of euclidean distance) to the given point.
The following function finds the euclidean-distance between two points in 2D space.
public static double FindDistance(Point pt1, Point pt2)
{
double x1 = pt1.X, y1 = pt1.Y;
double x2 = pt2.X, y2 = pt2.Y;
//find euclidean distance
double distance = Math.Sqrt(Math.Pow(x2 - x1, 2.0) + Math.Pow(y2 - y1, 2.0));
return (distance);
}
Downloads for this article
File |
Language |
Tools |
KMeans-Demo-Source 40.44 kb (2679 downloads) |
C#, Silverlight 3 |
Visual Studio 2008 |