New User?   Join Now     Login            Forgot Password?    
K-Means Algorithm  (115057 hits)
 Share this Article
 
Posted by Prabu Arumugam on Aug-03-2010
Languages: C#, Silverlight

This article gives a short introduction to clustering and then explains K-means algorithm in an efficient way using a live demo in Silverlight. The demo can be used to understand the working of k-means algorithm through user-defined data points. The full source code in C# and Silverlight is available for download below.

K-Means Demo in Silverlight

The K-means algorithm is illustrated in this demo. You can enter a new set of data points and test the resultant clusters. The source code of the demo is available for download below.

What is Machine Learning and Clustering

Machine learning is a scientific discipline used to automatically learn in order to understand complex patterns and make intelligent decisions based on data. This computational learning can be supervised or unsupervised. Data Mining is the process of extracting useful patterns from large volumes of data. Uncovering hidden patterns in data using data mining techniques will be very useful for businesses, scientists and governments.

Clustering is the process of organizing a set of items into subsets (called clusters) so that items in the same cluster are similar. The similarity between items can be defined by a function or a formula, based on the context. For example, the Euclidean distance between two points acts as a similarity function for list of points/co-ordinates in space. Clustering is a method of unsupervised learning and a common technique for statistical data analysis used in many fields. The term clustering can also refer to automatic classification, numerical taxonomy, topological analysis etc. For more information on Clustering, see http://en.wikipedia.org/wiki/Cluster_analysis.

Data Structures for this Article

We illustrate the k-means algorithm using a set of points in 2-dimensional (2D) space. The following data-structure classes are created. The Point class represents a point in 2D space. The PointCollection represents a set of points and/or cluster.

public class Point
{
    public int Id { get; set; }
    public double X { get; set; }
    public double Y { get; set; }
}

public class PointCollection : List<Point>
{
    public Point Centroid { get; set; }
}

K-Means Algorithm

The K-Means is a simple clustering algorithm used to divide a set of objects, based on their attributes/features, into k clusters, where k is a predefined or user-defined constant. The main idea is to define k centroids, one for each cluster. The centroid of a cluster is formed in such a way that it is closely related (in terms of similarity function) to all objects of that cluster.

Since we know the number of clusters to be formed, the objects in the input list are initially divided into random groups, that is, each object is assigned to a random cluster. After this, the algorithm iteratively refines each group by moving objects from irrelevant group to relevant group. The relevance is defined by the similarity measure or function. Whenever a new object is added or removed from a cluster, its centroid is updated or recalculated. Each iteration is guaranteed to increase the similarility between all the points inside a cluster. This iterative refinement is continued until all the clusters become stable i.e. there is no futher movement of objects between clusters. For more information on k-means algorithm, see http://en.wikipedia.org/wiki/K-means_clustering. The k-means algorithm is also referred to as Lloyd's algorithm.

The K-means algorithm can be used for grouping any set of objects whose similarity measure can be defined numerically. For example, a set of records of a relational-database table can be divided into clusters based on any numerical field of the table. For example, the set of customers or employees can be divided based on their attributes/properties like age, income, date-of-join, etc. In such cases, the similarity measure has to be defined based on that attribue.

The following source-code implements the K-means algorithm, using the data-structures defined above.

public static List<PointCollection> DoKMeans(PointCollection points, int clusterCount)
{
    //divide points into equal clusters
    List<PointCollection> allClusters = new List<PointCollection>();
    List<List<Point>> allGroups = ListUtility.SplitList<Point>(points, clusterCount);
    foreach (List<Point> group in allGroups)
    {
        PointCollection cluster = new PointCollection();
        cluster.AddRange(group);
        allClusters.Add(cluster);
    }

    //start k-means clustering
    int movements = 1;
    while (movements > 0)
    {
        movements = 0;

        foreach (PointCollection cluster in allClusters) //for all clusters
        {
            for (int pointIndex = 0; pointIndex < cluster.Count; pointIndex++) //for all points in each cluster
            {
                Point point = cluster[pointIndex];

                int nearestCluster = FindNearestCluster(allClusters, point);
                if (nearestCluster != allClusters.IndexOf(cluster)) //if point has moved
                {
                    if (cluster.Count > 1) //each cluster shall have minimum one point
                    {
                        Point removedPoint = cluster.RemovePoint(point);
                        allClusters[nearestCluster].AddPoint(removedPoint);
                        movements += 1;
                    }
                }
            }
        }
    }

    return allClusters;
}

The SplitList() function defined in ListUtility class is used to split a list of objects into equal number of groups. This is explained in more detail in this article. The FindNearestCluster() function finds the cluster that is very nearest (in terms of euclidean distance) to the given point.

The following function finds the euclidean-distance between two points in 2D space.

public static double FindDistance(Point pt1, Point pt2)
{
    double x1 = pt1.X, y1 = pt1.Y;
    double x2 = pt2.X, y2 = pt2.Y;

    //find euclidean distance
    double distance = Math.Sqrt(Math.Pow(x2 - x1, 2.0) + Math.Pow(y2 - y1, 2.0));
    return (distance);
}

 Downloads for this article
File Language Tools
KMeans-Demo-Source  40.44 kb  (2682 downloads) C#, Silverlight 3 Visual Studio 2008

 Share this Article

 Comments on this Article
Comment by kokilasri on Dec-16-2015
nise
Comment by chandra on Dec-08-2015
please send this code
Comment by POT on Jan-09-2014
Great! It worked! Saved the hell out of shit of us :)
Comment by caoboi_93 on Dec-19-2013
how can i download this article?
Comment by kulandai samy on Apr-28-2013
hi frds, i have one doubt..how to detemine number of cluster for k-means..please help me..
Comment by jos on Apr-23-2013
Great article. By reading the source code I now understand the algorithm. This is math for dummies like me. What is this web site anyway, does it contain more of these quality articles?
Comment by subrata on Jun-09-2012
i want
Comment by khanmuh2 on Apr-24-2012
using System.Windows.Controls.DataVisualization; source code compile error. I cant seem to find the dll for this.
Comment by zddream on Apr-11-2012
good applet!begin learning
Comment by Jiangxin Wu on Oct-24-2011
Remarkable
Comment by Allen on Sep-17-2011
do you have any examples using 4 dimentions observational data? let's x1 = {x,y,z,w}
Comment by maidenz08 on Sep-06-2011
Awesome!
Comment by win on Aug-22-2011
using System.Windows.Controls.DataVisualization; not found please help me!
Comment by kawita on Mar-31-2011
this is very useful material for understanding the dataming
Comment by Gu Qingshui on Mar-11-2011
Very Good!
 Post your comment here
Your Name
Your Email
Comment
Post Comment

About      Terms      Contact