KMEANS
the K-Means Data Clustering Problem


KMEANS is a FORTRAN90 library, using double precision arithmetic, for the K-Means clustering problem.

In the K-Means problem, a set of N points X(I) in M-dimensions is given. The goal is to arrange these points into K clusters, with each cluster having a representative point Z(J), usually chosen as the centroid of the points in the cluster.

        Z(J) = Sum ( all X(I) in cluster J ) X(I) / 
               Sum ( all X(I) in cluster J ) 1.
      
The energy of cluster J is
        E(J) = Sum ( all X(I) in cluster J ) || X(I) - Z(J) ||^2
      

For a given set of clusters, the total energy is then simply the sum of the cluster energies E(J). The goal is to choose the clusters in such a way that the total energy is minimized. Usually, a point X(I) goes into the cluster with the closest representative point Z(J). So to define the clusters, it's enough simply to specify the locations of the cluster representatives.

This is actually a fairly hard problem. Most algorithms do reasonably well, but cannot guarantee that the best solution has been found. It is very common for algorithms to get stuck at a solution which is merely a "local minimum". For such a local minimum, every slight rearrangement of the solution makes the energy go up; however a major rearrangement would result in a big drop in energy.

A simple algorithm for the problem is known as the "H-Means algorithm". It alternates between two procedures:

These steps are repeated until no points are moved, or some other termination criterion is reached.

A more sophisticated algorithm, known as the "K-Means algorithm", takes advantage of the fact that it is possible to quickly determine the decrease in energy caused by moving a point from its current cluster to another. It repeats the following procedure:

This procedure is repeated until no points are moved, or some other termination criterion is reached.

The Weighted K-Means Problem

A natural extension of the K-Means problem allows us to include some more information, namely, a set of weights associated with the data points. These might represent a measure of importance, a frequency count, or some other information. The intent is that a point with a weight of 5.0 is twice as "important" as a point with a weight of 2.5, for instance. This gives rise to the "weighted" K-Means problem.

In the weighted K-Means problem, we are given a set of N points X(I) in M-dimensions, and a corresponding set of nonnegative weights W(I). The goal is to arrange the points into K clusters, with each cluster having a representative point Z(J), usually chosen as the weighted centroid of the points in the cluster:

        Z(J) = Sum ( all X(I) in cluster J ) W(I) * X(I) / 
               Sum ( all X(I) in cluster J ) W(I).
      
The weighted energy of cluster J is
        E(J) = Sum ( all X(I) in cluster J ) W(I) * || X(I) - Z(J) ||^2
      

Licensing:

The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.

Related Data and Programs:

ASA058 is a FORTRAN90 library which implements the K-means algorithm of Sparks.

ASA136 is a FORTRAN77 file containing the original text of the Hartigan and Wong clustering algorithm ASA136.

CITIES is a library of FORTRAN90 routines for various problems associated with a set of "cities" on a map.

CITIES is a dataset directory which contains sets of data defining groups of cities.

LAU_NP is a FORTRAN90 library which contains heuristic algorithms for the K-center and K-median problems.

SPAETH is a FORTRAN90 library which can cluster data according to various principles.

SPAETH is a dataset directory which contains a set of test data.

SPAETH2 is a FORTRAN90 library which can cluster data according to various principles.

SPAETH2 is a dataset directory which contains a set of test data.

Reference:

  1. John Hartigan, Manchek Wong,
    Algorithm AS 136: A K-Means Clustering Algorithm,
    Applied Statistics,
    Volume 28, Number 1, 1979, pages 100-108.
  2. Wendy Martinez, Angel Martinez,
    Computational Statistics Handbook with MATLAB,
    Chapman and Hall / CRC, 2002.
  3. David Sparks,
    Algorithm AS 58: Euclidean Cluster Analysis,
    Applied Statistics,
    Volume 22, Number 1, 1973, pages 126-130.

Source Code:

Examples and Tests:

KMEANS_PRB is a sample problem which reads a data file and calls the various routines to cluster that data.

There are two files of data read by the sample code:

The sample program writes out a data file describing each clustering:

The PLOT_POINTS program can be used to create EPS images of the clusterings computed by the sample problem. These EPS images can then be converted to the PNG image format.

List of Routines:

You can go up one level to the FORTRAN90 source codes.


Last revised on 13 February 2008.