KMEANS
the K-Means Data Clustering Problem
KMEANS is a FORTRAN90 library, using double
precision arithmetic, for the K-Means clustering problem.
In the K-Means problem, a set of N points X(I) in M-dimensions is
given. The goal is to arrange these points into K clusters,
with each cluster having a representative point Z(J), usually
chosen as the centroid of the points in the cluster.
Z(J) = Sum ( all X(I) in cluster J ) X(I) /
Sum ( all X(I) in cluster J ) 1.
The energy of cluster J is
E(J) = Sum ( all X(I) in cluster J ) || X(I) - Z(J) ||^2
For a given set of clusters, the total energy is then simply
the sum of the cluster energies E(J). The goal is to choose the
clusters in such a way that the total energy is minimized.
Usually, a point X(I) goes into the cluster with the closest
representative point Z(J). So to define the clusters, it's
enough simply to specify the locations of the cluster representatives.
This is actually a fairly hard problem. Most algorithms do
reasonably well, but cannot guarantee that the best solution
has been found. It is very common for algorithms to get
stuck at a solution which is merely a "local minimum".
For such a local minimum, every slight rearrangement of
the solution makes the energy go up; however a major
rearrangement would result in a big drop in energy.
A simple algorithm for the problem is known as the "H-Means algorithm".
It alternates between two procedures:
-
Using the given cluster centers, assign each point to the
cluster with the nearest center;
-
Using the given cluster assignments, replace each cluster
center by the centroid or average of the points in the cluster.
These steps are repeated until no points are moved, or some
other termination criterion is reached.
A more sophisticated algorithm, known as the "K-Means algorithm",
takes advantage of the fact that it is possible to quickly determine
the decrease in energy caused by moving a point from its current cluster
to another. It repeats the following procedure:
-
For each point, move it to another cluster if that would lower
the energy. If you move a point, immediately update the
cluster centers of the two affected clusters.
This procedure is repeated until no points are moved, or some
other termination criterion is reached.
The Weighted K-Means Problem
A natural extension of the K-Means problem allows us to include
some more information, namely, a set of weights associated
with the data points. These might represent a measure of importance,
a frequency count, or some other information. The intent is that
a point with a weight of 5.0 is twice as "important" as a point with
a weight of 2.5, for instance. This gives rise to the "weighted"
K-Means problem.
In the weighted K-Means problem, we are given a set of N points
X(I) in M-dimensions, and a corresponding set of nonnegative weights
W(I). The goal is to arrange the points into K clusters,
with each cluster having a representative point Z(J), usually
chosen as the weighted centroid of the points in the cluster:
Z(J) = Sum ( all X(I) in cluster J ) W(I) * X(I) /
Sum ( all X(I) in cluster J ) W(I).
The weighted energy of cluster J is
E(J) = Sum ( all X(I) in cluster J ) W(I) * || X(I) - Z(J) ||^2
Licensing:
The computer code and data files described and made available on this web page
are distributed under
the GNU LGPL license.
Related Data and Programs:
ASA058
is a FORTRAN90 library which
implements the K-means algorithm of Sparks.
ASA136
is a FORTRAN77 file containing the original text of the Hartigan
and Wong clustering algorithm ASA136.
CITIES
is a library of
FORTRAN90 routines for various problems associated with a set of
"cities" on a map.
CITIES
is a dataset directory which contains sets of data defining
groups of cities.
LAU_NP
is a FORTRAN90 library which
contains heuristic algorithms for the K-center and K-median problems.
SPAETH
is a FORTRAN90 library which
can cluster data according to various principles.
SPAETH
is a dataset directory which contains a set of test data.
SPAETH2
is a FORTRAN90 library which
can cluster data according to various principles.
SPAETH2
is a dataset directory which contains a set of test data.
Reference:
-
John Hartigan, Manchek Wong,
Algorithm AS 136:
A K-Means Clustering Algorithm,
Applied Statistics,
Volume 28, Number 1, 1979, pages 100-108.
-
Wendy Martinez, Angel Martinez,
Computational Statistics Handbook with MATLAB,
Chapman and Hall / CRC, 2002.
-
David Sparks,
Algorithm AS 58:
Euclidean Cluster Analysis,
Applied Statistics,
Volume 22, Number 1, 1973, pages 126-130.
Source Code:
Examples and Tests:
KMEANS_PRB is a sample problem which reads a data file
and calls the various routines to cluster that data.
There are two files of data read by the sample code:
-
points_100.txt, 100 2D points,
used as a case study by the sample problem. (Can be
plotted using the
PLOT_POINTS
program).
-
ruspini.txt, 75 points in 2D,
with integer coordinates, and 0 < X < 120, 0 < Y < 160,
which might naturally be grouped into 4 sets.
The sample program writes out a data file describing each clustering:
-
test01_clusters.txt, the
100 point data set, clustered by test #1 in the sample problem.
(Can be plotted using the
PLOT_POINTS
program.)
-
test02_clusters.txt, the
100 point data set, clustered by test #2 in the sample problem.
(Can be plotted using the
PLOT_POINTS
program.)
-
test03_clusters.txt, the
100 point data set, clustered by test #3 in the sample problem.
(Can be plotted using the
PLOT_POINTS
program.)
-
test04_clusters.txt, the
100 point data set, clustered by test #4 in the sample problem.
(Can be plotted using the
PLOT_POINTS
program.)
-
test05_clusters.txt, the
100 point data set, clustered by test #5 in the sample problem.
(Can be plotted using the
PLOT_POINTS
program.)
-
test06_clusters.txt, the
100 point data set, clustered by test #6 in the sample problem.
(Can be plotted using the
PLOT_POINTS
program.)
-
test07_clusters.txt, the
100 point data set, clustered by test #7 in the sample problem.
(Can be plotted using the
PLOT_POINTS
program.)
-
test08_clusters.txt, the
100 point data set, clustered by test #8 in the sample problem.
(Can be plotted using the
PLOT_POINTS
program.)
-
test09_clusters_equal.txt, the
100 point data set, clustered by test #9 in the sample problem,
assuming EQUAL weights.
(Can be plotted using the
PLOT_POINTS
program.)
-
test09_clusters_unequal.txt, the
100 point data set, clustered by test #9 in the sample problem,
assuming UNEQUAL weights.
(Can be plotted using the
PLOT_POINTS
program.)
-
test10_clusters_equal.txt, the
100 point data set, clustered by test #10 in the sample problem,
assuming EQUAL weights.
(Can be plotted using the
PLOT_POINTS
program.)
-
test10_clusters_unequal.txt, the
100 point data set, clustered by test #10 in the sample problem,
assuming UNEQUAL weights.
(Can be plotted using the
PLOT_POINTS
program.)
-
test11_clusters_equal.txt, the
100 point data set, clustered by test #11 in the sample problem,
assuming EQUAL weights.
(Can be plotted using the
PLOT_POINTS
program.)
-
test11_clusters_unequal.txt, the
100 point data set, clustered by test #11 in the sample problem,
assuming UNEQUAL weights.
(Can be plotted using the
PLOT_POINTS
program.)
-
test12_clusters_equal.txt, the
100 point data set, clustered by test #12 in the sample problem,
assuming EQUAL weights.
(Can be plotted using the
PLOT_POINTS
program.)
-
test12_clusters_unequal.txt, the
100 point data set, clustered by test #12 in the sample problem,
assuming UNEQUAL weights.
(Can be plotted using the
PLOT_POINTS
program.)
The
PLOT_POINTS program can be used to create EPS images of the
clusterings computed by the sample problem. These EPS images can
then be converted to the
PNG image format.
-
plot_points_input.txt,
commands to
PLOT_POINTS to read the clustering files and create EPS
images.
-
eps_to_png.csh,
commands to create the EPS images and convert them to PNG format.
-
test01_clusters.png,
a PNG image of
the clusters.
-
test02_clusters.png,
a PNG image of
the clusters.
-
test03_clusters.png,
a PNG image of
the clusters.
-
test04_clusters.png,
a PNG image of
the clusters.
-
test05_clusters.png,
a PNG image of
the clusters.
-
test06_clusters.png,
a PNG image of
the clusters.
-
test07_clusters.png,
a PNG image of
the clusters.
-
test08_clusters.png,
a PNG image of
the clusters.
-
test09_clusters_equal.png,
a PNG image of
the clusters.
-
test09_clusters_unequal.png,
a PNG image of
the clusters.
-
test10_clusters_equal.png,
a PNG image of
the clusters.
-
test10_clusters_unequal.png,
a PNG image of
the clusters.
-
test11_clusters_equal.png,
a PNG image of
the clusters.
-
test11_clusters_unequal.png,
a PNG image of
the clusters.
-
test12_clusters_equal.png,
a PNG image of
the clusters.
-
test12_clusters_unequal.png,
a PNG image of
the clusters.
List of Routines:
-
CH_CAP capitalizes a single character.
-
CH_EQI is a case insensitive comparison of two characters for equality.
-
CH_TO_DIGIT returns the integer value of a base 10 digit.
-
CLUSTER_ENERGY_COMPUTE computes the energy of the clusters.
-
CLUSTER_INITIALIZE_1 initializes the clusters to data points.
-
CLUSTER_INITIALIZE_2 initializes the cluster centers to random values.
-
CLUSTER_INITIALIZE_3 initializes the cluster centers to random values.
-
CLUSTER_INITIALIZE_4 initializes the cluster centers to random values.
-
CLUSTER_INITIALIZE_5 initializes the cluster centers to random values.
-
CLUSTER_PRINT_SUMMARY prints a summary of data about a clustering.
-
CLUSTER_WRITE writes points and cluster centers to a file.
-
DTABLE_DATA_READ reads data from a double precision table file.
-
FILE_COLUMN_COUNT counts the number of columns in the first line of a file.
-
FILE_ROW_COUNT counts the number of row records in a file.
-
GET_UNIT returns a free FORTRAN unit number.
-
HMEANS_01 applies the H-Means algorithm.
-
HMEANS_02 applies the H-Means algorithm.
-
HMEANS_W_01 applies the weighted H-Means algorithm.
-
HMEANS_W_02 applies the weighted H-Means algorithm.
-
I4_UNIFORM returns a pseudorandom I4.
-
KMEANS_01 applies the K-Means algorithm.
-
KMEANS_02 applies the K-Means algorithm.
-
KMEANS_02_OPTRA carries out the optimal transfer stage.
-
KMEANS_02_QTRAN carries out the quick transfer stage.
-
KMEANS_03 applies the K-Means algorithm.
-
KMEANS_W_01 applies the weighted K-Means algorithm.
-
KMEANS_W_03 applies the weighted K-Means algorithm.
-
R4_UNIFORM_01 returns a unit pseudorandom R4.
-
R8_UNIFORM_01 returns a unit pseudorandom R8.
-
R8MAT_UNIFORM_01 returns a unit pseudorandom R8MAT.
-
R8VEC_UNIFORM_01 returns a unit pseudorandom R8VEC.
-
RANDOM_INITIALIZE initializes the FORTRAN 90 random number seed.
-
S_TO_R8 reads an R8 from a string.
-
S_TO_R8VEC reads an R8VEC from a string.
-
S_WORD_COUNT counts the number of "words" in a string.
-
TIMESTAMP prints the current YMDHMS date as a time stamp.
You can go up one level to
the FORTRAN90 source codes.
Last revised on 13 February 2008.