CVT_BASIS
Data Clustering by K-Means Techniques
CVT_BASIS is a FORTRAN90 program, using
double precision arithmetic, which computes good cluster centers
for a set of data.
The clustering process uses the K-Means algorithm, which can be
considered to be a discrete version of the CVT algorithm (Centroidal
Voronoi Tessellation).
The data is a collection of vectors, with each vector stored in
a separate file. The files are presumed to have "sequential" names,
such as "fred01.txt", "fred02.txt", and so on. Each file must be a
TABLE file, that is
a series of N lines, with M values on every line (although
comment lines may be inserted as well.)
The program is given the name of the first file in the sequence.
It reads the data from each file in the sequence, and carries out
the K Means clustering process to determine K cluster centers.
It writes each of these cluster centers out to a separate file.
The cluster centers will generally be "well spread out" in the space
spanned by the set of data. Such a set might be useful, for instance,
in determining a basis for a low-dimensional approximation of the
data.
INPUT: at run time, the user specifies:
-
uv0_file, the name of the first data file (the program
will assume all the files are numbered consecutively).
Note that you may now specify more than one set of solution families.
Enter "none" if there are no more families, or else the name of the
first file in the next family. Up to 10 separate families of
files are allowed.
-
cluster_lo, cluster_hi, the range of cluster sizes to check.
In most cases, you simply want to specify the same number
for both these values, namely, the requested basis size.
-
cluster_it_max, the number of different times you want to
try to cluster the data; I often use 15.
-
energy_it_max, the number of times you want to try to improve
a given clustering by swapping points from one cluster to another;
I often use 50 or 100.
-
comment, "Y" if initial comments may be included in the
beginning of the output files. These comments always start with
a "#" character in column 1.
Related Data and Programs:
BURGERS
is a data set directory which
contains solutions of the 1 dimensional Burgers equation;
CAVITY_FLOW
is a dataset directory which
contains solutions of a driven cavity flow in 2D;
CVT_BASIS_FLOW
is a FORTRAN90 program which
is similar to CVT_BASIS, but is specialized to handle
a particular family of fluid flow solutions.
INOUT_FLOW
is a dataset directory which
contains solutions for flow in and out of a chamber in 2D;
INOUT_FLOW2
is a dataset directory which
contains solutions for flow in and out of a chamber in 2D,
using a finer grid and more timesteps;
SVD_BASIS
is a FORTRAN90 program which
uses the singular value decomposition to extract representative
modes from a set of data vectors.
TABLE
is a file format used for the input and output files.
TCELL_FLOW
is a dataset directory which
contains solutions for flow through a T-cell in 2D;
Reference:
-
Franz Aurenhammer,
Voronoi diagrams -
a study of a fundamental geometric data structure,
ACM Computing Surveys,
Volume 23, Number 3, pages 345-405, September 1991,
../../pdf/aurenhammer.pdf
-
John Burkardt, Max Gunzburger, Hyung-Chun Lee,
Centroidal Voronoi Tessellation-Based Reduced-Order
Modelling of Complex Systems,
SIAM Journal on Scientific Computing,
Volume 28, Number 2, 2006, pages 459-484.
-
John Burkardt, Max Gunzburger, Janet Peterson and Rebecca Brannon,
User Manual and Supporting Information for Library of Codes
for Centroidal Voronoi Placement and Associated Zeroth,
First, and Second Moment Determination,
Sandia National Laboratories Technical Report SAND2002-0099,
February 2002,
../../publications/bgpb_2002.pdf
-
Qiang Du, Vance Faber, and Max Gunzburger,
Centroidal Voronoi Tessellations: Applications and Algorithms,
SIAM Review, Volume 41, 1999, pages 637-676.
-
Lili Ju, Qiang Du, and Max Gunzburger,
Probabilistic methods for centroidal Voronoi tessellations
and their parallel implementations,
Parallel Computing,
Volume 28, 2002, pages 1477-1500.
-
Wendy Martinez and Angel Martinez,
Computational Statistics Handbook with MATLAB,
Chapman and Hall / CRC, 2002.
Source Code:
Examples and Tests:
-
run 01, example seeking 2 clusters;
-
run 02, example seeking 4 clusters;
-
run 03, example seeking 8 clusters;
-
run 04, compute clusterings
of sizes 1 through 16, determine energies, and output size
versus energy data;
List of Routines:
-
MAIN is the main routine for the CVT_BASIS program.
-
ANALYSIS_RAW computes the energy for a range of number of clusters.
-
CH_CAP capitalizes a single character.
-
CH_EQI is a case insensitive comparison of two characters for equality.
-
CH_IS_DIGIT returns .TRUE. if a character is a decimal digit.
-
CH_TO_DIGIT returns the integer value of a base 10 digit.
-
CLUSTER_CENSUS computes and prints the population of each cluster.
-
CLUSTER_INITIALIZE_RAW initializes the cluster centers to random values.
-
CLUSTER_LIST prints out the assignments.
-
DATA_TO_GNUPLOT writes data to a file suitable for processing by GNUPLOT.
-
DIGIT_INC increments a decimal digit.
-
DIGIT_TO_CH returns the character representation of a decimal digit.
-
DTABLE_DATA_READ reads data from a double precision table file.
-
DTABLE_DATA_WRITE writes data to a double precision table file.
-
DTABLE_HEADER_READ reads the header from a double precision table file.
-
DTABLE_HEADER_WRITE writes the header to a double precision table file.
-
DTABLE_WRITE writes a double precision table file.
-
ENERGY_RAW computes the total energy of a given clustering.
-
FILE_COLUMN_COUNT counts the number of columns in the first line of a file.
-
FILE_EXIST reports whether a file exists.
-
FILE_NAME_INC generates the next filename in a series.
-
FILE_ROW_COUNT counts the number of row records in a file.
-
GET_UNIT returns a free FORTRAN unit number.
-
HMEANS_RAW seeks the minimal energy of a cluster of a given size.
-
I4_INPUT prints a prompt string and reads an integer from the user.
-
I4_RANGE_INPUT reads a pair of integers from the user, representing a range.
-
I4_UNIFORM returns a scaled pseudorandom I4.
-
ITABLE_DATA_READ reads data from an integer table file.
-
ITABLE_HEADER_READ reads the header from an integer table file.
-
I4VEC_PRINT prints an integer vector.
-
KMEANS_RAW tries to improve a partition of points.
-
NEAREST_CLUSTER_RAW finds the cluster nearest to a data point.
-
R8_UNIFORM_01 returns a unit pseudorandom R8.
-
R8VEC_NORM2 returns the 2-norm of a vector.
-
R8VEC_RANGE_INPUT reads two DP vectors from the user, representing a range.
-
R8VEC_UNIT_EUCLIDEAN normalizes a N-vector in the Euclidean norm.
-
RANDOM_INITIALIZE initializes the FORTRAN 90 random number seed.
-
REFQBF evaluates a reference element quadratic basis function.
-
S_BLANK_DELETE removes blanks from a string, left justifying the remainder.
-
S_EQI is a case insensitive comparison of two strings for equality.
-
S_INPUT prints a prompt string and reads a string from the user.
-
S_REP_CH replaces all occurrences of one character by another.
-
S_TO_D reads a real ( kind = 8 ) number from a string.
-
S_TO_R8VEC reads a double precision vector from a string.
-
S_TO_I4 reads an I4 from a string.
-
S_TO_I4VEC reads an I4VEC from a string.
-
S_WORD_COUNT counts the number of "words" in a string.
-
TIMESTAMP prints the current YMDHMS date as a time stamp.
-
TIMESTRING writes the current YMDHMS date into a string.
You can go up one level to
the FORTRAN90 source codes.
Last revised on 12 November 2006.