kmeansFit#
Purpose#
Partitions data into k clusters, using the kmeans algorithm.
Format#
- mdl = kmeansFit(X, clusters[, ctl])#
- Parameters:
X_train (NxP matrix) – The training features.
clusters (Scalar) – The number of clusters, or a matrix containing the initial centroids.
ctl –
Optional input, an instance of a
kmeansControl
structure.ctl.initMethod
Scalar specifying the algorithm used to create the initial centroids. Options include:
0
kmeans++ (default).
1
parallel k-means++
2
\(k\) randomly-selected observations.
ctl.nStarts
Scalar, the number of times to run the kmeans algorithm with new starting centroids. Note: this input will be ignored if the clusters input is a starting centroid.
ctl.seed
Seed for the random number generator which creates the initial centroids. Note: this input will be ignored if the clusters input is a starting centroid.
ctl.tolerance
Scalar, the convergence tolerance for the kmeans algorithm.
ctl.maxIters
Scalar, the maximum number of iterations to allow each of the nStarts to run before forcing convergence.
- Returns:
mdl (struct) –
An instance of a
kmeansModel
structure.mdl.centroids
kxP matrix, containing the centroids with the lowest intra-cluster sum of squares.
mdl.assignments
Nx1 matrix, containing the centroid assignment for the corresponding observation of the input matrix.
mdl.clusterSS
Scalar, sum of squared differences between each observation and its assigned centroid.
mdl.elapsedIters
Scalar, the number of iterations taken by the start with the lowest clusterSS.
Examples#
new;
library gml;
// For repeatable sample
rndseed 234234;
// Get dataset with full name
fname = getGAUSSHome("pkgs/gml/examples/iris.csv");
// Load data
X = loadd(fname, ". -species");
// Split data into x_train and x_test
{ x_train, x_test } = splitData(X, 0.70);
// Number of clusters
n_clusters = 3;
// Declare kmeansModel struct
struct kmeansModel mdl;
// Fit kmeans model
mdl = kmeansFit(x_train , n_clusters);
The above code will print the following:
=================================================================
Model: K-Means Number clusters: 3
Number observations: 105 Number features: 4
Init method: K-means++ Number starts: 3
Tolerance: 0.0001
=================================================================
K-means fit performance statistics:
============================================================
Total sum of squares: 477.576
Between group sum of squares: 419.05229
Within group sum of squares: 58.52371
The ratio of BSS/TSS: 0.87745676
============================================================
Centroids:
====================================================================
SepalLength SepalWidth PetalLength PetalWidth
5.82381 2.70952 4.35 1.42143
5.00937 3.40625 1.475 0.25
6.8871 3.06129 5.73226 2.06129
====================================================================
References#
Parallel Kmeans++ initialization. B. Bahmani, B. Moseley, A. Vattani, R. Kumar, S. Vassilvitskii. Scalable K-means++. Proceedings of the VLDB Endowment, 2012.
See also