kmeansFit ==================== Purpose ---------------------- Partitions data into k clusters, using the kmeans algorithm. Format ---------------------- .. function:: mdl = kmeansFit(X, clusters[, ctl]) :param X_train: The training features. :type X_train: NxP matrix :param clusters: The number of clusters, or a matrix containing the initial centroids. :type clusters: Scalar :param ctl: Optional input, an instance of a :class:`kmeansControl` structure. .. list-table:: :widths: auto * - ctl.initMethod - Scalar specifying the algorithm used to create the initial centroids. Options include: === =========================================== 0 kmeans++ (default). 1 parallel k-means++ 2 :math:`k` randomly-selected observations. === =========================================== * - ctl.nStarts - Scalar, the number of times to run the kmeans algorithm with new starting centroids. Note: this input will be ignored if the *clusters* input is a starting centroid. * - ctl.seed - Seed for the random number generator which creates the initial centroids. Note: this input will be ignored if the *clusters* input is a starting centroid. * - ctl.tolerance - Scalar, the convergence tolerance for the kmeans algorithm. * - ctl.maxIters - Scalar, the maximum number of iterations to allow each of the *nStarts* to run before forcing convergence. :return mdl: An instance of a :class:`kmeansModel` structure. .. csv-table:: :widths: auto "mdl.centroids","kxP matrix, containing the centroids with the lowest intra-cluster sum of squares." "mdl.assignments","Nx1 matrix, containing the centroid assignment for the corresponding observation of the input matrix." "mdl.clusterSS","Scalar, sum of squared differences between each observation and its assigned centroid." "mdl.elapsedIters","Scalar, the number of iterations taken by the *start* with the lowest *clusterSS*." :rtype mdl: struct Examples ------------ :: new; library gml; // For repeatable sample rndseed 234234; // Get dataset with full name fname = getGAUSSHome("pkgs/gml/examples/iris.csv"); // Load data X = loadd(fname, ". -species"); // Split data into x_train and x_test { x_train, x_test } = splitData(X, 0.70); // Number of clusters n_clusters = 3; // Declare kmeansModel struct struct kmeansModel mdl; // Fit kmeans model mdl = kmeansFit(x_train , n_clusters); The above code will print the following: :: ================================================================= Model: K-Means Number clusters: 3 Number observations: 105 Number features: 4 Init method: K-means++ Number starts: 3 Tolerance: 0.0001 ================================================================= K-means fit performance statistics: ============================================================ Total sum of squares: 477.576 Between group sum of squares: 419.05229 Within group sum of squares: 58.52371 The ratio of BSS/TSS: 0.87745676 ============================================================ Centroids: ==================================================================== SepalLength SepalWidth PetalLength PetalWidth 5.82381 2.70952 4.35 1.42143 5.00937 3.40625 1.475 0.25 6.8871 3.06129 5.73226 2.06129 ==================================================================== References ---------------- Parallel Kmeans++ initialization. B. Bahmani, B. Moseley, A. Vattani, R. Kumar, S. Vassilvitskii. Scalable K-means++. Proceedings of the VLDB Endowment, 2012. .. seealso:: :func:`kmeansPredict`, :func:`kmeansControlCreate`