cleands.Clustering.kmeans module

k-means clustering models.

This module implements a basic version of k-means clustering (simple_k_means) and a multi-start variant (k_means). It also includes helper functions to evaluate the optimal number of clusters using the total within-group sum of squares (TWSS) and the “elbow method”.

Classes:
simple_k_means:

Basic k-means clustering algorithm with iterative centroid updates.

k_means:

Extension of simple_k_means that runs multiple random initializations (n_start) and selects the solution with the lowest TWSS.

Functions:
total_within_group_sum_of_squares_for_different_k:

Computes TWSS across different k values (1..k_max).

select_k:

Heuristic to choose optimal k using the elbow method based on TWSS ratios.

class cleands.Clustering.kmeans.simple_k_means(x, k, max_iters=100, seed=None)[source]

Bases: clustering_model

Basic k-means clustering.

Iteratively assigns points to the nearest cluster centroid and updates centroids until convergence or the maximum number of iterations is reached.

Parameters:
  • x (np.ndarray) – Data matrix of shape (n_samples, n_features).

  • k (int) – Number of clusters.

  • max_iters (int, optional) – Maximum iterations. Defaults to 100.

  • seed (int | None, optional) – Random seed for reproducibility. Defaults to None.

Variables:
  • n_clusters (int) – Number of clusters.

  • iters (int) – Number of iterations performed.

  • _means (np.ndarray) – Cluster centroids of shape (k, n_features).

cluster(newx)[source]

Cluster new data based on learned centroids.

Parameters:

newx (np.ndarray) – New data matrix of shape (m, n_features).

Returns:

Cluster assignments of shape (m,).

Return type:

np.ndarray

class cleands.Clustering.kmeans.k_means(x, k, max_iters=100, seed=None, n_start=10)[source]

Bases: simple_k_means

Multi-start k-means clustering.

Runs multiple random initializations (n_start) of simple_k_means and selects the model with the lowest total within-group sum of squares (TWSS).

Parameters:
  • x (np.ndarray) – Data matrix of shape (n_samples, n_features).

  • k (int) – Number of clusters.

  • max_iters (int, optional) – Maximum iterations for each run. Defaults to 100.

  • seed (int | None, optional) – Random seed. Defaults to None.

  • n_start (int, optional) – Number of random initializations. Defaults to 10.

Variables:
  • n_clusters (int) – Number of clusters.

  • iters (int) – Iterations used by the best model.

  • means (np.ndarray) – Centroids of the best solution.

cleands.Clustering.kmeans.total_within_group_sum_of_squares_for_different_k(x, k_max=10, *args, **kwargs)[source]

Compute TWSS across different values of k.

Parameters:
  • x (np.ndarray) – Data matrix of shape (n_samples, n_features).

  • k_max (int, optional) – Maximum number of clusters to evaluate. Defaults to 10.

  • *args – Passed to k_means.

  • **kwargs – Passed to k_means.

Returns:

TWSS values indexed by k (1..k_max).

Return type:

np.ndarray

cleands.Clustering.kmeans.select_k(x, k_max=10, *args, **kwargs)[source]

Select the optimal number of clusters using the elbow method.

Uses the ratio of successive TWSS differences to detect the “elbow point.”

Parameters:
  • x (np.ndarray) – Data matrix of shape (n_samples, n_features).

  • k_max (int, optional) – Maximum number of clusters to test. Defaults to 10.

  • *args – Passed to k_means.

  • **kwargs – Passed to k_means.

Returns:

Estimated optimal number of clusters.

Return type:

int

class cleands.Clustering.kmeans.kMeans(formula, data, *args, **kwargs)[source]

Bases: ClusteringModel

Convenience wrapper for k-means clustering.

The k-means algorithm partitions observations into a fixed number of clusters by minimizing within-cluster sum of squares. This wrapper provides a formula/DataFrame interface for the k_means.

Variables:

MODEL_TYPE (ClassVar[Type[cleands.base.supervised_model]]) – Underlying model type, fixed to k_means.

Parameters:
  • formula (str)

  • data (DataFrame)

Example

>>> model = kMeans.from_formula("~ x1 + x2", data=df, k=3)
>>> clusters = model.cluster_from_df(df)
>>> model.means  # cluster centroids