cleands.Clustering.kmeans module

k-means clustering models.

This module implements a basic version of k-means clustering (simple_k_means) and a multi-start variant (k_means). It also includes helper functions to evaluate the optimal number of clusters using the total within-group sum of squares (TWSS) and the “elbow method”.

Classes:

simple_k_means:: Basic k-means clustering algorithm with iterative centroid updates.
k_means:: Extension of simple_k_means that runs multiple random initializations (n_start) and selects the solution with the lowest TWSS.

Functions:

total_within_group_sum_of_squares_for_different_k:: Computes TWSS across different k values (1..k_max).
select_k:: Heuristic to choose optimal k using the elbow method based on TWSS ratios.

class cleands.Clustering.kmeans.simple_k_means(x, k, max_iters=100, seed=None)[source]

Bases: clustering_model

Basic k-means clustering.

Iteratively assigns points to the nearest cluster centroid and updates centroids until convergence or the maximum number of iterations is reached.

Parameters:

x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k (int) – Number of clusters.
max_iters (int, optional) – Maximum iterations. Defaults to 100.
seed (int | None, optional) – Random seed for reproducibility. Defaults to None.

Variables:

n_clusters (int) – Number of clusters.
iters (int) – Number of iterations performed.
_means (np.ndarray) – Cluster centroids of shape (k, n_features).

cluster(newx)[source]

Cluster new data based on learned centroids.

Parameters:: newx (np.ndarray) – New data matrix of shape (m, n_features).
Returns:: Cluster assignments of shape (m,).
Return type:: np.ndarray

class cleands.Clustering.kmeans.k_means(x, k, max_iters=100, seed=None, n_start=10)[source]

Bases: simple_k_means

Multi-start k-means clustering.

Runs multiple random initializations (n_start) of simple_k_means and selects the model with the lowest total within-group sum of squares (TWSS).

Parameters:

x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k (int) – Number of clusters.
max_iters (int, optional) – Maximum iterations for each run. Defaults to 100.
seed (int | None, optional) – Random seed. Defaults to None.
n_start (int, optional) – Number of random initializations. Defaults to 10.

Variables:

n_clusters (int) – Number of clusters.
iters (int) – Iterations used by the best model.
means (np.ndarray) – Centroids of the best solution.

cleands.Clustering.kmeans.total_within_group_sum_of_squares_for_different_k(x, k_max=10, *args, **kwargs)[source]

Compute TWSS across different values of k.

Parameters:

x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k_max (int, optional) – Maximum number of clusters to evaluate. Defaults to 10.
*args – Passed to k_means.
**kwargs – Passed to k_means.

Returns:

TWSS values indexed by k (1..k_max).

Return type:

np.ndarray

cleands.Clustering.kmeans.select_k(x, k_max=10, *args, **kwargs)[source]

Select the optimal number of clusters using the elbow method.

Uses the ratio of successive TWSS differences to detect the “elbow point.”

Parameters:

x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k_max (int, optional) – Maximum number of clusters to test. Defaults to 10.
*args – Passed to k_means.
**kwargs – Passed to k_means.

Returns:

Estimated optimal number of clusters.

Return type:

int

class cleands.Clustering.kmeans.kMeans(formula, data, *args, **kwargs)[source]

Bases: ClusteringModel

Convenience wrapper for k-means clustering.

The k-means algorithm partitions observations into a fixed number of clusters by minimizing within-cluster sum of squares. This wrapper provides a formula/DataFrame interface for the k_means.

Variables:

MODEL_TYPE (ClassVar[Type[cleands.base.supervised_model]]) – Underlying model type, fixed to k_means.

Parameters:

formula (str)
data (DataFrame)

Example

>>> model = kMeans.from_formula("~ x1 + x2", data=df, k=3)
>>> clusters = model.cluster_from_df(df)
>>> model.means  # cluster centroids