cleands.Clustering.kmeans module
k-means clustering models.
This module implements a basic version of k-means clustering (simple_k_means) and a multi-start variant (k_means). It also includes helper functions to evaluate the optimal number of clusters using the total within-group sum of squares (TWSS) and the “elbow method”.
- Classes:
- simple_k_means:
Basic k-means clustering algorithm with iterative centroid updates.
- k_means:
Extension of simple_k_means that runs multiple random initializations (n_start) and selects the solution with the lowest TWSS.
- Functions:
- total_within_group_sum_of_squares_for_different_k:
Computes TWSS across different k values (1..k_max).
- select_k:
Heuristic to choose optimal k using the elbow method based on TWSS ratios.
- class cleands.Clustering.kmeans.simple_k_means(x, k, max_iters=100, seed=None)[source]
Bases:
clustering_modelBasic k-means clustering.
Iteratively assigns points to the nearest cluster centroid and updates centroids until convergence or the maximum number of iterations is reached.
- Parameters:
x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k (int) – Number of clusters.
max_iters (int, optional) – Maximum iterations. Defaults to 100.
seed (int | None, optional) – Random seed for reproducibility. Defaults to None.
- Variables:
n_clusters (int) – Number of clusters.
iters (int) – Number of iterations performed.
_means (np.ndarray) – Cluster centroids of shape (k, n_features).
- class cleands.Clustering.kmeans.k_means(x, k, max_iters=100, seed=None, n_start=10)[source]
Bases:
simple_k_meansMulti-start k-means clustering.
Runs multiple random initializations (
n_start) ofsimple_k_meansand selects the model with the lowest total within-group sum of squares (TWSS).- Parameters:
x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k (int) – Number of clusters.
max_iters (int, optional) – Maximum iterations for each run. Defaults to 100.
seed (int | None, optional) – Random seed. Defaults to None.
n_start (int, optional) – Number of random initializations. Defaults to 10.
- Variables:
n_clusters (int) – Number of clusters.
iters (int) – Iterations used by the best model.
means (np.ndarray) – Centroids of the best solution.
- cleands.Clustering.kmeans.total_within_group_sum_of_squares_for_different_k(x, k_max=10, *args, **kwargs)[source]
Compute TWSS across different values of
k.- Parameters:
x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k_max (int, optional) – Maximum number of clusters to evaluate. Defaults to 10.
*args – Passed to
k_means.**kwargs – Passed to
k_means.
- Returns:
TWSS values indexed by
k(1..k_max).- Return type:
np.ndarray
- cleands.Clustering.kmeans.select_k(x, k_max=10, *args, **kwargs)[source]
Select the optimal number of clusters using the elbow method.
Uses the ratio of successive TWSS differences to detect the “elbow point.”
- Parameters:
x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k_max (int, optional) – Maximum number of clusters to test. Defaults to 10.
*args – Passed to
k_means.**kwargs – Passed to
k_means.
- Returns:
Estimated optimal number of clusters.
- Return type:
int
- class cleands.Clustering.kmeans.kMeans(formula, data, *args, **kwargs)[source]
Bases:
ClusteringModelConvenience wrapper for k-means clustering.
The k-means algorithm partitions observations into a fixed number of clusters by minimizing within-cluster sum of squares. This wrapper provides a formula/DataFrame interface for the
k_means.- Variables:
MODEL_TYPE (ClassVar[Type[cleands.base.supervised_model]]) – Underlying model type, fixed to
k_means.- Parameters:
formula (str)
data (DataFrame)
Example
>>> model = kMeans.from_formula("~ x1 + x2", data=df, k=3) >>> clusters = model.cluster_from_df(df) >>> model.means # cluster centroids