cleands.DimensionReduction.pca module

pca.py

Implements dimension reduction models including Principal Components Analysis (PCA) and Canonical Correlation Analysis (CCA).

Classes:

principal_components_analysis:: Standard PCA for unsupervised dimension reduction, with centering and scaling options.
canonical_correlation_analysis:: Supervised dimension reduction that extracts correlated projections between two datasets.

Functions:

select_k:: Eigenvalue ratio test to automatically select the number of components.

Notes

PCA is fit via Singular Value Decomposition (SVD).
CCA is fit via whitening of covariance matrices and SVD of the cross-covariance.

class cleands.DimensionReduction.pca.principal_components_analysis(x, k, *, center=True, scale=False)[source]

Bases: dimension_reduction_model

Principal Components Analysis (PCA).

Reduces dimensionality by projecting data onto top k principal components that maximize variance. Provides explained variance statistics.

Parameters:

x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k (int) – Number of components to extract. Must be in [1, min(n_obs, n_feat)].
center (bool, optional) – Whether to mean-center the data. Defaults to True.
scale (bool, optional) – Whether to scale variables to unit variance. Defaults to False.

Variables:

mean (np.ndarray) – Feature means (if centered).
scale (np.ndarray) – Feature scales (if scaled).
singular_values (np.ndarray) – Singular values from SVD.
rotation (np.ndarray) – Principal component loadings (p, k).
components (np.ndarray) – Projected data (scores), shape (n, k).
explained_variance (np.ndarray) – Variance explained by each component.
explained_variance_ratio (np.ndarray) – Proportion of variance explained.
k (int) – Number of components retained.

reduce(target)[source]

Project new data into the PCA space.

Parameters:: target (np.ndarray) – New data of shape (m, n_features).
Returns:: Reduced representation of shape (m, k).
Return type:: np.ndarray

cleands.DimensionReduction.pca.select_k(eigs, k_max=None, include_zero=True, allow_zero=False)[source]

Eigenvalue ratio test for choosing number of components.

Uses the Ahn-Horenstein ratio-based test (2013) for selecting k. Optionally includes a “zeroth” eigenvalue based on the average.

Parameters:

eigs (np.ndarray) – Eigenvalues or variances in descending order.
k_max (int, optional) – Maximum number of components to consider. Defaults to all.
include_zero (bool, optional) – Include artificial λ₀ for ratio. Defaults to True.
allow_zero (bool, optional) – Allow selection of k=0. Defaults to False.

Returns:

Selected number of components.

Return type:

int

class cleands.DimensionReduction.pca.canonical_correlation_analysis(X, Y, k, *, center=True, scale=False, reg=1e-6)[source]

Bases: supervised_dimension_reduction_model

Canonical Correlation Analysis (CCA).

Finds linear projections of X and Y that maximize their correlation. Useful for studying relationships between two multivariate datasets.

Parameters:

X (np.ndarray) – Predictor data matrix of shape (n, p).
Y (np.ndarray) – Response data matrix of shape (n, q).
k (int) – Number of canonical variates to compute.
center (bool, optional) – Center data. Defaults to True.
scale (bool, optional) – Scale data to unit variance. Defaults to False.
reg (float, optional) – Regularization term added to covariance diagonals. Defaults to 1e-6.

Variables:

mean_x (np.ndarray) – Mean of X features.
mean_y (np.ndarray) – Mean of Y features.
scale_x (np.ndarray) – Scaling factors for X.
scale_y (np.ndarray) – Scaling factors for Y.
canonical_correlations (np.ndarray) – Canonical correlations.
rotation_x (np.ndarray) – Canonical directions for X (p, k).
rotation_y (np.ndarray) – Canonical directions for Y (q, k).
components_x (np.ndarray) – Canonical variates for X (n, k).
components_y (np.ndarray) – Canonical variates for Y (n, k).
k (int) – Number of canonical variates retained.

reduce_X(x_new)[source]

Project new X data into the canonical variate space.

Parameters:: x_new (np.ndarray) – New X data, shape (m, p).
Returns:: Canonical variates of shape (m, k).
Return type:: np.ndarray

reduce_Y(y_new)[source]

Project new Y data into the canonical variate space.

Parameters:: y_new (np.ndarray) – New Y data, shape (m, q).
Returns:: Canonical variates of shape (m, k).
Return type:: np.ndarray

class cleands.DimensionReduction.pca.PrincipalComponentsAnalysis(formula, data, *args, **kwargs)[source]

Bases: DimensionReductionModel

Convenience wrapper for principal components analysis (PCA).

PCA reduces the dimensionality of data by projecting it onto a set of orthogonal components that capture the maximum variance. This wrapper provides a formula/DataFrame interface for the principal_components_analysis.

Variables:

MODEL_TYPE (ClassVar[Type[cleands.base.supervised_model]]) – Underlying model type, fixed to principal_components_analysis.

Parameters:

formula (str)
data (DataFrame)

Example

>>> model = PrincipalComponentsAnalysis.from_formula("~ x1 + x2 + x3", data=df)
>>> Z = model.reduce(df[["x1", "x2", "x3"]])   # reduced components
>>> model.glance  # variance explained and diagnostics