cleands.DimensionReduction.pca module

pca.py

Implements dimension reduction models including Principal Components Analysis (PCA) and Canonical Correlation Analysis (CCA).

Classes:
principal_components_analysis:

Standard PCA for unsupervised dimension reduction, with centering and scaling options.

canonical_correlation_analysis:

Supervised dimension reduction that extracts correlated projections between two datasets.

Functions:
select_k:

Eigenvalue ratio test to automatically select the number of components.

Notes

  • PCA is fit via Singular Value Decomposition (SVD).

  • CCA is fit via whitening of covariance matrices and SVD of the cross-covariance.

class cleands.DimensionReduction.pca.principal_components_analysis(x, k, *, center=True, scale=False)[source]

Bases: dimension_reduction_model

Principal Components Analysis (PCA).

Reduces dimensionality by projecting data onto top k principal components that maximize variance. Provides explained variance statistics.

Parameters:
  • x (np.ndarray) – Data matrix of shape (n_samples, n_features).

  • k (int) – Number of components to extract. Must be in [1, min(n_obs, n_feat)].

  • center (bool, optional) – Whether to mean-center the data. Defaults to True.

  • scale (bool, optional) – Whether to scale variables to unit variance. Defaults to False.

Variables:
  • mean (np.ndarray) – Feature means (if centered).

  • scale (np.ndarray) – Feature scales (if scaled).

  • singular_values (np.ndarray) – Singular values from SVD.

  • rotation (np.ndarray) – Principal component loadings (p, k).

  • components (np.ndarray) – Projected data (scores), shape (n, k).

  • explained_variance (np.ndarray) – Variance explained by each component.

  • explained_variance_ratio (np.ndarray) – Proportion of variance explained.

  • k (int) – Number of components retained.

reduce(target)[source]

Project new data into the PCA space.

Parameters:

target (np.ndarray) – New data of shape (m, n_features).

Returns:

Reduced representation of shape (m, k).

Return type:

np.ndarray

cleands.DimensionReduction.pca.select_k(eigs, k_max=None, include_zero=True, allow_zero=False)[source]

Eigenvalue ratio test for choosing number of components.

Uses the Ahn-Horenstein ratio-based test (2013) for selecting k. Optionally includes a “zeroth” eigenvalue based on the average.

Parameters:
  • eigs (np.ndarray) – Eigenvalues or variances in descending order.

  • k_max (int, optional) – Maximum number of components to consider. Defaults to all.

  • include_zero (bool, optional) – Include artificial λ₀ for ratio. Defaults to True.

  • allow_zero (bool, optional) – Allow selection of k=0. Defaults to False.

Returns:

Selected number of components.

Return type:

int

class cleands.DimensionReduction.pca.canonical_correlation_analysis(X, Y, k, *, center=True, scale=False, reg=1e-6)[source]

Bases: supervised_dimension_reduction_model

Canonical Correlation Analysis (CCA).

Finds linear projections of X and Y that maximize their correlation. Useful for studying relationships between two multivariate datasets.

Parameters:
  • X (np.ndarray) – Predictor data matrix of shape (n, p).

  • Y (np.ndarray) – Response data matrix of shape (n, q).

  • k (int) – Number of canonical variates to compute.

  • center (bool, optional) – Center data. Defaults to True.

  • scale (bool, optional) – Scale data to unit variance. Defaults to False.

  • reg (float, optional) – Regularization term added to covariance diagonals. Defaults to 1e-6.

Variables:
  • mean_x (np.ndarray) – Mean of X features.

  • mean_y (np.ndarray) – Mean of Y features.

  • scale_x (np.ndarray) – Scaling factors for X.

  • scale_y (np.ndarray) – Scaling factors for Y.

  • canonical_correlations (np.ndarray) – Canonical correlations.

  • rotation_x (np.ndarray) – Canonical directions for X (p, k).

  • rotation_y (np.ndarray) – Canonical directions for Y (q, k).

  • components_x (np.ndarray) – Canonical variates for X (n, k).

  • components_y (np.ndarray) – Canonical variates for Y (n, k).

  • k (int) – Number of canonical variates retained.

reduce_X(x_new)[source]

Project new X data into the canonical variate space.

Parameters:

x_new (np.ndarray) – New X data, shape (m, p).

Returns:

Canonical variates of shape (m, k).

Return type:

np.ndarray

reduce_Y(y_new)[source]

Project new Y data into the canonical variate space.

Parameters:

y_new (np.ndarray) – New Y data, shape (m, q).

Returns:

Canonical variates of shape (m, k).

Return type:

np.ndarray

class cleands.DimensionReduction.pca.PrincipalComponentsAnalysis(formula, data, *args, **kwargs)[source]

Bases: DimensionReductionModel

Convenience wrapper for principal components analysis (PCA).

PCA reduces the dimensionality of data by projecting it onto a set of orthogonal components that capture the maximum variance. This wrapper provides a formula/DataFrame interface for the principal_components_analysis.

Variables:

MODEL_TYPE (ClassVar[Type[cleands.base.supervised_model]]) – Underlying model type, fixed to principal_components_analysis.

Parameters:
  • formula (str)

  • data (DataFrame)

Example

>>> model = PrincipalComponentsAnalysis.from_formula("~ x1 + x2 + x3", data=df)
>>> Z = model.reduce(df[["x1", "x2", "x3"]])   # reduced components
>>> model.glance  # variance explained and diagnostics