cleands.DimensionReduction.pca module
pca.py
Implements dimension reduction models including Principal Components Analysis (PCA) and Canonical Correlation Analysis (CCA).
- Classes:
- principal_components_analysis:
Standard PCA for unsupervised dimension reduction, with centering and scaling options.
- canonical_correlation_analysis:
Supervised dimension reduction that extracts correlated projections between two datasets.
- Functions:
- select_k:
Eigenvalue ratio test to automatically select the number of components.
Notes
PCA is fit via Singular Value Decomposition (SVD).
CCA is fit via whitening of covariance matrices and SVD of the cross-covariance.
- class cleands.DimensionReduction.pca.principal_components_analysis(x, k, *, center=True, scale=False)[source]
Bases:
dimension_reduction_modelPrincipal Components Analysis (PCA).
Reduces dimensionality by projecting data onto top k principal components that maximize variance. Provides explained variance statistics.
- Parameters:
x (np.ndarray) – Data matrix of shape (n_samples, n_features).
k (int) – Number of components to extract. Must be in [1, min(n_obs, n_feat)].
center (bool, optional) – Whether to mean-center the data. Defaults to True.
scale (bool, optional) – Whether to scale variables to unit variance. Defaults to False.
- Variables:
mean (np.ndarray) – Feature means (if centered).
scale (np.ndarray) – Feature scales (if scaled).
singular_values (np.ndarray) – Singular values from SVD.
rotation (np.ndarray) – Principal component loadings (p, k).
components (np.ndarray) – Projected data (scores), shape (n, k).
explained_variance (np.ndarray) – Variance explained by each component.
explained_variance_ratio (np.ndarray) – Proportion of variance explained.
k (int) – Number of components retained.
- cleands.DimensionReduction.pca.select_k(eigs, k_max=None, include_zero=True, allow_zero=False)[source]
Eigenvalue ratio test for choosing number of components.
Uses the Ahn-Horenstein ratio-based test (2013) for selecting k. Optionally includes a “zeroth” eigenvalue based on the average.
- Parameters:
eigs (np.ndarray) – Eigenvalues or variances in descending order.
k_max (int, optional) – Maximum number of components to consider. Defaults to all.
include_zero (bool, optional) – Include artificial λ₀ for ratio. Defaults to True.
allow_zero (bool, optional) – Allow selection of k=0. Defaults to False.
- Returns:
Selected number of components.
- Return type:
int
- class cleands.DimensionReduction.pca.canonical_correlation_analysis(X, Y, k, *, center=True, scale=False, reg=1e-6)[source]
Bases:
supervised_dimension_reduction_modelCanonical Correlation Analysis (CCA).
Finds linear projections of X and Y that maximize their correlation. Useful for studying relationships between two multivariate datasets.
- Parameters:
X (np.ndarray) – Predictor data matrix of shape (n, p).
Y (np.ndarray) – Response data matrix of shape (n, q).
k (int) – Number of canonical variates to compute.
center (bool, optional) – Center data. Defaults to True.
scale (bool, optional) – Scale data to unit variance. Defaults to False.
reg (float, optional) – Regularization term added to covariance diagonals. Defaults to 1e-6.
- Variables:
mean_x (np.ndarray) – Mean of X features.
mean_y (np.ndarray) – Mean of Y features.
scale_x (np.ndarray) – Scaling factors for X.
scale_y (np.ndarray) – Scaling factors for Y.
canonical_correlations (np.ndarray) – Canonical correlations.
rotation_x (np.ndarray) – Canonical directions for X (p, k).
rotation_y (np.ndarray) – Canonical directions for Y (q, k).
components_x (np.ndarray) – Canonical variates for X (n, k).
components_y (np.ndarray) – Canonical variates for Y (n, k).
k (int) – Number of canonical variates retained.
- class cleands.DimensionReduction.pca.PrincipalComponentsAnalysis(formula, data, *args, **kwargs)[source]
Bases:
DimensionReductionModelConvenience wrapper for principal components analysis (PCA).
PCA reduces the dimensionality of data by projecting it onto a set of orthogonal components that capture the maximum variance. This wrapper provides a formula/DataFrame interface for the
principal_components_analysis.- Variables:
MODEL_TYPE (ClassVar[Type[cleands.base.supervised_model]]) – Underlying model type, fixed to
principal_components_analysis.- Parameters:
formula (str)
data (DataFrame)
Example
>>> model = PrincipalComponentsAnalysis.from_formula("~ x1 + x2 + x3", data=df) >>> Z = model.reduce(df[["x1", "x2", "x3"]]) # reduced components >>> model.glance # variance explained and diagnostics