cleands.base module

Core base classes and abstract interfaces for the CleanDS statistical modeling framework.

This module defines the fundamental building blocks for supervised, unsupervised, prediction, classification, clustering, and distribution models. It establishes a consistent API across model families, providing shared properties such as fitted, residuals, mean_squared_error, aic, and bic.

Key features:
  • Abstract base classes for supervised and unsupervised models.

  • Prediction and classification models with common evaluation metrics.

  • Clustering model interface with mean/cluster management utilities.

  • Distribution model protocols with PDF and CDF support.

  • Likelihood-based model mixins for deviance, AIC, BIC, and log-likelihood.

  • Variance models for parameter uncertainty (standard errors, confidence intervals).

  • Supervised wrappers (SupervisedModel, PredictionModel, ClassificationModel) for integrating with formula notation and tidy outputs.

These base definitions are designed for extensibility: custom regression, classification, clustering, or distribution models should inherit from the appropriate abstract class to ensure interoperability within the CleanDS ecosystem.

class cleands.base.learning_model(x)[source]

Bases: ABC

Abstract base class for all learning models.

Parameters:

x (ndarray)

class cleands.base.supervised_model(x, y)[source]

Bases: learning_model, ABC

Base class for supervised learning models with features and labels.

Parameters:
  • x (ndarray)

  • y (ndarray)

class cleands.base.unsupervised_model(x)[source]

Bases: learning_model, ABC

Base class for unsupervised learning models.

Parameters:

x (ndarray)

class cleands.base.prediction_model(x, y)[source]

Bases: supervised_model, ABC

Base class for supervised prediction models.

Parameters:
  • x (ndarray)

  • y (ndarray)

abstract predict(target)[source]

Predict outcomes for new input data.

Parameters:

target (np.ndarray) – Feature matrix for prediction.

Returns:

Predicted values.

Return type:

np.ndarray

property fitted: ndarray

Predictions for training data (self.x).

Type:

np.ndarray

property residuals: ndarray

Difference between observed and fitted values.

Type:

np.ndarray

property residual_sum_of_squares: float

Residual sum of squares (RSS).

Type:

float

property mean_squared_error: float

Mean squared error (MSE).

Type:

float

out_of_sample_mean_squared_error(x, y)[source]

Compute out-of-sample mean squared error.

Parameters:
  • x (np.ndarray) – Test feature matrix.

  • y (np.ndarray) – Test target vector.

Returns:

Out-of-sample MSE.

Return type:

float

property root_mean_squared_error: float

Root mean squared error (RMSE).

Type:

float

out_of_sample_root_mean_squared_error(x, y)[source]

Compute out-of-sample RMSE.

Parameters:
  • x (np.ndarray) – Test feature matrix.

  • y (np.ndarray) – Test target vector.

Returns:

Out-of-sample RMSE.

Return type:

float

property r_squared: float

Coefficient of determination (R²).

Type:

float

property adjusted_r_squared: float

Adjusted R² that accounts for model complexity.

Type:

float

property degrees_of_freedom: int

Degrees of freedom = n_obs - n_feat.

Type:

int

property residual_variance: float

Estimated residual variance.

Type:

float

class cleands.base.classification_model(x, y)[source]

Bases: supervised_model, ABC

Base class for supervised classification models.

abstract predict_proba(target)[source]

Predict class probabilities.

Parameters:

target (np.ndarray) – Feature matrix.

Returns:

Class probability estimates.

Return type:

np.ndarray

property n_classes: int

Number of classes in the dataset.

Type:

int

classify(target)[source]

Predict class labels.

Parameters:

target (np.ndarray) – Feature matrix.

Returns:

Predicted class labels.

Return type:

np.ndarray

property fitted

Predicted class labels for training data.

Type:

np.ndarray

property accuracy

Training accuracy.

Type:

float

out_of_sample_accuracy(x, y)[source]

Compute out-of-sample accuracy.

Parameters:
  • x (np.ndarray) – Test features.

  • y (np.ndarray) – Test labels.

Returns:

Accuracy score.

Return type:

float

property misclassification_probability

Misclassification probability = 1 - accuracy.

Type:

float

out_of_sample_misclassification_probability(x, y)[source]

Compute out-of-sample misclassification probability.

Parameters:
  • x (np.ndarray) – Test features.

  • y (np.ndarray) – Test labels.

Returns:

Misclassification probability.

Return type:

float

property confusion_matrix

Confusion matrix for training data.

Type:

np.ndarray

out_of_sample_confusion_matrix(x, y)[source]

Compute confusion matrix for test data.

Parameters:
  • x (np.ndarray) – Test features.

  • y (np.ndarray) – Test labels.

Returns:

Confusion matrix.

Return type:

np.ndarray

class cleands.base.clustering_model(x)[source]

Bases: unsupervised_model, ABC

Base class for clustering models.

Parameters:

x (ndarray)

abstract cluster(target)[source]

Assign cluster labels for given data.

Parameters:

target (np.ndarray) – Feature matrix.

Returns:

Cluster assignments.

Return type:

np.ndarray

property n_clusters: int

Number of clusters.

Type:

int

property means: ndarray

Cluster centroids.

Type:

np.ndarray

property groups: ndarray

Cluster assignments for training data.

Type:

np.ndarray

property within_group_sum_of_squares: ndarray

Within-group sum of squares per cluster.

Type:

np.ndarray

property total_within_group_sum_of_squares: float

Total within-group sum of squares across clusters.

Type:

float

class cleands.base.distribution_model(x)[source]

Bases: unsupervised_model, ABC

Base class for unsupervised parametric/nonparametric distributions.

Parameters:

x (ndarray)

pdf(target)[source]

Probability density (or mass) function evaluated at target.

Parameters:

target (np.ndarray) – Points at which to evaluate the pdf/pmf.

Returns:

Density (or probability) values with shape compatible with target.

Return type:

np.ndarray

cdf(target)[source]

Cumulative distribution function evaluated at target.

Parameters:

target (np.ndarray) – Points at which to evaluate the CDF.

Returns:

Cumulative probabilities with shape compatible with target.

Return type:

np.ndarray

class cleands.base.dimension_reduction_model(x)[source]

Bases: unsupervised_model, ABC

Base class for unsupervised dimension reduction algorithms (e.g., PCA).

Parameters:

x (ndarray)

reduce(target)[source]

Project target into a lower-dimensional space.

Parameters:

target (np.ndarray) – Data matrix to reduce, shape (n_obs, n_feat).

Returns:

Reduced representation, shape (n_obs, k) where k <= n_feat.

Return type:

np.ndarray

out_of_sample_mean_squared_error(target)[source]

Reprojection MSE: squared reconstruction error per element.

This computes the MSE of projecting target onto the learned subspace and measuring the residual in the orthogonal complement.

Parameters:

target (np.ndarray) – Data matrix to evaluate, shape (n_obs, n_feat).

Returns:

Scalar MSE (float-like) computed as trace(T’ M T) / T.size.

Return type:

np.ndarray

property mean_squared_error: ndarray

In-sample mean squared error for self.x.

Type:

np.ndarray

out_of_sample_root_mean_squared_error(target)[source]

Root mean squared reconstruction error for new data.

Parameters:

target (np.ndarray) – Data matrix to evaluate.

Returns:

Scalar RMSE.

Return type:

np.ndarray

property root_mean_squared_error: ndarray

In-sample root mean squared error for self.x.

Type:

np.ndarray

class cleands.base.supervised_dimension_reduction_model(x, y)[source]

Bases: supervised_model, ABC

Base class for supervised dimension reduction (e.g., CCA).

Parameters:
  • x (ndarray)

  • y (ndarray)

reduce_X(x_new)[source]

Project x_new into the supervised lower-dimensional X-space.

Parameters:

x_new (np.ndarray) – Feature matrix, shape (n_obs, n_feat).

Returns:

Reduced X scores, shape (n_obs, kx).

Return type:

np.ndarray

reduce_Y(y_new)[source]

Project y_new (targets) into the supervised lower-dimensional Y-space.

Parameters:

y_new (np.ndarray) – Target vector or matrix, shape (n_obs, …) depending on model.

Returns:

Reduced Y scores, shape (n_obs, ky).

Return type:

np.ndarray

reduce(x_new=None, y_new=None)[source]

Reduce X, Y, or both, depending on provided inputs.

Exactly one (or both) of x_new or y_new must be provided.

Parameters:
  • x_new (Optional[np.ndarray]) – Feature matrix to reduce.

  • y_new (Optional[np.ndarray]) – Target vector/matrix to reduce.

Returns:

  • If both provided: (X_reduced, Y_reduced).

  • If only x_new provided: X_reduced.

  • If only y_new provided: Y_reduced.

Return type:

np.ndarray | tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If neither x_new nor y_new is provided.

class cleands.base.likelihood_type(*args, **kwargs)[source]

Bases: Protocol

Structural protocol for objects exposing likelihood metrics.

property log_likelihood: float

Model log-likelihood.

Type:

float

property null_likelihood: float

Log-likelihood of the null/reference model.

Type:

float

class cleands.base.likelihood_model[source]

Bases: ABC

Mixin-like base for models that report likelihood-based criteria.

n_feat: int
n_obs: int
abstract property log_likelihood: float

Model log-likelihood under fitted parameters.

Type:

float

abstract property null_likelihood: float

Log-likelihood of the null/reference model.

Type:

float

property aic: float

Akaike Information Criterion (smaller is better).

Type:

float

property bic: float

Bayesian Information Criterion (smaller is better).

Type:

float

property deviance: float

Model deviance = 2*LL(model) - 2*LL(null).

Type:

float

class cleands.base.parametric_distribution_model(x)[source]

Bases: distribution_model, likelihood_model, ABC

Base class for parametric distributions that expose likelihood metrics.

Parameters:

x (ndarray)

params: ndarray
x: ndarray
abstract pdf(target)[source]

Probability density (or mass) evaluated at target.

Parameters:

target (np.ndarray) – Points for evaluation.

Returns:

Density/probability values.

Return type:

np.ndarray

abstract cdf(target)[source]

Cumulative distribution evaluated at target.

Parameters:

target (np.ndarray) – Points for evaluation.

Returns:

Cumulative probabilities.

Return type:

np.ndarray

property log_likelihood: float

In-sample log-likelihood for self.x.

Type:

float

abstract out_of_sample_log_likelihood(target)[source]

Log-likelihood evaluated on arbitrary data.

Parameters:

target (np.ndarray) – Data on which to evaluate LL.

Returns:

Log-likelihood value.

Return type:

float

property null_likelihood: float

In-sample null log-likelihood for self.x.

Type:

float

abstract out_of_sample_null_likelihood(target)[source]

Null-model log-likelihood on arbitrary data.

Parameters:

target (np.ndarray) – Data on which to evaluate null LL.

Returns:

Null log-likelihood value.

Return type:

float

property deviance: ndarray

Deviance for self.x = 2*(LL - LL_null).

Type:

np.ndarray

out_of_sample_deviance(target)[source]

Deviance on arbitrary data.

Parameters:

target (np.ndarray) – Data on which to compute deviance.

Returns:

Deviance value(s).

Return type:

np.ndarray

class cleands.base.prediction_likelihood_model[source]

Bases: ABC

Base for prediction models that define likelihood via evaluate_lnL.

y: ndarray
n_obs: int
n_feat: int
abstract evaluate_lnL(pred)[source]

Evaluate log-likelihood given predictions pred.

Parameters:

pred (np.ndarray) – Predicted values or probabilities aligned with y.

Returns:

Log-likelihood value.

Return type:

float

abstract property fitted: ndarray

Model-fitted predictions on training data.

Returns:

Predictions aligned with y.

Return type:

np.ndarray

property log_likelihood: float

Log-likelihood at fitted values.

Type:

float

property null_likelihood: float

Log-likelihood of a mean-only/constant (null) predictor.

Type:

float

property aic: float

Akaike Information Criterion.

Type:

float

property bic: float

Bayesian Information Criterion.

Type:

float

property deviance: float

Deviance = 2*LL(model) - 2*LL(null).

Type:

float

class cleands.base.broom_model(*args, **kwargs)[source]

Bases: Protocol

Protocol for tidy/glance accessors (broom-like API).

property tidy: DataFrame

Per-parameter summary (estimates, SEs, tests, etc.).

Type:

pd.DataFrame

property glance: DataFrame

Model-level summary (fit statistics, diagnostics, etc.).

Type:

pd.DataFrame

class cleands.base.variance_model[source]

Bases: ABC

Mixin for models that expose variance-covariance and inferential stats.

abstract vcov_params()[source]

Variance-covariance matrix of parameter estimates.

Returns:

(p x p) covariance matrix for the first n_feat parameters.

Return type:

np.ndarray

property std_error

Standard errors for parameters (from vcov_params).

Type:

np.ndarray

property t_statistic

t-statistics = params / std_error.

Type:

np.ndarray

property p_value

Two-sided p-values under Student-t with df = n_obs - n_feat.

Type:

np.ndarray

conf_int(level=0.95)[source]

Confidence intervals for parameters.

Parameters:

level (float) – Coverage probability (default 0.95).

Returns:

2 x p array with lower/upper bounds by column.

Return type:

np.ndarray

property tidy

Tidy per-parameter table (no CIs).

Type:

pd.DataFrame

tidyci(level=0.95, ci=True)[source]

Tidy per-parameter table with optional confidence intervals.

Parameters:
  • level (float) – CI level (default 0.95).

  • ci (bool) – If True, include CI columns.

Returns:

Columns include variable, estimate, std.error, t.statistic, p.value,

and optionally ci.lower, ci.upper.

Return type:

pd.DataFrame

property glance: DataFrame

Model-level summary table.

Type:

pd.DataFrame

class cleands.base.SupervisedModel(formula, data, *args, **kwargs)[source]

Bases: ABC

Abstract base class for supervised models constructed from a formula.

Subclasses must set the class attribute MODEL_TYPE to a concrete supervised model implementation. This wrapper handles parsing a formula, extracting predictor and response variables, and fitting the underlying algorithm.

Variables:
  • formula (str) – Formula string used to specify the model.

  • x_vars (list[str]) – Names of predictor variables.

  • y_var (str) – Name of response variable.

  • data (pd.DataFrame) – Parsed DataFrame containing predictors and response.

  • model (supervised_model) – Fitted underlying model implementation.

Parameters:
  • formula (str)

  • data (DataFrame)

tidyci(level=0.95, ci=True)[source]

Return a tidy coefficient table with optional confidence intervals.

Parameters:
  • level (float, default=0.95) – Confidence level for intervals.

  • ci (bool, default=True) – Whether to include confidence intervals.

Returns:

Tidy table of parameter estimates. If the model outputs a variable column of matching length, it is replaced with the predictor names from x_vars.

Return type:

pd.DataFrame

property tidy: DataFrame

Return a tidy table of parameter estimates without confidence intervals.

Equivalent to calling tidyci() with ci=False.

Returns:

Table of parameter estimates.

Return type:

pd.DataFrame

property glance: DataFrame

Return model-level summary statistics.

Returns:

One-row DataFrame of model fit diagnostics (e.g., log-likelihood, R², AIC).

Return type:

pd.DataFrame

class cleands.base.PredictionModel(formula, data, *args, **kwargs)[source]

Bases: SupervisedModel, ABC

Concrete interface for supervised prediction models.

Extends SupervisedModel by adding a predict() method for generating predictions on new data.

Parameters:
  • formula (str)

  • data (DataFrame)

predict(new_data)[source]

Generate predictions on new data.

Parameters:

new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.

Returns:

Predictions indexed to new_data.index and named after the response variable (y_var).

Return type:

pd.Series

class cleands.base.ClassificationModel(formula, data, *args, **kwargs)[source]

Bases: SupervisedModel, ABC

Concrete interface for supervised prediction models.

Extends SupervisedModel by adding a classify() method for generating classifications on new data.

Parameters:
  • formula (str)

  • data (DataFrame)

classify(new_data)[source]

Generate classifications on new data.

Parameters:

new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.

Returns:

Classifications indexed to new_data.index and named after the response variable (y_var).

Return type:

pd.Series

predict_proba(new_data)[source]

Predict probabilities for classes on new data.

Parameters:

new_data (pd.DataFrame) – DataFrame containing the predicted points to be assigned to classes.

Returns:

Predicted probabilities with shape (n_samples, n_classes). Columns are named class=0, class=1, ..., class=r corresponding to the r classes returned by the underlying model.

Return type:

pd.DataFrame

class cleands.base.UnsupervisedModel(formula, data, *args, **kwargs)[source]

Bases: ABC

Abstract base class for unsupervised models constructed from a formula.

Subclasses must set the class attribute MODEL_TYPE to a concrete supervised model implementation. This wrapper handles parsing a formula, extracting predictor and response variables, and fitting the underlying algorithm.

Variables:
  • formula (str) – Formula string used to specify the model.

  • x_vars (list[str]) – Names of predictor variables.

  • y_var (str) – Name of response variable.

  • data (pd.DataFrame) – Parsed DataFrame containing predictors and response.

  • model (supervised_model) – Fitted underlying model implementation.

Parameters:
  • formula (str)

  • data (DataFrame)

tidyci(level=0.95, ci=True)[source]

Return a tidy coefficient table with optional confidence intervals.

Parameters:
  • level (float, default=0.95) – Confidence level for intervals.

  • ci (bool, default=True) – Whether to include confidence intervals.

Returns:

Tidy table of parameter estimates. If the model outputs a variable column of matching length, it is replaced with the predictor names from x_vars.

Return type:

pd.DataFrame

property tidy: DataFrame

Return a tidy table of parameter estimates without confidence intervals.

Equivalent to calling tidyci() with ci=False.

Returns:

Table of parameter estimates.

Return type:

pd.DataFrame

property glance: DataFrame

Return model-level summary statistics.

Returns:

One-row DataFrame of model fit diagnostics (e.g., log-likelihood, R², AIC).

Return type:

pd.DataFrame

class cleands.base.ClusteringModel(formula, data, *args, **kwargs)[source]

Bases: SupervisedModel, ABC

Concrete interface for supervised clustering models.

Extends SupervisedModel by adding a cluster() method for generating predictions on new data.

Parameters:
  • formula (str)

  • data (DataFrame)

cluster(new_data)[source]

Generate predictions on new data.

Parameters:

new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.

Returns:

Predictions indexed to new_data.index and named after the response variable (y_var).

Return type:

pd.Series

class cleands.base.DimensionReductionModel(formula, data, *args, **kwargs)[source]

Bases: SupervisedModel, ABC

Concrete interface for supervised dimension-reduction models.

Extends SupervisedModel by adding a reduce() method for projecting new data into a lower-dimensional space.

Parameters:
  • formula (str)

  • data (DataFrame)

reduce(new_data)[source]

Project new data into the reduced space.

Parameters:

new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.

Returns:

Reduced representation with shape (n_samples, r). Columns are named z1, z2, ..., zr corresponding to the r components returned by the underlying model.

Return type:

pd.DataFrame