cleands.base module

Core base classes and abstract interfaces for the CleanDS statistical modeling framework.

This module defines the fundamental building blocks for supervised, unsupervised, prediction, classification, clustering, and distribution models. It establishes a consistent API across model families, providing shared properties such as fitted, residuals, mean_squared_error, aic, and bic.

Key features:

Abstract base classes for supervised and unsupervised models.
Prediction and classification models with common evaluation metrics.
Clustering model interface with mean/cluster management utilities.
Distribution model protocols with PDF and CDF support.
Likelihood-based model mixins for deviance, AIC, BIC, and log-likelihood.
Variance models for parameter uncertainty (standard errors, confidence intervals).
Supervised wrappers (SupervisedModel, PredictionModel, ClassificationModel) for integrating with formula notation and tidy outputs.

These base definitions are designed for extensibility: custom regression, classification, clustering, or distribution models should inherit from the appropriate abstract class to ensure interoperability within the CleanDS ecosystem.

class cleands.base.learning_model(x)[source]

Bases: ABC

Abstract base class for all learning models.

Parameters:: x (ndarray)

class cleands.base.supervised_model(x, y)[source]

Bases: learning_model, ABC

Base class for supervised learning models with features and labels.

Parameters:

x (ndarray)
y (ndarray)

class cleands.base.unsupervised_model(x)[source]

Bases: learning_model, ABC

Base class for unsupervised learning models.

Parameters:: x (ndarray)

class cleands.base.prediction_model(x, y)[source]

Bases: supervised_model, ABC

Base class for supervised prediction models.

Parameters:

x (ndarray)
y (ndarray)

abstract predict(target)[source]

Predict outcomes for new input data.

Parameters:: target (np.ndarray) – Feature matrix for prediction.
Returns:: Predicted values.
Return type:: np.ndarray

property fitted: ndarray

Predictions for training data (self.x).

Type:: np.ndarray

property residuals: ndarray

Difference between observed and fitted values.

Type:: np.ndarray

property residual_sum_of_squares: float

Residual sum of squares (RSS).

Type:: float

property mean_squared_error: float

Mean squared error (MSE).

Type:: float

out_of_sample_mean_squared_error(x, y)[source]

Compute out-of-sample mean squared error.

Parameters:

x (np.ndarray) – Test feature matrix.
y (np.ndarray) – Test target vector.

Returns:

Out-of-sample MSE.

Return type:

float

property root_mean_squared_error: float

Root mean squared error (RMSE).

Type:: float

out_of_sample_root_mean_squared_error(x, y)[source]

Compute out-of-sample RMSE.

Parameters:

x (np.ndarray) – Test feature matrix.
y (np.ndarray) – Test target vector.

Returns:

Out-of-sample RMSE.

Return type:

float

property r_squared: float

Coefficient of determination (R²).

Type:: float

property adjusted_r_squared: float

Adjusted R² that accounts for model complexity.

Type:: float

property degrees_of_freedom: int

Degrees of freedom = n_obs - n_feat.

Type:: int

property residual_variance: float

Estimated residual variance.

Type:: float

class cleands.base.classification_model(x, y)[source]

Bases: supervised_model, ABC

Base class for supervised classification models.

abstract predict_proba(target)[source]

Predict class probabilities.

Parameters:: target (np.ndarray) – Feature matrix.
Returns:: Class probability estimates.
Return type:: np.ndarray

property n_classes: int

Number of classes in the dataset.

Type:: int

classify(target)[source]

Predict class labels.

Parameters:: target (np.ndarray) – Feature matrix.
Returns:: Predicted class labels.
Return type:: np.ndarray

property fitted

Predicted class labels for training data.

Type:: np.ndarray

property accuracy

Training accuracy.

Type:: float

out_of_sample_accuracy(x, y)[source]

Compute out-of-sample accuracy.

Parameters:

x (np.ndarray) – Test features.
y (np.ndarray) – Test labels.

Returns:

Accuracy score.

Return type:

float

property misclassification_probability

Misclassification probability = 1 - accuracy.

Type:: float

out_of_sample_misclassification_probability(x, y)[source]

Compute out-of-sample misclassification probability.

Parameters:

x (np.ndarray) – Test features.
y (np.ndarray) – Test labels.

Returns:

Misclassification probability.

Return type:

float

property confusion_matrix

Confusion matrix for training data.

Type:: np.ndarray

out_of_sample_confusion_matrix(x, y)[source]

Compute confusion matrix for test data.

Parameters:

x (np.ndarray) – Test features.
y (np.ndarray) – Test labels.

Returns:

Confusion matrix.

Return type:

np.ndarray

class cleands.base.clustering_model(x)[source]

Bases: unsupervised_model, ABC

Base class for clustering models.

Parameters:: x (ndarray)

abstract cluster(target)[source]

Assign cluster labels for given data.

Parameters:: target (np.ndarray) – Feature matrix.
Returns:: Cluster assignments.
Return type:: np.ndarray

property n_clusters: int

Number of clusters.

Type:: int

property means: ndarray

Cluster centroids.

Type:: np.ndarray

property groups: ndarray

Cluster assignments for training data.

Type:: np.ndarray

property within_group_sum_of_squares: ndarray

Within-group sum of squares per cluster.

Type:: np.ndarray

property total_within_group_sum_of_squares: float

Total within-group sum of squares across clusters.

Type:: float

class cleands.base.distribution_model(x)[source]

Bases: unsupervised_model, ABC

Base class for unsupervised parametric/nonparametric distributions.

Parameters:: x (ndarray)

pdf(target)[source]

Probability density (or mass) function evaluated at target.

Parameters:: target (np.ndarray) – Points at which to evaluate the pdf/pmf.
Returns:: Density (or probability) values with shape compatible with target.
Return type:: np.ndarray

cdf(target)[source]

Cumulative distribution function evaluated at target.

Parameters:: target (np.ndarray) – Points at which to evaluate the CDF.
Returns:: Cumulative probabilities with shape compatible with target.
Return type:: np.ndarray

class cleands.base.dimension_reduction_model(x)[source]

Bases: unsupervised_model, ABC

Base class for unsupervised dimension reduction algorithms (e.g., PCA).

Parameters:: x (ndarray)

reduce(target)[source]

Project target into a lower-dimensional space.

Parameters:: target (np.ndarray) – Data matrix to reduce, shape (n_obs, n_feat).
Returns:: Reduced representation, shape (n_obs, k) where k <= n_feat.
Return type:: np.ndarray

out_of_sample_mean_squared_error(target)[source]

Reprojection MSE: squared reconstruction error per element.

This computes the MSE of projecting target onto the learned subspace and measuring the residual in the orthogonal complement.

Parameters:: target (np.ndarray) – Data matrix to evaluate, shape (n_obs, n_feat).
Returns:: Scalar MSE (float-like) computed as trace(T’ M T) / T.size.
Return type:: np.ndarray

property mean_squared_error: ndarray

In-sample mean squared error for self.x.

Type:: np.ndarray

out_of_sample_root_mean_squared_error(target)[source]

Root mean squared reconstruction error for new data.

Parameters:: target (np.ndarray) – Data matrix to evaluate.
Returns:: Scalar RMSE.
Return type:: np.ndarray

property root_mean_squared_error: ndarray

In-sample root mean squared error for self.x.

Type:: np.ndarray

class cleands.base.supervised_dimension_reduction_model(x, y)[source]

Bases: supervised_model, ABC

Base class for supervised dimension reduction (e.g., CCA).

Parameters:

x (ndarray)
y (ndarray)

reduce_X(x_new)[source]

Project x_new into the supervised lower-dimensional X-space.

Parameters:: x_new (np.ndarray) – Feature matrix, shape (n_obs, n_feat).
Returns:: Reduced X scores, shape (n_obs, kx).
Return type:: np.ndarray

reduce_Y(y_new)[source]

Project y_new (targets) into the supervised lower-dimensional Y-space.

Parameters:: y_new (np.ndarray) – Target vector or matrix, shape (n_obs, …) depending on model.
Returns:: Reduced Y scores, shape (n_obs, ky).
Return type:: np.ndarray

reduce(x_new=None, y_new=None)[source]

Reduce X, Y, or both, depending on provided inputs.

Exactly one (or both) of x_new or y_new must be provided.

Parameters:

x_new (Optional[np.ndarray]) – Feature matrix to reduce.
y_new (Optional[np.ndarray]) – Target vector/matrix to reduce.

Returns:

If both provided: (X_reduced, Y_reduced).
If only x_new provided: X_reduced.
If only y_new provided: Y_reduced.

Return type:

np.ndarray | tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If neither x_new nor y_new is provided.

class cleands.base.likelihood_type(*args, **kwargs)[source]

Bases: Protocol

Structural protocol for objects exposing likelihood metrics.

property log_likelihood: float

Model log-likelihood.

Type:: float

property null_likelihood: float

Log-likelihood of the null/reference model.

Type:: float

class cleands.base.likelihood_model[source]

Bases: ABC

Mixin-like base for models that report likelihood-based criteria.

n_feat: int

n_obs: int

abstract property log_likelihood: float

Model log-likelihood under fitted parameters.

Type:: float

abstract property null_likelihood: float

Log-likelihood of the null/reference model.

Type:: float

property aic: float

Akaike Information Criterion (smaller is better).

Type:: float

property bic: float

Bayesian Information Criterion (smaller is better).

Type:: float

property deviance: float

Model deviance = 2*LL(model) - 2*LL(null).

Type:: float

class cleands.base.parametric_distribution_model(x)[source]

Bases: distribution_model, likelihood_model, ABC

Base class for parametric distributions that expose likelihood metrics.

Parameters:: x (ndarray)

params: ndarray

x: ndarray

abstract pdf(target)[source]

Probability density (or mass) evaluated at target.

Parameters:: target (np.ndarray) – Points for evaluation.
Returns:: Density/probability values.
Return type:: np.ndarray

abstract cdf(target)[source]

Cumulative distribution evaluated at target.

Parameters:: target (np.ndarray) – Points for evaluation.
Returns:: Cumulative probabilities.
Return type:: np.ndarray

property log_likelihood: float

In-sample log-likelihood for self.x.

Type:: float

abstract out_of_sample_log_likelihood(target)[source]

Log-likelihood evaluated on arbitrary data.

Parameters:: target (np.ndarray) – Data on which to evaluate LL.
Returns:: Log-likelihood value.
Return type:: float

property null_likelihood: float

In-sample null log-likelihood for self.x.

Type:: float

abstract out_of_sample_null_likelihood(target)[source]

Null-model log-likelihood on arbitrary data.

Parameters:: target (np.ndarray) – Data on which to evaluate null LL.
Returns:: Null log-likelihood value.
Return type:: float

property deviance: ndarray

Deviance for self.x = 2*(LL - LL_null).

Type:: np.ndarray

out_of_sample_deviance(target)[source]

Deviance on arbitrary data.

Parameters:: target (np.ndarray) – Data on which to compute deviance.
Returns:: Deviance value(s).
Return type:: np.ndarray

class cleands.base.prediction_likelihood_model[source]

Bases: ABC

Base for prediction models that define likelihood via evaluate_lnL.

y: ndarray

n_obs: int

n_feat: int

abstract evaluate_lnL(pred)[source]

Evaluate log-likelihood given predictions pred.

Parameters:: pred (np.ndarray) – Predicted values or probabilities aligned with y.
Returns:: Log-likelihood value.
Return type:: float

abstract property fitted: ndarray

Model-fitted predictions on training data.

Returns:: Predictions aligned with y.
Return type:: np.ndarray

property log_likelihood: float

Log-likelihood at fitted values.

Type:: float

property null_likelihood: float

Log-likelihood of a mean-only/constant (null) predictor.

Type:: float

property aic: float

Akaike Information Criterion.

Type:: float

property bic: float

Bayesian Information Criterion.

Type:: float

property deviance: float

Deviance = 2*LL(model) - 2*LL(null).

Type:: float

class cleands.base.broom_model(*args, **kwargs)[source]

Bases: Protocol

Protocol for tidy/glance accessors (broom-like API).

property tidy: DataFrame

Per-parameter summary (estimates, SEs, tests, etc.).

Type:: pd.DataFrame

property glance: DataFrame

Model-level summary (fit statistics, diagnostics, etc.).

Type:: pd.DataFrame

class cleands.base.variance_model[source]

Bases: ABC

Mixin for models that expose variance-covariance and inferential stats.

abstract vcov_params()[source]

Variance-covariance matrix of parameter estimates.

Returns:: (p x p) covariance matrix for the first n_feat parameters.
Return type:: np.ndarray

property std_error

Standard errors for parameters (from vcov_params).

Type:: np.ndarray

property t_statistic

t-statistics = params / std_error.

Type:: np.ndarray

property p_value

Two-sided p-values under Student-t with df = n_obs - n_feat.

Type:: np.ndarray

conf_int(level=0.95)[source]

Confidence intervals for parameters.

Parameters:: level (float) – Coverage probability (default 0.95).
Returns:: 2 x p array with lower/upper bounds by column.
Return type:: np.ndarray

property tidy

Tidy per-parameter table (no CIs).

Type:: pd.DataFrame

tidyci(level=0.95, ci=True)[source]

Tidy per-parameter table with optional confidence intervals.

Parameters:

level (float) – CI level (default 0.95).
ci (bool) – If True, include CI columns.

Returns:

Columns include variable, estimate, std.error, t.statistic, p.value,: and optionally ci.lower, ci.upper.

Return type:

pd.DataFrame

property glance: DataFrame

Model-level summary table.

Type:: pd.DataFrame

class cleands.base.SupervisedModel(formula, data, *args, **kwargs)[source]

Bases: ABC

Abstract base class for supervised models constructed from a formula.

Subclasses must set the class attribute MODEL_TYPE to a concrete supervised model implementation. This wrapper handles parsing a formula, extracting predictor and response variables, and fitting the underlying algorithm.

Variables:

formula (str) – Formula string used to specify the model.
x_vars (list[str]) – Names of predictor variables.
y_var (str) – Name of response variable.
data (pd.DataFrame) – Parsed DataFrame containing predictors and response.
model (supervised_model) – Fitted underlying model implementation.

Parameters:

formula (str)
data (DataFrame)

tidyci(level=0.95, ci=True)[source]

Return a tidy coefficient table with optional confidence intervals.

Parameters:

level (float, default=0.95) – Confidence level for intervals.
ci (bool, default=True) – Whether to include confidence intervals.

Returns:

Tidy table of parameter estimates. If the model outputs a variable column of matching length, it is replaced with the predictor names from x_vars.

Return type:

pd.DataFrame

property tidy: DataFrame

Return a tidy table of parameter estimates without confidence intervals.

Equivalent to calling tidyci() with ci=False.

Returns:: Table of parameter estimates.
Return type:: pd.DataFrame

property glance: DataFrame

Return model-level summary statistics.

Returns:: One-row DataFrame of model fit diagnostics (e.g., log-likelihood, R², AIC).
Return type:: pd.DataFrame

class cleands.base.PredictionModel(formula, data, *args, **kwargs)[source]

Bases: SupervisedModel, ABC

Concrete interface for supervised prediction models.

Extends SupervisedModel by adding a predict() method for generating predictions on new data.

Parameters:

formula (str)
data (DataFrame)

predict(new_data)[source]

Generate predictions on new data.

Parameters:: new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.
Returns:: Predictions indexed to new_data.index and named after the response variable (y_var).
Return type:: pd.Series

class cleands.base.ClassificationModel(formula, data, *args, **kwargs)[source]

Bases: SupervisedModel, ABC

Concrete interface for supervised prediction models.

Extends SupervisedModel by adding a classify() method for generating classifications on new data.

Parameters:

formula (str)
data (DataFrame)

classify(new_data)[source]

Generate classifications on new data.

Parameters:: new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.
Returns:: Classifications indexed to new_data.index and named after the response variable (y_var).
Return type:: pd.Series

predict_proba(new_data)[source]

Predict probabilities for classes on new data.

Parameters:: new_data (pd.DataFrame) – DataFrame containing the predicted points to be assigned to classes.
Returns:: Predicted probabilities with shape (n_samples, n_classes). Columns are named class=0, class=1, ..., class=r corresponding to the r classes returned by the underlying model.
Return type:: pd.DataFrame

class cleands.base.UnsupervisedModel(formula, data, *args, **kwargs)[source]

Bases: ABC

Abstract base class for unsupervised models constructed from a formula.

Subclasses must set the class attribute MODEL_TYPE to a concrete supervised model implementation. This wrapper handles parsing a formula, extracting predictor and response variables, and fitting the underlying algorithm.

Variables:

formula (str) – Formula string used to specify the model.
x_vars (list[str]) – Names of predictor variables.
y_var (str) – Name of response variable.
data (pd.DataFrame) – Parsed DataFrame containing predictors and response.
model (supervised_model) – Fitted underlying model implementation.

Parameters:

formula (str)
data (DataFrame)

tidyci(level=0.95, ci=True)[source]

Return a tidy coefficient table with optional confidence intervals.

Parameters:

level (float, default=0.95) – Confidence level for intervals.
ci (bool, default=True) – Whether to include confidence intervals.

Returns:

Tidy table of parameter estimates. If the model outputs a variable column of matching length, it is replaced with the predictor names from x_vars.

Return type:

pd.DataFrame

property tidy: DataFrame

Return a tidy table of parameter estimates without confidence intervals.

Equivalent to calling tidyci() with ci=False.

Returns:: Table of parameter estimates.
Return type:: pd.DataFrame

property glance: DataFrame

Return model-level summary statistics.

Returns:: One-row DataFrame of model fit diagnostics (e.g., log-likelihood, R², AIC).
Return type:: pd.DataFrame

class cleands.base.ClusteringModel(formula, data, *args, **kwargs)[source]

Bases: SupervisedModel, ABC

Concrete interface for supervised clustering models.

Extends SupervisedModel by adding a cluster() method for generating predictions on new data.

Parameters:

formula (str)
data (DataFrame)

cluster(new_data)[source]

Generate predictions on new data.

Parameters:: new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.
Returns:: Predictions indexed to new_data.index and named after the response variable (y_var).
Return type:: pd.Series

class cleands.base.DimensionReductionModel(formula, data, *args, **kwargs)[source]

Bases: SupervisedModel, ABC

Concrete interface for supervised dimension-reduction models.

Extends SupervisedModel by adding a reduce() method for projecting new data into a lower-dimensional space.

Parameters:

formula (str)
data (DataFrame)

reduce(new_data)[source]

Project new data into the reduced space.

Parameters:: new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.
Returns:: Reduced representation with shape (n_samples, r). Columns are named z1, z2, ..., zr corresponding to the r components returned by the underlying model.
Return type:: pd.DataFrame