cleands.base module
Core base classes and abstract interfaces for the CleanDS statistical modeling framework.
This module defines the fundamental building blocks for supervised, unsupervised, prediction, classification, clustering, and distribution models. It establishes a consistent API across model families, providing shared properties such as fitted, residuals, mean_squared_error, aic, and bic.
- Key features:
Abstract base classes for supervised and unsupervised models.
Prediction and classification models with common evaluation metrics.
Clustering model interface with mean/cluster management utilities.
Distribution model protocols with PDF and CDF support.
Likelihood-based model mixins for deviance, AIC, BIC, and log-likelihood.
Variance models for parameter uncertainty (standard errors, confidence intervals).
Supervised wrappers (SupervisedModel, PredictionModel, ClassificationModel) for integrating with formula notation and tidy outputs.
These base definitions are designed for extensibility: custom regression, classification, clustering, or distribution models should inherit from the appropriate abstract class to ensure interoperability within the CleanDS ecosystem.
- class cleands.base.learning_model(x)[source]
Bases:
ABCAbstract base class for all learning models.
- Parameters:
x (ndarray)
- class cleands.base.supervised_model(x, y)[source]
Bases:
learning_model,ABCBase class for supervised learning models with features and labels.
- Parameters:
x (ndarray)
y (ndarray)
- class cleands.base.unsupervised_model(x)[source]
Bases:
learning_model,ABCBase class for unsupervised learning models.
- Parameters:
x (ndarray)
- class cleands.base.prediction_model(x, y)[source]
Bases:
supervised_model,ABCBase class for supervised prediction models.
- Parameters:
x (ndarray)
y (ndarray)
- abstract predict(target)[source]
Predict outcomes for new input data.
- Parameters:
target (np.ndarray) – Feature matrix for prediction.
- Returns:
Predicted values.
- Return type:
np.ndarray
- property fitted: ndarray
Predictions for training data (self.x).
- Type:
np.ndarray
- property residuals: ndarray
Difference between observed and fitted values.
- Type:
np.ndarray
- property residual_sum_of_squares: float
Residual sum of squares (RSS).
- Type:
float
- property mean_squared_error: float
Mean squared error (MSE).
- Type:
float
- out_of_sample_mean_squared_error(x, y)[source]
Compute out-of-sample mean squared error.
- Parameters:
x (np.ndarray) – Test feature matrix.
y (np.ndarray) – Test target vector.
- Returns:
Out-of-sample MSE.
- Return type:
float
- property root_mean_squared_error: float
Root mean squared error (RMSE).
- Type:
float
- out_of_sample_root_mean_squared_error(x, y)[source]
Compute out-of-sample RMSE.
- Parameters:
x (np.ndarray) – Test feature matrix.
y (np.ndarray) – Test target vector.
- Returns:
Out-of-sample RMSE.
- Return type:
float
- property r_squared: float
Coefficient of determination (R²).
- Type:
float
- property adjusted_r_squared: float
Adjusted R² that accounts for model complexity.
- Type:
float
- property degrees_of_freedom: int
Degrees of freedom = n_obs - n_feat.
- Type:
int
- property residual_variance: float
Estimated residual variance.
- Type:
float
- class cleands.base.classification_model(x, y)[source]
Bases:
supervised_model,ABCBase class for supervised classification models.
- abstract predict_proba(target)[source]
Predict class probabilities.
- Parameters:
target (np.ndarray) – Feature matrix.
- Returns:
Class probability estimates.
- Return type:
np.ndarray
- property n_classes: int
Number of classes in the dataset.
- Type:
int
- classify(target)[source]
Predict class labels.
- Parameters:
target (np.ndarray) – Feature matrix.
- Returns:
Predicted class labels.
- Return type:
np.ndarray
- property fitted
Predicted class labels for training data.
- Type:
np.ndarray
- property accuracy
Training accuracy.
- Type:
float
- out_of_sample_accuracy(x, y)[source]
Compute out-of-sample accuracy.
- Parameters:
x (np.ndarray) – Test features.
y (np.ndarray) – Test labels.
- Returns:
Accuracy score.
- Return type:
float
- property misclassification_probability
Misclassification probability = 1 - accuracy.
- Type:
float
- out_of_sample_misclassification_probability(x, y)[source]
Compute out-of-sample misclassification probability.
- Parameters:
x (np.ndarray) – Test features.
y (np.ndarray) – Test labels.
- Returns:
Misclassification probability.
- Return type:
float
- property confusion_matrix
Confusion matrix for training data.
- Type:
np.ndarray
- class cleands.base.clustering_model(x)[source]
Bases:
unsupervised_model,ABCBase class for clustering models.
- Parameters:
x (ndarray)
- abstract cluster(target)[source]
Assign cluster labels for given data.
- Parameters:
target (np.ndarray) – Feature matrix.
- Returns:
Cluster assignments.
- Return type:
np.ndarray
- property n_clusters: int
Number of clusters.
- Type:
int
- property means: ndarray
Cluster centroids.
- Type:
np.ndarray
- property groups: ndarray
Cluster assignments for training data.
- Type:
np.ndarray
- property within_group_sum_of_squares: ndarray
Within-group sum of squares per cluster.
- Type:
np.ndarray
- property total_within_group_sum_of_squares: float
Total within-group sum of squares across clusters.
- Type:
float
- class cleands.base.distribution_model(x)[source]
Bases:
unsupervised_model,ABCBase class for unsupervised parametric/nonparametric distributions.
- Parameters:
x (ndarray)
- class cleands.base.dimension_reduction_model(x)[source]
Bases:
unsupervised_model,ABCBase class for unsupervised dimension reduction algorithms (e.g., PCA).
- Parameters:
x (ndarray)
- reduce(target)[source]
Project target into a lower-dimensional space.
- Parameters:
target (np.ndarray) – Data matrix to reduce, shape (n_obs, n_feat).
- Returns:
Reduced representation, shape (n_obs, k) where k <= n_feat.
- Return type:
np.ndarray
- out_of_sample_mean_squared_error(target)[source]
Reprojection MSE: squared reconstruction error per element.
This computes the MSE of projecting target onto the learned subspace and measuring the residual in the orthogonal complement.
- Parameters:
target (np.ndarray) – Data matrix to evaluate, shape (n_obs, n_feat).
- Returns:
Scalar MSE (float-like) computed as trace(T’ M T) / T.size.
- Return type:
np.ndarray
- property mean_squared_error: ndarray
In-sample mean squared error for self.x.
- Type:
np.ndarray
- out_of_sample_root_mean_squared_error(target)[source]
Root mean squared reconstruction error for new data.
- Parameters:
target (np.ndarray) – Data matrix to evaluate.
- Returns:
Scalar RMSE.
- Return type:
np.ndarray
- property root_mean_squared_error: ndarray
In-sample root mean squared error for self.x.
- Type:
np.ndarray
- class cleands.base.supervised_dimension_reduction_model(x, y)[source]
Bases:
supervised_model,ABCBase class for supervised dimension reduction (e.g., CCA).
- Parameters:
x (ndarray)
y (ndarray)
- reduce_X(x_new)[source]
Project x_new into the supervised lower-dimensional X-space.
- Parameters:
x_new (np.ndarray) – Feature matrix, shape (n_obs, n_feat).
- Returns:
Reduced X scores, shape (n_obs, kx).
- Return type:
np.ndarray
- reduce_Y(y_new)[source]
Project y_new (targets) into the supervised lower-dimensional Y-space.
- Parameters:
y_new (np.ndarray) – Target vector or matrix, shape (n_obs, …) depending on model.
- Returns:
Reduced Y scores, shape (n_obs, ky).
- Return type:
np.ndarray
- reduce(x_new=None, y_new=None)[source]
Reduce X, Y, or both, depending on provided inputs.
Exactly one (or both) of x_new or y_new must be provided.
- Parameters:
x_new (Optional[np.ndarray]) – Feature matrix to reduce.
y_new (Optional[np.ndarray]) – Target vector/matrix to reduce.
- Returns:
If both provided: (X_reduced, Y_reduced).
If only x_new provided: X_reduced.
If only y_new provided: Y_reduced.
- Return type:
np.ndarray | tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If neither x_new nor y_new is provided.
- class cleands.base.likelihood_type(*args, **kwargs)[source]
Bases:
ProtocolStructural protocol for objects exposing likelihood metrics.
- property log_likelihood: float
Model log-likelihood.
- Type:
float
- property null_likelihood: float
Log-likelihood of the null/reference model.
- Type:
float
- class cleands.base.likelihood_model[source]
Bases:
ABCMixin-like base for models that report likelihood-based criteria.
- n_feat: int
- n_obs: int
- abstract property log_likelihood: float
Model log-likelihood under fitted parameters.
- Type:
float
- abstract property null_likelihood: float
Log-likelihood of the null/reference model.
- Type:
float
- property aic: float
Akaike Information Criterion (smaller is better).
- Type:
float
- property bic: float
Bayesian Information Criterion (smaller is better).
- Type:
float
- property deviance: float
Model deviance = 2*LL(model) - 2*LL(null).
- Type:
float
- class cleands.base.parametric_distribution_model(x)[source]
Bases:
distribution_model,likelihood_model,ABCBase class for parametric distributions that expose likelihood metrics.
- Parameters:
x (ndarray)
- params: ndarray
- x: ndarray
- abstract pdf(target)[source]
Probability density (or mass) evaluated at target.
- Parameters:
target (np.ndarray) – Points for evaluation.
- Returns:
Density/probability values.
- Return type:
np.ndarray
- abstract cdf(target)[source]
Cumulative distribution evaluated at target.
- Parameters:
target (np.ndarray) – Points for evaluation.
- Returns:
Cumulative probabilities.
- Return type:
np.ndarray
- property log_likelihood: float
In-sample log-likelihood for self.x.
- Type:
float
- abstract out_of_sample_log_likelihood(target)[source]
Log-likelihood evaluated on arbitrary data.
- Parameters:
target (np.ndarray) – Data on which to evaluate LL.
- Returns:
Log-likelihood value.
- Return type:
float
- property null_likelihood: float
In-sample null log-likelihood for self.x.
- Type:
float
- abstract out_of_sample_null_likelihood(target)[source]
Null-model log-likelihood on arbitrary data.
- Parameters:
target (np.ndarray) – Data on which to evaluate null LL.
- Returns:
Null log-likelihood value.
- Return type:
float
- property deviance: ndarray
Deviance for self.x = 2*(LL - LL_null).
- Type:
np.ndarray
- class cleands.base.prediction_likelihood_model[source]
Bases:
ABCBase for prediction models that define likelihood via evaluate_lnL.
- y: ndarray
- n_obs: int
- n_feat: int
- abstract evaluate_lnL(pred)[source]
Evaluate log-likelihood given predictions pred.
- Parameters:
pred (np.ndarray) – Predicted values or probabilities aligned with y.
- Returns:
Log-likelihood value.
- Return type:
float
- abstract property fitted: ndarray
Model-fitted predictions on training data.
- Returns:
Predictions aligned with y.
- Return type:
np.ndarray
- property log_likelihood: float
Log-likelihood at fitted values.
- Type:
float
- property null_likelihood: float
Log-likelihood of a mean-only/constant (null) predictor.
- Type:
float
- property aic: float
Akaike Information Criterion.
- Type:
float
- property bic: float
Bayesian Information Criterion.
- Type:
float
- property deviance: float
Deviance = 2*LL(model) - 2*LL(null).
- Type:
float
- class cleands.base.broom_model(*args, **kwargs)[source]
Bases:
ProtocolProtocol for tidy/glance accessors (broom-like API).
- property tidy: DataFrame
Per-parameter summary (estimates, SEs, tests, etc.).
- Type:
pd.DataFrame
- property glance: DataFrame
Model-level summary (fit statistics, diagnostics, etc.).
- Type:
pd.DataFrame
- class cleands.base.variance_model[source]
Bases:
ABCMixin for models that expose variance-covariance and inferential stats.
- abstract vcov_params()[source]
Variance-covariance matrix of parameter estimates.
- Returns:
(p x p) covariance matrix for the first n_feat parameters.
- Return type:
np.ndarray
- property std_error
Standard errors for parameters (from vcov_params).
- Type:
np.ndarray
- property t_statistic
t-statistics = params / std_error.
- Type:
np.ndarray
- property p_value
Two-sided p-values under Student-t with df = n_obs - n_feat.
- Type:
np.ndarray
- conf_int(level=0.95)[source]
Confidence intervals for parameters.
- Parameters:
level (float) – Coverage probability (default 0.95).
- Returns:
2 x p array with lower/upper bounds by column.
- Return type:
np.ndarray
- property tidy
Tidy per-parameter table (no CIs).
- Type:
pd.DataFrame
- tidyci(level=0.95, ci=True)[source]
Tidy per-parameter table with optional confidence intervals.
- Parameters:
level (float) – CI level (default 0.95).
ci (bool) – If True, include CI columns.
- Returns:
- Columns include variable, estimate, std.error, t.statistic, p.value,
and optionally ci.lower, ci.upper.
- Return type:
pd.DataFrame
- property glance: DataFrame
Model-level summary table.
- Type:
pd.DataFrame
- class cleands.base.SupervisedModel(formula, data, *args, **kwargs)[source]
Bases:
ABCAbstract base class for supervised models constructed from a formula.
Subclasses must set the class attribute
MODEL_TYPEto a concrete supervised model implementation. This wrapper handles parsing a formula, extracting predictor and response variables, and fitting the underlying algorithm.- Variables:
formula (str) – Formula string used to specify the model.
x_vars (list[str]) – Names of predictor variables.
y_var (str) – Name of response variable.
data (pd.DataFrame) – Parsed DataFrame containing predictors and response.
model (supervised_model) – Fitted underlying model implementation.
- Parameters:
formula (str)
data (DataFrame)
- tidyci(level=0.95, ci=True)[source]
Return a tidy coefficient table with optional confidence intervals.
- Parameters:
level (float, default=0.95) – Confidence level for intervals.
ci (bool, default=True) – Whether to include confidence intervals.
- Returns:
Tidy table of parameter estimates. If the model outputs a
variablecolumn of matching length, it is replaced with the predictor names fromx_vars.- Return type:
pd.DataFrame
- property tidy: DataFrame
Return a tidy table of parameter estimates without confidence intervals.
Equivalent to calling
tidyci()withci=False.- Returns:
Table of parameter estimates.
- Return type:
pd.DataFrame
- property glance: DataFrame
Return model-level summary statistics.
- Returns:
One-row DataFrame of model fit diagnostics (e.g., log-likelihood, R², AIC).
- Return type:
pd.DataFrame
- class cleands.base.PredictionModel(formula, data, *args, **kwargs)[source]
Bases:
SupervisedModel,ABCConcrete interface for supervised prediction models.
Extends
SupervisedModelby adding apredict()method for generating predictions on new data.- Parameters:
formula (str)
data (DataFrame)
- class cleands.base.ClassificationModel(formula, data, *args, **kwargs)[source]
Bases:
SupervisedModel,ABCConcrete interface for supervised prediction models.
Extends
SupervisedModelby adding aclassify()method for generating classifications on new data.- Parameters:
formula (str)
data (DataFrame)
- classify(new_data)[source]
Generate classifications on new data.
- Parameters:
new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.
- Returns:
Classifications indexed to
new_data.indexand named after the response variable (y_var).- Return type:
pd.Series
- predict_proba(new_data)[source]
Predict probabilities for classes on new data.
- Parameters:
new_data (pd.DataFrame) – DataFrame containing the predicted points to be assigned to classes.
- Returns:
Predicted probabilities with shape (n_samples, n_classes). Columns are named
class=0, class=1, ..., class=rcorresponding to the r classes returned by the underlying model.- Return type:
pd.DataFrame
- class cleands.base.UnsupervisedModel(formula, data, *args, **kwargs)[source]
Bases:
ABCAbstract base class for unsupervised models constructed from a formula.
Subclasses must set the class attribute
MODEL_TYPEto a concrete supervised model implementation. This wrapper handles parsing a formula, extracting predictor and response variables, and fitting the underlying algorithm.- Variables:
formula (str) – Formula string used to specify the model.
x_vars (list[str]) – Names of predictor variables.
y_var (str) – Name of response variable.
data (pd.DataFrame) – Parsed DataFrame containing predictors and response.
model (supervised_model) – Fitted underlying model implementation.
- Parameters:
formula (str)
data (DataFrame)
- tidyci(level=0.95, ci=True)[source]
Return a tidy coefficient table with optional confidence intervals.
- Parameters:
level (float, default=0.95) – Confidence level for intervals.
ci (bool, default=True) – Whether to include confidence intervals.
- Returns:
Tidy table of parameter estimates. If the model outputs a
variablecolumn of matching length, it is replaced with the predictor names fromx_vars.- Return type:
pd.DataFrame
- property tidy: DataFrame
Return a tidy table of parameter estimates without confidence intervals.
Equivalent to calling
tidyci()withci=False.- Returns:
Table of parameter estimates.
- Return type:
pd.DataFrame
- property glance: DataFrame
Return model-level summary statistics.
- Returns:
One-row DataFrame of model fit diagnostics (e.g., log-likelihood, R², AIC).
- Return type:
pd.DataFrame
- class cleands.base.ClusteringModel(formula, data, *args, **kwargs)[source]
Bases:
SupervisedModel,ABCConcrete interface for supervised clustering models.
Extends
SupervisedModelby adding acluster()method for generating predictions on new data.- Parameters:
formula (str)
data (DataFrame)
- class cleands.base.DimensionReductionModel(formula, data, *args, **kwargs)[source]
Bases:
SupervisedModel,ABCConcrete interface for supervised dimension-reduction models.
Extends
SupervisedModelby adding areduce()method for projecting new data into a lower-dimensional space.- Parameters:
formula (str)
data (DataFrame)
- reduce(new_data)[source]
Project new data into the reduced space.
- Parameters:
new_data (pd.DataFrame) – DataFrame containing the predictor variables referenced in the original formula.
- Returns:
Reduced representation with shape (n_samples, r). Columns are named
z1, z2, ..., zrcorresponding to the r components returned by the underlying model.- Return type:
pd.DataFrame