cleands.utils module

Utility functions for statistical modeling and machine learning.

This module provides helper routines for array manipulation, encoding, optimization, resampling, and dataset splitting. Many of these utilities are thin wrappers around NumPy/SciPy/Pandas functionality but standardized for consistent use across the CleanDS framework.

Key features:

Encoding: one-hot, frequency, and probability encoding of categorical arrays.
Math helpers: numerically stable sigmoid (expit), horizontal/vertical stack.
Optimization: gradient descent, Newton’s method, grid search.
Resampling: train/test split, k-fold cross-validation, bootstrap sampling.
Combinatorics: set product and binomial coefficient matrix constructor (C).
Intercept helpers: add or append intercept columns to data matrices.

cleands.utils.one_hot_encode(x)[source]

Convert integer vector to one-hot encoded matrix.

Parameters:

x (np.ndarray) – Integer labels of shape (n,).

Returns:

One-hot encoded array of shape (n, k),: where k = x.max() + 1.

Return type:

np.ndarray

cleands.utils.itemfreq(x, axis=None, classes=None)[source]

Frequency count of integer labels.

Parameters:

x (np.ndarray) – Integer labels.
axis (Optional[int]) – Axis to count over. Defaults to None (flattened).
classes (Optional[int]) – Number of classes to assume. Defaults to x.max()+1.

Returns:

Frequency counts.

Return type:

np.ndarray

cleands.utils.itemprob(x, axis=None, classes=None)[source]

Relative frequencies (probabilities) of integer labels.

Parameters:

x (np.ndarray) – Integer labels.
axis (Optional[int]) – Axis to compute proportions along.
classes (Optional[int]) – Number of classes to assume. Defaults to x.max()+1.

Returns:

Probabilities across classes.

Return type:

np.ndarray

cleands.utils.expit(x)[source]

Numerically stable sigmoid function.

Parameters:: x (np.ndarray) – Input array.
Returns:: Sigmoid-transformed values in (0,1).
Return type:: np.ndarray

cleands.utils.hstack(*args)[source]

Column-wise stack with auto-reshaping of 1D arrays.

Parameters:: *args (np.ndarray) – Arrays to stack.
Returns:: Horizontally stacked 2D array.
Return type:: np.ndarray

cleands.utils.vstack(*args)[source]

Row-wise stack with auto-reshaping of 1D arrays.

Parameters:: *args (np.ndarray) – Arrays to stack.
Returns:: Vertically stacked 2D array.
Return type:: np.ndarray

cleands.utils.bind(*args)[source]

Concatenate arrays along first axis.

Parameters:: *args (np.ndarray) – Arrays to concatenate.
Returns:: Concatenated array.
Return type:: np.ndarray

cleands.utils.grid_search(func, space, axis=None, maximize=False)[source]

Perform grid search by evaluating a function on a search space.

Parameters:

func (Callable) – Function mapping space -> score array.
space (np.ndarray) – Search points.
axis (Optional[int]) – Axis to reduce along.
maximize (bool) – Whether to maximize (default False).

Returns:

Index of optimum along axis.

Return type:

np.ndarray

cleands.utils.gradient_descent(gradient, init_x, learning_rate=0.005, threshold=1e-10, max_reps=10000, maximize=False, obj=None, step_shrink=0.5, min_step=1e-12, tol_step=1e-12, armijo_c1=1e-4)[source]

Basic gradient method with robust stopping and optional Armijo backtracking.

Parameters:

gradient (Callable[[ndarray], ndarray]) – Gradient function g(x).
init_x (ndarray) – Initial point.
learning_rate (float) – Initial step size.
threshold (float) – Convergence threshold on ||g(x)||_2.
max_reps (int) – Maximum iterations.
maximize (bool) – If True, performs gradient ascent.
obj (Callable[[ndarray], float] | None) – Optional objective function f(x). If provided, perform Armijo backtracking.
step_shrink (float) – Backtracking multiplier in (0,1) when Armijo fails.
min_step (float) – Minimum allowable step size during backtracking before giving up.
tol_step (float) – Convergence threshold on parameter step size ||Δx||_2.
armijo_c1 (float) – Armijo constant (typically 1e-4).

Returns:

Final iterate and convergence flag.

Return type:

(x, converged)

cleands.utils.newton(gradient, hessian, init_x, max_reps=100, tolerance=1e-14)[source]

Newton-Raphson optimizer for root-finding/maximum likelihood.

Parameters:

gradient (Callable) – Gradient function.
hessian (Callable) – Hessian function.
init_x (np.ndarray) – Initial guess.
max_reps (int) – Max iterations.
tolerance (float) – Stopping threshold for update size.

Returns:

(solution, iterations) if converged.

Return type:

Tuple[np.ndarray, int]

Raises:

Exception – If convergence not achieved.

cleands.utils.add_intercept(x_vars, y_var, data)[source]

Add an intercept column to a DataFrame and feature list.

Parameters:

x_vars (list[str]) – Feature variable names.
y_var (str) – Target variable name.
data (pd.DataFrame) – Input dataset.

Returns:

Updated (x_vars, y_var, new data).

Return type:

Tuple[list[str], str, pd.DataFrame]

cleands.utils.test_train_split(x, y, test_ratio=0.1, seed=None)[source]

Split arrays into train/test sets.

Parameters:

x (np.ndarray) – Features.
y (np.ndarray) – Targets.
test_ratio (float) – Proportion for test set.
seed (Optional[int]) – RNG seed.

Returns:

(x_train, x_test, y_train, y_test).

Return type:

Tuple[np.ndarray]

cleands.utils.test_split_pandas(data, seed=None, test_ratio=0.1)[source]

Split a Pandas DataFrame into train/test subsets.

Parameters:

data (pd.DataFrame) – Dataset.
seed (Optional[int]) – RNG seed.
test_ratio (float) – Proportion for test set.

Returns:

(x_train, x_test).

Return type:

Tuple[pd.DataFrame]

cleands.utils.k_fold_cross_validation(x, y, folds=5, seed=None)[source]

Generate train/test splits for k-fold cross-validation.

Parameters:

x (np.ndarray) – Features.
y (np.ndarray) – Targets.
folds (int) – Number of folds.
seed (Optional[int]) – RNG seed.

Returns:

List of (x_train, x_test, y_train, y_test).

Return type:

list[Tuple[np.ndarray]]

cleands.utils.bootstrap(model, x, y, seed=None, bootstraps=1000)[source]

Generate bootstrap resamples and fit a model on each.

Parameters:

model (Type[supervised_model]) – Model class to fit.
x (np.ndarray) – Features.
y (np.ndarray) – Targets.
seed (Optional[int]) – RNG seed.
bootstraps (int) – Number of bootstrap samples.

Returns:

Fitted models from bootstrap samples.

Return type:

list[supervised_model]

cleands.utils.set_product(*args)[source]

Cartesian product of multiple iterables.

Parameters:: *args (Tuple[Iterable]) – Iterables to combine.
Returns:: Cartesian product tuples.
Return type:: list[Tuple]

cleands.utils.intercept(x)[source]

Prepend an intercept column of ones to feature matrix.

Parameters:: x (np.ndarray) – Feature matrix.
Returns:: Feature matrix with intercept column added.
Return type:: np.ndarray

cleands.utils.C(n, r)[source]

Construct combinatorial matrix for n choose r.

Parameters:

n (int) – Number of elements.
r (int) – Number chosen.

Returns:

Binary matrix representing subsets.

Return type:

np.ndarray

Raises:

Exception – If n < r.