cleands.utils module
Utility functions for statistical modeling and machine learning.
This module provides helper routines for array manipulation, encoding, optimization, resampling, and dataset splitting. Many of these utilities are thin wrappers around NumPy/SciPy/Pandas functionality but standardized for consistent use across the CleanDS framework.
- Key features:
Encoding: one-hot, frequency, and probability encoding of categorical arrays.
Math helpers: numerically stable sigmoid (expit), horizontal/vertical stack.
Optimization: gradient descent, Newton’s method, grid search.
Resampling: train/test split, k-fold cross-validation, bootstrap sampling.
Combinatorics: set product and binomial coefficient matrix constructor (C).
Intercept helpers: add or append intercept columns to data matrices.
- cleands.utils.one_hot_encode(x)[source]
Convert integer vector to one-hot encoded matrix.
- Parameters:
x (np.ndarray) – Integer labels of shape (n,).
- Returns:
- One-hot encoded array of shape (n, k),
where k = x.max() + 1.
- Return type:
np.ndarray
- cleands.utils.itemfreq(x, axis=None, classes=None)[source]
Frequency count of integer labels.
- Parameters:
x (np.ndarray) – Integer labels.
axis (Optional[int]) – Axis to count over. Defaults to None (flattened).
classes (Optional[int]) – Number of classes to assume. Defaults to x.max()+1.
- Returns:
Frequency counts.
- Return type:
np.ndarray
- cleands.utils.itemprob(x, axis=None, classes=None)[source]
Relative frequencies (probabilities) of integer labels.
- Parameters:
x (np.ndarray) – Integer labels.
axis (Optional[int]) – Axis to compute proportions along.
classes (Optional[int]) – Number of classes to assume. Defaults to x.max()+1.
- Returns:
Probabilities across classes.
- Return type:
np.ndarray
- cleands.utils.expit(x)[source]
Numerically stable sigmoid function.
- Parameters:
x (np.ndarray) – Input array.
- Returns:
Sigmoid-transformed values in (0,1).
- Return type:
np.ndarray
- cleands.utils.hstack(*args)[source]
Column-wise stack with auto-reshaping of 1D arrays.
- Parameters:
*args (np.ndarray) – Arrays to stack.
- Returns:
Horizontally stacked 2D array.
- Return type:
np.ndarray
- cleands.utils.vstack(*args)[source]
Row-wise stack with auto-reshaping of 1D arrays.
- Parameters:
*args (np.ndarray) – Arrays to stack.
- Returns:
Vertically stacked 2D array.
- Return type:
np.ndarray
- cleands.utils.bind(*args)[source]
Concatenate arrays along first axis.
- Parameters:
*args (np.ndarray) – Arrays to concatenate.
- Returns:
Concatenated array.
- Return type:
np.ndarray
- cleands.utils.grid_search(func, space, axis=None, maximize=False)[source]
Perform grid search by evaluating a function on a search space.
- Parameters:
func (Callable) – Function mapping space -> score array.
space (np.ndarray) – Search points.
axis (Optional[int]) – Axis to reduce along.
maximize (bool) – Whether to maximize (default False).
- Returns:
Index of optimum along axis.
- Return type:
np.ndarray
- cleands.utils.gradient_descent(gradient, init_x, learning_rate=0.005, threshold=1e-10, max_reps=10000, maximize=False, obj=None, step_shrink=0.5, min_step=1e-12, tol_step=1e-12, armijo_c1=1e-4)[source]
Basic gradient method with robust stopping and optional Armijo backtracking.
- Parameters:
gradient (Callable[[ndarray], ndarray]) – Gradient function g(x).
init_x (ndarray) – Initial point.
learning_rate (float) – Initial step size.
threshold (float) – Convergence threshold on ||g(x)||_2.
max_reps (int) – Maximum iterations.
maximize (bool) – If True, performs gradient ascent.
obj (Callable[[ndarray], float] | None) – Optional objective function f(x). If provided, perform Armijo backtracking.
step_shrink (float) – Backtracking multiplier in (0,1) when Armijo fails.
min_step (float) – Minimum allowable step size during backtracking before giving up.
tol_step (float) – Convergence threshold on parameter step size ||Δx||_2.
armijo_c1 (float) – Armijo constant (typically 1e-4).
- Returns:
Final iterate and convergence flag.
- Return type:
(x, converged)
- cleands.utils.newton(gradient, hessian, init_x, max_reps=100, tolerance=1e-14)[source]
Newton-Raphson optimizer for root-finding/maximum likelihood.
- Parameters:
gradient (Callable) – Gradient function.
hessian (Callable) – Hessian function.
init_x (np.ndarray) – Initial guess.
max_reps (int) – Max iterations.
tolerance (float) – Stopping threshold for update size.
- Returns:
(solution, iterations) if converged.
- Return type:
Tuple[np.ndarray, int]
- Raises:
Exception – If convergence not achieved.
- cleands.utils.add_intercept(x_vars, y_var, data)[source]
Add an intercept column to a DataFrame and feature list.
- Parameters:
x_vars (list[str]) – Feature variable names.
y_var (str) – Target variable name.
data (pd.DataFrame) – Input dataset.
- Returns:
Updated (x_vars, y_var, new data).
- Return type:
Tuple[list[str], str, pd.DataFrame]
- cleands.utils.test_train_split(x, y, test_ratio=0.1, seed=None)[source]
Split arrays into train/test sets.
- Parameters:
x (np.ndarray) – Features.
y (np.ndarray) – Targets.
test_ratio (float) – Proportion for test set.
seed (Optional[int]) – RNG seed.
- Returns:
(x_train, x_test, y_train, y_test).
- Return type:
Tuple[np.ndarray]
- cleands.utils.test_split_pandas(data, seed=None, test_ratio=0.1)[source]
Split a Pandas DataFrame into train/test subsets.
- Parameters:
data (pd.DataFrame) – Dataset.
seed (Optional[int]) – RNG seed.
test_ratio (float) – Proportion for test set.
- Returns:
(x_train, x_test).
- Return type:
Tuple[pd.DataFrame]
- cleands.utils.k_fold_cross_validation(x, y, folds=5, seed=None)[source]
Generate train/test splits for k-fold cross-validation.
- Parameters:
x (np.ndarray) – Features.
y (np.ndarray) – Targets.
folds (int) – Number of folds.
seed (Optional[int]) – RNG seed.
- Returns:
List of (x_train, x_test, y_train, y_test).
- Return type:
list[Tuple[np.ndarray]]
- cleands.utils.bootstrap(model, x, y, seed=None, bootstraps=1000)[source]
Generate bootstrap resamples and fit a model on each.
- Parameters:
model (Type[supervised_model]) – Model class to fit.
x (np.ndarray) – Features.
y (np.ndarray) – Targets.
seed (Optional[int]) – RNG seed.
bootstraps (int) – Number of bootstrap samples.
- Returns:
Fitted models from bootstrap samples.
- Return type:
list[supervised_model]
- cleands.utils.set_product(*args)[source]
Cartesian product of multiple iterables.
- Parameters:
*args (Tuple[Iterable]) – Iterables to combine.
- Returns:
Cartesian product tuples.
- Return type:
list[Tuple]