cleands.formula module

Formula parsing utilities for building design matrices from strings.

This module provides a lightweight, NumPy/Pandas-friendly parser for model formulas. It supports:

Intercept handling via the special column “(intercept)”.
Basic terms (column names), interactions with “:” and products with “*”.
Inclusion/exclusion using “+” and “-” (handled via make_pretty_minus).
Parentheses grouping.
Powers using “**” (or “^” if USE_CARET=True).
“As-is” expressions via I(<python expression>), evaluated against data.
A curated set of NumPy elementwise functions (see NUMPY_FUNCS).
Special polynomial generators such as “quadratic(x1,x2)”, “cubic(…)”, etc.

Key entry points:

parse(): returns x_vars, y_var, conditionals, and the processed DataFrame.
design helpers such as generate_interactions().

Notes

The parser mutates a copy of the input DataFrame and returns it.
(intercept) is always added as a column with value 1 unless excluded.

cleands.formula.USE_CARET: bool = False: Whether to accept ‘^’ as a power operator (translated to ‘**’).

cleands.formula.LOGGING: bool = False: Enable debug prints from parse steps when True.

cleands.formula.bind(x)[source]

Flatten a list of lists by one level.

Parameters:: x (list) – List whose elements are iterables to be chained.
Returns:: Single flattened list.
Return type:: list

cleands.formula.unique(x)[source]

Return list with original order but unique entries.

Parameters:: x (list) – Input list.
Returns:: De-duplicated list preserving first occurrence order.
Return type:: list

cleands.formula.split_expression(expression, delimiter='+')[source]

Split an expression by a delimiter respecting parentheses depth.

This avoids splitting inside nested parentheses.

Parameters:

expression (str) – Expression string to split.
delimiter (str) – Delimiter character (default: ‘+’).

Returns:

The top-level fragments.

Return type:

list[str]

cleands.formula.match_parens(expression)[source]

If expression is fully parenthesized, return the inside; else None.

Parameters:: expression (str) – Candidate string like “(a+b)”.
Returns:: Inner content without outer parentheses if valid; otherwise None.
Return type:: Optional[str]

cleands.formula.make_pretty_minus(expression)[source]

Normalize ‘-’ to ‘+-’ at top level to simplify inclusion/exclusion logic.

Example

“x - y + z” -> “x+-y+z” (and leading ‘+’ removed if present)

Parameters:: expression (str) – Raw expression.
Returns:: Normalized expression.
Return type:: str

cleands.formula.bin(x)[source]

Count occurrences of each element in a list.

Parameters:: x (list) – Input list.
Returns:: Mapping item -> count.
Return type:: dict

cleands.formula.parse(formula, data)[source]

Parse a full formula into design metadata and a processed DataFrame.

Syntax:

y ~ rhs [| conditionals]

Where rhs can contain:

column names
interactions with ‘:’ (e.g., a:b)
products with ‘*’ (expanded to main effects + interactions unless distribution is detected)
‘+’ and ‘-’ to include/exclude terms
powers via ‘**’ (or ‘^’ if USE_CARET=True)
‘.’ to include all columns
special generators like ‘quadratic(a,b)’, etc.
‘I(expr)’ to evaluate raw Python/NumPy expressions on columns

The function:

adds ‘(intercept)’ to data,
parses the left-hand side (y),
returns selected x-vars (after inclusion/exclusion),
returns optional conditionals to be passed downstream.

Parameters:

formula (str) – Full formula string.
data (pd.DataFrame) – Source data.

Returns:

x_vars: Ordered unique design column names (includes ‘(intercept)’ unless removed).
y_var: Dependent variable name.
conditionals: Parsed conditional columns (after inclusion/exclusion).
processed: A copy of data with derived columns added, restricted to [y_var] + x_vars + conditionals.

Return type:

Tuple[list[str], str, list[str], pd.DataFrame]

cleands.formula.parse_expression(expression, data)[source]

Parse a right-hand-side-like expression into included and excluded terms.

This function orchestrates:

normalization (make_pretty_minus, removing “np.” / “numpy.”, caret handling),
attempting parse_basic,
falling back to parse_complex.

Parameters:

expression (str) – RHS-like expression (may be empty).
data (pd.DataFrame) – DataFrame to mutate with derived columns.

Returns:

(included_terms, excluded_terms)

Return type:

tuple[list[str], list[str]]

Notes

Returns None on failure, but callers typically rely on truthiness and do not expect None in normal flows.

cleands.formula.parse_basic(expression, data)[source]

Handle simple cases: literals, parentheses, single terms, all-cols, powers, sums.

Rules (order matters):

Empty string -> ([], [])
Parenthesized -> parse inner
“1” -> intercept
Valid term -> [term]
“.” -> all current columns
“(… )**k” -> expand to interactions up to power k
“a+b+…” -> sum of sub-expressions (recursively parsed)
“-expr” -> invert included/excluded sets (for minus handling)
Special power funcs (e.g., “quadratic(a,b)”)

Parameters:

expression (str) – Candidate expression.
data (pd.DataFrame) – Data to mutate with derived columns.

Returns:

(included, excluded) or None.

Return type:

Optional[tuple[list[str], list[str]]]

cleands.formula.parse_complex(expression, data)[source]

Handle products ‘*’ and interactions ‘:’ with distribution/expansion logic.

Strategy:

Try splitting by ‘*’ via parse_complex_expression_by_splitting_on_string. If distributed=True, the product was distributable and we return terms. Otherwise, we add generated interactions.
Try ‘:’ similarly; ensure resulting string is a valid interaction.

Parameters:

expression (str) – Candidate expression with ‘*’ or ‘:’.
data (pd.DataFrame) – Data to mutate.

Returns:

(included, excluded) or None.

Return type:

Optional[Tuple[list[str], list[str]]]

Raises:

ValueError – If negations are detected in product/interaction contexts or invalid interaction strings are produced.

cleands.formula.parse_complex_expression_by_splitting_on_string(expression, data, delimiter='*')[source]

Split by a delimiter (‘*’ or ‘:’) and attempt recursive parsing/distribution.

For ‘*’:

If any sub-expression yields multiple included terms (and no excluded), attempt distribution across the product.
If distribution succeeds, return the distributed terms with {‘error’: False, ‘distributed’: True, ‘terms’: […]}.
Otherwise, return the collected simple terms and mark {‘error’: False, ‘distributed’: False, ‘terms’: […]}, leaving the caller to generate interactions.

For ‘:’:

Just return the list of terms; the caller will validate/construct the final interaction string.

Parameters:

expression (str) – Input expression.
data (pd.DataFrame) – Data to mutate during parsing/evaluation.
delimiter (str) – Either ‘*’ or ‘:’.

Returns:

A dictionary with keys:

’error’ (bool): True if excluded terms invalidated the operation.
’distributed’ (bool): True if product distribution occurred.
’terms’ (list[str] | None): Collected raw terms when successful.

Return type:

Optional[Dict[str, Any]]

cleands.formula.parse_term(term, data)[source]

Parse a single term by attempting NumPy func, interaction, or as-is.

Order:

in_numpy()
is_interaction()
is_as_is()

Parameters:

term (str) – A candidate term string.
data (pd.DataFrame) – Data to be mutated if term is derived.

Returns:

True if the term was successfully parsed/applied to data.

Return type:

bool

cleands.formula.is_interaction(expression, data)[source]

Create interaction column for colon-separated terms.

Example

“a:b:c” -> data[“a:b:c”] = data[“a”] * data[“b”] * data[“c”]

Parameters:

expression (str) – Interaction expression with ‘:’.
data (pd.DataFrame) – Data to mutate.

Returns:

True if an interaction was created; False otherwise.

Return type:

bool

cleands.formula.in_numpy(expression, data)[source]

Evaluate a recognized NumPy unary function or treat as existing column.

If expression matches a key in NUMPY_FUNCS in the form “<func>(col)”, the new column is added as that function applied to data[col]. If the expression is already a column name, this returns True.

Parameters:

expression (str) – Either a column name or “<func>(col)”.
data (pd.DataFrame) – Data to mutate.

Returns:

True if the expression is a known column or created successfully.

Return type:

bool

cleands.formula.is_as_is(expression, data)[source]

Evaluate a raw Python/NumPy expression with I(…).

Replaces bare column names in the interior with data[‘col’] and ensures bare NumPy function names are qualified with np. if present in NUMPY_FUNCS.

Example

I((x1 + x2)**2) or I(sqrt(x))

Parameters:

expression (str) – Expression beginning with ‘I’.
data (pd.DataFrame) – Data to evaluate against.

Returns:

True if successfully evaluated and assigned; False otherwise.

Return type:

bool

cleands.formula.generate_interactions(x, data, power=None)[source]

Generate all unique interaction terms up to a given order.

Parameters:

x (list[str]) – Base term names (already validated/created in data).
data (pd.DataFrame) – Data to mutate with interactions.
power (Optional[int]) – Maximum interaction order. Defaults to len(x).

Returns:

Sorted, unique interaction strings that were generated.

Return type:

list[str]

Raises:

ValueError – If any generated term fails to create an interaction in data.

cleands.formula.special_power(terms, data, power=1)[source]

Generate special-power terms (linear, quadratic, etc.) for a set of terms.

Parameters:

terms (list[str]) – Base terms to expand.
data (pd.DataFrame) – DataFrame where generated columns will be stored.
power (int, optional) – Maximum power to generate. Defaults to 1.

Returns:

Names of generated terms stored in data.

Return type:

list[str]

Notes

For a single term x, quadratic produces x and I(x**2).
For multiple terms, interaction powers like I(x*y) and I(x**2*y) may be created.

cleands.formula.special_power_funcs = {'cubic': functools.partial(<function special_power>, power=3), 'decic': functools.partial(<function special_power>, power=10), 'duodecic': functools.partial(<function special_power>, power=12), 'hexic': functools.partial(<function special_power>, power=6), 'linear': functools.partial(<function special_power>, power=1), 'nonic': functools.partial(<function special_power>, power=9), 'octic': functools.partial(<function special_power>, power=8), 'quadratic': functools.partial(<function special_power>, power=2), 'quartic': functools.partial(<function special_power>, power=4), 'quintic': functools.partial(<function special_power>, power=5), 'septic': functools.partial(<function special_power>, power=7), 'sextic': functools.partial(<function special_power>, power=6), 'vigintic': functools.partial(<function special_power>, power=20)}: Registry mapping special polynomial keywords to generator callables.

cleands.formula.check_special_power_funcs(expression, data)[source]

Detect and expand special polynomial helpers like ‘quadratic(…)’.

Parameters:

expression (str) – Expression beginning with a registered keyword.
data (pd.DataFrame) – Data to mutate with generated terms.

Returns:

Generated term list if matched; otherwise None.

Return type:

Optional[list[str]]