cleands.formula module

Formula parsing utilities for building design matrices from strings.

This module provides a lightweight, NumPy/Pandas-friendly parser for model formulas. It supports:

  • Intercept handling via the special column “(intercept)”.

  • Basic terms (column names), interactions with “:” and products with “*”.

  • Inclusion/exclusion using “+” and “-” (handled via make_pretty_minus).

  • Parentheses grouping.

  • Powers using “**” (or “^” if USE_CARET=True).

  • “As-is” expressions via I(<python expression>), evaluated against data.

  • A curated set of NumPy elementwise functions (see NUMPY_FUNCS).

  • Special polynomial generators such as “quadratic(x1,x2)”, “cubic(…)”, etc.

Key entry points:
  • parse(): returns x_vars, y_var, conditionals, and the processed DataFrame.

  • design helpers such as generate_interactions().

Notes

  • The parser mutates a copy of the input DataFrame and returns it.

  • (intercept) is always added as a column with value 1 unless excluded.

cleands.formula.USE_CARET: bool = False

Whether to accept ‘^’ as a power operator (translated to ‘**’).

cleands.formula.LOGGING: bool = False

Enable debug prints from parse steps when True.

cleands.formula.bind(x)[source]

Flatten a list of lists by one level.

Parameters:

x (list) – List whose elements are iterables to be chained.

Returns:

Single flattened list.

Return type:

list

cleands.formula.unique(x)[source]

Return list with original order but unique entries.

Parameters:

x (list) – Input list.

Returns:

De-duplicated list preserving first occurrence order.

Return type:

list

cleands.formula.split_expression(expression, delimiter='+')[source]

Split an expression by a delimiter respecting parentheses depth.

This avoids splitting inside nested parentheses.

Parameters:
  • expression (str) – Expression string to split.

  • delimiter (str) – Delimiter character (default: ‘+’).

Returns:

The top-level fragments.

Return type:

list[str]

cleands.formula.match_parens(expression)[source]

If expression is fully parenthesized, return the inside; else None.

Parameters:

expression (str) – Candidate string like “(a+b)”.

Returns:

Inner content without outer parentheses if valid; otherwise None.

Return type:

Optional[str]

cleands.formula.make_pretty_minus(expression)[source]

Normalize ‘-’ to ‘+-’ at top level to simplify inclusion/exclusion logic.

Example

“x - y + z” -> “x+-y+z” (and leading ‘+’ removed if present)

Parameters:

expression (str) – Raw expression.

Returns:

Normalized expression.

Return type:

str

cleands.formula.bin(x)[source]

Count occurrences of each element in a list.

Parameters:

x (list) – Input list.

Returns:

Mapping item -> count.

Return type:

dict

cleands.formula.parse(formula, data)[source]

Parse a full formula into design metadata and a processed DataFrame.

Syntax:

y ~ rhs [| conditionals]

Where rhs can contain:
  • column names

  • interactions with ‘:’ (e.g., a:b)

  • products with ‘*’ (expanded to main effects + interactions unless distribution is detected)

  • ‘+’ and ‘-’ to include/exclude terms

  • powers via ‘**’ (or ‘^’ if USE_CARET=True)

  • ‘.’ to include all columns

  • special generators like ‘quadratic(a,b)’, etc.

  • ‘I(expr)’ to evaluate raw Python/NumPy expressions on columns

The function:
  • adds ‘(intercept)’ to data,

  • parses the left-hand side (y),

  • returns selected x-vars (after inclusion/exclusion),

  • returns optional conditionals to be passed downstream.

Parameters:
  • formula (str) – Full formula string.

  • data (pd.DataFrame) – Source data.

Returns:

  • x_vars: Ordered unique design column names (includes ‘(intercept)’ unless removed).

  • y_var: Dependent variable name.

  • conditionals: Parsed conditional columns (after inclusion/exclusion).

  • processed: A copy of data with derived columns added, restricted to [y_var] + x_vars + conditionals.

Return type:

Tuple[list[str], str, list[str], pd.DataFrame]

cleands.formula.parse_expression(expression, data)[source]

Parse a right-hand-side-like expression into included and excluded terms.

This function orchestrates:
  1. normalization (make_pretty_minus, removing “np.” / “numpy.”, caret handling),

  2. attempting parse_basic,

  3. falling back to parse_complex.

Parameters:
  • expression (str) – RHS-like expression (may be empty).

  • data (pd.DataFrame) – DataFrame to mutate with derived columns.

Returns:

(included_terms, excluded_terms)

Return type:

tuple[list[str], list[str]]

Notes

  • Returns None on failure, but callers typically rely on truthiness and do not expect None in normal flows.

cleands.formula.parse_basic(expression, data)[source]

Handle simple cases: literals, parentheses, single terms, all-cols, powers, sums.

Rules (order matters):
  • Empty string -> ([], [])

  • Parenthesized -> parse inner

  • “1” -> intercept

  • Valid term -> [term]

  • “.” -> all current columns

  • “(… )**k” -> expand to interactions up to power k

  • “a+b+…” -> sum of sub-expressions (recursively parsed)

  • “-expr” -> invert included/excluded sets (for minus handling)

  • Special power funcs (e.g., “quadratic(a,b)”)

Parameters:
  • expression (str) – Candidate expression.

  • data (pd.DataFrame) – Data to mutate with derived columns.

Returns:

(included, excluded) or None.

Return type:

Optional[tuple[list[str], list[str]]]

cleands.formula.parse_complex(expression, data)[source]

Handle products ‘*’ and interactions ‘:’ with distribution/expansion logic.

Strategy:
  • Try splitting by ‘*’ via parse_complex_expression_by_splitting_on_string. If distributed=True, the product was distributable and we return terms. Otherwise, we add generated interactions.

  • Try ‘:’ similarly; ensure resulting string is a valid interaction.

Parameters:
  • expression (str) – Candidate expression with ‘*’ or ‘:’.

  • data (pd.DataFrame) – Data to mutate.

Returns:

(included, excluded) or None.

Return type:

Optional[Tuple[list[str], list[str]]]

Raises:

ValueError – If negations are detected in product/interaction contexts or invalid interaction strings are produced.

cleands.formula.parse_complex_expression_by_splitting_on_string(expression, data, delimiter='*')[source]

Split by a delimiter (‘*’ or ‘:’) and attempt recursive parsing/distribution.

For ‘*’:
  • If any sub-expression yields multiple included terms (and no excluded), attempt distribution across the product.

  • If distribution succeeds, return the distributed terms with {‘error’: False, ‘distributed’: True, ‘terms’: […]}.

  • Otherwise, return the collected simple terms and mark {‘error’: False, ‘distributed’: False, ‘terms’: […]}, leaving the caller to generate interactions.

For ‘:’:
  • Just return the list of terms; the caller will validate/construct the final interaction string.

Parameters:
  • expression (str) – Input expression.

  • data (pd.DataFrame) – Data to mutate during parsing/evaluation.

  • delimiter (str) – Either ‘*’ or ‘:’.

Returns:

A dictionary with keys:
  • ’error’ (bool): True if excluded terms invalidated the operation.

  • ’distributed’ (bool): True if product distribution occurred.

  • ’terms’ (list[str] | None): Collected raw terms when successful.

Return type:

Optional[Dict[str, Any]]

cleands.formula.parse_term(term, data)[source]

Parse a single term by attempting NumPy func, interaction, or as-is.

Order:
  1. in_numpy()

  2. is_interaction()

  3. is_as_is()

Parameters:
  • term (str) – A candidate term string.

  • data (pd.DataFrame) – Data to be mutated if term is derived.

Returns:

True if the term was successfully parsed/applied to data.

Return type:

bool

cleands.formula.is_interaction(expression, data)[source]

Create interaction column for colon-separated terms.

Example

“a:b:c” -> data[“a:b:c”] = data[“a”] * data[“b”] * data[“c”]

Parameters:
  • expression (str) – Interaction expression with ‘:’.

  • data (pd.DataFrame) – Data to mutate.

Returns:

True if an interaction was created; False otherwise.

Return type:

bool

cleands.formula.in_numpy(expression, data)[source]

Evaluate a recognized NumPy unary function or treat as existing column.

If expression matches a key in NUMPY_FUNCS in the form “<func>(col)”, the new column is added as that function applied to data[col]. If the expression is already a column name, this returns True.

Parameters:
  • expression (str) – Either a column name or “<func>(col)”.

  • data (pd.DataFrame) – Data to mutate.

Returns:

True if the expression is a known column or created successfully.

Return type:

bool

cleands.formula.is_as_is(expression, data)[source]

Evaluate a raw Python/NumPy expression with I(…).

Replaces bare column names in the interior with data[‘col’] and ensures bare NumPy function names are qualified with np. if present in NUMPY_FUNCS.

Example

I((x1 + x2)**2) or I(sqrt(x))

Parameters:
  • expression (str) – Expression beginning with ‘I’.

  • data (pd.DataFrame) – Data to evaluate against.

Returns:

True if successfully evaluated and assigned; False otherwise.

Return type:

bool

cleands.formula.generate_interactions(x, data, power=None)[source]

Generate all unique interaction terms up to a given order.

Parameters:
  • x (list[str]) – Base term names (already validated/created in data).

  • data (pd.DataFrame) – Data to mutate with interactions.

  • power (Optional[int]) – Maximum interaction order. Defaults to len(x).

Returns:

Sorted, unique interaction strings that were generated.

Return type:

list[str]

Raises:

ValueError – If any generated term fails to create an interaction in data.

cleands.formula.special_power(terms, data, power=1)[source]

Generate special-power terms (linear, quadratic, etc.) for a set of terms.

Parameters:
  • terms (list[str]) – Base terms to expand.

  • data (pd.DataFrame) – DataFrame where generated columns will be stored.

  • power (int, optional) – Maximum power to generate. Defaults to 1.

Returns:

Names of generated terms stored in data.

Return type:

list[str]

Notes

  • For a single term x, quadratic produces x and I(x**2).

  • For multiple terms, interaction powers like I(x*y) and I(x**2*y) may be created.

cleands.formula.special_power_funcs = {'cubic': functools.partial(<function special_power>, power=3), 'decic': functools.partial(<function special_power>, power=10), 'duodecic': functools.partial(<function special_power>, power=12), 'hexic': functools.partial(<function special_power>, power=6), 'linear': functools.partial(<function special_power>, power=1), 'nonic': functools.partial(<function special_power>, power=9), 'octic': functools.partial(<function special_power>, power=8), 'quadratic': functools.partial(<function special_power>, power=2), 'quartic': functools.partial(<function special_power>, power=4), 'quintic': functools.partial(<function special_power>, power=5), 'septic': functools.partial(<function special_power>, power=7), 'sextic': functools.partial(<function special_power>, power=6), 'vigintic': functools.partial(<function special_power>, power=20)}

Registry mapping special polynomial keywords to generator callables.

cleands.formula.check_special_power_funcs(expression, data)[source]

Detect and expand special polynomial helpers like ‘quadratic(…)’.

Parameters:
  • expression (str) – Expression beginning with a registered keyword.

  • data (pd.DataFrame) – Data to mutate with generated terms.

Returns:

Generated term list if matched; otherwise None.

Return type:

Optional[list[str]]