cleands.formula module
Formula parsing utilities for building design matrices from strings.
This module provides a lightweight, NumPy/Pandas-friendly parser for model formulas. It supports:
Intercept handling via the special column “(intercept)”.
Basic terms (column names), interactions with “:” and products with “*”.
Inclusion/exclusion using “+” and “-” (handled via make_pretty_minus).
Parentheses grouping.
Powers using “**” (or “^” if USE_CARET=True).
“As-is” expressions via I(<python expression>), evaluated against data.
A curated set of NumPy elementwise functions (see NUMPY_FUNCS).
Special polynomial generators such as “quadratic(x1,x2)”, “cubic(…)”, etc.
- Key entry points:
parse(): returns x_vars, y_var, conditionals, and the processed DataFrame.
design helpers such as generate_interactions().
Notes
The parser mutates a copy of the input DataFrame and returns it.
(intercept) is always added as a column with value 1 unless excluded.
- cleands.formula.USE_CARET: bool = False
Whether to accept ‘^’ as a power operator (translated to ‘**’).
- cleands.formula.LOGGING: bool = False
Enable debug prints from parse steps when True.
- cleands.formula.bind(x)[source]
Flatten a list of lists by one level.
- Parameters:
x (list) – List whose elements are iterables to be chained.
- Returns:
Single flattened list.
- Return type:
list
- cleands.formula.unique(x)[source]
Return list with original order but unique entries.
- Parameters:
x (list) – Input list.
- Returns:
De-duplicated list preserving first occurrence order.
- Return type:
list
- cleands.formula.split_expression(expression, delimiter='+')[source]
Split an expression by a delimiter respecting parentheses depth.
This avoids splitting inside nested parentheses.
- Parameters:
expression (str) – Expression string to split.
delimiter (str) – Delimiter character (default: ‘+’).
- Returns:
The top-level fragments.
- Return type:
list[str]
- cleands.formula.match_parens(expression)[source]
If expression is fully parenthesized, return the inside; else None.
- Parameters:
expression (str) – Candidate string like “(a+b)”.
- Returns:
Inner content without outer parentheses if valid; otherwise None.
- Return type:
Optional[str]
- cleands.formula.make_pretty_minus(expression)[source]
Normalize ‘-’ to ‘+-’ at top level to simplify inclusion/exclusion logic.
Example
“x - y + z” -> “x+-y+z” (and leading ‘+’ removed if present)
- Parameters:
expression (str) – Raw expression.
- Returns:
Normalized expression.
- Return type:
str
- cleands.formula.bin(x)[source]
Count occurrences of each element in a list.
- Parameters:
x (list) – Input list.
- Returns:
Mapping item -> count.
- Return type:
dict
- cleands.formula.parse(formula, data)[source]
Parse a full formula into design metadata and a processed DataFrame.
- Syntax:
y ~ rhs [| conditionals]
- Where rhs can contain:
column names
interactions with ‘:’ (e.g., a:b)
products with ‘*’ (expanded to main effects + interactions unless distribution is detected)
‘+’ and ‘-’ to include/exclude terms
powers via ‘**’ (or ‘^’ if USE_CARET=True)
‘.’ to include all columns
special generators like ‘quadratic(a,b)’, etc.
‘I(expr)’ to evaluate raw Python/NumPy expressions on columns
- The function:
adds ‘(intercept)’ to data,
parses the left-hand side (y),
returns selected x-vars (after inclusion/exclusion),
returns optional conditionals to be passed downstream.
- Parameters:
formula (str) – Full formula string.
data (pd.DataFrame) – Source data.
- Returns:
x_vars: Ordered unique design column names (includes ‘(intercept)’ unless removed).
y_var: Dependent variable name.
conditionals: Parsed conditional columns (after inclusion/exclusion).
processed: A copy of data with derived columns added, restricted to [y_var] + x_vars + conditionals.
- Return type:
Tuple[list[str], str, list[str], pd.DataFrame]
- cleands.formula.parse_expression(expression, data)[source]
Parse a right-hand-side-like expression into included and excluded terms.
- This function orchestrates:
normalization (make_pretty_minus, removing “np.” / “numpy.”, caret handling),
attempting parse_basic,
falling back to parse_complex.
- Parameters:
expression (str) – RHS-like expression (may be empty).
data (pd.DataFrame) – DataFrame to mutate with derived columns.
- Returns:
(included_terms, excluded_terms)
- Return type:
tuple[list[str], list[str]]
Notes
Returns None on failure, but callers typically rely on truthiness and do not expect None in normal flows.
- cleands.formula.parse_basic(expression, data)[source]
Handle simple cases: literals, parentheses, single terms, all-cols, powers, sums.
- Rules (order matters):
Empty string -> ([], [])
Parenthesized -> parse inner
“1” -> intercept
Valid term -> [term]
“.” -> all current columns
“(… )**k” -> expand to interactions up to power k
“a+b+…” -> sum of sub-expressions (recursively parsed)
“-expr” -> invert included/excluded sets (for minus handling)
Special power funcs (e.g., “quadratic(a,b)”)
- Parameters:
expression (str) – Candidate expression.
data (pd.DataFrame) – Data to mutate with derived columns.
- Returns:
(included, excluded) or None.
- Return type:
Optional[tuple[list[str], list[str]]]
- cleands.formula.parse_complex(expression, data)[source]
Handle products ‘*’ and interactions ‘:’ with distribution/expansion logic.
- Strategy:
Try splitting by ‘*’ via parse_complex_expression_by_splitting_on_string. If distributed=True, the product was distributable and we return terms. Otherwise, we add generated interactions.
Try ‘:’ similarly; ensure resulting string is a valid interaction.
- Parameters:
expression (str) – Candidate expression with ‘*’ or ‘:’.
data (pd.DataFrame) – Data to mutate.
- Returns:
(included, excluded) or None.
- Return type:
Optional[Tuple[list[str], list[str]]]
- Raises:
ValueError – If negations are detected in product/interaction contexts or invalid interaction strings are produced.
- cleands.formula.parse_complex_expression_by_splitting_on_string(expression, data, delimiter='*')[source]
Split by a delimiter (‘*’ or ‘:’) and attempt recursive parsing/distribution.
- For ‘*’:
If any sub-expression yields multiple included terms (and no excluded), attempt distribution across the product.
If distribution succeeds, return the distributed terms with {‘error’: False, ‘distributed’: True, ‘terms’: […]}.
Otherwise, return the collected simple terms and mark {‘error’: False, ‘distributed’: False, ‘terms’: […]}, leaving the caller to generate interactions.
- For ‘:’:
Just return the list of terms; the caller will validate/construct the final interaction string.
- Parameters:
expression (str) – Input expression.
data (pd.DataFrame) – Data to mutate during parsing/evaluation.
delimiter (str) – Either ‘*’ or ‘:’.
- Returns:
- A dictionary with keys:
’error’ (bool): True if excluded terms invalidated the operation.
’distributed’ (bool): True if product distribution occurred.
’terms’ (list[str] | None): Collected raw terms when successful.
- Return type:
Optional[Dict[str, Any]]
- cleands.formula.parse_term(term, data)[source]
Parse a single term by attempting NumPy func, interaction, or as-is.
- Order:
in_numpy()
is_interaction()
is_as_is()
- Parameters:
term (str) – A candidate term string.
data (pd.DataFrame) – Data to be mutated if term is derived.
- Returns:
True if the term was successfully parsed/applied to data.
- Return type:
bool
- cleands.formula.is_interaction(expression, data)[source]
Create interaction column for colon-separated terms.
Example
“a:b:c” -> data[“a:b:c”] = data[“a”] * data[“b”] * data[“c”]
- Parameters:
expression (str) – Interaction expression with ‘:’.
data (pd.DataFrame) – Data to mutate.
- Returns:
True if an interaction was created; False otherwise.
- Return type:
bool
- cleands.formula.in_numpy(expression, data)[source]
Evaluate a recognized NumPy unary function or treat as existing column.
If expression matches a key in NUMPY_FUNCS in the form “<func>(col)”, the new column is added as that function applied to data[col]. If the expression is already a column name, this returns True.
- Parameters:
expression (str) – Either a column name or “<func>(col)”.
data (pd.DataFrame) – Data to mutate.
- Returns:
True if the expression is a known column or created successfully.
- Return type:
bool
- cleands.formula.is_as_is(expression, data)[source]
Evaluate a raw Python/NumPy expression with I(…).
Replaces bare column names in the interior with data[‘col’] and ensures bare NumPy function names are qualified with np. if present in NUMPY_FUNCS.
Example
I((x1 + x2)**2) or I(sqrt(x))
- Parameters:
expression (str) – Expression beginning with ‘I’.
data (pd.DataFrame) – Data to evaluate against.
- Returns:
True if successfully evaluated and assigned; False otherwise.
- Return type:
bool
- cleands.formula.generate_interactions(x, data, power=None)[source]
Generate all unique interaction terms up to a given order.
- Parameters:
x (list[str]) – Base term names (already validated/created in data).
data (pd.DataFrame) – Data to mutate with interactions.
power (Optional[int]) – Maximum interaction order. Defaults to len(x).
- Returns:
Sorted, unique interaction strings that were generated.
- Return type:
list[str]
- Raises:
ValueError – If any generated term fails to create an interaction in data.
- cleands.formula.special_power(terms, data, power=1)[source]
Generate special-power terms (linear, quadratic, etc.) for a set of terms.
- Parameters:
terms (list[str]) – Base terms to expand.
data (pd.DataFrame) – DataFrame where generated columns will be stored.
power (int, optional) – Maximum power to generate. Defaults to 1.
- Returns:
Names of generated terms stored in
data.- Return type:
list[str]
Notes
For a single term
x, quadratic producesxandI(x**2).For multiple terms, interaction powers like
I(x*y)andI(x**2*y)may be created.
- cleands.formula.special_power_funcs = {'cubic': functools.partial(<function special_power>, power=3), 'decic': functools.partial(<function special_power>, power=10), 'duodecic': functools.partial(<function special_power>, power=12), 'hexic': functools.partial(<function special_power>, power=6), 'linear': functools.partial(<function special_power>, power=1), 'nonic': functools.partial(<function special_power>, power=9), 'octic': functools.partial(<function special_power>, power=8), 'quadratic': functools.partial(<function special_power>, power=2), 'quartic': functools.partial(<function special_power>, power=4), 'quintic': functools.partial(<function special_power>, power=5), 'septic': functools.partial(<function special_power>, power=7), 'sextic': functools.partial(<function special_power>, power=6), 'vigintic': functools.partial(<function special_power>, power=20)}
Registry mapping special polynomial keywords to generator callables.
- cleands.formula.check_special_power_funcs(expression, data)[source]
Detect and expand special polynomial helpers like ‘quadratic(…)’.
- Parameters:
expression (str) – Expression beginning with a registered keyword.
data (pd.DataFrame) – Data to mutate with generated terms.
- Returns:
Generated term list if matched; otherwise None.
- Return type:
Optional[list[str]]