Pyteryx/alteryx_runner_spec.md

1678 lines
55 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Alteryx Open-Source Workflow Runner — Technical Specification
> **Target audience:** Agentic coding tools (opencode, pi, etc.)
> **Stack:** Python 3.11+, DuckDB, Polars, PyArrow
> **License target:** Apache 2.0
---
## 1. Project Overview
Build a Python-native runner that parses Alteryx Designer workflow XML files (`.yxmd`) and executes each tool node against real data using DuckDB (for SQL-heavy transforms and I/O) and Polars (for streaming/in-memory frame operations). The runner does **not** require Alteryx to be installed.
### 1.1 Design Principles
- **Faithful semantics**: match Alteryx output column order, naming, data types and NULL behaviour exactly.
- **DAG execution**: parse the `<Connections>` graph, build a topological execution plan, execute nodes in dependency order.
- **Lazy where possible**: use Polars LazyFrame / DuckDB views so materialisation happens once at sink nodes.
- **Pluggable tool registry**: each tool is a self-contained Python class registered by its XML `Plugin` string; new tools can be dropped in without touching core execution logic.
- **No GUI dependency**: CLI-first; expose a Python API for programmatic use.
---
## 2. Repository Layout
```
alteryx_runner/
├── cli.py # Entry point: `python -m alteryx_runner run workflow.yxmd`
├── engine/
│ ├── __init__.py
│ ├── parser.py # XML → WorkflowGraph
│ ├── executor.py # Topological executor
│ ├── graph.py # WorkflowGraph, Node, Connection dataclasses
│ ├── context.py # RunContext: connection pool, temp dir, constants
│ └── type_mapper.py # Alteryx type → Polars/DuckDB dtype
├── tools/
│ ├── __init__.py # Tool registry
│ ├── base.py # BaseTool ABC
│ ├── inout/
│ │ ├── input_data.py
│ │ ├── output_data.py
│ │ ├── text_input.py
│ │ └── browse.py
│ ├── preparation/
│ │ ├── filter_tool.py
│ │ ├── formula_tool.py
│ │ ├── select_tool.py
│ │ ├── sort_tool.py
│ │ ├── sample_tool.py
│ │ ├── unique_tool.py
│ │ ├── generate_rows.py
│ │ ├── multi_row_formula.py
│ │ ├── multi_field_formula.py
│ │ ├── record_id.py
│ │ └── auto_field.py
│ ├── join/
│ │ ├── join_tool.py
│ │ ├── join_multiple.py
│ │ ├── union_tool.py
│ │ ├── append_fields.py
│ │ └── find_replace.py
│ ├── parse/
│ │ ├── datetime_tool.py
│ │ ├── regex_tool.py
│ │ └── text_to_columns.py
│ └── transform/
│ ├── summarize_tool.py
│ ├── cross_tab.py
│ └── transpose_tool.py
├── expression/
│ ├── __init__.py
│ ├── transpiler.py # Alteryx expression → DuckDB SQL / Polars expr
│ └── functions.py # Built-in function mappings
├── tests/
│ ├── fixtures/ # .yxmd files (from uploaded samples)
│ └── test_*.py
└── pyproject.toml
```
---
## 3. .yxmd File Format
Alteryx workflow files are UTF-8 XML. The root element is `<AlteryxDocument>`.
### 3.1 Top-Level Structure
```xml
<AlteryxDocument yxmdVer="2022.1">
<Nodes>
<Node ToolID="1"> ... </Node>
<Node ToolID="2"> ... </Node>
</Nodes>
<Connections>
<Connection>
<Origin ToolID="1" Connection="Output" />
<Destination ToolID="2" Connection="Input" />
</Connection>
</Connections>
<Properties>
<!-- workflow-level settings -->
<SortedGrouping value="True" />
<MetaInfo> ... </MetaInfo>
</Properties>
</AlteryxDocument>
```
### 3.2 Node Structure
```xml
<Node ToolID="74">
<GuiSettings Plugin="AlteryxBasePluginsGui.AlteryxSelect.AlteryxSelect">
<Position x="282" y="186" />
</GuiSettings>
<Properties>
<Configuration>
<!-- tool-specific XML -->
</Configuration>
<Annotation DisplayMode="0">
<DefaultAnnotationText>...</DefaultAnnotationText>
</Annotation>
</Properties>
<EngineSettings EngineDll="AlteryxBasePluginsEngine.dll"
EngineDllEntryPoint="AlteryxSelect" />
<MetaInfo connection="Output">
<RecordInfo>
<Field name="CustomerID" type="Int32" source="..." />
</RecordInfo>
</MetaInfo>
</Node>
```
**Key attributes for the runner:**
- `ToolID`: integer, unique per workflow. Use as node identifier.
- `Plugin` in `<GuiSettings>`: maps to the tool class (see registry table §4).
- `<Configuration>`: tool-specific settings; parse per-tool.
- `<MetaInfo>/<RecordInfo>/<Field>`: output schema hint (available on input tools; not always present on transform tools).
- `<EngineSettings>`: `Macro` attribute signals a macro/yxmc sub-workflow; handle recursively.
### 3.3 Connection Structure
```xml
<Connection name="#1" Wireless="True">
<Origin ToolID="155" Connection="Output" />
<Destination ToolID="156" Connection="Input" />
</Connection>
```
- `Connection` attribute on `<Origin>` / `<Destination>`: the anchor label.
- Most tools: `"Output"` / `"Input"`.
- Filter: `"True"` / `"False"`.
- Join: `"Left"` / `"Right"` (inputs), `"J"` / `"L"` / `"R"` (outputs).
- JoinMultiple: `"Input"` (numbered), `"Output"`.
- Union: numbered `"Input1"`, `"Input2"`, etc.
- `Wireless="True"`: logical connection, no visual wire; treat identically to wired.
- `name="#N"`: numbered for multi-input tools; determines input slot ordering.
### 3.4 ChildNodes (Tool Containers)
```xml
<Node ToolID="159">
<GuiSettings Plugin="AlteryxGuiToolkit.ToolContainer.ToolContainer" />
<ChildNodes>
<Node ToolID="156"> ... </Node>
<Node ToolID="157"> ... </Node>
</ChildNodes>
</Node>
```
Flatten `<ChildNodes>` into the parent graph. The container itself is a no-op passthrough for execution.
---
## 4. Tool Registry
Map `Plugin` string → Python class. The runner looks up the `Plugin` from `<GuiSettings>`.
| Plugin String | Class | Category |
|---|---|---|
| `AlteryxBasePluginsGui.DbFileInput.DbFileInput` | `InputDataTool` | In/Out |
| `AlteryxBasePluginsGui.DbFileOutput.DbFileOutput` | `OutputDataTool` | In/Out |
| `AlteryxBasePluginsGui.TextInput.TextInput` | `TextInputTool` | In/Out |
| `AlteryxBasePluginsGui.BrowseV2.BrowseV2` | `BrowseTool` | In/Out |
| `AlteryxBasePluginsGui.Filter.Filter` | `FilterTool` | Preparation |
| `AlteryxBasePluginsGui.Formula.Formula` | `FormulaTool` | Preparation |
| `AlteryxBasePluginsGui.AlteryxSelect.AlteryxSelect` | `SelectTool` | Preparation |
| `AlteryxBasePluginsGui.Sort.Sort` | `SortTool` | Preparation |
| `AlteryxBasePluginsGui.Sample.Sample` | `SampleTool` | Preparation |
| `AlteryxBasePluginsGui.Unique.Unique` | `UniqueTool` | Preparation |
| `AlteryxBasePluginsGui.GenerateRows.GenerateRows` | `GenerateRowsTool` | Preparation |
| `AlteryxBasePluginsGui.MultiRowFormula.MultiRowFormula` | `MultiRowFormulaTool` | Preparation |
| `AlteryxBasePluginsGui.MultiFieldFormula.MultiFieldFormula` | `MultiFieldFormulaTool` | Preparation |
| `AlteryxBasePluginsGui.RecordID.RecordID` | `RecordIDTool` | Preparation |
| `AlteryxBasePluginsGui.AutoField.AutoField` | `AutoFieldTool` | Preparation |
| `AlteryxBasePluginsGui.Join.Join` | `JoinTool` | Join |
| `AlteryxBasePluginsGui.JoinMultiple.JoinMultiple` | `JoinMultipleTool` | Join |
| `AlteryxBasePluginsGui.Union.Union` | `UnionTool` | Join |
| `AlteryxBasePluginsGui.AppendFields.AppendFields` | `AppendFieldsTool` | Join |
| `AlteryxBasePluginsGui.FindReplace.FindReplace` | `FindReplaceTool` | Join |
| `AlteryxBasePluginsGui.DateTime.DateTime` | `DateTimeTool` | Parse |
| `AlteryxBasePluginsGui.RegEx.RegEx` | `RegExTool` | Parse |
| `AlteryxBasePluginsGui.TextToColumns.TextToColumns` | `TextToColumnsTool` | Parse |
| `AlteryxSpatialPluginsGui.Summarize.Summarize` | `SummarizeTool` | Transform |
| `AlteryxBasePluginsGui.CrossTab.CrossTab` | `CrossTabTool` | Transform |
| `AlteryxBasePluginsGui.Transpose.Transpose` | `TransposeTool` | Transform |
| `AlteryxGuiToolkit.ToolContainer.ToolContainer` | `PassthroughTool` | Documentation |
| `AlteryxGuiToolkit.TextBox.TextBox` | `NullTool` | Documentation |
---
## 5. Data Model
### 5.1 Alteryx → Python Type Mapping
| Alteryx Type | Polars dtype | DuckDB type | Notes |
|---|---|---|---|
| `Bool` | `pl.Boolean` | `BOOLEAN` | |
| `Byte` | `pl.UInt8` | `TINYINT UNSIGNED` | 0255 |
| `Int16` | `pl.Int16` | `SMALLINT` | |
| `Int32` | `pl.Int32` | `INTEGER` | |
| `Int64` | `pl.Int64` | `BIGINT` | |
| `Float` | `pl.Float32` | `FLOAT` | |
| `Double` | `pl.Float64` | `DOUBLE` | |
| `FixedDecimal` | `pl.Decimal(p,s)` | `DECIMAL(p,s)` | size="19.2" → DECIMAL(19,2) |
| `String` | `pl.String` | `VARCHAR` | fixed max length |
| `V_String` | `pl.String` | `VARCHAR` | variable length |
| `WString` | `pl.String` | `VARCHAR` | wide string |
| `V_WString` | `pl.String` | `VARCHAR` | variable wide string |
| `Date` | `pl.Date` | `DATE` | |
| `Time` | `pl.Time` | `TIME` | |
| `DateTime` | `pl.Datetime` | `TIMESTAMP` | |
| `SpatialObj` | `pl.String` | `VARCHAR` | GeoJSON string; skip in non-spatial mode |
| `Blob` | `pl.Binary` | `BLOB` | |
### 5.2 Internal Frame Type
All frames are passed between tools as `polars.DataFrame`. DuckDB is used for:
- Reading/writing external files (CSV, Parquet, XLSX via `duckdb.read_csv`, `read_parquet`, etc.)
- Complex SQL generation (joins, aggregates, window functions)
- Expression evaluation where transpilation to DuckDB SQL is simpler than Polars
Conversion helpers:
```python
import polars as pl
import duckdb
def polars_to_duckdb(df: pl.DataFrame, name: str, con: duckdb.DuckDBPyConnection):
con.register(name, df.to_arrow())
def duckdb_to_polars(con: duckdb.DuckDBPyConnection, sql: str) -> pl.DataFrame:
return pl.from_arrow(con.execute(sql).arrow())
```
---
## 6. Core Engine
### 6.1 Parser (`engine/parser.py`)
```python
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import xml.etree.ElementTree as ET
@dataclass
class FieldDef:
name: str
type: str
size: Optional[int] = None
source: Optional[str] = None
@dataclass
class NodeDef:
tool_id: int
plugin: str
config: ET.Element # raw <Configuration> element
output_schema: List[FieldDef] = field(default_factory=list)
position: tuple = (0, 0)
@dataclass
class ConnectionDef:
origin_id: int
origin_anchor: str # e.g. "Output", "True", "J"
dest_id: int
dest_anchor: str # e.g. "Input", "Left", "#1"
name: Optional[str] = None
wireless: bool = False
@dataclass
class WorkflowGraph:
nodes: Dict[int, NodeDef]
connections: List[ConnectionDef]
properties: ET.Element # <Properties> element
def parse_workflow(path: str) -> WorkflowGraph:
"""Parse .yxmd XML into WorkflowGraph."""
tree = ET.parse(path)
root = tree.getroot()
nodes = {}
for node_el in root.iter("Node"): # iter flattens ChildNodes
tid = int(node_el.attrib["ToolID"])
gui = node_el.find("GuiSettings")
plugin = gui.attrib.get("Plugin", "") if gui is not None else ""
config = node_el.find("Properties/Configuration")
pos_el = gui.find("Position") if gui is not None else None
pos = (int(pos_el.attrib.get("x",0)), int(pos_el.attrib.get("y",0))) if pos_el is not None else (0,0)
schema = _parse_schema(node_el)
nodes[tid] = NodeDef(tid, plugin, config, schema, pos)
connections = []
for conn_el in root.findall("Connections/Connection"):
orig = conn_el.find("Origin")
dest = conn_el.find("Destination")
connections.append(ConnectionDef(
origin_id=int(orig.attrib["ToolID"]),
origin_anchor=orig.attrib.get("Connection","Output"),
dest_id=int(dest.attrib["ToolID"]),
dest_anchor=dest.attrib.get("Connection","Input"),
name=conn_el.attrib.get("name"),
wireless=conn_el.attrib.get("Wireless","False")=="True"
))
props = root.find("Properties")
return WorkflowGraph(nodes, connections, props)
def _parse_schema(node_el: ET.Element) -> List[FieldDef]:
fields = []
for f in node_el.findall(".//MetaInfo/RecordInfo/Field"):
size_str = f.attrib.get("size")
size = int(float(size_str)) if size_str else None
fields.append(FieldDef(f.attrib["name"], f.attrib.get("type","V_String"), size))
return fields
```
### 6.2 Executor (`engine/executor.py`)
```python
from collections import defaultdict, deque
import polars as pl
from .graph import WorkflowGraph, ConnectionDef
from .context import RunContext
from tools import get_tool_class
def execute(graph: WorkflowGraph, ctx: RunContext):
"""Topological BFS execution."""
# Build adjacency
in_degree = defaultdict(int)
successors = defaultdict(list) # tool_id → [(conn, dest_id)]
predecessors = defaultdict(list) # tool_id → [(conn, origin_id)]
for c in graph.connections:
in_degree[c.dest_id] += 1
successors[c.origin_id].append(c)
predecessors[c.dest_id].append(c)
for tid in graph.nodes:
if tid not in in_degree:
in_degree[tid] = 0
# Output buffers: (tool_id, anchor) → DataFrame
outputs: dict[tuple, pl.DataFrame] = {}
queue = deque([tid for tid, deg in in_degree.items() if deg == 0])
while queue:
tid = queue.popleft()
node = graph.nodes[tid]
tool_cls = get_tool_class(node.plugin)
if tool_cls is None:
# Unsupported / documentation tool — pass through if single input
_passthrough(tid, predecessors, outputs, successors, in_degree, queue)
continue
tool = tool_cls(node, ctx)
# Gather inputs: dict of anchor → DataFrame
inputs = {}
for c in predecessors[tid]:
df = outputs.get((c.origin_id, c.origin_anchor))
if df is not None:
inputs[c.dest_anchor] = df
# Execute
result = tool.execute(inputs) # returns dict[anchor, DataFrame]
for anchor, df in result.items():
outputs[(tid, anchor)] = df
# Decrement in-degree of successors
for c in successors[tid]:
in_degree[c.dest_id] -= 1
if in_degree[c.dest_id] == 0:
queue.append(c.dest_id)
return outputs
```
### 6.3 Base Tool (`tools/base.py`)
```python
from abc import ABC, abstractmethod
from typing import Dict
import polars as pl
import xml.etree.ElementTree as ET
from engine.graph import NodeDef
from engine.context import RunContext
class BaseTool(ABC):
def __init__(self, node: NodeDef, ctx: RunContext):
self.node = node
self.ctx = ctx
self.config = node.config # ET.Element
@abstractmethod
def execute(self, inputs: Dict[str, pl.DataFrame]) -> Dict[str, pl.DataFrame]:
"""
inputs: dict of anchor_name → DataFrame
returns: dict of anchor_name → DataFrame
"""
def _cfg(self, xpath: str, default=None):
"""Helper: get text of config sub-element."""
el = self.config.find(xpath) if self.config is not None else None
return el.text if el is not None else default
def _cfg_attr(self, xpath: str, attr: str, default=None):
el = self.config.find(xpath) if self.config is not None else None
return el.attrib.get(attr, default) if el is not None else default
```
---
## 7. Tool Implementations
### 7.1 InputData (`AlteryxBasePluginsGui.DbFileInput.DbFileInput`)
**XML Configuration:**
```xml
<Configuration>
<File OutputFileName="" FileFormat="19" SearchSubDirs="False" RecordLimit="50">
path/to/file.csv
</File>
<FormatSpecificOptions>
<Delimeter>,</Delimeter>
<HeaderRow>True</HeaderRow>
<ImportLine>1</ImportLine> <!-- 1-based line to start reading -->
<CodePage>28591</CodePage>
</FormatSpecificOptions>
</Configuration>
```
**File path syntax:**
- `path/to/file.csv` — plain file
- `path/file.xlsx|||``SheetName$`` ` — Excel sheet (triple pipe separator)
- `path/file.xlsx|||``<List of Sheet Names>`` ` — returns column: `Sheet Names`
- `path/file.xlsx|||``NamedRange`` ` — named range
**FileFormat codes (partial):**
| Code | Format |
|---|---|
| 0 | CSV / delimited text |
| 19 | YXDB (Alteryx native) |
| 25 | Excel (.xlsx) |
| 6 | Fixed-width |
**Implementation strategy:**
```python
class InputDataTool(BaseTool):
def execute(self, inputs):
file_el = self.config.find("File")
raw_path = file_el.text.strip()
fmt = int(file_el.attrib.get("FileFormat","0"))
record_limit = file_el.attrib.get("RecordLimit","") or None
limit = int(record_limit) if record_limit else None
opts = self.config.find("FormatSpecificOptions") or ET.Element("x")
# Resolve relative path against workflow directory
path, sheet = self._parse_path(raw_path)
df = self._read(path, fmt, sheet, opts, limit)
return {"Output": df}
def _parse_path(self, raw):
if "|||" in raw:
path, sheet = raw.split("|||", 1)
return path.strip(), sheet.strip().strip("`")
return raw.strip(), None
def _read(self, path, fmt, sheet, opts, limit):
import duckdb, polars as pl
con = self.ctx.duckdb_con
if fmt == 0: # CSV
delim = opts.findtext("Delimeter", ",")
header = opts.findtext("HeaderRow","True") == "True"
skip = max(0, int(opts.findtext("ImportLine","1") or 1) - 1)
df = pl.read_csv(path, separator=delim, has_header=header, skip_rows=skip)
elif fmt == 25: # Excel
df = pl.read_excel(path, sheet_name=sheet)
elif fmt == 19: # YXDB — use DuckDB + yxdb reader or fallback
raise NotImplementedError("YXDB: use yxdb-py library")
else:
df = pl.read_csv(path) # best-effort fallback
if limit:
df = df.head(limit)
return df
```
**YXDB support:** Use the `yxdb` Python package (`pip install yxdb`) which reads Alteryx's native binary format.
---
### 7.2 TextInput (`AlteryxBasePluginsGui.TextInput.TextInput`)
**XML Configuration:**
```xml
<Configuration>
<NumRows value="6" />
<Fields>
<Field name="CustomerID" />
<Field name="FirstName" />
</Fields>
<Data>
<r>
<c>49</c>
<c>Thomas</c>
</r>
<r>
<c>456</c>
<c></c> <!-- empty → null -->
</r>
</Data>
</Configuration>
```
**Implementation:**
```python
class TextInputTool(BaseTool):
def execute(self, inputs):
fields = [f.attrib["name"] for f in self.config.findall("Fields/Field")]
rows = []
for r in self.config.findall("Data/r"):
cells = r.findall("c")
row = {}
for i, f in enumerate(fields):
el = cells[i] if i < len(cells) else None
text = el.text if el is not None else None
row[f] = text if text != "" else None
rows.append(row)
df = pl.DataFrame(rows, schema={f: pl.String for f in fields})
return {"Output": df}
```
Note: All columns are `String` unless downstream Select/Formula coerces them.
---
### 7.3 OutputData (`AlteryxBasePluginsGui.DbFileOutput.DbFileOutput`)
**XML Configuration:**
```xml
<Configuration>
<File MaxRecords="" FileFormat="0">%temp%output.csv</File>
<FormatSpecificOptions>
<Delimeter>,</Delimeter>
<HeaderRow>True</HeaderRow>
<LineEndStyle>CRLF</LineEndStyle>
<CodePage>28591</CodePage>
</FormatSpecificOptions>
<MultiFile value="False" />
<!-- Multi-file split by field value: -->
<MultiFile value="True" />
<MultiFileType>Suffix</MultiFileType> <!-- Suffix | Prefix -->
<MultiFileField>Region</MultiFileField>
<KeepField value="True" />
</Configuration>
```
**Implementation notes:**
- Resolve `%temp%` to `tempfile.gettempdir()`.
- If `MultiFile value="True"`: partition by `MultiFileField`, write one file per unique value, append/prepend value to filename.
- If `MaxRecords` non-empty: split into chunks of that size, auto-number extra files with `_1`, `_2` suffix.
---
### 7.4 Browse (`AlteryxBasePluginsGui.BrowseV2.BrowseV2`)
**Behaviour:** Passthrough — output equals input. In the runner, log a preview (first N rows) to stdout or a callback.
```python
class BrowseTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
if self.ctx.verbose:
print(f"[Browse ToolID={self.node.tool_id}]\n{df.head(20)}")
return {"Output": df}
```
---
### 7.5 Filter (`AlteryxBasePluginsGui.Filter.Filter`)
**XML Configuration — two modes:**
**Mode: Simple (Basic filter)**
```xml
<Configuration>
<Expression>[CustomerID] &gt; 30</Expression>
<Mode>Simple</Mode>
<Simple>
<Operator>&gt;</Operator>
<Field>CustomerID</Field>
<Operands>
<Operand>30</Operand>
<DateType>fixed</DateType> <!-- fixed | today | tomorrow | yesterday -->
<IgnoreTimeInDateTime>True</IgnoreTimeInDateTime>
<PeriodType>days</PeriodType> <!-- for period operators -->
<PeriodCount>2</PeriodCount>
</Operands>
</Simple>
</Configuration>
```
**Mode: Custom (expression)**
```xml
<Configuration>
<Expression><![CDATA[[Region]=="South" OR REGEX_Match(UPPERCASE([Region]), ".*WEST")]]></Expression>
<Mode>Custom</Mode>
</Configuration>
```
**Anchors:** `Input``True` (rows matching), `False` (rows not matching).
**Operator mapping (Simple mode):**
| Alteryx Operator | Polars equivalent |
|---|---|
| `=` / `Equals` | `col == value` |
| `!=` / `Does not equal` | `col != value` |
| `>` | `col > value` |
| `>=` | `col >= value` |
| `<` | `col < value` |
| `<=` | `col <= value` |
| `IsNull` | `col.is_null()` |
| `IsNotNull` | `col.is_not_null()` |
| `Contains` | `col.str.contains(value)` |
| `Does not contain` | `~col.str.contains(value)` |
| `Is empty` | `col.is_null() | col == ""` |
| `Is not empty` | `col.is_not_null() & (col != "")` |
| `Comes before (<)` | `col < value` (string compare) |
| `Comes after (>)` | `col > value` (string compare) |
| `Is true` | `col == True` |
| `Is false` | `col == False` |
| `Range` | `(col >= start) & (col <= end)` |
| `PeriodAfter` | date window around anchor date |
| `PeriodBefore` | date window before anchor date |
**Date dynamic values:**
- `today``datetime.date.today()`
- `tomorrow``today + timedelta(1)`
- `yesterday``today - timedelta(1)`
**Custom mode:** transpile the `<Expression>` text via `expression/transpiler.py` (see §8).
**Implementation:**
```python
class FilterTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
mode = self._cfg("Mode", "Custom")
if mode == "Simple":
mask = self._build_simple_mask(df)
else:
expr_text = self._cfg("Expression","True")
mask = self.ctx.transpiler.eval_mask(df, expr_text)
return {
"True": df.filter(mask),
"False": df.filter(~mask),
}
```
---
### 7.6 Formula (`AlteryxBasePluginsGui.Formula.Formula`)
**XML Configuration:**
```xml
<Configuration>
<FormulaFields>
<FormulaField
expression="IF [Latitude] &gt; 39.7 THEN &quot;North&quot; ELSE &quot;South&quot; ENDIF"
field="Region"
size="1073741823"
type="V_WString" />
<FormulaField
expression="ROUND([Spend]/[Visits],1)"
field="AverageSpendPerVisit"
size="19.2"
type="FixedDecimal" />
</FormulaFields>
</Configuration>
```
**Behaviour:**
- Each `<FormulaField>` is applied sequentially; later expressions can reference columns created by earlier ones.
- If `field` matches an existing column: update in place.
- If `field` is new: append as new column.
- Expression HTML entities: `&gt;``>`, `&lt;``<`, `&quot;``"`, `&#xA;` → newline.
**Implementation:**
```python
class FormulaTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
for ff in self.config.findall("FormulaFields/FormulaField"):
expr_text = ff.attrib["expression"]
field = ff.attrib["field"]
dtype = self.ctx.type_mapper.map(ff.attrib.get("type","V_WString"),
ff.attrib.get("size"))
series = self.ctx.transpiler.eval_series(df, expr_text, field, dtype)
if field in df.columns:
df = df.with_columns(series.alias(field))
else:
df = df.with_columns(series.alias(field))
return {"Output": df}
```
---
### 7.7 Select (`AlteryxBasePluginsGui.AlteryxSelect.AlteryxSelect`)
**XML Configuration:**
```xml
<Configuration>
<OrderChanged value="True" />
<CommaDecimal value="False" />
<SelectFields>
<SelectField field="CustomerID" selected="True" rename="Customer ID" description="..." />
<SelectField field="Visits" selected="True" type="Int16" size="2" />
<SelectField field="*Unknown" selected="True" /> <!-- pass-through unknown cols -->
</SelectFields>
</Configuration>
```
**Behaviour:**
- `selected="False"`: drop the column.
- `rename` attribute: rename column.
- `type` + `size` attribute on SelectField: cast column to new type.
- `*Unknown` row: if `selected="True"`, pass through any incoming columns not explicitly listed; if `selected="False"`, drop them.
- `OrderChanged="True"`: output columns in the order listed in `<SelectFields>`.
**Implementation:**
```python
class SelectTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
fields = self.config.findall("SelectFields/SelectField")
keep_unknown = False
explicit = {} # original_name → (selected, rename, dtype)
for sf in fields:
name = sf.attrib["field"]
if name == "*Unknown":
keep_unknown = sf.attrib.get("selected","True") == "True"
continue
explicit[name] = {
"selected": sf.attrib.get("selected","True") == "True",
"rename": sf.attrib.get("rename", name),
"type": sf.attrib.get("type"),
"size": sf.attrib.get("size"),
}
order_changed = self.config.attrib.get("OrderChanged","False") == "True"
# Build column list in config order if OrderChanged, else original order
result_cols = []
processed = set()
for sf in fields:
name = sf.attrib["field"]
if name == "*Unknown": continue
if name not in df.columns: continue
meta = explicit[name]
if not meta["selected"]: continue
processed.add(name)
col = pl.col(name)
if meta["type"]:
col = col.cast(self.ctx.type_mapper.map(meta["type"], meta["size"]))
result_cols.append(col.alias(meta["rename"]))
if keep_unknown:
for c in df.columns:
if c not in processed:
result_cols.append(pl.col(c))
if not order_changed:
# respect incoming column order
result_cols = sorted(result_cols, key=lambda e: df.columns.index(e._pyexpr.root_names()[0]) if hasattr(e, '_pyexpr') else 999)
return {"Output": df.select(result_cols)}
```
---
### 7.8 Sort (`AlteryxBasePluginsGui.Sort.Sort`)
**XML Configuration:**
```xml
<Configuration>
<SortInfo locale="0">
<Field field="LastName" order="Ascending" />
<Field field="FirstName" order="Ascending" />
</SortInfo>
</Configuration>
```
**Attributes:**
- `order`: `"Ascending"` | `"Descending"`
- `locale`: `0` = default byte-order sort; `1033` = dictionary order (natural number sort for strings, English locale)
**Implementation:**
```python
class SortTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
sort_fields = self.config.findall("SortInfo/Field")
by = [f.attrib["field"] for f in sort_fields]
descending = [f.attrib.get("order","Ascending") == "Descending" for f in sort_fields]
locale = self.config.findtext("SortInfo[@locale]") or "0"
# locale 1033 = natural sort (numeric strings as numbers)
# Polars: use sort with maintain_order=True for stability
df = df.sort(by=by, descending=descending, maintain_order=True)
return {"Output": df}
```
Note: For `locale="1033"` (dictionary/natural order on string columns containing numbers), apply a Schwartzian transform: extract numeric prefix for sort, then restore original values.
---
### 7.9 Sample (`AlteryxBasePluginsGui.Sample.Sample`)
**XML Configuration:**
```xml
<Configuration>
<Mode>First</Mode> <!-- First | Last | Sample | Random | NPercent -->
<N>3</N>
<GroupFields orderChanged="False">
<Field name="City" />
</GroupFields>
</Configuration>
```
**Modes:**
| Mode | Behaviour |
|---|---|
| `First` | Return first N rows (per group if GroupFields present) |
| `Last` | Return last N rows |
| `Sample` | Return 1 of every N rows (deterministic, not random) |
| `Random` | Each row has 1-in-N chance of inclusion (non-deterministic) |
| `NPercent` | Return first N% of rows |
**Implementation:**
```python
class SampleTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
mode = self._cfg("Mode","First")
n = int(self._cfg("N","1"))
group_fields = [f.attrib["name"] for f in self.config.findall("GroupFields/Field")]
if not group_fields:
return {"Output": self._sample_flat(df, mode, n)}
# Grouped sampling
parts = []
for _, group_df in df.group_by(group_fields, maintain_order=True):
parts.append(self._sample_flat(group_df, mode, n))
return {"Output": pl.concat(parts)}
def _sample_flat(self, df, mode, n):
if mode == "First":
return df.head(n)
elif mode == "Last":
return df.tail(n)
elif mode == "Sample":
indices = range(0, len(df), n)
return df[list(indices)]
elif mode == "Random":
mask = [random.random() < (1/n) for _ in range(len(df))]
return df.filter(mask)
elif mode == "NPercent":
count = max(1, int(len(df) * n / 100))
return df.head(count)
return df
```
---
### 7.10 Unique (`AlteryxBasePluginsGui.Unique.Unique`)
**XML Configuration:**
```xml
<Configuration>
<UniqueFields>
<Field name="LastName" />
<Field name="FirstName" />
</UniqueFields>
</Configuration>
```
**Behaviour:**
- Group by the listed fields.
- First occurrence of each group → `Unique` anchor output.
- All subsequent duplicates → `Duplicate` anchor output.
- Column order preserved.
**Anchors:** `Input``Unique`, `Duplicate`
**Implementation:**
```python
class UniqueTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
key_fields = [f.attrib["name"] for f in self.config.findall("UniqueFields/Field")]
df = df.with_row_index("_row_idx")
first_idx = df.group_by(key_fields, maintain_order=True).agg(pl.col("_row_idx").first())["_row_idx"]
unique_df = df.filter(pl.col("_row_idx").is_in(first_idx)).drop("_row_idx")
dup_df = df.filter(~pl.col("_row_idx").is_in(first_idx)).drop("_row_idx")
return {"Unique": unique_df, "Duplicate": dup_df}
```
---
### 7.11 Join (`AlteryxBasePluginsGui.Join.Join`)
**XML Configuration:**
```xml
<Configuration>
<JoinByRecordPos value="False" />
<!-- OR: join by field -->
<SelectConfiguration>
<Configuration outputConnection="J">
<SelectFields>
<!-- fields to include in J output -->
<SelectField field="L_CustomerID" selected="True" />
<SelectField field="R_OrderDate" selected="True" />
</SelectFields>
</Configuration>
<Configuration outputConnection="L"> ... </Configuration>
<Configuration outputConnection="R"> ... </Configuration>
</SelectConfiguration>
</Configuration>
```
The join key pairs are stored as child elements:
```xml
<JoinFields>
<JoinField Left="CustomerID" Right="CustomerID" />
</JoinFields>
```
**Anchors:**
- Inputs: `Left`, `Right`
- Outputs: `J` (inner join), `L` (left unmatched), `R` (right unmatched)
**Column naming on J output:**
- If both inputs share a column name (other than join key): prefix with `L_` and `R_`.
- Join key columns: appear once (from Left input by default).
**Implementation using DuckDB:**
```python
class JoinTool(BaseTool):
def execute(self, inputs):
left = inputs.get("Left", pl.DataFrame())
right = inputs.get("Right", pl.DataFrame())
by_pos = self._cfg_attr("JoinByRecordPos","value","False") == "True"
con = self.ctx.duckdb_con
con.register("_left", left.to_arrow())
con.register("_right", right.to_arrow())
if by_pos:
left = left.with_row_index("_pos")
right = right.with_row_index("_pos")
# join on _pos
join_keys = [("_pos","_pos")]
else:
join_keys = [(jf.attrib["Left"], jf.attrib["Right"])
for jf in self.config.findall("JoinFields/JoinField")]
j_df, l_df, r_df = self._execute_join(left, right, join_keys, con)
return {"J": j_df, "L": l_df, "R": r_df}
def _execute_join(self, left, right, join_keys, con):
# Build column lists with disambiguation
l_cols = left.columns
r_cols = right.columns
key_l = [k[0] for k in join_keys]
key_r = [k[1] for k in join_keys]
# ... build SQL for inner, left anti, right anti
# Inner join (J):
on_clause = " AND ".join(f"l.{k[0]} = r.{k[1]}" for k in join_keys)
j_sql = f"SELECT * FROM _left l INNER JOIN _right r ON {on_clause}"
l_sql = f"SELECT l.* FROM _left l LEFT ANTI JOIN _right r ON {on_clause}"
r_sql = f"SELECT r.* FROM _right r LEFT ANTI JOIN _left l ON {on_clause}"
j_df = duckdb_to_polars(con, j_sql)
l_df = duckdb_to_polars(con, l_sql)
r_df = duckdb_to_polars(con, r_sql)
return j_df, l_df, r_df
```
---
### 7.12 Union (`AlteryxBasePluginsGui.Union.Union`)
**XML Configuration:**
```xml
<Configuration>
<Mode>Auto</Mode> <!-- Auto | ByName | ByPosition -->
</Configuration>
```
**Modes:**
- `Auto` / `ByName`: match columns by name; fill missing columns with NULL.
- `ByPosition`: stack columns positionally regardless of name; use first input's column names.
**Anchors:** Multiple `Input1`, `Input2`, ... `InputN`; single `Output`.
**Implementation:**
```python
class UnionTool(BaseTool):
def execute(self, inputs):
mode = self._cfg("Mode","Auto")
# Sort input anchors: Input, Input1, Input2... or by numeric suffix
sorted_keys = sorted(inputs.keys(), key=lambda k: int(k.replace("Input","") or 0))
dfs = [inputs[k] for k in sorted_keys]
if mode == "ByPosition":
# Rename all to first df's columns
names = dfs[0].columns
renamed = [df.rename(dict(zip(df.columns, names))) for df in dfs]
result = pl.concat(renamed, how="diagonal")
else: # ByName / Auto
result = pl.concat(dfs, how="diagonal_relaxed")
return {"Output": result}
```
---
### 7.13 Summarize (`AlteryxSpatialPluginsGui.Summarize.Summarize`)
**XML Configuration:**
```xml
<Configuration>
<SummarizeFields>
<SummarizeField field="Region" action="GroupBy" rename="Region" />
<SummarizeField field="Spend" action="Sum" rename="Total_Spend" />
<SummarizeField field="CustomerID" action="Count" rename="Num_Customers" />
<SummarizeField field="Score" action="Avg" rename="Avg_Score" />
<SummarizeField field="City" action="Concatenate" rename="Cities"
separator=", " order="Ascending" />
<SummarizeField field="Visits" action="Percentile" rename="P50"
percentile="50" />
</SummarizeFields>
</Configuration>
```
**Action mapping:**
| Alteryx Action | DuckDB / Polars equivalent |
|---|---|
| `GroupBy` | `group_by` key |
| `Sum` | `pl.col(f).sum()` |
| `Count` | `pl.col(f).count()` |
| `Count Non Null` | `pl.col(f).drop_nulls().count()` |
| `Count Distinct` | `pl.col(f).n_unique()` |
| `Count Distinct Non Null` | `pl.col(f).drop_nulls().n_unique()` |
| `Count Null` | `pl.col(f).is_null().sum()` |
| `Min` | `pl.col(f).min()` |
| `Max` | `pl.col(f).max()` |
| `Avg` | `pl.col(f).mean()` |
| `Median` | `pl.col(f).median()` |
| `Std Deviation` | `pl.col(f).std()` |
| `Variance` | `pl.col(f).var()` |
| `Percentile` | `pl.col(f).quantile(p/100)` |
| `Concatenate` | `pl.col(f).sort().cast(str).str.join(sep)` |
| `First` | `pl.col(f).first()` |
| `Last` | `pl.col(f).last()` |
| `Mode` | custom UDF |
**Implementation:**
```python
class SummarizeTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
fields = self.config.findall("SummarizeFields/SummarizeField")
group_fields = [f.attrib["field"] for f in fields if f.attrib.get("action")=="GroupBy"]
agg_exprs = []
for f in fields:
action = f.attrib.get("action","GroupBy")
field = f.attrib["field"]
rename = f.attrib.get("rename", field)
if action == "GroupBy": continue
expr = self._build_agg(field, action, f.attrib)
agg_exprs.append(expr.alias(rename))
if group_fields:
result = df.group_by(group_fields, maintain_order=True).agg(agg_exprs)
else:
result = df.select(agg_exprs)
return {"Output": result}
```
---
### 7.14 JoinMultiple (`AlteryxBasePluginsGui.JoinMultiple.JoinMultiple`)
**XML Configuration:**
```xml
<Configuration>
<JoinByRecPos value="True" />
<OutputJoinOnly value="False" />
<SelectConfiguration>
<Configuration outputConnection="Output">
<SelectFields>
<SelectField field="Input_#1_CustomerID" selected="True" />
<SelectField field="Input_#2_JoinDate" selected="True" />
<SelectField field="*Unknown" selected="True" />
</SelectFields>
</Configuration>
</SelectConfiguration>
</Configuration>
```
**Behaviour:**
- Joins N inputs simultaneously (by position or by field).
- Column names from each input are prefixed with `Input_#N_` to disambiguate.
- Single `Output` anchor.
- `OutputJoinOnly="False"`: include all rows (outer join on all inputs).
---
### 7.15 AppendFields (`AlteryxBasePluginsGui.AppendFields.AppendFields`)
**Behaviour:** Cartesian product of a small "Source" input appended to every row of a large "Target" input. One-to-many: every Target row gets one or more Source rows' fields added.
**Anchors:** `Target` (large), `Source` (small) → `Output`
**Implementation:** Cross join if source is small; otherwise warn.
---
### 7.16 CrossTab (`AlteryxBasePluginsGui.CrossTab.CrossTab`)
**XML Configuration:**
```xml
<Configuration>
<GroupFields>
<Field name="Region" />
</GroupFields>
<HeaderField field="Year" />
<DataField field="Sales" method="Sum" />
</Configuration>
```
**Behaviour:** Pivot table: rows = GroupBy values, columns = unique values of `HeaderField`, cell = aggregate of `DataField`.
**Implementation using DuckDB PIVOT:**
```python
class CrossTabTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
con = self.ctx.duckdb_con
con.register("_ct_input", df.to_arrow())
header = self._cfg("HeaderField/@field") or self.config.find("HeaderField").attrib["field"]
data = self.config.find("DataField")
data_field = data.attrib["field"]
method = data.attrib.get("method","Sum").upper()
groups = [f.attrib["name"] for f in self.config.findall("GroupFields/Field")]
group_clause = ", ".join(groups)
# DuckDB PIVOT
sql = f"""
PIVOT _ct_input ON "{header}"
USING {method}("{data_field}")
GROUP BY {group_clause}
"""
result = duckdb_to_polars(con, sql)
return {"Output": result}
```
---
### 7.17 Transpose (`AlteryxBasePluginsGui.Transpose.Transpose`)
**XML Configuration:**
```xml
<Configuration>
<KeyFields>
<Field name="CustomerID" />
</KeyFields>
<DataFields>
<Field name="Visits" />
<Field name="Spend" />
</DataFields>
</Configuration>
```
**Behaviour:** Un-pivot (melt). Key fields remain; each Data field becomes a row with `Name` (field name) and `Value` columns.
**Implementation:**
```python
class TransposeTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
keys = [f.attrib["name"] for f in self.config.findall("KeyFields/Field")]
data_cols = [f.attrib["name"] for f in self.config.findall("DataFields/Field")]
result = df.melt(id_vars=keys, value_vars=data_cols,
variable_name="Name", value_name="Value")
return {"Output": result}
```
---
### 7.18 DateTime Parse (`AlteryxBasePluginsGui.DateTime.DateTime`)
**XML Configuration:**
```xml
<Configuration>
<Format>%d/%m/%Y</Format>
<InputField>DateString</InputField>
<OutputField>ParsedDate</OutputField>
<InputType>String</InputType>
<OutputType>Date</OutputType>
</Configuration>
```
**Implementation:** `pl.col(input).str.to_date(format)` or `str.to_datetime(format)`.
---
### 7.19 RegEx (`AlteryxBasePluginsGui.RegEx.RegEx`)
**XML Configuration:**
```xml
<Configuration>
<Field>EmailAddress</Field>
<Expression>(\w+)@(\w+)\.(\w+)</Expression>
<Replace>
<ReplaceString>$1_at_$2</ReplaceString>
</Replace>
<Method>Match</Method> <!-- Match | Replace | Parse | Token -->
<OutputField>Matched</OutputField>
<FullMatch value="True" />
<CaseInsensitive value="False" />
</Configuration>
```
**Methods:**
- `Match`: boolean column; True if pattern matches.
- `Replace`: replace with `ReplaceString` (use `$1`, `$2` for capture groups → `\1`, `\2` in Python).
- `Parse`: extract capture groups into new columns (one per group).
- `Token`: split by pattern, return one row per token.
---
### 7.20 TextToColumns (`AlteryxBasePluginsGui.TextToColumns.TextToColumns`)
**XML Configuration:**
```xml
<Configuration>
<Field>AddressField</Field>
<Delimiter>,</Delimiter>
<NumCols value="3" />
<RootName>Address_</RootName>
<SplitToRows value="False" />
</Configuration>
```
**Behaviour:**
- `SplitToRows="False"`: split into `NumCols` new columns named `{RootName}1`, `{RootName}2`, etc.
- `SplitToRows="True"`: one row per token.
---
### 7.21 GenerateRows (`AlteryxBasePluginsGui.GenerateRows.GenerateRows`)
**XML Configuration:**
```xml
<Configuration>
<UpdateField value="False" /> <!-- True = update existing field, False = create -->
<CreateField_Name>JoinDate</CreateField_Name>
<CreateField_Type>Date</CreateField_Type>
<Expression_Init>DateTimeFormat(DateTimeNow(),"%Y-%m-%d")</Expression_Init>
<Expression_Cond>[JoinDate] &lt;= DateTimeAdd("2025-01-01", 5, "Days")</Expression_Cond>
<Expression_Loop>DateTimeAdd([JoinDate], 1, "days")</Expression_Loop>
<RecordCount value="0" /> <!-- 0 = unlimited (use condition) -->
</Configuration>
```
**Behaviour:** Generate rows by:
1. Initialise the target field using `Expression_Init`.
2. While `Expression_Cond` is true:
a. Emit current row.
b. Update field with `Expression_Loop`.
**Implementation note:** Because GenerateRows typically has no input (or one input row acts as a seed), implement as an iterative Python loop evaluating the transpiled expressions.
---
### 7.22 MultiRowFormula (`AlteryxBasePluginsGui.MultiRowFormula.MultiRowFormula`)
**XML Configuration:**
```xml
<Configuration>
<FormulaField field="RunningTotal" expression="[Row-1:RunningTotal] + [Visits]"
type="Double" size="8" />
<NumRows value="1" />
<GroupByFields>
<Field name="Region" />
</GroupByFields>
</Configuration>
```
**Expression syntax:**
- `[Row-1:FieldName]` → previous row's value of `FieldName` (null at boundary).
- `[Row+1:FieldName]` → next row's value.
- `[Row-N:FieldName]` → N rows back.
**Implementation:** Use Polars `shift()` for lookback/lookahead, or DuckDB window functions (`LAG`, `LEAD`).
---
### 7.23 RecordID (`AlteryxBasePluginsGui.RecordID.RecordID`)
**XML Configuration:**
```xml
<Configuration>
<Field>RecordID</Field>
<StartValue>1</StartValue>
<FieldType>Int32</FieldType>
</Configuration>
```
**Implementation:** `df.with_row_index(name=field, offset=start_value - 1)` then cast to int type.
---
## 8. Expression Transpiler
The `expression/transpiler.py` module converts Alteryx expression syntax into either:
- A Polars expression (`pl.Expr`) for simple column-level operations.
- A DuckDB SQL fragment for complex/multi-step cases.
### 8.1 Syntax Differences
| Alteryx | Python/Polars/DuckDB |
|---|---|
| `[ColumnName]` | `pl.col("ColumnName")` / `"ColumnName"` |
| `"string literal"` | `'string literal'` |
| `IF ... THEN ... ELSEIF ... ELSE ... ENDIF` | `pl.when(...).then(...).when(...).then(...).otherwise(...)` / `CASE WHEN` |
| `IIF(cond, true_val, false_val)` | `pl.when(cond).then(true_val).otherwise(false_val)` |
| `IsNull([Field])` | `pl.col("Field").is_null()` |
| `!IsNull([Field])` | `pl.col("Field").is_not_null()` |
| `NULL()` | `None` / `pl.lit(None)` |
| `AND` / `OR` / `NOT` | `&` / `\|` / `~` (Polars) or `AND`/`OR`/`NOT` (SQL) |
| `==` (equality) | `==` |
| `!=` | `!=` |
| Row reference `[Row-1:Field]` | `pl.col("Field").shift(1)` |
### 8.2 Built-in Function Mapping
**String functions:**
| Alteryx | DuckDB SQL / Polars |
|---|---|
| `Uppercase([F])` | `UPPER(F)` / `pl.col(F).str.to_uppercase()` |
| `Lowercase([F])` | `LOWER(F)` / `pl.col(F).str.to_lowercase()` |
| `Titlecase([F])` | custom: capitalise each word |
| `Trim([F])` | `TRIM(F)` |
| `LTrim([F])` | `LTRIM(F)` |
| `RTrim([F])` | `RTRIM(F)` |
| `Length([F])` | `LENGTH(F)` / `pl.col(F).str.len_chars()` |
| `Left([F],n)` | `LEFT(F,n)` / `pl.col(F).str.slice(0,n)` |
| `Right([F],n)` | `RIGHT(F,n)` |
| `Substring([F],start,len)` | `SUBSTR(F,start,len)` (1-indexed) |
| `FindString([F],pattern)` | `INSTR(F,pattern)` (1-indexed, 0=not found) |
| `ReplaceChar([F],old,new)` | `REPLACE(F,old,new)` |
| `StringToDate([F],fmt)` | `STRPTIME(F,fmt)` |
| `ToString([num],decimals)` | `PRINTF('%.Nf', num)` |
| `Contains([F],str)` | `CONTAINS(F,str)` |
| `StartsWith([F],str)` | `STARTS_WITH(F,str)` |
| `EndsWith([F],str)` | `ENDS_WITH(F,str)` |
| `REGEX_Match([F],pat)` | `REGEXP_MATCHES(F,pat)` |
| `REGEX_Replace([F],pat,repl)` | `REGEXP_REPLACE(F,pat,repl)` |
| `REGEX_CountMatches([F],pat)` | custom UDF |
**Math functions:**
| Alteryx | DuckDB / Polars |
|---|---|
| `ABS([F])` | `ABS(F)` |
| `CEIL([F])` | `CEIL(F)` |
| `FLOOR([F])` | `FLOOR(F)` |
| `ROUND([F],n)` | `ROUND(F,n)` |
| `SQRT([F])` | `SQRT(F)` |
| `POW([F],n)` | `POWER(F,n)` |
| `LOG([F])` | `LN(F)` |
| `LOG10([F])` | `LOG10(F)` |
| `MOD([F],n)` | `F % n` |
| `MIN([A],[B])` | `LEAST(A,B)` |
| `MAX([A],[B])` | `GREATEST(A,B)` |
| `RandInt(n)` | `FLOOR(RANDOM()*n)` |
**Date/Time functions:**
| Alteryx | DuckDB / Polars |
|---|---|
| `DateTimeNow()` | `NOW()` / `datetime.datetime.now()` |
| `DateTimeToday()` | `CURRENT_DATE` |
| `DateTimeAdd([D],n,"days")` | `D + INTERVAL n DAY` |
| `DateTimeDiff([D1],[D2],"days")` | `DATEDIFF('day', D2, D1)` |
| `DateTimeFormat([D],fmt)` | `STRFTIME(D, fmt)` |
| `ToDate([D])` | `CAST(D AS DATE)` |
| `DateTimeYear([D])` | `YEAR(D)` |
| `DateTimeMonth([D])` | `MONTH(D)` |
| `DateTimeDay([D])` | `DAY(D)` |
| `DateTimeHour([D])` | `HOUR(D)` |
**Conditional:**
| Alteryx | DuckDB |
|---|---|
| `IF [F] > 0 THEN "pos" ELSE "neg" ENDIF` | `CASE WHEN F > 0 THEN 'pos' ELSE 'neg' END` |
| `IIF([F]>0, "pos", "neg")` | `IF(F>0, 'pos', 'neg')` |
| `Switch([F], default, v1, r1, v2, r2)` | `CASE F WHEN v1 THEN r1 WHEN v2 THEN r2 ELSE default END` |
### 8.3 Transpiler Implementation Strategy
Use a recursive descent parser or a grammar-based approach (e.g. `lark-parser` or `pyparsing`).
**Recommended approach:**
1. Tokenise the expression (handle `[column]`, string literals, function calls, operators, keywords).
2. Build an AST.
3. Walk the AST and emit DuckDB SQL or Polars expressions.
4. For `eval_mask(df, expr)`: register df as a DuckDB view, run `SELECT *, ({sql_expr}) AS _mask FROM df`, return the boolean series.
5. For `eval_series(df, expr, field, dtype)`: run `SELECT ({sql_expr}) AS {field} FROM df`.
**Fallback:** If an expression cannot be transpiled, raise `UnsupportedExpressionError` with the original expression text.
---
## 9. RunContext
```python
import duckdb
import tempfile
from pathlib import Path
class RunContext:
def __init__(self, workflow_dir: str, verbose: bool = False):
self.workflow_dir = Path(workflow_dir)
self.verbose = verbose
self.duckdb_con = duckdb.connect(":memory:")
self.temp_dir = Path(tempfile.mkdtemp())
self.transpiler = ExpressionTranspiler(self.duckdb_con)
self.type_mapper = TypeMapper()
self.constants: dict = {} # workflow-level constants from <Properties>
def resolve_path(self, path: str) -> Path:
"""Resolve relative paths against workflow directory, expand %temp%."""
path = path.replace("%temp%", str(self.temp_dir) + "/")
p = Path(path)
if not p.is_absolute():
p = self.workflow_dir / p
return p
```
---
## 10. CLI
```
python -m alteryx_runner run <workflow.yxmd> [options]
Options:
--output-dir PATH Write output files to this directory (overrides workflow paths)
--param KEY=VALUE Set workflow constant
--verbose Print Browse tool results and execution log
--dry-run Parse and validate only; do not execute
--format {json,csv,parquet} Default output format for Browse nodes
```
**Example:**
```bash
python -m alteryx_runner run MyWorkflow.yxmd --verbose --output-dir ./results
```
---
## 11. Dependencies (`pyproject.toml`)
```toml
[project]
name = "alteryx-runner"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"polars>=0.20",
"duckdb>=0.10",
"pyarrow>=14",
"yxdb>=0.1", # Alteryx YXDB binary format
"openpyxl>=3.1", # Excel read/write
"lark>=1.1", # Expression parser grammar
"click>=8.0", # CLI
]
[project.optional-dependencies]
dev = ["pytest", "pytest-cov", "ruff", "mypy"]
```
---
## 12. Test Strategy
Each tool should have:
1. A unit test loading the corresponding `.yxmd` fixture from `tests/fixtures/`.
2. Assertions on output schema (column names, dtypes).
3. Assertions on output data (row count, specific cell values).
**Priority test fixtures (from uploaded files):**
| File | Tools covered |
|---|---|
| `Filter.yxmd` | TextInput, Filter (Simple+Custom), JoinMultiple, GenerateRows, Formula |
| `Formula.yxmd` | InputData (YXDB), Formula (static/conditional/modify/multi) |
| `Select.yxmd` | InputData, Select (simple/type-cast/rename-reorder/prefix) |
| `Sort.yxmd` | TextInput, Sort (single/multi/string-numeric/dictionary) |
| `Sample.yxmd` | InputData, Sample (First/Sample/Random/NPercent/Grouped) |
| `Unique.yxmd` | InputData, Unique |
| `Input_Data.yxmd` | InputData (CSV/Excel/sheets/named-ranges/record-limit) |
| `Output_Data.yxmd` | TextInput, OutputData (CSV/YXDB/multi-file/max-records) |
| `Browse.yxmd` | InputData, Browse (passthrough, profiling) |
| `Text_Input.yxmd` | TextInput |
---
## 13. Known Limitations & Out-of-Scope
- **Spatial tools**: `SpatialObj` columns are passed through as opaque strings; spatial operations (Buffer, Distance, etc.) are not implemented.
- **In-Database tools**: The `LockIn*` plugin family (In-DB tools) requires a live database connection; not supported.
- **Macros** (`.yxmc`): The `EngineDll="Macro"` setting triggers sub-workflow execution. Implement basic macro recursion but mark as best-effort.
- **Predictive tools** (R-based `.yxmc`): Out of scope.
- **YXDB write**: Writing `.yxdb` output requires the `yxdb` library's write API; fall back to Parquet if unavailable.
- **Gallery/Server connections**: All `SavedDataConnections` and `Gallery` sources will raise `NotImplementedError`.
- **Report/Rendering tools**: `PortfolioPlugins*` family is out of scope.
- **Wireless connections**: Treated identically to wired connections.
---
## 14. Execution Example
Given `Sort.yxmd` from the uploaded fixtures:
```
TextInput (ToolID=64)
Sort: CustomerID Ascending (ToolID=26)
↓ (no downstream in this example workflow)
```
The runner will:
1. Parse XML → `WorkflowGraph` with 4 Sort nodes all sourced from TextInput node 64.
2. Execute `TextInput` → DataFrame with columns: `CustomerID`, `FirstName`, `LastName`, `Gender`, `JoinDate`, `Region`, `Score`.
3. Execute `Sort(26)``df.sort("CustomerID", descending=False)`.
4. Since no downstream connections exist (no Output/Browse), store results in the context.
---
## 15. XML Quick-Reference Card
Below is the distilled XML config shape for each implemented tool, as extracted from the uploaded `.yxmd` example files.
### Filter (Simple)
```xml
<Configuration>
<Expression>[CustomerID] &gt; 30</Expression>
<Mode>Simple</Mode>
<Simple>
<Operator>&gt;</Operator>
<Field>CustomerID</Field>
<Operands><Operand>30</Operand><DateType>fixed</DateType></Operands>
</Simple>
</Configuration>
```
### Filter (Custom)
```xml
<Configuration>
<Expression><![CDATA[[Region]=="South" OR REGEX_Match(UPPERCASE([Region]),".*WEST")]]></Expression>
<Mode>Custom</Mode>
</Configuration>
```
### Formula
```xml
<Configuration>
<FormulaFields>
<FormulaField expression="Titlecase([City])" field="City" size="256" type="String" />
<FormulaField expression="IF [Lat] &gt; 39.7 THEN &quot;N&quot; ELSE &quot;S&quot; ENDIF"
field="Region" size="1073741823" type="V_WString" />
</FormulaFields>
</Configuration>
```
### Select
```xml
<Configuration>
<OrderChanged value="True" />
<CommaDecimal value="False" />
<SelectFields>
<SelectField field="CustomerID" selected="True" rename="Cust ID" description="PK" />
<SelectField field="Visits" selected="True" type="Int16" size="2" />
<SelectField field="*Unknown" selected="False" />
</SelectFields>
</Configuration>
```
### Sort
```xml
<Configuration>
<SortInfo locale="0">
<Field field="LastName" order="Ascending" />
<Field field="FirstName" order="Descending" />
</SortInfo>
</Configuration>
```
### Sample
```xml
<Configuration>
<Mode>First</Mode>
<N>3</N>
<GroupFields orderChanged="False">
<Field name="City" />
</GroupFields>
</Configuration>
```
### Unique
```xml
<Configuration>
<UniqueFields>
<Field name="CustomerID" />
</UniqueFields>
</Configuration>
```
### InputData (CSV)
```xml
<Configuration>
<File OutputFileName="" FileFormat="0" RecordLimit="50">path/to/file.csv</File>
<FormatSpecificOptions>
<Delimeter>,</Delimeter><HeaderRow>True</HeaderRow><ImportLine>1</ImportLine>
</FormatSpecificOptions>
</Configuration>
```
### InputData (Excel)
```xml
<Configuration>
<File FileFormat="25">path/to/file.xlsx|||`SheetName$`</File>
<FormatSpecificOptions>
<FirstRowData>False</FirstRowData><ImportLine>1</ImportLine>
</FormatSpecificOptions>
</Configuration>
```
### OutputData (CSV)
```xml
<Configuration>
<File MaxRecords="" FileFormat="0">%temp%output.csv</File>
<FormatSpecificOptions>
<Delimeter>,</Delimeter><HeaderRow>True</HeaderRow><LineEndStyle>CRLF</LineEndStyle>
</FormatSpecificOptions>
<MultiFile value="False" />
</Configuration>
```
### TextInput
```xml
<Configuration>
<NumRows value="3" />
<Fields>
<Field name="Col1" /><Field name="Col2" />
</Fields>
<Data>
<r><c>val1</c><c>val2</c></r>
<r><c>val3</c><c /></r> <!-- empty → NULL -->
</Data>
</Configuration>
```