55 KiB
Alteryx Open-Source Workflow Runner — Technical Specification
Target audience: Agentic coding tools (opencode, pi, etc.) Stack: Python 3.11+, DuckDB, Polars, PyArrow License target: Apache 2.0
1. Project Overview
Build a Python-native runner that parses Alteryx Designer workflow XML files (.yxmd) and executes each tool node against real data using DuckDB (for SQL-heavy transforms and I/O) and Polars (for streaming/in-memory frame operations). The runner does not require Alteryx to be installed.
1.1 Design Principles
- Faithful semantics: match Alteryx output column order, naming, data types and NULL behaviour exactly.
- DAG execution: parse the
<Connections>graph, build a topological execution plan, execute nodes in dependency order. - Lazy where possible: use Polars LazyFrame / DuckDB views so materialisation happens once at sink nodes.
- Pluggable tool registry: each tool is a self-contained Python class registered by its XML
Pluginstring; new tools can be dropped in without touching core execution logic. - No GUI dependency: CLI-first; expose a Python API for programmatic use.
2. Repository Layout
alteryx_runner/
├── cli.py # Entry point: `python -m alteryx_runner run workflow.yxmd`
├── engine/
│ ├── __init__.py
│ ├── parser.py # XML → WorkflowGraph
│ ├── executor.py # Topological executor
│ ├── graph.py # WorkflowGraph, Node, Connection dataclasses
│ ├── context.py # RunContext: connection pool, temp dir, constants
│ └── type_mapper.py # Alteryx type → Polars/DuckDB dtype
├── tools/
│ ├── __init__.py # Tool registry
│ ├── base.py # BaseTool ABC
│ ├── inout/
│ │ ├── input_data.py
│ │ ├── output_data.py
│ │ ├── text_input.py
│ │ └── browse.py
│ ├── preparation/
│ │ ├── filter_tool.py
│ │ ├── formula_tool.py
│ │ ├── select_tool.py
│ │ ├── sort_tool.py
│ │ ├── sample_tool.py
│ │ ├── unique_tool.py
│ │ ├── generate_rows.py
│ │ ├── multi_row_formula.py
│ │ ├── multi_field_formula.py
│ │ ├── record_id.py
│ │ └── auto_field.py
│ ├── join/
│ │ ├── join_tool.py
│ │ ├── join_multiple.py
│ │ ├── union_tool.py
│ │ ├── append_fields.py
│ │ └── find_replace.py
│ ├── parse/
│ │ ├── datetime_tool.py
│ │ ├── regex_tool.py
│ │ └── text_to_columns.py
│ └── transform/
│ ├── summarize_tool.py
│ ├── cross_tab.py
│ └── transpose_tool.py
├── expression/
│ ├── __init__.py
│ ├── transpiler.py # Alteryx expression → DuckDB SQL / Polars expr
│ └── functions.py # Built-in function mappings
├── tests/
│ ├── fixtures/ # .yxmd files (from uploaded samples)
│ └── test_*.py
└── pyproject.toml
3. .yxmd File Format
Alteryx workflow files are UTF-8 XML. The root element is <AlteryxDocument>.
3.1 Top-Level Structure
<AlteryxDocument yxmdVer="2022.1">
<Nodes>
<Node ToolID="1"> ... </Node>
<Node ToolID="2"> ... </Node>
</Nodes>
<Connections>
<Connection>
<Origin ToolID="1" Connection="Output" />
<Destination ToolID="2" Connection="Input" />
</Connection>
</Connections>
<Properties>
<!-- workflow-level settings -->
<SortedGrouping value="True" />
<MetaInfo> ... </MetaInfo>
</Properties>
</AlteryxDocument>
3.2 Node Structure
<Node ToolID="74">
<GuiSettings Plugin="AlteryxBasePluginsGui.AlteryxSelect.AlteryxSelect">
<Position x="282" y="186" />
</GuiSettings>
<Properties>
<Configuration>
<!-- tool-specific XML -->
</Configuration>
<Annotation DisplayMode="0">
<DefaultAnnotationText>...</DefaultAnnotationText>
</Annotation>
</Properties>
<EngineSettings EngineDll="AlteryxBasePluginsEngine.dll"
EngineDllEntryPoint="AlteryxSelect" />
<MetaInfo connection="Output">
<RecordInfo>
<Field name="CustomerID" type="Int32" source="..." />
</RecordInfo>
</MetaInfo>
</Node>
Key attributes for the runner:
ToolID: integer, unique per workflow. Use as node identifier.Pluginin<GuiSettings>: maps to the tool class (see registry table §4).<Configuration>: tool-specific settings; parse per-tool.<MetaInfo>/<RecordInfo>/<Field>: output schema hint (available on input tools; not always present on transform tools).<EngineSettings>:Macroattribute signals a macro/yxmc sub-workflow; handle recursively.
3.3 Connection Structure
<Connection name="#1" Wireless="True">
<Origin ToolID="155" Connection="Output" />
<Destination ToolID="156" Connection="Input" />
</Connection>
Connectionattribute on<Origin>/<Destination>: the anchor label.- Most tools:
"Output"/"Input". - Filter:
"True"/"False". - Join:
"Left"/"Right"(inputs),"J"/"L"/"R"(outputs). - JoinMultiple:
"Input"(numbered),"Output". - Union: numbered
"Input1","Input2", etc.
- Most tools:
Wireless="True": logical connection, no visual wire; treat identically to wired.name="#N": numbered for multi-input tools; determines input slot ordering.
3.4 ChildNodes (Tool Containers)
<Node ToolID="159">
<GuiSettings Plugin="AlteryxGuiToolkit.ToolContainer.ToolContainer" />
<ChildNodes>
<Node ToolID="156"> ... </Node>
<Node ToolID="157"> ... </Node>
</ChildNodes>
</Node>
Flatten <ChildNodes> into the parent graph. The container itself is a no-op passthrough for execution.
4. Tool Registry
Map Plugin string → Python class. The runner looks up the Plugin from <GuiSettings>.
| Plugin String | Class | Category |
|---|---|---|
AlteryxBasePluginsGui.DbFileInput.DbFileInput |
InputDataTool |
In/Out |
AlteryxBasePluginsGui.DbFileOutput.DbFileOutput |
OutputDataTool |
In/Out |
AlteryxBasePluginsGui.TextInput.TextInput |
TextInputTool |
In/Out |
AlteryxBasePluginsGui.BrowseV2.BrowseV2 |
BrowseTool |
In/Out |
AlteryxBasePluginsGui.Filter.Filter |
FilterTool |
Preparation |
AlteryxBasePluginsGui.Formula.Formula |
FormulaTool |
Preparation |
AlteryxBasePluginsGui.AlteryxSelect.AlteryxSelect |
SelectTool |
Preparation |
AlteryxBasePluginsGui.Sort.Sort |
SortTool |
Preparation |
AlteryxBasePluginsGui.Sample.Sample |
SampleTool |
Preparation |
AlteryxBasePluginsGui.Unique.Unique |
UniqueTool |
Preparation |
AlteryxBasePluginsGui.GenerateRows.GenerateRows |
GenerateRowsTool |
Preparation |
AlteryxBasePluginsGui.MultiRowFormula.MultiRowFormula |
MultiRowFormulaTool |
Preparation |
AlteryxBasePluginsGui.MultiFieldFormula.MultiFieldFormula |
MultiFieldFormulaTool |
Preparation |
AlteryxBasePluginsGui.RecordID.RecordID |
RecordIDTool |
Preparation |
AlteryxBasePluginsGui.AutoField.AutoField |
AutoFieldTool |
Preparation |
AlteryxBasePluginsGui.Join.Join |
JoinTool |
Join |
AlteryxBasePluginsGui.JoinMultiple.JoinMultiple |
JoinMultipleTool |
Join |
AlteryxBasePluginsGui.Union.Union |
UnionTool |
Join |
AlteryxBasePluginsGui.AppendFields.AppendFields |
AppendFieldsTool |
Join |
AlteryxBasePluginsGui.FindReplace.FindReplace |
FindReplaceTool |
Join |
AlteryxBasePluginsGui.DateTime.DateTime |
DateTimeTool |
Parse |
AlteryxBasePluginsGui.RegEx.RegEx |
RegExTool |
Parse |
AlteryxBasePluginsGui.TextToColumns.TextToColumns |
TextToColumnsTool |
Parse |
AlteryxSpatialPluginsGui.Summarize.Summarize |
SummarizeTool |
Transform |
AlteryxBasePluginsGui.CrossTab.CrossTab |
CrossTabTool |
Transform |
AlteryxBasePluginsGui.Transpose.Transpose |
TransposeTool |
Transform |
AlteryxGuiToolkit.ToolContainer.ToolContainer |
PassthroughTool |
Documentation |
AlteryxGuiToolkit.TextBox.TextBox |
NullTool |
Documentation |
5. Data Model
5.1 Alteryx → Python Type Mapping
| Alteryx Type | Polars dtype | DuckDB type | Notes |
|---|---|---|---|
Bool |
pl.Boolean |
BOOLEAN |
|
Byte |
pl.UInt8 |
TINYINT UNSIGNED |
0–255 |
Int16 |
pl.Int16 |
SMALLINT |
|
Int32 |
pl.Int32 |
INTEGER |
|
Int64 |
pl.Int64 |
BIGINT |
|
Float |
pl.Float32 |
FLOAT |
|
Double |
pl.Float64 |
DOUBLE |
|
FixedDecimal |
pl.Decimal(p,s) |
DECIMAL(p,s) |
size="19.2" → DECIMAL(19,2) |
String |
pl.String |
VARCHAR |
fixed max length |
V_String |
pl.String |
VARCHAR |
variable length |
WString |
pl.String |
VARCHAR |
wide string |
V_WString |
pl.String |
VARCHAR |
variable wide string |
Date |
pl.Date |
DATE |
|
Time |
pl.Time |
TIME |
|
DateTime |
pl.Datetime |
TIMESTAMP |
|
SpatialObj |
pl.String |
VARCHAR |
GeoJSON string; skip in non-spatial mode |
Blob |
pl.Binary |
BLOB |
5.2 Internal Frame Type
All frames are passed between tools as polars.DataFrame. DuckDB is used for:
- Reading/writing external files (CSV, Parquet, XLSX via
duckdb.read_csv,read_parquet, etc.) - Complex SQL generation (joins, aggregates, window functions)
- Expression evaluation where transpilation to DuckDB SQL is simpler than Polars
Conversion helpers:
import polars as pl
import duckdb
def polars_to_duckdb(df: pl.DataFrame, name: str, con: duckdb.DuckDBPyConnection):
con.register(name, df.to_arrow())
def duckdb_to_polars(con: duckdb.DuckDBPyConnection, sql: str) -> pl.DataFrame:
return pl.from_arrow(con.execute(sql).arrow())
6. Core Engine
6.1 Parser (engine/parser.py)
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import xml.etree.ElementTree as ET
@dataclass
class FieldDef:
name: str
type: str
size: Optional[int] = None
source: Optional[str] = None
@dataclass
class NodeDef:
tool_id: int
plugin: str
config: ET.Element # raw <Configuration> element
output_schema: List[FieldDef] = field(default_factory=list)
position: tuple = (0, 0)
@dataclass
class ConnectionDef:
origin_id: int
origin_anchor: str # e.g. "Output", "True", "J"
dest_id: int
dest_anchor: str # e.g. "Input", "Left", "#1"
name: Optional[str] = None
wireless: bool = False
@dataclass
class WorkflowGraph:
nodes: Dict[int, NodeDef]
connections: List[ConnectionDef]
properties: ET.Element # <Properties> element
def parse_workflow(path: str) -> WorkflowGraph:
"""Parse .yxmd XML into WorkflowGraph."""
tree = ET.parse(path)
root = tree.getroot()
nodes = {}
for node_el in root.iter("Node"): # iter flattens ChildNodes
tid = int(node_el.attrib["ToolID"])
gui = node_el.find("GuiSettings")
plugin = gui.attrib.get("Plugin", "") if gui is not None else ""
config = node_el.find("Properties/Configuration")
pos_el = gui.find("Position") if gui is not None else None
pos = (int(pos_el.attrib.get("x",0)), int(pos_el.attrib.get("y",0))) if pos_el is not None else (0,0)
schema = _parse_schema(node_el)
nodes[tid] = NodeDef(tid, plugin, config, schema, pos)
connections = []
for conn_el in root.findall("Connections/Connection"):
orig = conn_el.find("Origin")
dest = conn_el.find("Destination")
connections.append(ConnectionDef(
origin_id=int(orig.attrib["ToolID"]),
origin_anchor=orig.attrib.get("Connection","Output"),
dest_id=int(dest.attrib["ToolID"]),
dest_anchor=dest.attrib.get("Connection","Input"),
name=conn_el.attrib.get("name"),
wireless=conn_el.attrib.get("Wireless","False")=="True"
))
props = root.find("Properties")
return WorkflowGraph(nodes, connections, props)
def _parse_schema(node_el: ET.Element) -> List[FieldDef]:
fields = []
for f in node_el.findall(".//MetaInfo/RecordInfo/Field"):
size_str = f.attrib.get("size")
size = int(float(size_str)) if size_str else None
fields.append(FieldDef(f.attrib["name"], f.attrib.get("type","V_String"), size))
return fields
6.2 Executor (engine/executor.py)
from collections import defaultdict, deque
import polars as pl
from .graph import WorkflowGraph, ConnectionDef
from .context import RunContext
from tools import get_tool_class
def execute(graph: WorkflowGraph, ctx: RunContext):
"""Topological BFS execution."""
# Build adjacency
in_degree = defaultdict(int)
successors = defaultdict(list) # tool_id → [(conn, dest_id)]
predecessors = defaultdict(list) # tool_id → [(conn, origin_id)]
for c in graph.connections:
in_degree[c.dest_id] += 1
successors[c.origin_id].append(c)
predecessors[c.dest_id].append(c)
for tid in graph.nodes:
if tid not in in_degree:
in_degree[tid] = 0
# Output buffers: (tool_id, anchor) → DataFrame
outputs: dict[tuple, pl.DataFrame] = {}
queue = deque([tid for tid, deg in in_degree.items() if deg == 0])
while queue:
tid = queue.popleft()
node = graph.nodes[tid]
tool_cls = get_tool_class(node.plugin)
if tool_cls is None:
# Unsupported / documentation tool — pass through if single input
_passthrough(tid, predecessors, outputs, successors, in_degree, queue)
continue
tool = tool_cls(node, ctx)
# Gather inputs: dict of anchor → DataFrame
inputs = {}
for c in predecessors[tid]:
df = outputs.get((c.origin_id, c.origin_anchor))
if df is not None:
inputs[c.dest_anchor] = df
# Execute
result = tool.execute(inputs) # returns dict[anchor, DataFrame]
for anchor, df in result.items():
outputs[(tid, anchor)] = df
# Decrement in-degree of successors
for c in successors[tid]:
in_degree[c.dest_id] -= 1
if in_degree[c.dest_id] == 0:
queue.append(c.dest_id)
return outputs
6.3 Base Tool (tools/base.py)
from abc import ABC, abstractmethod
from typing import Dict
import polars as pl
import xml.etree.ElementTree as ET
from engine.graph import NodeDef
from engine.context import RunContext
class BaseTool(ABC):
def __init__(self, node: NodeDef, ctx: RunContext):
self.node = node
self.ctx = ctx
self.config = node.config # ET.Element
@abstractmethod
def execute(self, inputs: Dict[str, pl.DataFrame]) -> Dict[str, pl.DataFrame]:
"""
inputs: dict of anchor_name → DataFrame
returns: dict of anchor_name → DataFrame
"""
def _cfg(self, xpath: str, default=None):
"""Helper: get text of config sub-element."""
el = self.config.find(xpath) if self.config is not None else None
return el.text if el is not None else default
def _cfg_attr(self, xpath: str, attr: str, default=None):
el = self.config.find(xpath) if self.config is not None else None
return el.attrib.get(attr, default) if el is not None else default
7. Tool Implementations
7.1 InputData (AlteryxBasePluginsGui.DbFileInput.DbFileInput)
XML Configuration:
<Configuration>
<File OutputFileName="" FileFormat="19" SearchSubDirs="False" RecordLimit="50">
path/to/file.csv
</File>
<FormatSpecificOptions>
<Delimeter>,</Delimeter>
<HeaderRow>True</HeaderRow>
<ImportLine>1</ImportLine> <!-- 1-based line to start reading -->
<CodePage>28591</CodePage>
</FormatSpecificOptions>
</Configuration>
File path syntax:
path/to/file.csv— plain filepath/file.xlsx|||``SheetName$``— Excel sheet (triple pipe separator)path/file.xlsx|||``<List of Sheet Names>``— returns column:Sheet Namespath/file.xlsx|||``NamedRange``— named range
FileFormat codes (partial):
| Code | Format |
|---|---|
| 0 | CSV / delimited text |
| 19 | YXDB (Alteryx native) |
| 25 | Excel (.xlsx) |
| 6 | Fixed-width |
Implementation strategy:
class InputDataTool(BaseTool):
def execute(self, inputs):
file_el = self.config.find("File")
raw_path = file_el.text.strip()
fmt = int(file_el.attrib.get("FileFormat","0"))
record_limit = file_el.attrib.get("RecordLimit","") or None
limit = int(record_limit) if record_limit else None
opts = self.config.find("FormatSpecificOptions") or ET.Element("x")
# Resolve relative path against workflow directory
path, sheet = self._parse_path(raw_path)
df = self._read(path, fmt, sheet, opts, limit)
return {"Output": df}
def _parse_path(self, raw):
if "|||" in raw:
path, sheet = raw.split("|||", 1)
return path.strip(), sheet.strip().strip("`")
return raw.strip(), None
def _read(self, path, fmt, sheet, opts, limit):
import duckdb, polars as pl
con = self.ctx.duckdb_con
if fmt == 0: # CSV
delim = opts.findtext("Delimeter", ",")
header = opts.findtext("HeaderRow","True") == "True"
skip = max(0, int(opts.findtext("ImportLine","1") or 1) - 1)
df = pl.read_csv(path, separator=delim, has_header=header, skip_rows=skip)
elif fmt == 25: # Excel
df = pl.read_excel(path, sheet_name=sheet)
elif fmt == 19: # YXDB — use DuckDB + yxdb reader or fallback
raise NotImplementedError("YXDB: use yxdb-py library")
else:
df = pl.read_csv(path) # best-effort fallback
if limit:
df = df.head(limit)
return df
YXDB support: Use the yxdb Python package (pip install yxdb) which reads Alteryx's native binary format.
7.2 TextInput (AlteryxBasePluginsGui.TextInput.TextInput)
XML Configuration:
<Configuration>
<NumRows value="6" />
<Fields>
<Field name="CustomerID" />
<Field name="FirstName" />
</Fields>
<Data>
<r>
<c>49</c>
<c>Thomas</c>
</r>
<r>
<c>456</c>
<c></c> <!-- empty → null -->
</r>
</Data>
</Configuration>
Implementation:
class TextInputTool(BaseTool):
def execute(self, inputs):
fields = [f.attrib["name"] for f in self.config.findall("Fields/Field")]
rows = []
for r in self.config.findall("Data/r"):
cells = r.findall("c")
row = {}
for i, f in enumerate(fields):
el = cells[i] if i < len(cells) else None
text = el.text if el is not None else None
row[f] = text if text != "" else None
rows.append(row)
df = pl.DataFrame(rows, schema={f: pl.String for f in fields})
return {"Output": df}
Note: All columns are String unless downstream Select/Formula coerces them.
7.3 OutputData (AlteryxBasePluginsGui.DbFileOutput.DbFileOutput)
XML Configuration:
<Configuration>
<File MaxRecords="" FileFormat="0">%temp%output.csv</File>
<FormatSpecificOptions>
<Delimeter>,</Delimeter>
<HeaderRow>True</HeaderRow>
<LineEndStyle>CRLF</LineEndStyle>
<CodePage>28591</CodePage>
</FormatSpecificOptions>
<MultiFile value="False" />
<!-- Multi-file split by field value: -->
<MultiFile value="True" />
<MultiFileType>Suffix</MultiFileType> <!-- Suffix | Prefix -->
<MultiFileField>Region</MultiFileField>
<KeepField value="True" />
</Configuration>
Implementation notes:
- Resolve
%temp%totempfile.gettempdir(). - If
MultiFile value="True": partition byMultiFileField, write one file per unique value, append/prepend value to filename. - If
MaxRecordsnon-empty: split into chunks of that size, auto-number extra files with_1,_2suffix.
7.4 Browse (AlteryxBasePluginsGui.BrowseV2.BrowseV2)
Behaviour: Passthrough — output equals input. In the runner, log a preview (first N rows) to stdout or a callback.
class BrowseTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
if self.ctx.verbose:
print(f"[Browse ToolID={self.node.tool_id}]\n{df.head(20)}")
return {"Output": df}
7.5 Filter (AlteryxBasePluginsGui.Filter.Filter)
XML Configuration — two modes:
Mode: Simple (Basic filter)
<Configuration>
<Expression>[CustomerID] > 30</Expression>
<Mode>Simple</Mode>
<Simple>
<Operator>></Operator>
<Field>CustomerID</Field>
<Operands>
<Operand>30</Operand>
<DateType>fixed</DateType> <!-- fixed | today | tomorrow | yesterday -->
<IgnoreTimeInDateTime>True</IgnoreTimeInDateTime>
<PeriodType>days</PeriodType> <!-- for period operators -->
<PeriodCount>2</PeriodCount>
</Operands>
</Simple>
</Configuration>
Mode: Custom (expression)
<Configuration>
<Expression><![CDATA[[Region]=="South" OR REGEX_Match(UPPERCASE([Region]), ".*WEST")]]></Expression>
<Mode>Custom</Mode>
</Configuration>
Anchors: Input → True (rows matching), False (rows not matching).
Operator mapping (Simple mode):
| Alteryx Operator | Polars equivalent |
|---|---|
= / Equals |
col == value |
!= / Does not equal |
col != value |
> |
col > value |
>= |
col >= value |
< |
col < value |
<= |
col <= value |
IsNull |
col.is_null() |
IsNotNull |
col.is_not_null() |
Contains |
col.str.contains(value) |
Does not contain |
~col.str.contains(value) |
Is empty |
`col.is_null() |
Is not empty |
col.is_not_null() & (col != "") |
Comes before (<) |
col < value (string compare) |
Comes after (>) |
col > value (string compare) |
Is true |
col == True |
Is false |
col == False |
Range |
(col >= start) & (col <= end) |
PeriodAfter |
date window around anchor date |
PeriodBefore |
date window before anchor date |
Date dynamic values:
today→datetime.date.today()tomorrow→today + timedelta(1)yesterday→today - timedelta(1)
Custom mode: transpile the <Expression> text via expression/transpiler.py (see §8).
Implementation:
class FilterTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
mode = self._cfg("Mode", "Custom")
if mode == "Simple":
mask = self._build_simple_mask(df)
else:
expr_text = self._cfg("Expression","True")
mask = self.ctx.transpiler.eval_mask(df, expr_text)
return {
"True": df.filter(mask),
"False": df.filter(~mask),
}
7.6 Formula (AlteryxBasePluginsGui.Formula.Formula)
XML Configuration:
<Configuration>
<FormulaFields>
<FormulaField
expression="IF [Latitude] > 39.7 THEN "North" ELSE "South" ENDIF"
field="Region"
size="1073741823"
type="V_WString" />
<FormulaField
expression="ROUND([Spend]/[Visits],1)"
field="AverageSpendPerVisit"
size="19.2"
type="FixedDecimal" />
</FormulaFields>
</Configuration>
Behaviour:
- Each
<FormulaField>is applied sequentially; later expressions can reference columns created by earlier ones. - If
fieldmatches an existing column: update in place. - If
fieldis new: append as new column. - Expression HTML entities:
>→>,<→<,"→",
→ newline.
Implementation:
class FormulaTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
for ff in self.config.findall("FormulaFields/FormulaField"):
expr_text = ff.attrib["expression"]
field = ff.attrib["field"]
dtype = self.ctx.type_mapper.map(ff.attrib.get("type","V_WString"),
ff.attrib.get("size"))
series = self.ctx.transpiler.eval_series(df, expr_text, field, dtype)
if field in df.columns:
df = df.with_columns(series.alias(field))
else:
df = df.with_columns(series.alias(field))
return {"Output": df}
7.7 Select (AlteryxBasePluginsGui.AlteryxSelect.AlteryxSelect)
XML Configuration:
<Configuration>
<OrderChanged value="True" />
<CommaDecimal value="False" />
<SelectFields>
<SelectField field="CustomerID" selected="True" rename="Customer ID" description="..." />
<SelectField field="Visits" selected="True" type="Int16" size="2" />
<SelectField field="*Unknown" selected="True" /> <!-- pass-through unknown cols -->
</SelectFields>
</Configuration>
Behaviour:
selected="False": drop the column.renameattribute: rename column.type+sizeattribute on SelectField: cast column to new type.*Unknownrow: ifselected="True", pass through any incoming columns not explicitly listed; ifselected="False", drop them.OrderChanged="True": output columns in the order listed in<SelectFields>.
Implementation:
class SelectTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
fields = self.config.findall("SelectFields/SelectField")
keep_unknown = False
explicit = {} # original_name → (selected, rename, dtype)
for sf in fields:
name = sf.attrib["field"]
if name == "*Unknown":
keep_unknown = sf.attrib.get("selected","True") == "True"
continue
explicit[name] = {
"selected": sf.attrib.get("selected","True") == "True",
"rename": sf.attrib.get("rename", name),
"type": sf.attrib.get("type"),
"size": sf.attrib.get("size"),
}
order_changed = self.config.attrib.get("OrderChanged","False") == "True"
# Build column list in config order if OrderChanged, else original order
result_cols = []
processed = set()
for sf in fields:
name = sf.attrib["field"]
if name == "*Unknown": continue
if name not in df.columns: continue
meta = explicit[name]
if not meta["selected"]: continue
processed.add(name)
col = pl.col(name)
if meta["type"]:
col = col.cast(self.ctx.type_mapper.map(meta["type"], meta["size"]))
result_cols.append(col.alias(meta["rename"]))
if keep_unknown:
for c in df.columns:
if c not in processed:
result_cols.append(pl.col(c))
if not order_changed:
# respect incoming column order
result_cols = sorted(result_cols, key=lambda e: df.columns.index(e._pyexpr.root_names()[0]) if hasattr(e, '_pyexpr') else 999)
return {"Output": df.select(result_cols)}
7.8 Sort (AlteryxBasePluginsGui.Sort.Sort)
XML Configuration:
<Configuration>
<SortInfo locale="0">
<Field field="LastName" order="Ascending" />
<Field field="FirstName" order="Ascending" />
</SortInfo>
</Configuration>
Attributes:
order:"Ascending"|"Descending"locale:0= default byte-order sort;1033= dictionary order (natural number sort for strings, English locale)
Implementation:
class SortTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
sort_fields = self.config.findall("SortInfo/Field")
by = [f.attrib["field"] for f in sort_fields]
descending = [f.attrib.get("order","Ascending") == "Descending" for f in sort_fields]
locale = self.config.findtext("SortInfo[@locale]") or "0"
# locale 1033 = natural sort (numeric strings as numbers)
# Polars: use sort with maintain_order=True for stability
df = df.sort(by=by, descending=descending, maintain_order=True)
return {"Output": df}
Note: For locale="1033" (dictionary/natural order on string columns containing numbers), apply a Schwartzian transform: extract numeric prefix for sort, then restore original values.
7.9 Sample (AlteryxBasePluginsGui.Sample.Sample)
XML Configuration:
<Configuration>
<Mode>First</Mode> <!-- First | Last | Sample | Random | NPercent -->
<N>3</N>
<GroupFields orderChanged="False">
<Field name="City" />
</GroupFields>
</Configuration>
Modes:
| Mode | Behaviour |
|---|---|
First |
Return first N rows (per group if GroupFields present) |
Last |
Return last N rows |
Sample |
Return 1 of every N rows (deterministic, not random) |
Random |
Each row has 1-in-N chance of inclusion (non-deterministic) |
NPercent |
Return first N% of rows |
Implementation:
class SampleTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
mode = self._cfg("Mode","First")
n = int(self._cfg("N","1"))
group_fields = [f.attrib["name"] for f in self.config.findall("GroupFields/Field")]
if not group_fields:
return {"Output": self._sample_flat(df, mode, n)}
# Grouped sampling
parts = []
for _, group_df in df.group_by(group_fields, maintain_order=True):
parts.append(self._sample_flat(group_df, mode, n))
return {"Output": pl.concat(parts)}
def _sample_flat(self, df, mode, n):
if mode == "First":
return df.head(n)
elif mode == "Last":
return df.tail(n)
elif mode == "Sample":
indices = range(0, len(df), n)
return df[list(indices)]
elif mode == "Random":
mask = [random.random() < (1/n) for _ in range(len(df))]
return df.filter(mask)
elif mode == "NPercent":
count = max(1, int(len(df) * n / 100))
return df.head(count)
return df
7.10 Unique (AlteryxBasePluginsGui.Unique.Unique)
XML Configuration:
<Configuration>
<UniqueFields>
<Field name="LastName" />
<Field name="FirstName" />
</UniqueFields>
</Configuration>
Behaviour:
- Group by the listed fields.
- First occurrence of each group →
Uniqueanchor output. - All subsequent duplicates →
Duplicateanchor output. - Column order preserved.
Anchors: Input → Unique, Duplicate
Implementation:
class UniqueTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
key_fields = [f.attrib["name"] for f in self.config.findall("UniqueFields/Field")]
df = df.with_row_index("_row_idx")
first_idx = df.group_by(key_fields, maintain_order=True).agg(pl.col("_row_idx").first())["_row_idx"]
unique_df = df.filter(pl.col("_row_idx").is_in(first_idx)).drop("_row_idx")
dup_df = df.filter(~pl.col("_row_idx").is_in(first_idx)).drop("_row_idx")
return {"Unique": unique_df, "Duplicate": dup_df}
7.11 Join (AlteryxBasePluginsGui.Join.Join)
XML Configuration:
<Configuration>
<JoinByRecordPos value="False" />
<!-- OR: join by field -->
<SelectConfiguration>
<Configuration outputConnection="J">
<SelectFields>
<!-- fields to include in J output -->
<SelectField field="L_CustomerID" selected="True" />
<SelectField field="R_OrderDate" selected="True" />
</SelectFields>
</Configuration>
<Configuration outputConnection="L"> ... </Configuration>
<Configuration outputConnection="R"> ... </Configuration>
</SelectConfiguration>
</Configuration>
The join key pairs are stored as child elements:
<JoinFields>
<JoinField Left="CustomerID" Right="CustomerID" />
</JoinFields>
Anchors:
- Inputs:
Left,Right - Outputs:
J(inner join),L(left unmatched),R(right unmatched)
Column naming on J output:
- If both inputs share a column name (other than join key): prefix with
L_andR_. - Join key columns: appear once (from Left input by default).
Implementation using DuckDB:
class JoinTool(BaseTool):
def execute(self, inputs):
left = inputs.get("Left", pl.DataFrame())
right = inputs.get("Right", pl.DataFrame())
by_pos = self._cfg_attr("JoinByRecordPos","value","False") == "True"
con = self.ctx.duckdb_con
con.register("_left", left.to_arrow())
con.register("_right", right.to_arrow())
if by_pos:
left = left.with_row_index("_pos")
right = right.with_row_index("_pos")
# join on _pos
join_keys = [("_pos","_pos")]
else:
join_keys = [(jf.attrib["Left"], jf.attrib["Right"])
for jf in self.config.findall("JoinFields/JoinField")]
j_df, l_df, r_df = self._execute_join(left, right, join_keys, con)
return {"J": j_df, "L": l_df, "R": r_df}
def _execute_join(self, left, right, join_keys, con):
# Build column lists with disambiguation
l_cols = left.columns
r_cols = right.columns
key_l = [k[0] for k in join_keys]
key_r = [k[1] for k in join_keys]
# ... build SQL for inner, left anti, right anti
# Inner join (J):
on_clause = " AND ".join(f"l.{k[0]} = r.{k[1]}" for k in join_keys)
j_sql = f"SELECT * FROM _left l INNER JOIN _right r ON {on_clause}"
l_sql = f"SELECT l.* FROM _left l LEFT ANTI JOIN _right r ON {on_clause}"
r_sql = f"SELECT r.* FROM _right r LEFT ANTI JOIN _left l ON {on_clause}"
j_df = duckdb_to_polars(con, j_sql)
l_df = duckdb_to_polars(con, l_sql)
r_df = duckdb_to_polars(con, r_sql)
return j_df, l_df, r_df
7.12 Union (AlteryxBasePluginsGui.Union.Union)
XML Configuration:
<Configuration>
<Mode>Auto</Mode> <!-- Auto | ByName | ByPosition -->
</Configuration>
Modes:
Auto/ByName: match columns by name; fill missing columns with NULL.ByPosition: stack columns positionally regardless of name; use first input's column names.
Anchors: Multiple Input1, Input2, ... InputN; single Output.
Implementation:
class UnionTool(BaseTool):
def execute(self, inputs):
mode = self._cfg("Mode","Auto")
# Sort input anchors: Input, Input1, Input2... or by numeric suffix
sorted_keys = sorted(inputs.keys(), key=lambda k: int(k.replace("Input","") or 0))
dfs = [inputs[k] for k in sorted_keys]
if mode == "ByPosition":
# Rename all to first df's columns
names = dfs[0].columns
renamed = [df.rename(dict(zip(df.columns, names))) for df in dfs]
result = pl.concat(renamed, how="diagonal")
else: # ByName / Auto
result = pl.concat(dfs, how="diagonal_relaxed")
return {"Output": result}
7.13 Summarize (AlteryxSpatialPluginsGui.Summarize.Summarize)
XML Configuration:
<Configuration>
<SummarizeFields>
<SummarizeField field="Region" action="GroupBy" rename="Region" />
<SummarizeField field="Spend" action="Sum" rename="Total_Spend" />
<SummarizeField field="CustomerID" action="Count" rename="Num_Customers" />
<SummarizeField field="Score" action="Avg" rename="Avg_Score" />
<SummarizeField field="City" action="Concatenate" rename="Cities"
separator=", " order="Ascending" />
<SummarizeField field="Visits" action="Percentile" rename="P50"
percentile="50" />
</SummarizeFields>
</Configuration>
Action mapping:
| Alteryx Action | DuckDB / Polars equivalent |
|---|---|
GroupBy |
group_by key |
Sum |
pl.col(f).sum() |
Count |
pl.col(f).count() |
Count Non Null |
pl.col(f).drop_nulls().count() |
Count Distinct |
pl.col(f).n_unique() |
Count Distinct Non Null |
pl.col(f).drop_nulls().n_unique() |
Count Null |
pl.col(f).is_null().sum() |
Min |
pl.col(f).min() |
Max |
pl.col(f).max() |
Avg |
pl.col(f).mean() |
Median |
pl.col(f).median() |
Std Deviation |
pl.col(f).std() |
Variance |
pl.col(f).var() |
Percentile |
pl.col(f).quantile(p/100) |
Concatenate |
pl.col(f).sort().cast(str).str.join(sep) |
First |
pl.col(f).first() |
Last |
pl.col(f).last() |
Mode |
custom UDF |
Implementation:
class SummarizeTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
fields = self.config.findall("SummarizeFields/SummarizeField")
group_fields = [f.attrib["field"] for f in fields if f.attrib.get("action")=="GroupBy"]
agg_exprs = []
for f in fields:
action = f.attrib.get("action","GroupBy")
field = f.attrib["field"]
rename = f.attrib.get("rename", field)
if action == "GroupBy": continue
expr = self._build_agg(field, action, f.attrib)
agg_exprs.append(expr.alias(rename))
if group_fields:
result = df.group_by(group_fields, maintain_order=True).agg(agg_exprs)
else:
result = df.select(agg_exprs)
return {"Output": result}
7.14 JoinMultiple (AlteryxBasePluginsGui.JoinMultiple.JoinMultiple)
XML Configuration:
<Configuration>
<JoinByRecPos value="True" />
<OutputJoinOnly value="False" />
<SelectConfiguration>
<Configuration outputConnection="Output">
<SelectFields>
<SelectField field="Input_#1_CustomerID" selected="True" />
<SelectField field="Input_#2_JoinDate" selected="True" />
<SelectField field="*Unknown" selected="True" />
</SelectFields>
</Configuration>
</SelectConfiguration>
</Configuration>
Behaviour:
- Joins N inputs simultaneously (by position or by field).
- Column names from each input are prefixed with
Input_#N_to disambiguate. - Single
Outputanchor. OutputJoinOnly="False": include all rows (outer join on all inputs).
7.15 AppendFields (AlteryxBasePluginsGui.AppendFields.AppendFields)
Behaviour: Cartesian product of a small "Source" input appended to every row of a large "Target" input. One-to-many: every Target row gets one or more Source rows' fields added.
Anchors: Target (large), Source (small) → Output
Implementation: Cross join if source is small; otherwise warn.
7.16 CrossTab (AlteryxBasePluginsGui.CrossTab.CrossTab)
XML Configuration:
<Configuration>
<GroupFields>
<Field name="Region" />
</GroupFields>
<HeaderField field="Year" />
<DataField field="Sales" method="Sum" />
</Configuration>
Behaviour: Pivot table: rows = GroupBy values, columns = unique values of HeaderField, cell = aggregate of DataField.
Implementation using DuckDB PIVOT:
class CrossTabTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
con = self.ctx.duckdb_con
con.register("_ct_input", df.to_arrow())
header = self._cfg("HeaderField/@field") or self.config.find("HeaderField").attrib["field"]
data = self.config.find("DataField")
data_field = data.attrib["field"]
method = data.attrib.get("method","Sum").upper()
groups = [f.attrib["name"] for f in self.config.findall("GroupFields/Field")]
group_clause = ", ".join(groups)
# DuckDB PIVOT
sql = f"""
PIVOT _ct_input ON "{header}"
USING {method}("{data_field}")
GROUP BY {group_clause}
"""
result = duckdb_to_polars(con, sql)
return {"Output": result}
7.17 Transpose (AlteryxBasePluginsGui.Transpose.Transpose)
XML Configuration:
<Configuration>
<KeyFields>
<Field name="CustomerID" />
</KeyFields>
<DataFields>
<Field name="Visits" />
<Field name="Spend" />
</DataFields>
</Configuration>
Behaviour: Un-pivot (melt). Key fields remain; each Data field becomes a row with Name (field name) and Value columns.
Implementation:
class TransposeTool(BaseTool):
def execute(self, inputs):
df = inputs.get("Input", pl.DataFrame())
keys = [f.attrib["name"] for f in self.config.findall("KeyFields/Field")]
data_cols = [f.attrib["name"] for f in self.config.findall("DataFields/Field")]
result = df.melt(id_vars=keys, value_vars=data_cols,
variable_name="Name", value_name="Value")
return {"Output": result}
7.18 DateTime Parse (AlteryxBasePluginsGui.DateTime.DateTime)
XML Configuration:
<Configuration>
<Format>%d/%m/%Y</Format>
<InputField>DateString</InputField>
<OutputField>ParsedDate</OutputField>
<InputType>String</InputType>
<OutputType>Date</OutputType>
</Configuration>
Implementation: pl.col(input).str.to_date(format) or str.to_datetime(format).
7.19 RegEx (AlteryxBasePluginsGui.RegEx.RegEx)
XML Configuration:
<Configuration>
<Field>EmailAddress</Field>
<Expression>(\w+)@(\w+)\.(\w+)</Expression>
<Replace>
<ReplaceString>$1_at_$2</ReplaceString>
</Replace>
<Method>Match</Method> <!-- Match | Replace | Parse | Token -->
<OutputField>Matched</OutputField>
<FullMatch value="True" />
<CaseInsensitive value="False" />
</Configuration>
Methods:
Match: boolean column; True if pattern matches.Replace: replace withReplaceString(use$1,$2for capture groups →\1,\2in Python).Parse: extract capture groups into new columns (one per group).Token: split by pattern, return one row per token.
7.20 TextToColumns (AlteryxBasePluginsGui.TextToColumns.TextToColumns)
XML Configuration:
<Configuration>
<Field>AddressField</Field>
<Delimiter>,</Delimiter>
<NumCols value="3" />
<RootName>Address_</RootName>
<SplitToRows value="False" />
</Configuration>
Behaviour:
SplitToRows="False": split intoNumColsnew columns named{RootName}1,{RootName}2, etc.SplitToRows="True": one row per token.
7.21 GenerateRows (AlteryxBasePluginsGui.GenerateRows.GenerateRows)
XML Configuration:
<Configuration>
<UpdateField value="False" /> <!-- True = update existing field, False = create -->
<CreateField_Name>JoinDate</CreateField_Name>
<CreateField_Type>Date</CreateField_Type>
<Expression_Init>DateTimeFormat(DateTimeNow(),"%Y-%m-%d")</Expression_Init>
<Expression_Cond>[JoinDate] <= DateTimeAdd("2025-01-01", 5, "Days")</Expression_Cond>
<Expression_Loop>DateTimeAdd([JoinDate], 1, "days")</Expression_Loop>
<RecordCount value="0" /> <!-- 0 = unlimited (use condition) -->
</Configuration>
Behaviour: Generate rows by:
- Initialise the target field using
Expression_Init. - While
Expression_Condis true: a. Emit current row. b. Update field withExpression_Loop.
Implementation note: Because GenerateRows typically has no input (or one input row acts as a seed), implement as an iterative Python loop evaluating the transpiled expressions.
7.22 MultiRowFormula (AlteryxBasePluginsGui.MultiRowFormula.MultiRowFormula)
XML Configuration:
<Configuration>
<FormulaField field="RunningTotal" expression="[Row-1:RunningTotal] + [Visits]"
type="Double" size="8" />
<NumRows value="1" />
<GroupByFields>
<Field name="Region" />
</GroupByFields>
</Configuration>
Expression syntax:
[Row-1:FieldName]→ previous row's value ofFieldName(null at boundary).[Row+1:FieldName]→ next row's value.[Row-N:FieldName]→ N rows back.
Implementation: Use Polars shift() for lookback/lookahead, or DuckDB window functions (LAG, LEAD).
7.23 RecordID (AlteryxBasePluginsGui.RecordID.RecordID)
XML Configuration:
<Configuration>
<Field>RecordID</Field>
<StartValue>1</StartValue>
<FieldType>Int32</FieldType>
</Configuration>
Implementation: df.with_row_index(name=field, offset=start_value - 1) then cast to int type.
8. Expression Transpiler
The expression/transpiler.py module converts Alteryx expression syntax into either:
- A Polars expression (
pl.Expr) for simple column-level operations. - A DuckDB SQL fragment for complex/multi-step cases.
8.1 Syntax Differences
| Alteryx | Python/Polars/DuckDB |
|---|---|
[ColumnName] |
pl.col("ColumnName") / "ColumnName" |
"string literal" |
'string literal' |
IF ... THEN ... ELSEIF ... ELSE ... ENDIF |
pl.when(...).then(...).when(...).then(...).otherwise(...) / CASE WHEN |
IIF(cond, true_val, false_val) |
pl.when(cond).then(true_val).otherwise(false_val) |
IsNull([Field]) |
pl.col("Field").is_null() |
!IsNull([Field]) |
pl.col("Field").is_not_null() |
NULL() |
None / pl.lit(None) |
AND / OR / NOT |
& / | / ~ (Polars) or AND/OR/NOT (SQL) |
== (equality) |
== |
!= |
!= |
Row reference [Row-1:Field] |
pl.col("Field").shift(1) |
8.2 Built-in Function Mapping
String functions:
| Alteryx | DuckDB SQL / Polars |
|---|---|
Uppercase([F]) |
UPPER(F) / pl.col(F).str.to_uppercase() |
Lowercase([F]) |
LOWER(F) / pl.col(F).str.to_lowercase() |
Titlecase([F]) |
custom: capitalise each word |
Trim([F]) |
TRIM(F) |
LTrim([F]) |
LTRIM(F) |
RTrim([F]) |
RTRIM(F) |
Length([F]) |
LENGTH(F) / pl.col(F).str.len_chars() |
Left([F],n) |
LEFT(F,n) / pl.col(F).str.slice(0,n) |
Right([F],n) |
RIGHT(F,n) |
Substring([F],start,len) |
SUBSTR(F,start,len) (1-indexed) |
FindString([F],pattern) |
INSTR(F,pattern) (1-indexed, 0=not found) |
ReplaceChar([F],old,new) |
REPLACE(F,old,new) |
StringToDate([F],fmt) |
STRPTIME(F,fmt) |
ToString([num],decimals) |
PRINTF('%.Nf', num) |
Contains([F],str) |
CONTAINS(F,str) |
StartsWith([F],str) |
STARTS_WITH(F,str) |
EndsWith([F],str) |
ENDS_WITH(F,str) |
REGEX_Match([F],pat) |
REGEXP_MATCHES(F,pat) |
REGEX_Replace([F],pat,repl) |
REGEXP_REPLACE(F,pat,repl) |
REGEX_CountMatches([F],pat) |
custom UDF |
Math functions:
| Alteryx | DuckDB / Polars |
|---|---|
ABS([F]) |
ABS(F) |
CEIL([F]) |
CEIL(F) |
FLOOR([F]) |
FLOOR(F) |
ROUND([F],n) |
ROUND(F,n) |
SQRT([F]) |
SQRT(F) |
POW([F],n) |
POWER(F,n) |
LOG([F]) |
LN(F) |
LOG10([F]) |
LOG10(F) |
MOD([F],n) |
F % n |
MIN([A],[B]) |
LEAST(A,B) |
MAX([A],[B]) |
GREATEST(A,B) |
RandInt(n) |
FLOOR(RANDOM()*n) |
Date/Time functions:
| Alteryx | DuckDB / Polars |
|---|---|
DateTimeNow() |
NOW() / datetime.datetime.now() |
DateTimeToday() |
CURRENT_DATE |
DateTimeAdd([D],n,"days") |
D + INTERVAL n DAY |
DateTimeDiff([D1],[D2],"days") |
DATEDIFF('day', D2, D1) |
DateTimeFormat([D],fmt) |
STRFTIME(D, fmt) |
ToDate([D]) |
CAST(D AS DATE) |
DateTimeYear([D]) |
YEAR(D) |
DateTimeMonth([D]) |
MONTH(D) |
DateTimeDay([D]) |
DAY(D) |
DateTimeHour([D]) |
HOUR(D) |
Conditional:
| Alteryx | DuckDB |
|---|---|
IF [F] > 0 THEN "pos" ELSE "neg" ENDIF |
CASE WHEN F > 0 THEN 'pos' ELSE 'neg' END |
IIF([F]>0, "pos", "neg") |
IF(F>0, 'pos', 'neg') |
Switch([F], default, v1, r1, v2, r2) |
CASE F WHEN v1 THEN r1 WHEN v2 THEN r2 ELSE default END |
8.3 Transpiler Implementation Strategy
Use a recursive descent parser or a grammar-based approach (e.g. lark-parser or pyparsing).
Recommended approach:
- Tokenise the expression (handle
[column], string literals, function calls, operators, keywords). - Build an AST.
- Walk the AST and emit DuckDB SQL or Polars expressions.
- For
eval_mask(df, expr): register df as a DuckDB view, runSELECT *, ({sql_expr}) AS _mask FROM df, return the boolean series. - For
eval_series(df, expr, field, dtype): runSELECT ({sql_expr}) AS {field} FROM df.
Fallback: If an expression cannot be transpiled, raise UnsupportedExpressionError with the original expression text.
9. RunContext
import duckdb
import tempfile
from pathlib import Path
class RunContext:
def __init__(self, workflow_dir: str, verbose: bool = False):
self.workflow_dir = Path(workflow_dir)
self.verbose = verbose
self.duckdb_con = duckdb.connect(":memory:")
self.temp_dir = Path(tempfile.mkdtemp())
self.transpiler = ExpressionTranspiler(self.duckdb_con)
self.type_mapper = TypeMapper()
self.constants: dict = {} # workflow-level constants from <Properties>
def resolve_path(self, path: str) -> Path:
"""Resolve relative paths against workflow directory, expand %temp%."""
path = path.replace("%temp%", str(self.temp_dir) + "/")
p = Path(path)
if not p.is_absolute():
p = self.workflow_dir / p
return p
10. CLI
python -m alteryx_runner run <workflow.yxmd> [options]
Options:
--output-dir PATH Write output files to this directory (overrides workflow paths)
--param KEY=VALUE Set workflow constant
--verbose Print Browse tool results and execution log
--dry-run Parse and validate only; do not execute
--format {json,csv,parquet} Default output format for Browse nodes
Example:
python -m alteryx_runner run MyWorkflow.yxmd --verbose --output-dir ./results
11. Dependencies (pyproject.toml)
[project]
name = "alteryx-runner"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"polars>=0.20",
"duckdb>=0.10",
"pyarrow>=14",
"yxdb>=0.1", # Alteryx YXDB binary format
"openpyxl>=3.1", # Excel read/write
"lark>=1.1", # Expression parser grammar
"click>=8.0", # CLI
]
[project.optional-dependencies]
dev = ["pytest", "pytest-cov", "ruff", "mypy"]
12. Test Strategy
Each tool should have:
- A unit test loading the corresponding
.yxmdfixture fromtests/fixtures/. - Assertions on output schema (column names, dtypes).
- Assertions on output data (row count, specific cell values).
Priority test fixtures (from uploaded files):
| File | Tools covered |
|---|---|
Filter.yxmd |
TextInput, Filter (Simple+Custom), JoinMultiple, GenerateRows, Formula |
Formula.yxmd |
InputData (YXDB), Formula (static/conditional/modify/multi) |
Select.yxmd |
InputData, Select (simple/type-cast/rename-reorder/prefix) |
Sort.yxmd |
TextInput, Sort (single/multi/string-numeric/dictionary) |
Sample.yxmd |
InputData, Sample (First/Sample/Random/NPercent/Grouped) |
Unique.yxmd |
InputData, Unique |
Input_Data.yxmd |
InputData (CSV/Excel/sheets/named-ranges/record-limit) |
Output_Data.yxmd |
TextInput, OutputData (CSV/YXDB/multi-file/max-records) |
Browse.yxmd |
InputData, Browse (passthrough, profiling) |
Text_Input.yxmd |
TextInput |
13. Known Limitations & Out-of-Scope
- Spatial tools:
SpatialObjcolumns are passed through as opaque strings; spatial operations (Buffer, Distance, etc.) are not implemented. - In-Database tools: The
LockIn*plugin family (In-DB tools) requires a live database connection; not supported. - Macros (
.yxmc): TheEngineDll="Macro"setting triggers sub-workflow execution. Implement basic macro recursion but mark as best-effort. - Predictive tools (R-based
.yxmc): Out of scope. - YXDB write: Writing
.yxdboutput requires theyxdblibrary's write API; fall back to Parquet if unavailable. - Gallery/Server connections: All
SavedDataConnectionsandGallerysources will raiseNotImplementedError. - Report/Rendering tools:
PortfolioPlugins*family is out of scope. - Wireless connections: Treated identically to wired connections.
14. Execution Example
Given Sort.yxmd from the uploaded fixtures:
TextInput (ToolID=64)
↓
Sort: CustomerID Ascending (ToolID=26)
↓ (no downstream in this example workflow)
The runner will:
- Parse XML →
WorkflowGraphwith 4 Sort nodes all sourced from TextInput node 64. - Execute
TextInput→ DataFrame with columns:CustomerID,FirstName,LastName,Gender,JoinDate,Region,Score. - Execute
Sort(26)→df.sort("CustomerID", descending=False). - Since no downstream connections exist (no Output/Browse), store results in the context.
15. XML Quick-Reference Card
Below is the distilled XML config shape for each implemented tool, as extracted from the uploaded .yxmd example files.
Filter (Simple)
<Configuration>
<Expression>[CustomerID] > 30</Expression>
<Mode>Simple</Mode>
<Simple>
<Operator>></Operator>
<Field>CustomerID</Field>
<Operands><Operand>30</Operand><DateType>fixed</DateType></Operands>
</Simple>
</Configuration>
Filter (Custom)
<Configuration>
<Expression><![CDATA[[Region]=="South" OR REGEX_Match(UPPERCASE([Region]),".*WEST")]]></Expression>
<Mode>Custom</Mode>
</Configuration>
Formula
<Configuration>
<FormulaFields>
<FormulaField expression="Titlecase([City])" field="City" size="256" type="String" />
<FormulaField expression="IF [Lat] > 39.7 THEN "N" ELSE "S" ENDIF"
field="Region" size="1073741823" type="V_WString" />
</FormulaFields>
</Configuration>
Select
<Configuration>
<OrderChanged value="True" />
<CommaDecimal value="False" />
<SelectFields>
<SelectField field="CustomerID" selected="True" rename="Cust ID" description="PK" />
<SelectField field="Visits" selected="True" type="Int16" size="2" />
<SelectField field="*Unknown" selected="False" />
</SelectFields>
</Configuration>
Sort
<Configuration>
<SortInfo locale="0">
<Field field="LastName" order="Ascending" />
<Field field="FirstName" order="Descending" />
</SortInfo>
</Configuration>
Sample
<Configuration>
<Mode>First</Mode>
<N>3</N>
<GroupFields orderChanged="False">
<Field name="City" />
</GroupFields>
</Configuration>
Unique
<Configuration>
<UniqueFields>
<Field name="CustomerID" />
</UniqueFields>
</Configuration>
InputData (CSV)
<Configuration>
<File OutputFileName="" FileFormat="0" RecordLimit="50">path/to/file.csv</File>
<FormatSpecificOptions>
<Delimeter>,</Delimeter><HeaderRow>True</HeaderRow><ImportLine>1</ImportLine>
</FormatSpecificOptions>
</Configuration>
InputData (Excel)
<Configuration>
<File FileFormat="25">path/to/file.xlsx|||`SheetName$`</File>
<FormatSpecificOptions>
<FirstRowData>False</FirstRowData><ImportLine>1</ImportLine>
</FormatSpecificOptions>
</Configuration>
OutputData (CSV)
<Configuration>
<File MaxRecords="" FileFormat="0">%temp%output.csv</File>
<FormatSpecificOptions>
<Delimeter>,</Delimeter><HeaderRow>True</HeaderRow><LineEndStyle>CRLF</LineEndStyle>
</FormatSpecificOptions>
<MultiFile value="False" />
</Configuration>
TextInput
<Configuration>
<NumRows value="3" />
<Fields>
<Field name="Col1" /><Field name="Col2" />
</Fields>
<Data>
<r><c>val1</c><c>val2</c></r>
<r><c>val3</c><c /></r> <!-- empty → NULL -->
</Data>
</Configuration>