Pyteryx/README.md

202 lines
7.3 KiB
Markdown

# Pyteryx — Alteryx Runner
A Python-native runner for Alteryx `.yxmd` workflow files — no Alteryx installation required. Run workflows directly **or** convert them to standalone Python scripts.
## Prerequisites
- **Python 3.11+**
- **[uv](https://docs.astral.sh/uv/)** — fast Python package manager
## Setup
```bash
# Install all dependencies
uv sync
```
---
## Running a Workflow
```bash
uv run python -m alteryx_runner run <path/to/workflow.yxmd> [options]
```
### Example
```bash
uv run python -m alteryx_runner run ./Alteryx_TestWorkflows/Unique\&Sample/Unique\&Sample.yxmd --verbose
```
### Options
| Flag | Description |
|---|---|
| `--verbose` | Print Browse results and detailed execution log |
| `--dry-run` | Parse and validate only — do not execute |
| `--output-dir PATH` | Write output files to a specific directory |
| `--param KEY=VALUE` | Set a workflow constant (repeatable) |
| `--format [csv\|json\|parquet]` | Default output format for Browse nodes (default: `csv`) |
### Dry-Run (Validate Only)
Check that a workflow parses correctly without executing it:
```bash
uv run python -m alteryx_runner run ./Alteryx_TestWorkflows/JoinTesting/JoinTesting.yxmd --dry-run
```
### Custom Output Directory & Format
```bash
uv run python -m alteryx_runner run ./workflow.yxmd --output-dir ./results --format parquet
```
### Workflow Parameters
Pass runtime constants with `--param`:
```bash
uv run python -m alteryx_runner run ./workflow.yxmd --param "StartDate=2024-01-01" --param "Region=West"
```
## Listing Supported Tools
See all registered Alteryx tool plugins:
```bash
uv run python -m alteryx_runner list-tools
```
---
## Converting an Alteryx Workflow to a Python Script
Pyteryx can convert an `.yxmd` workflow into a standalone Python script that uses **Polars** and **DuckDB** — the same libraries the runner uses internally. The generated script reproduces every step of your workflow as readable, sequential Python code that you can run, edit, and commit without any Alteryx dependency.
### How It Works
1. **Parse** — The `.yxmd` XML file is parsed into a directed acyclic graph (DAG) of tool nodes and connections (`alteryx_runner/engine/parser.py`).
2. **Topological Sort** — Nodes are ordered so each tool runs only after its upstream inputs are ready (`alteryx_runner/engine/executor.py`).
3. **Transpile Expressions** — Alteryx formula expressions (e.g. `IF [Amount] > 100 THEN "High" ELSE "Low" ENDIF`) are transpiled to DuckDB SQL fragments (`alteryx_runner/expression/transpiler.py`).
4. **Emit Python** — Each tool node is converted to its Polars/DuckDB equivalent. The final output is a self-contained `.py` file.
### Quick Start
```bash
# Convert an Alteryx workflow to a Python script
uv run python -m alteryx_runner convert <path/to/workflow.yxmd> -o <output_script.py>
```
#### Examples
```bash
# Convert the Join testing workflow
uv run python -m alteryx_runner convert ./Alteryx_TestWorkflows/JoinTesting/JoinTesting.yxmd -o join_pipeline.py
# Convert and immediately run the generated script
uv run python -m alteryx_runner convert ./workflow.yxmd -o pipeline.py
uv run python pipeline.py
```
### What Gets Generated
The output script follows a predictable structure:
```python
#!/usr/bin/env python3
"""Auto-generated by Pyteryx from MyWorkflow.yxmd"""
import polars as pl
import duckdb
# ── Tool 1: Input Data ────────────────────────────
df_1 = pl.read_csv("data/sales.csv")
# ── Tool 3: Filter ────────────────────────────────
df_3 = df_1.filter(pl.sql_expr("\"Region\" = 'West'"))
# ── Tool 5: Formula ───────────────────────────────
con = duckdb.connect(":memory:")
con.register("t", df_3.to_arrow())
df_5 = con.execute("SELECT *, (\"Price\" * \"Qty\") AS \"Total\" FROM t").pl()
# ── Tool 7: Output Data ──────────────────────────
df_5.write_csv("output/results.csv")
```
### Conversion Reference
The table below shows how each supported Alteryx tool maps to its Python equivalent:
| Alteryx Tool | Python / Polars Equivalent |
|---|---|
| **Input Data** | `pl.read_csv()` / `pl.read_excel()` / `pl.read_parquet()` |
| **Output Data** | `df.write_csv()` / `df.write_parquet()` |
| **Text Input** | `pl.DataFrame({...})` (inline literal data) |
| **Browse** | `print(df)` or `df.write_csv()` |
| **Filter** | `df.filter(pl.sql_expr(...))` |
| **Formula** | DuckDB `SELECT *, <expr> AS <field>` |
| **Multi-Row Formula** | DuckDB window functions (`LAG` / `LEAD`) |
| **Multi-Field Formula** | Loop over matching columns with DuckDB |
| **Select** | `df.select()` / `df.rename()` / `df.cast()` |
| **Sort** | `df.sort(...)` |
| **Sample** | `df.head()` / `df.sample()` |
| **Unique** | `df.unique()` |
| **Record ID** | `df.with_row_index(...)` |
| **Auto Field** | Automatic dtype optimization |
| **Generate Rows** | Loop / `pl.DataFrame` construction |
| **Join** | `df.join(other, ...)` |
| **Join Multiple** | Chained `df.join(...)` calls |
| **Union** | `pl.concat([...])` |
| **Append Fields** | `df.join(other, how="cross")` |
| **Find Replace** | DuckDB `REPLACE()` / Polars `.str.replace()` |
| **DateTime** | DuckDB `STRFTIME()` / `STRPTIME()` |
| **RegEx** | `df.with_columns(pl.col(...).str.extract(...))` |
| **Text To Columns** | `df.with_columns(pl.col(...).str.split(...))` |
| **Summarize** | `df.group_by(...).agg(...)` |
| **Cross Tab** | `df.pivot(...)` |
| **Transpose** | `df.transpose()` |
### Expression Transpilation
Alteryx formula expressions are automatically transpiled to DuckDB SQL. Common patterns:
| Alteryx Expression | Generated SQL |
|---|---|
| `[ColumnName]` | `"ColumnName"` |
| `IF [X] > 0 THEN "Y" ENDIF` | `CASE WHEN "X" > 0 THEN 'Y' END` |
| `IIF([A]=1, "yes", "no")` | `CASE WHEN "A"=1 THEN 'yes' ELSE 'no' END` |
| `IsNull([F])` | `"F" IS NULL` |
| `Left([Name], 3)` | `LEFT("Name", 3)` |
| `DateTimeAdd([Date], 7, "days")` | `"Date" + INTERVAL (7) DAY` |
| `[Row-1:Total]` | `LAG("Total", 1) OVER ()` |
## Supported Tool Categories
| Category | Tools |
|---|---|
| **In/Out** | Input Data, Output Data, Browse, Text Input |
| **Preparation** | Filter, Formula, Multi-Field Formula, Multi-Row Formula, Select, Sort, Sample, Unique, Record ID, Auto Field, Generate Rows |
| **Join** | Join, Join Multiple, Union, Append Fields, Find Replace |
| **Parse** | DateTime, RegEx, Text To Columns |
| **Transform** | Summarize, Cross Tab, Transpose |
## Project Structure
```
alteryx_runner/
├── cli.py # CLI entry point
├── engine/
│ ├── parser.py # .yxmd XML → workflow graph
│ ├── executor.py # Topological execution engine
│ ├── graph.py # DAG data structures
│ └── context.py # Runtime context & config
├── expression/ # Alteryx expression parser
└── tools/ # Tool implementations
├── inout/ # Input Data, Output Data, Browse, Text Input
├── join/ # Join, Union, Append Fields, Find Replace
├── parse/ # DateTime, RegEx, Text To Columns
├── preparation/ # Filter, Formula, Sort, Sample, Unique, …
└── transform/ # Summarize, Cross Tab, Transpose
```