⚠️ Private SDK Documentation - This documentation is for customers with private SDK access. Some features and capabilities may vary based on your agreement.

Understanding the fundamental concepts behind the Lume Python SDK will help you build more effective data transformation workflows.

Architecture Overview

The Lume SDK follows a simple but powerful pattern:

CSV Files + Seed Files → Flow Version → Run → Transformed Data (CSV/JSON) + Metrics

Key Objects

Flow

A Flow is a logical mapping template that defines:

  • Target Schema: The structure of your output data
  • Transformation Rules: How to map input data to a destination data model
  • Validation Rules: Quality checks and business logic
  • Error Handling: How to handle malformed or missing data
  • Seed Data Integration: How to use reference data during transformation

Flows are created and managed in the Lume UI, not via the API.

Version

A Version is an immutable snapshot of a Flow at a specific point in time. Think of it like a Git commit - once created, it never changes.

# Examples of Flow Versions
"customer_data:v2"      # Version 2 of the customer_data flow
"product_catalog:v1"      # Version 1 of the product_catalog flow

Why Versions?

  • Reproducibility: Same input always produces same output
  • Safety: Changes to flows don’t affect running jobs
  • Rollback: Easy to revert to previous versions
  • Testing: Test new versions before promoting to production
  • Compliance: Maintain audit trails

Run

A Run is a single execution of a Flow Version against one or more CSV input files and optional seed files.

# Create a run with input and seed files
run = lume.run(
    flow_version="invoice_cleaner:v4",
    input_files=["s3://bucket/file1.csv", "s3://bucket/file2.csv"],
    seed_files=["s3://reference/lookup.csv"]
)

# A run has:
run.id              # Unique identifier
run.status          # Current state
run.metrics         # Performance and quality metrics
run.created_at      # Timestamp

Run Lifecycle

Every run goes through these states:

CREATED → QUEUED → RUNNING → SUCCEEDED/FAILED/PARTIAL_FAILED/CRASHED

Status Meanings

StatusDescriptionAction Required
CREATEDRun created, data uploaded, waiting to be triggeredTrigger the run
QUEUEDWaiting for resourcesNone - will start automatically
RUNNINGCurrently processingNone - monitor progress
SUCCEEDEDAll data processed successfullyDownload results
PARTIAL_FAILEDSome data processed, some failedCheck rejects, download results
FAILEDAll data failed to processInvestigate errors
CRASHEDSystem error occurredContact support

Output Structure

Every run produces a consistent output structure:

output/
├── mapped/                    # Successfully transformed data
│   └── part-0000.csv         # or part-0000.json
├── rejects/                   # Rows that failed validation
│   └── part-0000.csv         # or part-0000.json
├── metrics.json              # Summary statistics
└── validation_results.json   # Detailed validation results

Metrics Overview

The metrics.json file contains:

{
  "run_id": "run_01HX...",
  "flow_version": "invoice_cleaner:v4",
  "row_counts": {
    "input": 14892,
    "mapped": 14317,
    "rejects": 575
  },
  "error_rate": 0.0386,
  "max_error_code": "MISSING_REQUIRED_FIELD",
  "validation_summary": {
    "tests_executed": 7,
    "tests_failed": 1,
    "top_errors": [
      {"error_code": "MISSING_REQUIRED_FIELD", "count": 461},
      {"error_code": "INVALID_DATE_FORMAT", "count": 114}
    ]
  },
  "runtime_seconds": 22
}

File Formats

Input Formats

  • CSV: Comma-separated values (only supported input format)

Output Formats

  • CSV: Comma-separated values (default)
  • JSON: JSON Lines format (one JSON object per line)

Seed Files

  • CSV: Reference data files used during transformation
  • Examples: lookup tables, configuration data, master data

Supported Storage

The SDK can read from and write to:

  • Amazon S3: s3://bucket/path/to/file

Seed Files

Seed files provide reference data that can be used during the transformation process. Common use cases include:

Lookup Tables

# Customer lookup table
run = lume.run(
    flow_version="invoice_cleaner:v4",
    input_files=["s3://raw/invoices.csv"],
    seed_files=["s3://reference/customer_lookup.csv"]
).wait()

Configuration Data

# Product catalog and pricing rules
run = lume.run(
    flow_version="invoice_processor:v2",
    input_files=["s3://raw/orders.csv"],
    seed_files=[
        "s3://reference/product_catalog.csv",
        "s3://reference/pricing_rules.csv"
    ]
).wait()

Master Data

# Geographic and currency reference data
run = lume.run(
    flow_version="global_sales:v1",
    input_files=["s3://raw/sales.csv"],
    seed_files=[
        "s3://reference/countries.csv",
        "s3://reference/currencies.csv",
        "s3://reference/exchange_rates.csv"
    ]
).wait()

Output Format Selection

You can choose your output format when downloading results:

# Download as CSV (default)
run.download_all("./output", output_format="csv")

# Download as JSON
run.download_all("./output", output_format="json")

# Download only mapped data as JSON
run.download_mapped("./output", output_format="json")

Security and Compliance

See our Security page.

Best Practices

Error Handling

run = lume.run(flow_version="my_flow:v1", input_files=files).wait()

if run.status == "SUCCEEDED":
    # Process successful data
    run.download_all("./output")
elif run.status == "PARTIAL_FAILED":
    # Some data failed, but you still have results
    run.download_all("./output")
    # Investigate rejects
    run.download_rejects("./rejects")
else:
    # Handle complete failure
    raise RuntimeError(f"Run failed with status: {run.status}")

Output Format Selection

# Choose format based on downstream requirements
if downstream_system == "warehouse":
    run.download_all("./output", output_format="csv")
elif downstream_system == "api":
    run.download_all("./output", output_format="json")