Skip to content

Pipeline Overview

How source-system data is transformed from raw exports into a diagnosed, vendor-agnostic analytical dataset.


Before we get to the layer-by-layer transformation, there is a structural split at the top of the pipeline.

pre-make gathers all inputs. It extracts source data from the DLZ (governed by the extraction contract), pulls notes and bookmarks from the SIS, publishes them as YAML files in the AFS, and commits to git. Pre-make has side effects — it touches the outside world.

make is a pure function: AFS in → KLS out. No SIS mutation, no external fetches, no network calls. Every build is reproducible from the AFS commit it was built on. This is the security boundary — whoever controls make controls the analytical truth.

For the full architecture (including the five operating modes), see Operating Modes.


1. Starting Point: The Source System Export

Section titled “1. Starting Point: The Source System Export”

Every organization manages its operational data through one or more source systems — ERP platforms, industry-specific software, or custom tools. Each system has its own data model, its own column names, its own conventions. Source systems supported by domain packs include Skidata, CargoWise, SAP MM, Navision, and many more.

The organization exports its data as a set of files — typically Excel workbooks or CSV files. These files are the raw truth. They contain everything — but they are messy, vendor-specific, and not ready for analysis.


2. Conversion: From Vendor Format to Canonical CSVs

Section titled “2. Conversion: From Vendor Format to Canonical CSVs”

The first step is to bring the source-system export into a standardised CSV format that the rest of the pipeline expects. This is where vendor-specific knowledge lives.

A dedicated extraction script reads the export files and produces canonical CSV files. The following tables illustrate a typical pack’s entity model (each pack defines its own entities):

CSV fileWhat it describesKey fields
cost_centers.csvOrganisational units (departments, wards, operational areas)ID, name, type
materials.csvEvery article tracked in the inventory system — consumables, instruments, suppliesID, description, group, standard price, unit, billable flag
material_classifications.csvA 3-level hierarchy classifying materials (e.g. 9 → 90 → 90.01: Pharma → Internal pharmacy → Internal preparations)Code, French name, German name
packaging_types.csvHow materials are packaged (piece, box, bottle, etc.)Code, French name, German name
suppliers.csvCompanies that deliver materialsCode, name
article_suppliers.csvWhich supplier delivers which material, with their own reference numberMaterial ID, supplier code, supplier reference, creation date
CSV fileWhat it recordsKey fields
cases.csvOperational cases or transactionsCase token, start/end time, cost center, type
procedures.csvProcedures performed during casesProcedure ID, type, case token, timestamp, duration
case_material_usage.csvWhat items were consumed for each caseUsage ID, case token, material ID, quantity, timestamp, unit price
billing_events.csvWhat was invoiced for each caseBilling ID, case token, material ID, quantity, billed amount, cost center, status
material_movements.csvPhysical movement of goods between locationsMovement ID, material ID, quantity, from/to location, timestamp, type (RECEIVE/TRANSFER/ISSUE/RETURN)

The conversion handles real-world messiness:

  • Source workbooks may have multiple sheets with different column layouts
  • IDs are mixed alphanumeric (e.g. “638876A” alongside “5299613”)
  • Cost center names may carry site prefixes
  • Sentinel dates like 9999-12-31 are stripped
  • Entities discovered in usage/billing but absent from master data are backfilled

For synthetic tenants (used for development and testing), a data generator produces the same CSV files with deterministic, seeded random data — different organizational profiles, different defect patterns, same structure.

Extraction is a trust boundary. Everything that crosses it must be explicit, pinned, verified on every run, logged, and failure-intolerant. The configuration IS the contract, and the contract is the audit trail.

Every tenant has a pipeline.yml at the AFS root that lists every file crossing the extract layer:

  • Exact path — no globs, no “pick the latest” heuristics
  • SHA-256 pin — hash mismatches fail the build hard
  • Source type — one of six: system_export, expert_curation, reference_data, pipeline_config, api_feed, db_extract
  • Expected schema — required sheets, required columns, minimum row counts
  • Attribution — who delivered it, when, why, downstream consumers

Every extraction run appends a record to afs/state/extract_log.jsonl — an append-only, git-tracked audit trail that answers questions like “which exact file did we read last March? what was its hash?” by simply reading a file.

Anomalies — missing files, hash drift, missing sheets, missing columns, extractor write outside declared output — are errors, not warnings. No silent fallbacks.

Read the full rules in the Extraction Guide or the design doc docs/design/extractor_discipline.md.


3. The Medallion Architecture: Bronze → Silver → Gold

Section titled “3. The Medallion Architecture: Bronze → Silver → Gold”

Once CSV files are in place, the data flows through three layers of transformation. Each layer has a clear responsibility. Nothing is lost; everything is traceable.

What it does: Reads each CSV file exactly as-is into the database. No transformation, no filtering, no judgment.

What it adds:

  • source_file — which CSV file this row came from (for traceability)
  • row_number — the row’s position in the file (for debugging)

Why it matters: Bronze is the faithful record. If something goes wrong downstream, you can always trace back to the exact row in the exact file.

There are 13 Bronze models — one per CSV file:

Bronze modelSource CSV
bronze_casescases.csv
bronze_proceduresprocedures.csv
bronze_materialsmaterials.csv
bronze_cost_centerscost_centers.csv
bronze_case_material_usagecase_material_usage.csv
bronze_billing_eventsbilling_events.csv
bronze_material_movementsmaterial_movements.csv
bronze_material_classificationsmaterial_classifications.csv
bronze_supplierssuppliers.csv
bronze_article_suppliersarticle_suppliers.csv
bronze_packaging_typespackaging_types.csv
bronze_taxonomy_nodestaxonomy_nodes.csv
bronze_taxonomy_member_mappingstaxonomy_member_mappings.csv

Source-system dispatch: This is the only place where vendor-specific column names appear. A dispatch mechanism translates vendor columns to canonical names. For example, one source system’s case_token stays case_token, while another’s AUFNR or No_ becomes case_token.

From Bronze onward, the data model is completely source-system-agnostic.

What it does: Takes each Bronze table and applies domain knowledge — cleaning, type casting, validation, and referential integrity checks.

Key operations:

  • Trimming and normalisation: Remove leading/trailing whitespace, normalise case (inpatient not Inpatient)
  • Type casting: Convert strings to proper types — timestamps, decimals, booleans (handling multilingual values like oui/yes/true/1)
  • Surrogate key generation: Each row gets a stable, unique key derived from its source file and row number
  • Validation: Every row is checked against business rules and gets two columns:
    • is_valid — true or false
    • invalid_reason — a human-readable explanation when false (e.g. “missing case_token”, “orphan material_id”, “invalid billing_status”)

What gets validated per entity:

EntityValidation rules
CasesMust have a case token; start_time must parse as a timestamp; case_type must be one of inpatient/outpatient/day-case
MaterialsMust have a material ID; standard_price must be a positive number; material_group must be IMPLANT, CONSUMABLE, KIT, INSTRUMENT, or MEDICATION_MAT
Billing eventsMust have billing_id, case_token, material_id; case_token must reference a valid case; material_id must reference a valid material; billed_amount must parse as a number; billing_status must be POSTED, PENDING, or REVERSED
Material usageMust have usage_id, case_token, material_id; foreign keys checked against valid cases and materials; quantity and unit_price must be numeric
Material movementsMust have movement_id, material_id; movement_type must be RECEIVE, TRANSFER, ISSUE, RETURN, or ADJUST; quantity must be numeric
ProceduresMust have procedure_id and case_token; case must exist; duration must be numeric if present
Cost centresMust have cost_center_id and name
SuppliersMust have supplier_code and supplier_name
Article-suppliersMust have material_id and supplier_code; both must reference valid master records

The design principle: no silent filtering. Silver never drops a row. A billing event that references a non-existent case is marked is_valid = false with invalid_reason = 'orphan case_token' — it stays in the table, visible and queryable. This is essential: knowing what’s wrong with your data is as valuable as having clean data.

What it does: Filters Silver to only valid rows. This is the final, trusted dataset — the product contract.

Gold models are deliberately simple:

select case_key, case_token, start_time, end_time, cost_center_id, case_type
from silver_cases
where is_valid = true

Gold produces 13 entities that form the canonical data model:

Core dimensions (what things are):

  • Cases — Operational cases, each with a start/end time, cost centre, and type
  • Materials — Every article with its price, group, description, and classification codes
  • Cost centres — Organisational units (departments, operational areas)
  • Procedures — Procedures linked to cases
  • Material classifications — 3-level product hierarchy (L1 → L2 → L3)
  • Suppliers — Material vendors
  • Article-suppliers — Which supplier delivers which material
  • Packaging types — How materials are packaged

Core facts (what happened):

  • Case material usage — What was consumed, when, for which case, at what price
  • Billing events — What was invoiced, when, for which case, at what amount
  • Material movements — How goods moved physically between locations

Taxonomies (how things are organised):

  • Taxonomy nodes — Hierarchical trees (e.g. organisational units, cost centre groupings) with computed depth, materialised path, and leaf-node flags
  • Taxonomy member mappings — Links between tree nodes and dimension members (e.g. “cost centre X belongs to node Y in the OU-structure taxonomy”)
  • Taxonomy closure — A pre-computed table of all ancestor-descendant relationships, enabling efficient hierarchical queries (“show me all cost centres under the Surgery department, at any depth”)

The Gold layer is formally versioned through a contract (gold_contract.v1.json) that defines every entity, its primary key, and its fields with types. Signals and downstream applications code against this contract, not against implementation details.

gold_article_catalogues, gold_material_movements, and gold_price_series carry a supply_org column — the smallest organizational unit that independently owns assortment, pricing, and fulfillment responsibility for a set of materials. In nuMetrix, canonical values are PHA (pharmacy) and LOG (logistics). The correct pricing grain is (material, supply_org), not material alone. A new supply_org_matched rule in the price palette returns one price per (material, supply_org) pair. Existing rules are unchanged — the fix is additive. See Supply-Org Pricing.


The pipeline doesn’t just transform data — it measures its own quality at every stage.

For each of the 9 transactional entities, the pipeline reports:

  • Total rows ingested
  • How many are valid vs. invalid
  • The invalidity rate as a percentage

This answers: “How clean is the data we received?”

Compares Silver to Gold row counts for each entity:

  • How many rows were in Silver
  • How many made it to Gold
  • How many were lost (and what percentage)

This answers: “How much data did we have to exclude, and is that expected?”

Cross-checks financial totals:

  • Total usage value (quantity × unit price) vs. total billed amount
  • Highlights discrepancies per cost centre and time period

This answers: “Do the numbers add up?“


5. Multi-Tenant and Cross-Organization Analytics

Section titled “5. Multi-Tenant and Cross-Organization Analytics”

Each tenant gets its own isolated schema in the database. Tenant A’s data is in tenant_a.*, Tenant B’s is in tenant_b.*. They never mix at the Bronze, Silver, or Gold level.

The Platform layer sits above all tenants. It creates union views across organizations:

platform.platform_gold_cases =
tenant_a.gold_cases (with tenant_id = 'tenant_a')
UNION ALL
tenant_b.gold_cases (with tenant_id = 'tenant_b')
UNION ALL
...

This enables cross-tenant analytics — comparing costs, billing patterns, quality metrics, and signal findings across organizations — while preserving tenant isolation at the data layer.


Once the data is certified and unified in Gold, the real analytical work begins. Signals are automated diagnostic queries that scan the dataset for anomalies, inefficiencies, and potential financial risks.

Each signal:

  1. Operates exclusively on Gold entities (never touches Bronze or Silver)
  2. Scans for a specific pattern or anomaly
  3. Produces standardised findings — each finding has a continuous score (0–100, with display labels derived from ranges), the entity affected, a time bucket, the estimated impact, polarity, direction, and supporting evidence

Every signal emits rows with the same structure:

FieldMeaning
finding_idUnique identifier for this finding
tenant_idWhich organization
signal_idWhich signal detected this
scoreContinuous significance (0–100); display labels are derived from ranges
polaritypositive / negative / neutral
directionimproving / stable / worsening
entity_typeWhat kind of entity is affected (Case, Material, etc.)
entity_idWhich specific entity
time_bucketWhen (typically year-month)
impactStructured exposure — value + unit + dimension (financial / operational / quality / cultural / strategic)
evidenceMachine-readable details (JSON)

Revenue Leakage (balance signal, high severity) Compares total usage value against total billed amount per case. When an organization records 5,000 worth of consumption for a case but only bills 3,500, that’s 1,500 of potential revenue leakage. The signal flags cases where usage exceeds billing by more than 2%, with severity tiers at 5% (medium) and 20% (high).

Missing Mandatory Items (mandatory-item signal, high severity) Checks that qualifying cases actually have the corresponding item usage recorded. A qualifying transaction without documented item consumption signals a documentation or billing gap.

Cost Centre Billing Mismatch (balance signal, medium severity) Detects cases where billing events are posted to a different cost centre than the case’s assigned cost centre. If a case belongs to Department A but items are billed to Department B, that’s a cost attribution error that distorts departmental P&L.

I/O Coefficient (hand-written signal, medium-high severity) Compares warehouse TRANSFER quantities against documented USAGE quantities for each item per month. In a balanced system, the I/O coefficient should be close to 1.0 — what leaves the warehouse should roughly equal what’s documented as consumed. Deviations indicate untracked ward-level consumption (coefficient < 1: more goes out than is documented) or over-delivery (coefficient > 1: more is documented than shipped). Filtered to materials priced CHF 1–99 to focus on high-volume consumables where systematic drift has the largest cumulative impact.

The signal framework supports two authoring modes:

Declarative signals (for repeatable patterns):

  • Balance signals compare two aggregates per entity and flag when they diverge beyond a tolerance
  • Mandatory-item signals check that qualifying entities have required related items
  • Distribution-outlier signals flag statistical anomalies using z-scores

These are compiled from YAML definitions into SQL, validated against the Gold contract.

Hand-written signals (for unique analytical shapes): When a signal needs to join across fact tables at different grains or apply logic that doesn’t fit the YAML templates, it’s written directly as SQL. It still follows the findings output contract and integrates into the same platform union.

All signal findings from all tenants are unioned into platform.platform_probe_findings, enabling cross-tenant comparison of anomaly patterns, severity distributions, and aggregate financial risk.


Today, the organization delivers Excel files which are manually placed in the intake directory. The next step is to automate this — an SFTP drop folder or upload endpoint that triggers the pipeline automatically when new files arrive.

The current pipeline does full rebuilds: every run processes all data from scratch. For tenants with growing datasets, this will need to evolve into incremental processing — only ingesting new or changed rows since the last run. This affects every layer:

  • Bronze must detect and append only new rows
  • Silver must re-validate only affected entities
  • Gold and signals must refresh only the time periods that changed

Reference data — materials, cost centres, suppliers, classifications — changes over time. A material’s price today is not what it was six months ago. A cost centre might be renamed, split, or decommissioned. Currently, the pipeline treats reference data as a snapshot: whatever the export says now is the truth.

The next evolution is temporal reference data management: tracking when each reference record was valid. This means:

  • Every dimension record gets a valid_from and valid_to date
  • Fact records (usage, billing, movements) are joined to the reference data version that was active at the time of the event
  • Signals can detect pricing anomalies by comparing the price at the time of billing against the current standard price
  • Historical analysis becomes accurate: a report for Q3 2024 uses Q3 2024 prices, not today’s prices

This is essential for longitudinal analysis across multi-year datasets.

The framework is designed to grow. Planned signal directions include:

  • Duplicate detection — finding billing events that appear to be entered twice
  • Timing anomalies — billing events that occur before the associated usage
  • Stale pricing — materials where the billed unit price deviates significantly from the current standard price
  • Phantom references — usage records pointing to materials that don’t exist in the master
  • Partial billing — cases where billed quantity is systematically less than used quantity

The current I/O coefficient signal works at the material level. Extending it to the cost centre level requires resolving the mapping between warehouse location codes (used in movements) and cost centre codes (used in cases). Once this mapping exists, the organization can see which departments are over- or under-consuming relative to what’s delivered — a key insight for inventory management.


The pipeline described here is the same across all five operating modes — Local, Proxy (P2P2P), Cloud (R2), Semi-Cloud, Make-as-a-Service. What varies is where each store lives and who runs the build. See Operating Modes for the full topology matrix.

jazzisnow jinflow is a jazzisnow product
v0.45.1 · built 2026-04-17 08:14 UTC