Pipeline Overview

How source-system data is transformed from raw exports into a diagnosed, vendor-agnostic analytical dataset.

0. Two Phases: Pre-Make and Make

Before we get to the layer-by-layer transformation, there is a structural split at the top of the pipeline.

pre-make gathers all inputs. It extracts source data from the DLZ (governed by the extraction contract), pulls notes and bookmarks from the SIS, publishes them as YAML files in the AFS, and commits to git. Pre-make has side effects — it touches the outside world.

make is a pure function: AFS in → KLS out. No SIS mutation, no external fetches, no network calls. Every build is reproducible from the AFS commit it was built on. This is the security boundary — whoever controls make controls the analytical truth.

For the full architecture (including the five operating modes), see Operating Modes.

1. Starting Point: The Source System Export

Every organization manages its operational data through one or more source systems — ERP platforms, industry-specific software, or custom tools. Each system has its own data model, its own column names, its own conventions. Source systems supported by domain packs include Skidata, CargoWise, SAP MM, Navision, and many more.

The organization exports its data as a set of files — typically Excel workbooks or CSV files. These files are the raw truth. They contain everything — but they are messy, vendor-specific, and not ready for analysis.

2. Conversion: From Vendor Format to Canonical CSVs

The first step is to bring the source-system export into a standardised CSV format that the rest of the pipeline expects. This is where vendor-specific knowledge lives.

A dedicated extraction script reads the export files and produces canonical CSV files. The following tables illustrate a typical pack’s entity model (each pack defines its own entities):

Dimension tables (what things are)

CSV file	What it describes	Key fields
`cost_centers.csv`	Organisational units (departments, wards, operational areas)	ID, name, type
`materials.csv`	Every article tracked in the inventory system — consumables, instruments, supplies	ID, description, group, standard price, unit, billable flag
`material_classifications.csv`	A 3-level hierarchy classifying materials (e.g. 9 → 90 → 90.01: Pharma → Internal pharmacy → Internal preparations)	Code, French name, German name
`packaging_types.csv`	How materials are packaged (piece, box, bottle, etc.)	Code, French name, German name
`suppliers.csv`	Companies that deliver materials	Code, name
`article_suppliers.csv`	Which supplier delivers which material, with their own reference number	Material ID, supplier code, supplier reference, creation date

Fact tables (what happened)

CSV file	What it records	Key fields
`cases.csv`	Operational cases or transactions	Case token, start/end time, cost center, type
`procedures.csv`	Procedures performed during cases	Procedure ID, type, case token, timestamp, duration
`case_material_usage.csv`	What items were consumed for each case	Usage ID, case token, material ID, quantity, timestamp, unit price
`billing_events.csv`	What was invoiced for each case	Billing ID, case token, material ID, quantity, billed amount, cost center, status
`material_movements.csv`	Physical movement of goods between locations	Movement ID, material ID, quantity, from/to location, timestamp, type (RECEIVE/TRANSFER/ISSUE/RETURN)

The conversion handles real-world messiness:

Source workbooks may have multiple sheets with different column layouts
IDs are mixed alphanumeric (e.g. “638876A” alongside “5299613”)
Cost center names may carry site prefixes
Sentinel dates like 9999-12-31 are stripped
Entities discovered in usage/billing but absent from master data are backfilled

For synthetic tenants (used for development and testing), a data generator produces the same CSV files with deterministic, seeded random data — different organizational profiles, different defect patterns, same structure.

Extractor Discipline

Extraction is a trust boundary. Everything that crosses it must be explicit, pinned, verified on every run, logged, and failure-intolerant. The configuration IS the contract, and the contract is the audit trail.

Every tenant has a pipeline.yml at the AFS root that lists every file crossing the extract layer:

Exact path — no globs, no “pick the latest” heuristics
SHA-256 pin — hash mismatches fail the build hard
Source type — one of six: system_export, expert_curation, reference_data, pipeline_config, api_feed, db_extract
Expected schema — required sheets, required columns, minimum row counts
Attribution — who delivered it, when, why, downstream consumers

Every extraction run appends a record to afs/state/extract_log.jsonl — an append-only, git-tracked audit trail that answers questions like “which exact file did we read last March? what was its hash?” by simply reading a file.

Anomalies — missing files, hash drift, missing sheets, missing columns, extractor write outside declared output — are errors, not warnings. No silent fallbacks.

Read the full rules in the Extraction Guide or the design doc docs/design/extractor_discipline.md.

3. The Medallion Architecture: Bronze → Silver → Gold

Once CSV files are in place, the data flows through three layers of transformation. Each layer has a clear responsibility. Nothing is lost; everything is traceable.

3.1 Bronze: Structural Ingestion

What it does: Reads each CSV file exactly as-is into the database. No transformation, no filtering, no judgment.

What it adds:

source_file — which CSV file this row came from (for traceability)
row_number — the row’s position in the file (for debugging)

Why it matters: Bronze is the faithful record. If something goes wrong downstream, you can always trace back to the exact row in the exact file.

There are 13 Bronze models — one per CSV file:

Bronze model	Source CSV
`bronze_cases`	cases.csv
`bronze_procedures`	procedures.csv
`bronze_materials`	materials.csv
`bronze_cost_centers`	cost_centers.csv
`bronze_case_material_usage`	case_material_usage.csv
`bronze_billing_events`	billing_events.csv
`bronze_material_movements`	material_movements.csv
`bronze_material_classifications`	material_classifications.csv
`bronze_suppliers`	suppliers.csv
`bronze_article_suppliers`	article_suppliers.csv
`bronze_packaging_types`	packaging_types.csv
`bronze_taxonomy_nodes`	taxonomy_nodes.csv
`bronze_taxonomy_member_mappings`	taxonomy_member_mappings.csv

Source-system dispatch: This is the only place where vendor-specific column names appear. A dispatch mechanism translates vendor columns to canonical names. For example, one source system’s case_token stays case_token, while another’s AUFNR or No_ becomes case_token.

From Bronze onward, the data model is completely source-system-agnostic.

3.2 Silver: Validation and Domain Logic

What it does: Takes each Bronze table and applies domain knowledge — cleaning, type casting, validation, and referential integrity checks.

Key operations:

Trimming and normalisation: Remove leading/trailing whitespace, normalise case (inpatient not Inpatient)
Type casting: Convert strings to proper types — timestamps, decimals, booleans (handling multilingual values like oui/yes/true/1)
Surrogate key generation: Each row gets a stable, unique key derived from its source file and row number
Validation: Every row is checked against business rules and gets two columns:
- is_valid — true or false
- invalid_reason — a human-readable explanation when false (e.g. “missing case_token”, “orphan material_id”, “invalid billing_status”)

What gets validated per entity:

Entity	Validation rules
Cases	Must have a case token; start_time must parse as a timestamp; case_type must be one of inpatient/outpatient/day-case
Materials	Must have a material ID; standard_price must be a positive number; material_group must be IMPLANT, CONSUMABLE, KIT, INSTRUMENT, or MEDICATION_MAT
Billing events	Must have billing_id, case_token, material_id; case_token must reference a valid case; material_id must reference a valid material; billed_amount must parse as a number; billing_status must be POSTED, PENDING, or REVERSED
Material usage	Must have usage_id, case_token, material_id; foreign keys checked against valid cases and materials; quantity and unit_price must be numeric
Material movements	Must have movement_id, material_id; movement_type must be RECEIVE, TRANSFER, ISSUE, RETURN, or ADJUST; quantity must be numeric
Procedures	Must have procedure_id and case_token; case must exist; duration must be numeric if present
Cost centres	Must have cost_center_id and name
Suppliers	Must have supplier_code and supplier_name
Article-suppliers	Must have material_id and supplier_code; both must reference valid master records

The design principle: no silent filtering. Silver never drops a row. A billing event that references a non-existent case is marked is_valid = false with invalid_reason = 'orphan case_token' — it stays in the table, visible and queryable. This is essential: knowing what’s wrong with your data is as valuable as having clean data.

3.3 Gold: The Certified Product

What it does: Filters Silver to only valid rows. This is the final, trusted dataset — the product contract.

Gold models are deliberately simple:

select case_key, case_token, start_time, end_time, cost_center_id, case_type
from silver_cases
where is_valid = true

Gold produces 13 entities that form the canonical data model:

Core dimensions (what things are):

Cases — Operational cases, each with a start/end time, cost centre, and type
Materials — Every article with its price, group, description, and classification codes
Cost centres — Organisational units (departments, operational areas)
Procedures — Procedures linked to cases
Material classifications — 3-level product hierarchy (L1 → L2 → L3)
Suppliers — Material vendors
Article-suppliers — Which supplier delivers which material
Packaging types — How materials are packaged

Core facts (what happened):

Case material usage — What was consumed, when, for which case, at what price
Billing events — What was invoiced, when, for which case, at what amount
Material movements — How goods moved physically between locations

Taxonomies (how things are organised):

Taxonomy nodes — Hierarchical trees (e.g. organisational units, cost centre groupings) with computed depth, materialised path, and leaf-node flags
Taxonomy member mappings — Links between tree nodes and dimension members (e.g. “cost centre X belongs to node Y in the OU-structure taxonomy”)
Taxonomy closure — A pre-computed table of all ancestor-descendant relationships, enabling efficient hierarchical queries (“show me all cost centres under the Surgery department, at any depth”)

The Gold layer is formally versioned through a contract (gold_contract.v1.json) that defines every entity, its primary key, and its fields with types. Signals and downstream applications code against this contract, not against implementation details.

The `supply_org` dimension on pricing

gold_article_catalogues, gold_material_movements, and gold_price_series carry a supply_org column — the smallest organizational unit that independently owns assortment, pricing, and fulfillment responsibility for a set of materials. In nuMetrix, canonical values are PHA (pharmacy) and LOG (logistics). The correct pricing grain is (material, supply_org), not material alone. A new supply_org_matched rule in the price palette returns one price per (material, supply_org) pair. Existing rules are unchanged — the fix is additive. See Supply-Org Pricing.

4. Quality as a First-Class Citizen

The pipeline doesn’t just transform data — it measures its own quality at every stage.

Silver Quality Metrics

For each of the 9 transactional entities, the pipeline reports:

Total rows ingested
How many are valid vs. invalid
The invalidity rate as a percentage

This answers: “How clean is the data we received?”

Gold Quality Metrics

Compares Silver to Gold row counts for each entity:

How many rows were in Silver
How many made it to Gold
How many were lost (and what percentage)

This answers: “How much data did we have to exclude, and is that expected?”

Reconciliation

Cross-checks financial totals:

Total usage value (quantity × unit price) vs. total billed amount
Highlights discrepancies per cost centre and time period

This answers: “Do the numbers add up?“

5. Multi-Tenant and Cross-Organization Analytics

Each tenant gets its own isolated schema in the database. Tenant A’s data is in tenant_a.*, Tenant B’s is in tenant_b.*. They never mix at the Bronze, Silver, or Gold level.

The Platform layer sits above all tenants. It creates union views across organizations:

platform.platform_gold_cases =
    tenant_a.gold_cases (with tenant_id = 'tenant_a')
    UNION ALL
    tenant_b.gold_cases (with tenant_id = 'tenant_b')
    UNION ALL
    ...

This enables cross-tenant analytics — comparing costs, billing patterns, quality metrics, and signal findings across organizations — while preserving tenant isolation at the data layer.

6. Detection: The Signal Framework

Once the data is certified and unified in Gold, the real analytical work begins. Signals are automated diagnostic queries that scan the dataset for anomalies, inefficiencies, and potential financial risks.

How Signals Work

Each signal:

Operates exclusively on Gold entities (never touches Bronze or Silver)
Scans for a specific pattern or anomaly
Produces standardised findings — each finding has a continuous score (0–100, with display labels derived from ranges), the entity affected, a time bucket, the estimated impact, polarity, direction, and supporting evidence

The Findings Contract

Every signal emits rows with the same structure:

Field	Meaning
`finding_id`	Unique identifier for this finding
`tenant_id`	Which organization
`signal_id`	Which signal detected this
`score`	Continuous significance (0–100); display labels are derived from ranges
`polarity`	`positive` / `negative` / `neutral`
`direction`	`improving` / `stable` / `worsening`
`entity_type`	What kind of entity is affected (Case, Material, etc.)
`entity_id`	Which specific entity
`time_bucket`	When (typically year-month)
`impact`	Structured exposure — value + unit + dimension (financial / operational / quality / cultural / strategic)
`evidence`	Machine-readable details (JSON)

Current Signals

Revenue Leakage (balance signal, high severity) Compares total usage value against total billed amount per case. When an organization records 5,000 worth of consumption for a case but only bills 3,500, that’s 1,500 of potential revenue leakage. The signal flags cases where usage exceeds billing by more than 2%, with severity tiers at 5% (medium) and 20% (high).

Missing Mandatory Items (mandatory-item signal, high severity) Checks that qualifying cases actually have the corresponding item usage recorded. A qualifying transaction without documented item consumption signals a documentation or billing gap.

Cost Centre Billing Mismatch (balance signal, medium severity) Detects cases where billing events are posted to a different cost centre than the case’s assigned cost centre. If a case belongs to Department A but items are billed to Department B, that’s a cost attribution error that distorts departmental P&L.

I/O Coefficient (hand-written signal, medium-high severity) Compares warehouse TRANSFER quantities against documented USAGE quantities for each item per month. In a balanced system, the I/O coefficient should be close to 1.0 — what leaves the warehouse should roughly equal what’s documented as consumed. Deviations indicate untracked ward-level consumption (coefficient < 1: more goes out than is documented) or over-delivery (coefficient > 1: more is documented than shipped). Filtered to materials priced CHF 1–99 to focus on high-volume consumables where systematic drift has the largest cumulative impact.

Signal Types

The signal framework supports two authoring modes:

Declarative signals (for repeatable patterns):

Balance signals compare two aggregates per entity and flag when they diverge beyond a tolerance
Mandatory-item signals check that qualifying entities have required related items
Distribution-outlier signals flag statistical anomalies using z-scores

These are compiled from YAML definitions into SQL, validated against the Gold contract.

Hand-written signals (for unique analytical shapes): When a signal needs to join across fact tables at different grains or apply logic that doesn’t fit the YAML templates, it’s written directly as SQL. It still follows the findings output contract and integrates into the same platform union.

Platform Findings

All signal findings from all tenants are unioned into platform.platform_probe_findings, enabling cross-tenant comparison of anomaly patterns, severity distributions, and aggregate financial risk.

7. Outlook: What Comes Next

Automated Intake

Today, the organization delivers Excel files which are manually placed in the intake directory. The next step is to automate this — an SFTP drop folder or upload endpoint that triggers the pipeline automatically when new files arrive.

Incremental Loads

The current pipeline does full rebuilds: every run processes all data from scratch. For tenants with growing datasets, this will need to evolve into incremental processing — only ingesting new or changed rows since the last run. This affects every layer:

Bronze must detect and append only new rows
Silver must re-validate only affected entities
Gold and signals must refresh only the time periods that changed

Temporal Validity of Reference Data

Reference data — materials, cost centres, suppliers, classifications — changes over time. A material’s price today is not what it was six months ago. A cost centre might be renamed, split, or decommissioned. Currently, the pipeline treats reference data as a snapshot: whatever the export says now is the truth.

The next evolution is temporal reference data management: tracking when each reference record was valid. This means:

Every dimension record gets a valid_from and valid_to date
Fact records (usage, billing, movements) are joined to the reference data version that was active at the time of the event
Signals can detect pricing anomalies by comparing the price at the time of billing against the current standard price
Historical analysis becomes accurate: a report for Q3 2024 uses Q3 2024 prices, not today’s prices

This is essential for longitudinal analysis across multi-year datasets.

Additional Signal Types

The framework is designed to grow. Planned signal directions include:

Duplicate detection — finding billing events that appear to be entered twice
Timing anomalies — billing events that occur before the associated usage
Stale pricing — materials where the billed unit price deviates significantly from the current standard price
Phantom references — usage records pointing to materials that don’t exist in the master
Partial billing — cases where billed quantity is systematically less than used quantity

Cost Centre–Level I/O Analysis

The current I/O coefficient signal works at the material level. Extending it to the cost centre level requires resolving the mapping between warehouse location codes (used in movements) and cost centre codes (used in cases). Once this mapping exists, the organization can see which departments are over- or under-consuming relative to what’s delivered — a key insight for inventory management.

The pipeline described here is the same across all five operating modes — Local, Proxy (P2P2P), Cloud (R2), Semi-Cloud, Make-as-a-Service. What varies is where each store lives and who runs the build. See Operating Modes for the full topology matrix.

jinflow is a jazzisnow product

v0.45.1 · built 2026-04-17 08:14 UTC