Pipeline Overview
How source-system data is transformed from raw exports into a diagnosed, vendor-agnostic analytical dataset.
0. Two Phases: Pre-Make and Make
Section titled “0. Two Phases: Pre-Make and Make”Before we get to the layer-by-layer transformation, there is a structural split at the top of the pipeline.
pre-make gathers all inputs. It extracts source data from the DLZ (governed by the extraction contract), pulls notes and bookmarks from the SIS, publishes them as YAML files in the AFS, and commits to git. Pre-make has side effects — it touches the outside world.
make is a pure function: AFS in → KLS out. No SIS mutation, no external fetches, no network calls. Every build is reproducible from the AFS commit it was built on. This is the security boundary — whoever controls make controls the analytical truth.
For the full architecture (including the five operating modes), see Operating Modes.
1. Starting Point: The Source System Export
Section titled “1. Starting Point: The Source System Export”Every organization manages its operational data through one or more source systems — ERP platforms, industry-specific software, or custom tools. Each system has its own data model, its own column names, its own conventions. Source systems supported by domain packs include Skidata, CargoWise, SAP MM, Navision, and many more.
The organization exports its data as a set of files — typically Excel workbooks or CSV files. These files are the raw truth. They contain everything — but they are messy, vendor-specific, and not ready for analysis.
2. Conversion: From Vendor Format to Canonical CSVs
Section titled “2. Conversion: From Vendor Format to Canonical CSVs”The first step is to bring the source-system export into a standardised CSV format that the rest of the pipeline expects. This is where vendor-specific knowledge lives.
A dedicated extraction script reads the export files and produces canonical CSV files. The following tables illustrate a typical pack’s entity model (each pack defines its own entities):
Dimension tables (what things are)
Section titled “Dimension tables (what things are)”| CSV file | What it describes | Key fields |
|---|---|---|
cost_centers.csv | Organisational units (departments, wards, operational areas) | ID, name, type |
materials.csv | Every article tracked in the inventory system — consumables, instruments, supplies | ID, description, group, standard price, unit, billable flag |
material_classifications.csv | A 3-level hierarchy classifying materials (e.g. 9 → 90 → 90.01: Pharma → Internal pharmacy → Internal preparations) | Code, French name, German name |
packaging_types.csv | How materials are packaged (piece, box, bottle, etc.) | Code, French name, German name |
suppliers.csv | Companies that deliver materials | Code, name |
article_suppliers.csv | Which supplier delivers which material, with their own reference number | Material ID, supplier code, supplier reference, creation date |
Fact tables (what happened)
Section titled “Fact tables (what happened)”| CSV file | What it records | Key fields |
|---|---|---|
cases.csv | Operational cases or transactions | Case token, start/end time, cost center, type |
procedures.csv | Procedures performed during cases | Procedure ID, type, case token, timestamp, duration |
case_material_usage.csv | What items were consumed for each case | Usage ID, case token, material ID, quantity, timestamp, unit price |
billing_events.csv | What was invoiced for each case | Billing ID, case token, material ID, quantity, billed amount, cost center, status |
material_movements.csv | Physical movement of goods between locations | Movement ID, material ID, quantity, from/to location, timestamp, type (RECEIVE/TRANSFER/ISSUE/RETURN) |
The conversion handles real-world messiness:
- Source workbooks may have multiple sheets with different column layouts
- IDs are mixed alphanumeric (e.g. “638876A” alongside “5299613”)
- Cost center names may carry site prefixes
- Sentinel dates like
9999-12-31are stripped - Entities discovered in usage/billing but absent from master data are backfilled
For synthetic tenants (used for development and testing), a data generator produces the same CSV files with deterministic, seeded random data — different organizational profiles, different defect patterns, same structure.
Extractor Discipline
Section titled “Extractor Discipline”Extraction is a trust boundary. Everything that crosses it must be explicit, pinned, verified on every run, logged, and failure-intolerant. The configuration IS the contract, and the contract is the audit trail.
Every tenant has a pipeline.yml at the AFS root that lists every file crossing the extract layer:
- Exact path — no globs, no “pick the latest” heuristics
- SHA-256 pin — hash mismatches fail the build hard
- Source type — one of six:
system_export,expert_curation,reference_data,pipeline_config,api_feed,db_extract - Expected schema — required sheets, required columns, minimum row counts
- Attribution — who delivered it, when, why, downstream consumers
Every extraction run appends a record to afs/state/extract_log.jsonl — an append-only, git-tracked audit trail that answers questions like “which exact file did we read last March? what was its hash?” by simply reading a file.
Anomalies — missing files, hash drift, missing sheets, missing columns, extractor write outside declared output — are errors, not warnings. No silent fallbacks.
Read the full rules in the Extraction Guide or the design doc docs/design/extractor_discipline.md.
3. The Medallion Architecture: Bronze → Silver → Gold
Section titled “3. The Medallion Architecture: Bronze → Silver → Gold”Once CSV files are in place, the data flows through three layers of transformation. Each layer has a clear responsibility. Nothing is lost; everything is traceable.
3.1 Bronze: Structural Ingestion
Section titled “3.1 Bronze: Structural Ingestion”What it does: Reads each CSV file exactly as-is into the database. No transformation, no filtering, no judgment.
What it adds:
source_file— which CSV file this row came from (for traceability)row_number— the row’s position in the file (for debugging)
Why it matters: Bronze is the faithful record. If something goes wrong downstream, you can always trace back to the exact row in the exact file.
There are 13 Bronze models — one per CSV file:
| Bronze model | Source CSV |
|---|---|
bronze_cases | cases.csv |
bronze_procedures | procedures.csv |
bronze_materials | materials.csv |
bronze_cost_centers | cost_centers.csv |
bronze_case_material_usage | case_material_usage.csv |
bronze_billing_events | billing_events.csv |
bronze_material_movements | material_movements.csv |
bronze_material_classifications | material_classifications.csv |
bronze_suppliers | suppliers.csv |
bronze_article_suppliers | article_suppliers.csv |
bronze_packaging_types | packaging_types.csv |
bronze_taxonomy_nodes | taxonomy_nodes.csv |
bronze_taxonomy_member_mappings | taxonomy_member_mappings.csv |
Source-system dispatch: This is the only place where vendor-specific column names appear. A dispatch mechanism translates vendor columns to canonical names. For example, one source system’s case_token stays case_token, while another’s AUFNR or No_ becomes case_token.
From Bronze onward, the data model is completely source-system-agnostic.
3.2 Silver: Validation and Domain Logic
Section titled “3.2 Silver: Validation and Domain Logic”What it does: Takes each Bronze table and applies domain knowledge — cleaning, type casting, validation, and referential integrity checks.
Key operations:
- Trimming and normalisation: Remove leading/trailing whitespace, normalise case (
inpatientnotInpatient) - Type casting: Convert strings to proper types — timestamps, decimals, booleans (handling multilingual values like
oui/yes/true/1) - Surrogate key generation: Each row gets a stable, unique key derived from its source file and row number
- Validation: Every row is checked against business rules and gets two columns:
is_valid— true or falseinvalid_reason— a human-readable explanation when false (e.g. “missing case_token”, “orphan material_id”, “invalid billing_status”)
What gets validated per entity:
| Entity | Validation rules |
|---|---|
| Cases | Must have a case token; start_time must parse as a timestamp; case_type must be one of inpatient/outpatient/day-case |
| Materials | Must have a material ID; standard_price must be a positive number; material_group must be IMPLANT, CONSUMABLE, KIT, INSTRUMENT, or MEDICATION_MAT |
| Billing events | Must have billing_id, case_token, material_id; case_token must reference a valid case; material_id must reference a valid material; billed_amount must parse as a number; billing_status must be POSTED, PENDING, or REVERSED |
| Material usage | Must have usage_id, case_token, material_id; foreign keys checked against valid cases and materials; quantity and unit_price must be numeric |
| Material movements | Must have movement_id, material_id; movement_type must be RECEIVE, TRANSFER, ISSUE, RETURN, or ADJUST; quantity must be numeric |
| Procedures | Must have procedure_id and case_token; case must exist; duration must be numeric if present |
| Cost centres | Must have cost_center_id and name |
| Suppliers | Must have supplier_code and supplier_name |
| Article-suppliers | Must have material_id and supplier_code; both must reference valid master records |
The design principle: no silent filtering. Silver never drops a row. A billing event that references a non-existent case is marked is_valid = false with invalid_reason = 'orphan case_token' — it stays in the table, visible and queryable. This is essential: knowing what’s wrong with your data is as valuable as having clean data.
3.3 Gold: The Certified Product
Section titled “3.3 Gold: The Certified Product”What it does: Filters Silver to only valid rows. This is the final, trusted dataset — the product contract.
Gold models are deliberately simple:
select case_key, case_token, start_time, end_time, cost_center_id, case_typefrom silver_caseswhere is_valid = trueGold produces 13 entities that form the canonical data model:
Core dimensions (what things are):
- Cases — Operational cases, each with a start/end time, cost centre, and type
- Materials — Every article with its price, group, description, and classification codes
- Cost centres — Organisational units (departments, operational areas)
- Procedures — Procedures linked to cases
- Material classifications — 3-level product hierarchy (L1 → L2 → L3)
- Suppliers — Material vendors
- Article-suppliers — Which supplier delivers which material
- Packaging types — How materials are packaged
Core facts (what happened):
- Case material usage — What was consumed, when, for which case, at what price
- Billing events — What was invoiced, when, for which case, at what amount
- Material movements — How goods moved physically between locations
Taxonomies (how things are organised):
- Taxonomy nodes — Hierarchical trees (e.g. organisational units, cost centre groupings) with computed depth, materialised path, and leaf-node flags
- Taxonomy member mappings — Links between tree nodes and dimension members (e.g. “cost centre X belongs to node Y in the OU-structure taxonomy”)
- Taxonomy closure — A pre-computed table of all ancestor-descendant relationships, enabling efficient hierarchical queries (“show me all cost centres under the Surgery department, at any depth”)
The Gold layer is formally versioned through a contract (gold_contract.v1.json) that defines every entity, its primary key, and its fields with types. Signals and downstream applications code against this contract, not against implementation details.
The supply_org dimension on pricing
Section titled “The supply_org dimension on pricing”gold_article_catalogues, gold_material_movements, and gold_price_series carry a supply_org column — the smallest organizational unit that independently owns assortment, pricing, and fulfillment responsibility for a set of materials. In nuMetrix, canonical values are PHA (pharmacy) and LOG (logistics). The correct pricing grain is (material, supply_org), not material alone. A new supply_org_matched rule in the price palette returns one price per (material, supply_org) pair. Existing rules are unchanged — the fix is additive. See Supply-Org Pricing.
4. Quality as a First-Class Citizen
Section titled “4. Quality as a First-Class Citizen”The pipeline doesn’t just transform data — it measures its own quality at every stage.
Silver Quality Metrics
Section titled “Silver Quality Metrics”For each of the 9 transactional entities, the pipeline reports:
- Total rows ingested
- How many are valid vs. invalid
- The invalidity rate as a percentage
This answers: “How clean is the data we received?”
Gold Quality Metrics
Section titled “Gold Quality Metrics”Compares Silver to Gold row counts for each entity:
- How many rows were in Silver
- How many made it to Gold
- How many were lost (and what percentage)
This answers: “How much data did we have to exclude, and is that expected?”
Reconciliation
Section titled “Reconciliation”Cross-checks financial totals:
- Total usage value (quantity × unit price) vs. total billed amount
- Highlights discrepancies per cost centre and time period
This answers: “Do the numbers add up?“
5. Multi-Tenant and Cross-Organization Analytics
Section titled “5. Multi-Tenant and Cross-Organization Analytics”Each tenant gets its own isolated schema in the database. Tenant A’s data is in tenant_a.*, Tenant B’s is in tenant_b.*. They never mix at the Bronze, Silver, or Gold level.
The Platform layer sits above all tenants. It creates union views across organizations:
platform.platform_gold_cases = tenant_a.gold_cases (with tenant_id = 'tenant_a') UNION ALL tenant_b.gold_cases (with tenant_id = 'tenant_b') UNION ALL ...This enables cross-tenant analytics — comparing costs, billing patterns, quality metrics, and signal findings across organizations — while preserving tenant isolation at the data layer.
6. Detection: The Signal Framework
Section titled “6. Detection: The Signal Framework”Once the data is certified and unified in Gold, the real analytical work begins. Signals are automated diagnostic queries that scan the dataset for anomalies, inefficiencies, and potential financial risks.
How Signals Work
Section titled “How Signals Work”Each signal:
- Operates exclusively on Gold entities (never touches Bronze or Silver)
- Scans for a specific pattern or anomaly
- Produces standardised findings — each finding has a continuous score (0–100, with display labels derived from ranges), the entity affected, a time bucket, the estimated impact, polarity, direction, and supporting evidence
The Findings Contract
Section titled “The Findings Contract”Every signal emits rows with the same structure:
| Field | Meaning |
|---|---|
finding_id | Unique identifier for this finding |
tenant_id | Which organization |
signal_id | Which signal detected this |
score | Continuous significance (0–100); display labels are derived from ranges |
polarity | positive / negative / neutral |
direction | improving / stable / worsening |
entity_type | What kind of entity is affected (Case, Material, etc.) |
entity_id | Which specific entity |
time_bucket | When (typically year-month) |
impact | Structured exposure — value + unit + dimension (financial / operational / quality / cultural / strategic) |
evidence | Machine-readable details (JSON) |
Current Signals
Section titled “Current Signals”Revenue Leakage (balance signal, high severity) Compares total usage value against total billed amount per case. When an organization records 5,000 worth of consumption for a case but only bills 3,500, that’s 1,500 of potential revenue leakage. The signal flags cases where usage exceeds billing by more than 2%, with severity tiers at 5% (medium) and 20% (high).
Missing Mandatory Items (mandatory-item signal, high severity) Checks that qualifying cases actually have the corresponding item usage recorded. A qualifying transaction without documented item consumption signals a documentation or billing gap.
Cost Centre Billing Mismatch (balance signal, medium severity) Detects cases where billing events are posted to a different cost centre than the case’s assigned cost centre. If a case belongs to Department A but items are billed to Department B, that’s a cost attribution error that distorts departmental P&L.
I/O Coefficient (hand-written signal, medium-high severity) Compares warehouse TRANSFER quantities against documented USAGE quantities for each item per month. In a balanced system, the I/O coefficient should be close to 1.0 — what leaves the warehouse should roughly equal what’s documented as consumed. Deviations indicate untracked ward-level consumption (coefficient < 1: more goes out than is documented) or over-delivery (coefficient > 1: more is documented than shipped). Filtered to materials priced CHF 1–99 to focus on high-volume consumables where systematic drift has the largest cumulative impact.
Signal Types
Section titled “Signal Types”The signal framework supports two authoring modes:
Declarative signals (for repeatable patterns):
- Balance signals compare two aggregates per entity and flag when they diverge beyond a tolerance
- Mandatory-item signals check that qualifying entities have required related items
- Distribution-outlier signals flag statistical anomalies using z-scores
These are compiled from YAML definitions into SQL, validated against the Gold contract.
Hand-written signals (for unique analytical shapes): When a signal needs to join across fact tables at different grains or apply logic that doesn’t fit the YAML templates, it’s written directly as SQL. It still follows the findings output contract and integrates into the same platform union.
Platform Findings
Section titled “Platform Findings”All signal findings from all tenants are unioned into platform.platform_probe_findings, enabling cross-tenant comparison of anomaly patterns, severity distributions, and aggregate financial risk.
7. Outlook: What Comes Next
Section titled “7. Outlook: What Comes Next”Automated Intake
Section titled “Automated Intake”Today, the organization delivers Excel files which are manually placed in the intake directory. The next step is to automate this — an SFTP drop folder or upload endpoint that triggers the pipeline automatically when new files arrive.
Incremental Loads
Section titled “Incremental Loads”The current pipeline does full rebuilds: every run processes all data from scratch. For tenants with growing datasets, this will need to evolve into incremental processing — only ingesting new or changed rows since the last run. This affects every layer:
- Bronze must detect and append only new rows
- Silver must re-validate only affected entities
- Gold and signals must refresh only the time periods that changed
Temporal Validity of Reference Data
Section titled “Temporal Validity of Reference Data”Reference data — materials, cost centres, suppliers, classifications — changes over time. A material’s price today is not what it was six months ago. A cost centre might be renamed, split, or decommissioned. Currently, the pipeline treats reference data as a snapshot: whatever the export says now is the truth.
The next evolution is temporal reference data management: tracking when each reference record was valid. This means:
- Every dimension record gets a
valid_fromandvalid_todate - Fact records (usage, billing, movements) are joined to the reference data version that was active at the time of the event
- Signals can detect pricing anomalies by comparing the price at the time of billing against the current standard price
- Historical analysis becomes accurate: a report for Q3 2024 uses Q3 2024 prices, not today’s prices
This is essential for longitudinal analysis across multi-year datasets.
Additional Signal Types
Section titled “Additional Signal Types”The framework is designed to grow. Planned signal directions include:
- Duplicate detection — finding billing events that appear to be entered twice
- Timing anomalies — billing events that occur before the associated usage
- Stale pricing — materials where the billed unit price deviates significantly from the current standard price
- Phantom references — usage records pointing to materials that don’t exist in the master
- Partial billing — cases where billed quantity is systematically less than used quantity
Cost Centre–Level I/O Analysis
Section titled “Cost Centre–Level I/O Analysis”The current I/O coefficient signal works at the material level. Extending it to the cost centre level requires resolving the mapping between warehouse location codes (used in movements) and cost centre codes (used in cases). Once this mapping exists, the organization can see which departments are over- or under-consuming relative to what’s delivered — a key insight for inventory management.
Related
Section titled “Related”The pipeline described here is the same across all five operating modes — Local, Proxy (P2P2P), Cloud (R2), Semi-Cloud, Make-as-a-Service. What varies is where each store lives and who runs the build. See Operating Modes for the full topology matrix.