Extraction
Ce contenu n’est pas encore disponible dans votre langue.
Extraction is a trust boundary. Everything that crosses it must be listed, pinned, verified, logged, and failure-intolerant.
The configuration IS the contract, and the contract is the audit trail.
Extraction is where the pipeline reaches into the real world — into shared folders, USB sticks, Sapheneia exports, manually-curated Excel workbooks — and pulls bytes into the analytical pipeline. Everything downstream rests on the assumption that this layer answers one question correctly: which exact file was read, and was its content what we expected?
This guide describes the extraction contract: the pipeline.yml file, the six source types, the zero-transform rule, and the CLI surface.
The Seven Principles
Section titled “The Seven Principles”These are stated as rules, not preferences. They apply together.
1. Deterministic, not heuristic
Section titled “1. Deterministic, not heuristic”No globs. No “latest version wins”. No sorted(dir)[-1]. Every file is pinned by an exact path from a well-defined root. When a new version arrives, somebody updates the configuration — a git-reviewed, auditable change — not the filesystem state.
2. Explicit, not conventional
Section titled “2. Explicit, not conventional”Reading pipeline.yml tells you the complete universe of what jin make --extract will ever touch. No hidden knowledge inside Python scripts. Every file is listed, by hand, with its exact path, its declared purpose, and its expected shape.
3. Purpose-bound
Section titled “3. Purpose-bound”Every extraction entry declares why the file is being read — in the config, in plain language, reviewable by a person who has never seen the code. Six months from now, anybody reading the config knows exactly what the file is for without having to dig.
4. Content-pinned, not just path-pinned
Section titled “4. Content-pinned, not just path-pinned”Every source file declares its expected SHA-256 hash. On every extraction run, the observed hash is compared to the declared hash. If they don’t match, extraction fails hard — the pipeline refuses to proceed, and the operator must either update the config to accept the new hash (a reviewable change) or investigate why the file content changed unexpectedly.
5. Anomalies are errors, not warnings
Section titled “5. Anomalies are errors, not warnings”At this layer, the list of conditions that must cause a hard failure is long and unforgiving:
- Source file missing at the declared path
- Source file hash does not match the declared hash
- Expected sheet not found in the workbook
- Required column(s) missing from the header
- Extraction produces zero rows when
min_rows > 0is declared - Output file write fails (permission, disk full, path missing)
- Extractor script exits non-zero
- Extractor writes to any path not listed in the config’s
outputfield
No warnings users can ignore. No “continue anyway” flags. No silent fallbacks.
6. Zero-transform
Section titled “6. Zero-transform”Extractors are dumb. They open the file at the pinned path, verify its hash, parse the declared structure, and emit CSV rows that map one-to-one to the workbook content. No filtering, no calculations, no “smart” handling of missing values, no unit conversions, no date normalization, no deduplication.
Anything that looks like decision-making about the data belongs in Bronze or later — never here.
7. Fully traceable, per run
Section titled “7. Fully traceable, per run”Every extraction run produces a permanent, append-only audit record in afs/state/extract_log.jsonl — git-tracked, one JSON record per run per entry. Over time the log becomes a queryable history: “when did we last extract the E00 file? what hash? was it successful? how long did it take?” All answerable by reading a file.
Source Taxonomy
Section titled “Source Taxonomy”Not all ingress is the same. The extraction framework declares what kind of thing each source is, because different kinds need different governance.
| Type | Example | Characteristics | Special governance |
|---|---|---|---|
system_export | OPALE cases.xlsx, SAP MM AUFK | Machine-generated, reproducible, schema-stable | Automated verification |
expert_curation | Raffa’s E00_V0.2.xlsx with priority flags | Human-authored, judgmental, schema-volatile | author, reviewed_by, justification required |
reference_data | ICD-10 codes, ATC, MIGEL, FX rates | External authority, rarely changes | publisher, license, upstream_url required |
pipeline_config | Severity thresholds, exclusion lists | Instructions to the pipeline | author, reviewed_by, impact_statement, affects required |
api_feed | HL7 FHIR, REST webhook | Real-time or scheduled pull from a live system | Future phase |
db_extract | Read-only query against an operational DB | Direct pull, pinned query, connection reference | Future phase |
Why the taxonomy matters: a finding derived from an OPALE system export and a finding derived from a hand-curated spreadsheet have different epistemic weight. The source type travels with the data all the way to the Observation layer and becomes part of the validation surface.
The pipeline.yml Shape
Section titled “The pipeline.yml Shape”Lives at afs/scripts/pipeline.yml in the tenant AFS — one config per tenant, fully self-contained, no pack-level inheritance.
extract: - id: opale_cases_v2026_02 script: extract_opale_xlsx_csv.py purpose: summary: "OPALE clinical export — cases, procedures, materials, billing events" delivered_by: "Sapheneia (Raffa channel)" delivered_at: "2026-02-17" comment: "Monthly clinical export. 68K cases, 2.4M billing lines."
source: type: system_export path: "opale/xslx/V2026-02/cases.xlsx" # DLZ-relative; no absolutes sha256: "a3b5c7f89012e8d4cdef0123456789abcdef0123456789abcdef0123456789ab" size_bytes: 12948192
output: - { path: "build/csv/cases.csv", required: true } - { path: "build/csv/procedures.csv", required: true } - { path: "build/csv/billing_events.csv", required: true }
expected: sheets: - name: "Fälle" required_columns: ["ABT", "FALL", "EINTRITT", "AUSTRITT"] min_rows: 10000 - name: "Leistungen" required_columns: ["FALL", "LEISTUNG", "BETRAG"] min_rows: 100000
- id: e00_ou_structure_v0_2 script: extract_e00_abt_kst.py purpose: summary: "ABT department structure with priority flags and ABT→KST mapping" delivered_by: "Raffa (clinical consultant)" delivered_at: "2026-03-20" comment: "V0.2 replaces V0.1. Column E (Prio) drives is_priority downstream."
source: type: expert_curation path: "opale/misc/e00/E00_OU_Structur_RAF_V0.2.xlsx" sha256: "d41d8cd98f00b204e9800998ecf8427e1234567890abcdef1234567890abcdef" author: name: "Raffa" role: "Clinical consultant" curated_at: "2026-03-20"
output: - { path: "build/csv/abt_departments.csv", required: true } - { path: "build/csv/abt_kst_mapping.csv", required: true }
expected: sheets: - name: "Tabelle1" header_contains: "ABT Code" required_columns: - "ABT Code" - "ABT Name" - "Prio" - "Korresp. KST Code" min_rows: 100Field reference (headline)
Section titled “Field reference (headline)”| Field | Required | Purpose |
|---|---|---|
id | yes | Stable identifier for the log and CLI. Unique within the tenant. |
script | yes | Filename under afs/scripts/. Must exist and be executable. |
purpose.summary | yes | One-line human description |
purpose.delivered_by | yes | Who delivered this file, through what channel |
purpose.delivered_at | yes | ISO date when this version was delivered |
source.type | yes | One of the six source taxonomy types |
source.path | yes | DLZ-relative, exact path. No globs, no wildcards. |
source.sha256 | yes | Full 64-char SHA-256 of the file content |
source.size_bytes | no | Belt-and-suspenders truncation check |
output | yes | List of every file the extractor may write. Writing elsewhere is an error. |
expected.sheets | recommended | Per-sheet parse contract (name, required columns, min rows) |
expert_curation, reference_data, and pipeline_config types require additional attribution fields (author, reviewer, justification, publisher, license, etc.). The config loader rejects incomplete entries before extraction starts.
The Extract Log
Section titled “The Extract Log”afs/state/extract_log.jsonl — append-only, git-tracked, one JSON record per run per entry. Every record captures:
- Unique run ID (timestamp + git commit)
- Entry ID that fired, declared + resolved path, declared + observed SHA-256, hash match flag
- Source file size and mtime at time of read
- Every output path written, with its SHA-256 and row count
- Sheet/column verification summary
- Outcome (ok / error) with full error details on failure
- Start, end, duration, user, machine, jinflow version, git commit
The log is git-tracked: every build run produces new log lines that become part of the AFS commit. Over a year of daily builds, the file accumulates a few thousand lines — human-readable, git-diffable, queryable with jq, Python, or DuckDB’s read_json_auto.
Extraction as Phase 0 of Make
Section titled “Extraction as Phase 0 of Make”Extraction is not a sidecar — it’s the first phase of every build.
jin make # Phase 0 (extract) → enrich → compile → buildjin make --skip-extract # Skip Phase 0 when you know sources haven't changedjin make --extract-only # Run Phase 0 only, stop after extractionIncremental detection
Section titled “Incremental detection”- For each entry, read the source file’s current hash.
- Compare to the declared hash in config. If mismatch → hard error (the config is the authority, not the file).
- Compare to the last successful extraction’s source hash (from the log). If same → skip. If different → re-extract.
- After re-extraction, compare the output hashes to the previous run’s outputs. If outputs actually changed → mark downstream as stale.
A build against unchanged sources skips extraction entirely (fast, no I/O). A build after a new delivery (operator already pinned the new hash in config) re-extracts the affected entry and marks its downstream stale.
Inspecting the Contract — jin extract
Section titled “Inspecting the Contract — jin extract”jin make does extraction; jin extract sees extraction. No overlap.
jin extract # show every entry with current statusjin extract --list # machine-readable: id, path, status, last hashjin extract --check # verify all declared hashes without running extractionjin extract --contract # render the full contract (purpose, source, output, expected) # human-readable: who, what, when, why, downstream impactjin extract --contract is the tool you reach for when you want to audit a tenant’s ingress surface in one go. It renders every entry’s purpose section, source type, attribution, and downstream consumers — the complete story, without having to read code.
jin extract --check is the go-to verification step before a build. It walks every entry, hashes the current file on disk, and tells you: matches, mismatches, missing files, size drift. No extraction actually runs.
Scope — Pack vs Tenant
Section titled “Scope — Pack vs Tenant”Extractor scripts are pack-level assets. They know how to parse a given kind of workbook — OPALE sheet conventions, SAP MM AUFK exports, winery scale-weigh exports. They live in packs/<pack>/scripts/ and are copied into each tenant’s afs/scripts/ at jin init --pack time.
The pipeline.yml that tells the scripts what to do is tenant-specific. Two tenants in the same pack, using the same source system, at the same point in time, still have different pipeline.yml files — different file deliveries, different delivery channels, different auxiliary files, different file versions at different times. There is no pack-level pipeline.yml, no inheritance, no cascade. Each tenant’s config is fully self-contained.
Boilerplate at this layer is readable. Inheritance at this layer is dangerous.
Related
Section titled “Related”- Pipeline Overview — how extract, bronze, silver, and gold fit together
- Make Guide — the build pipeline including pre-make
- Tenants Guide — tenant layout and AFS structure
- Design doc:
docs/design/extractor_discipline.md— full threat model and migration path