Aller au contenu

Extraction

Ce contenu n’est pas encore disponible dans votre langue.

Extraction is a trust boundary. Everything that crosses it must be listed, pinned, verified, logged, and failure-intolerant.

The configuration IS the contract, and the contract is the audit trail.

Extraction is where the pipeline reaches into the real world — into shared folders, USB sticks, Sapheneia exports, manually-curated Excel workbooks — and pulls bytes into the analytical pipeline. Everything downstream rests on the assumption that this layer answers one question correctly: which exact file was read, and was its content what we expected?

This guide describes the extraction contract: the pipeline.yml file, the six source types, the zero-transform rule, and the CLI surface.


These are stated as rules, not preferences. They apply together.

No globs. No “latest version wins”. No sorted(dir)[-1]. Every file is pinned by an exact path from a well-defined root. When a new version arrives, somebody updates the configuration — a git-reviewed, auditable change — not the filesystem state.

Reading pipeline.yml tells you the complete universe of what jin make --extract will ever touch. No hidden knowledge inside Python scripts. Every file is listed, by hand, with its exact path, its declared purpose, and its expected shape.

Every extraction entry declares why the file is being read — in the config, in plain language, reviewable by a person who has never seen the code. Six months from now, anybody reading the config knows exactly what the file is for without having to dig.

Every source file declares its expected SHA-256 hash. On every extraction run, the observed hash is compared to the declared hash. If they don’t match, extraction fails hard — the pipeline refuses to proceed, and the operator must either update the config to accept the new hash (a reviewable change) or investigate why the file content changed unexpectedly.

At this layer, the list of conditions that must cause a hard failure is long and unforgiving:

  • Source file missing at the declared path
  • Source file hash does not match the declared hash
  • Expected sheet not found in the workbook
  • Required column(s) missing from the header
  • Extraction produces zero rows when min_rows > 0 is declared
  • Output file write fails (permission, disk full, path missing)
  • Extractor script exits non-zero
  • Extractor writes to any path not listed in the config’s output field

No warnings users can ignore. No “continue anyway” flags. No silent fallbacks.

Extractors are dumb. They open the file at the pinned path, verify its hash, parse the declared structure, and emit CSV rows that map one-to-one to the workbook content. No filtering, no calculations, no “smart” handling of missing values, no unit conversions, no date normalization, no deduplication.

Anything that looks like decision-making about the data belongs in Bronze or later — never here.

Every extraction run produces a permanent, append-only audit record in afs/state/extract_log.jsonl — git-tracked, one JSON record per run per entry. Over time the log becomes a queryable history: “when did we last extract the E00 file? what hash? was it successful? how long did it take?” All answerable by reading a file.


Not all ingress is the same. The extraction framework declares what kind of thing each source is, because different kinds need different governance.

TypeExampleCharacteristicsSpecial governance
system_exportOPALE cases.xlsx, SAP MM AUFKMachine-generated, reproducible, schema-stableAutomated verification
expert_curationRaffa’s E00_V0.2.xlsx with priority flagsHuman-authored, judgmental, schema-volatileauthor, reviewed_by, justification required
reference_dataICD-10 codes, ATC, MIGEL, FX ratesExternal authority, rarely changespublisher, license, upstream_url required
pipeline_configSeverity thresholds, exclusion listsInstructions to the pipelineauthor, reviewed_by, impact_statement, affects required
api_feedHL7 FHIR, REST webhookReal-time or scheduled pull from a live systemFuture phase
db_extractRead-only query against an operational DBDirect pull, pinned query, connection referenceFuture phase

Why the taxonomy matters: a finding derived from an OPALE system export and a finding derived from a hand-curated spreadsheet have different epistemic weight. The source type travels with the data all the way to the Observation layer and becomes part of the validation surface.


Lives at afs/scripts/pipeline.yml in the tenant AFS — one config per tenant, fully self-contained, no pack-level inheritance.

extract:
- id: opale_cases_v2026_02
script: extract_opale_xlsx_csv.py
purpose:
summary: "OPALE clinical export — cases, procedures, materials, billing events"
delivered_by: "Sapheneia (Raffa channel)"
delivered_at: "2026-02-17"
comment: "Monthly clinical export. 68K cases, 2.4M billing lines."
source:
type: system_export
path: "opale/xslx/V2026-02/cases.xlsx" # DLZ-relative; no absolutes
sha256: "a3b5c7f89012e8d4cdef0123456789abcdef0123456789abcdef0123456789ab"
size_bytes: 12948192
output:
- { path: "build/csv/cases.csv", required: true }
- { path: "build/csv/procedures.csv", required: true }
- { path: "build/csv/billing_events.csv", required: true }
expected:
sheets:
- name: "Fälle"
required_columns: ["ABT", "FALL", "EINTRITT", "AUSTRITT"]
min_rows: 10000
- name: "Leistungen"
required_columns: ["FALL", "LEISTUNG", "BETRAG"]
min_rows: 100000
- id: e00_ou_structure_v0_2
script: extract_e00_abt_kst.py
purpose:
summary: "ABT department structure with priority flags and ABT→KST mapping"
delivered_by: "Raffa (clinical consultant)"
delivered_at: "2026-03-20"
comment: "V0.2 replaces V0.1. Column E (Prio) drives is_priority downstream."
source:
type: expert_curation
path: "opale/misc/e00/E00_OU_Structur_RAF_V0.2.xlsx"
sha256: "d41d8cd98f00b204e9800998ecf8427e1234567890abcdef1234567890abcdef"
author:
name: "Raffa"
role: "Clinical consultant"
curated_at: "2026-03-20"
output:
- { path: "build/csv/abt_departments.csv", required: true }
- { path: "build/csv/abt_kst_mapping.csv", required: true }
expected:
sheets:
- name: "Tabelle1"
header_contains: "ABT Code"
required_columns:
- "ABT Code"
- "ABT Name"
- "Prio"
- "Korresp. KST Code"
min_rows: 100
FieldRequiredPurpose
idyesStable identifier for the log and CLI. Unique within the tenant.
scriptyesFilename under afs/scripts/. Must exist and be executable.
purpose.summaryyesOne-line human description
purpose.delivered_byyesWho delivered this file, through what channel
purpose.delivered_atyesISO date when this version was delivered
source.typeyesOne of the six source taxonomy types
source.pathyesDLZ-relative, exact path. No globs, no wildcards.
source.sha256yesFull 64-char SHA-256 of the file content
source.size_bytesnoBelt-and-suspenders truncation check
outputyesList of every file the extractor may write. Writing elsewhere is an error.
expected.sheetsrecommendedPer-sheet parse contract (name, required columns, min rows)

expert_curation, reference_data, and pipeline_config types require additional attribution fields (author, reviewer, justification, publisher, license, etc.). The config loader rejects incomplete entries before extraction starts.


afs/state/extract_log.jsonl — append-only, git-tracked, one JSON record per run per entry. Every record captures:

  • Unique run ID (timestamp + git commit)
  • Entry ID that fired, declared + resolved path, declared + observed SHA-256, hash match flag
  • Source file size and mtime at time of read
  • Every output path written, with its SHA-256 and row count
  • Sheet/column verification summary
  • Outcome (ok / error) with full error details on failure
  • Start, end, duration, user, machine, jinflow version, git commit

The log is git-tracked: every build run produces new log lines that become part of the AFS commit. Over a year of daily builds, the file accumulates a few thousand lines — human-readable, git-diffable, queryable with jq, Python, or DuckDB’s read_json_auto.


Extraction is not a sidecar — it’s the first phase of every build.

Terminal window
jin make # Phase 0 (extract) → enrich → compile → build
jin make --skip-extract # Skip Phase 0 when you know sources haven't changed
jin make --extract-only # Run Phase 0 only, stop after extraction
  1. For each entry, read the source file’s current hash.
  2. Compare to the declared hash in config. If mismatch → hard error (the config is the authority, not the file).
  3. Compare to the last successful extraction’s source hash (from the log). If same → skip. If different → re-extract.
  4. After re-extraction, compare the output hashes to the previous run’s outputs. If outputs actually changed → mark downstream as stale.

A build against unchanged sources skips extraction entirely (fast, no I/O). A build after a new delivery (operator already pinned the new hash in config) re-extracts the affected entry and marks its downstream stale.


jin make does extraction; jin extract sees extraction. No overlap.

Terminal window
jin extract # show every entry with current status
jin extract --list # machine-readable: id, path, status, last hash
jin extract --check # verify all declared hashes without running extraction
jin extract --contract # render the full contract (purpose, source, output, expected)
# human-readable: who, what, when, why, downstream impact

jin extract --contract is the tool you reach for when you want to audit a tenant’s ingress surface in one go. It renders every entry’s purpose section, source type, attribution, and downstream consumers — the complete story, without having to read code.

jin extract --check is the go-to verification step before a build. It walks every entry, hashes the current file on disk, and tells you: matches, mismatches, missing files, size drift. No extraction actually runs.


Extractor scripts are pack-level assets. They know how to parse a given kind of workbook — OPALE sheet conventions, SAP MM AUFK exports, winery scale-weigh exports. They live in packs/<pack>/scripts/ and are copied into each tenant’s afs/scripts/ at jin init --pack time.

The pipeline.yml that tells the scripts what to do is tenant-specific. Two tenants in the same pack, using the same source system, at the same point in time, still have different pipeline.yml files — different file deliveries, different delivery channels, different auxiliary files, different file versions at different times. There is no pack-level pipeline.yml, no inheritance, no cascade. Each tenant’s config is fully self-contained.

Boilerplate at this layer is readable. Inheritance at this layer is dangerous.


  • Pipeline Overview — how extract, bronze, silver, and gold fit together
  • Make Guide — the build pipeline including pre-make
  • Tenants Guide — tenant layout and AFS structure
  • Design doc: docs/design/extractor_discipline.md — full threat model and migration path
jazzisnow jinflow is a jazzisnow product
v0.45.1 · built 2026-04-17 08:14 UTC