Extraction

Ce contenu n’est pas encore disponible dans votre langue.

Extraction is a trust boundary. Everything that crosses it must be listed, pinned, verified, logged, and failure-intolerant.

The configuration IS the contract, and the contract is the audit trail.

Extraction is where the pipeline reaches into the real world — into shared folders, USB sticks, Sapheneia exports, manually-curated Excel workbooks — and pulls bytes into the analytical pipeline. Everything downstream rests on the assumption that this layer answers one question correctly: which exact file was read, and was its content what we expected?

This guide describes the extraction contract: the pipeline.yml file, the six source types, the zero-transform rule, and the CLI surface.

The Seven Principles

These are stated as rules, not preferences. They apply together.

1. Deterministic, not heuristic

No globs. No “latest version wins”. No sorted(dir)[-1]. Every file is pinned by an exact path from a well-defined root. When a new version arrives, somebody updates the configuration — a git-reviewed, auditable change — not the filesystem state.

2. Explicit, not conventional

Reading pipeline.yml tells you the complete universe of what jin make --extract will ever touch. No hidden knowledge inside Python scripts. Every file is listed, by hand, with its exact path, its declared purpose, and its expected shape.

3. Purpose-bound

Every extraction entry declares why the file is being read — in the config, in plain language, reviewable by a person who has never seen the code. Six months from now, anybody reading the config knows exactly what the file is for without having to dig.

4. Content-pinned, not just path-pinned

Every source file declares its expected SHA-256 hash. On every extraction run, the observed hash is compared to the declared hash. If they don’t match, extraction fails hard — the pipeline refuses to proceed, and the operator must either update the config to accept the new hash (a reviewable change) or investigate why the file content changed unexpectedly.

5. Anomalies are errors, not warnings

At this layer, the list of conditions that must cause a hard failure is long and unforgiving:

Source file missing at the declared path
Source file hash does not match the declared hash
Expected sheet not found in the workbook
Required column(s) missing from the header
Extraction produces zero rows when min_rows > 0 is declared
Output file write fails (permission, disk full, path missing)
Extractor script exits non-zero
Extractor writes to any path not listed in the config’s output field

No warnings users can ignore. No “continue anyway” flags. No silent fallbacks.

6. Zero-transform

Extractors are dumb. They open the file at the pinned path, verify its hash, parse the declared structure, and emit CSV rows that map one-to-one to the workbook content. No filtering, no calculations, no “smart” handling of missing values, no unit conversions, no date normalization, no deduplication.

Anything that looks like decision-making about the data belongs in Bronze or later — never here.

7. Fully traceable, per run

Every extraction run produces a permanent, append-only audit record in afs/state/extract_log.jsonl — git-tracked, one JSON record per run per entry. Over time the log becomes a queryable history: “when did we last extract the E00 file? what hash? was it successful? how long did it take?” All answerable by reading a file.

Source Taxonomy

Not all ingress is the same. The extraction framework declares what kind of thing each source is, because different kinds need different governance.

Type	Example	Characteristics	Special governance
`system_export`	OPALE `cases.xlsx`, SAP MM `AUFK`	Machine-generated, reproducible, schema-stable	Automated verification
`expert_curation`	Raffa’s `E00_V0.2.xlsx` with priority flags	Human-authored, judgmental, schema-volatile	`author`, `reviewed_by`, `justification` required
`reference_data`	ICD-10 codes, ATC, MIGEL, FX rates	External authority, rarely changes	`publisher`, `license`, `upstream_url` required
`pipeline_config`	Severity thresholds, exclusion lists	Instructions to the pipeline	`author`, `reviewed_by`, `impact_statement`, `affects` required
`api_feed`	HL7 FHIR, REST webhook	Real-time or scheduled pull from a live system	Future phase
`db_extract`	Read-only query against an operational DB	Direct pull, pinned query, connection reference	Future phase

Why the taxonomy matters: a finding derived from an OPALE system export and a finding derived from a hand-curated spreadsheet have different epistemic weight. The source type travels with the data all the way to the Observation layer and becomes part of the validation surface.

The `pipeline.yml` Shape

Lives at afs/scripts/pipeline.yml in the tenant AFS — one config per tenant, fully self-contained, no pack-level inheritance.

extract:
  - id: opale_cases_v2026_02
    script: extract_opale_xlsx_csv.py
    purpose:
      summary: "OPALE clinical export — cases, procedures, materials, billing events"
      delivered_by: "Sapheneia (Raffa channel)"
      delivered_at: "2026-02-17"
      comment: "Monthly clinical export. 68K cases, 2.4M billing lines."

    source:
      type: system_export
      path: "opale/xslx/V2026-02/cases.xlsx"   # DLZ-relative; no absolutes
      sha256: "a3b5c7f89012e8d4cdef0123456789abcdef0123456789abcdef0123456789ab"
      size_bytes: 12948192

    output:
      - { path: "build/csv/cases.csv",          required: true }
      - { path: "build/csv/procedures.csv",     required: true }
      - { path: "build/csv/billing_events.csv", required: true }

    expected:
      sheets:
        - name: "Fälle"
          required_columns: ["ABT", "FALL", "EINTRITT", "AUSTRITT"]
          min_rows: 10000
        - name: "Leistungen"
          required_columns: ["FALL", "LEISTUNG", "BETRAG"]
          min_rows: 100000

  - id: e00_ou_structure_v0_2
    script: extract_e00_abt_kst.py
    purpose:
      summary: "ABT department structure with priority flags and ABT→KST mapping"
      delivered_by: "Raffa (clinical consultant)"
      delivered_at: "2026-03-20"
      comment: "V0.2 replaces V0.1. Column E (Prio) drives is_priority downstream."

    source:
      type: expert_curation
      path: "opale/misc/e00/E00_OU_Structur_RAF_V0.2.xlsx"
      sha256: "d41d8cd98f00b204e9800998ecf8427e1234567890abcdef1234567890abcdef"
      author:
        name: "Raffa"
        role: "Clinical consultant"
        curated_at: "2026-03-20"

    output:
      - { path: "build/csv/abt_departments.csv",  required: true }
      - { path: "build/csv/abt_kst_mapping.csv",  required: true }

    expected:
      sheets:
        - name: "Tabelle1"
          header_contains: "ABT Code"
          required_columns:
            - "ABT Code"
            - "ABT Name"
            - "Prio"
            - "Korresp. KST Code"
          min_rows: 100

Field reference (headline)

Field	Required	Purpose
`id`	yes	Stable identifier for the log and CLI. Unique within the tenant.
`script`	yes	Filename under `afs/scripts/`. Must exist and be executable.
`purpose.summary`	yes	One-line human description
`purpose.delivered_by`	yes	Who delivered this file, through what channel
`purpose.delivered_at`	yes	ISO date when this version was delivered
`source.type`	yes	One of the six source taxonomy types
`source.path`	yes	DLZ-relative, exact path. No globs, no wildcards.
`source.sha256`	yes	Full 64-char SHA-256 of the file content
`source.size_bytes`	no	Belt-and-suspenders truncation check
`output`	yes	List of every file the extractor may write. Writing elsewhere is an error.
`expected.sheets`	recommended	Per-sheet parse contract (name, required columns, min rows)

expert_curation, reference_data, and pipeline_config types require additional attribution fields (author, reviewer, justification, publisher, license, etc.). The config loader rejects incomplete entries before extraction starts.

The Extract Log

afs/state/extract_log.jsonl — append-only, git-tracked, one JSON record per run per entry. Every record captures:

Unique run ID (timestamp + git commit)
Entry ID that fired, declared + resolved path, declared + observed SHA-256, hash match flag
Source file size and mtime at time of read
Every output path written, with its SHA-256 and row count
Sheet/column verification summary
Outcome (ok / error) with full error details on failure
Start, end, duration, user, machine, jinflow version, git commit

The log is git-tracked: every build run produces new log lines that become part of the AFS commit. Over a year of daily builds, the file accumulates a few thousand lines — human-readable, git-diffable, queryable with jq, Python, or DuckDB’s read_json_auto.

Extraction as Phase 0 of Make

Extraction is not a sidecar — it’s the first phase of every build.

jin make                  # Phase 0 (extract) → enrich → compile → build
jin make --skip-extract   # Skip Phase 0 when you know sources haven't changed
jin make --extract-only   # Run Phase 0 only, stop after extraction

Incremental detection

For each entry, read the source file’s current hash.
Compare to the declared hash in config. If mismatch → hard error (the config is the authority, not the file).
Compare to the last successful extraction’s source hash (from the log). If same → skip. If different → re-extract.
After re-extraction, compare the output hashes to the previous run’s outputs. If outputs actually changed → mark downstream as stale.

A build against unchanged sources skips extraction entirely (fast, no I/O). A build after a new delivery (operator already pinned the new hash in config) re-extracts the affected entry and marks its downstream stale.

Inspecting the Contract — `jin extract`

jin make does extraction; jin extract sees extraction. No overlap.

jin extract               # show every entry with current status
jin extract --list        # machine-readable: id, path, status, last hash
jin extract --check       # verify all declared hashes without running extraction
jin extract --contract    # render the full contract (purpose, source, output, expected)
                          # human-readable: who, what, when, why, downstream impact

jin extract --contract is the tool you reach for when you want to audit a tenant’s ingress surface in one go. It renders every entry’s purpose section, source type, attribution, and downstream consumers — the complete story, without having to read code.

jin extract --check is the go-to verification step before a build. It walks every entry, hashes the current file on disk, and tells you: matches, mismatches, missing files, size drift. No extraction actually runs.

Scope — Pack vs Tenant

Extractor scripts are pack-level assets. They know how to parse a given kind of workbook — OPALE sheet conventions, SAP MM AUFK exports, winery scale-weigh exports. They live in packs/<pack>/scripts/ and are copied into each tenant’s afs/scripts/ at jin init --pack time.

The pipeline.yml that tells the scripts what to do is tenant-specific. Two tenants in the same pack, using the same source system, at the same point in time, still have different pipeline.yml files — different file deliveries, different delivery channels, different auxiliary files, different file versions at different times. There is no pack-level pipeline.yml, no inheritance, no cascade. Each tenant’s config is fully self-contained.

Boilerplate at this layer is readable. Inheritance at this layer is dangerous.

Pipeline Overview — how extract, bronze, silver, and gold fit together
Make Guide — the build pipeline including pre-make
Tenants Guide — tenant layout and AFS structure
Design doc: docs/design/extractor_discipline.md — full threat model and migration path

jinflow is a jazzisnow product

v0.45.1 · built 2026-04-17 08:14 UTC