The configuration
IS the contract.

Extraction is a trust boundary.
Everything crossing it must be listed, pinned, verified, logged.

Heuristic extraction rests the pyramid on sand.

  • Battlefield of Excel versions — E00_V0.1.xlsx, E00_V0.2.xlsx, E00_V0.2_backup.xlsx, E00_V0.2 (copy).xlsx. Which one did we read last March?
  • Silent picking of the wrong file — "latest version by name sort" picks "Version Draft/" over "Version 3/" because 'D' > '3'. Build succeeds. Numbers are wrong.
  • No hash pinning — the delivered file can be replaced silently. The pipeline re-extracts the new content with no alarm. The CFO reads different numbers.
  • Anomalies degrade to warnings — missing files, parse errors, empty results caught in try/except. Downstream tables end up with missing rows. Nothing stops the line.

Stated as rules, not preferences.
They apply together.

1 — 3: Be Explicit
Deterministic, not heuristic. No globs.
Explicit, not conventional. No hidden knowledge in scripts.
Purpose-bound. Why this file, in plain language, in the config.
4 — 5: Fail Hard
Content-pinned. SHA-256 mismatch = hard fail.
Anomalies are errors, not warnings. No silent fallbacks. No "continue anyway" flags.
6 — 7: Be Honest
Zero-transform. Extractors copy bytes. No reshaping.
Fully traceable per run. Append-only JSONL log, git-tracked.

jin make does. jin extract sees.

jin make
Phase 0: extract
Phase 1-5: build
Incremental by hash.
jin extract
--contract (render)
--check (verify)
--list (machine-readable)
extract_log.jsonl
Append-only.
Git-tracked.
Every run, every file.

Extraction is Phase 0 of every make. Not a flag. Not opt-in.
Skipped only when source hashes match the last successful extraction.

Six source types.
Each has its own governance.

  • system_export — machine-generated, reproducible, large, schema-stable (OPALE cases.xlsx). Automated verification.
  • expert_curation — human-authored, judgmental (Raffa's E00_V0.2.xlsx). Requires author, reviewer, justification.
  • reference_data — external authority (ICD-10, ATC, MIGEL). Requires publisher, license, upstream URL.
  • pipeline_config — severity thresholds, exclusion lists. Strictest: author + reviewer + impact_statement + affects.
  • api_feed and db_extract — future phases. Schema hooks today.

One file. Per tenant. Self-contained.

id + purpose
Stable identifier; human description with delivered_by, delivered_at, downstream_consumers.
Six months from now, anybody reading the config knows what this file is for.
source.type + path + sha256
Type from the six-taxonomy; DLZ-relative exact path; 64-char SHA-256 pin; optional size_bytes.
Hash mismatch = hard fail. Size drift caught as a truncation check.
output[]
Every file the extractor may write. Writing elsewhere is an error.
No file that touches the pipeline is invisible to the config.
expected.sheets[]
Per-sheet parse contract: name, required_columns, min_rows, header_contains.
Missing column = error. Empty sheet when min_rows > 0 = error.
Tenant-specific
No pack-level pipeline.yml. No inheritance. No cascade.
Boilerplate at this layer is readable. Inheritance is dangerous.
Git-reviewed updates
New file version = edit the config (SHA-256 change).
The act of bumping V0.2 to V0.3 is a code review, not a file drop.

Append-only. Git-tracked.
The audit trail that always answers.

Per-run records
Run ID (timestamp + git commit).
Declared + observed SHA-256.
Every output path, hash, row count.
Start, end, duration.
User, machine, jinflow version.
Always queryable
grep, jq, Python, DuckDB
(read_json_auto).
The log IS the history.
No archeology needed.
Bonus: solves the empty-build-no-commit gap
The log file always dirties
the AFS tree on every build.
The auto-commit at the end of make
always has something to record.

Config → Check → Run → Log.

pipeline.yml
declared
SHA-256
verified
xlsx_to_csv
no transform
build/csv/
outputs
extract_log
audit

The extractor is a wrench, not a brain.
Every byte it writes was described in the config first.

No undeclared file enters the pipeline. Ever.

The configuration IS the contract.
The contract is the audit trail.

Listed, pinned, verified, logged, failure-intolerant.
Not a convenience problem. A foundation problem.

If the extract layer can silently drift, every finding above it rests on sand.

jinflow.io