Entity Registry — Concept Document

Date: 2026-02-27

What It Is

The Entity Registry is a metadata-driven replacement for all hardcoded domain assumptions in the Explorer. Instead of the Explorer knowing that gold_cases has a start_time column, or that signal entity type Case maps to dimension key cases, or that materials have related billing events joined on material_id — all of this knowledge lives in YAML files, gets compiled into dbt registry tables, and the Explorer reads it at runtime.

The Explorer code may assume the metadata schema. It may not assume any business content.

This follows the same pattern as signals, theses, and verdicts: YAML source of truth → compiler → dbt SQL → runtime consumption. The Entity Registry is the final piece that makes the Explorer fully domain-agnostic.

The Problem

The Explorer currently works because ~25 hardcoded assumptions scattered across 5 TypeScript files happen to match the current Gold schema. Adding a new Gold entity requires editing multiple files in multiple layers. Worse, these assumptions are invisible — they’re inline constants, heuristics, and naming conventions that only work by coincidence.

Inventory of Hardcoded Assumptions

explorer/src/lib/server/queries/dimensions.ts

Line(s)	Assumption	What it encodes
5–7	`EXCLUDED_TABLES`	`gold_article_suppliers` is not browseable
10–15	`EXCLUDED_PREFIXES`	`gold_taxonomy_`, `gold_quality_`, `gold_reconciliation_`, `gold_io_coefficient` are not browseable
23–28	`ENTITY_TYPE_MAP`	Maps signal entity_type (`Case`/`Material`/`Procedure`/`CostCenter`) → dimension keys
64	PK inference	First column ending in `_key` is primary key
204–231	`RELATED_FACTS`	9 relationship definitions: which tables join, on which columns, display order, and 2 custom SQL templates

explorer/src/lib/server/queries/summary.ts

Line(s)	Assumption	What it encodes
9–16	`YearFilteredCounts`	Hardcoded interface with `cases`, `procedures`, `usage`, `billing`, `billing_amount`, `movements`
34	`start_time`	Cases’ primary timestamp
38	`billing_timestamp`	Billing events’ primary timestamp
77–89	5 entity+timestamp pairs	cases/start_time, procedures/procedure_timestamp, usage/usage_timestamp, billing/billing_timestamp, movements/movement_timestamp
86	`billed_amount`	The monetary amount column for billing
115–121	`getTenantLandingStats`	Hardcoded references to `gold_cases` and `gold_billing_events`

explorer/src/lib/server/queries/hypotheses.ts

Line(s)	Assumption	What it encodes
120–124	`ENTITY_NAME_MAP`	Maps entity_type → table/id-column/description-column for 3 entity types

explorer/src/lib/server/queries/taxonomies.ts

Line(s)	Assumption	What it encodes
63–67	Member caption JOIN	Hardcoded LEFT JOIN to `gold_cost_centers.name` and `gold_materials.material_description`

explorer/src/routes/api/jumpbar/+server.ts

Line(s)	Assumption	What it encodes
20–34	`pickDisplayColumns`	Heuristic: skip `_key` columns, pattern-match “name”/“description” for label, “type”/“group”/“category” for detail

explorer/src/lib/i18n/locales/{en,de,fr}.json

Line(s)	Assumption	What it encodes
41–61	`dimensions.*` keys	Static display names for each entity type in 3 languages

Total: 22+ distinct hardcoded assumptions across 6 files, not counting the duplicated categoryColors and statusStyles maps in Svelte components.

What Goes Wrong

Adding a new Gold entity (e.g., gold_suppliers) requires editing ENTITY_TYPE_MAP, EXCLUDED_TABLES, potentially RELATED_FACTS, summary.ts queries, all 3 locale JSON files, and hoping you didn’t miss anything.
Changing a column name in the Gold contract silently breaks Explorer heuristics.
Different tenants with different entities is impossible — the Explorer assumes all tenants share the same Gold schema shape.
The _key suffix heuristic works today but is fragile. It can’t distinguish surrogate keys from business IDs.

Design

Guiding Principle

The Explorer code may assume the metadata schema. It may not assume any business content.

The Explorer knows two things: (1) there is a table called entity_registry with a fixed set of columns, and (2) there is a table called entity_relationships with a fixed set of columns. Everything else — entity names, column roles, relationships, display labels, icons — comes from the data.

Architecture

entities/*.yaml                              ← source of truth
      │
      ▼
scripts/entitycompile.py ──validates──▶ contracts/gold_contract.v1.json
      │
      ├──▶ dbt/{pack}/models/registry/entity_registry.sql
      └──▶ dbt/{pack}/models/registry/entity_relationships.sql
                  │
                  ▼
            dbt build --select registry
                  │
                  ├──▶ {tenant}.entity_registry        (TABLE)
                  ├──▶ {tenant}.entity_relationships    (TABLE)
                  │
                  ▼
            platform_entity_registry       (VIEW, from first tenant)
            platform_entity_relationships  (VIEW, from first tenant)
                  │
                  ▼
      explorer/src/lib/server/queries/entityRegistry.ts
        loadRegistry() → per-tenant cache, invalidated by DB mtime
        getBrowseableEntities() / getEntityByKey() / getRelationships() / ...
                  │
                  ▼
      All Explorer consumers — zero hardcoded entity knowledge

This is the same pattern as signals (probes/*.yaml → signalcompile.py → signal_findings__*.sql) and theses (hypotheses/*.yaml → hypothesiscompile.py → hypothesis_verdicts.sql).

YAML Source Format

Entity definitions live in entities/ at the repo root (same level as probes/, hypotheses/, diagnoses/). One file per entity. The entity_id must match the filename (without .yaml).

Complete Example: `entities/cases.yaml`

entity_id: cases
dbt_model: gold_cases
display_name:
  en: Cases
  de: Fälle
  fr: Cas
icon: "◈"
is_browseable: true
columns:
  primary_key: case_key
  business_id: case_token
  label: drg_description
  detail: case_type
  timestamp: start_time
  amount: null
probe_entity_type: Case
taxonomy:
  dimension_type: null
  caption_column: null
relationships:
  - target_model: gold_case_material_usage
    join_column: case_token
    order_column: usage_timestamp
    label: { en: Usage, de: Verbrauch, fr: Consommation }
  - target_model: gold_billing_events
    join_column: case_token
    order_column: billing_timestamp
    label: { en: Billing, de: Abrechnung, fr: Facturation }
  - target_model: gold_procedures
    join_column: case_token
    order_column: procedure_timestamp
    label: { en: Procedures, de: Eingriffe, fr: Procédures }

Complete Example: `entities/materials.yaml`

entity_id: materials
dbt_model: gold_materials
display_name:
  en: Materials
  de: Materialien
  fr: Matériaux
icon: "📦"
is_browseable: true
columns:
  primary_key: material_key
  business_id: material_id
  label: material_description
  detail: material_group
  timestamp: null
  amount: standard_price
probe_entity_type: Material
taxonomy:
  dimension_type: material
  caption_column: material_description
relationships:
  - target_model: gold_case_material_usage
    join_column: material_id
    order_column: usage_timestamp
    label: { en: Usage, de: Verbrauch, fr: Consommation }
  - target_model: gold_billing_events
    join_column: material_id
    order_column: billing_timestamp
    label: { en: Billing, de: Abrechnung, fr: Facturation }
  - target_model: gold_material_movements
    join_column: material_id
    order_column: movement_timestamp
    label: { en: Movements, de: Bewegungen, fr: Mouvements }
  - target_model: gold_article_suppliers
    join_column: material_id
    order_column: supplier_code
    label: { en: Suppliers, de: Lieferanten, fr: Fournisseurs }
    custom_sql: |
      SELECT a.supplier_code, s.supplier_name, a.supplier_reference, a.creation_date
      FROM {schema}.gold_article_suppliers a
      LEFT JOIN {schema}.gold_suppliers s ON a.supplier_code = s.supplier_code
      WHERE a.material_id = $1 ORDER BY a.supplier_code LIMIT 100

Non-Browseable Example: `entities/article_suppliers.yaml`

entity_id: article_suppliers
dbt_model: gold_article_suppliers
display_name:
  en: Article Suppliers
  de: Artikellieferanten
  fr: Fournisseurs d'articles
icon: "🔗"
is_browseable: false
columns:
  primary_key: article_supplier_key
  business_id: material_id
  label: null
  detail: null
  timestamp: null
  amount: null
probe_entity_type: null
taxonomy:
  dimension_type: null
  caption_column: null

YAML Field Reference

Field	Required	Type	Description
`entity_id`	Yes	string	Dimension key, must match filename
`dbt_model`	Yes	string	Gold table name (must start with `gold_`)
`display_name`	Yes	{en, de, fr}	Tri-lingual display name
`icon`	Yes	string	Single emoji/character for UI
`is_browseable`	Yes	boolean	Whether entity appears in dimension browser
`columns.primary_key`	Yes	string	Surrogate key column
`columns.business_id`	Yes	string	Human-meaningful ID column
`columns.label`	No	string	Display name/description column
`columns.detail`	No	string	Secondary info column (type, group, category)
`columns.timestamp`	No	string	Primary timestamp for time-axis queries
`columns.amount`	No	string	Monetary amount column
`probe_entity_type`	No	string	Maps to probe findings `entity_type` field
`taxonomy.dimension_type`	No	string	Value in `taxonomy_member_mappings.dimension_type`
`taxonomy.caption_column`	No	string	Column used for taxonomy member captions
`relationships`	No	list	Related fact definitions (see below)

Relationship Fields

Field	Required	Type	Description
`target_model`	Yes	string	Gold table to query
`join_column`	Yes	string	Column on both source and target for the join
`order_column`	Yes	string	Default ORDER BY column
`label`	Yes	{en, de, fr}	Section heading in entity detail page
`custom_sql`	No	string	Full SQL template with `{schema}` and `$1` placeholders

Entity List

~12 YAML files covering all current Gold entities:

File	dbt_model	Browseable	probe_entity_type	Relationships
`cases.yaml`	`gold_cases`	Yes	`Case`	usage, billing, procedures
`materials.yaml`	`gold_materials`	Yes	`Material`	usage, billing, movements, suppliers
`cost_centers.yaml`	`gold_cost_centers`	Yes	`CostCenter`	—
`suppliers.yaml`	`gold_suppliers`	Yes	—	materials (via junction)
`procedures.yaml`	`gold_procedures`	Yes	`Procedure`	—
`case_material_usage.yaml`	`gold_case_material_usage`	Yes	—	—
`billing_events.yaml`	`gold_billing_events`	Yes	`BillingEvent`	—
`material_movements.yaml`	`gold_material_movements`	Yes	—	—
`article_suppliers.yaml`	`gold_article_suppliers`	No	—	—
`material_classifications.yaml`	`gold_material_classifications`	Yes	—	—
`packaging_types.yaml`	`gold_packaging_types`	Yes	—	—
`service_mandates.yaml`	`gold_service_mandates`	Yes	—	—

Schema Definitions

`entity_registry` Table

One row per entity. Materialized as TABLE in dbt/{pack}/models/registry/entity_registry.sql.

Column	Type	Nullable	Description
`entity_id`	VARCHAR	No	Dimension key (e.g. `cases`, `materials`)
`dbt_model`	VARCHAR	No	Table name (e.g. `gold_cases`)
`display_name_en`	VARCHAR	No	Display name, English
`display_name_de`	VARCHAR	No	Display name, German
`display_name_fr`	VARCHAR	No	Display name, French
`icon`	VARCHAR	No	Single emoji character
`is_browseable`	BOOLEAN	No	Appears in dimension browser
`primary_key_column`	VARCHAR	No	Surrogate key column name
`business_id_column`	VARCHAR	No	Human-meaningful ID column
`label_column`	VARCHAR	Yes	Display name column
`detail_column`	VARCHAR	Yes	Secondary info column
`timestamp_column`	VARCHAR	Yes	Primary timestamp for time-axis queries
`amount_column`	VARCHAR	Yes	Monetary amount column
`probe_entity_type`	VARCHAR	Yes	Maps to probe findings `entity_type`
`taxonomy_dimension_type`	VARCHAR	Yes	`dimension_type` in `taxonomy_member_mappings`
`taxonomy_caption_column`	VARCHAR	Yes	Column for taxonomy member captions

`entity_relationships` Table

One row per relationship. Materialized as TABLE in dbt/{pack}/models/registry/entity_relationships.sql.

Column	Type	Nullable	Description
`source_entity_id`	VARCHAR	No	FK to `entity_registry.entity_id`
`join_column`	VARCHAR	No	Column on source entity providing the join value
`target_model`	VARCHAR	No	Gold table to query
`target_join_column`	VARCHAR	No	Column in target to match (usually same as `join_column`)
`order_column`	VARCHAR	No	Default ORDER BY column
`label_en`	VARCHAR	No	Label, English
`label_de`	VARCHAR	No	Label, German
`label_fr`	VARCHAR	No	Label, French
`custom_sql`	VARCHAR	Yes	Full SQL template with `{schema}` and `$1` placeholders

Platform Views

Same pattern as other registries (from first tenant):

dbt/{pack}/models/platform/platform_entity_registry.sql
dbt/{pack}/models/platform/platform_entity_relationships.sql

Design Decisions

Two tables, not one — entity metadata and relationship metadata have different cardinality (1:N). Normalized, matches the signal_registry + probe_findings pattern.
models/registry/ directory, not models/gold/ — metadata registries don’t belong in the gold business layer.
No gold_ prefix — entity_registry, not gold_entity_registry. These are metadata, not business entities.
custom_sql for complex joins — only 2 of 9 current relationships need it (both via gold_article_suppliers junction table). Avoids over-engineering a join DSL.

Compiler

`scripts/entitycompile.py`

Follows the pattern of signalcompile.py and hypothesiscompile.py:

Reads all entities/*.yaml files
Validates each YAML:
- entity_id must match filename (without .yaml)
- dbt_model must start with gold_
- primary_key_column and business_id_column are required
- probe_entity_type, if set, must be a known contract entity type
- All column references must exist in contracts/gold_contract.v1.json
- target_model in relationships must reference a real Gold model
- custom_sql must contain {schema} and $1 placeholders
- Warns if a contract entity has no corresponding YAML file
Generates two dbt SQL models via UNION ALL of SELECT literals
Supports --check flag for CI dry-run

`scripts/entitycheck.py`

Validates YAML structure and cross-references without generating SQL. Analogous to signalcheck.py and hypothesischeck.py.

Integration into `rebuild.sh`

New Step 2f after lineage compile:

log "━━━ Step 2f: Compile entity registry ━━━"
run_cmd "$VENV/python" "$SCRIPTS_DIR/entitycompile.py"

Commands

# Validate entity YAML definitions
python3 scripts/entitycheck.py

# Compile YAML → dbt SQL
python3 scripts/entitycompile.py

# Dry-run: check if compiled SQL is up to date
python3 scripts/entitycompile.py --check

# Build registry for a tenant
cd dbt/{pack} && .venv/bin/dbt build --select registry \
  --vars '{"tenant_id": "my_tenant"}'

Explorer Consumption

`explorer/src/lib/server/queries/entityRegistry.ts`

A single module that loads, caches, and queries the registry tables. All domain knowledge in the Explorer flows through this module.

interface EntityRegistryEntry {
  entity_id: string;
  dbt_model: string;
  display_name_en: string;
  display_name_de: string;
  display_name_fr: string;
  icon: string;
  is_browseable: boolean;
  primary_key_column: string;
  business_id_column: string;
  label_column: string | null;
  detail_column: string | null;
  timestamp_column: string | null;
  amount_column: string | null;
  probe_entity_type: string | null;
  taxonomy_dimension_type: string | null;
  taxonomy_caption_column: string | null;
}

interface EntityRelationship {
  source_entity_id: string;
  join_column: string;
  target_model: string;
  target_join_column: string;
  order_column: string;
  label_en: string;
  label_de: string;
  label_fr: string;
  custom_sql: string | null;
}

Public API

Function	Returns	Replaces
`getRegistry(tenant)`	Load + cache both tables	—
`getBrowseableEntities(tenant)`	`is_browseable = true` entities	`discoverDimensions()` + exclusion lists
`getEntityByKey(tenant, key)`	Lookup by `entity_id`	PK heuristic + `pickDisplayColumns`
`getEntityByProbeType(tenant, type)`	Lookup by `probe_entity_type`	`ENTITY_TYPE_MAP`
`getRelationships(tenant, entityId)`	Relationships for entity	`RELATED_FACTS`
`getEntitiesWithTimestamp(tenant)`	`timestamp_column IS NOT NULL`	Hardcoded `summary.ts` timestamps
`getEntitiesWithAmount(tenant)`	`amount_column IS NOT NULL`	Hardcoded `summary.ts` amounts
`getTaxonomyMemberEntities(tenant)`	`taxonomy_dimension_type IS NOT NULL`	Hardcoded `taxonomies.ts` JOINs

Cache Strategy

Piggybacks on the existing getDb() reconnect logic in db.ts (which already tracks dbMtimeMs). When the DB mtime changes after dbt build, the registry cache is invalidated. No TTL needed.

Graceful Degradation

If entity_registry table does not exist (tenant not yet rebuilt after registry is added), fall back to current information_schema discovery + heuristics. This allows a smooth rollout — the Explorer works with or without the registry tables during the transition period.

What Gets Replaced

File	Hardcoded Knowledge	Registry Replacement
`dimensions.ts` — `ENTITY_TYPE_MAP`	signal entity_type → dimension key	`probe_entity_type` column
`dimensions.ts` — `EXCLUDED_TABLES/PREFIXES`	which tables are browseable	`is_browseable` column
`dimensions.ts` — `RELATED_FACTS`	9 FK relationships + 2 custom SQL	`entity_relationships` table
`dimensions.ts` — PK heuristic	first `_key` column = surrogate key	`primary_key_column` column
`dimensions.ts` — `discoverDimensions()`	schema introspection + exclusions	`SELECT * FROM entity_registry WHERE is_browseable`
`summary.ts` — timestamps	hardcoded `start_time`, `billing_timestamp`	`timestamp_column` column
`summary.ts` — amounts	hardcoded `billed_amount`	`amount_column` column
`hypotheses.ts` — `ENTITY_NAME_MAP`	3 entity types → table/id/desc	`business_id_column` + `label_column`
`taxonomies.ts` — member JOINs	hardcoded `gold_cost_centers`/`gold_materials`	`taxonomy_dimension_type` + `taxonomy_caption_column`
`jumpbar/+server.ts` — `pickDisplayColumns`	regex heuristic for id/label/detail	`business_id_column` + `label_column` + `detail_column`
i18n `dimensions.*` keys	static display names per entity	`display_name_en/de/fr` columns

Implementation Phases

Phase 1: Foundation (no Explorer changes)

Create entities/ directory with ~12 YAML files
Write scripts/entitycompile.py (compiler + validator)
Write scripts/entitycheck.py (validator only)
Create dbt models: models/registry/entity_registry.sql, entity_relationships.sql
Create platform views: platform_entity_registry.sql, platform_entity_relationships.sql
Add Step 2f to scripts/rebuild.sh
Rebuild and verify tables populated

Verification: SELECT * FROM my_tenant.entity_registry returns all entities with correct metadata.

Phase 2: Explorer Registry Module

Create explorer/src/lib/server/queries/entityRegistry.ts with load/cache/query functions
Implement graceful fallback: if entity_registry table missing, fall back to current heuristics

Verification: Module compiles, cache works, fallback triggers correctly on pre-registry tenants.

Phase 3: Migrate Consumers (one file at a time)

dimensions.ts — replace ENTITY_TYPE_MAP, EXCLUDED_*, RELATED_FACTS, PK heuristic, discoverDimensions()
jumpbar/+server.ts — replace pickDisplayColumns and dynamic dimension search
hypotheses.ts — replace ENTITY_NAME_MAP
taxonomies.ts — replace hardcoded member caption JOINs
summary.ts — replace hardcoded timestamp/amount column references

Verification: Each file migration is verified independently — Explorer behavior is identical before and after.

Phase 4: Cleanup

Remove all dead hardcoded code (ENTITY_TYPE_MAP, EXCLUDED_TABLES, RELATED_FACTS, etc.)
Remove dimensions.* i18n keys (display names now come from registry)
Update CLAUDE.md to document the Entity Registry

Verification: Grep Explorer src/ for hardcoded gold_ table names — zero hits outside entityRegistry.ts fallback.

Files to Create/Modify

Action	File
Create	`entities/*.yaml` (~12 files)
Create	`scripts/entitycompile.py`
Create	`scripts/entitycheck.py`
Create	`dbt/{pack}/models/registry/entity_registry.sql`
Create	`dbt/{pack}/models/registry/entity_relationships.sql`
Create	`dbt/{pack}/models/platform/platform_entity_registry.sql`
Create	`dbt/{pack}/models/platform/platform_entity_relationships.sql`
Create	`explorer/src/lib/server/queries/entityRegistry.ts`
Modify	`scripts/rebuild.sh` — add Step 2f: entity compile
Modify	`explorer/src/lib/server/queries/dimensions.ts`
Modify	`explorer/src/lib/server/queries/summary.ts`
Modify	`explorer/src/lib/server/queries/hypotheses.ts`
Modify	`explorer/src/lib/server/queries/taxonomies.ts`
Modify	`explorer/src/routes/api/jumpbar/+server.ts`
Modify	`CLAUDE.md`

Testing

The Entity Registry is unusually testable. Every layer — YAML validation, SQL compilation, DuckDB tables, Explorer consumption — can be tested in isolation with synthetic data. No real tenant data needed. No dbt project needed for the integration tests. Just a DuckDB instance and some generated schemas.

Tier 1: Unit Tests

Compiler Tests (`tests/test_entitycompile.py`)

Pure Python, no DuckDB. Feed YAML dicts into the compiler functions, assert correct SQL output or correct validation errors.

Test	Input	Expected
Valid entity	Complete YAML dict	SQL SELECT literal with all 16 columns
Missing required field	YAML without `primary_key`	Validation error naming the field
`entity_id` ≠ filename	`entity_id: foo` in `bar.yaml`	Validation error
`dbt_model` without `gold_`	`dbt_model: silver_cases`	Validation error
Unknown column reference	`timestamp: nonexistent_col`	Warning (column not in contract)
Relationship with `custom_sql` missing `{schema}`	SQL without placeholder	Validation error
Relationship with `custom_sql` missing `$1`	SQL without parameter	Validation error
No YAML for contract entity	Contract has `BillingEvent`, no `billing_events.yaml`	Warning (not error)
`--check` mode, SQL up to date	Generated SQL matches file on disk	Exit 0
`--check` mode, SQL stale	Generated SQL differs from file on disk	Exit 1

Registry Module Tests (`explorer/src/lib/server/queries/entityRegistry.test.ts`)

TypeScript unit tests against a real in-memory DuckDB. Create the two registry tables, insert known rows, then test every public API function.

Test	Setup	Assert
`getBrowseableEntities`	3 entities, 1 with `is_browseable = false`	Returns 2
`getEntityByKey`	Insert entity with `entity_id = 'cases'`	Returns correct entry
`getEntityByKey` — missing	Query for `entity_id = 'nonexistent'`	Returns `null`
`getEntityByProbeType`	Insert entity with `probe_entity_type = 'Case'`	Returns correct entry
`getEntityByProbeType` — null	Insert entity with `probe_entity_type = null`	Not returned
`getRelationships`	Insert 3 relationships for `materials`	Returns 3 in order
`getEntitiesWithTimestamp`	2 with timestamp, 1 without	Returns 2
`getEntitiesWithAmount`	1 with amount, 2 without	Returns 1
`getTaxonomyMemberEntities`	1 with `taxonomy_dimension_type`, 2 without	Returns 1
Cache invalidation	Load registry, change DB mtime, reload	Fresh data returned
Graceful degradation	No `entity_registry` table exists	Falls back to `information_schema` discovery

Tier 2: Integration Tests

Verify the full pipeline: YAML → compiler → SQL → DuckDB → Explorer module.

# Run from repo root
python3 -m pytest tests/test_entity_registry_integration.py -v

Start with real entities/*.yaml files
Run entitycompile.py programmatically
Execute generated SQL in a fresh in-memory DuckDB
Query the resulting tables
Assert every YAML entity appears in entity_registry with correct values
Assert every YAML relationship appears in entity_relationships

Tier 3: Three-Party Adversarial E2E Test

This is the centrepiece of the testing strategy. Three independent parties that never communicate directly — they only share a DuckDB file and a common report format.

The Idea

                    ┌─────────────────────┐
                    │  Test Orchestrator   │
                    │  (seeded PRNG)       │
                    └──────┬──────────────┘
                           │
              generates random schema spec
                           │
           ┌───────────────┼───────────────┐
           ▼                               ▼
   ┌───────────────┐               ┌───────────────┐
   │  Processor     │               │  Explorer      │
   │                │               │                │
   │  1. Reads spec │               │  1. Opens DB   │
   │  2. Creates    │               │     (read-only)│
   │     DuckDB     │──── .duckdb ──│  2. Discovers  │
   │     schema +   │     file      │     schema via │
   │     registry   │               │     registry   │
   │  3. Writes     │               │  3. Writes     │
   │     INTENTION  │               │     REPORT     │
   └───────┬───────┘               └───────┬───────┘
           │                               │
           │    ┌───────────────┐          │
           └───▶│  Comparator   │◀─────────┘
                │               │
                │  Diffs intent │
                │  vs report    │
                │               │
                │  PASS / FAIL  │
                └───────────────┘

The Processor and Explorer are completely independent. They share no code, no imports, no function calls. The only coupling is:

The DuckDB file (Processor writes, Explorer reads)
The intention/report format (a shared JSON schema)

This means: if the Explorer can correctly report what the Processor built — for any random schema — the registry system works.

The Orchestrator

Generates a random-but-consistent schema specification using a seeded PRNG. Each test run is deterministic and reproducible.

The spec controls:

How many entities (3–15)
Entity names (random words, not domain-specific terms — to prove domain-agnosticism)
Which columns each entity has (random column names + types)
Which column roles are assigned (primary_key, business_id, label, detail, timestamp, amount)
Which entities are browseable
Which entities have a probe_entity_type
How many relationships exist (0–3 per entity)
Whether any relationships use custom_sql
Tri-lingual display names (generated from entity name + locale suffix)

The spec is passed to both the Processor and the Explorer test harness. But critically, the Explorer harness only uses the spec to know which tenant schema to query — it does not peek at the spec to know what it should find.

The Processor

Receives the schema spec. Creates a real DuckDB database with:

Gold tables — one gold_{entity_id} table per entity, with the columns specified, populated with 5–50 random rows
Registry tables — entity_registry and entity_relationships, populated from the spec (as if entitycompile.py had run)
Relationship target tables — for each relationship, the target table exists with the join column and order column

Then writes an intention manifest (intention.json):

{
  "seed": 42,
  "schema": "test_tenant_42",
  "entities": [
    {
      "entity_id": "widgets",
      "dbt_model": "gold_widgets",
      "display_name_en": "Widgets",
      "display_name_de": "Widgets_de",
      "display_name_fr": "Widgets_fr",
      "is_browseable": true,
      "primary_key_column": "widget_key",
      "business_id_column": "widget_code",
      "label_column": "widget_name",
      "detail_column": "widget_category",
      "timestamp_column": "created_at",
      "amount_column": "unit_price",
      "probe_entity_type": "Widget",
      "row_count": 23,
      "relationships": [
        {
          "target_model": "gold_widget_events",
          "join_column": "widget_code",
          "order_column": "event_time",
          "label_en": "Events",
          "has_custom_sql": false
        }
      ]
    }
  ]
}

The intention manifest describes what the Processor claims it built. It is a pure declaration — no DuckDB queries, no introspection. Just the spec, serialized.

The Explorer

Opens the DuckDB read-only. Has no access to the schema spec or the intention manifest. It only knows the tenant schema name.

Uses the entityRegistry.ts module (or a Python equivalent that mirrors its logic) to:

Load entity_registry and entity_relationships from the schema
For each registered entity, verify the Gold table exists and query its columns
For each relationship, verify the target table exists and the join column is present
For each entity with timestamp_column, verify the column exists and contains temporal data
For each entity with amount_column, verify the column exists and contains numeric data
Count rows in each Gold table

Then writes a discovery report (report.json):

{
  "schema": "test_tenant_42",
  "entities": [
    {
      "entity_id": "widgets",
      "dbt_model": "gold_widgets",
      "display_name_en": "Widgets",
      "display_name_de": "Widgets_de",
      "display_name_fr": "Widgets_fr",
      "is_browseable": true,
      "primary_key_column": "widget_key",
      "business_id_column": "widget_code",
      "label_column": "widget_name",
      "detail_column": "widget_category",
      "timestamp_column": "created_at",
      "amount_column": "unit_price",
      "probe_entity_type": "Widget",
      "table_exists": true,
      "row_count": 23,
      "pk_is_unique": true,
      "timestamp_column_type": "TIMESTAMP",
      "amount_column_type": "DOUBLE",
      "relationships": [
        {
          "target_model": "gold_widget_events",
          "join_column": "widget_code",
          "order_column": "event_time",
          "target_exists": true,
          "join_column_exists": true,
          "has_custom_sql": false
        }
      ]
    }
  ]
}

The report describes what the Explorer actually found. It includes everything the intention declares, plus structural validation (table exists, PK is unique, column types are correct).

The Comparator

Receives both intention.json and report.json. Compares them field by field.

Exact-match assertions:

Field	Rule
`entity_id`	Every intention entity appears in report
`dbt_model`	Matches exactly
`display_name_{en,de,fr}`	Matches exactly
`is_browseable`	Matches exactly
`primary_key_column`	Matches exactly
`business_id_column`	Matches exactly
`label_column`	Matches exactly (or both null)
`detail_column`	Matches exactly (or both null)
`timestamp_column`	Matches exactly (or both null)
`amount_column`	Matches exactly (or both null)
`probe_entity_type`	Matches exactly (or both null)
`row_count`	Matches exactly
Relationship count	Same number per entity
Relationship `target_model`	Matches exactly
Relationship `join_column`	Matches exactly

Structural assertions (report-only):

Field	Rule
`table_exists`	Must be `true`
`pk_is_unique`	Must be `true`
`timestamp_column_type`	Must be a temporal type (`TIMESTAMP`, `DATE`)
`amount_column_type`	Must be a numeric type (`DOUBLE`, `DECIMAL`, `INTEGER`)
`target_exists`	Must be `true`
`join_column_exists`	Must be `true`

Surplus/deficit detection:

Check	Meaning
Entity in report but not in intention	Explorer found something the Processor didn’t declare (ghost entity)
Entity in intention but not in report	Explorer missed something the Processor built (blind spot)
Extra relationship in report	Explorer found an undeclared FK
Missing relationship in report	Explorer missed a declared FK

Test Scenarios

The orchestrator runs multiple seeds, each producing a different random schema. Some seeds are engineered to stress specific edge cases:

Seed	Scenario	What it tests
1	Minimal — 1 entity, no relationships, no optional columns	Baseline
2	Maximal — 15 entities, all column roles filled, 3 relationships each	Scale
3	All non-browseable	`getBrowseableEntities` returns empty
4	No timestamps, no amounts	`getEntitiesWithTimestamp` / `WithAmount` return empty
5	Mixed null columns	Some entities have label, some don’t
6	Relationship with `custom_sql`	SQL template with `{schema}` and `$1`
7	Duplicate `probe_entity_type` across 2 entities	`getEntityByProbeType` returns first match
8	Entity names that need quoting	Spaces, hyphens, unicode
9	100 random seeds	Property-based coverage

Running the Tests

# Single seed (fast, for debugging)
python3 tests/test_entity_registry_e2e.py --seed 42

# All named scenarios
python3 tests/test_entity_registry_e2e.py --scenarios

# Property-based (100 random seeds)
python3 tests/test_entity_registry_e2e.py --property 100

# Verbose mode (prints intention + report + diff)
python3 tests/test_entity_registry_e2e.py --seed 42 -v

Why This Works

The three-party architecture has a key property: the Processor and Explorer can’t collude. They share no code. If the Explorer returns a correct report, it’s because the registry system genuinely works — not because someone hardcoded the right answers.

This catches bugs that unit tests miss:

Compiler generates SQL that DuckDB rejects (syntax errors in edge cases)
Registry table schema doesn’t match what the Explorer expects (column type mismatch)
Cache returns stale data after schema change
Graceful degradation path is never exercised
Column role metadata is correct in the registry but the Explorer queries the wrong column

It also serves as a living specification: the intention/report format documents exactly what the registry system promises.

File Layout

tests/
  test_entity_registry_e2e.py          ← orchestrator + comparator
  entity_registry/
    processor.py                       ← builds DuckDB from spec
    explorer.py                        ← discovers schema from registry
    schemas.py                         ← intention + report JSON schemas
    conftest.py                        ← pytest fixtures (tmp DuckDB, seeds)

Tier 4: Regression Smoke Test

After the real Entity Registry is deployed, a lightweight check that runs as part of rebuild.sh:

Step	What	How
1	Compiler runs	`python3 scripts/entitycompile.py` — exit 0
2	CI dry-run	`python3 scripts/entitycompile.py --check` — exit 0
3	Registry populated	`SELECT count(*) FROM {tenant}.entity_registry` > 0
4	Relationships populated	`SELECT count(*) FROM {tenant}.entity_relationships` > 0
5	No orphan references	Every `entity_relationships.source_entity_id` exists in `entity_registry`
6	Type safety	`npx svelte-check` — zero errors
7	No hardcoded remnants	Grep Explorer `src/` for hardcoded `gold_` table names — zero hits outside fallback

Open Questions

Should entity_relationships support asymmetric joins? — Currently join_column is assumed to be the same on both sides. The custom_sql escape hatch handles the 2 cases where this doesn’t hold. If more asymmetric joins appear, add a target_join_column field.
Should the registry include non-Gold entities? — Silver audit signals reference Silver tables. For now, Silver entities are out of scope — they don’t appear in the Explorer. Revisit if Silver browsing is added.
Should the YAML include column-level metadata? — e.g., which columns are numeric, which are dates, which should be formatted as currency. Currently the Explorer infers this from DuckDB types at query time. Adding column metadata to the registry would enable richer formatting, but it’s a lot of YAML for marginal benefit today.

jinflow is a jazzisnow product

v0.45.1 · built 2026-04-17 08:14 UTC