Entity Registry — Concept Document
Date: 2026-02-27
What It Is
Section titled “What It Is”The Entity Registry is a metadata-driven replacement for all hardcoded domain assumptions in the Explorer. Instead of the Explorer knowing that gold_cases has a start_time column, or that signal entity type Case maps to dimension key cases, or that materials have related billing events joined on material_id — all of this knowledge lives in YAML files, gets compiled into dbt registry tables, and the Explorer reads it at runtime.
The Explorer code may assume the metadata schema. It may not assume any business content.
This follows the same pattern as signals, theses, and verdicts: YAML source of truth → compiler → dbt SQL → runtime consumption. The Entity Registry is the final piece that makes the Explorer fully domain-agnostic.
The Problem
Section titled “The Problem”The Explorer currently works because ~25 hardcoded assumptions scattered across 5 TypeScript files happen to match the current Gold schema. Adding a new Gold entity requires editing multiple files in multiple layers. Worse, these assumptions are invisible — they’re inline constants, heuristics, and naming conventions that only work by coincidence.
Inventory of Hardcoded Assumptions
Section titled “Inventory of Hardcoded Assumptions”explorer/src/lib/server/queries/dimensions.ts
| Line(s) | Assumption | What it encodes |
|---|---|---|
| 5–7 | EXCLUDED_TABLES | gold_article_suppliers is not browseable |
| 10–15 | EXCLUDED_PREFIXES | gold_taxonomy_*, gold_quality_*, gold_reconciliation_*, gold_io_coefficient* are not browseable |
| 23–28 | ENTITY_TYPE_MAP | Maps signal entity_type (Case/Material/Procedure/CostCenter) → dimension keys |
| 64 | PK inference | First column ending in _key is primary key |
| 204–231 | RELATED_FACTS | 9 relationship definitions: which tables join, on which columns, display order, and 2 custom SQL templates |
explorer/src/lib/server/queries/summary.ts
| Line(s) | Assumption | What it encodes |
|---|---|---|
| 9–16 | YearFilteredCounts | Hardcoded interface with cases, procedures, usage, billing, billing_amount, movements |
| 34 | start_time | Cases’ primary timestamp |
| 38 | billing_timestamp | Billing events’ primary timestamp |
| 77–89 | 5 entity+timestamp pairs | cases/start_time, procedures/procedure_timestamp, usage/usage_timestamp, billing/billing_timestamp, movements/movement_timestamp |
| 86 | billed_amount | The monetary amount column for billing |
| 115–121 | getTenantLandingStats | Hardcoded references to gold_cases and gold_billing_events |
explorer/src/lib/server/queries/hypotheses.ts
| Line(s) | Assumption | What it encodes |
|---|---|---|
| 120–124 | ENTITY_NAME_MAP | Maps entity_type → table/id-column/description-column for 3 entity types |
explorer/src/lib/server/queries/taxonomies.ts
| Line(s) | Assumption | What it encodes |
|---|---|---|
| 63–67 | Member caption JOIN | Hardcoded LEFT JOIN to gold_cost_centers.name and gold_materials.material_description |
explorer/src/routes/api/jumpbar/+server.ts
| Line(s) | Assumption | What it encodes |
|---|---|---|
| 20–34 | pickDisplayColumns | Heuristic: skip _key columns, pattern-match “name”/“description” for label, “type”/“group”/“category” for detail |
explorer/src/lib/i18n/locales/{en,de,fr}.json
| Line(s) | Assumption | What it encodes |
|---|---|---|
| 41–61 | dimensions.* keys | Static display names for each entity type in 3 languages |
Total: 22+ distinct hardcoded assumptions across 6 files, not counting the duplicated categoryColors and statusStyles maps in Svelte components.
What Goes Wrong
Section titled “What Goes Wrong”- Adding a new Gold entity (e.g.,
gold_suppliers) requires editingENTITY_TYPE_MAP,EXCLUDED_TABLES, potentiallyRELATED_FACTS,summary.tsqueries, all 3 locale JSON files, and hoping you didn’t miss anything. - Changing a column name in the Gold contract silently breaks Explorer heuristics.
- Different tenants with different entities is impossible — the Explorer assumes all tenants share the same Gold schema shape.
- The
_keysuffix heuristic works today but is fragile. It can’t distinguish surrogate keys from business IDs.
Design
Section titled “Design”Guiding Principle
Section titled “Guiding Principle”The Explorer code may assume the metadata schema. It may not assume any business content.
The Explorer knows two things: (1) there is a table called entity_registry with a fixed set of columns, and (2) there is a table called entity_relationships with a fixed set of columns. Everything else — entity names, column roles, relationships, display labels, icons — comes from the data.
Architecture
Section titled “Architecture”entities/*.yaml ← source of truth │ ▼scripts/entitycompile.py ──validates──▶ contracts/gold_contract.v1.json │ ├──▶ dbt/{pack}/models/registry/entity_registry.sql └──▶ dbt/{pack}/models/registry/entity_relationships.sql │ ▼ dbt build --select registry │ ├──▶ {tenant}.entity_registry (TABLE) ├──▶ {tenant}.entity_relationships (TABLE) │ ▼ platform_entity_registry (VIEW, from first tenant) platform_entity_relationships (VIEW, from first tenant) │ ▼ explorer/src/lib/server/queries/entityRegistry.ts loadRegistry() → per-tenant cache, invalidated by DB mtime getBrowseableEntities() / getEntityByKey() / getRelationships() / ... │ ▼ All Explorer consumers — zero hardcoded entity knowledgeThis is the same pattern as signals (probes/*.yaml → signalcompile.py → signal_findings__*.sql) and theses (hypotheses/*.yaml → hypothesiscompile.py → hypothesis_verdicts.sql).
YAML Source Format
Section titled “YAML Source Format”Entity definitions live in entities/ at the repo root (same level as probes/, hypotheses/, diagnoses/). One file per entity. The entity_id must match the filename (without .yaml).
Complete Example: entities/cases.yaml
Section titled “Complete Example: entities/cases.yaml”entity_id: casesdbt_model: gold_casesdisplay_name: en: Cases de: Fälle fr: Casicon: "◈"is_browseable: truecolumns: primary_key: case_key business_id: case_token label: drg_description detail: case_type timestamp: start_time amount: nullprobe_entity_type: Casetaxonomy: dimension_type: null caption_column: nullrelationships: - target_model: gold_case_material_usage join_column: case_token order_column: usage_timestamp label: { en: Usage, de: Verbrauch, fr: Consommation } - target_model: gold_billing_events join_column: case_token order_column: billing_timestamp label: { en: Billing, de: Abrechnung, fr: Facturation } - target_model: gold_procedures join_column: case_token order_column: procedure_timestamp label: { en: Procedures, de: Eingriffe, fr: Procédures }Complete Example: entities/materials.yaml
Section titled “Complete Example: entities/materials.yaml”entity_id: materialsdbt_model: gold_materialsdisplay_name: en: Materials de: Materialien fr: Matériauxicon: "📦"is_browseable: truecolumns: primary_key: material_key business_id: material_id label: material_description detail: material_group timestamp: null amount: standard_priceprobe_entity_type: Materialtaxonomy: dimension_type: material caption_column: material_descriptionrelationships: - target_model: gold_case_material_usage join_column: material_id order_column: usage_timestamp label: { en: Usage, de: Verbrauch, fr: Consommation } - target_model: gold_billing_events join_column: material_id order_column: billing_timestamp label: { en: Billing, de: Abrechnung, fr: Facturation } - target_model: gold_material_movements join_column: material_id order_column: movement_timestamp label: { en: Movements, de: Bewegungen, fr: Mouvements } - target_model: gold_article_suppliers join_column: material_id order_column: supplier_code label: { en: Suppliers, de: Lieferanten, fr: Fournisseurs } custom_sql: | SELECT a.supplier_code, s.supplier_name, a.supplier_reference, a.creation_date FROM {schema}.gold_article_suppliers a LEFT JOIN {schema}.gold_suppliers s ON a.supplier_code = s.supplier_code WHERE a.material_id = $1 ORDER BY a.supplier_code LIMIT 100Non-Browseable Example: entities/article_suppliers.yaml
Section titled “Non-Browseable Example: entities/article_suppliers.yaml”entity_id: article_suppliersdbt_model: gold_article_suppliersdisplay_name: en: Article Suppliers de: Artikellieferanten fr: Fournisseurs d'articlesicon: "🔗"is_browseable: falsecolumns: primary_key: article_supplier_key business_id: material_id label: null detail: null timestamp: null amount: nullprobe_entity_type: nulltaxonomy: dimension_type: null caption_column: nullYAML Field Reference
Section titled “YAML Field Reference”| Field | Required | Type | Description |
|---|---|---|---|
entity_id | Yes | string | Dimension key, must match filename |
dbt_model | Yes | string | Gold table name (must start with gold_) |
display_name | Yes | {en, de, fr} | Tri-lingual display name |
icon | Yes | string | Single emoji/character for UI |
is_browseable | Yes | boolean | Whether entity appears in dimension browser |
columns.primary_key | Yes | string | Surrogate key column |
columns.business_id | Yes | string | Human-meaningful ID column |
columns.label | No | string | Display name/description column |
columns.detail | No | string | Secondary info column (type, group, category) |
columns.timestamp | No | string | Primary timestamp for time-axis queries |
columns.amount | No | string | Monetary amount column |
probe_entity_type | No | string | Maps to probe findings entity_type field |
taxonomy.dimension_type | No | string | Value in taxonomy_member_mappings.dimension_type |
taxonomy.caption_column | No | string | Column used for taxonomy member captions |
relationships | No | list | Related fact definitions (see below) |
Relationship Fields
Section titled “Relationship Fields”| Field | Required | Type | Description |
|---|---|---|---|
target_model | Yes | string | Gold table to query |
join_column | Yes | string | Column on both source and target for the join |
order_column | Yes | string | Default ORDER BY column |
label | Yes | {en, de, fr} | Section heading in entity detail page |
custom_sql | No | string | Full SQL template with {schema} and $1 placeholders |
Entity List
Section titled “Entity List”~12 YAML files covering all current Gold entities:
| File | dbt_model | Browseable | probe_entity_type | Relationships |
|---|---|---|---|---|
cases.yaml | gold_cases | Yes | Case | usage, billing, procedures |
materials.yaml | gold_materials | Yes | Material | usage, billing, movements, suppliers |
cost_centers.yaml | gold_cost_centers | Yes | CostCenter | — |
suppliers.yaml | gold_suppliers | Yes | — | materials (via junction) |
procedures.yaml | gold_procedures | Yes | Procedure | — |
case_material_usage.yaml | gold_case_material_usage | Yes | — | — |
billing_events.yaml | gold_billing_events | Yes | BillingEvent | — |
material_movements.yaml | gold_material_movements | Yes | — | — |
article_suppliers.yaml | gold_article_suppliers | No | — | — |
material_classifications.yaml | gold_material_classifications | Yes | — | — |
packaging_types.yaml | gold_packaging_types | Yes | — | — |
service_mandates.yaml | gold_service_mandates | Yes | — | — |
Schema Definitions
Section titled “Schema Definitions”entity_registry Table
Section titled “entity_registry Table”One row per entity. Materialized as TABLE in dbt/{pack}/models/registry/entity_registry.sql.
| Column | Type | Nullable | Description |
|---|---|---|---|
entity_id | VARCHAR | No | Dimension key (e.g. cases, materials) |
dbt_model | VARCHAR | No | Table name (e.g. gold_cases) |
display_name_en | VARCHAR | No | Display name, English |
display_name_de | VARCHAR | No | Display name, German |
display_name_fr | VARCHAR | No | Display name, French |
icon | VARCHAR | No | Single emoji character |
is_browseable | BOOLEAN | No | Appears in dimension browser |
primary_key_column | VARCHAR | No | Surrogate key column name |
business_id_column | VARCHAR | No | Human-meaningful ID column |
label_column | VARCHAR | Yes | Display name column |
detail_column | VARCHAR | Yes | Secondary info column |
timestamp_column | VARCHAR | Yes | Primary timestamp for time-axis queries |
amount_column | VARCHAR | Yes | Monetary amount column |
probe_entity_type | VARCHAR | Yes | Maps to probe findings entity_type |
taxonomy_dimension_type | VARCHAR | Yes | dimension_type in taxonomy_member_mappings |
taxonomy_caption_column | VARCHAR | Yes | Column for taxonomy member captions |
entity_relationships Table
Section titled “entity_relationships Table”One row per relationship. Materialized as TABLE in dbt/{pack}/models/registry/entity_relationships.sql.
| Column | Type | Nullable | Description |
|---|---|---|---|
source_entity_id | VARCHAR | No | FK to entity_registry.entity_id |
join_column | VARCHAR | No | Column on source entity providing the join value |
target_model | VARCHAR | No | Gold table to query |
target_join_column | VARCHAR | No | Column in target to match (usually same as join_column) |
order_column | VARCHAR | No | Default ORDER BY column |
label_en | VARCHAR | No | Label, English |
label_de | VARCHAR | No | Label, German |
label_fr | VARCHAR | No | Label, French |
custom_sql | VARCHAR | Yes | Full SQL template with {schema} and $1 placeholders |
Platform Views
Section titled “Platform Views”Same pattern as other registries (from first tenant):
dbt/{pack}/models/platform/platform_entity_registry.sqldbt/{pack}/models/platform/platform_entity_relationships.sql
Design Decisions
Section titled “Design Decisions”- Two tables, not one — entity metadata and relationship metadata have different cardinality (1:N). Normalized, matches the signal_registry + probe_findings pattern.
models/registry/directory, notmodels/gold/— metadata registries don’t belong in the gold business layer.- No
gold_prefix —entity_registry, notgold_entity_registry. These are metadata, not business entities. custom_sqlfor complex joins — only 2 of 9 current relationships need it (both viagold_article_suppliersjunction table). Avoids over-engineering a join DSL.
Compiler
Section titled “Compiler”scripts/entitycompile.py
Section titled “scripts/entitycompile.py”Follows the pattern of signalcompile.py and hypothesiscompile.py:
- Reads all
entities/*.yamlfiles - Validates each YAML:
entity_idmust match filename (without.yaml)dbt_modelmust start withgold_primary_key_columnandbusiness_id_columnare requiredprobe_entity_type, if set, must be a known contract entity type- All column references must exist in
contracts/gold_contract.v1.json target_modelin relationships must reference a real Gold modelcustom_sqlmust contain{schema}and$1placeholders- Warns if a contract entity has no corresponding YAML file
- Generates two dbt SQL models via UNION ALL of SELECT literals
- Supports
--checkflag for CI dry-run
scripts/entitycheck.py
Section titled “scripts/entitycheck.py”Validates YAML structure and cross-references without generating SQL. Analogous to signalcheck.py and hypothesischeck.py.
Integration into rebuild.sh
Section titled “Integration into rebuild.sh”New Step 2f after lineage compile:
log "━━━ Step 2f: Compile entity registry ━━━"run_cmd "$VENV/python" "$SCRIPTS_DIR/entitycompile.py"Commands
Section titled “Commands”# Validate entity YAML definitionspython3 scripts/entitycheck.py
# Compile YAML → dbt SQLpython3 scripts/entitycompile.py
# Dry-run: check if compiled SQL is up to datepython3 scripts/entitycompile.py --check
# Build registry for a tenantcd dbt/{pack} && .venv/bin/dbt build --select registry \ --vars '{"tenant_id": "my_tenant"}'Explorer Consumption
Section titled “Explorer Consumption”explorer/src/lib/server/queries/entityRegistry.ts
Section titled “explorer/src/lib/server/queries/entityRegistry.ts”A single module that loads, caches, and queries the registry tables. All domain knowledge in the Explorer flows through this module.
interface EntityRegistryEntry { entity_id: string; dbt_model: string; display_name_en: string; display_name_de: string; display_name_fr: string; icon: string; is_browseable: boolean; primary_key_column: string; business_id_column: string; label_column: string | null; detail_column: string | null; timestamp_column: string | null; amount_column: string | null; probe_entity_type: string | null; taxonomy_dimension_type: string | null; taxonomy_caption_column: string | null;}
interface EntityRelationship { source_entity_id: string; join_column: string; target_model: string; target_join_column: string; order_column: string; label_en: string; label_de: string; label_fr: string; custom_sql: string | null;}Public API
Section titled “Public API”| Function | Returns | Replaces |
|---|---|---|
getRegistry(tenant) | Load + cache both tables | — |
getBrowseableEntities(tenant) | is_browseable = true entities | discoverDimensions() + exclusion lists |
getEntityByKey(tenant, key) | Lookup by entity_id | PK heuristic + pickDisplayColumns |
getEntityByProbeType(tenant, type) | Lookup by probe_entity_type | ENTITY_TYPE_MAP |
getRelationships(tenant, entityId) | Relationships for entity | RELATED_FACTS |
getEntitiesWithTimestamp(tenant) | timestamp_column IS NOT NULL | Hardcoded summary.ts timestamps |
getEntitiesWithAmount(tenant) | amount_column IS NOT NULL | Hardcoded summary.ts amounts |
getTaxonomyMemberEntities(tenant) | taxonomy_dimension_type IS NOT NULL | Hardcoded taxonomies.ts JOINs |
Cache Strategy
Section titled “Cache Strategy”Piggybacks on the existing getDb() reconnect logic in db.ts (which already tracks dbMtimeMs). When the DB mtime changes after dbt build, the registry cache is invalidated. No TTL needed.
Graceful Degradation
Section titled “Graceful Degradation”If entity_registry table does not exist (tenant not yet rebuilt after registry is added), fall back to current information_schema discovery + heuristics. This allows a smooth rollout — the Explorer works with or without the registry tables during the transition period.
What Gets Replaced
Section titled “What Gets Replaced”| File | Hardcoded Knowledge | Registry Replacement |
|---|---|---|
dimensions.ts — ENTITY_TYPE_MAP | signal entity_type → dimension key | probe_entity_type column |
dimensions.ts — EXCLUDED_TABLES/PREFIXES | which tables are browseable | is_browseable column |
dimensions.ts — RELATED_FACTS | 9 FK relationships + 2 custom SQL | entity_relationships table |
dimensions.ts — PK heuristic | first _key column = surrogate key | primary_key_column column |
dimensions.ts — discoverDimensions() | schema introspection + exclusions | SELECT * FROM entity_registry WHERE is_browseable |
summary.ts — timestamps | hardcoded start_time, billing_timestamp | timestamp_column column |
summary.ts — amounts | hardcoded billed_amount | amount_column column |
hypotheses.ts — ENTITY_NAME_MAP | 3 entity types → table/id/desc | business_id_column + label_column |
taxonomies.ts — member JOINs | hardcoded gold_cost_centers/gold_materials | taxonomy_dimension_type + taxonomy_caption_column |
jumpbar/+server.ts — pickDisplayColumns | regex heuristic for id/label/detail | business_id_column + label_column + detail_column |
i18n dimensions.* keys | static display names per entity | display_name_en/de/fr columns |
Implementation Phases
Section titled “Implementation Phases”Phase 1: Foundation (no Explorer changes)
Section titled “Phase 1: Foundation (no Explorer changes)”- Create
entities/directory with ~12 YAML files - Write
scripts/entitycompile.py(compiler + validator) - Write
scripts/entitycheck.py(validator only) - Create dbt models:
models/registry/entity_registry.sql,entity_relationships.sql - Create platform views:
platform_entity_registry.sql,platform_entity_relationships.sql - Add Step 2f to
scripts/rebuild.sh - Rebuild and verify tables populated
Verification: SELECT * FROM my_tenant.entity_registry returns all entities with correct metadata.
Phase 2: Explorer Registry Module
Section titled “Phase 2: Explorer Registry Module”- Create
explorer/src/lib/server/queries/entityRegistry.tswith load/cache/query functions - Implement graceful fallback: if
entity_registrytable missing, fall back to current heuristics
Verification: Module compiles, cache works, fallback triggers correctly on pre-registry tenants.
Phase 3: Migrate Consumers (one file at a time)
Section titled “Phase 3: Migrate Consumers (one file at a time)”dimensions.ts— replaceENTITY_TYPE_MAP,EXCLUDED_*,RELATED_FACTS, PK heuristic,discoverDimensions()jumpbar/+server.ts— replacepickDisplayColumnsand dynamic dimension searchhypotheses.ts— replaceENTITY_NAME_MAPtaxonomies.ts— replace hardcoded member caption JOINssummary.ts— replace hardcoded timestamp/amount column references
Verification: Each file migration is verified independently — Explorer behavior is identical before and after.
Phase 4: Cleanup
Section titled “Phase 4: Cleanup”- Remove all dead hardcoded code (
ENTITY_TYPE_MAP,EXCLUDED_TABLES,RELATED_FACTS, etc.) - Remove
dimensions.*i18n keys (display names now come from registry) - Update
CLAUDE.mdto document the Entity Registry
Verification: Grep Explorer src/ for hardcoded gold_ table names — zero hits outside entityRegistry.ts fallback.
Files to Create/Modify
Section titled “Files to Create/Modify”| Action | File |
|---|---|
| Create | entities/*.yaml (~12 files) |
| Create | scripts/entitycompile.py |
| Create | scripts/entitycheck.py |
| Create | dbt/{pack}/models/registry/entity_registry.sql |
| Create | dbt/{pack}/models/registry/entity_relationships.sql |
| Create | dbt/{pack}/models/platform/platform_entity_registry.sql |
| Create | dbt/{pack}/models/platform/platform_entity_relationships.sql |
| Create | explorer/src/lib/server/queries/entityRegistry.ts |
| Modify | scripts/rebuild.sh — add Step 2f: entity compile |
| Modify | explorer/src/lib/server/queries/dimensions.ts |
| Modify | explorer/src/lib/server/queries/summary.ts |
| Modify | explorer/src/lib/server/queries/hypotheses.ts |
| Modify | explorer/src/lib/server/queries/taxonomies.ts |
| Modify | explorer/src/routes/api/jumpbar/+server.ts |
| Modify | CLAUDE.md |
Testing
Section titled “Testing”The Entity Registry is unusually testable. Every layer — YAML validation, SQL compilation, DuckDB tables, Explorer consumption — can be tested in isolation with synthetic data. No real tenant data needed. No dbt project needed for the integration tests. Just a DuckDB instance and some generated schemas.
Tier 1: Unit Tests
Section titled “Tier 1: Unit Tests”Compiler Tests (tests/test_entitycompile.py)
Section titled “Compiler Tests (tests/test_entitycompile.py)”Pure Python, no DuckDB. Feed YAML dicts into the compiler functions, assert correct SQL output or correct validation errors.
| Test | Input | Expected |
|---|---|---|
| Valid entity | Complete YAML dict | SQL SELECT literal with all 16 columns |
| Missing required field | YAML without primary_key | Validation error naming the field |
entity_id ≠ filename | entity_id: foo in bar.yaml | Validation error |
dbt_model without gold_ | dbt_model: silver_cases | Validation error |
| Unknown column reference | timestamp: nonexistent_col | Warning (column not in contract) |
Relationship with custom_sql missing {schema} | SQL without placeholder | Validation error |
Relationship with custom_sql missing $1 | SQL without parameter | Validation error |
| No YAML for contract entity | Contract has BillingEvent, no billing_events.yaml | Warning (not error) |
--check mode, SQL up to date | Generated SQL matches file on disk | Exit 0 |
--check mode, SQL stale | Generated SQL differs from file on disk | Exit 1 |
Registry Module Tests (explorer/src/lib/server/queries/entityRegistry.test.ts)
Section titled “Registry Module Tests (explorer/src/lib/server/queries/entityRegistry.test.ts)”TypeScript unit tests against a real in-memory DuckDB. Create the two registry tables, insert known rows, then test every public API function.
| Test | Setup | Assert |
|---|---|---|
getBrowseableEntities | 3 entities, 1 with is_browseable = false | Returns 2 |
getEntityByKey | Insert entity with entity_id = 'cases' | Returns correct entry |
getEntityByKey — missing | Query for entity_id = 'nonexistent' | Returns null |
getEntityByProbeType | Insert entity with probe_entity_type = 'Case' | Returns correct entry |
getEntityByProbeType — null | Insert entity with probe_entity_type = null | Not returned |
getRelationships | Insert 3 relationships for materials | Returns 3 in order |
getEntitiesWithTimestamp | 2 with timestamp, 1 without | Returns 2 |
getEntitiesWithAmount | 1 with amount, 2 without | Returns 1 |
getTaxonomyMemberEntities | 1 with taxonomy_dimension_type, 2 without | Returns 1 |
| Cache invalidation | Load registry, change DB mtime, reload | Fresh data returned |
| Graceful degradation | No entity_registry table exists | Falls back to information_schema discovery |
Tier 2: Integration Tests
Section titled “Tier 2: Integration Tests”Verify the full pipeline: YAML → compiler → SQL → DuckDB → Explorer module.
# Run from repo rootpython3 -m pytest tests/test_entity_registry_integration.py -v- Start with real
entities/*.yamlfiles - Run
entitycompile.pyprogrammatically - Execute generated SQL in a fresh in-memory DuckDB
- Query the resulting tables
- Assert every YAML entity appears in
entity_registrywith correct values - Assert every YAML relationship appears in
entity_relationships
Tier 3: Three-Party Adversarial E2E Test
Section titled “Tier 3: Three-Party Adversarial E2E Test”This is the centrepiece of the testing strategy. Three independent parties that never communicate directly — they only share a DuckDB file and a common report format.
The Idea
Section titled “The Idea” ┌─────────────────────┐ │ Test Orchestrator │ │ (seeded PRNG) │ └──────┬──────────────┘ │ generates random schema spec │ ┌───────────────┼───────────────┐ ▼ ▼ ┌───────────────┐ ┌───────────────┐ │ Processor │ │ Explorer │ │ │ │ │ │ 1. Reads spec │ │ 1. Opens DB │ │ 2. Creates │ │ (read-only)│ │ DuckDB │──── .duckdb ──│ 2. Discovers │ │ schema + │ file │ schema via │ │ registry │ │ registry │ │ 3. Writes │ │ 3. Writes │ │ INTENTION │ │ REPORT │ └───────┬───────┘ └───────┬───────┘ │ │ │ ┌───────────────┐ │ └───▶│ Comparator │◀─────────┘ │ │ │ Diffs intent │ │ vs report │ │ │ │ PASS / FAIL │ └───────────────┘The Processor and Explorer are completely independent. They share no code, no imports, no function calls. The only coupling is:
- The DuckDB file (Processor writes, Explorer reads)
- The intention/report format (a shared JSON schema)
This means: if the Explorer can correctly report what the Processor built — for any random schema — the registry system works.
The Orchestrator
Section titled “The Orchestrator”Generates a random-but-consistent schema specification using a seeded PRNG. Each test run is deterministic and reproducible.
The spec controls:
- How many entities (3–15)
- Entity names (random words, not domain-specific terms — to prove domain-agnosticism)
- Which columns each entity has (random column names + types)
- Which column roles are assigned (primary_key, business_id, label, detail, timestamp, amount)
- Which entities are browseable
- Which entities have a
probe_entity_type - How many relationships exist (0–3 per entity)
- Whether any relationships use
custom_sql - Tri-lingual display names (generated from entity name + locale suffix)
The spec is passed to both the Processor and the Explorer test harness. But critically, the Explorer harness only uses the spec to know which tenant schema to query — it does not peek at the spec to know what it should find.
The Processor
Section titled “The Processor”Receives the schema spec. Creates a real DuckDB database with:
- Gold tables — one
gold_{entity_id}table per entity, with the columns specified, populated with 5–50 random rows - Registry tables —
entity_registryandentity_relationships, populated from the spec (as ifentitycompile.pyhad run) - Relationship target tables — for each relationship, the target table exists with the join column and order column
Then writes an intention manifest (intention.json):
{ "seed": 42, "schema": "test_tenant_42", "entities": [ { "entity_id": "widgets", "dbt_model": "gold_widgets", "display_name_en": "Widgets", "display_name_de": "Widgets_de", "display_name_fr": "Widgets_fr", "is_browseable": true, "primary_key_column": "widget_key", "business_id_column": "widget_code", "label_column": "widget_name", "detail_column": "widget_category", "timestamp_column": "created_at", "amount_column": "unit_price", "probe_entity_type": "Widget", "row_count": 23, "relationships": [ { "target_model": "gold_widget_events", "join_column": "widget_code", "order_column": "event_time", "label_en": "Events", "has_custom_sql": false } ] } ]}The intention manifest describes what the Processor claims it built. It is a pure declaration — no DuckDB queries, no introspection. Just the spec, serialized.
The Explorer
Section titled “The Explorer”Opens the DuckDB read-only. Has no access to the schema spec or the intention manifest. It only knows the tenant schema name.
Uses the entityRegistry.ts module (or a Python equivalent that mirrors its logic) to:
- Load
entity_registryandentity_relationshipsfrom the schema - For each registered entity, verify the Gold table exists and query its columns
- For each relationship, verify the target table exists and the join column is present
- For each entity with
timestamp_column, verify the column exists and contains temporal data - For each entity with
amount_column, verify the column exists and contains numeric data - Count rows in each Gold table
Then writes a discovery report (report.json):
{ "schema": "test_tenant_42", "entities": [ { "entity_id": "widgets", "dbt_model": "gold_widgets", "display_name_en": "Widgets", "display_name_de": "Widgets_de", "display_name_fr": "Widgets_fr", "is_browseable": true, "primary_key_column": "widget_key", "business_id_column": "widget_code", "label_column": "widget_name", "detail_column": "widget_category", "timestamp_column": "created_at", "amount_column": "unit_price", "probe_entity_type": "Widget", "table_exists": true, "row_count": 23, "pk_is_unique": true, "timestamp_column_type": "TIMESTAMP", "amount_column_type": "DOUBLE", "relationships": [ { "target_model": "gold_widget_events", "join_column": "widget_code", "order_column": "event_time", "target_exists": true, "join_column_exists": true, "has_custom_sql": false } ] } ]}The report describes what the Explorer actually found. It includes everything the intention declares, plus structural validation (table exists, PK is unique, column types are correct).
The Comparator
Section titled “The Comparator”Receives both intention.json and report.json. Compares them field by field.
Exact-match assertions:
| Field | Rule |
|---|---|
entity_id | Every intention entity appears in report |
dbt_model | Matches exactly |
display_name_{en,de,fr} | Matches exactly |
is_browseable | Matches exactly |
primary_key_column | Matches exactly |
business_id_column | Matches exactly |
label_column | Matches exactly (or both null) |
detail_column | Matches exactly (or both null) |
timestamp_column | Matches exactly (or both null) |
amount_column | Matches exactly (or both null) |
probe_entity_type | Matches exactly (or both null) |
row_count | Matches exactly |
| Relationship count | Same number per entity |
Relationship target_model | Matches exactly |
Relationship join_column | Matches exactly |
Structural assertions (report-only):
| Field | Rule |
|---|---|
table_exists | Must be true |
pk_is_unique | Must be true |
timestamp_column_type | Must be a temporal type (TIMESTAMP, DATE) |
amount_column_type | Must be a numeric type (DOUBLE, DECIMAL, INTEGER) |
target_exists | Must be true |
join_column_exists | Must be true |
Surplus/deficit detection:
| Check | Meaning |
|---|---|
| Entity in report but not in intention | Explorer found something the Processor didn’t declare (ghost entity) |
| Entity in intention but not in report | Explorer missed something the Processor built (blind spot) |
| Extra relationship in report | Explorer found an undeclared FK |
| Missing relationship in report | Explorer missed a declared FK |
Test Scenarios
Section titled “Test Scenarios”The orchestrator runs multiple seeds, each producing a different random schema. Some seeds are engineered to stress specific edge cases:
| Seed | Scenario | What it tests |
|---|---|---|
| 1 | Minimal — 1 entity, no relationships, no optional columns | Baseline |
| 2 | Maximal — 15 entities, all column roles filled, 3 relationships each | Scale |
| 3 | All non-browseable | getBrowseableEntities returns empty |
| 4 | No timestamps, no amounts | getEntitiesWithTimestamp / WithAmount return empty |
| 5 | Mixed null columns | Some entities have label, some don’t |
| 6 | Relationship with custom_sql | SQL template with {schema} and $1 |
| 7 | Duplicate probe_entity_type across 2 entities | getEntityByProbeType returns first match |
| 8 | Entity names that need quoting | Spaces, hyphens, unicode |
| 9 | 100 random seeds | Property-based coverage |
Running the Tests
Section titled “Running the Tests”# Single seed (fast, for debugging)python3 tests/test_entity_registry_e2e.py --seed 42
# All named scenariospython3 tests/test_entity_registry_e2e.py --scenarios
# Property-based (100 random seeds)python3 tests/test_entity_registry_e2e.py --property 100
# Verbose mode (prints intention + report + diff)python3 tests/test_entity_registry_e2e.py --seed 42 -vWhy This Works
Section titled “Why This Works”The three-party architecture has a key property: the Processor and Explorer can’t collude. They share no code. If the Explorer returns a correct report, it’s because the registry system genuinely works — not because someone hardcoded the right answers.
This catches bugs that unit tests miss:
- Compiler generates SQL that DuckDB rejects (syntax errors in edge cases)
- Registry table schema doesn’t match what the Explorer expects (column type mismatch)
- Cache returns stale data after schema change
- Graceful degradation path is never exercised
- Column role metadata is correct in the registry but the Explorer queries the wrong column
It also serves as a living specification: the intention/report format documents exactly what the registry system promises.
File Layout
Section titled “File Layout”tests/ test_entity_registry_e2e.py ← orchestrator + comparator entity_registry/ processor.py ← builds DuckDB from spec explorer.py ← discovers schema from registry schemas.py ← intention + report JSON schemas conftest.py ← pytest fixtures (tmp DuckDB, seeds)Tier 4: Regression Smoke Test
Section titled “Tier 4: Regression Smoke Test”After the real Entity Registry is deployed, a lightweight check that runs as part of rebuild.sh:
| Step | What | How |
|---|---|---|
| 1 | Compiler runs | python3 scripts/entitycompile.py — exit 0 |
| 2 | CI dry-run | python3 scripts/entitycompile.py --check — exit 0 |
| 3 | Registry populated | SELECT count(*) FROM {tenant}.entity_registry > 0 |
| 4 | Relationships populated | SELECT count(*) FROM {tenant}.entity_relationships > 0 |
| 5 | No orphan references | Every entity_relationships.source_entity_id exists in entity_registry |
| 6 | Type safety | npx svelte-check — zero errors |
| 7 | No hardcoded remnants | Grep Explorer src/ for hardcoded gold_ table names — zero hits outside fallback |
Open Questions
Section titled “Open Questions”- Should
entity_relationshipssupport asymmetric joins? — Currentlyjoin_columnis assumed to be the same on both sides. Thecustom_sqlescape hatch handles the 2 cases where this doesn’t hold. If more asymmetric joins appear, add atarget_join_columnfield. - Should the registry include non-Gold entities? — Silver audit signals reference Silver tables. For now, Silver entities are out of scope — they don’t appear in the Explorer. Revisit if Silver browsing is added.
- Should the YAML include column-level metadata? — e.g., which columns are numeric, which are dates, which should be formatted as currency. Currently the Explorer infers this from DuckDB types at query time. Adding column metadata to the registry would enable richer formatting, but it’s a lot of YAML for marginal benefit today.