Skip to content

Entity Registry — Concept Document

Date: 2026-02-27

The Entity Registry is a metadata-driven replacement for all hardcoded domain assumptions in the Explorer. Instead of the Explorer knowing that gold_cases has a start_time column, or that signal entity type Case maps to dimension key cases, or that materials have related billing events joined on material_id — all of this knowledge lives in YAML files, gets compiled into dbt registry tables, and the Explorer reads it at runtime.

The Explorer code may assume the metadata schema. It may not assume any business content.

This follows the same pattern as signals, theses, and verdicts: YAML source of truth → compiler → dbt SQL → runtime consumption. The Entity Registry is the final piece that makes the Explorer fully domain-agnostic.

The Explorer currently works because ~25 hardcoded assumptions scattered across 5 TypeScript files happen to match the current Gold schema. Adding a new Gold entity requires editing multiple files in multiple layers. Worse, these assumptions are invisible — they’re inline constants, heuristics, and naming conventions that only work by coincidence.

explorer/src/lib/server/queries/dimensions.ts

Line(s)AssumptionWhat it encodes
5–7EXCLUDED_TABLESgold_article_suppliers is not browseable
10–15EXCLUDED_PREFIXESgold_taxonomy_*, gold_quality_*, gold_reconciliation_*, gold_io_coefficient* are not browseable
23–28ENTITY_TYPE_MAPMaps signal entity_type (Case/Material/Procedure/CostCenter) → dimension keys
64PK inferenceFirst column ending in _key is primary key
204–231RELATED_FACTS9 relationship definitions: which tables join, on which columns, display order, and 2 custom SQL templates

explorer/src/lib/server/queries/summary.ts

Line(s)AssumptionWhat it encodes
9–16YearFilteredCountsHardcoded interface with cases, procedures, usage, billing, billing_amount, movements
34start_timeCases’ primary timestamp
38billing_timestampBilling events’ primary timestamp
77–895 entity+timestamp pairscases/start_time, procedures/procedure_timestamp, usage/usage_timestamp, billing/billing_timestamp, movements/movement_timestamp
86billed_amountThe monetary amount column for billing
115–121getTenantLandingStatsHardcoded references to gold_cases and gold_billing_events

explorer/src/lib/server/queries/hypotheses.ts

Line(s)AssumptionWhat it encodes
120–124ENTITY_NAME_MAPMaps entity_type → table/id-column/description-column for 3 entity types

explorer/src/lib/server/queries/taxonomies.ts

Line(s)AssumptionWhat it encodes
63–67Member caption JOINHardcoded LEFT JOIN to gold_cost_centers.name and gold_materials.material_description

explorer/src/routes/api/jumpbar/+server.ts

Line(s)AssumptionWhat it encodes
20–34pickDisplayColumnsHeuristic: skip _key columns, pattern-match “name”/“description” for label, “type”/“group”/“category” for detail

explorer/src/lib/i18n/locales/{en,de,fr}.json

Line(s)AssumptionWhat it encodes
41–61dimensions.* keysStatic display names for each entity type in 3 languages

Total: 22+ distinct hardcoded assumptions across 6 files, not counting the duplicated categoryColors and statusStyles maps in Svelte components.

  • Adding a new Gold entity (e.g., gold_suppliers) requires editing ENTITY_TYPE_MAP, EXCLUDED_TABLES, potentially RELATED_FACTS, summary.ts queries, all 3 locale JSON files, and hoping you didn’t miss anything.
  • Changing a column name in the Gold contract silently breaks Explorer heuristics.
  • Different tenants with different entities is impossible — the Explorer assumes all tenants share the same Gold schema shape.
  • The _key suffix heuristic works today but is fragile. It can’t distinguish surrogate keys from business IDs.

The Explorer code may assume the metadata schema. It may not assume any business content.

The Explorer knows two things: (1) there is a table called entity_registry with a fixed set of columns, and (2) there is a table called entity_relationships with a fixed set of columns. Everything else — entity names, column roles, relationships, display labels, icons — comes from the data.

entities/*.yaml ← source of truth
scripts/entitycompile.py ──validates──▶ contracts/gold_contract.v1.json
├──▶ dbt/{pack}/models/registry/entity_registry.sql
└──▶ dbt/{pack}/models/registry/entity_relationships.sql
dbt build --select registry
├──▶ {tenant}.entity_registry (TABLE)
├──▶ {tenant}.entity_relationships (TABLE)
platform_entity_registry (VIEW, from first tenant)
platform_entity_relationships (VIEW, from first tenant)
explorer/src/lib/server/queries/entityRegistry.ts
loadRegistry() → per-tenant cache, invalidated by DB mtime
getBrowseableEntities() / getEntityByKey() / getRelationships() / ...
All Explorer consumers — zero hardcoded entity knowledge

This is the same pattern as signals (probes/*.yamlsignalcompile.pysignal_findings__*.sql) and theses (hypotheses/*.yamlhypothesiscompile.pyhypothesis_verdicts.sql).

Entity definitions live in entities/ at the repo root (same level as probes/, hypotheses/, diagnoses/). One file per entity. The entity_id must match the filename (without .yaml).

entity_id: cases
dbt_model: gold_cases
display_name:
en: Cases
de: Fälle
fr: Cas
icon: ""
is_browseable: true
columns:
primary_key: case_key
business_id: case_token
label: drg_description
detail: case_type
timestamp: start_time
amount: null
probe_entity_type: Case
taxonomy:
dimension_type: null
caption_column: null
relationships:
- target_model: gold_case_material_usage
join_column: case_token
order_column: usage_timestamp
label: { en: Usage, de: Verbrauch, fr: Consommation }
- target_model: gold_billing_events
join_column: case_token
order_column: billing_timestamp
label: { en: Billing, de: Abrechnung, fr: Facturation }
- target_model: gold_procedures
join_column: case_token
order_column: procedure_timestamp
label: { en: Procedures, de: Eingriffe, fr: Procédures }
entity_id: materials
dbt_model: gold_materials
display_name:
en: Materials
de: Materialien
fr: Matériaux
icon: "📦"
is_browseable: true
columns:
primary_key: material_key
business_id: material_id
label: material_description
detail: material_group
timestamp: null
amount: standard_price
probe_entity_type: Material
taxonomy:
dimension_type: material
caption_column: material_description
relationships:
- target_model: gold_case_material_usage
join_column: material_id
order_column: usage_timestamp
label: { en: Usage, de: Verbrauch, fr: Consommation }
- target_model: gold_billing_events
join_column: material_id
order_column: billing_timestamp
label: { en: Billing, de: Abrechnung, fr: Facturation }
- target_model: gold_material_movements
join_column: material_id
order_column: movement_timestamp
label: { en: Movements, de: Bewegungen, fr: Mouvements }
- target_model: gold_article_suppliers
join_column: material_id
order_column: supplier_code
label: { en: Suppliers, de: Lieferanten, fr: Fournisseurs }
custom_sql: |
SELECT a.supplier_code, s.supplier_name, a.supplier_reference, a.creation_date
FROM {schema}.gold_article_suppliers a
LEFT JOIN {schema}.gold_suppliers s ON a.supplier_code = s.supplier_code
WHERE a.material_id = $1 ORDER BY a.supplier_code LIMIT 100

Non-Browseable Example: entities/article_suppliers.yaml

Section titled “Non-Browseable Example: entities/article_suppliers.yaml”
entity_id: article_suppliers
dbt_model: gold_article_suppliers
display_name:
en: Article Suppliers
de: Artikellieferanten
fr: Fournisseurs d'articles
icon: "🔗"
is_browseable: false
columns:
primary_key: article_supplier_key
business_id: material_id
label: null
detail: null
timestamp: null
amount: null
probe_entity_type: null
taxonomy:
dimension_type: null
caption_column: null
FieldRequiredTypeDescription
entity_idYesstringDimension key, must match filename
dbt_modelYesstringGold table name (must start with gold_)
display_nameYes{en, de, fr}Tri-lingual display name
iconYesstringSingle emoji/character for UI
is_browseableYesbooleanWhether entity appears in dimension browser
columns.primary_keyYesstringSurrogate key column
columns.business_idYesstringHuman-meaningful ID column
columns.labelNostringDisplay name/description column
columns.detailNostringSecondary info column (type, group, category)
columns.timestampNostringPrimary timestamp for time-axis queries
columns.amountNostringMonetary amount column
probe_entity_typeNostringMaps to probe findings entity_type field
taxonomy.dimension_typeNostringValue in taxonomy_member_mappings.dimension_type
taxonomy.caption_columnNostringColumn used for taxonomy member captions
relationshipsNolistRelated fact definitions (see below)
FieldRequiredTypeDescription
target_modelYesstringGold table to query
join_columnYesstringColumn on both source and target for the join
order_columnYesstringDefault ORDER BY column
labelYes{en, de, fr}Section heading in entity detail page
custom_sqlNostringFull SQL template with {schema} and $1 placeholders

~12 YAML files covering all current Gold entities:

Filedbt_modelBrowseableprobe_entity_typeRelationships
cases.yamlgold_casesYesCaseusage, billing, procedures
materials.yamlgold_materialsYesMaterialusage, billing, movements, suppliers
cost_centers.yamlgold_cost_centersYesCostCenter
suppliers.yamlgold_suppliersYesmaterials (via junction)
procedures.yamlgold_proceduresYesProcedure
case_material_usage.yamlgold_case_material_usageYes
billing_events.yamlgold_billing_eventsYesBillingEvent
material_movements.yamlgold_material_movementsYes
article_suppliers.yamlgold_article_suppliersNo
material_classifications.yamlgold_material_classificationsYes
packaging_types.yamlgold_packaging_typesYes
service_mandates.yamlgold_service_mandatesYes

One row per entity. Materialized as TABLE in dbt/{pack}/models/registry/entity_registry.sql.

ColumnTypeNullableDescription
entity_idVARCHARNoDimension key (e.g. cases, materials)
dbt_modelVARCHARNoTable name (e.g. gold_cases)
display_name_enVARCHARNoDisplay name, English
display_name_deVARCHARNoDisplay name, German
display_name_frVARCHARNoDisplay name, French
iconVARCHARNoSingle emoji character
is_browseableBOOLEANNoAppears in dimension browser
primary_key_columnVARCHARNoSurrogate key column name
business_id_columnVARCHARNoHuman-meaningful ID column
label_columnVARCHARYesDisplay name column
detail_columnVARCHARYesSecondary info column
timestamp_columnVARCHARYesPrimary timestamp for time-axis queries
amount_columnVARCHARYesMonetary amount column
probe_entity_typeVARCHARYesMaps to probe findings entity_type
taxonomy_dimension_typeVARCHARYesdimension_type in taxonomy_member_mappings
taxonomy_caption_columnVARCHARYesColumn for taxonomy member captions

One row per relationship. Materialized as TABLE in dbt/{pack}/models/registry/entity_relationships.sql.

ColumnTypeNullableDescription
source_entity_idVARCHARNoFK to entity_registry.entity_id
join_columnVARCHARNoColumn on source entity providing the join value
target_modelVARCHARNoGold table to query
target_join_columnVARCHARNoColumn in target to match (usually same as join_column)
order_columnVARCHARNoDefault ORDER BY column
label_enVARCHARNoLabel, English
label_deVARCHARNoLabel, German
label_frVARCHARNoLabel, French
custom_sqlVARCHARYesFull SQL template with {schema} and $1 placeholders

Same pattern as other registries (from first tenant):

  • dbt/{pack}/models/platform/platform_entity_registry.sql
  • dbt/{pack}/models/platform/platform_entity_relationships.sql
  • Two tables, not one — entity metadata and relationship metadata have different cardinality (1:N). Normalized, matches the signal_registry + probe_findings pattern.
  • models/registry/ directory, not models/gold/ — metadata registries don’t belong in the gold business layer.
  • No gold_ prefixentity_registry, not gold_entity_registry. These are metadata, not business entities.
  • custom_sql for complex joins — only 2 of 9 current relationships need it (both via gold_article_suppliers junction table). Avoids over-engineering a join DSL.

Follows the pattern of signalcompile.py and hypothesiscompile.py:

  1. Reads all entities/*.yaml files
  2. Validates each YAML:
    • entity_id must match filename (without .yaml)
    • dbt_model must start with gold_
    • primary_key_column and business_id_column are required
    • probe_entity_type, if set, must be a known contract entity type
    • All column references must exist in contracts/gold_contract.v1.json
    • target_model in relationships must reference a real Gold model
    • custom_sql must contain {schema} and $1 placeholders
    • Warns if a contract entity has no corresponding YAML file
  3. Generates two dbt SQL models via UNION ALL of SELECT literals
  4. Supports --check flag for CI dry-run

Validates YAML structure and cross-references without generating SQL. Analogous to signalcheck.py and hypothesischeck.py.

New Step 2f after lineage compile:

Terminal window
log "━━━ Step 2f: Compile entity registry ━━━"
run_cmd "$VENV/python" "$SCRIPTS_DIR/entitycompile.py"
Terminal window
# Validate entity YAML definitions
python3 scripts/entitycheck.py
# Compile YAML → dbt SQL
python3 scripts/entitycompile.py
# Dry-run: check if compiled SQL is up to date
python3 scripts/entitycompile.py --check
# Build registry for a tenant
cd dbt/{pack} && .venv/bin/dbt build --select registry \
--vars '{"tenant_id": "my_tenant"}'

explorer/src/lib/server/queries/entityRegistry.ts

Section titled “explorer/src/lib/server/queries/entityRegistry.ts”

A single module that loads, caches, and queries the registry tables. All domain knowledge in the Explorer flows through this module.

interface EntityRegistryEntry {
entity_id: string;
dbt_model: string;
display_name_en: string;
display_name_de: string;
display_name_fr: string;
icon: string;
is_browseable: boolean;
primary_key_column: string;
business_id_column: string;
label_column: string | null;
detail_column: string | null;
timestamp_column: string | null;
amount_column: string | null;
probe_entity_type: string | null;
taxonomy_dimension_type: string | null;
taxonomy_caption_column: string | null;
}
interface EntityRelationship {
source_entity_id: string;
join_column: string;
target_model: string;
target_join_column: string;
order_column: string;
label_en: string;
label_de: string;
label_fr: string;
custom_sql: string | null;
}
FunctionReturnsReplaces
getRegistry(tenant)Load + cache both tables
getBrowseableEntities(tenant)is_browseable = true entitiesdiscoverDimensions() + exclusion lists
getEntityByKey(tenant, key)Lookup by entity_idPK heuristic + pickDisplayColumns
getEntityByProbeType(tenant, type)Lookup by probe_entity_typeENTITY_TYPE_MAP
getRelationships(tenant, entityId)Relationships for entityRELATED_FACTS
getEntitiesWithTimestamp(tenant)timestamp_column IS NOT NULLHardcoded summary.ts timestamps
getEntitiesWithAmount(tenant)amount_column IS NOT NULLHardcoded summary.ts amounts
getTaxonomyMemberEntities(tenant)taxonomy_dimension_type IS NOT NULLHardcoded taxonomies.ts JOINs

Piggybacks on the existing getDb() reconnect logic in db.ts (which already tracks dbMtimeMs). When the DB mtime changes after dbt build, the registry cache is invalidated. No TTL needed.

If entity_registry table does not exist (tenant not yet rebuilt after registry is added), fall back to current information_schema discovery + heuristics. This allows a smooth rollout — the Explorer works with or without the registry tables during the transition period.

FileHardcoded KnowledgeRegistry Replacement
dimensions.tsENTITY_TYPE_MAPsignal entity_type → dimension keyprobe_entity_type column
dimensions.tsEXCLUDED_TABLES/PREFIXESwhich tables are browseableis_browseable column
dimensions.tsRELATED_FACTS9 FK relationships + 2 custom SQLentity_relationships table
dimensions.ts — PK heuristicfirst _key column = surrogate keyprimary_key_column column
dimensions.tsdiscoverDimensions()schema introspection + exclusionsSELECT * FROM entity_registry WHERE is_browseable
summary.ts — timestampshardcoded start_time, billing_timestamptimestamp_column column
summary.ts — amountshardcoded billed_amountamount_column column
hypotheses.tsENTITY_NAME_MAP3 entity types → table/id/descbusiness_id_column + label_column
taxonomies.ts — member JOINshardcoded gold_cost_centers/gold_materialstaxonomy_dimension_type + taxonomy_caption_column
jumpbar/+server.tspickDisplayColumnsregex heuristic for id/label/detailbusiness_id_column + label_column + detail_column
i18n dimensions.* keysstatic display names per entitydisplay_name_en/de/fr columns
  1. Create entities/ directory with ~12 YAML files
  2. Write scripts/entitycompile.py (compiler + validator)
  3. Write scripts/entitycheck.py (validator only)
  4. Create dbt models: models/registry/entity_registry.sql, entity_relationships.sql
  5. Create platform views: platform_entity_registry.sql, platform_entity_relationships.sql
  6. Add Step 2f to scripts/rebuild.sh
  7. Rebuild and verify tables populated

Verification: SELECT * FROM my_tenant.entity_registry returns all entities with correct metadata.

  1. Create explorer/src/lib/server/queries/entityRegistry.ts with load/cache/query functions
  2. Implement graceful fallback: if entity_registry table missing, fall back to current heuristics

Verification: Module compiles, cache works, fallback triggers correctly on pre-registry tenants.

Phase 3: Migrate Consumers (one file at a time)

Section titled “Phase 3: Migrate Consumers (one file at a time)”
  1. dimensions.ts — replace ENTITY_TYPE_MAP, EXCLUDED_*, RELATED_FACTS, PK heuristic, discoverDimensions()
  2. jumpbar/+server.ts — replace pickDisplayColumns and dynamic dimension search
  3. hypotheses.ts — replace ENTITY_NAME_MAP
  4. taxonomies.ts — replace hardcoded member caption JOINs
  5. summary.ts — replace hardcoded timestamp/amount column references

Verification: Each file migration is verified independently — Explorer behavior is identical before and after.

  1. Remove all dead hardcoded code (ENTITY_TYPE_MAP, EXCLUDED_TABLES, RELATED_FACTS, etc.)
  2. Remove dimensions.* i18n keys (display names now come from registry)
  3. Update CLAUDE.md to document the Entity Registry

Verification: Grep Explorer src/ for hardcoded gold_ table names — zero hits outside entityRegistry.ts fallback.

ActionFile
Createentities/*.yaml (~12 files)
Createscripts/entitycompile.py
Createscripts/entitycheck.py
Createdbt/{pack}/models/registry/entity_registry.sql
Createdbt/{pack}/models/registry/entity_relationships.sql
Createdbt/{pack}/models/platform/platform_entity_registry.sql
Createdbt/{pack}/models/platform/platform_entity_relationships.sql
Createexplorer/src/lib/server/queries/entityRegistry.ts
Modifyscripts/rebuild.sh — add Step 2f: entity compile
Modifyexplorer/src/lib/server/queries/dimensions.ts
Modifyexplorer/src/lib/server/queries/summary.ts
Modifyexplorer/src/lib/server/queries/hypotheses.ts
Modifyexplorer/src/lib/server/queries/taxonomies.ts
Modifyexplorer/src/routes/api/jumpbar/+server.ts
ModifyCLAUDE.md

The Entity Registry is unusually testable. Every layer — YAML validation, SQL compilation, DuckDB tables, Explorer consumption — can be tested in isolation with synthetic data. No real tenant data needed. No dbt project needed for the integration tests. Just a DuckDB instance and some generated schemas.

Compiler Tests (tests/test_entitycompile.py)

Section titled “Compiler Tests (tests/test_entitycompile.py)”

Pure Python, no DuckDB. Feed YAML dicts into the compiler functions, assert correct SQL output or correct validation errors.

TestInputExpected
Valid entityComplete YAML dictSQL SELECT literal with all 16 columns
Missing required fieldYAML without primary_keyValidation error naming the field
entity_id ≠ filenameentity_id: foo in bar.yamlValidation error
dbt_model without gold_dbt_model: silver_casesValidation error
Unknown column referencetimestamp: nonexistent_colWarning (column not in contract)
Relationship with custom_sql missing {schema}SQL without placeholderValidation error
Relationship with custom_sql missing $1SQL without parameterValidation error
No YAML for contract entityContract has BillingEvent, no billing_events.yamlWarning (not error)
--check mode, SQL up to dateGenerated SQL matches file on diskExit 0
--check mode, SQL staleGenerated SQL differs from file on diskExit 1

Registry Module Tests (explorer/src/lib/server/queries/entityRegistry.test.ts)

Section titled “Registry Module Tests (explorer/src/lib/server/queries/entityRegistry.test.ts)”

TypeScript unit tests against a real in-memory DuckDB. Create the two registry tables, insert known rows, then test every public API function.

TestSetupAssert
getBrowseableEntities3 entities, 1 with is_browseable = falseReturns 2
getEntityByKeyInsert entity with entity_id = 'cases'Returns correct entry
getEntityByKey — missingQuery for entity_id = 'nonexistent'Returns null
getEntityByProbeTypeInsert entity with probe_entity_type = 'Case'Returns correct entry
getEntityByProbeType — nullInsert entity with probe_entity_type = nullNot returned
getRelationshipsInsert 3 relationships for materialsReturns 3 in order
getEntitiesWithTimestamp2 with timestamp, 1 withoutReturns 2
getEntitiesWithAmount1 with amount, 2 withoutReturns 1
getTaxonomyMemberEntities1 with taxonomy_dimension_type, 2 withoutReturns 1
Cache invalidationLoad registry, change DB mtime, reloadFresh data returned
Graceful degradationNo entity_registry table existsFalls back to information_schema discovery

Verify the full pipeline: YAML → compiler → SQL → DuckDB → Explorer module.

Terminal window
# Run from repo root
python3 -m pytest tests/test_entity_registry_integration.py -v
  1. Start with real entities/*.yaml files
  2. Run entitycompile.py programmatically
  3. Execute generated SQL in a fresh in-memory DuckDB
  4. Query the resulting tables
  5. Assert every YAML entity appears in entity_registry with correct values
  6. Assert every YAML relationship appears in entity_relationships

This is the centrepiece of the testing strategy. Three independent parties that never communicate directly — they only share a DuckDB file and a common report format.

┌─────────────────────┐
│ Test Orchestrator │
│ (seeded PRNG) │
└──────┬──────────────┘
generates random schema spec
┌───────────────┼───────────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Processor │ │ Explorer │
│ │ │ │
│ 1. Reads spec │ │ 1. Opens DB │
│ 2. Creates │ │ (read-only)│
│ DuckDB │──── .duckdb ──│ 2. Discovers │
│ schema + │ file │ schema via │
│ registry │ │ registry │
│ 3. Writes │ │ 3. Writes │
│ INTENTION │ │ REPORT │
└───────┬───────┘ └───────┬───────┘
│ │
│ ┌───────────────┐ │
└───▶│ Comparator │◀─────────┘
│ │
│ Diffs intent │
│ vs report │
│ │
│ PASS / FAIL │
└───────────────┘

The Processor and Explorer are completely independent. They share no code, no imports, no function calls. The only coupling is:

  1. The DuckDB file (Processor writes, Explorer reads)
  2. The intention/report format (a shared JSON schema)

This means: if the Explorer can correctly report what the Processor built — for any random schema — the registry system works.

Generates a random-but-consistent schema specification using a seeded PRNG. Each test run is deterministic and reproducible.

The spec controls:

  • How many entities (3–15)
  • Entity names (random words, not domain-specific terms — to prove domain-agnosticism)
  • Which columns each entity has (random column names + types)
  • Which column roles are assigned (primary_key, business_id, label, detail, timestamp, amount)
  • Which entities are browseable
  • Which entities have a probe_entity_type
  • How many relationships exist (0–3 per entity)
  • Whether any relationships use custom_sql
  • Tri-lingual display names (generated from entity name + locale suffix)

The spec is passed to both the Processor and the Explorer test harness. But critically, the Explorer harness only uses the spec to know which tenant schema to query — it does not peek at the spec to know what it should find.

Receives the schema spec. Creates a real DuckDB database with:

  1. Gold tables — one gold_{entity_id} table per entity, with the columns specified, populated with 5–50 random rows
  2. Registry tablesentity_registry and entity_relationships, populated from the spec (as if entitycompile.py had run)
  3. Relationship target tables — for each relationship, the target table exists with the join column and order column

Then writes an intention manifest (intention.json):

{
"seed": 42,
"schema": "test_tenant_42",
"entities": [
{
"entity_id": "widgets",
"dbt_model": "gold_widgets",
"display_name_en": "Widgets",
"display_name_de": "Widgets_de",
"display_name_fr": "Widgets_fr",
"is_browseable": true,
"primary_key_column": "widget_key",
"business_id_column": "widget_code",
"label_column": "widget_name",
"detail_column": "widget_category",
"timestamp_column": "created_at",
"amount_column": "unit_price",
"probe_entity_type": "Widget",
"row_count": 23,
"relationships": [
{
"target_model": "gold_widget_events",
"join_column": "widget_code",
"order_column": "event_time",
"label_en": "Events",
"has_custom_sql": false
}
]
}
]
}

The intention manifest describes what the Processor claims it built. It is a pure declaration — no DuckDB queries, no introspection. Just the spec, serialized.

Opens the DuckDB read-only. Has no access to the schema spec or the intention manifest. It only knows the tenant schema name.

Uses the entityRegistry.ts module (or a Python equivalent that mirrors its logic) to:

  1. Load entity_registry and entity_relationships from the schema
  2. For each registered entity, verify the Gold table exists and query its columns
  3. For each relationship, verify the target table exists and the join column is present
  4. For each entity with timestamp_column, verify the column exists and contains temporal data
  5. For each entity with amount_column, verify the column exists and contains numeric data
  6. Count rows in each Gold table

Then writes a discovery report (report.json):

{
"schema": "test_tenant_42",
"entities": [
{
"entity_id": "widgets",
"dbt_model": "gold_widgets",
"display_name_en": "Widgets",
"display_name_de": "Widgets_de",
"display_name_fr": "Widgets_fr",
"is_browseable": true,
"primary_key_column": "widget_key",
"business_id_column": "widget_code",
"label_column": "widget_name",
"detail_column": "widget_category",
"timestamp_column": "created_at",
"amount_column": "unit_price",
"probe_entity_type": "Widget",
"table_exists": true,
"row_count": 23,
"pk_is_unique": true,
"timestamp_column_type": "TIMESTAMP",
"amount_column_type": "DOUBLE",
"relationships": [
{
"target_model": "gold_widget_events",
"join_column": "widget_code",
"order_column": "event_time",
"target_exists": true,
"join_column_exists": true,
"has_custom_sql": false
}
]
}
]
}

The report describes what the Explorer actually found. It includes everything the intention declares, plus structural validation (table exists, PK is unique, column types are correct).

Receives both intention.json and report.json. Compares them field by field.

Exact-match assertions:

FieldRule
entity_idEvery intention entity appears in report
dbt_modelMatches exactly
display_name_{en,de,fr}Matches exactly
is_browseableMatches exactly
primary_key_columnMatches exactly
business_id_columnMatches exactly
label_columnMatches exactly (or both null)
detail_columnMatches exactly (or both null)
timestamp_columnMatches exactly (or both null)
amount_columnMatches exactly (or both null)
probe_entity_typeMatches exactly (or both null)
row_countMatches exactly
Relationship countSame number per entity
Relationship target_modelMatches exactly
Relationship join_columnMatches exactly

Structural assertions (report-only):

FieldRule
table_existsMust be true
pk_is_uniqueMust be true
timestamp_column_typeMust be a temporal type (TIMESTAMP, DATE)
amount_column_typeMust be a numeric type (DOUBLE, DECIMAL, INTEGER)
target_existsMust be true
join_column_existsMust be true

Surplus/deficit detection:

CheckMeaning
Entity in report but not in intentionExplorer found something the Processor didn’t declare (ghost entity)
Entity in intention but not in reportExplorer missed something the Processor built (blind spot)
Extra relationship in reportExplorer found an undeclared FK
Missing relationship in reportExplorer missed a declared FK

The orchestrator runs multiple seeds, each producing a different random schema. Some seeds are engineered to stress specific edge cases:

SeedScenarioWhat it tests
1Minimal — 1 entity, no relationships, no optional columnsBaseline
2Maximal — 15 entities, all column roles filled, 3 relationships eachScale
3All non-browseablegetBrowseableEntities returns empty
4No timestamps, no amountsgetEntitiesWithTimestamp / WithAmount return empty
5Mixed null columnsSome entities have label, some don’t
6Relationship with custom_sqlSQL template with {schema} and $1
7Duplicate probe_entity_type across 2 entitiesgetEntityByProbeType returns first match
8Entity names that need quotingSpaces, hyphens, unicode
9100 random seedsProperty-based coverage
Terminal window
# Single seed (fast, for debugging)
python3 tests/test_entity_registry_e2e.py --seed 42
# All named scenarios
python3 tests/test_entity_registry_e2e.py --scenarios
# Property-based (100 random seeds)
python3 tests/test_entity_registry_e2e.py --property 100
# Verbose mode (prints intention + report + diff)
python3 tests/test_entity_registry_e2e.py --seed 42 -v

The three-party architecture has a key property: the Processor and Explorer can’t collude. They share no code. If the Explorer returns a correct report, it’s because the registry system genuinely works — not because someone hardcoded the right answers.

This catches bugs that unit tests miss:

  • Compiler generates SQL that DuckDB rejects (syntax errors in edge cases)
  • Registry table schema doesn’t match what the Explorer expects (column type mismatch)
  • Cache returns stale data after schema change
  • Graceful degradation path is never exercised
  • Column role metadata is correct in the registry but the Explorer queries the wrong column

It also serves as a living specification: the intention/report format documents exactly what the registry system promises.

tests/
test_entity_registry_e2e.py ← orchestrator + comparator
entity_registry/
processor.py ← builds DuckDB from spec
explorer.py ← discovers schema from registry
schemas.py ← intention + report JSON schemas
conftest.py ← pytest fixtures (tmp DuckDB, seeds)

After the real Entity Registry is deployed, a lightweight check that runs as part of rebuild.sh:

StepWhatHow
1Compiler runspython3 scripts/entitycompile.py — exit 0
2CI dry-runpython3 scripts/entitycompile.py --check — exit 0
3Registry populatedSELECT count(*) FROM {tenant}.entity_registry > 0
4Relationships populatedSELECT count(*) FROM {tenant}.entity_relationships > 0
5No orphan referencesEvery entity_relationships.source_entity_id exists in entity_registry
6Type safetynpx svelte-check — zero errors
7No hardcoded remnantsGrep Explorer src/ for hardcoded gold_ table names — zero hits outside fallback
  1. Should entity_relationships support asymmetric joins? — Currently join_column is assumed to be the same on both sides. The custom_sql escape hatch handles the 2 cases where this doesn’t hold. If more asymmetric joins appear, add a target_join_column field.
  2. Should the registry include non-Gold entities? — Silver audit signals reference Silver tables. For now, Silver entities are out of scope — they don’t appear in the Explorer. Revisit if Silver browsing is added.
  3. Should the YAML include column-level metadata? — e.g., which columns are numeric, which are dates, which should be formatted as currency. Currently the Explorer infers this from DuckDB types at query time. Adding column metadata to the registry would enable richer formatting, but it’s a lot of YAML for marginal benefit today.
jazzisnow jinflow is a jazzisnow product
v0.45.1 · built 2026-04-17 08:14 UTC