feat(schema): schema-driven attributes end-to-end with persistent ontology#256
feat(schema): schema-driven attributes end-to-end with persistent ontology#256galshubeli wants to merge 8 commits into
Conversation
…ology Wires declared PropertyType / RelationType.properties from the GraphSchema through extraction, storage, and retrieval. The ontology persists in a dedicated <data_graph>__ontology FalkorDB graph and accumulates as the union of every schema ever registered, so ingest passes can use subset/ extension schemas while retrieval always sees the global ontology. - core/models: PropertyType type-normalising validator, RelationType.properties, ExtractedEntity/Relation.attributes, reserved-name rejection, GraphSchema.merge. - storage/ontology_store (new): OntologyStore.load/register/clear against <data_graph>__ontology; idempotent unions; pattern lists deduped. - ingestion/graph_extraction: conditional attribute block in VERIFY_EXTRACT_RELS_PROMPT (empty for property-less schemas, zero drift), per-type coercion at the extractor boundary, required-missing record drop, aggregators carry attributes (last-write-wins), conversion merges into props. - ingestion/pipeline: _validate_attributes pass after _prune, then re-run _filter_quality to cascade-clean dangling relationships. - retrieval/cypher_generation: replaces hardcoded SCHEMA_PROMPT with build_schema_prompt(schema, question) + render_schema_block; synthesises one numeric-attribute example per declared INTEGER/FLOAT property; validate_cypher accepts dynamic schema labels. - retrieval/multi_path: schema threaded into execute_cypher_retrieval. - api/main: GraphRAG owns the OntologyStore, registers local schema on each ingest, refresh_ontology() propagates the global ontology to retrieval, public get_ontology() / refresh_ontology() methods. Forward-only evolution: nodes ingested before a property was declared remain without that property; FalkorDB's MERGE ... SET n += props handles fill-in naturally and WHERE p.attr > N naturally excludes them. Tests: 124 new/extended unit tests; full suite 819 passed, 23 skipped (integration tests gated by RUN_INTEGRATION=1). Property-less schemas produce identical output to the prior pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds schema property typing and reserved names, live ontology inference (OntologyStore), attribute-aware extraction and coercion, pipeline attribute validation, schema-aware Cypher generation, retrieval wiring, facade methods to inspect/refresh/save the global schema, and tests for all layers. ChangesPersistent Schema Ontology with Attribute Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
graphrag_sdk/tests/test_ontology_store.py (1)
116-166: ⚡ Quick winAdd a regression test for same-named properties across different owners.
Please add a test that registers two types (e.g., entity
Personand relationWORKS_AT) both with property"since"and asserts independent metadata persistence. This will catch owner-scoping regressions in ontology property upserts.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/tests/test_ontology_store.py` around lines 116 - 166, Add a regression test in tests/test_ontology_store.py that registers two types sharing the same property name to ensure owner-scoped property upserts: create a GraphSchema with an EntityType(label="Person", properties=[PropertyType(name="since", type="DATE", required=True)]) and a RelationType(label="WORKS_AT", patterns=[("Person","Company")]) that also has properties=[PropertyType(name="since", type="INTEGER", required=False)]; call await store.register(schema) and then inspect fake_graph.calls to assert there are two separate MERGE/SET operations for the property (look for "MERGE (o)-[:HAS_PROPERTY]->" and the params passed) and that the params/metadata for each include the correct owner (e.g., owner label or the relation/entity context) and the distinct type/required values to prove they were persisted independently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 196-203: delete_all() currently clears only the data graph leaving
the persisted ontology in OntologyStore (_ontology_store) and a stale
_global_schema; update delete_all() to also clear the ontology store and reset
in-memory schema: call the OntologyStore.clear()/delete_all() API (or implement
one) on self._ontology_store, then set self._global_schema back to a clean
default (e.g., self.schema or an empty GraphSchema) and ensure
refresh_ontology()/load paths will repopulate from the cleared store; reference
OntologyStore, delete_all(), _ontology_store, _global_schema, and
refresh_ontology() when making the change.
In `@graphrag_sdk/src/graphrag_sdk/core/models.py`:
- Around line 314-315: The merge() implementation currently uses
"incoming.description or current.description" which treats empty strings as
falsey and prevents an explicit empty description from overwriting an existing
one; change those checks in merge() to use an explicit None check (e.g., use
incoming.description if incoming.description is not None else
current.description) so last-write-wins applies even when incoming.description
== "". Apply the same fix to the other occurrence that uses the same pattern
(the second use around the properties merge).
In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py`:
- Around line 261-265: SCHEMA_PROMPT currently points to _SCHEMA_PROMPT_TEMPLATE
which requires schema_block and attribute_examples and thus breaks legacy
callers using SCHEMA_PROMPT.format(question=...); replace the simple alias with
a compatibility shim object (or small helper) named SCHEMA_PROMPT that
implements a format(**kwargs) method which: if kwargs contains only "question"
(legacy call), delegates to build_schema_prompt(question=kwargs["question"]) and
returns that built string, otherwise delegates to
_SCHEMA_PROMPT_TEMPLATE.format(**kwargs); ensure the shim also behaves like a
string for non-format uses (e.g., __str__ or __repr__ returning the underlying
template) so callers of SCHEMA_PROMPT continue to work.
In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 177-181: The MERGE for OntologyProperty is using only {name:
$name}, causing properties with the same name across different owners to
collide; update the upsert so the property node is keyed by both name and its
owner context (e.g. include owner or owner_label in the MERGE pattern that
creates p) instead of just name, and adjust any related uniqueness constraints
or queries that expect a single-keyed property (refer to the MERGE creating
p:OntologyProperty, the variable p, and owner_label in this block).
---
Nitpick comments:
In `@graphrag_sdk/tests/test_ontology_store.py`:
- Around line 116-166: Add a regression test in tests/test_ontology_store.py
that registers two types sharing the same property name to ensure owner-scoped
property upserts: create a GraphSchema with an EntityType(label="Person",
properties=[PropertyType(name="since", type="DATE", required=True)]) and a
RelationType(label="WORKS_AT", patterns=[("Person","Company")]) that also has
properties=[PropertyType(name="since", type="INTEGER", required=False)]; call
await store.register(schema) and then inspect fake_graph.calls to assert there
are two separate MERGE/SET operations for the property (look for "MERGE
(o)-[:HAS_PROPERTY]->" and the params passed) and that the params/metadata for
each include the correct owner (e.g., owner label or the relation/entity
context) and the distinct type/required values to prove they were persisted
independently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9aab5a64-1ab3-4985-8285-6595928e4a23
📒 Files selected for processing (14)
graphrag_sdk/src/graphrag_sdk/api/main.pygraphrag_sdk/src/graphrag_sdk/core/models.pygraphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.pygraphrag_sdk/src/graphrag_sdk/ingestion/pipeline.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.pygraphrag_sdk/src/graphrag_sdk/storage/__init__.pygraphrag_sdk/src/graphrag_sdk/storage/ontology_store.pygraphrag_sdk/tests/test_attribute_prompt.pygraphrag_sdk/tests/test_cypher_generation.pygraphrag_sdk/tests/test_graph_extraction.pygraphrag_sdk/tests/test_models.pygraphrag_sdk/tests/test_ontology_store.pygraphrag_sdk/tests/test_pipeline.py
The data graph already contains labels, relationship types, properties, and endpoint patterns — querying it is the source of truth, not a parallel persisted graph. Schema-as-config is supported via plain JSON files. - storage/ontology_store: rewritten as inference-only. OntologyStore.infer() calls db.labels() / db.relationshipTypes(), samples keys()+typeof() per label, derives RELATES sub-types from rel_type values, and pulls endpoint patterns. Structural labels/edges and reserved property keys are filtered out so they never leak into the LLM-facing schema. - core/models: GraphSchema.from_file / save_to_file convenience methods for the schema-as-config workflow (versioned JSON, hand-editable, shareable). - api/main: GraphRAG.get_ontology() returns inferred ∪ self.schema so user- declared descriptions and required flags survive (and properties declared but not yet extracted still appear in the retrieval prompt). New save_ontology(path) writes the global ontology to a JSON file. Ingest no longer writes to a separate ontology graph. The previous PR introduced a persistent <data_graph>__ontology FalkorDB graph; that was solving "declared property with zero instances" — a case that is essentially invisible to retrieval (empty results either way). The runtime cost was a second graph to back up, clean up, and reason about. Inference + an opt-in JSON file covers the same ground in less code. Tests: 830 passed, 23 skipped. test_ontology_store rewritten around mocked-driver introspection. test_models gains a JSON round-trip test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
graphrag_sdk/src/graphrag_sdk/api/main.py (1)
596-601:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRefresh or invalidate
_global_schemaafter every graph mutation, not only ingest.This best-effort refresh only runs on
_ingest_single().update(),delete_document(), anddelete_all()also change the graph, but they leave_global_schemaand the retrieval strategy's cached schema untouched, so subsequent retrieval can use stale labels/properties until the caller manually invokesrefresh_ontology().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/src/graphrag_sdk/api/main.py` around lines 596 - 601, The patch currently refreshes the global ontology only in _ingest_single(); update(), delete_document(), and delete_all() also mutate the graph but don't refresh or invalidate the cached schema, causing stale retrievals; modify those mutation paths (the update(), delete_document(), and delete_all() methods) to either call await self.refresh_ontology() after a successful mutation or explicitly clear/invalidate self._global_schema and notify the retrieval strategy to clear its cached schema (use the same mechanism the ingest path uses to update retrieval_strategy's cache), so that retrievals see new/removed properties immediately.graphrag_sdk/src/graphrag_sdk/core/models.py (1)
330-333:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPreserve explicit empty descriptions in
merge().These branches still use
incoming.description or current.description, somerge()cannot intentionally clear a description with""even though the method documents last-write-wins behavior.Suggested fix
ent_by_label[e.label] = EntityType( label=cur.label, - description=e.description or cur.description, + description=e.description if e.description is not None else cur.description, properties=_merge_props(cur.properties, e.properties), ) @@ rel_by_label[r.label] = RelationType( label=cur.label, - description=r.description or cur.description, + description=r.description if r.description is not None else cur.description, patterns=merged_patterns, properties=_merge_props(cur.properties, r.properties), )Also applies to: 348-350
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/src/graphrag_sdk/core/models.py` around lines 330 - 333, In merge(), the code uses falsy checks (incoming.description or current.description) which prevents intentionally clearing a description with an empty string; update the description selection to check for None explicitly (e.g., use incoming.description if incoming.description is not None else current.description) in the branches that construct EntityType (references: ent_by_label, EntityType construction, variables e/cur or incoming/current) and the analogous branch around lines 348-350 so last-write-wins semantics accept "" as a valid value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 111-119: The infer()/get_ontology() path only collects relation
types when "RELATES" exists, dropping direct user edge labels like
WORKS_AT/LOCATED_IN; update the logic around rel_types and relations (the
relations: list[RelationType] block and use of self._infer_relates_subtypes) to
also include any non-structural rel_types besides RELATES by converting each
such label into a RelationType (including property keys and endpoint patterns)
or by calling a new helper to infer their properties using the same sample_size;
keep excluding structural edges (PART_OF, NEXT_CHUNK, MENTIONED_IN) and preserve
the existing RELATES handling via self._infer_relates_subtypes(sample_size).
---
Duplicate comments:
In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 596-601: The patch currently refreshes the global ontology only in
_ingest_single(); update(), delete_document(), and delete_all() also mutate the
graph but don't refresh or invalidate the cached schema, causing stale
retrievals; modify those mutation paths (the update(), delete_document(), and
delete_all() methods) to either call await self.refresh_ontology() after a
successful mutation or explicitly clear/invalidate self._global_schema and
notify the retrieval strategy to clear its cached schema (use the same mechanism
the ingest path uses to update retrieval_strategy's cache), so that retrievals
see new/removed properties immediately.
In `@graphrag_sdk/src/graphrag_sdk/core/models.py`:
- Around line 330-333: In merge(), the code uses falsy checks
(incoming.description or current.description) which prevents intentionally
clearing a description with an empty string; update the description selection to
check for None explicitly (e.g., use incoming.description if
incoming.description is not None else current.description) in the branches that
construct EntityType (references: ent_by_label, EntityType construction,
variables e/cur or incoming/current) and the analogous branch around lines
348-350 so last-write-wins semantics accept "" as a valid value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ee8ce5db-b86d-454b-94a9-d1ef91a6f6f8
📒 Files selected for processing (5)
graphrag_sdk/src/graphrag_sdk/api/main.pygraphrag_sdk/src/graphrag_sdk/core/models.pygraphrag_sdk/src/graphrag_sdk/storage/ontology_store.pygraphrag_sdk/tests/test_models.pygraphrag_sdk/tests/test_ontology_store.py
🚧 Files skipped from review as they are similar to previous changes (1)
- graphrag_sdk/tests/test_models.py
- graph_extraction.py: regroup imports (datetime alongside stdlib); remove unused EntityType import; wrap two over-length attribute-fetch lines. - api/main.py: split a long get_ontology return expression. ruff check src/graphrag_sdk/ passes locally. Tests: 830 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's "ruff format --check src/" flagged 5 files. Pure formatting; no behavioral change. Tests: 830 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 140-146: The query that builds subtypes_result currently applies
LIMIT before DISTINCT, causing rare r.rel_type values to be dropped; update the
Cypher used in the get_ontology (or wherever subtypes_result is produced) so
DISTINCT r.rel_type is applied prior to sampling/limiting: select DISTINCT
r.rel_type AS rel_type (filter out NULLs), then apply the LIMIT $limit to that
deduplicated set before returning rel_type via self._conn.query.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 61f452a0-7e8d-4c3d-aa3e-7be0ec4c8ff3
📒 Files selected for processing (5)
graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.pygraphrag_sdk/src/graphrag_sdk/ingestion/pipeline.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.pygraphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
✅ Files skipped from review due to trivial changes (1)
- graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
🚧 Files skipped from review as they are similar to previous changes (3)
- graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py
- graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py
- graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py
There was a problem hiding this comment.
Pull request overview
Implements schema-driven attributes across extraction, ingestion validation, and schema-aware Cypher generation, plus an ontology mechanism to supply a global schema to retrieval.
Changes:
- Adds attribute coercion/extraction support driven by
GraphSchemaproperty declarations, and validates/drops records missing required attributes during ingestion. - Introduces
OntologyStoreand threads schema/ontology into retrieval so Cypher prompt/validation can be schema-aware. - Extends unit test coverage across models, extraction, ingestion pipeline, ontology inference, and Cypher generation.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py | Builds schema-aware prompt blocks/examples and validates labels against an optional GraphSchema. |
| graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py | New ontology inference helper that introspects the live graph for labels/relationship subtypes/properties. |
| graphrag_sdk/src/graphrag_sdk/core/models.py | Adds schema property typing/validation, reserved-name checks, schema merge, and extracted attributes fields. |
| graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py | Adds attribute prompt block + per-type coercion and carries attributes through aggregation/storage conversion. |
| graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py | Adds _validate_attributes pass after pruning and re-runs quality filtering. |
| graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py | Threads schema into Cypher retrieval execution when enabled. |
| graphrag_sdk/src/graphrag_sdk/api/main.py | Instantiates OntologyStore, adds get/refresh/save ontology APIs, and propagates schema to retrieval strategy. |
| graphrag_sdk/src/graphrag_sdk/storage/init.py | Exports OntologyStore. |
| graphrag_sdk/tests/test_attribute_prompt.py | New tests for attribute prompt rendering and coercion helpers. |
| graphrag_sdk/tests/test_cypher_generation.py | Adds tests for schema block rendering, prompt building, and schema-aware Cypher validation. |
| graphrag_sdk/tests/test_graph_extraction.py | Adds tests for attribute coercion, required-attribute dropping, aggregation, and property merge behavior. |
| graphrag_sdk/tests/test_models.py | Updates/extends schema model tests (type normalization, reserved keys, merge). |
| graphrag_sdk/tests/test_ontology_store.py | New tests for ontology inference logic with a mocked FalkorDB driver. |
| graphrag_sdk/tests/test_pipeline.py | Adds tests for ingestion attribute validation behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Question: "What organizations are related to the technology?" | ||
| ```cypher | ||
| MATCH (o:Organization)-[r:RELATES]-(t:Technology) | ||
| RETURN o.name AS organization, t.name AS technology, r.rel_type AS relation, r.fact AS evidence | ||
| LIMIT 20 | ||
| ``` | ||
| ```{attribute_examples} | ||
|
|
||
| ## Your task |
| """Ontology inference from the live data graph. | ||
|
|
||
| The schema is **derived** from what's in the data graph, not maintained in a | ||
| separate persistent graph. This keeps the architecture honest: the source of | ||
| truth for "what entities and relations exist" is the data itself. | ||
|
|
| property name appears in both, the incoming type/description/required | ||
| overrides — last-write-wins, matching the persisted ontology's | ||
| register() semantics. |
| # Ontology is inferred from the live data graph (no separate | ||
| # persistent graph). User-supplied ``schema`` is merged on top at | ||
| # retrieval time so declared descriptions / required flags / not-yet- | ||
| # extracted properties survive. | ||
| self._ontology_store = OntologyStore(self._conn) | ||
| # Global ontology used at retrieval time. Initially just the local | ||
| # schema; refresh_ontology() merges in the inferred view on demand | ||
| # and after each ingest. | ||
| self._global_schema: GraphSchema = self.schema | ||
|
|
||
| # Default retrieval strategy | ||
| self._retrieval_strategy = retrieval_strategy or MultiPathRetrieval( | ||
| graph_store=self._graph_store, | ||
| vector_store=self._vector_store, | ||
| embedder=self.embedder, | ||
| llm=self.llm, | ||
| schema=self._global_schema, | ||
| ) |
`required=True` was a footgun for an LLM-extraction pipeline: when the LLM couldn't find a value in the text (often because the text didn't state it), the entire entity was silently dropped. That destroyed real data. New behavior: every declared property name appears in `ExtractedEntity.attributes` with either the coerced value or `None`. The storage layer's existing `_clean_properties` strips `None` before writing so the graph sees "key missing" — the right null semantics for retrieval (`WHERE p.age > N` naturally excludes nodes without `age`). - core/models: remove `required` field from PropertyType. - ingestion/graph_extraction: `_coerce_attributes` now returns just a dict with `None` for missing/uncoercible values; never drops records. The drop-on-missing-required branches and aggregated warnings are gone. - ingestion/pipeline: `_validate_attributes` strips undeclared keys but never drops records. Cascade `_filter_quality` still runs but no longer has anything to cascade in the attribute-validation case. - storage/ontology_store: docstring no longer mentions required flags. - tests: collapsed the three "required missing" cases into a single "every declared property appears in result; uncoercible -> None" test per file. 829 passed, 23 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| "For each verified entity and each extracted relationship, also extract any " | ||
| "of the declared properties below. Use null when the property is not stated " | ||
| "in the text. Coerce to the declared type (INTEGER, FLOAT, BOOLEAN, DATE in " | ||
| "ISO 8601, STRING, LIST). Place extracted values inside an `attributes` " | ||
| "object on each entity / relationship.", |
The data graph alone wasn't enough to anchor the schema: - types are lossy (FalkorDB `typeof()` collapses DATE → "string") - multiple processes/instances couldn't share a declarative view - nothing caught schema-typo contradictions across sessions - declarative metadata (descriptions, future flags) had no canonical home Restore the persistent ontology graph (``<data>__ontology``) as the single source of truth, with additive-only semantics: - OntologyStore.register() validates incoming schema against the persisted ontology. Re-typing an existing property raises OntologyContradictionError before any partial state is persisted; additions (new labels, properties, patterns) go through unchanged. - OntologyStore.load() / clear() round out the lifecycle. - GraphRAG._ensure_ontology_initialized() lazy first-touch loads the persisted ontology and registers self.schema. Called from ingest (so contradictions surface before expensive extraction) and from get_ontology / retrieval paths. - delete_all() now drops both data and ontology graphs and resets the initialised flag so the next ingest re-registers self.schema cleanly. - get_ontology() always reads from the ontology graph; refresh_ontology() is a thin re-load for cross-process freshness. save_ontology(path) remains as a JSON-file bridge. This reverts the architecture of 54e4e05 (drop ontology graph, infer from data). The "save a graph" simplicity wasn't worth losing validation, multi-process safety, type fidelity, and a queryable schema artifact. Tests: 825 passed, 23 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y one Before, a user passing no schema got an empty ontology graph even though extraction was using DEFAULT_ENTITY_TYPES under the hood. The anchor said "we know nothing" while the LLM was producing Person/Organization/... nodes. Now ``_ensure_ontology_initialized()`` has three branches: - user passed a schema → register it (validate + persist) - ontology graph already populated (prior session) → use as-is - both empty → register DEFAULT_ENTITY_TYPES so the ontology graph faithfully reflects what the extractor will produce After this, ``get_ontology()``, the Cypher-generation prompt, and the extractor all read from the same source. Test fixture: ``mock_conn`` in test_facade.py now stubs the connection's ``_driver.select_graph()`` chain so OntologyStore can open the ontology graph handle against the mock. 825 passed, 23 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Wires declared
PropertyType/RelationType.propertiesfrom theGraphSchemathrough extraction, storage, and retrieval. The ontology persists in a dedicated<data_graph>__ontologyFalkorDB graph and accumulates as the union of every schema ever registered, so ingest passes can use subset / extension schemas while retrieval always sees the global ontology.Forward-only evolution: nodes ingested before a property was declared remain without that property; FalkorDB's
MERGE … SET n += propshandles fill-in naturally andWHERE p.attr > Nnaturally excludes them — no backfill, no merge-strategy machinery.What changed
core/models—PropertyTypetype-normalising validator (rejects unknown types); newRelationType.properties;ExtractedEntity/Relation.attributesdict; reserved-name collisions rejected at config time;GraphSchema.merge()for ontology union.storage/ontology_store(new) — persistent ontology in<data_graph>__ontologygraph. Idempotentregister()(union-merge),load(),clear(). Pattern lists deduped on union.ingestion/graph_extraction—VERIFY_EXTRACT_RELS_PROMPTgains a conditional## Attribute extractionblock (empty for property-less schemas → zero token drift). Per-type coercion at the extractor boundary (INTEGER,FLOAT,BOOLEAN,DATE,LIST,STRING). Records missingrequired=Trueproperties are dropped with aggregated warnings. Aggregators carry attributes through dedup with last-write-wins. Conversion merges attributes into graph props.ingestion/pipeline— new_validate_attributespass after_prune, then re-runs_filter_qualityto cascade-clean dangling relationships. SkipsUnknown-labelled nodes.retrieval/cypher_generation— replaces hardcodedSCHEMA_PROMPTwithbuild_schema_prompt(schema, question)+render_schema_block(schema). Synthesises one numeric-attribute example per declaredINTEGER/FLOATproperty so the LLM learns the column shape.validate_cypheraccepts aschemafor dynamic label validation; falls back to the historic hardcoded set when omitted.retrieval/multi_path— schema threaded intoexecute_cypher_retrieval.api/main—GraphRAGconstructs anOntologyStore, registers the local schema on each ingest, andrefresh_ontology()propagates the global ontology to the retrieval strategy. Publicget_ontology()/refresh_ontology()for inspection.Algorithm
Ingest (per run, with
local_schema):ontology_store.register(local_schema)→ union into persisted ontology.attributesvia the augmented prompt; deterministic strategies (GLiNER-only) leave them empty.required=Truefailure.MERGE … SET n += properties— new keys land, existing keys overwrite, untouched keys stay.Retrieval (per question):
<data_graph>__ontology.WHERE p.age > 30naturally excludes pre-agenodes.Backwards compatibility
Schemas without declared
propertiesproduce identical output to the prior pipeline:"", so token-for-token the prompt is unchanged._validate_attributesshort-circuits on empty schemas.render_schema_block(GraphSchema())falls back to the historic hardcoded label set.Test plan
RUN_INTEGRATION=1).test_ontology_store,test_attribute_prompt,test_models,test_graph_extraction,test_pipeline,test_cypher_generation.agedeclared; ingest pass 2 withagedeclared; query "Who is older than 50?" viakg.chat_session(). Verify generated Cypher usesp.age; verify pre-pass-2 Persons aren't in the result; verifykg.get_ontology()returns the union.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests