Skip to content

feat(schema): schema-driven attributes end-to-end with persistent ontology#256

Open
galshubeli wants to merge 8 commits into
mainfrom
feat/schema-driven-attributes
Open

feat(schema): schema-driven attributes end-to-end with persistent ontology#256
galshubeli wants to merge 8 commits into
mainfrom
feat/schema-driven-attributes

Conversation

@galshubeli
Copy link
Copy Markdown
Collaborator

@galshubeli galshubeli commented May 17, 2026

Summary

Wires declared PropertyType / RelationType.properties from the GraphSchema through extraction, storage, and retrieval. The ontology persists in a dedicated <data_graph>__ontology FalkorDB graph and accumulates as the union of every schema ever registered, so ingest passes can use subset / extension schemas while retrieval always sees the global ontology.

Forward-only evolution: nodes ingested before a property was declared remain without that property; FalkorDB's MERGE … SET n += props handles fill-in naturally and WHERE p.attr > N naturally excludes them — no backfill, no merge-strategy machinery.

What changed

  • core/modelsPropertyType type-normalising validator (rejects unknown types); new RelationType.properties; ExtractedEntity/Relation.attributes dict; reserved-name collisions rejected at config time; GraphSchema.merge() for ontology union.
  • storage/ontology_store (new) — persistent ontology in <data_graph>__ontology graph. Idempotent register() (union-merge), load(), clear(). Pattern lists deduped on union.
  • ingestion/graph_extractionVERIFY_EXTRACT_RELS_PROMPT gains a conditional ## Attribute extraction block (empty for property-less schemas → zero token drift). Per-type coercion at the extractor boundary (INTEGER, FLOAT, BOOLEAN, DATE, LIST, STRING). Records missing required=True properties are dropped with aggregated warnings. Aggregators carry attributes through dedup with last-write-wins. Conversion merges attributes into graph props.
  • ingestion/pipeline — new _validate_attributes pass after _prune, then re-runs _filter_quality to cascade-clean dangling relationships. Skips Unknown-labelled nodes.
  • retrieval/cypher_generation — replaces hardcoded SCHEMA_PROMPT with build_schema_prompt(schema, question) + render_schema_block(schema). Synthesises one numeric-attribute example per declared INTEGER/FLOAT property so the LLM learns the column shape. validate_cypher accepts a schema for dynamic label validation; falls back to the historic hardcoded set when omitted.
  • retrieval/multi_path — schema threaded into execute_cypher_retrieval.
  • api/mainGraphRAG constructs an OntologyStore, registers the local schema on each ingest, and refresh_ontology() propagates the global ontology to the retrieval strategy. Public get_ontology() / refresh_ontology() for inspection.

Algorithm

Ingest (per run, with local_schema):

  1. ontology_store.register(local_schema) → union into persisted ontology.
  2. Existing extraction runs; LLM-driven strategies fill attributes via the augmented prompt; deterministic strategies (GLiNER-only) leave them empty.
  3. Type-coerce at the extractor boundary; drop records on required=True failure.
  4. Aggregate (existing dedup + per-key LWW for attributes).
  5. Validate against schema; cascade clean dangling rels.
  6. Write via existing MERGE … SET n += properties — new keys land, existing keys overwrite, untouched keys stay.

Retrieval (per question):

  1. Load global ontology from <data_graph>__ontology.
  2. Render schema block into the Cypher generation prompt.
  3. LLM authors Cypher referencing any declared attribute. WHERE p.age > 30 naturally excludes pre-age nodes.

Backwards compatibility

Schemas without declared properties produce identical output to the prior pipeline:

  • The augmented prompt section renders to "", so token-for-token the prompt is unchanged.
  • _validate_attributes short-circuits on empty schemas.
  • render_schema_block(GraphSchema()) falls back to the historic hardcoded label set.

Test plan

  • Full suite: 819 passed, 23 skipped (integration tests gated by RUN_INTEGRATION=1).
  • 124 new/extended unit tests across test_ontology_store, test_attribute_prompt, test_models, test_graph_extraction, test_pipeline, test_cypher_generation.
  • Smoke test against live FalkorDB: ingest pass 1 without age declared; ingest pass 2 with age declared; query "Who is older than 50?" via kg.chat_session(). Verify generated Cypher uses p.age; verify pre-pass-2 Persons aren't in the result; verify kg.get_ontology() returns the union.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Async ontology management: get, refresh, and save a global schema; ingestion now best-effort refreshes ontology after single-document ingest
    • Schema-driven model improvements: normalized property types, reserved-name checks, required-property enforcement during ingestion and retrieval
    • Schema-aware extraction & retrieval: attribute extraction with type coercion/merging and schema-aware Cypher generation/execution; ontology discovery exported from storage
  • Tests

    • Added comprehensive tests for attribute prompts/coercion, schema models/validation, cypher generation, ontology inference, and ingestion attribute validation

Review Change Stack

…ology

Wires declared PropertyType / RelationType.properties from the GraphSchema
through extraction, storage, and retrieval. The ontology persists in a
dedicated <data_graph>__ontology FalkorDB graph and accumulates as the
union of every schema ever registered, so ingest passes can use subset/
extension schemas while retrieval always sees the global ontology.

- core/models: PropertyType type-normalising validator, RelationType.properties,
  ExtractedEntity/Relation.attributes, reserved-name rejection, GraphSchema.merge.
- storage/ontology_store (new): OntologyStore.load/register/clear against
  <data_graph>__ontology; idempotent unions; pattern lists deduped.
- ingestion/graph_extraction: conditional attribute block in
  VERIFY_EXTRACT_RELS_PROMPT (empty for property-less schemas, zero drift),
  per-type coercion at the extractor boundary, required-missing record drop,
  aggregators carry attributes (last-write-wins), conversion merges into props.
- ingestion/pipeline: _validate_attributes pass after _prune, then re-run
  _filter_quality to cascade-clean dangling relationships.
- retrieval/cypher_generation: replaces hardcoded SCHEMA_PROMPT with
  build_schema_prompt(schema, question) + render_schema_block; synthesises
  one numeric-attribute example per declared INTEGER/FLOAT property;
  validate_cypher accepts dynamic schema labels.
- retrieval/multi_path: schema threaded into execute_cypher_retrieval.
- api/main: GraphRAG owns the OntologyStore, registers local schema on each
  ingest, refresh_ontology() propagates the global ontology to retrieval,
  public get_ontology() / refresh_ontology() methods.

Forward-only evolution: nodes ingested before a property was declared
remain without that property; FalkorDB's MERGE ... SET n += props handles
fill-in naturally and WHERE p.attr > N naturally excludes them.

Tests: 124 new/extended unit tests; full suite 819 passed, 23 skipped
(integration tests gated by RUN_INTEGRATION=1). Property-less schemas
produce identical output to the prior pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds schema property typing and reserved names, live ontology inference (OntologyStore), attribute-aware extraction and coercion, pipeline attribute validation, schema-aware Cypher generation, retrieval wiring, facade methods to inspect/refresh/save the global schema, and tests for all layers.

Changes

Persistent Schema Ontology with Attribute Support

Layer / File(s) Summary
Schema validation and attribute data models
graphrag_sdk/src/graphrag_sdk/core/models.py, graphrag_sdk/tests/test_models.py
Adds supported property types and reserved names, normalizes/validates PropertyType.type, adds RelationType.properties, GraphSchema validation, and attributes fields on ExtractedEntity/ExtractedRelation. Tests cover type normalization, reserved-name rejection, JSON round-trip, and merge semantics.
Live ontology inference (OntologyStore)
graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py, graphrag_sdk/src/graphrag_sdk/storage/__init__.py, graphrag_sdk/tests/test_ontology_store.py
New OntologyStore.infer() inspects the FalkorDB graph (labels, relationshipTypes, sampled property keys/types, RELATES subtypes, endpoint patterns), maps DB types to SDK PropertyType, and returns a GraphSchema. Tests mock driver calls and validate inference, filtering, and failure fallbacks.
Extraction strategy with attribute coercion
graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py, graphrag_sdk/tests/test_attribute_prompt.py, graphrag_sdk/tests/test_graph_extraction.py
Two-step extraction renders attribute schema in prompts, parses/coerces LLM attributes into declared types, drops records missing required attributes, aggregates attributes (last-write-wins), and merges them into node/relationship properties. Tests cover rendering, coercion, dropping, aggregation, and prompt integration.
Pipeline attribute validation
graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py, graphrag_sdk/tests/test_pipeline.py
Pipeline runs _validate_attributes() after pruning to drop undeclared keys (except SDK-reserved names) and remove nodes/relations missing required schema-declared attributes; tests verify no-op on empty schema, attribute filtering, and required-field enforcement.
Schema-aware Cypher generation and validation
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py, graphrag_sdk/tests/test_cypher_generation.py
Adds build_schema_prompt(schema, question), optional schema parameters to generate_cypher, validate_cypher, and execute_cypher_retrieval; renders schema blocks and attribute examples in prompts and validates labels against the schema. Tests cover rendering fallback, label validation, and prompt contents.
Multi-path retrieval schema wiring
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py, graphrag_sdk/src/graphrag_sdk/storage/__init__.py
MultiPathRetrieval accepts an optional schema and forwards it to the Cypher retrieval branch so schema-aware Cypher is used when enabled; storage re-exports OntologyStore.
GraphRAG facade ontology integration
graphrag_sdk/src/graphrag_sdk/api/main.py
Facade creates OntologyStore, maintains _global_schema seeded from local schema, wires it into retrieval, exposes get_ontology(), refresh_ontology(), save_ontology(), and best-effort refreshes ontology after single-document ingest.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • FalkorDB/GraphRAG-SDK#253: Modifies Cypher generation/retrieval interfaces and prompts; may overlap with schema-aware Cypher changes here.

Suggested reviewers

  • Naseem77

Poem

🐰 I hopped through types, through schemas bright,

I trimmed and uppercased them through the night,
I inferred the graph and coaxed attributes true,
Merged, coerced, and pipelined — then I nibbled new dew,
Now graphs and rabbits both feel light.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding schema-driven attributes end-to-end with persistent ontology management.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/schema-driven-attributes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread graphrag_sdk/tests/test_ontology_store.py Fixed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
graphrag_sdk/tests/test_ontology_store.py (1)

116-166: ⚡ Quick win

Add a regression test for same-named properties across different owners.

Please add a test that registers two types (e.g., entity Person and relation WORKS_AT) both with property "since" and asserts independent metadata persistence. This will catch owner-scoping regressions in ontology property upserts.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_ontology_store.py` around lines 116 - 166, Add a
regression test in tests/test_ontology_store.py that registers two types sharing
the same property name to ensure owner-scoped property upserts: create a
GraphSchema with an EntityType(label="Person",
properties=[PropertyType(name="since", type="DATE", required=True)]) and a
RelationType(label="WORKS_AT", patterns=[("Person","Company")]) that also has
properties=[PropertyType(name="since", type="INTEGER", required=False)]; call
await store.register(schema) and then inspect fake_graph.calls to assert there
are two separate MERGE/SET operations for the property (look for "MERGE
(o)-[:HAS_PROPERTY]->" and the params passed) and that the params/metadata for
each include the correct owner (e.g., owner label or the relation/entity
context) and the distinct type/required values to prove they were persisted
independently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 196-203: delete_all() currently clears only the data graph leaving
the persisted ontology in OntologyStore (_ontology_store) and a stale
_global_schema; update delete_all() to also clear the ontology store and reset
in-memory schema: call the OntologyStore.clear()/delete_all() API (or implement
one) on self._ontology_store, then set self._global_schema back to a clean
default (e.g., self.schema or an empty GraphSchema) and ensure
refresh_ontology()/load paths will repopulate from the cleared store; reference
OntologyStore, delete_all(), _ontology_store, _global_schema, and
refresh_ontology() when making the change.

In `@graphrag_sdk/src/graphrag_sdk/core/models.py`:
- Around line 314-315: The merge() implementation currently uses
"incoming.description or current.description" which treats empty strings as
falsey and prevents an explicit empty description from overwriting an existing
one; change those checks in merge() to use an explicit None check (e.g., use
incoming.description if incoming.description is not None else
current.description) so last-write-wins applies even when incoming.description
== "". Apply the same fix to the other occurrence that uses the same pattern
(the second use around the properties merge).

In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py`:
- Around line 261-265: SCHEMA_PROMPT currently points to _SCHEMA_PROMPT_TEMPLATE
which requires schema_block and attribute_examples and thus breaks legacy
callers using SCHEMA_PROMPT.format(question=...); replace the simple alias with
a compatibility shim object (or small helper) named SCHEMA_PROMPT that
implements a format(**kwargs) method which: if kwargs contains only "question"
(legacy call), delegates to build_schema_prompt(question=kwargs["question"]) and
returns that built string, otherwise delegates to
_SCHEMA_PROMPT_TEMPLATE.format(**kwargs); ensure the shim also behaves like a
string for non-format uses (e.g., __str__ or __repr__ returning the underlying
template) so callers of SCHEMA_PROMPT continue to work.

In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 177-181: The MERGE for OntologyProperty is using only {name:
$name}, causing properties with the same name across different owners to
collide; update the upsert so the property node is keyed by both name and its
owner context (e.g. include owner or owner_label in the MERGE pattern that
creates p) instead of just name, and adjust any related uniqueness constraints
or queries that expect a single-keyed property (refer to the MERGE creating
p:OntologyProperty, the variable p, and owner_label in this block).

---

Nitpick comments:
In `@graphrag_sdk/tests/test_ontology_store.py`:
- Around line 116-166: Add a regression test in tests/test_ontology_store.py
that registers two types sharing the same property name to ensure owner-scoped
property upserts: create a GraphSchema with an EntityType(label="Person",
properties=[PropertyType(name="since", type="DATE", required=True)]) and a
RelationType(label="WORKS_AT", patterns=[("Person","Company")]) that also has
properties=[PropertyType(name="since", type="INTEGER", required=False)]; call
await store.register(schema) and then inspect fake_graph.calls to assert there
are two separate MERGE/SET operations for the property (look for "MERGE
(o)-[:HAS_PROPERTY]->" and the params passed) and that the params/metadata for
each include the correct owner (e.g., owner label or the relation/entity
context) and the distinct type/required values to prove they were persisted
independently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9aab5a64-1ab3-4985-8285-6595928e4a23

📥 Commits

Reviewing files that changed from the base of the PR and between a90ad18 and 576bafa.

📒 Files selected for processing (14)
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/core/models.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
  • graphrag_sdk/src/graphrag_sdk/storage/__init__.py
  • graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
  • graphrag_sdk/tests/test_attribute_prompt.py
  • graphrag_sdk/tests/test_cypher_generation.py
  • graphrag_sdk/tests/test_graph_extraction.py
  • graphrag_sdk/tests/test_models.py
  • graphrag_sdk/tests/test_ontology_store.py
  • graphrag_sdk/tests/test_pipeline.py

Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/core/models.py
Comment thread graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py Outdated
The data graph already contains labels, relationship types, properties,
and endpoint patterns — querying it is the source of truth, not a parallel
persisted graph. Schema-as-config is supported via plain JSON files.

- storage/ontology_store: rewritten as inference-only. OntologyStore.infer()
  calls db.labels() / db.relationshipTypes(), samples keys()+typeof() per
  label, derives RELATES sub-types from rel_type values, and pulls endpoint
  patterns. Structural labels/edges and reserved property keys are filtered
  out so they never leak into the LLM-facing schema.
- core/models: GraphSchema.from_file / save_to_file convenience methods for
  the schema-as-config workflow (versioned JSON, hand-editable, shareable).
- api/main: GraphRAG.get_ontology() returns inferred ∪ self.schema so user-
  declared descriptions and required flags survive (and properties declared
  but not yet extracted still appear in the retrieval prompt). New
  save_ontology(path) writes the global ontology to a JSON file. Ingest no
  longer writes to a separate ontology graph.

The previous PR introduced a persistent <data_graph>__ontology FalkorDB
graph; that was solving "declared property with zero instances" — a case
that is essentially invisible to retrieval (empty results either way). The
runtime cost was a second graph to back up, clean up, and reason about.
Inference + an opt-in JSON file covers the same ground in less code.

Tests: 830 passed, 23 skipped. test_ontology_store rewritten around
mocked-driver introspection. test_models gains a JSON round-trip test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
graphrag_sdk/src/graphrag_sdk/api/main.py (1)

596-601: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh or invalidate _global_schema after every graph mutation, not only ingest.

This best-effort refresh only runs on _ingest_single(). update(), delete_document(), and delete_all() also change the graph, but they leave _global_schema and the retrieval strategy's cached schema untouched, so subsequent retrieval can use stale labels/properties until the caller manually invokes refresh_ontology().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/api/main.py` around lines 596 - 601, The patch
currently refreshes the global ontology only in _ingest_single(); update(),
delete_document(), and delete_all() also mutate the graph but don't refresh or
invalidate the cached schema, causing stale retrievals; modify those mutation
paths (the update(), delete_document(), and delete_all() methods) to either call
await self.refresh_ontology() after a successful mutation or explicitly
clear/invalidate self._global_schema and notify the retrieval strategy to clear
its cached schema (use the same mechanism the ingest path uses to update
retrieval_strategy's cache), so that retrievals see new/removed properties
immediately.
graphrag_sdk/src/graphrag_sdk/core/models.py (1)

330-333: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve explicit empty descriptions in merge().

These branches still use incoming.description or current.description, so merge() cannot intentionally clear a description with "" even though the method documents last-write-wins behavior.

Suggested fix
                 ent_by_label[e.label] = EntityType(
                     label=cur.label,
-                    description=e.description or cur.description,
+                    description=e.description if e.description is not None else cur.description,
                     properties=_merge_props(cur.properties, e.properties),
                 )
@@
                 rel_by_label[r.label] = RelationType(
                     label=cur.label,
-                    description=r.description or cur.description,
+                    description=r.description if r.description is not None else cur.description,
                     patterns=merged_patterns,
                     properties=_merge_props(cur.properties, r.properties),
                 )

Also applies to: 348-350

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/core/models.py` around lines 330 - 333, In
merge(), the code uses falsy checks (incoming.description or
current.description) which prevents intentionally clearing a description with an
empty string; update the description selection to check for None explicitly
(e.g., use incoming.description if incoming.description is not None else
current.description) in the branches that construct EntityType (references:
ent_by_label, EntityType construction, variables e/cur or incoming/current) and
the analogous branch around lines 348-350 so last-write-wins semantics accept ""
as a valid value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 111-119: The infer()/get_ontology() path only collects relation
types when "RELATES" exists, dropping direct user edge labels like
WORKS_AT/LOCATED_IN; update the logic around rel_types and relations (the
relations: list[RelationType] block and use of self._infer_relates_subtypes) to
also include any non-structural rel_types besides RELATES by converting each
such label into a RelationType (including property keys and endpoint patterns)
or by calling a new helper to infer their properties using the same sample_size;
keep excluding structural edges (PART_OF, NEXT_CHUNK, MENTIONED_IN) and preserve
the existing RELATES handling via self._infer_relates_subtypes(sample_size).

---

Duplicate comments:
In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 596-601: The patch currently refreshes the global ontology only in
_ingest_single(); update(), delete_document(), and delete_all() also mutate the
graph but don't refresh or invalidate the cached schema, causing stale
retrievals; modify those mutation paths (the update(), delete_document(), and
delete_all() methods) to either call await self.refresh_ontology() after a
successful mutation or explicitly clear/invalidate self._global_schema and
notify the retrieval strategy to clear its cached schema (use the same mechanism
the ingest path uses to update retrieval_strategy's cache), so that retrievals
see new/removed properties immediately.

In `@graphrag_sdk/src/graphrag_sdk/core/models.py`:
- Around line 330-333: In merge(), the code uses falsy checks
(incoming.description or current.description) which prevents intentionally
clearing a description with an empty string; update the description selection to
check for None explicitly (e.g., use incoming.description if
incoming.description is not None else current.description) in the branches that
construct EntityType (references: ent_by_label, EntityType construction,
variables e/cur or incoming/current) and the analogous branch around lines
348-350 so last-write-wins semantics accept "" as a valid value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee8ce5db-b86d-454b-94a9-d1ef91a6f6f8

📥 Commits

Reviewing files that changed from the base of the PR and between 576bafa and 54e4e05.

📒 Files selected for processing (5)
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/core/models.py
  • graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
  • graphrag_sdk/tests/test_models.py
  • graphrag_sdk/tests/test_ontology_store.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • graphrag_sdk/tests/test_models.py

Comment thread graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py Outdated
- graph_extraction.py: regroup imports (datetime alongside stdlib); remove
  unused EntityType import; wrap two over-length attribute-fetch lines.
- api/main.py: split a long get_ontology return expression.

ruff check src/graphrag_sdk/ passes locally. Tests: 830 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's "ruff format --check src/" flagged 5 files. Pure formatting; no
behavioral change. Tests: 830 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 140-146: The query that builds subtypes_result currently applies
LIMIT before DISTINCT, causing rare r.rel_type values to be dropped; update the
Cypher used in the get_ontology (or wherever subtypes_result is produced) so
DISTINCT r.rel_type is applied prior to sampling/limiting: select DISTINCT
r.rel_type AS rel_type (filter out NULLs), then apply the LIMIT $limit to that
deduplicated set before returning rel_type via self._conn.query.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61f452a0-7e8d-4c3d-aa3e-7be0ec4c8ff3

📥 Commits

Reviewing files that changed from the base of the PR and between 0597de9 and adc223f.

📒 Files selected for processing (5)
  • graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
  • graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
✅ Files skipped from review due to trivial changes (1)
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py
  • graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py

Comment thread graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements schema-driven attributes across extraction, ingestion validation, and schema-aware Cypher generation, plus an ontology mechanism to supply a global schema to retrieval.

Changes:

  • Adds attribute coercion/extraction support driven by GraphSchema property declarations, and validates/drops records missing required attributes during ingestion.
  • Introduces OntologyStore and threads schema/ontology into retrieval so Cypher prompt/validation can be schema-aware.
  • Extends unit test coverage across models, extraction, ingestion pipeline, ontology inference, and Cypher generation.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py Builds schema-aware prompt blocks/examples and validates labels against an optional GraphSchema.
graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py New ontology inference helper that introspects the live graph for labels/relationship subtypes/properties.
graphrag_sdk/src/graphrag_sdk/core/models.py Adds schema property typing/validation, reserved-name checks, schema merge, and extracted attributes fields.
graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py Adds attribute prompt block + per-type coercion and carries attributes through aggregation/storage conversion.
graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py Adds _validate_attributes pass after pruning and re-runs quality filtering.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py Threads schema into Cypher retrieval execution when enabled.
graphrag_sdk/src/graphrag_sdk/api/main.py Instantiates OntologyStore, adds get/refresh/save ontology APIs, and propagates schema to retrieval strategy.
graphrag_sdk/src/graphrag_sdk/storage/init.py Exports OntologyStore.
graphrag_sdk/tests/test_attribute_prompt.py New tests for attribute prompt rendering and coercion helpers.
graphrag_sdk/tests/test_cypher_generation.py Adds tests for schema block rendering, prompt building, and schema-aware Cypher validation.
graphrag_sdk/tests/test_graph_extraction.py Adds tests for attribute coercion, required-attribute dropping, aggregation, and property merge behavior.
graphrag_sdk/tests/test_models.py Updates/extends schema model tests (type normalization, reserved keys, merge).
graphrag_sdk/tests/test_ontology_store.py New tests for ontology inference logic with a mocked FalkorDB driver.
graphrag_sdk/tests/test_pipeline.py Adds tests for ingestion attribute validation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 229 to 236
Question: "What organizations are related to the technology?"
```cypher
MATCH (o:Organization)-[r:RELATES]-(t:Technology)
RETURN o.name AS organization, t.name AS technology, r.rel_type AS relation, r.fact AS evidence
LIMIT 20
```
```{attribute_examples}

## Your task
Comment on lines +1 to +6
"""Ontology inference from the live data graph.

The schema is **derived** from what's in the data graph, not maintained in a
separate persistent graph. This keeps the architecture honest: the source of
truth for "what entities and relations exist" is the data itself.

Comment on lines +312 to +314
property name appears in both, the incoming type/description/required
overrides — last-write-wins, matching the persisted ontology's
register() semantics.
Comment on lines 198 to 215
# Ontology is inferred from the live data graph (no separate
# persistent graph). User-supplied ``schema`` is merged on top at
# retrieval time so declared descriptions / required flags / not-yet-
# extracted properties survive.
self._ontology_store = OntologyStore(self._conn)
# Global ontology used at retrieval time. Initially just the local
# schema; refresh_ontology() merges in the inferred view on demand
# and after each ingest.
self._global_schema: GraphSchema = self.schema

# Default retrieval strategy
self._retrieval_strategy = retrieval_strategy or MultiPathRetrieval(
graph_store=self._graph_store,
vector_store=self._vector_store,
embedder=self.embedder,
llm=self.llm,
schema=self._global_schema,
)
`required=True` was a footgun for an LLM-extraction pipeline: when the LLM
couldn't find a value in the text (often because the text didn't state it),
the entire entity was silently dropped. That destroyed real data.

New behavior: every declared property name appears in
`ExtractedEntity.attributes` with either the coerced value or `None`. The
storage layer's existing `_clean_properties` strips `None` before writing
so the graph sees "key missing" — the right null semantics for retrieval
(`WHERE p.age > N` naturally excludes nodes without `age`).

- core/models: remove `required` field from PropertyType.
- ingestion/graph_extraction: `_coerce_attributes` now returns just a dict
  with `None` for missing/uncoercible values; never drops records. The
  drop-on-missing-required branches and aggregated warnings are gone.
- ingestion/pipeline: `_validate_attributes` strips undeclared keys but
  never drops records. Cascade `_filter_quality` still runs but no longer
  has anything to cascade in the attribute-validation case.
- storage/ontology_store: docstring no longer mentions required flags.
- tests: collapsed the three "required missing" cases into a single
  "every declared property appears in result; uncoercible -> None" test
  per file. 829 passed, 23 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +130 to +134
"For each verified entity and each extracted relationship, also extract any "
"of the declared properties below. Use null when the property is not stated "
"in the text. Coerce to the declared type (INTEGER, FLOAT, BOOLEAN, DATE in "
"ISO 8601, STRING, LIST). Place extracted values inside an `attributes` "
"object on each entity / relationship.",
galshubeli and others added 2 commits May 19, 2026 15:58
The data graph alone wasn't enough to anchor the schema:
- types are lossy (FalkorDB `typeof()` collapses DATE → "string")
- multiple processes/instances couldn't share a declarative view
- nothing caught schema-typo contradictions across sessions
- declarative metadata (descriptions, future flags) had no canonical home

Restore the persistent ontology graph (``<data>__ontology``) as the single
source of truth, with additive-only semantics:

- OntologyStore.register() validates incoming schema against the persisted
  ontology. Re-typing an existing property raises OntologyContradictionError
  before any partial state is persisted; additions (new labels, properties,
  patterns) go through unchanged.
- OntologyStore.load() / clear() round out the lifecycle.
- GraphRAG._ensure_ontology_initialized() lazy first-touch loads the
  persisted ontology and registers self.schema. Called from ingest (so
  contradictions surface before expensive extraction) and from get_ontology /
  retrieval paths.
- delete_all() now drops both data and ontology graphs and resets the
  initialised flag so the next ingest re-registers self.schema cleanly.
- get_ontology() always reads from the ontology graph; refresh_ontology()
  is a thin re-load for cross-process freshness. save_ontology(path)
  remains as a JSON-file bridge.

This reverts the architecture of 54e4e05 (drop ontology graph, infer from
data). The "save a graph" simplicity wasn't worth losing validation,
multi-process safety, type fidelity, and a queryable schema artifact.

Tests: 825 passed, 23 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y one

Before, a user passing no schema got an empty ontology graph even though
extraction was using DEFAULT_ENTITY_TYPES under the hood. The anchor said
"we know nothing" while the LLM was producing Person/Organization/... nodes.

Now ``_ensure_ontology_initialized()`` has three branches:

- user passed a schema → register it (validate + persist)
- ontology graph already populated (prior session) → use as-is
- both empty → register DEFAULT_ENTITY_TYPES so the ontology graph
  faithfully reflects what the extractor will produce

After this, ``get_ontology()``, the Cypher-generation prompt, and the
extractor all read from the same source.

Test fixture: ``mock_conn`` in test_facade.py now stubs the connection's
``_driver.select_graph()`` chain so OntologyStore can open the ontology
graph handle against the mock. 825 passed, 23 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Schema-driven attributes: typed properties through extraction → storage → retrieval

2 participants