feat(schema): schema-driven attributes end-to-end with persistent ontology by galshubeli · Pull Request #256 · FalkorDB/GraphRAG-SDK

galshubeli · 2026-05-17T13:49:35Z

Summary

Wires declared PropertyType / RelationType.properties from the GraphSchema through extraction, storage, and retrieval. The ontology persists in a dedicated <data_graph>__ontology FalkorDB graph and accumulates as the union of every schema ever registered, so ingest passes can use subset / extension schemas while retrieval always sees the global ontology.

Forward-only evolution: nodes ingested before a property was declared remain without that property; FalkorDB's MERGE … SET n += props handles fill-in naturally and WHERE p.attr > N naturally excludes them — no backfill, no merge-strategy machinery.

What changed

core/models — PropertyType type-normalising validator (rejects unknown types); new RelationType.properties; ExtractedEntity/Relation.attributes dict; reserved-name collisions rejected at config time; GraphSchema.merge() for ontology union.
storage/ontology_store (new) — persistent ontology in <data_graph>__ontology graph. Idempotent register() (union-merge), load(), clear(). Pattern lists deduped on union.
ingestion/graph_extraction — VERIFY_EXTRACT_RELS_PROMPT gains a conditional ## Attribute extraction block (empty for property-less schemas → zero token drift). Per-type coercion at the extractor boundary (INTEGER, FLOAT, BOOLEAN, DATE, LIST, STRING). Records missing required=True properties are dropped with aggregated warnings. Aggregators carry attributes through dedup with last-write-wins. Conversion merges attributes into graph props.
ingestion/pipeline — new _validate_attributes pass after _prune, then re-runs _filter_quality to cascade-clean dangling relationships. Skips Unknown-labelled nodes.
retrieval/cypher_generation — replaces hardcoded SCHEMA_PROMPT with build_schema_prompt(schema, question) + render_schema_block(schema). Synthesises one numeric-attribute example per declared INTEGER/FLOAT property so the LLM learns the column shape. validate_cypher accepts a schema for dynamic label validation; falls back to the historic hardcoded set when omitted.
retrieval/multi_path — schema threaded into execute_cypher_retrieval.
api/main — GraphRAG constructs an OntologyStore, registers the local schema on each ingest, and refresh_ontology() propagates the global ontology to the retrieval strategy. Public get_ontology() / refresh_ontology() for inspection.

Algorithm

Ingest (per run, with local_schema):

ontology_store.register(local_schema) → union into persisted ontology.
Existing extraction runs; LLM-driven strategies fill attributes via the augmented prompt; deterministic strategies (GLiNER-only) leave them empty.
Type-coerce at the extractor boundary; drop records on required=True failure.
Aggregate (existing dedup + per-key LWW for attributes).
Validate against schema; cascade clean dangling rels.
Write via existing MERGE … SET n += properties — new keys land, existing keys overwrite, untouched keys stay.

Retrieval (per question):

Load global ontology from <data_graph>__ontology.
Render schema block into the Cypher generation prompt.
LLM authors Cypher referencing any declared attribute. WHERE p.age > 30 naturally excludes pre-age nodes.

Backwards compatibility

Schemas without declared properties produce identical output to the prior pipeline:

The augmented prompt section renders to "", so token-for-token the prompt is unchanged.
_validate_attributes short-circuits on empty schemas.
render_schema_block(GraphSchema()) falls back to the historic hardcoded label set.

Test plan

Full suite: 819 passed, 23 skipped (integration tests gated by RUN_INTEGRATION=1).
124 new/extended unit tests across test_ontology_store, test_attribute_prompt, test_models, test_graph_extraction, test_pipeline, test_cypher_generation.
Smoke test against live FalkorDB: ingest pass 1 without age declared; ingest pass 2 with age declared; query "Who is older than 50?" via kg.chat_session(). Verify generated Cypher uses p.age; verify pre-pass-2 Persons aren't in the result; verify kg.get_ontology() returns the union.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Async ontology management: get, refresh, and save a global schema; ingestion now best-effort refreshes ontology after single-document ingest
- Schema-driven model improvements: normalized property types, reserved-name checks, required-property enforcement during ingestion and retrieval
- Schema-aware extraction & retrieval: attribute extraction with type coercion/merging and schema-aware Cypher generation/execution; ontology discovery exported from storage
Tests
- Added comprehensive tests for attribute prompts/coercion, schema models/validation, cypher generation, ontology inference, and ingestion attribute validation

…ology Wires declared PropertyType / RelationType.properties from the GraphSchema through extraction, storage, and retrieval. The ontology persists in a dedicated <data_graph>__ontology FalkorDB graph and accumulates as the union of every schema ever registered, so ingest passes can use subset/ extension schemas while retrieval always sees the global ontology. - core/models: PropertyType type-normalising validator, RelationType.properties, ExtractedEntity/Relation.attributes, reserved-name rejection, GraphSchema.merge. - storage/ontology_store (new): OntologyStore.load/register/clear against <data_graph>__ontology; idempotent unions; pattern lists deduped. - ingestion/graph_extraction: conditional attribute block in VERIFY_EXTRACT_RELS_PROMPT (empty for property-less schemas, zero drift), per-type coercion at the extractor boundary, required-missing record drop, aggregators carry attributes (last-write-wins), conversion merges into props. - ingestion/pipeline: _validate_attributes pass after _prune, then re-run _filter_quality to cascade-clean dangling relationships. - retrieval/cypher_generation: replaces hardcoded SCHEMA_PROMPT with build_schema_prompt(schema, question) + render_schema_block; synthesises one numeric-attribute example per declared INTEGER/FLOAT property; validate_cypher accepts dynamic schema labels. - retrieval/multi_path: schema threaded into execute_cypher_retrieval. - api/main: GraphRAG owns the OntologyStore, registers local schema on each ingest, refresh_ontology() propagates the global ontology to retrieval, public get_ontology() / refresh_ontology() methods. Forward-only evolution: nodes ingested before a property was declared remain without that property; FalkorDB's MERGE ... SET n += props handles fill-in naturally and WHERE p.attr > N naturally excludes them. Tests: 124 new/extended unit tests; full suite 819 passed, 23 skipped (integration tests gated by RUN_INTEGRATION=1). Property-less schemas produce identical output to the prior pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-17T13:49:47Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds schema property typing and reserved names, live ontology inference (OntologyStore), attribute-aware extraction and coercion, pipeline attribute validation, schema-aware Cypher generation, retrieval wiring, facade methods to inspect/refresh/save the global schema, and tests for all layers.

Changes

Persistent Schema Ontology with Attribute Support

Layer / File(s)	Summary
Schema validation and attribute data models `graphrag_sdk/src/graphrag_sdk/core/models.py`, `graphrag_sdk/tests/test_models.py`	Adds supported property types and reserved names, normalizes/validates `PropertyType.type`, adds `RelationType.properties`, `GraphSchema` validation, and `attributes` fields on `ExtractedEntity`/`ExtractedRelation`. Tests cover type normalization, reserved-name rejection, JSON round-trip, and merge semantics.
Live ontology inference (OntologyStore) `graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`, `graphrag_sdk/src/graphrag_sdk/storage/__init__.py`, `graphrag_sdk/tests/test_ontology_store.py`	New `OntologyStore.infer()` inspects the FalkorDB graph (labels, relationshipTypes, sampled property keys/types, RELATES subtypes, endpoint patterns), maps DB types to SDK `PropertyType`, and returns a `GraphSchema`. Tests mock driver calls and validate inference, filtering, and failure fallbacks.
Extraction strategy with attribute coercion `graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py`, `graphrag_sdk/tests/test_attribute_prompt.py`, `graphrag_sdk/tests/test_graph_extraction.py`	Two-step extraction renders attribute schema in prompts, parses/coerces LLM `attributes` into declared types, drops records missing required attributes, aggregates attributes (last-write-wins), and merges them into node/relationship properties. Tests cover rendering, coercion, dropping, aggregation, and prompt integration.
Pipeline attribute validation `graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py`, `graphrag_sdk/tests/test_pipeline.py`	Pipeline runs `_validate_attributes()` after pruning to drop undeclared keys (except SDK-reserved names) and remove nodes/relations missing required schema-declared attributes; tests verify no-op on empty schema, attribute filtering, and required-field enforcement.
Schema-aware Cypher generation and validation `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py`, `graphrag_sdk/tests/test_cypher_generation.py`	Adds `build_schema_prompt(schema, question)`, optional `schema` parameters to `generate_cypher`, `validate_cypher`, and `execute_cypher_retrieval`; renders schema blocks and attribute examples in prompts and validates labels against the schema. Tests cover rendering fallback, label validation, and prompt contents.
Multi-path retrieval schema wiring `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py`, `graphrag_sdk/src/graphrag_sdk/storage/__init__.py`	`MultiPathRetrieval` accepts an optional `schema` and forwards it to the Cypher retrieval branch so schema-aware Cypher is used when enabled; storage re-exports `OntologyStore`.
GraphRAG facade ontology integration `graphrag_sdk/src/graphrag_sdk/api/main.py`	Facade creates `OntologyStore`, maintains `_global_schema` seeded from local schema, wires it into retrieval, exposes `get_ontology()`, `refresh_ontology()`, `save_ontology()`, and best-effort refreshes ontology after single-document ingest.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Schema-driven attributes: typed properties through extraction → storage → retrieval #260: Changes implement schema-driven typed attribute extraction and ontology APIs that match this issue's objectives.

Possibly related PRs

FalkorDB/GraphRAG-SDK#253: Modifies Cypher generation/retrieval interfaces and prompts; may overlap with schema-aware Cypher changes here.

Suggested reviewers

Naseem77

Poem

🐰 I hopped through types, through schemas bright,

I trimmed and uppercased them through the night,
I inferred the graph and coaxed attributes true,
Merged, coerced, and pipelined — then I nibbled new dew,
Now graphs and rabbits both feel light.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.92% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding schema-driven attributes end-to-end with persistent ontology management.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/schema-driven-attributes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

graphrag_sdk/tests/test_ontology_store.py (1)

116-166: ⚡ Quick win

Add a regression test for same-named properties across different owners.

Please add a test that registers two types (e.g., entity Person and relation WORKS_AT) both with property "since" and asserts independent metadata persistence. This will catch owner-scoping regressions in ontology property upserts.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_ontology_store.py` around lines 116 - 166, Add a
regression test in tests/test_ontology_store.py that registers two types sharing
the same property name to ensure owner-scoped property upserts: create a
GraphSchema with an EntityType(label="Person",
properties=[PropertyType(name="since", type="DATE", required=True)]) and a
RelationType(label="WORKS_AT", patterns=[("Person","Company")]) that also has
properties=[PropertyType(name="since", type="INTEGER", required=False)]; call
await store.register(schema) and then inspect fake_graph.calls to assert there
are two separate MERGE/SET operations for the property (look for "MERGE
(o)-[:HAS_PROPERTY]->" and the params passed) and that the params/metadata for
each include the correct owner (e.g., owner label or the relation/entity
context) and the distinct type/required values to prove they were persisted
independently.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 196-203: delete_all() currently clears only the data graph leaving
the persisted ontology in OntologyStore (_ontology_store) and a stale
_global_schema; update delete_all() to also clear the ontology store and reset
in-memory schema: call the OntologyStore.clear()/delete_all() API (or implement
one) on self._ontology_store, then set self._global_schema back to a clean
default (e.g., self.schema or an empty GraphSchema) and ensure
refresh_ontology()/load paths will repopulate from the cleared store; reference
OntologyStore, delete_all(), _ontology_store, _global_schema, and
refresh_ontology() when making the change.

In `@graphrag_sdk/src/graphrag_sdk/core/models.py`:
- Around line 314-315: The merge() implementation currently uses
"incoming.description or current.description" which treats empty strings as
falsey and prevents an explicit empty description from overwriting an existing
one; change those checks in merge() to use an explicit None check (e.g., use
incoming.description if incoming.description is not None else
current.description) so last-write-wins applies even when incoming.description
== "". Apply the same fix to the other occurrence that uses the same pattern
(the second use around the properties merge).

In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py`:
- Around line 261-265: SCHEMA_PROMPT currently points to _SCHEMA_PROMPT_TEMPLATE
which requires schema_block and attribute_examples and thus breaks legacy
callers using SCHEMA_PROMPT.format(question=...); replace the simple alias with
a compatibility shim object (or small helper) named SCHEMA_PROMPT that
implements a format(**kwargs) method which: if kwargs contains only "question"
(legacy call), delegates to build_schema_prompt(question=kwargs["question"]) and
returns that built string, otherwise delegates to
_SCHEMA_PROMPT_TEMPLATE.format(**kwargs); ensure the shim also behaves like a
string for non-format uses (e.g., __str__ or __repr__ returning the underlying
template) so callers of SCHEMA_PROMPT continue to work.

In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 177-181: The MERGE for OntologyProperty is using only {name:
$name}, causing properties with the same name across different owners to
collide; update the upsert so the property node is keyed by both name and its
owner context (e.g. include owner or owner_label in the MERGE pattern that
creates p) instead of just name, and adjust any related uniqueness constraints
or queries that expect a single-keyed property (refer to the MERGE creating
p:OntologyProperty, the variable p, and owner_label in this block).

---

Nitpick comments:
In `@graphrag_sdk/tests/test_ontology_store.py`:
- Around line 116-166: Add a regression test in tests/test_ontology_store.py
that registers two types sharing the same property name to ensure owner-scoped
property upserts: create a GraphSchema with an EntityType(label="Person",
properties=[PropertyType(name="since", type="DATE", required=True)]) and a
RelationType(label="WORKS_AT", patterns=[("Person","Company")]) that also has
properties=[PropertyType(name="since", type="INTEGER", required=False)]; call
await store.register(schema) and then inspect fake_graph.calls to assert there
are two separate MERGE/SET operations for the property (look for "MERGE
(o)-[:HAS_PROPERTY]->" and the params passed) and that the params/metadata for
each include the correct owner (e.g., owner label or the relation/entity
context) and the distinct type/required values to prove they were persisted
independently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9aab5a64-1ab3-4985-8285-6595928e4a23

📥 Commits

Reviewing files that changed from the base of the PR and between a90ad18 and 576bafa.

📒 Files selected for processing (14)

graphrag_sdk/src/graphrag_sdk/api/main.py
graphrag_sdk/src/graphrag_sdk/core/models.py
graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py
graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
graphrag_sdk/src/graphrag_sdk/storage/__init__.py
graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
graphrag_sdk/tests/test_attribute_prompt.py
graphrag_sdk/tests/test_cypher_generation.py
graphrag_sdk/tests/test_graph_extraction.py
graphrag_sdk/tests/test_models.py
graphrag_sdk/tests/test_ontology_store.py
graphrag_sdk/tests/test_pipeline.py

The data graph already contains labels, relationship types, properties, and endpoint patterns — querying it is the source of truth, not a parallel persisted graph. Schema-as-config is supported via plain JSON files. - storage/ontology_store: rewritten as inference-only. OntologyStore.infer() calls db.labels() / db.relationshipTypes(), samples keys()+typeof() per label, derives RELATES sub-types from rel_type values, and pulls endpoint patterns. Structural labels/edges and reserved property keys are filtered out so they never leak into the LLM-facing schema. - core/models: GraphSchema.from_file / save_to_file convenience methods for the schema-as-config workflow (versioned JSON, hand-editable, shareable). - api/main: GraphRAG.get_ontology() returns inferred ∪ self.schema so user- declared descriptions and required flags survive (and properties declared but not yet extracted still appear in the retrieval prompt). New save_ontology(path) writes the global ontology to a JSON file. Ingest no longer writes to a separate ontology graph. The previous PR introduced a persistent <data_graph>__ontology FalkorDB graph; that was solving "declared property with zero instances" — a case that is essentially invisible to retrieval (empty results either way). The runtime cost was a second graph to back up, clean up, and reason about. Inference + an opt-in JSON file covers the same ground in less code. Tests: 830 passed, 23 skipped. test_ontology_store rewritten around mocked-driver introspection. test_models gains a JSON round-trip test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

graphrag_sdk/src/graphrag_sdk/api/main.py (1)

596-601: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh or invalidate _global_schema after every graph mutation, not only ingest.

This best-effort refresh only runs on _ingest_single(). update(), delete_document(), and delete_all() also change the graph, but they leave _global_schema and the retrieval strategy's cached schema untouched, so subsequent retrieval can use stale labels/properties until the caller manually invokes refresh_ontology().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/api/main.py` around lines 596 - 601, The patch
currently refreshes the global ontology only in _ingest_single(); update(),
delete_document(), and delete_all() also mutate the graph but don't refresh or
invalidate the cached schema, causing stale retrievals; modify those mutation
paths (the update(), delete_document(), and delete_all() methods) to either call
await self.refresh_ontology() after a successful mutation or explicitly
clear/invalidate self._global_schema and notify the retrieval strategy to clear
its cached schema (use the same mechanism the ingest path uses to update
retrieval_strategy's cache), so that retrievals see new/removed properties
immediately.

graphrag_sdk/src/graphrag_sdk/core/models.py (1)

330-333: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve explicit empty descriptions in merge().

These branches still use incoming.description or current.description, so merge() cannot intentionally clear a description with "" even though the method documents last-write-wins behavior.

Suggested fix

                 ent_by_label[e.label] = EntityType(
                     label=cur.label,
-                    description=e.description or cur.description,
+                    description=e.description if e.description is not None else cur.description,
                     properties=_merge_props(cur.properties, e.properties),
                 )
@@
                 rel_by_label[r.label] = RelationType(
                     label=cur.label,
-                    description=r.description or cur.description,
+                    description=r.description if r.description is not None else cur.description,
                     patterns=merged_patterns,
                     properties=_merge_props(cur.properties, r.properties),
                 )

Also applies to: 348-350

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/core/models.py` around lines 330 - 333, In
merge(), the code uses falsy checks (incoming.description or
current.description) which prevents intentionally clearing a description with an
empty string; update the description selection to check for None explicitly
(e.g., use incoming.description if incoming.description is not None else
current.description) in the branches that construct EntityType (references:
ent_by_label, EntityType construction, variables e/cur or incoming/current) and
the analogous branch around lines 348-350 so last-write-wins semantics accept ""
as a valid value.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 111-119: The infer()/get_ontology() path only collects relation
types when "RELATES" exists, dropping direct user edge labels like
WORKS_AT/LOCATED_IN; update the logic around rel_types and relations (the
relations: list[RelationType] block and use of self._infer_relates_subtypes) to
also include any non-structural rel_types besides RELATES by converting each
such label into a RelationType (including property keys and endpoint patterns)
or by calling a new helper to infer their properties using the same sample_size;
keep excluding structural edges (PART_OF, NEXT_CHUNK, MENTIONED_IN) and preserve
the existing RELATES handling via self._infer_relates_subtypes(sample_size).

---

Duplicate comments:
In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 596-601: The patch currently refreshes the global ontology only in
_ingest_single(); update(), delete_document(), and delete_all() also mutate the
graph but don't refresh or invalidate the cached schema, causing stale
retrievals; modify those mutation paths (the update(), delete_document(), and
delete_all() methods) to either call await self.refresh_ontology() after a
successful mutation or explicitly clear/invalidate self._global_schema and
notify the retrieval strategy to clear its cached schema (use the same mechanism
the ingest path uses to update retrieval_strategy's cache), so that retrievals
see new/removed properties immediately.

In `@graphrag_sdk/src/graphrag_sdk/core/models.py`:
- Around line 330-333: In merge(), the code uses falsy checks
(incoming.description or current.description) which prevents intentionally
clearing a description with an empty string; update the description selection to
check for None explicitly (e.g., use incoming.description if
incoming.description is not None else current.description) in the branches that
construct EntityType (references: ent_by_label, EntityType construction,
variables e/cur or incoming/current) and the analogous branch around lines
348-350 so last-write-wins semantics accept "" as a valid value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee8ce5db-b86d-454b-94a9-d1ef91a6f6f8

📥 Commits

Reviewing files that changed from the base of the PR and between 576bafa and 54e4e05.

📒 Files selected for processing (5)

graphrag_sdk/src/graphrag_sdk/api/main.py
graphrag_sdk/src/graphrag_sdk/core/models.py
graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
graphrag_sdk/tests/test_models.py
graphrag_sdk/tests/test_ontology_store.py

🚧 Files skipped from review as they are similar to previous changes (1)

graphrag_sdk/tests/test_models.py

- graph_extraction.py: regroup imports (datetime alongside stdlib); remove unused EntityType import; wrap two over-length attribute-fetch lines. - api/main.py: split a long get_ontology return expression. ruff check src/graphrag_sdk/ passes locally. Tests: 830 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI's "ruff format --check src/" flagged 5 files. Pure formatting; no behavioral change. Tests: 830 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 140-146: The query that builds subtypes_result currently applies
LIMIT before DISTINCT, causing rare r.rel_type values to be dropped; update the
Cypher used in the get_ontology (or wherever subtypes_result is produced) so
DISTINCT r.rel_type is applied prior to sampling/limiting: select DISTINCT
r.rel_type AS rel_type (filter out NULLs), then apply the LIMIT $limit to that
deduplicated set before returning rel_type via self._conn.query.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61f452a0-7e8d-4c3d-aa3e-7be0ec4c8ff3

📥 Commits

Reviewing files that changed from the base of the PR and between 0597de9 and adc223f.

📒 Files selected for processing (5)

graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py
graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py

✅ Files skipped from review due to trivial changes (1)

graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py

🚧 Files skipped from review as they are similar to previous changes (3)

graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py
graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py

Copilot

Pull request overview

Implements schema-driven attributes across extraction, ingestion validation, and schema-aware Cypher generation, plus an ontology mechanism to supply a global schema to retrieval.

Changes:

Adds attribute coercion/extraction support driven by GraphSchema property declarations, and validates/drops records missing required attributes during ingestion.
Introduces OntologyStore and threads schema/ontology into retrieval so Cypher prompt/validation can be schema-aware.
Extends unit test coverage across models, extraction, ingestion pipeline, ontology inference, and Cypher generation.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py	Builds schema-aware prompt blocks/examples and validates labels against an optional `GraphSchema`.
graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py	New ontology inference helper that introspects the live graph for labels/relationship subtypes/properties.
graphrag_sdk/src/graphrag_sdk/core/models.py	Adds schema property typing/validation, reserved-name checks, schema merge, and extracted attributes fields.
graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py	Adds attribute prompt block + per-type coercion and carries attributes through aggregation/storage conversion.
graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py	Adds `_validate_attributes` pass after pruning and re-runs quality filtering.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py	Threads schema into Cypher retrieval execution when enabled.
graphrag_sdk/src/graphrag_sdk/api/main.py	Instantiates `OntologyStore`, adds get/refresh/save ontology APIs, and propagates schema to retrieval strategy.
graphrag_sdk/src/graphrag_sdk/storage/init.py	Exports `OntologyStore`.
graphrag_sdk/tests/test_attribute_prompt.py	New tests for attribute prompt rendering and coercion helpers.
graphrag_sdk/tests/test_cypher_generation.py	Adds tests for schema block rendering, prompt building, and schema-aware Cypher validation.
graphrag_sdk/tests/test_graph_extraction.py	Adds tests for attribute coercion, required-attribute dropping, aggregation, and property merge behavior.
graphrag_sdk/tests/test_models.py	Updates/extends schema model tests (type normalization, reserved keys, merge).
graphrag_sdk/tests/test_ontology_store.py	New tests for ontology inference logic with a mocked FalkorDB driver.
graphrag_sdk/tests/test_pipeline.py	Adds tests for ingestion attribute validation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 Question: "What organizations are related to the technology?"
 ```cypher
 MATCH (o:Organization)-[r:RELATES]-(t:Technology)
 RETURN o.name AS organization, t.name AS technology, r.rel_type AS relation, r.fact AS evidence
 LIMIT 20
-```
+```{attribute_examples}

 ## Your task


+"""Ontology inference from the live data graph.
+
+The schema is **derived** from what's in the data graph, not maintained in a
+separate persistent graph. This keeps the architecture honest: the source of
+truth for "what entities and relations exist" is the data itself.
+


+          property name appears in both, the incoming type/description/required
+          overrides — last-write-wins, matching the persisted ontology's
+          register() semantics.


+        # Ontology is inferred from the live data graph (no separate
+        # persistent graph). User-supplied ``schema`` is merged on top at
+        # retrieval time so declared descriptions / required flags / not-yet-
+        # extracted properties survive.
+        self._ontology_store = OntologyStore(self._conn)
+        # Global ontology used at retrieval time. Initially just the local
+        # schema; refresh_ontology() merges in the inferred view on demand
+        # and after each ingest.
+        self._global_schema: GraphSchema = self.schema
+
        # Default retrieval strategy
        self._retrieval_strategy = retrieval_strategy or MultiPathRetrieval(
            graph_store=self._graph_store,
            vector_store=self._vector_store,
            embedder=self.embedder,
            llm=self.llm,
+            schema=self._global_schema,
        )


`required=True` was a footgun for an LLM-extraction pipeline: when the LLM couldn't find a value in the text (often because the text didn't state it), the entire entity was silently dropped. That destroyed real data. New behavior: every declared property name appears in `ExtractedEntity.attributes` with either the coerced value or `None`. The storage layer's existing `_clean_properties` strips `None` before writing so the graph sees "key missing" — the right null semantics for retrieval (`WHERE p.age > N` naturally excludes nodes without `age`). - core/models: remove `required` field from PropertyType. - ingestion/graph_extraction: `_coerce_attributes` now returns just a dict with `None` for missing/uncoercible values; never drops records. The drop-on-missing-required branches and aggregated warnings are gone. - ingestion/pipeline: `_validate_attributes` strips undeclared keys but never drops records. Cascade `_filter_quality` still runs but no longer has anything to cascade in the attribute-validation case. - storage/ontology_store: docstring no longer mentions required flags. - tests: collapsed the three "required missing" cases into a single "every declared property appears in result; uncoercible -> None" test per file. 829 passed, 23 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

+        "For each verified entity and each extracted relationship, also extract any "
+        "of the declared properties below. Use null when the property is not stated "
+        "in the text. Coerce to the declared type (INTEGER, FLOAT, BOOLEAN, DATE in "
+        "ISO 8601, STRING, LIST). Place extracted values inside an `attributes` "
+        "object on each entity / relationship.",


The data graph alone wasn't enough to anchor the schema: - types are lossy (FalkorDB `typeof()` collapses DATE → "string") - multiple processes/instances couldn't share a declarative view - nothing caught schema-typo contradictions across sessions - declarative metadata (descriptions, future flags) had no canonical home Restore the persistent ontology graph (``<data>__ontology``) as the single source of truth, with additive-only semantics: - OntologyStore.register() validates incoming schema against the persisted ontology. Re-typing an existing property raises OntologyContradictionError before any partial state is persisted; additions (new labels, properties, patterns) go through unchanged. - OntologyStore.load() / clear() round out the lifecycle. - GraphRAG._ensure_ontology_initialized() lazy first-touch loads the persisted ontology and registers self.schema. Called from ingest (so contradictions surface before expensive extraction) and from get_ontology / retrieval paths. - delete_all() now drops both data and ontology graphs and resets the initialised flag so the next ingest re-registers self.schema cleanly. - get_ontology() always reads from the ontology graph; refresh_ontology() is a thin re-load for cross-process freshness. save_ontology(path) remains as a JSON-file bridge. This reverts the architecture of 54e4e05 (drop ontology graph, infer from data). The "save a graph" simplicity wasn't worth losing validation, multi-process safety, type fidelity, and a queryable schema artifact. Tests: 825 passed, 23 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y one Before, a user passing no schema got an empty ontology graph even though extraction was using DEFAULT_ENTITY_TYPES under the hood. The anchor said "we know nothing" while the LLM was producing Person/Organization/... nodes. Now ``_ensure_ontology_initialized()`` has three branches: - user passed a schema → register it (validate + persist) - ontology graph already populated (prior session) → use as-is - both empty → register DEFAULT_ENTITY_TYPES so the ontology graph faithfully reflects what the extractor will produce After this, ``get_ontology()``, the Cypher-generation prompt, and the extractor all read from the same source. Test fixture: ``mock_conn`` in test_facade.py now stubs the connection's ``_driver.select_graph()`` chain so OntologyStore can open the ontology graph handle against the mock. 825 passed, 23 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-code-quality Bot found potential problems May 17, 2026

View reviewed changes

Comment thread graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py Fixed

Comment thread graphrag_sdk/tests/test_ontology_store.py Fixed

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py Outdated

Comment thread graphrag_sdk/src/graphrag_sdk/core/models.py

Comment thread graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py

Comment thread graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py Outdated

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py Outdated

github-code-quality Bot found potential problems May 18, 2026

View reviewed changes

Comment thread graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py Fixed

This was referenced May 18, 2026

Schema-driven attributes: typed properties through extraction → storage → retrieval #260

Open

Ontology learning: auto-discover schema from a corpus #240

Open

galshubeli linked an issue May 18, 2026 that may be closed by this pull request

Schema-driven attributes: typed properties through extraction → storage → retrieval #260

Open

3 tasks

style(ruff): apply ruff format to feature files

adc223f

CI's "ruff format --check src/" flagged 5 files. Pure formatting; no behavioral change. Tests: 830 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py Outdated

Merge branch 'main' into feat/schema-driven-attributes

dcc4069

galshubeli requested a review from Naseem77 May 19, 2026 10:48

galshubeli self-assigned this May 19, 2026

galshubeli requested a review from Copilot May 19, 2026 10:49

Copilot started reviewing on behalf of galshubeli May 19, 2026 10:50 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

github-code-quality Bot found potential problems May 19, 2026

View reviewed changes

galshubeli and others added 2 commits May 19, 2026 15:58

Conversation

galshubeli commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Algorithm

Backwards compatibility

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

galshubeli commented May 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading