Skip to content

Add Lean language indexing with optional LSP resolution#548

Open
JJYYY-JJY wants to merge 1 commit into
colbymchenry:mainfrom
JJYYY-JJY:main
Open

Add Lean language indexing with optional LSP resolution#548
JJYYY-JJY wants to merge 1 commit into
colbymchenry:mainfrom
JJYYY-JJY:main

Conversation

@JJYYY-JJY
Copy link
Copy Markdown

Summary

Adds Lean (.lean) support to CodeGraph via a vendored tree-sitter-lean.wasm grammar and a dedicated Lean extractor.
Lean indexing works statically and offline by default. When lake or lean is available, CodeGraph can run a best-effort Lean/Lake LSP definition pass to improve unresolved reference edges without making the Lean toolchain required.

Changes

  • Add .lean language detection and Tree-sitter parsing.
  • Add Lean extraction for declarations, namespaces, imports, sections, attributes, instances, variables, and reference candidates.
  • Add local Lean module import resolution.
  • Add optional Lean LSP definition resolution for unresolved references.
  • Add Lean resolver metadata provenance via resolvedBy: "lean-lsp".
  • Add .lake/ to default ignored dependency/build/cache directories, with .gitignore negation still allowing explicit opt in.
  • Add Lean extraction and resolution tests.
  • Update README and docs to describe Lean as static extraction with optional Lean/Lake LSP definition resolution.

Configuration

New optional environment variables:

  • CODEGRAPH_LEAN_SEMANTICS
  • CODEGRAPH_LEAN_LSP_COMMAND
  • CODEGRAPH_LEAN_LSP_TIMEOUT_MS
  • CODEGRAPH_LEAN_LSP_REF_LIMIT

Verification

  • node scripts/add-lang/check-grammar.mjs src/extraction/wasm/tree-sitter-lean.wasm <valid-lean-sample> 20
  • npm test -- __tests__/lean-extraction.test.ts __tests__/resolution.test.ts
  • npm run build
  • npm test
  • npm --prefix site run build
  • git diff --check

Copilot AI review requested due to automatic review settings May 29, 2026 05:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3e54d9a6ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +343 to +345
line: Math.max(0, (ref.line || 1) - 1),
character: Math.max(0, ref.column || 0),
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Convert tree-sitter byte columns before LSP requests

For Lean files containing multibyte Unicode before a reference on the same line, this sends the tree-sitter startPosition.column byte offset directly as an LSP character position. Because the worker initializes with empty capabilities and does not negotiate/use byte offsets, the Lean server interprets character as the protocol position encoding, so common Lean text such as α, , or before target makes textDocument/definition query the wrong character and semantic resolution silently falls back or resolves incorrectly. Convert the extracted byte column to the LSP encoding for the opened line before sending the request, and do the inverse for returned locations before comparing to tree-sitter node ranges.

Useful? React with 👍 / 👎.

Comment on lines +184 to +185
const nameNode = getChildByField(node, 'name');
if (!nameNode) return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Read Lean import modules from the module field

For any Lean file with import Foo.Bar, the vendored tree-sitter-lean grammar exposes the imported identifier under the module field, so looking up name returns null and this method exits before creating either the import node or the unresolved import reference. That prevents Lean import edges from ever being created and also leaves importedModules empty, so static candidate resolution loses the imported-module prefixes when LSP is disabled or unavailable.

Useful? React with 👍 / 👎.

Comment on lines +460 to +464
this.walkDescendants(declaration, (node) => {
if (node.type !== 'app') return true;
if (this.isInsideNodeType(node, 'attributes') || this.isInInheritancePosition(node)) return true;
const fn = getChildByField(node, 'fn');
if (!fn) return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match the vendored Lean application nodes for calls

For ordinary Lean calls such as Foo.bar 1, this branch never emits a call reference because the vendored tree-sitter-lean grammar does not produce app nodes with an fn field for applications. As a result, static-only indexing misses Lean call edges entirely whenever the optional LSP pass is disabled, unavailable, or capped by CODEGRAPH_LEAN_LSP_REF_LIMIT; update this traversal to the actual application node and function field names from the vendored grammar.

Useful? React with 👍 / 👎.

Comment on lines +18 to +21
const DECLARATION_TYPES = new Set([
'def',
'theorem',
'abbrev',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include lemma declarations in Lean extraction

For any Lean declaration written with lemma, findDeclarationPayload only accepts node types from this set and therefore returns null because the vendored grammar exposes lemma separately from theorem. Lemmas are one of the common declaration forms in Lean projects, so their symbol nodes, references, docstrings, and visibility are all dropped; add lemma to the declaration/name-position sets and map it like theorem.

Useful? React with 👍 / 👎.

Comment on lines +423 to +424
if (child.type === 'field') {
this.createMember(parent, 'field', child);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle Lean struct_field nodes when extracting fields

For structures and classes that declare fields in a where block, the vendored Lean grammar emits those members as struct_field nodes, but this traversal only creates a field for nodes typed field. Consequently fields such as structure Point where x : Nat are omitted from the graph even though the parent structure is indexed; include struct_field here and pass it through createMember.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant