Skip to content

Add JSON AST IR with cfg resolution for multi-language XDR code generation#518

Open
tamirms wants to merge 9 commits intostellar:mainfrom
tamirms:tamir/json-ast-ir
Open

Add JSON AST IR with cfg resolution for multi-language XDR code generation#518
tamirms wants to merge 9 commits intostellar:mainfrom
tamirms:tamir/json-ast-ir

Conversation

@tamirms
Copy link
Copy Markdown

@tamirms tamirms commented Mar 24, 2026

Summary

Add a --json-ast flag to the XDR generator that outputs a language-neutral JSON IR for consumption by code generators in other languages (e.g., Go). Uses shadow types in ir.rs decoupled from the parser's AST, so the IR shape can evolve independently without modifying the parser.

JSON IR format

  • Flat list of definitions, each carrying its computed properties directly
  • Internally-tagged encoding: {"kind": "struct", "name": "Foo", "fixed_size": 12, ...}
  • fixed_size only on structs and unions (where it's a non-trivial computation). Enums are always 4 bytes by XDR spec; typedefs are resolved by following the chain — same as consumers do for everything else
  • Pre-resolved values: union case idents resolved to integers, named sizes resolved to integers, const values inlined
  • Union cases as [{"value": 0, "name": "TAG_A"}, {"value": 1}] — single struct with optional name, not parallel arrays
  • resolved_features metadata when --feature is used

Shadow types (ir.rs)

Dedicated IR types for JSON emission, constructed from the AST via ir::build(). This avoids adding serde derives, serde tags, or variant reshaping to the parser's AST. The only AST additions are CfgExpr::evaluate() and UnionArm.name.

Wire size computation lives in ir.rs (not in the parser) since it's only used by the IR builder. The ir::build() function is the single entry point — takes &XdrSpec + resolved features, returns IR. TypeInfo construction, wire size computation, and const resolution are internal details.

Cfg resolution (--feature flag)

  • --feature <name> evaluates #ifdef/#else/#endif and filters the spec
  • Comma-delimited for multiple features (consistent with Cargo)
  • Removes definitions, union arms, and enum members whose cfg evaluates to false
  • Clears cfg annotations on surviving elements
  • Restricted to --json-ast (Rust codegen uses #[cfg(...)] attributes directly)

Other changes

  • CfgExpr::evaluate() method on the AST type
  • Checked arithmetic in wire size computation
  • Snapshot JSON IR to xdr-definitions-json/xdr.json
  • Union arm name field preserved in AST for downstream consumers

Copy link
Copy Markdown
Member

@leighmcculloch leighmcculloch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. A few asks inline.

Can we also:

  • Generate the JSON and write it to a new folder at the root, xdr-definitions-json? In this repo anything that can be generated gets snapshotted into the repo on every commit. Just do that for the 'curr' branch for the moment.
  • Hold merging this until #515 which is critical path for other work that's happening at the moment.

/// XDR type specification.
#[derive(Debug, Clone, PartialEq, Eq)]
#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize)]
#[serde(tag = "kind")]
Copy link
Copy Markdown
Member

@leighmcculloch leighmcculloch Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use on the types #[serde(rename_all = "snake_case")] and remove the tag = "kind" attribute?

The XDR-JSON generated by this repo uses snake case, and it's less surprising if we aren't flipping back and forth between styles.

The JSON generated by this repo uses external tags, which is the default, rather than internal tags which is what tag = "kind" does.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used #[serde(rename_all = "snake_case")] on the types. But, I still kept tag = "kind" because the JSON IR is designed for consumption by non-Rust code generators (Go today, potentially others). External tags produce two different shapes for the same enum — bare strings for unit variants ("int") and wrapper objects for struct variants ({"opaque_fixed": {"size": ...}}). The consumer has to handle both shapes. Internal tags produce uniform objects for every variant ({"kind": "int"}, {"kind": "opaque_fixed", "size": ...}), which is simpler to unmarshal in Go and is the standard pattern for tagged unions in TypeScript and other languages.

But, if you still think it's better to remove tag = "kind" I can do that and I will just have to write extra unmarshalling logic on the Go side to handle it.

tamirms and others added 5 commits April 1, 2026 11:40
Add --json-ast flag that outputs the parsed XDR specification as JSON,
enriched with computed type properties:

- fixed_wire_size: XDR wire size for fixed-size types (RFC 4506)
- is_recursive: cycle detection in the type graph
- union_arm_is_recursive: per-arm recursion flags
- definition_order: topological ordering for dependency-safe emission
- Union arm field names preserved in AST

These properties enable Go (and other) code generators to consume the
parsed XDR specification without reimplementing type analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from serde default externally-tagged encoding to internally-tagged
encoding (#[serde(tag = "kind")]) for Definition, Type, Size, and
UnionCaseValue enums. This produces cleaner JSON:

  Before: {"Ident": "AccountId"}
  After:  {"kind": "Ident", "ident": "AccountId"}

This eliminates the need for custom JSON unmarshalers in consuming code
generators (e.g., the Go generator dropped 125 lines of unmarshaling code).

Other changes:
- Convert newtype variants to struct variants for uniform serde output
- Remove is_recursive and union_arm_is_recursive from ComputedProperties
  (consumers handle recursion via depth-bounded traversal instead)
- Remove definition_order from JSON output (consumers order by dependency)
- Add kind field to TypeProperties and const_values to ComputedProperties
- Remove unused resolve_typedef_to_builtin_with, Definition::is_nested,
  Definition::file_index methods
- Extract resolved_builtin() helper in Rust generator
- Simplify discriminant_is_builtin check to use resolve_typedef_to_builtin
- Remove dead type case from escape_type_name
- Remove unused serde_json dependency from xdr-parser
- Add validation: error if neither --output nor --json-ast specified
- Add type property tests (wire size, kind values, typedef chains)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use #[serde(rename_all = "snake_case")] on all enums and remove
  #[serde(tag = "kind")] to use serde's default external tagging.
  Matches the repo's XDR-JSON convention.

- Replace TypeProperties.kind: &'static str with a proper DefKind enum.
  Type-safe and serializes identically with rename_all.

- Add Serialize derive to CfgExpr for the new cfg field on UnionArm.

- Add documentation on fixed_wire_size: when None, the consumer reads
  wire data (length prefixes, discriminants) to determine size at runtime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The computed properties (fixed_wire_size, kind) are keyed by name.
If cfg-gated redefinitions produce two definitions with the same name,
TypeInfo::build silently keeps the last one, making computed properties
wrong for the first. Detect this in the JSON IR path and error out
with a clear message.

This only affects --json-ast; the Rust code generator handles
same-name redefinitions correctly via #[cfg] attributes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Switch serde encoding from external to internal tags (tag = "kind")
  on Definition, Type, UnionCaseValue, and Size enums. Internal tags
  produce uniform JSON objects with a "kind" discriminator, which is
  the standard cross-language pattern for tagged unions.

- Add --feature flag (restricted to --json-ast) for cfg resolution.
  When specified, cfg expressions are evaluated against the feature set,
  filtered definitions/arms/members are removed, and cfg annotations
  are cleared before emission. Includes union arm coverage validation
  to catch incomplete unions after filtering.

- Add CfgExpr::evaluate() method alongside existing render()/negate()/and().

- Snapshot JSON IR for xdr/curr/ to xdr-definitions-json/curr.json
  with Makefile target.

- 21 new tests covering cfg evaluation, filtering, and validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tamirms tamirms force-pushed the tamir/json-ast-ir branch 5 times, most recently from 95230f6 to 548f22d Compare April 1, 2026 17:13
@tamirms tamirms changed the title Add JSON AST output for multi-language XDR code generation Add JSON AST IR with cfg resolution for multi-language XDR code generation Apr 1, 2026
@tamirms tamirms force-pushed the tamir/json-ast-ir branch 5 times, most recently from 2f4fc9e to 4c3cfea Compare April 1, 2026 20:03
@tamirms tamirms marked this pull request as ready for review April 1, 2026 20:39
@tamirms
Copy link
Copy Markdown
Author

tamirms commented Apr 1, 2026

@leighmcculloch can you take another look at the PR now? it is ready for review. I have pulled in the latest changes from master including #515 and I have included the xdr-definitions-json fixture file as well

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a language-neutral JSON representation of parsed XDR (plus computed metadata) to enable non-Rust code generators to consume the existing Rust parser output, with optional cfg/feature resolution during JSON emission.

Changes:

  • Add serde-serializable AST + a --json-ast CLI output mode (with optional --feature cfg resolution restricted to JSON output).
  • Add computed per-definition metadata (e.g., fixed_wire_size, kind, const_values) for the emitted IR.
  • Extend parsing to preserve union arm field names, and update AST shapes (Type, Size, UnionCaseValue) for internally-tagged JSON.

Reviewed changes

Copilot reviewed 17 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
xdr-generator-rust/xdr-parser/src/types.rs Adds computed properties (including fixed wire size) and updates matches for the new AST shapes.
xdr-generator-rust/xdr-parser/src/tests/types.rs New tests covering wire-size computation and preserved union arm names.
xdr-generator-rust/xdr-parser/src/tests/parser.rs Updates parser tests for the new Type/Size shapes.
xdr-generator-rust/xdr-parser/src/tests/mod.rs Registers the new types test module.
xdr-generator-rust/xdr-parser/src/parser.rs Preserves union arm field names and updates AST construction to struct-style enum variants.
xdr-generator-rust/xdr-parser/src/lexer.rs Makes IntBase serializable to support JSON emission paths.
xdr-generator-rust/xdr-parser/src/ast.rs Adds serde serialization + internal tagging (kind) and updates AST variant representations.
xdr-generator-rust/xdr-parser/Cargo.toml Adds serde dependency for AST serialization.
xdr-generator-rust/generator/src/types.rs Updates generator type mapping logic for the new AST variant shapes.
xdr-generator-rust/generator/src/tests/types.rs Updates generator tests for the new AST shapes.
xdr-generator-rust/generator/src/tests/mod.rs Registers cfg filtering tests.
xdr-generator-rust/generator/src/tests/cfg_filter.rs Adds tests for cfg evaluation, filtering behavior, and union coverage validation.
xdr-generator-rust/generator/src/naming.rs Updates union case naming logic for the new UnionCaseValue shapes.
xdr-generator-rust/generator/src/main.rs Adds --json-ast/--feature, implements cfg filtering + union coverage validation, and emits JSON IR.
xdr-generator-rust/generator/src/generator.rs Updates discriminant handling and size extraction for new Type shapes.
xdr-generator-rust/generator/Cargo.toml Adds serde_json for JSON IR emission.
xdr-generator-rust/Cargo.lock Locks new serde/serde_json dependencies.
Makefile Adds xdr-definitions-json target and wires it into make generate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tamirms tamirms force-pushed the tamir/json-ast-ir branch from 4c3cfea to b41cf99 Compare April 2, 2026 07:21
The IR emitter faithfully represents the filtered XDR — semantic
validation of the source (union coverage, dangling references) is
the consumer's responsibility, not the emitter's. Remove
validate_union_coverage and simplify filter_spec to just filter.

Use checked arithmetic in wire size computation: checked_add,
checked_mul, and u32::try_from for const resolution. Return None
(variable size) on overflow or unresolvable constants instead of
panicking or silently wrapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 19 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

xdr-generator-rust/xdr-parser/src/lexer.rs:105

  • IntBase is now serialized, but without any serde renaming. This will emit JSON values like "Decimal"/"Hexadecimal", which is inconsistent with the rest of the IR’s snake_case convention and may be awkward for non-Rust consumers. Consider adding #[serde(rename_all = "snake_case")] to IntBase (or explicitly renaming variants) to keep the JSON schema consistent.
/// The base (radix) of an integer literal.
#[derive(Debug, Clone, Copy, PartialEq, Eq, serde::Serialize)]
pub enum IntBase {
    Decimal,
    Hexadecimal,
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +172 to +177
pub struct ComputedProperties {
/// Per-type properties (fixed wire size).
pub types: HashMap<String, TypeProperties>,
/// Resolved constant values (const name → integer value).
pub const_values: HashMap<String, i64>,
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ComputedProperties uses HashMap for types/const_values, which makes the emitted --json-ast output nondeterministic because HashMap iteration order is randomized. Since the Makefile snapshots JSON to a file, this will cause noisy diffs across runs. Consider switching these maps to BTreeMap (or emitting sorted Vec<(String, ...)>) so JSON output is stable.

Copilot uses AI. Check for mistakes.
Comment on lines +294 to +303
Type::Array { element_type, size } => {
let elem_sz = self.type_wire_size(element_type, cache, in_progress)?;
let count = self.resolve_size(size)?;
elem_sz.checked_mul(count)
}
Type::Ident { ident: name } => self.compute_wire_size(name, cache, in_progress),
// Variable-size types
Type::OpaqueVar { max_size: _ } | Type::String { max_size: _ } | Type::VarArray { .. } | Type::Optional { element_type: _ } => {
None
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type_wire_size resolves Type::Ident by calling compute_wire_size(name, ...) using the raw identifier string. But TypeInfo::definitions is keyed by the transformed name from type_name_fn in TypeInfo::build, so this can fail to resolve sizes when a non-identity naming function is used. Consider looking up via the same naming function used to build TypeInfo (e.g., store it, or store an additional map from original XDR name -> key), or store definitions keyed by original name for wire-size purposes.

Copilot uses AI. Check for mistakes.
Comment on lines +253 to +270
Some(Definition::Union(u)) => {
let disc_size: u32 = 4;
let mut arm_sizes: Vec<Option<u32>> = Vec::new();
for arm in &u.arms {
match &arm.type_ {
None => arm_sizes.push(Some(0)), // void
Some(t) => arm_sizes.push(self.type_wire_size(t, cache, in_progress)),
}
}
if arm_sizes.is_empty()
|| arm_sizes.iter().any(|s| s.is_none())
|| arm_sizes.iter().filter_map(|s| *s).collect::<HashSet<_>>().len() != 1
{
None
} else {
disc_size.checked_add(arm_sizes[0]?)
}
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union wire size uses a hard-coded disc_size: u32 = 4. This is correct for the RFC-allowed discriminant types (int/uint/bool/enum), but the parser appears to accept other discriminant types as well; if one slips through (e.g., hyper), the computed fixed_wire_size will be wrong. Consider computing the discriminant size via type_wire_size(&u.discriminant.type_, ...) (and returning None if it’s not fixed / not 4 bytes) to keep the computation consistent with the parsed AST.

Copilot uses AI. Check for mistakes.
/// Feature names are compared case-insensitively.
pub fn evaluate(&self, features: &std::collections::HashSet<String>) -> bool {
match self {
CfgExpr::Feature(name) => features.contains(&name.to_lowercase()),
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CfgExpr::evaluate is documented as “case-insensitive feature matching”, but it only lowercases the expression-side name and does a direct HashSet::contains lookup. This only works if the caller has already lowercased all entries in features. To match the doc/API contract, consider normalizing the input set inside evaluate (e.g., compare with eq_ignore_ascii_case, or require/encode a LowercaseFeatureSet newtype so callers can’t pass mixed-case sets).

Suggested change
CfgExpr::Feature(name) => features.contains(&name.to_lowercase()),
CfgExpr::Feature(name) => {
features.iter().any(|f| f.eq_ignore_ascii_case(name))
}

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +136
/// Filter an XDR spec in-place: remove definitions and sub-elements whose cfg
/// evaluates to false, then clear all remaining cfg annotations.
fn filter_spec(spec: &mut XdrSpec, features: &HashSet<String>) {
filter_definitions(&mut spec.definitions, features);
for ns in &mut spec.namespaces {
filter_definitions(&mut ns.definitions, features);
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter_spec only filters spec.namespaces one level deep. Since Namespace supports nested namespaces (Namespace { namespaces: Vec<Namespace>, ... }), cfg-gated definitions inside nested namespaces will not be removed and their cfg annotations won’t be cleared. Consider making filter_spec recurse into ns.namespaces as well (and add a test covering nested namespaces).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed about the format is that it requires matching up information from multiple places. There's a separate computed section that you have to lookup fixed size information, when it would be more convenient and intuitive if that was in the definition itself. And the computer const values would also ideally be in the definition of the const too. If they were combined it'd be a list of definitions, and each definition will contain their details, and maybe computed values can be under a computed field of each definition. I think that would be easier to work with. Wdyt?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that makes a lot of sense. In an earlier conversation you mentioned the idea of using shadow types for the IR. I think that's the right approach here and I've implemented that in the next commit I'm pushing. There will be separate IR types in ir.rs that are purpose-built for JSON emission, decoupled from the parser's AST.

This also means minimal changes to the parser — no serde derives on AST types, no serde tags, no reshaping variants (e.g. tuple to struct) just for serialization. The only AST additions are CfgExpr::evaluate() for feature filtering and a name field on UnionArm.

The shadow types let us build exactly the consumer-friendly format you're describing, plus a few other improvements:

  • Each definition carries its computed properties directly (fixed_size, resolved const values) — no separate computed section
  • Union case values pre-resolved to integers with optional ident names
  • Named sizes (like opaque data[MAX_SIZE]) resolved to integers
  • Union arm cases use a single {value, name?} struct instead of parallel arrays

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these tests test existing functionality, not new functionality, could we add them in a separate quick pr?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests all cover new functionality added in this PR — CfgExpr::evaluate(), filter_spec, ComputedProperties/wire size computation, and the new UnionArm.name field. None of them test pre-existing code.

fn test_base_type_ref_opaque_fixed() {
assert_eq!(
base_type_ref(&Type::OpaqueFixed(Size::Literal(32)), None),
base_type_ref(&Type::OpaqueFixed { size: Size::Literal { literal: 32 } }, None),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of this change using the named fields is a plus I think, it makes this code much more self-descriptive when reading 👍🏻

tamirms and others added 3 commits April 3, 2026 10:30
Replace serde derives on AST types with dedicated shadow types in ir.rs
for JSON IR emission. This minimizes changes to the parser — no serde
tags, no tuple-to-struct variant reshaping. The only AST additions are
CfgExpr::evaluate() and UnionArm.name.

IR format improvements:
- Each definition embeds its computed properties (fixed_size) directly
- Union case values pre-resolved with optional ident names
- Named sizes resolved to integers
- Merged case representation (single struct instead of parallel arrays)
- Multiple --feature flags supported (comma-delimited)

Safety improvements:
- Panic on unresolved references instead of silent unwrap_or(0)
- i64::from/u64::from instead of as casts
- u64::try_from for negative constant detection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… Union.default_arm

- Enum.fixed_size was always Some(4) — XDR enums are always 4 bytes
- Typedef.fixed_size was pre-resolved chain value — consumer already
  follows typedef chains for everything else, inconsistent to pre-resolve
  only this one property
- Union.default_arm was always None — parser rejects default arms,
  adding the field later is not a breaking JSON change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The wire size computation is only used by the IR builder, not by the
parser or Rust code generator. Moving it to ir.rs:

- Eliminates ComputedProperties/TypeProperties types (replaced by a
  plain HashMap internal to ir.rs)
- Exposes a single ir::build() entry point that takes &XdrSpec and
  returns IR (hides TypeInfo, wire size computation, const resolution)
- Collapses 13 lines of IR construction in main.rs to one call
- Moves wire size tests from parser to generator (where the code lives)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tamirms tamirms force-pushed the tamir/json-ast-ir branch from 0446e70 to 48c3df4 Compare April 3, 2026 10:52
@tamirms tamirms requested a review from leighmcculloch April 3, 2026 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants