Skip to content

Opschema metadata#6280

Open
mzient wants to merge 25 commits intoNVIDIA:mainfrom
mzient:opschema-metadata
Open

Opschema metadata#6280
mzient wants to merge 25 commits intoNVIDIA:mainfrom
mzient:opschema-metadata

Conversation

@mzient
Copy link
Copy Markdown
Contributor

@mzient mzient commented Apr 3, 2026

Co-authored-by: Rostan Tabet rtabet@nvidia.com

Category:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

Description:

This change adds static metadata inference (ndim, layout, dtype) to OpSchema. Most operators can infer it from OpSpec.
OpSpec now carries the statically inferred metadata.
Actual inputs and outputs, as seen in the workspace, are now automatically validated against OpSpec in OperatorBase.

There's a default policy for handling metadata - it's opt-in, but can be enabled for all schemas declared with DALI_SCHEMA if DALI_SCHEMA_DEFAULT_METADATA_POLICY is defined and set to nonzero. This is true for DALI project, so all internal operators implement the default policy and need to either opt-out or override it if they don't conform.

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@mzient mzient force-pushed the opschema-metadata branch from b0330dc to 156fd04 Compare April 3, 2026 18:08
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR adds static metadata inference (ndim, dtype, layout) to OpSchema and OpSpec, propagates it through the op graph during Pipeline::Build(), and automatically validates actual inputs/outputs against the inferred descriptors in OperatorBase::Setup and Run. The scope is large (91 files) but the architectural approach — lazy-cached per-output callbacks with schema inheritance, DFS propagation in node_meta.cc, and a __debug escape-hatch for eager operators — is sound.

  • P1 (expand_dims.cc): The new OutputLayout schema inference lambda does not guard against negative axis values. ComputeDataNodeMetadata runs before the operator constructor, so DALI_ENFORCE(0 <= axis) never fires first; a negative axis loops past all input dims and hits assert(src_axis < ndim) or UB. one_hot.cc in the same PR handles this correctly with a return nullopt guard.

Confidence Score: 4/5

The PR is mostly safe but has one P1 defect in expand_dims schema inference that should be fixed before merging.

All prior review concerns (prev_c_idx, dead output_dtype_fn, label typos, join axis sign) are tracked in existing threads. One new P1 issue was found: negative axes in the expand_dims OutputLayout lambda trigger an assertion / undefined behaviour during Pipeline::Build() before the operator's own validation can fire. The remaining finding (dead variable input_idx) is P2. Score 4 reflects the single outstanding P1.

dali/operators/generic/expand_dims.cc (OutputLayout lambda, negative axis handling)

Important Files Changed

Filename Overview
dali/pipeline/operator/op_schema.cc Core metadata inference logic (CalculateOutputDType/NDim/Layout, GetCorrespondingExpandedOutputLayout); contains dead variable input_idx and the prev_c_idx-after-insert ordering bug noted in prior threads
dali/pipeline/operator/op_schema.h Introduces OutputDTypeFunc/OutputNDimFunc/OutputLayoutFunc aliases, lazy-cached flattened function vectors, and the UseDefaultMetadataPolicy / AutoExpandDims builder API
dali/pipeline/operator/op_spec.cc AddInput/AddArgumentInput extended with optional metadata; InferOutputMetadata() delegates to schema Calculate* methods
dali/pipeline/operator/op_spec.h InOutDesc gains ndim/dtype/layout fields; output_name_idx_ map key is InOutDesc but keys are always inserted with nullopt metadata so heterogeneous (name,device) lookup remains correct
dali/pipeline/operator/operator.cc ValidateInputMetadata/ValidateOutputMetadata added; skips empty batches, honours __debug flag for eager / debug-mode operators
dali/pipeline/operator/operator.h Setup and Run gain an optional validate_metadata parameter (default true); SequenceOperator overrides pass it down correctly, skipping validation for the inner expanded workspace
dali/pipeline/graph/node_meta.cc New file: DFS propagation of producer output metadata into consumer input descriptors, then InferOutputMetadata per node
dali/pipeline/pipeline.cc ComputeDataNodeMetadata inserted before executor build — correct placement in the build sequence
dali/pipeline/operator/eager_operator.h Adds __debug=true to bypass validation; intentional since eager ops skip the full graph-build / metadata-inference pass
dali/operators/generic/expand_dims.cc OutputLayout lambda does not normalise negative axes; negative values cause assert/UB in schema inference before the operator's own DALI_ENFORCE fires
dali/operators/generic/one_hot.cc OutputNDim/OutputLayout lambdas correctly normalise negative axis (axis += ndim+1) and guard unknown ndim/layout with nullopt
dali/python/nvidia/dali/data_node.py DataNode gains ndim/dtype/layout sourced from OpSpec.OutputDesc at construction; fields are None until InferOutputMetadata runs, which is fine for the graph-mode path
dali/python/nvidia/dali/experimental/dynamic/_invocation.py ndim/dtype/layout lazily inferred via _init_spec before falling back to full deferred evaluation

Sequence Diagram

sequenceDiagram
    participant Py as Python (pipeline build)
    participant PL as Pipeline::Build()
    participant NM as node_meta::ComputeDataNodeMetadata
    participant OS as OpSchema::Calculate*
    participant EX as Executor::Build (operator ctor)
    participant OP as OperatorBase::Setup/Run

    Py->>PL: Build(output_descs)
    PL->>NM: ComputeDataNodeMetadata(graph)
    loop DFS over OpNodes
        NM->>NM: propagate producer OutputDesc → consumer InputDesc
        NM->>OS: InferOutputMetadata() → CalculateOutputDType/NDim/Layout
        OS-->>NM: optional<dtype/ndim/layout> stored in OpSpec::outputs_
    end
    NM-->>PL: metadata populated
    PL->>EX: executor_->Build(graph)
    EX->>EX: instantiate operators (DALI_ENFORCE axis validations run here)
    EX-->>PL: built
    PL-->>Py: pipeline ready

    Py->>OP: Setup(output_desc, ws)
    OP->>OP: ValidateInputMetadata(ws, spec)
    OP->>OP: SetupImpl()
    Py->>OP: Run(ws)
    OP->>OP: RunImpl()
    OP->>OP: ValidateOutputMetadata(ws, spec)
Loading

Reviews (4): Last reviewed commit: "Make default metadata policy opt-in. Mak..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@jantonguirao jantonguirao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for the bugs raised by greptile and some minor comments

Comment on lines +35 to +36
if (window_size == 0)
return {};
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it a debug build fired an assertion. This is not related to this change but prevented running all the tests in a debug build.

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [47951113]: BUILD STARTED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [47951113]: BUILD FAILED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [47993982]: BUILD STARTED


def __init__(self, name, device="cpu", source=None):
def __init__(
self, name, device="cpu", source=None, index=0, ndim=None, dtype=None, layout=None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: missing type hints

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pass some wild stuff there at times, so I wouldn't risk putting type hints here.

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [47993982]: BUILD FAILED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48010759]: BUILD STARTED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48023970]: BUILD STARTED

@mzient mzient force-pushed the opschema-metadata branch 2 times, most recently from 5f38835 to 20b9194 Compare April 8, 2026 14:33
@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48024121]: BUILD STARTED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48010759]: BUILD FAILED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48024121]: BUILD FAILED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48125071]: BUILD STARTED

@mzient mzient force-pushed the opschema-metadata branch from b4b3b73 to 63b8346 Compare April 9, 2026 17:24
@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48133713]: BUILD STARTED

@mzient
Copy link
Copy Markdown
Contributor Author

mzient commented Apr 9, 2026

@greptileai Please look at expand_dims.cc:71 - out-of-range axes are handled here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

You're right - I see the check at line 85 in GenerateSrcDims():

DALI_ENFORCE(d < ndim,
  make_string("Data has not enough dimensions to add new axes at specified indices."));

This correctly validates that axes aren't out of range. When an axis index is too large, d will exceed ndim, triggering this enforcement. Thanks for the clarification!

Tip: You can give me more context with .greptile/rules so I get this right next time.

Copy link
Copy Markdown
Contributor

@jantonguirao jantonguirao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor comments left

dtype = self._operator._op_spec.OutputDesc(result_index)[3]
if dtype is not None:
return dtype
self.run(self._eval_context)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] DRY — the ndim, dtype, and layout methods now share the same pattern:

if self._results is None:
    if init_spec := getattr(self._operator, "_init_spec", None):
        init_spec(self._inputs, self._args)
        val = self._operator._op_spec.OutputDesc(result_index)[N]
        if val is not None:
            return val
    self.run(self._eval_context)

Consider extracting a helper like _get_static_meta(self, result_index, field_index) to avoid repeating the init_spec / OutputDesc / fallback logic three times.

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48411019]: BUILD FAILED

rostan-t and others added 25 commits April 14, 2026 11:06
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
…ALI.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
…y input layout.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@mzient mzient force-pushed the opschema-metadata branch from 3a571e7 to d876708 Compare April 14, 2026 09:06
@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48484956]: BUILD STARTED

layout = self._operator._op_spec.OutputDesc(result_index)[4]
if layout is not None:
layout = str(layout)
return None if layout == "" else layout
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] The layout method's fast path returns None for empty layout (return None if layout == "" else layout), but the fallback path returns self._results[result_index].layout() which returns an empty string for no-layout tensors. This means the two code paths return different values for the same semantic state.

Also, the type hint says -> str but the fast path can return None.

auto input_layout = input_desc.layout.value_or("");

if (input_layout.empty()) {
// If the layout was empty, we need the number of dimesnions, as "" is legal for any ndim.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Typo: dimesnionsdimensions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants