Skip to content

Add Quality Control and Validation Constraints #8

@dragon-ai-agent

Description

@dragon-ai-agent

Problem

BioStride could benefit from additional validation rules and constraints to ensure data correctness, beyond basic types and required fields.

Current State

Some good constraints already exist:

  • pH: 0-14 range ✓
  • humidity: 0-100% ✓
  • Required fields properly specified ✓

Missing Constraints

Numeric Range Validation

Fields that should have logical bounds:

  • completeness: Should be 0-100% (currently unconstrained)
  • resolution: Should be positive (Angstroms cannot be negative)
  • temperature: Could have reasonable bounds (absolute zero to practical max)
  • molecular_weight: Should be positive
  • concentration: Should be non-negative

Cross-Entity Validation

Logical relationships that could be validated:

  • ExperimentRun.sample_id should match an actual Sample.id in the same Study
  • WorkflowRun.input_files should reference existing DataFile objects
  • Instrument technique should match ExperimentRun technique type

Format Validation

Fields that could have pattern constraints:

  • Email addresses in operator fields
  • Date formats (if standardizing on ISO 8601)
  • ID patterns (if adopting consistent CURIE schemes)

Benefits

  1. Automatic validation: Catch errors during data curation
  2. Data quality: Ensure logical consistency
  3. User feedback: Clear error messages for invalid data
  4. Integration support: Better LinkML validation test coverage

Implementation Approach

Phase 1: Simple numeric constraints

Add minimum/maximum values to appropriate numeric fields:

  • completeness: 0-100 range
  • resolution: minimum_value 0
  • molecular_weight: minimum_value 0

Phase 2: Pattern validation

Add regex patterns for structured fields like IDs, emails, dates

Phase 3: Cross-reference validation

Explore LinkML capabilities for validating references between objects (may be limited in current LinkML version).

Alignment with Standards

  • LinkML validation: Leverages built-in constraint mechanisms
  • JSON Schema: Could export enhanced constraints
  • Data quality standards: Follows FAIR principles for validated data

Implementation Notes

  • Add constraints incrementally to avoid breaking existing examples
  • Test against current example files to ensure compatibility
  • Document validation rules for users
  • Consider whether to make constraints strict (fail) vs warnings

Priority

Low-Medium - Quality improvement that doesn't break functionality but enhances data reliability.

Future Considerations

As LinkML evolves, explore more sophisticated validation:

  • Cross-object integrity rules
  • Conditional constraints (if technique=X, then field Y is required)
  • Integration with external validation services

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions