-
Notifications
You must be signed in to change notification settings - Fork 781
ADR: Pipeline spec #6900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
ADR: Pipeline spec #6900
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,339 @@ | ||
| # Pipeline spec | ||
|
|
||
| - Authors: Ben Sherman | ||
| - Status: accepted | ||
| - Deciders: Ben Sherman, Paolo Di Tommaso, Phil Ewels | ||
| - Date: 2025-12-12 | ||
| - Tags: pipelines | ||
|
|
||
| ## Summary | ||
|
|
||
| Provide a way for Nextflow to describe inherent properties of a pipeline that can be easily consumed by external systems. | ||
|
|
||
| ## Problem Statement | ||
|
|
||
| A Nextflow pipeline is defined by Nextflow scripts (`.nf` files) and configuration (`.config` files). However, there are many aspects of a pipeline which are of interest to external systems, such as: | ||
|
|
||
| - Metadata (e.g. name, authors, license) | ||
| - Pipeline paramters and outputs | ||
| - Software dependency versions (e.g. modules) | ||
|
|
||
| Acquring this information directly from the source code requires parsing (or even executing) Nextflow code, which is generally not feasible for external systems. Additionally, it may be desirable to provide additional information that is not practical or otherwise does not belong in Nextflow code (e.g. display icons for pipeline parameters). | ||
|
|
||
| Primary use cases: | ||
|
|
||
| * **Viewing pipelines:** Display pipeline information (name, author, parameters, outputs) in an external user interface. | ||
|
|
||
| * **Form validation:** Validate pipeline parameters at launch time, prior to running the pipeline. | ||
|
|
||
| * **Pipeline chaining:** Validate a pipeline chain at launch time, allowing downstream pipeline inputs to reference upstream pipeline outputs that are compatible based on their respective pipeline specs. | ||
|
|
||
| - **Pipeline registry:** Enable pipelines to be published and executed as immutable software artifacts via the Nextflow registry, instead of cloning the source code repository. | ||
|
|
||
| ## Solution | ||
|
|
||
| ### Pipeline spec definition | ||
|
|
||
| The schema for pipeline specs is defined in [nextflow-io/schemas](https://github.com/nextflow-io/schemas/blob/main/pipeline/v1/schema.json). It was originally defined as the *meta-schema* for the [nf-core schema](https://nf-co.re/docs/nf-core-tools/pipelines/schema), a standard developed by the nf-core community to model pipeline parameters using JSON schema. The nf-core schema for a pipeline is typically defined as `nextflow_schema.json` in the project root. | ||
|
|
||
| Since the meta-schema was transferred to the `nextflow-io` GitHub organization, it is now considered an official Nextflow standard: | ||
|
|
||
| - The Nextflow language server uses the schema to provide code intelligence for pipeline parameters in Nextflow scripts. | ||
|
|
||
| - The Seqera Platform uses the schema to validate pipeline parameters at launch time. | ||
|
|
||
| - The `nf-schema` plugin, also under `nextflow-io`, uses the schema to validate pipeline parameters at runtime. | ||
|
|
||
| The pipeline spec adopts the structure of the nf-core schema, with only the following nominal changes: | ||
|
|
||
| - *nf-core schema* becomes *pipeline spec* | ||
| - *nf-core meta-schema* becomes *schema for pipeline specs* | ||
| - `nextflow_schema.json` becomes `nextflow_spec.json` | ||
|
|
||
| Preserving the structure of the original nf-core schema makes the migration process as easy as possible for users. At the same time, the nomenclature changes are needed to reduce confusion over different kinds of schemas and align with existing Nextflow standards (i.e. plugin specs, module specs). | ||
|
|
||
| The nf-core schema already defines the title, description, and parameters of a pipeline. The pipeline spec adds the following new properties: | ||
|
|
||
| - `version`: pipeline release version | ||
| - `contributors`: list of pipeline contributors (name, email, affiliation, etc) | ||
| - `documentation`: project documentation URL | ||
| - `homePage`: project home page | ||
| - `keywords`: relevant keywords | ||
| - `license`: project license | ||
| - `modules`: list of module versions used by the pipeline | ||
| - `requires`: runtime requirements | ||
| - `nextflow`: Nextflow version constraint | ||
| - `modules`: list of modules used by the pipeline | ||
| - `output`: list of pipeline outputs (name, type, description, etc) | ||
|
|
||
| Examples of these are shown in the following section on pipeline spec generation. | ||
|
|
||
| ### Pipeline spec generation | ||
|
|
||
| Nextflow should be able to generate a pipeline spec from the pipeline source code: | ||
|
|
||
| - The parameter schema can be generated from the `params` block and associated record types. | ||
|
|
||
| - Samplesheet schemas (e.g. `schema_input.json`) can be generated from the record types used by corresponding parameters. | ||
|
|
||
| - The `output` section can be generated from the `output` block. | ||
|
|
||
| - Most of the other fields can be inferred from the `manifest` config scope in the main config file. | ||
|
|
||
| For example, given the following pipeline script and config: | ||
|
|
||
| **`main.nf`** | ||
|
|
||
| ```groovy | ||
| params { | ||
| // Samplesheet containing the input paired-end reads | ||
| input: List<FastqPair> | ||
|
|
||
| // The input transcriptome file | ||
| transcriptome: Path | ||
|
|
||
| // Directory containing multiqc configuration | ||
| multiqc: Path = "${projectDir}/multiqc" | ||
| } | ||
|
|
||
| record FastqPair { | ||
| id : String | ||
| fastq_1 : Path | ||
| fastq_2 : Path? | ||
| strandedness : Strandedness | ||
| } | ||
|
|
||
| enum Strandedness { | ||
| FORWARD, | ||
| REVERSE, | ||
| UNSTRANDED, | ||
| AUTO | ||
| } | ||
|
|
||
| workflow { | ||
| // ... | ||
| } | ||
|
|
||
| output { | ||
| // List of aligned samples | ||
| samples: Channel<AlignedSample> { | ||
| path { sample -> | ||
| sample.fastqc >> 'fastqc/' | ||
| sample.bam >> 'align/' | ||
| sample.bai >> 'align/' | ||
| } | ||
| index { | ||
| path 'samples.json' | ||
| } | ||
| } | ||
|
|
||
| // MultiQC summary report | ||
| multiqc_report: Path { | ||
| path '.' | ||
| } | ||
| } | ||
|
|
||
| record AlignedSample { | ||
| id: String | ||
| fastqc: Path | ||
| bam: Path? | ||
| bai: Path? | ||
| } | ||
| ``` | ||
|
|
||
| **`nextflow.config`** | ||
|
|
||
| ```groovy | ||
| manifest { | ||
| name = 'nf-core/rnaseq' | ||
| contributors = [ | ||
| [ | ||
| name: 'Harshil Patel', | ||
| affiliation: 'Seqera', | ||
| github: '@drpatelh', | ||
| contribution: ['author'], | ||
| orcid: '0000-0003-2707-7940' | ||
| ], | ||
| [ | ||
| name: 'Phil Ewels', | ||
| affiliation: 'Seqera', | ||
| github: '@ewels', | ||
| contribution: ['author'], | ||
| orcid: '0000-0003-4101-2502' | ||
| ], | ||
| ] | ||
| description = 'RNA sequencing analysis pipeline for gene/isoform quantification and extensive quality control.' | ||
| nextflowVersion = '!>=25.04.3' | ||
| version = '3.23.0' | ||
| } | ||
| ``` | ||
|
|
||
| The following pipeline spec should be produced: | ||
|
|
||
| **`nextflow_spec.json`** | ||
|
|
||
| ```json | ||
| { | ||
| // metadata | ||
| "$schema": "https://raw.githubusercontent.com/nextflow/schemas/main/pipeline/v1/schema.json", | ||
| "$id": "https://raw.githubusercontent.com/nf-core/rnaseq/refs/tags/3.23.0/nextflow_spec.json", | ||
| "title": "nf-core/rnaseq", | ||
| "description": "RNA sequencing analysis pipeline for gene/isoform quantification and extensive quality control.", | ||
| "version": "3.23.0", | ||
| "contributors": [ | ||
| { | ||
| "name": "Harshil Patel", | ||
| "affiliation": "Seqera", | ||
| "github": "@drpatelh", | ||
| "contribution": ["author"], | ||
| "orcid": "0000-0003-2707-7940" | ||
| }, | ||
| { | ||
| "name": "Phil Ewels", | ||
| "affiliation": "Seqera", | ||
| "github": "@ewels", | ||
| "contribution": ["author"], | ||
| "orcid": "0000-0003-4101-2502" | ||
| } | ||
| ], | ||
|
|
||
| // inputs | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it will be more clear if the input parameters are also in an
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That was the original plan, but we wanted to avoid breaking the JSON-schema validity of the file. Basically because it's useful to be able to load it and throw it into any JSON-schema validation library (unrecognised keys are typically ignored). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a higher level use of definitions could help here? Like having one definitions for
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depending on how the auto-generation works out, it might be feasible to maintain an actual In theory, the user wouldn't need to worry about the resulting duplication because they shouldn't modify those bits anyway. They would only be adding things to the JSON-schema part There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, let me know if you have something for me to test. I'd love to give it a go |
||
| "type": "object", | ||
| "$defs": { | ||
| "all_options": { | ||
| "title": "Parameters", | ||
| "type": "object", | ||
| "properties": { | ||
| "input": { | ||
| "type": "string", | ||
| "format": "file-path", | ||
| "description": "Samplesheet containing the input paired-end reads", | ||
| "schema": "assets/schema_input.json" | ||
| }, | ||
| "transcriptome": { | ||
| "type": "string", | ||
| "format": "file-path", | ||
| "description": "The input transcriptome file" | ||
| }, | ||
| "multiqc": { | ||
| "type": "string", | ||
| "format": "directory-path", | ||
| "description": "Directory containing multiqc configuration", | ||
| "default": "${projectDir}/multiqc" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "allOf": [ | ||
| { | ||
| "$ref": "#/$defs/all_options" | ||
| } | ||
| ], | ||
|
|
||
| // outputs | ||
| "output": { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might be a bit fuzzy here, but this looks like a sub-schema ( What is the purpose of this section beyond documentation? If it's intended for e.g. validating outputs then it would be better if it was a properly defined sub-schema I think?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right now, each output can declare either a In theory, we could embed the schema (e.g. for |
||
| "samples": { | ||
| "description": "List of aligned samples", | ||
| "schema": "assets/schema_samples.json", | ||
| "path": "samples.json" | ||
| }, | ||
| "multiqc_report": { | ||
| "description": "MultiQC summary report", | ||
| "type": "file", | ||
| // (path) | ||
| } | ||
| }, | ||
|
|
||
| // software dependencies | ||
| "requires": { | ||
| "nextflow": "!>=25.04.3" | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **`assets/schema_input.json`** | ||
|
|
||
| ```json | ||
| { | ||
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "type": "array", | ||
| "items": { | ||
| "type": "object", | ||
| "properties": { | ||
| "id": { | ||
| "type": "string", | ||
| }, | ||
| "fastq_1": { | ||
| "type": "string", | ||
| "format": "file-path", | ||
| "exists": true | ||
| }, | ||
| "fastq_2": { | ||
| "type": "string", | ||
| "format": "file-path", | ||
| "exists": true | ||
| }, | ||
| "strandedness": { | ||
| "type": "string", | ||
| "enum": ["forward", "reverse", "unstranded", "auto"] | ||
| }, | ||
| }, | ||
| "required": ["sample", "fastq_1", "strandedness"] | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **`assets/schema_samples.json`** | ||
|
|
||
| ```json | ||
| { | ||
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "type": "array", | ||
| "items": { | ||
| "type": "object", | ||
| "properties": { | ||
| "id": { | ||
| "type": "string" | ||
| }, | ||
| "fastqc": { | ||
| "type": "string", | ||
| "format": "directory-path" | ||
| }, | ||
| "bam": { | ||
| "type": "string", | ||
| "format": "file-path" | ||
| }, | ||
| "bai": { | ||
| "type": "string", | ||
| "format": "file-path" | ||
| } | ||
| }, | ||
| "required": ["id", "fastqc"] | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Notes: | ||
|
|
||
| - The `manifest` config options are effectively converted directly to JSON with only nominal changes, such as `manifest.name` -> `title` (preserve structure of original nf-core schema) and `nextflowVersion` -> `requires.nextflow` (leave space for module versions in the future). | ||
|
|
||
| - The parameter schema follows the structure of the nf-core schema, which defines *parameter groups* under `$defs` and combines them using JSON schema properties such as `allOf`. This section should be generated with sensible defaults since some properties (e.g. group name) can not be specified in pipeline code. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ungrouped parameters are also allowed in the schema, do we still want to support these or would you prefer to have everything in definitions?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch - I was thinking this yesterday and meant to leave a comment but forgot. Yes we need to accept top-level ungrouped properties (ungrouped params). {
"$schema": "http://json-schema.org/draft-07/schema",
"$id": "https://raw.githubusercontent.com/YOUR_PIPELINE/master/nextflow_schema.json",
"title": "Nextflow pipeline parameters",
"description": "This pipeline uses Nextflow and processes some kind of data. The JSON Schema was built using the nf-core pipeline schema builder.",
"type": "object",
"properties": {
"some_parameter": {
"type": "string"
}
}
}
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good to know. In that case I think Nextflow will generate ungrouped parameters by default and then preserve whatever groups the user adds There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes that's how we do it in nf-core too |
||
|
|
||
| - Each output in the `output` section should specify either a type (e.g. `file`, `directory`) or a schema (e.g. if the output is a collection of records). Like parameters, the schema for an individual output should reference an external JSON schema file. | ||
|
|
||
| ### Pipeline spec synchronization | ||
|
|
||
| The pipeline spec may contain additional fields that cannot be sourced from the pipeline code (e.g., the `fa_icon` property in the parameter schema). Such fields can be useful for external systems even if they aren't relevant to the pipeline execution. | ||
|
|
||
| As a result, the pipeline spec cannot be completely inferred from pipeline code. Instead, the generated pipeline spec should be treated as a "skeleton" that can be extended by the user with additional fields. | ||
|
|
||
| - When generating the pipeline spec, Nextflow should use any existing spec and preserve information that isn't inferred from pipeline code. | ||
|
|
||
| - Any inconsistencies between the existing spec and pipeline code (e.g. missing or extra parameters) should be reported as errors. | ||
|
|
||
| ## Links | ||
|
|
||
| - [nextflow-io/schemas](https://github.com/nextflow-io/schemas) | ||
| - [nf-core schema](https://nf-co.re/docs/nf-core-tools/pipelines/schema) | ||
| - Examples: [nextflow_schema.json](https://github.com/nf-core/rnaseq/blob/3.23.0/nextflow_schema.json) and [schema_input.json](https://github.com/nf-core/rnaseq/blob/3.23.0/assets/schema_input.json) | ||
| - [JSON schema](https://json-schema.org/) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally not a fan of this kind of information duplication in what are essentially all source files (assuming the
jsonschemais still providing some of the validation logic and is not just a representation of it). If this is the route we are headed down I think it would be much more preferable fornextflow_spec.jsonto become the source of truth and this information be pulled over by nextflow when it's needed in the compiled source code itself (consider how e.g. Python'spyproject.tomlor Rust'sCargo.tomlwork).I'd almost go further and question a bit the logic of combining metadata and source-code validation logic (the params schema) into a single document? I guess it's mostly about a convenient single source for e.g. Seqera Platform launch forms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline spec would be a source of truth for external systems, but not for Nextflow itself. The source-of-truth for Nextflow is the pipeline code
I suppose Nextflow could use the pipeline spec to perform additional validation that goes beyond what can be defined in the
paramsblock, but I would rather avoid that if possibleE.g. instead of using the pipeline spec to validate file extensions, we could add something like "blob types" to the Nextflow language that allow you to define specializations of the
Pathtype such asFastq,Bam, etcThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem as I see it right now is that using
jsonschemavianf-schemaallows us to validate cleanly things beyond the simple type of the param. We can for example useminandmaxto bound an integer param. We can check string lengths. We can check things like the length of an array.Unless the intention is to add ways to constrain the param types in Nextflow in such a way then this validation will need to continue and to do so using
nf-schemait will require the use of theparamspart of this new spec document?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have considered adding declarative ways to validate those kinds of things, for example:
params { num_iterations: Integer = 100 { min 1 max 1000 } }But Paolo and I weren't too keen on doing this, at least not yet, since it would add a lot of noise to the
paramsblock. Meanwhile, you can always do this kind of validation with regular code:workflow { if( params.num_iterations < 1 || params.num_iterations > 1000 ) { error "Parameter `num_iterations` should be between 1 and 1000" } // ... }So it seems like we have the tools to validate everything that nf-schema validates through Nextflow code. But I'd like to wait and see how people use these features before we make any more drastic changes
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The declarative method would be much better in my personal opinion.
I don't know how others feel but I was under the impression that part of the benefit of
nf-schemawas not having to use these kind of conditional checks in code - prior to the originalnf-validationyou would see huge chunks ofif-elseblocks at the start of pipelines.I think this is a bigger consideration too if there is any intention for nextflow to use record types for native samplesheet validation and conversion to lists of records (which I got the sense from some teasers from Phil is something that could be coming) because there is often more of this sort of validation needed there, and do we really want to be doing something that potentially looks like:
This also has the effect of de-coupling the type level checking from the value level checking, meaning if you ever modified the type you'd need to then locate the value check and ensure it was still compatible.
Note also how in the last example where if a value can take multiple types (not sure if type unions are actually supported in the native nextflow type-casting yet) you would need to check the type before checking the value.
The above issues will also apply to
paramsjust without needing to be in themapcall.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All fair points. We're just trying to take it one step at a time. The
paramsblock will provide much of the type-level validation, including samplesheets. But we have to find the appropriate line between declarative vs imperative validation in Nextflow, and that will take time, so I don't want to over-commit on anything yet