Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
339 changes: 339 additions & 0 deletions adr/20251212-pipeline-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
# Pipeline spec

- Authors: Ben Sherman
- Status: accepted
- Deciders: Ben Sherman, Paolo Di Tommaso, Phil Ewels
- Date: 2025-12-12
- Tags: pipelines

## Summary

Provide a way for Nextflow to describe inherent properties of a pipeline that can be easily consumed by external systems.

## Problem Statement

A Nextflow pipeline is defined by Nextflow scripts (`.nf` files) and configuration (`.config` files). However, there are many aspects of a pipeline which are of interest to external systems, such as:

- Metadata (e.g. name, authors, license)
- Pipeline paramters and outputs
- Software dependency versions (e.g. modules)

Acquring this information directly from the source code requires parsing (or even executing) Nextflow code, which is generally not feasible for external systems. Additionally, it may be desirable to provide additional information that is not practical or otherwise does not belong in Nextflow code (e.g. display icons for pipeline parameters).

Primary use cases:

* **Viewing pipelines:** Display pipeline information (name, author, parameters, outputs) in an external user interface.

* **Form validation:** Validate pipeline parameters at launch time, prior to running the pipeline.

* **Pipeline chaining:** Validate a pipeline chain at launch time, allowing downstream pipeline inputs to reference upstream pipeline outputs that are compatible based on their respective pipeline specs.

- **Pipeline registry:** Enable pipelines to be published and executed as immutable software artifacts via the Nextflow registry, instead of cloning the source code repository.

## Solution

### Pipeline spec definition

The schema for pipeline specs is defined in [nextflow-io/schemas](https://github.com/nextflow-io/schemas/blob/main/pipeline/v1/schema.json). It was originally defined as the *meta-schema* for the [nf-core schema](https://nf-co.re/docs/nf-core-tools/pipelines/schema), a standard developed by the nf-core community to model pipeline parameters using JSON schema. The nf-core schema for a pipeline is typically defined as `nextflow_schema.json` in the project root.

Since the meta-schema was transferred to the `nextflow-io` GitHub organization, it is now considered an official Nextflow standard:

- The Nextflow language server uses the schema to provide code intelligence for pipeline parameters in Nextflow scripts.

- The Seqera Platform uses the schema to validate pipeline parameters at launch time.

- The `nf-schema` plugin, also under `nextflow-io`, uses the schema to validate pipeline parameters at runtime.

The pipeline spec adopts the structure of the nf-core schema, with only the following nominal changes:

- *nf-core schema* becomes *pipeline spec*
- *nf-core meta-schema* becomes *schema for pipeline specs*
- `nextflow_schema.json` becomes `nextflow_spec.json`

Preserving the structure of the original nf-core schema makes the migration process as easy as possible for users. At the same time, the nomenclature changes are needed to reduce confusion over different kinds of schemas and align with existing Nextflow standards (i.e. plugin specs, module specs).

The nf-core schema already defines the title, description, and parameters of a pipeline. The pipeline spec adds the following new properties:

- `version`: pipeline release version
- `contributors`: list of pipeline contributors (name, email, affiliation, etc)
- `documentation`: project documentation URL
- `homePage`: project home page
- `keywords`: relevant keywords
- `license`: project license
- `modules`: list of module versions used by the pipeline
- `requires`: runtime requirements
- `nextflow`: Nextflow version constraint
- `modules`: list of modules used by the pipeline
- `output`: list of pipeline outputs (name, type, description, etc)

Examples of these are shown in the following section on pipeline spec generation.

### Pipeline spec generation

Nextflow should be able to generate a pipeline spec from the pipeline source code:

- The parameter schema can be generated from the `params` block and associated record types.

- Samplesheet schemas (e.g. `schema_input.json`) can be generated from the record types used by corresponding parameters.

- The `output` section can be generated from the `output` block.

- Most of the other fields can be inferred from the `manifest` config scope in the main config file.

For example, given the following pipeline script and config:

**`main.nf`**

```groovy
params {
// Samplesheet containing the input paired-end reads
input: List<FastqPair>

// The input transcriptome file
transcriptome: Path

// Directory containing multiqc configuration
multiqc: Path = "${projectDir}/multiqc"
}

record FastqPair {
id : String
fastq_1 : Path
fastq_2 : Path?
strandedness : Strandedness
}

enum Strandedness {
FORWARD,
REVERSE,
UNSTRANDED,
AUTO
}

workflow {
// ...
}

output {
// List of aligned samples
samples: Channel<AlignedSample> {
path { sample ->
sample.fastqc >> 'fastqc/'
sample.bam >> 'align/'
sample.bai >> 'align/'
}
index {
path 'samples.json'
}
}

// MultiQC summary report
multiqc_report: Path {
path '.'
}
}

record AlignedSample {
id: String
fastqc: Path
bam: Path?
bai: Path?
}
```

**`nextflow.config`**

```groovy
manifest {
name = 'nf-core/rnaseq'
contributors = [
[
name: 'Harshil Patel',
affiliation: 'Seqera',
github: '@drpatelh',
contribution: ['author'],
orcid: '0000-0003-2707-7940'
],
[
name: 'Phil Ewels',
affiliation: 'Seqera',
github: '@ewels',
contribution: ['author'],
orcid: '0000-0003-4101-2502'
],
]
description = 'RNA sequencing analysis pipeline for gene/isoform quantification and extensive quality control.'
nextflowVersion = '!>=25.04.3'
version = '3.23.0'
}
```

The following pipeline spec should be produced:

**`nextflow_spec.json`**

```json
{
// metadata
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally not a fan of this kind of information duplication in what are essentially all source files (assuming the jsonschema is still providing some of the validation logic and is not just a representation of it). If this is the route we are headed down I think it would be much more preferable for nextflow_spec.json to become the source of truth and this information be pulled over by nextflow when it's needed in the compiled source code itself (consider how e.g. Python's pyproject.toml or Rust's Cargo.toml work).

I'd almost go further and question a bit the logic of combining metadata and source-code validation logic (the params schema) into a single document? I guess it's mostly about a convenient single source for e.g. Seqera Platform launch forms?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline spec would be a source of truth for external systems, but not for Nextflow itself. The source-of-truth for Nextflow is the pipeline code

I suppose Nextflow could use the pipeline spec to perform additional validation that goes beyond what can be defined in the params block, but I would rather avoid that if possible

E.g. instead of using the pipeline spec to validate file extensions, we could add something like "blob types" to the Nextflow language that allow you to define specializations of the Path type such as Fastq, Bam, etc

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem as I see it right now is that using jsonschema via nf-schema allows us to validate cleanly things beyond the simple type of the param. We can for example use min and max to bound an integer param. We can check string lengths. We can check things like the length of an array.

Unless the intention is to add ways to constrain the param types in Nextflow in such a way then this validation will need to continue and to do so using nf-schema it will require the use of the params part of this new spec document?

Copy link
Copy Markdown
Member Author

@bentsherman bentsherman Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have considered adding declarative ways to validate those kinds of things, for example:

params {
    num_iterations: Integer = 100 {
        min 1
        max 1000
    }
}

But Paolo and I weren't too keen on doing this, at least not yet, since it would add a lot of noise to the params block. Meanwhile, you can always do this kind of validation with regular code:

workflow {
    if( params.num_iterations < 1 || params.num_iterations > 1000 ) {
        error "Parameter `num_iterations` should be between 1 and 1000"
    }

    // ...
}

So it seems like we have the tools to validate everything that nf-schema validates through Nextflow code. But I'd like to wait and see how people use these features before we make any more drastic changes

Copy link
Copy Markdown

@awgymer awgymer Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The declarative method would be much better in my personal opinion.

I don't know how others feel but I was under the impression that part of the benefit of nf-schema was not having to use these kind of conditional checks in code - prior to the original nf-validation you would see huge chunks of if-else blocks at the start of pipelines.

I think this is a bigger consideration too if there is any intention for nextflow to use record types for native samplesheet validation and conversion to lists of records (which I got the sense from some teasers from Phil is something that could be coming) because there is often more of this sort of validation needed there, and do we really want to be doing something that potentially looks like:

inputs = samplesheetToRecords('samplesheet.json')

def errors = [:]

inputs.map{
    rec ->
    errors["${rec.id}"] = []
    if(rec.intval < 0){
        errors["${rec.id}"].push(intval '${rec.intval}' must be greater than 0")
    }
    if(rec.someotherval not in ["a", "b", "c"]){
        errors["${rec.id}"].push("${rec.id}: intval '${rec.intval}' must be greater than 0")
    }
    if(rec.finalval instanceOf Integer && rec.finalval > 0){
        errors["${rec.id}"].push(intval '${rec.finalval}' must be less than 0 if an integer")
    } elif (rec.finalval instanceOf String && rec.finalval not in ["ont", "pacbio"]) {
        errors["${rec.id}"].push(finalval '${rec.finalval}' must be one of 'ont' or 'pacbio' if a string")
    <repeat for every necessary complex check>
    }
}

This also has the effect of de-coupling the type level checking from the value level checking, meaning if you ever modified the type you'd need to then locate the value check and ensure it was still compatible.

Note also how in the last example where if a value can take multiple types (not sure if type unions are actually supported in the native nextflow type-casting yet) you would need to check the type before checking the value.

The above issues will also apply to params just without needing to be in the map call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fair points. We're just trying to take it one step at a time. The params block will provide much of the type-level validation, including samplesheets. But we have to find the appropriate line between declarative vs imperative validation in Nextflow, and that will take time, so I don't want to over-commit on anything yet

"$schema": "https://raw.githubusercontent.com/nextflow/schemas/main/pipeline/v1/schema.json",
"$id": "https://raw.githubusercontent.com/nf-core/rnaseq/refs/tags/3.23.0/nextflow_spec.json",
"title": "nf-core/rnaseq",
"description": "RNA sequencing analysis pipeline for gene/isoform quantification and extensive quality control.",
"version": "3.23.0",
"contributors": [
{
"name": "Harshil Patel",
"affiliation": "Seqera",
"github": "@drpatelh",
"contribution": ["author"],
"orcid": "0000-0003-2707-7940"
},
{
"name": "Phil Ewels",
"affiliation": "Seqera",
"github": "@ewels",
"contribution": ["author"],
"orcid": "0000-0003-4101-2502"
}
],

// inputs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it will be more clear if the input parameters are also in an inputs block or something similar? It's pretty easy to migrate the current schema to that

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original plan, but we wanted to avoid breaking the JSON-schema validity of the file. Basically because it's useful to be able to load it and throw it into any JSON-schema validation library (unrecognised keys are typically ignored).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a higher level use of definitions could help here? Like having one definitions for inputs, one for outputs, one for required...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how the auto-generation works out, it might be feasible to maintain an actual inputs section that mirrors the outputs, while also generating the parameter schema

In theory, the user wouldn't need to worry about the resulting duplication because they shouldn't modify those bits anyway. They would only be adding things to the JSON-schema part

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let me know if you have something for me to test. I'd love to give it a go

"type": "object",
"$defs": {
"all_options": {
"title": "Parameters",
"type": "object",
"properties": {
"input": {
"type": "string",
"format": "file-path",
"description": "Samplesheet containing the input paired-end reads",
"schema": "assets/schema_input.json"
},
"transcriptome": {
"type": "string",
"format": "file-path",
"description": "The input transcriptome file"
},
"multiqc": {
"type": "string",
"format": "directory-path",
"description": "Directory containing multiqc configuration",
"default": "${projectDir}/multiqc"
}
}
}
},
"allOf": [
{
"$ref": "#/$defs/all_options"
}
],

// outputs
"output": {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be a bit fuzzy here, but this looks like a sub-schema (multiqc_report looks to have standard keys for an item in properties) but only semi-defined as such - no top-level object type, no properties, etc.

What is the purpose of this section beyond documentation? If it's intended for e.g. validating outputs then it would be better if it was a properly defined sub-schema I think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, each output can declare either a type (for simple outputs like numbers or files) or a schema (for complex outputs like a collection of samples)

In theory, we could embed the schema (e.g. for samples) directly in the pipeline spec. But if it is useful for that schema to be used on its own then it might make more sense to keep it in a separate file, as we do for samplesheet schemas.

"samples": {
"description": "List of aligned samples",
"schema": "assets/schema_samples.json",
"path": "samples.json"
},
"multiqc_report": {
"description": "MultiQC summary report",
"type": "file",
// (path)
}
},

// software dependencies
"requires": {
"nextflow": "!>=25.04.3"
}
}
```

**`assets/schema_input.json`**

```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {
"type": "string",
},
"fastq_1": {
"type": "string",
"format": "file-path",
"exists": true
},
"fastq_2": {
"type": "string",
"format": "file-path",
"exists": true
},
"strandedness": {
"type": "string",
"enum": ["forward", "reverse", "unstranded", "auto"]
},
},
"required": ["sample", "fastq_1", "strandedness"]
}
}
```

**`assets/schema_samples.json`**

```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"fastqc": {
"type": "string",
"format": "directory-path"
},
"bam": {
"type": "string",
"format": "file-path"
},
"bai": {
"type": "string",
"format": "file-path"
}
},
"required": ["id", "fastqc"]
}
}
```

Notes:

- The `manifest` config options are effectively converted directly to JSON with only nominal changes, such as `manifest.name` -> `title` (preserve structure of original nf-core schema) and `nextflowVersion` -> `requires.nextflow` (leave space for module versions in the future).

- The parameter schema follows the structure of the nf-core schema, which defines *parameter groups* under `$defs` and combines them using JSON schema properties such as `allOf`. This section should be generated with sensible defaults since some properties (e.g. group name) can not be specified in pipeline code.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ungrouped parameters are also allowed in the schema, do we still want to support these or would you prefer to have everything in definitions?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I was thinking this yesterday and meant to leave a comment but forgot. Yes we need to accept top-level ungrouped properties (ungrouped params).

{
    "$schema": "http://json-schema.org/draft-07/schema",
    "$id": "https://raw.githubusercontent.com/YOUR_PIPELINE/master/nextflow_schema.json",
    "title": "Nextflow pipeline parameters",
    "description": "This pipeline uses Nextflow and processes some kind of data. The JSON Schema was built using the nf-core pipeline schema builder.",
    "type": "object",
    "properties": {
        "some_parameter": {
            "type": "string"
        }
    }
}
        

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. In that case I think Nextflow will generate ungrouped parameters by default and then preserve whatever groups the user adds

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's how we do it in nf-core too


- Each output in the `output` section should specify either a type (e.g. `file`, `directory`) or a schema (e.g. if the output is a collection of records). Like parameters, the schema for an individual output should reference an external JSON schema file.

### Pipeline spec synchronization

The pipeline spec may contain additional fields that cannot be sourced from the pipeline code (e.g., the `fa_icon` property in the parameter schema). Such fields can be useful for external systems even if they aren't relevant to the pipeline execution.

As a result, the pipeline spec cannot be completely inferred from pipeline code. Instead, the generated pipeline spec should be treated as a "skeleton" that can be extended by the user with additional fields.

- When generating the pipeline spec, Nextflow should use any existing spec and preserve information that isn't inferred from pipeline code.

- Any inconsistencies between the existing spec and pipeline code (e.g. missing or extra parameters) should be reported as errors.

## Links

- [nextflow-io/schemas](https://github.com/nextflow-io/schemas)
- [nf-core schema](https://nf-co.re/docs/nf-core-tools/pipelines/schema)
- Examples: [nextflow_schema.json](https://github.com/nf-core/rnaseq/blob/3.23.0/nextflow_schema.json) and [schema_input.json](https://github.com/nf-core/rnaseq/blob/3.23.0/assets/schema_input.json)
- [JSON schema](https://json-schema.org/)
Loading