diff --git a/CHANGELOG.md b/CHANGELOG.md index cad1c567..e8f58861 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,14 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### `Added` -- [#483](https://github.com/nf-core/funcscan/pull/483) **New screening workflow** for CAZyme Gene Cluster (CGC) and substrate prediction, through dbCAN (by @HaidYi) -- [#483](https://github.com/nf-core/funcscan/pull/483) Added support for preannotated input with optional GFF column in samplesheet for dbCAN CAZyme Gene Cluster (CGC) and substrate prediction, with new `--dbcan_skip_cgc` and `--dbcan_skip_substrate` parameters (by @HaidYi) - [#500](https://github.com/nf-core/funcscan/pull/500) Updated pipeline template to nf-core/tools version 3.4.1 (by @jfy133) - [#508](https://github.com/nf-core/funcscan/pull/508) Added support for antiSMASH's --clusterhmmer, --fullhmmer, and --tigrfam options (❤️ to @yusukepockyby for requesting, @jfy133) - [#506](https://github.com/nf-core/funcscan/pull/506) Added support GECCO convert for generation of additional files useful for downstream analysis (by @SkyLexS) - [#507](https://github.com/nf-core/funcscan/pull/507) Updated to nf-core template v3.5.1 (by @jfy133) - [#510](https://github.com/nf-core funcscan/pull/510) Fixed code to make Nextflow strict-syntax compliant (by @jfy133) -- [#521](https://github.com/nf-core funcscan/pull/521) Added option to turn on RGI's own cleanup of intermediate files (❤️ to @SamD28 for requesting, added by @jfy133) +- [#519](https://github.com/nf-core/funcscan/pull/519)Added BiG-SLiCE (`bigslice`) as a new BGC clustering tool in the BGC subworkflow. BiG-SLiCE clusters BGC sequences detected by antiSMASH and/or GECCO into Gene Cluster Families (GCFs) using an HMM-based approach. Activated with `--bgc_run_bigslice` and requires `--bgc_bigslice_db`. (by @SkyLexS) ### `Fixed` diff --git a/CITATIONS.md b/CITATIONS.md index f6c0630f..6705d544 100644 --- a/CITATIONS.md +++ b/CITATIONS.md @@ -38,6 +38,12 @@ > Schwengers, O., Jelonek, L., Dieckmann, M. A., Beyvers, S., Blom, J., & Goesmann, A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). [DOI: 10.1099/mgen.0.000685](https://doi.org/10.1099/mgen.0.000685) +- [BiG-SLiCE](https://github.com/medema-group/bigslice) + + > Kautsar, S. A., van der Hooft, J. J. J., de Ridder, D., & Medema, M. H. (2021). BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters. GigaScience, 10(1), giaa154. [DOI: 10.1093/gigascience/giaa154](https://doi.org/10.1093/gigascience/giaa154) + + > Kautsar, S. A., et al. (2026). BiG-SLiCE 2.0: improved gene cluster family diversity mapping. Nature Communications. [DOI: 10.1038/s41467-026-68733-5](https://doi.org/10.1038/s41467-026-68733-5) + - [comBGC](https://github.com/nf-core/funcscan) > Frangenberg, J., Fellows Yates, J. A., Ibrahim, A., Perelo, L., & Beber, M. E. (2023). nf-core/funcscan: 1.0.0 - German Rollmops - 2023-02-15. [DOI: 10.5281/zenodo.7643100](https://doi.org/10.5281/zenodo.7643099) diff --git a/README.md b/README.md index 55ceaf73..2aee71e7 100644 --- a/README.md +++ b/README.md @@ -39,10 +39,9 @@ The nf-core/funcscan AWS full test dataset are contigs generated by the MGnify s 4. Annotation of coding sequences from 3. to obtain general protein families and domains with [`InterProScan`](https://github.com/ebi-pf-team/interproscan) 5. Screening contigs for antimicrobial peptide-like sequences with [`ampir`](https://cran.r-project.org/web/packages/ampir/index.html), [`Macrel`](https://github.com/BigDataBiology/macrel), [`HMMER`](http://hmmer.org/), [`AMPlify`](https://github.com/bcgsc/AMPlify) 6. Screening contigs for antibiotic resistant gene-like sequences with [`ABRicate`](https://github.com/tseemann/abricate), [`AMRFinderPlus`](https://github.com/ncbi/amr), [`fARGene`](https://github.com/fannyhb/fargene), [`RGI`](https://card.mcmaster.ca/analyze/rgi), [`DeepARG`](https://bench.cs.vt.edu/deeparg). [`argNorm`](https://github.com/BigDataBiology/argNorm) is used to map the outputs of `DeepARG`, `AMRFinderPlus`, and `ABRicate` to the [`Antibiotic Resistance Ontology`](https://www.ebi.ac.uk/ols4/ontologies/aro) for consistent ARG classification terms. -7. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/) -8. Screening contigs for carbohydrate-active enzymes (CAZymes), CAZyme gene clusters and substrates with [run_dbcan](https://github.com/bcb-unl/run_dbcan). -9. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/paleobiotechnology/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs -10. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/) +7. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`BiG-SLiCE`](https://github.com/medema-group/bigslice), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/) +8. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/Darcy220606/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs +9. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/) ![funcscan metro workflow](docs/images/funcscan_metro_workflow.png) @@ -93,7 +92,7 @@ nf-core/funcscan was originally written by Jasmin Frangenberg, Anan Ibrahim, Lou We thank the following people for their extensive assistance in the development of this pipeline: -Adam Talbot, Alexandru Mizeranschi, Haidong Yi, Hugo Tavares, Júlia Mir Pedrol, Martin Klapper, Mehrdad Jaberi, Robert Syme, Rosa Herbst, Vedanth Ramji, @Microbion, Dediu Octavian-Codrin. +Adam Talbot, Alexandru Mizeranschi, Hugo Tavares, Júlia Mir Pedrol, Martin Klapper, Mehrdad Jaberi, Robert Syme, Rosa Herbst, Vedanth Ramji, @Microbion, Dediu Octavian-Codrin. ## Contributions and Support diff --git a/conf/modules.config b/conf/modules.config index c8a394a9..a0ddac1b 100644 --- a/conf/modules.config +++ b/conf/modules.config @@ -541,6 +541,29 @@ process { ] } + withName: BIGSLICE { + errorStrategy = 'ignore' + ext.args = [ + params.bgc_bigslice_complete ? '--complete' : '', + params.bgc_bigslice_threshold != 0.4 ? "--threshold ${params.bgc_bigslice_threshold}" : '', + params.bgc_bigslice_threshold_pct != 0.0 ? "--threshold_pct ${params.bgc_bigslice_threshold_pct}" : '', + params.bgc_bigslice_n_ranks != 1 ? "--n_ranks ${params.bgc_bigslice_n_ranks}" : '' + ].join(' ').trim() + publishDir = [ + path: { "${params.outdir}/bgc/" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename }, + ] + } + + withName: BIGSLICE_DOWNLOADDB { + publishDir = [ + path: { "${params.outdir}/bgc/bigslice_db" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename }, + ] + } + withName: HAMRONIZATION_ABRICATE { publishDir = [ path: { "${params.outdir}/arg/hamronization/abricate" }, diff --git a/docs/output.md b/docs/output.md index 5229e037..b4db67bc 100644 --- a/docs/output.md +++ b/docs/output.md @@ -5,13 +5,12 @@ The output of nf-core/funcscan provides reports for each of the functional groups: - **antibiotic resistance genes** (tools: [ABRicate](https://github.com/tseemann/abricate), [AMRFinderPlus](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder), [DeepARG](https://bitbucket.org/gusphdproj/deeparg-ss/src/master), [fARGene](https://github.com/fannyhb/fargene), [RGI](https://card.mcmaster.ca/analyze/rgi) – summarised by [hAMRonization](https://github.com/pha4ge/hAMRonization). Results from ABRicate, AMRFinderPlus, and DeepARG are normalised to [ARO](https://obofoundry.org/ontology/aro.html) by [argNorm](https://github.com/BigDataBiology/argNorm).) -- **antimicrobial peptides** (tools: [Macrel](https://github.com/BigDataBiology/macrel), [AMPlify](https://github.com/bcgsc/AMPlify), [ampir](https://ampir.marine-omics.net), [hmmsearch](http://hmmer.org) – summarised by [AMPcombi](https://github.com/paleobiotechnology/AMPcombi)) -- **biosynthetic gene clusters** (tools: [antiSMASH](https://docs.antismash.secondarymetabolites.org), [DeepBGC](https://github.com/Merck/deepbgc), [GECCO](https://gecco.embl.de), [hmmsearch](http://hmmer.org) – summarised by [comBGC](#combgc)) -- **carbohydrate-active enzymes (CAZymes)**, CAZyme gene clusters and substrates (tools: [run_dbcan](https://github.com/bcb-unl/run_dbcan)) +- **antimicrobial peptides** (tools: [Macrel](https://github.com/BigDataBiology/macrel), [AMPlify](https://github.com/bcgsc/AMPlify), [ampir](https://ampir.marine-omics.net), [hmmsearch](http://hmmer.org) – summarised by [AMPcombi](https://github.com/Darcy220606/AMPcombi)) +- **biosynthetic gene clusters** (tools: [antiSMASH](https://docs.antismash.secondarymetabolites.org), [BiGSLiCE](https://github.com/medema-group/bigslice), [DeepBGC](https://github.com/Merck/deepbgc), [GECCO](https://gecco.embl.de), [hmmsearch](http://hmmer.org) – summarised by [comBGC](#combgc)) As a general workflow, we recommend to first look at the summary reports ([ARGs](#hamronization), [AMPs](#ampcombi), [BGCs](#combgc)), to get a general overview of what hits have been found across all the tools of each functional group. After which, you can explore the specific output directories of each tool to get more detailed information about each result. The tool-specific output directories also includes the output from the functional annotation steps of either [prokka](https://github.com/tseemann/prokka), [pyrodigal](https://github.com/althonos/pyrodigal), [prodigal](https://github.com/hyattpd/Prodigal), or [Bakta](https://github.com/oschwengers/bakta) if the `--save_annotations` flag was set. Additionally, taxonomic classifications from [MMseqs2](https://github.com/soedinglab/MMseqs2) are saved if the `--taxa_classification_mmseqs_db_savetmp` and `--taxa_classification_mmseqs_taxonomy_savetmp` flags are set. -Similarly, all downloaded databases are saved (i.e. from [MMseqs2](https://github.com/soedinglab/MMseqs2), [antiSMASH](https://docs.antismash.secondarymetabolites.org), [AMRFinderPlus](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder), [Bakta](https://github.com/oschwengers/bakta), [DeepARG](https://bitbucket.org/gusphdproj/deeparg-ss/src/master), [RGI](https://github.com/arpcard/rgi), [AMPcombi](https://github.com/paleobiotechnology/AMPcombi), and/or [run_dbcan](https://github.com/bcb-unl/run_dbcan)) into the output directory `/databases/` if the `--save_db` flag was set. +Similarly, all downloaded databases are saved (i.e. from [MMseqs2](https://github.com/soedinglab/MMseqs2), [antiSMASH](https://docs.antismash.secondarymetabolites.org), [AMRFinderPlus](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder), [Bakta](https://github.com/oschwengers/bakta), [DeepARG](https://bitbucket.org/gusphdproj/deeparg-ss/src/master), [RGI](https://github.com/arpcard/rgi), and/or [AMPcombi](https://github.com/Darcy220606/AMPcombi)) into the output directory `/databases/` if the `--save_db` flag was set. Furthermore, for reproducibility, versions of all software used in the run is presented in a [MultiQC](http://multiqc.info) report. @@ -39,11 +38,10 @@ results/ | └── rgi/ ├── bgc/ | ├── antismash/ +| ├── bigslice/ | ├── deepbgc/ | ├── gecco/ | └── hmmsearch/ -├── cazyme/ -| └── dbcan/ ├── databases/ ├── multiqc/ ├── pipeline_info/ @@ -66,11 +64,11 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes p Input contig QC with: -- [SeqKit](https://bioinf.shenwei.me/seqkit/) (default) – for separating into long- and short- categories +- [SeqKit](https://bioinf.shenwei.me/seqkit/) (default) - for separating into long- and short- categories Taxonomy classification of nucleotide sequences with: -- [MMseqs2](https://github.com/soedinglab/MMseqs2) (default) – for contig taxonomic classification using 2bLCA. +- [MMseqs2](https://github.com/soedinglab/MMseqs2) (default) - for contig taxonomic classification using 2bLCA. ORF prediction and annotation with any of: @@ -101,22 +99,19 @@ Antimicrobial Peptides (AMPs): Biosynthetic Gene Clusters (BGCs): - [antiSMASH](#antismash) – biosynthetic gene cluster detection. -- [deepBGC](#deepbgc) – biosynthetic gene cluster detection, using a deep learning model. +- [BiGSLiCE](#bigslice) – biosynthetic gene cluster super-linear clustering engine. +- [deepBGC](#deepbgc) - biosynthetic gene cluster detection, using a deep learning model. - [GECCO](#gecco) – biosynthetic gene cluster detection, using Conditional Random Fields (CRFs). - [hmmsearch](#hmmsearch) – biosynthetic gene cluster detection, based on hidden Markov models. -Carbohydrate-active enzymes (CAZYMEs) - -- [run_dbcan](https://github.com/bcb-unl/run_dbcan) – carbohydrate-active enzyme (CAZyme), CAZyme gene clusters and substrate detection. - Output Summaries: -- [AMPcombi](#ampcombi) – summary report of antimicrobial peptide gene output from various detection tools -- [hAMRonization](#hamronization) – summary of antimicrobial resistance gene output from various detection tools -- [argNorm](#argNorm) – Normalize ARG annotations from [ABRicate](#abricate), [AMRFinderPlus](#amrfinderplus), and [DeepARG](#deeparg) to the ARO -- [comBGC](#combgc) – summary of biosynthetic gene cluster output from various detection tools -- [MultiQC](#multiqc) – report of all software and versions used in the pipeline -- [Pipeline information](#pipeline-information) – report metrics generated during the workflow execution +- [AMPcombi](#ampcombi) – summary report of antimicrobial peptide gene output from various detection tools. +- [hAMRonization](#hamronization) – summary of antimicrobial resistance gene output from various detection tools. +- [argNorm](#argNorm) - Normalize ARG annotations from [ABRicate](#abricate), [AMRFinderPlus](#amrfinderplus), and [DeepARG](#deeparg) to the ARO +- [comBGC](#combgc) – summary of biosynthetic gene cluster output from various detection tools. +- [MultiQC](#multiqc) – report of all software and versions used in the pipeline. +- [Pipeline information](#pipeline-information) – report metrics generated during the workflow execution. ## Tool details @@ -393,7 +388,7 @@ Output Summaries: ### BGC detection tools -[antiSMASH](#antismash), [deepBGC](#deepbgc), [GECCO](#gecco), [hmmsearch](#hmmsearch). +[antiSMASH](#antismash), [BiGSLiCE](#bigslice), [deepBGC](#deepbgc), [GECCO](#gecco), [hmmsearch](#hmmsearch). Note that the BGC tools are run on a set of annotations generated on only long contigs (3000 bp or longer) by default. These specific filtered FASTA files are under `bgc/seqkit/`, and annotations files are under `annotation//long/`, if the corresponding saving flags are specified (see [parameter docs](https://nf-co.re/funcscan/parameters)). However the same annotations _should_ also be annotation files in the sister `all/` directory. @@ -435,6 +430,27 @@ Note that filtered FASTA is only used for BGC workflow for run-time optimisation [antiSMASH](https://docs.antismash.secondarymetabolites.org) (**anti**biotics & **S**econdary **M**etabolite **A**nalysis **SH**ell) is a tool for rapid genome-wide identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial genomes. It identifies biosynthetic loci covering all currently known secondary metabolite compound classes in a rule-based fashion using profile HMMs and aligns the identified regions at the gene cluster level to their nearest relatives from a database containing experimentally verified gene clusters (MIBiG). +#### BiGSLiCE + +
+Output files + +- `bigslice/` + - `/` + - `result/` + - `data.db`: SQLite database containing results for BGCs, CDSs, Gene Cluster Families (GCFs), HMMs and HSPs. + - `tsv_export/` (optional): TSV exports of all parsed BGC metadata, vectorized features and clustering results. Only produced when `--bgc_bigslice_export_tsv` is set. + - `tmp/` + - `/` + - `*.fa`: predicted biosynthetic features as FASTA files, one file per hit HMM. + +
+ +[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). +It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. +BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_run_bigslice`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled. +All results are stored in a SQLite database (`data.db`) which can be explored with standard SQLite tools or via the [BiG-SLiCE interactive web interface](https://github.com/medema-group/bigslice#running-the-query-mode-and-visualization). + #### deepBGC
@@ -479,35 +495,6 @@ Note that filtered FASTA is only used for BGC workflow for run-time optimisation The additional GFF3, GenBank, or FASTA files from `--bgc_gecco_runconvert`, can be useful for additional further analysis of the BGC hits. -### CAZyme annotation tools - -#### run_dbcan - -
-Output files - -- `cazyme/` - - `dbcan/` - - `cazyme_annotation/` - - `_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation - - `_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation - - `_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation - - `_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation - - `cgc/` - - `_cgc.gff`: GFF file containing the CAZyme gene clusters (CGC) identified by dbCAN. This file is generated from the dbCAN annotation and contains the locations of CAZyme gene clusters in the genome - - `_cgc_standard_out.tsv`: Standard output file from dbCAN for CAZyme gene clusters (CGC) in a tabular format. This file summarizes the CAZyme gene clusters identified in the genome - - `_diamond.out.tc`: TSV file containing the diamond output for transporter annotation - - `_TF_hmm_results.tsv`: TSV file containing the results of transcription factor screening - - `_STP_hmm_results.tsv`: TSV file containing the results of signaling transduction proteins (STP) annotation - - `substrate/` - - `_total_cgc_info.tsv`: TSV file summarizing the total additional genes in the genome - - `_substrate_prediction.tsv`: TSV file containing the substrate predictions based on the CGC annotations from dbCAN - - `_synteny_pdf/`: Directory containing one or more PDF files showing the syntenic regions of the CGCs in DNA sequence as identified by dbCAN - -
- -[run_dbcan](https://github.com/bcb-unl/run_dbcan) is an automated tool for carbohydrate-active enzyme (CAZyme), CAZyme gene cluster and substrate annotation. - ### Summary tools [AMPcombi](#ampcombi), [hAMRonization](#hamronization), [comBGC](#combgc), [MultiQC](#multiqc), [pipeline information](#pipeline-information), [argNorm](#argnorm). @@ -583,12 +570,12 @@ In that case we recommend to lower the AMP prediction thresholds and run more th
-[AMPcombi](https://github.com/paleobiotechnology/AMPcombi) summarizes the results of **antimicrobial peptide (AMP)** prediction tools (ampir, AMPlify, Macrel, and other non-nf-core supported tools) into a single table and aligns the hits against a reference AMP database for functional, structural and taxonomic classification using [MMseqs2](https://github.com/soedinglab/MMseqs2). +[AMPcombi](https://github.com/Darcy220606/AMPcombi) summarizes the results of **antimicrobial peptide (AMP)** prediction tools (ampir, AMPlify, Macrel, and other non-nf-core supported tools) into a single table and aligns the hits against a reference AMP database for functional, structural and taxonomic classification using [MMseqs2](https://github.com/soedinglab/MMseqs2). It further assigns the physiochemical properties (e.g. hydrophobicity, molecular weight) using the [Biopython toolkit](https://github.com/biopython/biopython) and clusters the resulting AMP hits from all samples using [MMseqs2](https://github.com/soedinglab/MMseqs2). To further filter the recovered AMPs using the presence of signaling peptides, the output file `Ampcombi_summary_cluster.tsv` or `ampcombi_complete_summary_taxonomy.tsv.gz` can be used downstream as detailed [here](https://ampcombi.readthedocs.io/en/main/usage.html#signal-peptide). The final tables generated may also be visualized and explored using an interactive [user interface](https://ampcombi.readthedocs.io/en/main/visualization.html). -AMPcombi interface +AMPcombi interface #### hAMRonization diff --git a/docs/usage.md b/docs/usage.md index 35eee206..faafe8ea 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -169,6 +169,44 @@ When the annotation is run with Prokka, the resulting `.gbk` file passed to anti If antiSMASH is run for BGC detection, we recommend to **not** run Prokka for annotation but instead use the default annotation tool (Pyrodigal), or switch to Prodigal or (for bacteria only!) Bakta. ::: +### BiGSLiCE + +[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). +It is activated with `--bgc_run_bigslice` and requires at least one BGC source to be enabled: + +- antiSMASH (default BGC tool). +- GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice` + +BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. +The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#databases-and-reference-files) for details); it is not auto-downloaded by the pipeline. + +By default BiG-SLiCE only writes a `data.db` SQLite database. +To additionally export all results as tab-separated text files, pass `--bgc_bigslice_export_tsv`. + +The following optional parameters can be used to tune the clustering behaviour: + +| Pipeline parameter | BiG-SLiCE flag | Description | +| ------------------------------ | ----------------- | ---------------------------------------------------------------------------------------------- | +| `--bgc_bigslice_complete` | `--complete` | Force a full re-clustering run from scratch | +| `--bgc_bigslice_threshold` | `--threshold` | Jaccard index threshold for GCF membership (default: 0.3) | +| `--bgc_bigslice_threshold_pct` | `--threshold_pct` | Percentage-based GCF membership threshold (mutually exclusive with `--bgc_bigslice_threshold`) | +| `--bgc_bigslice_n_ranks` | `--n_ranks` | Number of initial GCF centroids (default: 3000) | + +::: note +`--bgc_bigslice_threshold` and `--bgc_bigslice_threshold_pct` are mutually exclusive — the pipeline will error at startup if both are set to non-default values. +::: + +::: warning +`--bgc_bigslice_complete` forces BiG-SLiCE to cluster **all** input BGCs, including those with no significant HMM hits. +This requires a sufficiently large dataset; with fewer than ~10–15 samples the run will fail with `Exception: Not enough input for clustering`. +::: + +::: warning +`--bgc_bigslice_n_ranks` must be **smaller than the number of BGCs** in the input dataset. +Setting it to a value larger than the dataset size will cause BiG-SLiCE to fail with `ValueError: Expected n_neighbors <= n_samples_fit`. +The default of 3000 is suitable for large public datasets; reduce this value when working with smaller datasets. +::: + ## Databases and reference files Various tools of nf-core/funcscan use databases and reference files to operate. @@ -527,6 +565,25 @@ deepbgc_db/ └── myDetectors*.pkl ``` +### BiGSLiCE + +BiG-SLiCE requires its own HMM database. Unlike most other tools in funcscan, the pipeline does **not** auto-download this database — there is no built-in download command in the tool itself. The database must be downloaded manually and supplied with `--bgc_bigslice_db`. + +Download the latest pre-built database archive from the [BiG-SLiCE GitHub releases page](https://github.com/medema-group/bigslice/releases): + +```bash +wget https://github.com/medema-group/bigslice/releases/latest/download/bigslice-models.tar.gz +tar -xzf bigslice-models.tar.gz +``` + +Then supply the extracted directory to the pipeline: + +```bash +--bgc_bigslice_db '////' +``` + +The contents of the database directory should contain subdirectories such as `biosynthetic_pfams/` and `sub_pfams/` in the top level. + ### InterProScan [InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow with `--run_protein_annotation` will download and unzip the [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.72-103.0/) version 5.72-103.0. The database can be saved in the output directory `/databases/interproscan/` if the `--save_db` is turned on. diff --git a/modules.json b/modules.json index f7832716..df1f923c 100644 --- a/modules.json +++ b/modules.json @@ -70,6 +70,16 @@ "git_sha": "72c983560c9b9c2a02ff636451a5e5008f7d020b", "installed_by": ["modules"] }, + "bigslice": { + "branch": "master", + "git_sha": "8ff67cb10964982d41bae43ac3fe7ada16d09ef8", + "installed_by": ["modules"] + }, + "bigslice/downloaddb": { + "branch": "master", + "git_sha": "8ff67cb10964982d41bae43ac3fe7ada16d09ef8", + "installed_by": ["modules"] + }, "deeparg/downloaddata": { "branch": "master", "git_sha": "81880787133db07d9b4c1febd152c090eb8325dc", diff --git a/modules/nf-core/bigslice/downloaddb/environment.yml b/modules/nf-core/bigslice/downloaddb/environment.yml new file mode 100644 index 00000000..7c42576a --- /dev/null +++ b/modules/nf-core/bigslice/downloaddb/environment.yml @@ -0,0 +1,8 @@ +--- +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json +channels: + - conda-forge + - bioconda +dependencies: + - "bioconda::bigslice=2.0.2" + - "conda-forge::python=3.10.18" diff --git a/modules/nf-core/bigslice/downloaddb/main.nf b/modules/nf-core/bigslice/downloaddb/main.nf new file mode 100644 index 00000000..d2a19985 --- /dev/null +++ b/modules/nf-core/bigslice/downloaddb/main.nf @@ -0,0 +1,43 @@ +process BIGSLICE_DOWNLOADDB { + tag "${meta.id}" + label 'process_single' + + conda "${moduleDir}/environment.yml" + container "${workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container + ? 'https://depot.galaxyproject.org/singularity/bigslice:2.0.2--pyh8ed023e_0' + : 'quay.io/biocontainers/bigslice:2.0.2--pyh8ed023e_0'}" + + input: + val meta + + output: + tuple val(meta), path ("bigslice-models") , emit: db + // WARN: Version information not provided by tool on CLI. Please update this string when bumping container versions. + tuple val("${task.process}"), val('bigslice'), val("2.0.2"), topic: versions, emit: versions_bigslice + tuple val("${task.process}"), val('python'), eval("python --version | sed 's/Python //'"), topic: versions, emit: versions_python + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + """ + # Copy the script to the work directory so it writes bigslice-models/ here + # (download_bigslice_hmmdb uses __file__ to determine the output path; + # copying it here ensures bigslice-models/ is created in the work directory) + + cp \$(which download_bigslice_hmmdb) ./download_bigslice_hmmdb_local + + python \\ + ./download_bigslice_hmmdb_local \\ + ${args} + + """ + + stub: + """ + mkdir -p bigslice-models/biosyn_pfam bigslice-models/sub_pfams + touch bigslice-models/biosyn_pfam/Biosyn_pfams.hmm + touch bigslice-models/sub_pfams/corepfam.tsv + """ +} diff --git a/modules/nf-core/bigslice/downloaddb/meta.yml b/modules/nf-core/bigslice/downloaddb/meta.yml new file mode 100644 index 00000000..95d17104 --- /dev/null +++ b/modules/nf-core/bigslice/downloaddb/meta.yml @@ -0,0 +1,83 @@ +name: "bigslice_downloaddb" +description: | + Downloads and extracts the BiG-SLiCE HMM database (biosynthetic and sub Pfams) + using the bundled `download_bigslice_hmmdb` script shipped with BiG-SLiCE. + The resulting directory can be passed directly as the `hmmdb` input to the + `BIGSLICE` module. +keywords: + - biosynthetic gene clusters + - genomics + - database + - download +tools: + - "bigslice": + description: A highly scalable, user-interactive tool for the large scale + analysis of Biosynthetic Gene Clusters data + homepage: "https://github.com/medema-group/bigslice" + documentation: "https://github.com/medema-group/bigslice" + tool_dev_url: "https://github.com/medema-group/bigslice" + doi: "10.1093/gigascience/giaa154" + licence: + - "AGPL v3-or-later" + identifier: "" +input: + - meta: + type: map + description: Groovy Map containing sample information e.g. `[ id:'test' ]` +output: + db: + - - meta: + type: map + description: Groovy Map containing sample information e.g. `[ id:'test' ]` + - bigslice-models: + type: directory + description: Downloaded and extracted BiG-SLiCE HMM database directory + containing biosynthetic Pfam HMMs and sub-Pfam profiles. Pass this + directly as `hmmdb` to the `BIGSLICE` module. + pattern: "bigslice-models" + versions_bigslice: + - - ${task.process}: + type: string + description: The name of the process + - bigslice: + type: string + description: The name of the tool + - 2.0.2: + type: string + description: The version of the tool + versions_python: + - - ${task.process}: + type: string + description: The name of the process + - python: + type: string + description: The name of the tool + - python --version | sed 's/Python //': + type: eval + description: The expression to obtain the version of the tool +topics: + versions: + - - ${task.process}: + type: string + description: The name of the process + - bigslice: + type: string + description: The name of the tool + - 2.0.2: + type: string + description: The version of the tool + - - ${task.process}: + type: string + description: The name of the process + - python: + type: string + description: The name of the tool + - python --version | sed 's/Python //': + type: eval + description: The expression to obtain the version of the tool +authors: + - "@vagkaratzas" + - "@SkyLex" +maintainers: + - "@vagkaratzas" + - "@SkyLex" diff --git a/modules/nf-core/bigslice/downloaddb/tests/main.nf.test b/modules/nf-core/bigslice/downloaddb/tests/main.nf.test new file mode 100644 index 00000000..01ec8fe3 --- /dev/null +++ b/modules/nf-core/bigslice/downloaddb/tests/main.nf.test @@ -0,0 +1,59 @@ +nextflow_process { + + name "Test Process BIGSLICE_DOWNLOADDB" + script "../main.nf" + process "BIGSLICE_DOWNLOADDB" + + tag "modules" + tag "modules_nfcore" + tag "bigslice" + tag "bigslice_downloaddb" + tag "bigslice/downloaddb" + + test("bigslice - downloaddb") { + + when { + process { + """ + input[0] = [ id: 'test' ] + """ + } + } + + then { + assert process.success + def dbDir = file(process.out.db[0][1]) + def topDirs = [] + dbDir.eachDir { topDirs << it.name } + assertAll( + { assert dbDir.isDirectory() }, + { assert snapshot( + topDirs.sort(), + ["versions_bigslice": process.out.versions_bigslice] + ).match() }, + { assert process.out.versions_python } + ) + } + + } + + test("bigslice - downloaddb - stub") { + + options "-stub" + + when { + process { + """ + input[0] = [ id: 'test' ] + """ + } + } + + then { + assert process.success + assert snapshot(sanitizeOutput(process.out)).match() + } + + } + +} diff --git a/modules/nf-core/bigslice/downloaddb/tests/main.nf.test.snap b/modules/nf-core/bigslice/downloaddb/tests/main.nf.test.snap new file mode 100644 index 00000000..a73b08e1 --- /dev/null +++ b/modules/nf-core/bigslice/downloaddb/tests/main.nf.test.snap @@ -0,0 +1,64 @@ +{ + "bigslice - downloaddb": { + "content": [ + [ + "biosynthetic_pfams", + "sub_pfams" + ], + { + "versions_bigslice": [ + [ + "BIGSLICE_DOWNLOADDB", + "bigslice", + "2.0.2" + ] + ] + } + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.10.3" + }, + "timestamp": "2026-04-28T16:52:06.219494172" + }, + "bigslice - downloaddb - stub": { + "content": [ + { + "db": [ + [ + { + "id": "test" + }, + [ + [ + "Biosyn_pfams.hmm:md5,d41d8cd98f00b204e9800998ecf8427e" + ], + [ + "corepfam.tsv:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ] + ] + ], + "versions_bigslice": [ + [ + "BIGSLICE_DOWNLOADDB", + "bigslice", + "2.0.2" + ] + ], + "versions_python": [ + [ + "BIGSLICE_DOWNLOADDB", + "python", + "3.10.18" + ] + ] + } + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.10.3" + }, + "timestamp": "2026-05-06T16:13:18.283218194" + } +} \ No newline at end of file diff --git a/modules/nf-core/bigslice/downloaddb/tests/nextflow.config b/modules/nf-core/bigslice/downloaddb/tests/nextflow.config new file mode 100644 index 00000000..76ab8bba --- /dev/null +++ b/modules/nf-core/bigslice/downloaddb/tests/nextflow.config @@ -0,0 +1,5 @@ +process { + withName: BIGSLICE_DOWNLOADDB { + ext.args = '' + } +} diff --git a/modules/nf-core/bigslice/environment.yml b/modules/nf-core/bigslice/environment.yml new file mode 100644 index 00000000..de8fdfbb --- /dev/null +++ b/modules/nf-core/bigslice/environment.yml @@ -0,0 +1,7 @@ +--- +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json +channels: + - conda-forge + - bioconda +dependencies: + - "bioconda::bigslice=2.0.2" diff --git a/modules/nf-core/bigslice/main.nf b/modules/nf-core/bigslice/main.nf new file mode 100644 index 00000000..ae301171 --- /dev/null +++ b/modules/nf-core/bigslice/main.nf @@ -0,0 +1,63 @@ +process BIGSLICE { + tag "${meta.id}" + label 'process_medium' + + conda "${moduleDir}/environment.yml" + container "${workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container + ? 'https://depot.galaxyproject.org/singularity/bigslice:2.0.2--pyh8ed023e_0' + : 'quay.io/biocontainers/bigslice:2.0.2--pyh8ed023e_0'}" + + input: + tuple val(meta), path(bgc, stageAs: 'bgc_files/s*/*') + path(hmmdb) + val(export_tsv) + + output: + tuple val(meta), path("${prefix}/result") , emit: output + tuple val(meta), path("${prefix}/result/tsv_export") , emit: tsv, optional: true + // WARN: Version information not provided by tool on CLI. Please update this string when bumping container versions. + tuple val("${task.process}"), val('bigslice'), val("2.0.2"), topic: versions, emit: versions_bigslice + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def args2 = task.ext.args2 ?: '' + prefix = task.ext.prefix ?: "${meta.id}" + def sample = meta.id + def export_tsv_cmd = export_tsv ? "bigslice --export-tsv ${prefix}/result/tsv_export --program_db_folder ${hmmdb} ${args2} ${prefix}" : '' + """ + mkdir -p input/dataset/${sample} input/taxonomy + find bgc_files -name '*.gbk' | xargs -I{} cp {} input/dataset/${sample}/ + + printf "# dataset_name\\tdataset_path\\ttaxonomy_path\\tdescription\\n" > input/datasets.tsv + printf "dataset\\tdataset\\ttaxonomy/taxonomy.tsv\\tBGC dataset\\n" >> input/datasets.tsv + + touch input/taxonomy/taxonomy.tsv + + bigslice \\ + ${args} \\ + --num_threads ${task.cpus} \\ + -i input \\ + --program_db_folder ${hmmdb} \\ + ${prefix} + + ${export_tsv_cmd} + """ + + stub: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}" + """ + echo ${args} + + mkdir -p ${prefix}/result/tmp/2e555308dfc411186cf012334262f127 + touch ${prefix}/result/data.db + touch ${prefix}/result/tmp/2e555308dfc411186cf012334262f127/test.fa + if ${export_tsv}; then + mkdir -p ${prefix}/result/tsv_export + touch ${prefix}/result/tsv_export/bgcs.tsv + fi + """ +} diff --git a/modules/nf-core/bigslice/meta.yml b/modules/nf-core/bigslice/meta.yml new file mode 100644 index 00000000..54caea09 --- /dev/null +++ b/modules/nf-core/bigslice/meta.yml @@ -0,0 +1,93 @@ +name: "bigslice" +description: | + A scalable tool for large-scale analysis of Biosynthetic Gene Clusters (BGCs). + It takes genome regions in GenBank format along with an HMM database and produces a SQLite database and FASTA outputs of predicted features. +keywords: + - biosynthetic gene clusters + - genomics + - analysis +tools: + - "bigslice": + description: A highly scalable, user-interactive tool for the large scale + analysis of Biosynthetic Gene Clusters data + homepage: "https://github.com/medema-group/bigslice" + documentation: "https://github.com/medema-group/bigslice" + tool_dev_url: "https://github.com/medema-group/bigslice" + doi: "10.1093/gigascience/giaa154" + licence: + - "AGPL v3-or-later" + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing sample information + e.g. `[ id:'sample1' ]` + - bgc: + type: file + description: | + List of GenBank (.gbk) files containing genomic region annotations for BiG-SLiCE input. + Each file represents a BGC region. The module internally organises them into the required + BiG-SLiCE input folder structure (datasets.tsv and taxonomy TSV). + pattern: "*.gbk" + ontologies: [] + - hmmdb: + type: directory + description: | + Path to the BiG-SLiCE HMM database folder containing biosynthetic and sub Pfams for annotation, in the required BiG-SLiCE format. + An example directory in compressed archive format can be found here: https://github.com/medema-group/bigslice/releases/download/v2.0.0rc/bigslice-models.2022-11-30.tar.gz + - export_tsv: + type: boolean + description: | + If true, runs a second BiG-SLiCE invocation to export all results from the SQLite database + to TSV files under `tsv_export/`. Additional arguments for this step can be passed via `task.ext.args2`. +output: + output: + - - meta: + type: map + description: Groovy Map containing sample/dataset information + - ${prefix}/result: + type: directory + description: | + BiG-SLiCE result directory containing the SQLite database (`data.db`), + predicted feature FASTA files (`tmp/**/*.fa`), and optionally TSV exports + (`tsv_export/`) when `export_tsv` is `true`. + pattern: "result" + tsv: + - - meta: + type: map + description: Groovy Map containing sample/dataset information + - ${prefix}/result/tsv_export: + type: directory + description: | + Directory containing TSV exports of all parsed BGC metadata, vectorized + features and clustering results. Only present when `export_tsv` input is + set to `true`. + pattern: "tsv_export" + versions_bigslice: + - - ${task.process}: + type: string + description: The name of the process + - bigslice: + type: string + description: The name of the tool + - 2.0.2: + type: string + description: The version of the tool +topics: + versions: + - - ${task.process}: + type: string + description: The name of the process + - bigslice: + type: string + description: The name of the tool + - 2.0.2: + type: string + description: The version of the tool +authors: + - "@vagkaratzas" + - "@SkyLex" +maintainers: + - "@vagkaratzas" + - "@SkyLex" diff --git a/modules/nf-core/bigslice/tests/main.nf.test b/modules/nf-core/bigslice/tests/main.nf.test new file mode 100644 index 00000000..f895cfd0 --- /dev/null +++ b/modules/nf-core/bigslice/tests/main.nf.test @@ -0,0 +1,178 @@ +nextflow_process { + + name "Test Process BIGSLICE" + script "../main.nf" + process "BIGSLICE" + config "./nextflow.config" + + tag "modules" + tag "modules_nfcore" + tag "bigslice" + tag "aria2" + tag "untar" + + setup { + run("ARIA2", alias: "ARIA2_HMMDB") { + script "../../aria2/main.nf" + process { + """ + input[0] = [ + [ id:'test_hmm_db' ], + 'https://github.com/medema-group/bigslice/releases/download/v2.0.0rc/bigslice-models.2022-11-30.tar.gz' // https URL + ] + """ + } + } + + run("UNTAR", alias: "UNTAR_HMMDB") { + script "../../untar/main.nf" + process { + """ + input[0] = ARIA2_HMMDB.out.downloaded_file + """ + } + } + + run("ARIA2", alias: "ARIA2_GBK") { + script "../../aria2/main.nf" + process { + """ + input[0] = [ + [ id:'test_gbk' ], + params.modules_testdata_base_path + 'genomics/prokaryotes/streptomyces_coelicolor/fixtures_bigslice_gbk.tar.gz' // https URL + ] + """ + } + } + + run("UNTAR", alias: "UNTAR_GBK") { + script "../../untar/main.nf" + process { + """ + input[0] = ARIA2_GBK.out.downloaded_file + """ + } + } + } + + test("streptomyces_coelicolor - bigslice - gbk") { + + when { + process { + """ + // Flatten the GBK directory into a list of individual GBK files with meta + input[0] = UNTAR_GBK.out.untar.map { meta, dir -> + def gbk_files = [] + dir.eachFileRecurse { if (it.name.endsWith('.gbk')) gbk_files << it } + [ meta, gbk_files ] + } + input[1] = UNTAR_HMMDB.out.untar.map{ it -> it[1] } + input[2] = false + """ + } + } + + then { + assert process.success + def resultDir = file(process.out.output[0][1]) + def allNames = [] + def tmpFaCount = 0 + resultDir.eachFileRecurse { f -> + if (!f.isDirectory()) { + def rel = resultDir.toPath().relativize(f.toPath()).toString() + if (rel.startsWith('tmp/') || rel.startsWith('tmp\\')) { + if (f.name.endsWith('.fa')) tmpFaCount++ + } else { + allNames.add(f.name) + } + } + } + assertAll( + { assert resultDir.isDirectory() }, + { assert tmpFaCount > 0 }, + { assert snapshot( + allNames.sort(), + process.out.findAll { key, val -> key.startsWith("versions")} + ).match() } + ) + } + + } + + test("streptomyces_coelicolor - bigslice - gbk - export_tsv") { + + when { + process { + """ + // Flatten the GBK directory into a list of individual GBK files with meta + input[0] = UNTAR_GBK.out.untar.map { meta, dir -> + def gbk_files = [] + dir.eachFileRecurse { if (it.name.endsWith('.gbk')) gbk_files << it } + [ meta, gbk_files ] + } + input[1] = UNTAR_HMMDB.out.untar.map{ it -> it[1] } + input[2] = true + """ + } + } + + then { + assert process.success + def resultDir = file(process.out.output[0][1]) + def allNames = [] + def tmpFaCount = 0 + resultDir.eachFileRecurse { f -> + if (!f.isDirectory()) { + def rel = resultDir.toPath().relativize(f.toPath()).toString() + if (rel.startsWith('tmp/') || rel.startsWith('tmp\\')) { + if (f.name.endsWith('.fa')) tmpFaCount++ + } else { + allNames.add(f.name) + } + } + } + assertAll( + { assert resultDir.isDirectory() }, + { assert tmpFaCount > 0 }, + { assert file(process.out.tsv[0][1]).isDirectory() }, + { assert snapshot( + allNames.sort(), + process.out.findAll { key, val -> key.startsWith("versions")} + ).match() } + ) + } + + } + + test("streptomyces_coelicolor - bigslice - gbk - stub") { + + options "-stub" + + when { + process { + """ + // Flatten the GBK directory into a list of individual GBK files with meta + input[0] = UNTAR_GBK.out.untar.map { meta, dir -> + def gbk_files = [] + dir.eachFileRecurse { if (it.name.endsWith('.gbk')) gbk_files << it } + [ meta, gbk_files ] + } + input[1] = UNTAR_HMMDB.out.untar.map{ it -> it[1] } + input[2] = false + """ + } + } + + then { + assert process.success + assertAll( + { assert snapshot( + process.out, + process.out.findAll { key, val -> key.startsWith("versions")} + ).match() } + ) + } + + } + +} diff --git a/modules/nf-core/bigslice/tests/main.nf.test.snap b/modules/nf-core/bigslice/tests/main.nf.test.snap new file mode 100644 index 00000000..d945a37e --- /dev/null +++ b/modules/nf-core/bigslice/tests/main.nf.test.snap @@ -0,0 +1,121 @@ +{ + "streptomyces_coelicolor - bigslice - gbk - export_tsv": { + "content": [ + [ + "bgc_features_1.pkl", + "bgc_metadata.tsv", + "data.db", + "gcf_membership.tsv", + "gcf_models_1.pkl", + "run_metadata.tsv" + ], + { + "versions_bigslice": [ + [ + "BIGSLICE", + "bigslice", + "2.0.2" + ] + ] + } + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.10.3" + }, + "timestamp": "2026-04-03T01:11:26.257005672" + }, + "streptomyces_coelicolor - bigslice - gbk - stub": { + "content": [ + { + "0": [ + [ + { + "id": "test_gbk" + }, + [ + "data.db:md5,d41d8cd98f00b204e9800998ecf8427e", + [ + [ + "test.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ] + ] + ] + ], + "1": [ + + ], + "2": [ + [ + "BIGSLICE", + "bigslice", + "2.0.2" + ] + ], + "output": [ + [ + { + "id": "test_gbk" + }, + [ + "data.db:md5,d41d8cd98f00b204e9800998ecf8427e", + [ + [ + "test.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ] + ] + ] + ], + "tsv": [ + + ], + "versions_bigslice": [ + [ + "BIGSLICE", + "bigslice", + "2.0.2" + ] + ] + }, + { + "versions_bigslice": [ + [ + "BIGSLICE", + "bigslice", + "2.0.2" + ] + ] + } + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.10.3" + }, + "timestamp": "2026-04-02T22:45:23.737040708" + }, + "streptomyces_coelicolor - bigslice - gbk": { + "content": [ + [ + "bgc_features_1.pkl", + "data.db", + "gcf_models_1.pkl" + ], + { + "versions_bigslice": [ + [ + "BIGSLICE", + "bigslice", + "2.0.2" + ] + ] + } + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.10.3" + }, + "timestamp": "2026-04-03T01:10:19.794409662" + } +} \ No newline at end of file diff --git a/modules/nf-core/bigslice/tests/nextflow.config b/modules/nf-core/bigslice/tests/nextflow.config new file mode 100644 index 00000000..2986e346 --- /dev/null +++ b/modules/nf-core/bigslice/tests/nextflow.config @@ -0,0 +1,5 @@ +process { + withName: BIGSLICE { + ext.prefix = "test_bigslice" + } +} diff --git a/nextflow.config b/nextflow.config index 0bc5360a..9dbe79e5 100644 --- a/nextflow.config +++ b/nextflow.config @@ -257,6 +257,15 @@ params { bgc_gecco_convertmode = 'clusters' bgc_gecco_convertformat = 'gff' + + bgc_run_bigslice = false + bgc_bigslice_db = null + bgc_bigslice_complete = false + bgc_bigslice_export_tsv = false + bgc_bigslice_threshold = 0.4 + bgc_bigslice_threshold_pct = 0.0 + bgc_bigslice_n_ranks = 1 + bgc_run_hmmsearch = false bgc_hmmsearch_models = null bgc_hmmsearch_savealignments = false @@ -561,6 +570,14 @@ manifest { contribution: ['contributor'], orcid: '', ], + [ + name: 'Dediu Octavian-Codrin', + affiliation: '', + email: '', + github: 'https://github.com/SkyLexS', + contribution: ['contributor'], + orcid: '', + ], ] homePage = 'https://github.com/nf-core/funcscan' description = """Pipeline for screening for functional components of assembled contigs""" diff --git a/nextflow_schema.json b/nextflow_schema.json index 26133958..d1fb5889 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -404,7 +404,7 @@ "type": "integer", "default": 1, "description": "Minimum contig size required for annotation (bp).", - "help_text": "Specify the minimum contig lengths to carry out annotations on. The Prokka developers recommend that this should be more than 200 bp, if you plan to submit such annotations to NCBI.\n\nFor more information please check the Prokka [documentation](https://github.com/tseemann/prokka).\n\n> Modifies tool parameter(s):\n> - Prokka: `--mincontiglen`", + "help_text": "Specify the minimum contig lengths to carry out annotations on. The Prokka developers recommend that this should be ≥ 200 bp, if you plan to submit such annotations to NCBI.\n\nFor more information please check the Prokka [documentation](https://github.com/tseemann/prokka).\n\n> Modifies tool parameter(s):\n> - Prokka: `--mincontiglen`", "fa_icon": "fas fa-ruler-horizontal" }, "annotation_prokka_evalue": { @@ -1070,14 +1070,14 @@ }, "arg_rgi_includeloose": { "type": "boolean", - "description": "Include all of loose, strict and perfect hits (i.e. more than 95% identity) found by RGI.", + "description": "Include all of loose, strict and perfect hits (i.e. ≥ 95% identity) found by RGI.", "help_text": "When activated RGI output will include 'Loose' hits in addition to 'Strict' and 'Perfect' hits. The 'Loose' algorithm works outside of the detection model cut-offs to provide detection of new, emergent threats and more distant homologs of AMR genes, but will also catalog homologous sequences and spurious partial matches that may not have a role in AMR.\n\nFor more information check the RGI [documentation](https://github.com/arpcard/rgi).\n\n> Modifies tool parameter(s):\n> - RGI_MAIN: `--include_loose`", "fa_icon": "far fa-hand-scissors" }, "arg_rgi_includenudge": { "type": "boolean", "description": "Suppresses the default behaviour of RGI with `--arg_rgi_includeloose`.", - "help_text": "This flag suppresses the default behaviour of RGI, by listing all 'Loose' matches of more than 95% identity as 'Strict' or 'Perfect', regardless of alignment length.\n\nFor more information check the RGI [documentation](https://github.com/arpcard/rgi).\n\n> Modifies tool parameter(s):\n> - RGI_MAIN: `--include_nudge`", + "help_text": "This flag suppresses the default behaviour of RGI, by listing all 'Loose' matches of ≥ 95% identity as 'Strict' or 'Perfect', regardless of alignment length.\n\nFor more information check the RGI [documentation](https://github.com/arpcard/rgi).\n\n> Modifies tool parameter(s):\n> - RGI_MAIN: `--include_nudge`", "fa_icon": "fas fa-hand-scissors" }, "arg_rgi_lowquality": { @@ -1465,6 +1465,55 @@ }, "fa_icon": "fas fa-angle-double-right" }, + "bgc_bigslice": { + "title": "BGC: BiG-SLiCE", + "type": "object", + "default": "", + "properties": { + "bgc_run_bigslice": { + "type": "boolean", + "description": "Run BiG-SLiCE to cluster detected BGCs into gene cluster families (GCFs)." + }, + "bgc_bigslice_db": { + "type": "string", + "description": "Path to the pre-downloaded BiG-SLiCE HMM database directory.", + "help_text": "Supply the path to a local copy of the BiG-SLiCE HMM database. The database can be downloaded from the BiG-SLiCE GitHub releases page:\n\n```bash\nwget https://github.com/medema-group/bigslice/releases/latest/download/bigslice-models.tar.gz\ntar -xzf bigslice-models.tar.gz\n```\n\nThe contents of the directory should contain subdirectories such as `biosynthetic_pfams/` and `sub_pfams/` in the top level.\n\n> Modifies tool parameter(s):\n> - BiG-SLiCE: `--program_db_folder`", + "fa_icon": "fas fa-database" + }, + "bgc_bigslice_export_tsv": { + "type": "boolean", + "description": "Export BiG-SLiCE results as TSV files in addition to the SQLite database.", + "help_text": "Passes `--export-tsv` to BiG-SLiCE, which exports all parsed BGC metadata, vectorized features and clustering results as tab-separated text files alongside the default `data.db` SQLite database." + }, + "bgc_bigslice_complete": { + "type": "boolean", + "description": "Run BiG-SLiCE in complete mode, re-clustering all BGCs into GCFs from scratch.", + "help_text": "By default BiG-SLiCE continues from a previous run if the output folder exists. Activating this flag forces a full re-run of the clustering step, discarding any cached intermediate results.\n\n> Modifies tool parameter(s):\n> - BiG-SLiCE: `--complete`" + }, + "bgc_bigslice_threshold": { + "type": "number", + "default": 0.4, + "description": "Jaccard index threshold for considering a BGC as a member of a GCF.", + "help_text": "Controls the minimum Jaccard index alignment score required for a BGC to be assigned to a Gene Cluster Family (GCF). Lower values increase sensitivity (more assignments) at the cost of specificity.\n\nFor more information see the BiG-SLiCE [documentation](https://github.com/medema-group/bigslice/wiki/Program-parameters).\n\n> Modifies tool parameter(s):\n> - BiG-SLiCE: `--threshold`", + "fa_icon": "fas fa-sliders-h" + }, + "bgc_bigslice_threshold_pct": { + "type": "number", + "default": 0.0, + "description": "Percentage-based BGC-to-GCF membership threshold.", + "help_text": "An alternative membership threshold expressed as a percentage of the total number of hits. When set above 0, a BGC is assigned to a GCF only if at least this percentage of its domain hits overlap with the GCF centroid.\n\nFor more information see the BiG-SLiCE [documentation](https://github.com/medema-group/bigslice/wiki/Program-parameters).\n\n> Modifies tool parameter(s):\n> - BiG-SLiCE: `--threshold_pct`", + "fa_icon": "fas fa-percent" + }, + "bgc_bigslice_n_ranks": { + "type": "integer", + "default": 1, + "description": "Number of best GCF hits to report for each BGC membership assignment.", + "help_text": "Sets how many top-scoring GCF hits are recorded per BGC during the membership assignment step. Increasing this value reports more alternative GCF assignments per BGC at the cost of longer runtime.\n\nFor more information see the BiG-SLiCE [documentation](https://github.com/medema-group/bigslice/wiki/Program-parameters).\n\n> Modifies tool parameter(s):\n> - BiG-SLiCE: `--n_ranks`", + "fa_icon": "fas fa-list-ol" + } + }, + "description": "Parameters for BiG-SLiCE clustering of biosynthetic gene clusters (BGCs) into gene cluster families (GCFs). More info: https://github.com/medema-group/bigslice" + }, "bgc_hmmsearch": { "title": "BGC: hmmsearch", "type": "object", @@ -1775,6 +1824,9 @@ { "$ref": "#/$defs/bgc_gecco" }, + { + "$ref": "#/$defs/bgc_bigslice" + }, { "$ref": "#/$defs/bgc_hmmsearch" }, diff --git a/subworkflows/local/bgc.nf b/subworkflows/local/bgc.nf index e12c21c0..06eb6e14 100644 --- a/subworkflows/local/bgc.nf +++ b/subworkflows/local/bgc.nf @@ -13,6 +13,8 @@ include { COMBGC } from '../../modules/local/com include { TABIX_BGZIP as BGC_TABIX_BGZIP } from '../../modules/nf-core/tabix/bgzip' include { MERGE_TAXONOMY_COMBGC } from '../../modules/local/merge_taxonomy_combgc' include { GECCO_CONVERT } from '../../modules/nf-core/gecco/convert' +include { BIGSLICE } from '../../modules/nf-core/bigslice' +include { BIGSLICE_DOWNLOADDB } from '../../modules/nf-core/bigslice/downloaddb' workflow BGC { take: @@ -116,6 +118,27 @@ workflow BGC { GECCO_CONVERT(ch_gecco_clusters_and_gbk, params.bgc_gecco_convertmode, params.bgc_gecco_convertformat) ch_versions = ch_versions.mix(GECCO_CONVERT.out.versions) } + // BIGSLICE + if (params.bgc_run_bigslice) { + + def gecco_bigslice = !params.bgc_skip_gecco && params.bgc_gecco_runconvert && params.bgc_gecco_convertformat == 'bigslice' + + if (!params.bgc_skip_antismash && gecco_bigslice) { + ch_bigslice_input = ANTISMASH_ANTISMASH.out.gbk_results.mix(GECCO_CONVERT.out.bigslice) + } else if (!params.bgc_skip_antismash) { + ch_bigslice_input = ANTISMASH_ANTISMASH.out.gbk_results + } else { + ch_bigslice_input = GECCO_CONVERT.out.bigslice + } + + ch_bigslice_grouped = ch_bigslice_input + .map { _meta, files -> files } + .collect() + .map { files -> [ [id: 'bigslice'], files.flatten() ] } + + BIGSLICE_DOWNLOADDB([ id: 'bigslice_db' ]) + BIGSLICE(ch_bigslice_grouped, BIGSLICE_DOWNLOADDB.out.db.map { _meta, db -> db }, params.bgc_bigslice_export_tsv) + } // HMMSEARCH if (params.bgc_run_hmmsearch) { if (params.bgc_hmmsearch_models) { diff --git a/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf b/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf index a64b84cf..d348338a 100644 --- a/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf +++ b/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf @@ -176,6 +176,20 @@ def validateInputParameters() { error("[nf-core/funcscan] ERROR: when specifying --bgc_gecco_convertmode 'clusters', --bgc_gecco_convertformat can only be set to 'gff'. You specified --bgc_gecco_convertformat '${params.bgc_gecco_convertformat}'. Check input!") } } + if (params.run_bgc_screening && params.bgc_run_bigslice) { + if (params.bgc_skip_antismash && (params.bgc_skip_gecco || !params.bgc_gecco_runconvert || params.bgc_gecco_convertformat != 'bigslice')) { + error('[nf-core/funcscan] ERROR: BigSLICE requires at least one of: (1) antiSMASH enabled, or (2) GECCO enabled with GECCO convert in bigslice format. Please check your parameters.') + } + if (params.bgc_bigslice_threshold != 0.4 && params.bgc_bigslice_threshold_pct != 0.0) { + error('[nf-core/funcscan] ERROR: --bgc_bigslice_threshold and --bgc_bigslice_threshold_pct are mutually exclusive. Please specify only one of the two.') + } + if (params.bgc_bigslice_complete) { + log.warn('[nf-core/funcscan] WARNING: --bgc_bigslice_complete restricts BiG-SLiCE clustering to complete (non-contig-edge) BGCs only. If all detected BGCs are fragmented (on_contig_edge = True), BiG-SLiCE will fail with "Not enough input for clustering." Consider removing --bgc_bigslice_complete if your input sequences are short or fragmented.') + } + if (params.bgc_bigslice_n_ranks != 1) { + log.warn("[nf-core/funcscan] WARNING: --bgc_bigslice_n_ranks is set to ${params.bgc_bigslice_n_ranks}. BiG-SLiCE will fail if this value exceeds the total number of BGCs detected in your dataset (n_neighbors must be <= n_samples). Consider using the default value (1) for small datasets.") + } + } } // @@ -235,6 +249,7 @@ def toolCitationText() { !params.bgc_skip_deepbgc ? "deepBGC (Hannigan et al. 2019)," : "", !params.bgc_skip_gecco ? "GECCO (Carroll et al. 2021)," : "", params.bgc_run_hmmsearch ? "HMMER (Eddy 2011)," : "", + params.bgc_run_bigslice ? "BiG-SLiCE (Kautsar et al. 2021, Kautsar et al. 2026)," : "", ". The output from the biosynthetic gene cluster screening tools were standardised and summarised with comBGC (Frangenberg et al. 2023).", ].join(' ').replaceAll(', +.', ".").trim() @@ -292,6 +307,8 @@ def toolBibliographyText() { !params.bgc_skip_antismash ? '
  • Blin, K., Shaw, S., Vader, L., Szenei, J., Reitz, Z.L., Augustijn, H.E., Cediel-Becerra, J.D.D., de Crécy-Lagard, V., Koetsier, R.A., Williams, S.E., Cruz-Morales, P., Wongwas, S., Segurado Luchsinger, A.E., Biermann, F., Korenskaia, A., Zdouc, M.M., Meijer, D., Terlouw, B.R., van der Hooft, J.J.J., Ziemert, N., Helfrich, E.J.N., Masschelein, J., Corre, C., Chevrette, M.G., van Wezel, G.P., Medema, M.H., Weber, T., 2025. antiSMASH 8.0: extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation. Nucleic Acids Res. 53, W32-W38. DOI: Hannigan, G. D., Prihoda, D., Palicka, A., Soukup, J., Klempir, O., Rampula, L., Durcak, J., Wurst, M., Kotowski, J., Chang, D., Wang, R., Piizzi, G., Temesi, G., Hazuda, D. J., Woelk, C. H., & Bitton, D. A. (2019). A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic acids research, 47(18), e110. DOI: 10.1093/nar/gkz654
  • ' : "", !params.bgc_skip_gecco ? '
  • Carroll, L. M. , Larralde, M., Fleck, J. S., Ponnudurai, R., Milanese, A., Cappio Barazzone, E. & Zeller, G. (2021). Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv DOI: 0.1101/2021.05.03.442509
  • ' : "", + params.bgc_run_bigslice ? '
  • Kautsar, S. A., van der Hooft, J. J. J., de Ridder, D., & Medema, M. H. (2021). BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters. GigaScience, 10(1), giaa154. DOI: 10.1093/gigascience/giaa154
  • ' : "", + params.bgc_run_bigslice ? '
  • Kautsar, S. A., et al. (2026). BiG-SLiCE 2.0: improved gene cluster family diversity mapping. Nature Communications. DOI: 10.1038/s41467-026-68733-5
  • ' : "", '
  • Frangenberg, J. Fellows Yates, J. A., Ibrahim, A., Perelo, L., & Beber, M. E. (2023). nf-core/funcscan: 1.0.0 - German Rollmops - 2023-02-15. https://doi.org/10.5281/zenodo.7643100
  • ', ].join(' ').replaceAll(', +.', ".").trim()