diff --git a/AUTHORS b/AUTHORS index 2fee14e35f..ece0e9df3a 100644 --- a/AUTHORS +++ b/AUTHORS @@ -5,9 +5,12 @@ Metagraph was conceived and written by: Andre Kahles Further contributors/collaborators: + Oleksandr Kulkov Daniel Danciu + Marc Zimmermann Christopher Barber Contributing students: Sara Javadzadeh No Jan Studeny + Thomas Zhou diff --git a/COPYRIGHT b/COPYRIGHT index 1768c99bc3..a8b0d9dfdd 100644 --- a/COPYRIGHT +++ b/COPYRIGHT @@ -1,6 +1,6 @@ Metagraph is provided free of charge under the GPLv3 license: -Copyright (c) 2014-2019, Mikhail Karasikov, Harun Mustafa, Andre Kahles, Gunnar Raetsch +Copyright (c) 2014-2026, Mikhail Karasikov, Harun Mustafa, Oleksandr Kulkov, Andre Kahles, Gunnar Raetsch (for further author information, please refer to the AUTHORS file) All rights reserved. diff --git a/README.md b/README.md index d6ec44b2db..f0d4e9aabd 100644 --- a/README.md +++ b/README.md @@ -1,316 +1,342 @@ -# Metagenome Graph Project +

+ + + MetaGraph + +

+# MetaGraph: Metagenome Graph Project + +[![Platform: Linux | macOS](https://img.shields.io/badge/platform-Linux%20%7C%20macOS-brightgreen)](#quick-start) [![GitHub release (latest by date)](https://img.shields.io/github/v/release/ratschlab/metagraph)](https://github.com/ratschlab/metagraph/releases) +[![Bioconda version](https://img.shields.io/conda/vn/bioconda/metagraph)](https://bioconda.github.io/recipes/metagraph/README.html) [![bioconda downloads](https://img.shields.io/conda/dn/bioconda/metagraph?color=blue)](https://bioconda.github.io/recipes/metagraph/README.html) -[![install with conda](https://img.shields.io/badge/install%20with-conda-brightgreen.svg?style=flat)](#conda) +[![install with conda](https://img.shields.io/badge/install%20with-conda-brightgreen.svg?style=flat)](#1-install) [![install with docker](https://img.shields.io/badge/install%20with-docker-brightgreen)](#docker) [![install from source](https://img.shields.io/badge/install%20from-source-lightgrey)](#install-from-sources) -[![documentation](https://img.shields.io/badge/-online%20docs-grey)](https://metagraph.ethz.ch/static/docs/index.html) - -MetaGraph is a tool for scalable construction of annotated genome graphs and sequence-to-graph alignment. - -The default index representations in MetaGraph are extremely scalable and support building graphs with trillions of nodes and millions of annotation labels. -At the same time, the provided workflows and their careful implementation, combined with low-level optimizations of the core data structures, enable exceptional query and alignment performance. +[![DOI](https://img.shields.io/badge/DOI-10.1038%2Fs41586--025--09603--w-blue)](https://doi.org/10.1038/s41586-025-09603-w) +[![documentation](https://img.shields.io/badge/docs-online-blue.svg)](https://metagraph.ethz.ch/static/docs/index.html) -#### Main features: -* Large-scale indexing of sequences -* [Python API](https://metagraph.ethz.ch/static/docs/api.html) for querying in the server mode -* Encoding [**k-mer counts**](https://metagraph.ethz.ch/static/docs/quick_start.html#index-k-mer-counts) (e.g., expression values) and [**k-mer coordinates**](https://metagraph.ethz.ch/static/docs/quick_start.html#index-k-mer-coordinates) in source sequences (e.g., for lossless encoding of genomes) -* **Sequence alignment** against very large annotated graphs (sub-k seeding allows using arbitrarily short seeds) -* Scalable cleaning of very large de Bruijn graphs (to remove sequencing errors) -* Support for custom alphabets (e.g., {A,C,G,T,N} or amino acids) -* Algorithms for [differential assembly](https://metagraph.ethz.ch/static/docs/sequence_assembly.html#differential-assembly) +**Scalable indexing and querying of annotated genome graphs, from a handful of genomes to petabase-scale sequence repositories.** -#### Design choices in MetaGraph: -* Use of succinct data structures and efficient representation schemes for extremely high scalability -* Algorithmic choices that work efficiently with succinct data structures (e.g., always prefer batched operations) -* Modular support of different graph and annotation representations -* Use of generic and extensible interfaces to support adding custom index representations / algorithms with little code overhead. +Think of MetaGraph as a search engine for sequencing data: index your reads or assemblies once, then query, align, or recover source positions in milliseconds. -## Documentation -Online documentation is available at https://metagraph.ethz.ch/static/docs/index.html. Offline sources are [here](metagraph/docs/source). +A MetaGraph index has two components: a **de Bruijn graph** that stores all *k*-mers extracted from the input sequences, and an **annotation matrix** that links each *k*-mer to its source labels (for example the sample, sequence header, expression level, or position). -## Citation +```mermaid +flowchart LR + F["FASTA / FASTQ
reads, assemblies, transcripts"] + F -->|build| G[("de Bruijn graph
graph.dbg")] + F -->|annotate| M[("annotation matrix
*.annodbg")] + G --> Q(("query / align")) + M --> Q + Qseq["query sequence"] --> Q + Q --> R["matches
labels, counts, positions"] +``` -If you are using MetaGraph or the index resources for your work, please cite: +### Features -> Karasikov M, Mustafa H, Danciu D, Kulkov O, Zimmermann M, Barber C, Rätsch G, Kahles A. Efficient and accurate search in petabase-scale sequence repositories. *Nature*. 2025;647: 1036–1044. -> https://www.nature.com/articles/s41586-025-09603-w +- **Search in public archives.** [metagraph.ethz.ch](https://metagraph.ethz.ch) hosts a search engine over 56 petabases of public sequencing data. See [MetaGraph Online](#metagraph-online). +- **Index your own data.** Build a *k*-mer index over reads, assemblies, or transcripts; query for matching labels. +- **Optional per-*k*-mer payloads.** Attach [counts](https://metagraph.ethz.ch/static/docs/quick_start.html#index-k-mer-counts) (abundance, e.g. expression or coverage) or [coordinates](https://metagraph.ethz.ch/static/docs/quick_start.html#index-k-mer-coordinates) (positions that losslessly encode source sequences and return per-target hit positions). +- **Sequence alignment.** Align sequences to the full annotated graph, with sub-*k* seeding for arbitrarily short queries. +- **Scalable graph cleaning.** Strip sequencing errors out of very large de Bruijn graphs. +- **[Differential assembly](https://metagraph.ethz.ch/static/docs/sequence_assembly.html#differential-assembly).** Extract sequences present in one group of samples and absent from another. +- **[Python API & HTTP server](https://metagraph.ethz.ch/static/docs/api.html).** Drive MetaGraph from Python or query a running instance over HTTP.
-BibTeX +Under the hood + +- **Succinct data structures**: the default `succinct` (BOSS) graph representation uses only 2–4 bits per *k*-mer. +- **Modular annotation formats**: `ColumnCompressed`, `RowDiff`, `RowSparse`, `Rainbowfish`, plus count- and coordinate-aware variants. Pick the compression/speed tradeoff that fits your scale. +- **Custom alphabets**: `{A,C,G,T}`, `{A,C,G,T,N}`, amino acids, case-sensitive DNA, or compile-time custom alphabets. +- **Extensible by design**: generic interfaces let developers add new graph/annotation representations or algorithms with little code. +- **Memory-mapped loading**: pass `--mmap` to any subcommand for fast cold start and low query-time RAM (NVMe recommended; SSD works but slower). +- **Scales to trillions of *k*-mers and millions of labels**: petabase-scale collections have been indexed end-to-end. -```bibtex -@article{karasikov2025metagraph, - title={Efficient and accurate search in petabase-scale sequence repositories}, - author={Karasikov, Mikhail and Mustafa, Harun and Danciu, Daniel and Kulkov, Oleksandr and Zimmermann, Marc and Barber, Christopher and R{\"a}tsch, Gunnar and Kahles, Andr{\'e}}, - journal={Nature}, - volume={647}, - number={8091}, - pages={1036--1044}, - year={2025}, - publisher={Nature Publishing Group}, - doi={10.1038/s41586-025-09603-w} -} -```
-## Install +> **Full documentation:** (offline copy in [`metagraph/docs/source`](metagraph/docs/source)). -### Conda +## MetaGraph Online -Install the [latest release](https://github.com/ratschlab/metagraph/releases/latest) on Linux or Mac OS X with Anaconda: +Try MetaGraph without installing anything: hosts a search engine over indexes built from 56 petabases of public DNA, RNA, and protein sequencing data: RefSeq, UHGG, Tara Oceans, UniParc, and more (see the [databases list](https://metagraph.ethz.ch/indexes)). Paste a query sequence and pick the indexes to search. -``` -conda install -c bioconda -c conda-forge metagraph -``` +Prefer local analysis? The prebuilt indexes are also published on [AWS Open Data](https://registry.opendata.aws/metagraph/) for download (no AWS account needed) and offline querying. See [Preconstructed indexes](https://metagraph.ethz.ch/static/docs/resources.html#preconstructed-indexes). -### Docker +## Quick start -If docker is available on the system, immediately get started with +### 1. Install +```bash +conda create -n metagraph python +conda activate metagraph +conda install -c bioconda -c conda-forge metagraph +pip install "git+https://github.com/ratschlab/metagraph.git#subdirectory=metagraph/workflows" ``` -docker pull ghcr.io/ratschlab/metagraph:master -docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \ - metagraph build -v -k 10 -o /mnt/transcripts_1000 /mnt/transcripts_1000.fa -``` -and replace `${HOME}` with a directory on the host system to map it under `/mnt` in the container. -By default, it executes the binary compiled for the `DNA` alphabet {A,C,G,T}. -To run the binary compiled for the `DNA5` or `Protein` alphabet, just replace `metagraph` with `metagraph_DNA5` or `metagraph_Protein`, respectively, e.g.: -``` -docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \ - metagraph_Protein build -v -k 10 -o /mnt/graph /mnt/protein.fa -``` +`metagraph` is the compiled binary; `metagraph-workflows` is the Python wrapper that drives the full build pipeline. -One can see that running MetaGraph with docker is very easy. Also, the following command (or similar) may be handy to see what directory is mounted in the container: -``` -docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master ls /mnt -``` +### 2. Build an index -For more complex workflows, consider running docker in the interactive mode: -``` -$ docker run -it --entrypoint /bin/bash -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master +Clone the repo for the bundled test data, then build: + +```bash +# Only to fetch the bundled example *.fa files; skip this if you have your own data +git clone https://github.com/ratschlab/metagraph.git && cd metagraph -root@5c42291cc9cf:/# ls /mnt/ -root@5c42291cc9cf:/# metagraph --version +metagraph-workflows build <(ls metagraph/tests/data/*.fa) -o out/ --primary ``` -All different versions of the container image are listed [here](https://github.com/ratschlab/metagraph/pkgs/container/metagraph). +`--primary` indexes one strand per *k*-mer pair (about half the size; appropriate when read strand orientation is unknown, e.g. typical short-read sequencing). -### Install From Sources +Internally this chains `metagraph build → annotate → row-diff transform → BRWT clustering → BRWT relaxation` and produces `graph.dbg`, the more compact `graph_small.dbg` (smaller and slower at access, useful when RAM or storage is tight), and the default annotation `graph.relax.row_diff_brwt.annodbg` (`RowDiff`). See the [Quick start guide](https://metagraph.ethz.ch/static/docs/quick_start.html) in the docs for a step-by-step of index construction. -To compile from source (e.g., for builds with custom alphabet or other configurations), see [documentation online](https://metagraph.ethz.ch/static/docs/installation.html#install-from-source). +
+Real-workload example with file list and hardware budget + +```bash +metagraph-workflows build samples.txt -o out/ --primary \ + -p 34 --mem-gb 70 --disk-swap-dir /scratch/swap +``` +The positional argument accepts a file list (one path per line), a directory, or process substitution. Per-stage memory caps, thread packing, and BRWT parameters are derived from the budget. See `metagraph-workflows build --help` for all options. -## Quick start: build an index in one command +
-For most users, the easiest entry point is the Snakemake wrapper, which -runs the full indexing pipeline — graph construction, annotation, and -all row-diff / BRWT transforms — as a single command. +### 3. Query -The wrapper ships as a separate Python package; the `metagraph` conda -recipe only installs the C++ binary, so the workflow CLI needs an extra -`pip install` step: +Query the index (any fasta works as input; the bundled test data is fine for a smoke test): ```bash -conda install -c bioconda -c conda-forge metagraph # the metagraph binary -pip install -U "git+https://github.com/ratschlab/metagraph.git#subdirectory=metagraph/workflows" +metagraph query --query-mode matches -p 8 \ + -i out/graph.dbg -a out/graph.relax.row_diff_brwt.annodbg \ + metagraph/tests/data/transcripts_100.fa ``` -Then run the full pipeline as: +Other ways to use the index: + +- [`metagraph align`](https://metagraph.ethz.ch/static/docs/sequence_search.html#sequence-to-graph-alignment): align sequences to the graph and report the alignment itself. With a coordinate-aware annotator it acts as a read mapper, returning source positions. +- `metagraph query --align`: align each query to the graph first, then report the *labels* of the best-scoring alignment instead of requiring exact *k*-mer matches. Useful for divergent or noisy queries. (So `align` returns alignments/positions; `query --align` returns labels found via alignment.) +- [`metagraph server_query`](https://metagraph.ethz.ch/static/docs/api.html): start an HTTP server that answers queries over a REST API (also drivable from the Python client). + +The [Minimal example](#minimal-example) below walks through each step on a smaller dataset. + +## Minimal example + +
+Step-by-step walkthrough with the metagraph CLI (build → annotate → query → stats) + +A hands-on demo using `metagraph` directly (no workflow wrapper). A *label* is whatever tag you want each *k*-mer associated with: a filename, a fasta header, or a custom string. `--anno-header` below produces one label per fasta record. ```bash -metagraph-workflows build samples.txt -o out/ --primary +cd metagraph/tests/data + +# 1. Build a de Bruijn graph (k-mer index) with k=31 +metagraph build -v -p 4 -k 31 -o samples metasub_fake_data_simple.fa + +# 2. Construct an annotation matrix linking each k-mer to its source label. +metagraph annotate -v -p 4 -i samples.dbg --anno-header \ + -o samples metasub_fake_data_simple.fa + +# 3. Query: for each query sequence, report all labels whose k-mers cover ≥80% of it. +metagraph query --query-mode matches \ + -i samples.dbg -a samples.column.annodbg \ + --min-kmers-fraction-label 0.8 \ + metasub_fake_data_simple.fa + +# 4. Print graph + annotation stats. +metagraph stats -a samples.column.annodbg samples.dbg ``` -`samples.txt` is a text file listing your input sample paths (one per -line); a directory of sample files works just as well. `out/` will -contain `graph.dbg`, `graph_small.dbg`, and the requested annotation -artifacts. You can also feed a list inline with process substitution: +Outputs `samples.dbg` (the de Bruijn graph) and `samples.column.annodbg` (3 labels, one per fasta record). Sample query output (tab-separated: query index, query header, then one `
-Tell the workflow how much hardware to use; everything else (memory -caps per stage, per-column buffer sizing, BRWT clustering parameters) -is derived automatically: +## More recipes + +
+Direct CLI: align / assemble / differential assembly / stats ```bash -metagraph-workflows build samples.txt -o out/ --primary \ - -p 34 --mem-gb 70 --disk-swap-dir /scratch/swap -``` +# Build a de Bruijn graph from a fasta (single sample) +metagraph build -v -p 8 -k 31 --mem-cap-gb 10 -o graph data.fa.gz -Useful switches: -- `-p N` — maximum CPU cores to use (defaults to all cores) -- `--mem-gb GB` — approximate RAM budget per rule (default 16) -- `--disk-swap-dir DIR` — directory for on-disk spill buffers -- `--primary` — build a primary graph (recommended for most workloads) -- `--anno-type FMT` — request a specific annotation format - (repeat for multiple outputs); the default is `relax.row_diff_brwt` -- `--with-counts` / `--with-coords` — count- or coordinate-aware - annotation (mutually exclusive) -- `--graph EXISTING.dbg` — reuse an already-built graph and only run - the annotation + transforms - -See `metagraph/workflows/README.rst` for setup and the full option -list (`metagraph-workflows build --help`). - - -## Typical workflow -1. Build de Bruijn graph from Fasta files, FastQ files, or [KMC k-mer counters](https://github.com/refresh-bio/KMC/):\ -`./metagraph build` -2. Annotate graph using the column compressed annotation:\ -`./metagraph annotate` -3. Transform the built annotation to a different annotation scheme:\ -`./metagraph transform_anno` -4. Query annotated graph\ -`./metagraph query` - -### Example -``` -DATA="../tests/data/transcripts_1000.fa" +# Build using disk swap (limits RAM at the cost of disk I/O) +metagraph build -v -p 8 -k 31 --mem-cap-gb 10 --disk-swap /scratch/swap -o graph data.fa.gz -./metagraph build -k 12 -o transcripts_1000 $DATA +# Annotate a graph with file-based labels (one label per input file) +metagraph annotate -v -p 8 --anno-filename -i graph.dbg -o annotation data.fa.gz -./metagraph annotate -i transcripts_1000.dbg --anno-filename -o transcripts_1000 $DATA +# Align sequences to the graph (plain sequence-to-graph alignment, no labels). +metagraph align -v -i graph.dbg query.fa -./metagraph query -i transcripts_1000.dbg -a transcripts_1000.column.annodbg $DATA +# Labeled alignment: with -a, the walk is label-trace-consistent, i.e. every +# k-mer of the reported path lies in every reported label. +metagraph align -v -i graph.dbg -a annotation.row_diff_brwt.annodbg reads.fq -./metagraph stats -a transcripts_1000.column.annodbg transcripts_1000.dbg -``` +# Same, but with a coordinate-aware annotator: the walk is additionally +# coordinate-consistent (positions step by ±1 per node), so hits map to +# source positions. Functions as a read mapper over indexed genomes. +metagraph align -v -i graph.dbg -a annotation.row_diff_brwt_coord.annodbg reads.fq -### Print usage -`./metagraph` +# query --align: aligns each query to the graph WITHOUT label constraints, +# then fetches labels for the highest-scoring walk. Use when exact k-mer +# matching is too strict (divergent or noisy queries). +metagraph query --align -i graph.dbg -a annotation.row_diff_brwt.annodbg query.fa -### Build graph +# Assemble unitigs from a graph (writes assembled.fasta.gz) +metagraph assemble -v graph.dbg -o assembled --unitigs -* #### Simple build -```bash -./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 \ - -o /graph /*.fasta.gz \ -2>&1 | tee /log.txt -``` +# Differential assembly: JSON rules define which label groups must be present vs. absent. +# Sample rule files: metagraph/tests/data/example.diff.json, example_simple.diff.json. +metagraph assemble -v graph.dbg --unitigs \ + -a annotation.column.annodbg \ + --diff-assembly-rules diff_assembly_rules.json \ + -o diff_assembled -* #### Build with disk swap (use to limit the RAM usage) -```bash -./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 --disk-swap \ - -o /graph /*.fasta.gz \ -2>&1 | tee /log.txt +# Stats: graph only, annotation only, or both +metagraph stats graph.dbg +metagraph stats -a annotation.column.annodbg +metagraph stats -a annotation.column.annodbg graph.dbg ``` -#### Build from k-mers filtered with KMC -```bash -K=20 -./KMC/kmc -ci5 -t4 -k$K -m5 -fm .fasta.gz .cutoff_5 ./KMC -./metagraph build -v -p 4 -k $K --mem-cap-gb 10 -o graph .cutoff_5.kmc_pre -``` +See the [online docs](https://metagraph.ethz.ch/static/docs/index.html) for the full subcommand reference, alphabet builds, and the row-diff / BRWT clustering pipeline. -### Annotate graph -```bash -./metagraph annotate -v --anno-type row --fasta-anno \ - -i primates.dbg \ - -o primates \ - ~/fasta_zurich/refs_chimpanzee_primates.fa -``` +
-### Convert annotation to Multi-BRWT -1) Cluster columns -```bash -./metagraph transform_anno -v --linkage --greedy \ - -o linkage.txt \ - --subsample R \ - -p NCORES \ - primates.column.annodbg -``` -Requires `N*R/8 + 6*N^2` bytes of RAM, where `N` is the number of columns and `R` is the number of rows subsampled. +
+Build from KMC k-mer counters (e.g., to filter low-abundance k-mers) -2) Construct Multi-BRWT -```bash -./metagraph transform_anno -v -p NCORES --anno-type brwt \ - --linkage-file linkage.txt \ - -o primates \ - --parallel-nodes V \ - -p NCORES \ - primates.column.annodbg -``` -Requires `M*V/8 + Size(BRWT)` bytes of RAM, where `M` is the number of rows in the annotation and `V` is the number of nodes merged concurrently. +[KMC](https://github.com/refresh-bio/KMC) is an extremely efficient *k*-mer counter that can pre-process inputs and filter by abundance. Build a graph from *k*-mers occurring at least 5 times: -### Query graph ```bash -./metagraph query -v -i /graph.dbg \ - -a /annotation.column.annodbg \ - --min-kmers-fraction-label 0.8 --labels-delimiter ", " \ - query_seq.fa -``` +K=31 +# Count k-mers with KMC (drops k-mers with abundance < 5) +./KMC/kmc -ci5 -t4 -k$K -m5 -fq input.fastq.gz input.cutoff_5 ./KMC -### Align to graph -```bash -./metagraph align -v -i /graph.dbg query_seq.fa +# Build the graph directly from the KMC counter +metagraph build -v -p 4 -k $K -o graph input.cutoff_5.kmc_pre ``` -### Assemble sequences +Add `--count-kmers` to keep the KMC abundance counts as a weight vector alongside the graph (`graph.dbg.weights`). The weighted graph is the input for [`metagraph clean`](https://metagraph.ethz.ch/static/docs/quick_start.html#graph-cleaning) and for indexing *k*-mer counts. + +
+ +
+Common query flags + ```bash -./metagraph assemble -v /graph.dbg \ - -o assembled.fa \ - --unitigs +# Filter labels with low k-mer coverage (default 0.7) +metagraph query --min-kmers-fraction-label 0.8 ... + +# Delimiter joining labels in `--query-mode labels` output (default ":") +metagraph query --query-mode labels --labels-delimiter ", " ... + +# Per-k-mer presence/absence bitmask per matching label +metagraph query --query-mode signature ... + +# Indexed k-mer counts (requires count-aware annotation, see online docs) +metagraph query --query-mode counts ... + +# k-mer coordinates / per-target positions (requires coordinate-aware annotation) +metagraph query --query-mode coords ... + +# Server mode (Python API / HTTP queries) +metagraph server_query -i graph.dbg -a annotation.row_diff_brwt.annodbg --port 5555 --parallel 8 ``` -### Assemble differential sequences +
+ +## More install options + +The recommended conda install is in [Quick start](#quick-start). MetaGraph runs on Linux and macOS. Bioconda ships the `DNA` and `Protein` alphabets (`metagraph` is symlinked to `metagraph_DNA`); the Docker image adds `DNA5`. For other alphabets, build from source. + +### Docker + ```bash -./metagraph assemble -v /graph.dbg \ - --unitigs \ - -a /annotation.column.annodbg \ - --diff-assembly-rules diff_assembly_rules.json \ - -o diff_assembled.fa +docker pull ghcr.io/ratschlab/metagraph:master +docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \ + metagraph build -v -k 10 -o /mnt/transcripts_1000 /mnt/transcripts_1000.fa ``` -See [`metagraph/tests/data/example.diff.json`](metagraph/tests/data/example.diff.json) and [`metagraph/tests/data/example_simple.diff.json`](metagraph/tests/data/example_simple.diff.json) for sample files. +Replace `${HOME}` with the host directory you want to expose under `/mnt`. The default image targets the `DNA` alphabet `{A,C,G,T}`; the `DNA5` and `Protein` variants are invoked as `metagraph_DNA5` / `metagraph_Protein` from the same image. All published images are listed [here](https://github.com/ratschlab/metagraph/pkgs/container/metagraph). + +
+Advanced Docker usage + +Inspect what's mounted in the container: -### Get stats -Stats for graph ```bash -./metagraph stats graph.dbg +docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master ls /mnt ``` -Stats for annotation + +Run the `DNA5` / `Protein` variants: + ```bash -./metagraph stats -a annotation.column.annodbg +docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \ + metagraph_Protein build -v -k 10 -o /mnt/graph /mnt/protein.fa ``` -Stats for both + +Drop into an interactive shell for multi-step workflows: + ```bash -./metagraph stats -a annotation.column.annodbg graph.dbg +docker run -it --entrypoint /bin/bash -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master +# inside container: +metagraph --version +metagraph build -v -k 31 -o /mnt/graph /mnt/data.fa.gz ``` -## Developer Notes +
+ +### Install from sources -### Build a docker container +See the [installation guide](https://metagraph.ethz.ch/static/docs/installation.html#install-from-source) for custom alphabet / debug builds. -Simply run `docker build .` +## Contributing -### Makefile +
+Developer notes: docker build, releases -The `Makefile` in the top level source directory can be used to build and test `metagraph` more conveniently. The following -arguments are supported: -* `env`: environment in which to compile/run (`""`: on the host, `docker`: in a docker container) -* `alphabet`: compile metagraph for a certain alphabet (e.g. `DNA` or `Protein`, default `DNA`) -* `additional_cmake_args`: additional arguments to pass to cmake. +**Build a docker container.** Run `docker build .` -Examples: +**Releases.** Three steps: 1) bump the version in `package.json`; 2) tag the commit with that version; 3) create a GitHub release. -``` -# compiles metagraph in a docker container for the `DNA` alphabet -make build-metagraph env=docker alphabet=DNA -``` +
-### Update and create a new release +## Citation + +If MetaGraph or its index resources are useful in your work, please cite: -Creating a new version release is done in three steps: +> Karasikov M, Mustafa H, Danciu D, Kulkov O, Zimmermann M, Barber C, Rätsch G, Kahles A. Efficient and accurate search in petabase-scale sequence repositories. *Nature*. 2025;647: 1036–1044. -1. Update package.json and set the version -2. Add a tag with that new version -3. Make a new release on github +
+BibTeX + +```bibtex +@article{karasikov2025metagraph, + title={Efficient and accurate search in petabase-scale sequence repositories}, + author={Karasikov, Mikhail and Mustafa, Harun and Danciu, Daniel and Kulkov, Oleksandr and Zimmermann, Marc and Barber, Christopher and R{\"a}tsch, Gunnar and Kahles, Andr{\'e}}, + journal={Nature}, + volume={647}, + number={8091}, + pages={1036--1044}, + year={2025}, + publisher={Nature Publishing Group}, + doi={10.1038/s41586-025-09603-w} +} +``` + +
## License -Metagraph is distributed under the GPLv3 License (see LICENSE). -Please find further information in the AUTHORS and COPYRIGHTS files. + +MetaGraph is distributed under the GPLv3 License (see [LICENSE](LICENSE)). See also [AUTHORS](AUTHORS) and [COPYRIGHT](COPYRIGHT). diff --git a/metagraph/docs/source/images/metagraph_logo.svg b/metagraph/docs/source/images/metagraph_logo.svg new file mode 100644 index 0000000000..8dd4605e67 --- /dev/null +++ b/metagraph/docs/source/images/metagraph_logo.svg @@ -0,0 +1,111 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/metagraph/docs/source/images/metagraph_logo_dark.svg b/metagraph/docs/source/images/metagraph_logo_dark.svg new file mode 100644 index 0000000000..295fbf6fad --- /dev/null +++ b/metagraph/docs/source/images/metagraph_logo_dark.svg @@ -0,0 +1,111 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/metagraph/docs/source/quick_start.rst b/metagraph/docs/source/quick_start.rst index bf856ed76f..34c2e59c97 100644 --- a/metagraph/docs/source/quick_start.rst +++ b/metagraph/docs/source/quick_start.rst @@ -676,6 +676,16 @@ The conversion to ``Multi-BRWT`` can be done either changing the value passed with flag ``--subsample ``. The 1M rows subsampled by default are usually enough even for very large annotations. Increasing this value usually does not lead to any significantly better compression. +.. note:: + Rough RAM requirements of the two stages above: + + * computing the column clustering (linkage) needs about ``N*R/8 + 6*N^2`` bytes, where ``N`` + is the number of columns (labels) and ``R`` the number of subsampled rows + (flag ``--subsample``, 1M by default); + * constructing the Multi-BRWT needs about ``M*V/8 + Size(BRWT)`` bytes, where ``M`` is the + number of rows in the annotation and ``V`` the number of nodes processed in parallel + (flag ``--parallel-nodes``). + Finally, the internal structure of the BRWT tree can be relaxed (which is always recommended to do) to increase the arity of its internal nodes and enhance the compression::