diff --git a/AUTHORS b/AUTHORS
index 2fee14e35f..ece0e9df3a 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -5,9 +5,12 @@ Metagraph was conceived and written by:
Andre Kahles
Further contributors/collaborators:
+ Oleksandr Kulkov
Daniel Danciu
+ Marc Zimmermann
Christopher Barber
Contributing students:
Sara Javadzadeh No
Jan Studeny
+ Thomas Zhou
diff --git a/COPYRIGHT b/COPYRIGHT
index 1768c99bc3..a8b0d9dfdd 100644
--- a/COPYRIGHT
+++ b/COPYRIGHT
@@ -1,6 +1,6 @@
Metagraph is provided free of charge under the GPLv3 license:
-Copyright (c) 2014-2019, Mikhail Karasikov, Harun Mustafa, Andre Kahles, Gunnar Raetsch
+Copyright (c) 2014-2026, Mikhail Karasikov, Harun Mustafa, Oleksandr Kulkov, Andre Kahles, Gunnar Raetsch
(for further author information, please refer to the AUTHORS file)
All rights reserved.
diff --git a/README.md b/README.md
index d6ec44b2db..f0d4e9aabd 100644
--- a/README.md
+++ b/README.md
@@ -1,316 +1,342 @@
-# Metagenome Graph Project
+
+
+
+
+
+
+# MetaGraph: Metagenome Graph Project
+
+[](#quick-start)
[](https://github.com/ratschlab/metagraph/releases)
+[](https://bioconda.github.io/recipes/metagraph/README.html)
[](https://bioconda.github.io/recipes/metagraph/README.html)
-[](#conda)
+[](#1-install)
[](#docker)
[](#install-from-sources)
-[](https://metagraph.ethz.ch/static/docs/index.html)
-
-MetaGraph is a tool for scalable construction of annotated genome graphs and sequence-to-graph alignment.
-
-The default index representations in MetaGraph are extremely scalable and support building graphs with trillions of nodes and millions of annotation labels.
-At the same time, the provided workflows and their careful implementation, combined with low-level optimizations of the core data structures, enable exceptional query and alignment performance.
+[](https://doi.org/10.1038/s41586-025-09603-w)
+[](https://metagraph.ethz.ch/static/docs/index.html)
-#### Main features:
-* Large-scale indexing of sequences
-* [Python API](https://metagraph.ethz.ch/static/docs/api.html) for querying in the server mode
-* Encoding [**k-mer counts**](https://metagraph.ethz.ch/static/docs/quick_start.html#index-k-mer-counts) (e.g., expression values) and [**k-mer coordinates**](https://metagraph.ethz.ch/static/docs/quick_start.html#index-k-mer-coordinates) in source sequences (e.g., for lossless encoding of genomes)
-* **Sequence alignment** against very large annotated graphs (sub-k seeding allows using arbitrarily short seeds)
-* Scalable cleaning of very large de Bruijn graphs (to remove sequencing errors)
-* Support for custom alphabets (e.g., {A,C,G,T,N} or amino acids)
-* Algorithms for [differential assembly](https://metagraph.ethz.ch/static/docs/sequence_assembly.html#differential-assembly)
+**Scalable indexing and querying of annotated genome graphs, from a handful of genomes to petabase-scale sequence repositories.**
-#### Design choices in MetaGraph:
-* Use of succinct data structures and efficient representation schemes for extremely high scalability
-* Algorithmic choices that work efficiently with succinct data structures (e.g., always prefer batched operations)
-* Modular support of different graph and annotation representations
-* Use of generic and extensible interfaces to support adding custom index representations / algorithms with little code overhead.
+Think of MetaGraph as a search engine for sequencing data: index your reads or assemblies once, then query, align, or recover source positions in milliseconds.
-## Documentation
-Online documentation is available at https://metagraph.ethz.ch/static/docs/index.html. Offline sources are [here](metagraph/docs/source).
+A MetaGraph index has two components: a **de Bruijn graph** that stores all *k*-mers extracted from the input sequences, and an **annotation matrix** that links each *k*-mer to its source labels (for example the sample, sequence header, expression level, or position).
-## Citation
+```mermaid
+flowchart LR
+ F["FASTA / FASTQ reads, assemblies, transcripts"]
+ F -->|build| G[("de Bruijn graph graph.dbg")]
+ F -->|annotate| M[("annotation matrix *.annodbg")]
+ G --> Q(("query / align"))
+ M --> Q
+ Qseq["query sequence"] --> Q
+ Q --> R["matches labels, counts, positions"]
+```
-If you are using MetaGraph or the index resources for your work, please cite:
+### Features
-> Karasikov M, Mustafa H, Danciu D, Kulkov O, Zimmermann M, Barber C, Rätsch G, Kahles A. Efficient and accurate search in petabase-scale sequence repositories. *Nature*. 2025;647: 1036–1044.
-> https://www.nature.com/articles/s41586-025-09603-w
+- **Search in public archives.** [metagraph.ethz.ch](https://metagraph.ethz.ch) hosts a search engine over 56 petabases of public sequencing data. See [MetaGraph Online](#metagraph-online).
+- **Index your own data.** Build a *k*-mer index over reads, assemblies, or transcripts; query for matching labels.
+- **Optional per-*k*-mer payloads.** Attach [counts](https://metagraph.ethz.ch/static/docs/quick_start.html#index-k-mer-counts) (abundance, e.g. expression or coverage) or [coordinates](https://metagraph.ethz.ch/static/docs/quick_start.html#index-k-mer-coordinates) (positions that losslessly encode source sequences and return per-target hit positions).
+- **Sequence alignment.** Align sequences to the full annotated graph, with sub-*k* seeding for arbitrarily short queries.
+- **Scalable graph cleaning.** Strip sequencing errors out of very large de Bruijn graphs.
+- **[Differential assembly](https://metagraph.ethz.ch/static/docs/sequence_assembly.html#differential-assembly).** Extract sequences present in one group of samples and absent from another.
+- **[Python API & HTTP server](https://metagraph.ethz.ch/static/docs/api.html).** Drive MetaGraph from Python or query a running instance over HTTP.
-BibTeX
+Under the hood
+
+- **Succinct data structures**: the default `succinct` (BOSS) graph representation uses only 2–4 bits per *k*-mer.
+- **Modular annotation formats**: `ColumnCompressed`, `RowDiff`, `RowSparse`, `Rainbowfish`, plus count- and coordinate-aware variants. Pick the compression/speed tradeoff that fits your scale.
+- **Custom alphabets**: `{A,C,G,T}`, `{A,C,G,T,N}`, amino acids, case-sensitive DNA, or compile-time custom alphabets.
+- **Extensible by design**: generic interfaces let developers add new graph/annotation representations or algorithms with little code.
+- **Memory-mapped loading**: pass `--mmap` to any subcommand for fast cold start and low query-time RAM (NVMe recommended; SSD works but slower).
+- **Scales to trillions of *k*-mers and millions of labels**: petabase-scale collections have been indexed end-to-end.
-```bibtex
-@article{karasikov2025metagraph,
- title={Efficient and accurate search in petabase-scale sequence repositories},
- author={Karasikov, Mikhail and Mustafa, Harun and Danciu, Daniel and Kulkov, Oleksandr and Zimmermann, Marc and Barber, Christopher and R{\"a}tsch, Gunnar and Kahles, Andr{\'e}},
- journal={Nature},
- volume={647},
- number={8091},
- pages={1036--1044},
- year={2025},
- publisher={Nature Publishing Group},
- doi={10.1038/s41586-025-09603-w}
-}
-```
-## Install
+> **Full documentation:** (offline copy in [`metagraph/docs/source`](metagraph/docs/source)).
-### Conda
+## MetaGraph Online
-Install the [latest release](https://github.com/ratschlab/metagraph/releases/latest) on Linux or Mac OS X with Anaconda:
+Try MetaGraph without installing anything: hosts a search engine over indexes built from 56 petabases of public DNA, RNA, and protein sequencing data: RefSeq, UHGG, Tara Oceans, UniParc, and more (see the [databases list](https://metagraph.ethz.ch/indexes)). Paste a query sequence and pick the indexes to search.
-```
-conda install -c bioconda -c conda-forge metagraph
-```
+Prefer local analysis? The prebuilt indexes are also published on [AWS Open Data](https://registry.opendata.aws/metagraph/) for download (no AWS account needed) and offline querying. See [Preconstructed indexes](https://metagraph.ethz.ch/static/docs/resources.html#preconstructed-indexes).
-### Docker
+## Quick start
-If docker is available on the system, immediately get started with
+### 1. Install
+```bash
+conda create -n metagraph python
+conda activate metagraph
+conda install -c bioconda -c conda-forge metagraph
+pip install "git+https://github.com/ratschlab/metagraph.git#subdirectory=metagraph/workflows"
```
-docker pull ghcr.io/ratschlab/metagraph:master
-docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
- metagraph build -v -k 10 -o /mnt/transcripts_1000 /mnt/transcripts_1000.fa
-```
-and replace `${HOME}` with a directory on the host system to map it under `/mnt` in the container.
-By default, it executes the binary compiled for the `DNA` alphabet {A,C,G,T}.
-To run the binary compiled for the `DNA5` or `Protein` alphabet, just replace `metagraph` with `metagraph_DNA5` or `metagraph_Protein`, respectively, e.g.:
-```
-docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
- metagraph_Protein build -v -k 10 -o /mnt/graph /mnt/protein.fa
-```
+`metagraph` is the compiled binary; `metagraph-workflows` is the Python wrapper that drives the full build pipeline.
-One can see that running MetaGraph with docker is very easy. Also, the following command (or similar) may be handy to see what directory is mounted in the container:
-```
-docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master ls /mnt
-```
+### 2. Build an index
-For more complex workflows, consider running docker in the interactive mode:
-```
-$ docker run -it --entrypoint /bin/bash -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master
+Clone the repo for the bundled test data, then build:
+
+```bash
+# Only to fetch the bundled example *.fa files; skip this if you have your own data
+git clone https://github.com/ratschlab/metagraph.git && cd metagraph
-root@5c42291cc9cf:/# ls /mnt/
-root@5c42291cc9cf:/# metagraph --version
+metagraph-workflows build <(ls metagraph/tests/data/*.fa) -o out/ --primary
```
-All different versions of the container image are listed [here](https://github.com/ratschlab/metagraph/pkgs/container/metagraph).
+`--primary` indexes one strand per *k*-mer pair (about half the size; appropriate when read strand orientation is unknown, e.g. typical short-read sequencing).
-### Install From Sources
+Internally this chains `metagraph build → annotate → row-diff transform → BRWT clustering → BRWT relaxation` and produces `graph.dbg`, the more compact `graph_small.dbg` (smaller and slower at access, useful when RAM or storage is tight), and the default annotation `graph.relax.row_diff_brwt.annodbg` (`RowDiff`). See the [Quick start guide](https://metagraph.ethz.ch/static/docs/quick_start.html) in the docs for a step-by-step of index construction.
-To compile from source (e.g., for builds with custom alphabet or other configurations), see [documentation online](https://metagraph.ethz.ch/static/docs/installation.html#install-from-source).
+
+Real-workload example with file list and hardware budget
+
+```bash
+metagraph-workflows build samples.txt -o out/ --primary \
+ -p 34 --mem-gb 70 --disk-swap-dir /scratch/swap
+```
+The positional argument accepts a file list (one path per line), a directory, or process substitution. Per-stage memory caps, thread packing, and BRWT parameters are derived from the budget. See `metagraph-workflows build --help` for all options.
-## Quick start: build an index in one command
+
-For most users, the easiest entry point is the Snakemake wrapper, which
-runs the full indexing pipeline — graph construction, annotation, and
-all row-diff / BRWT transforms — as a single command.
+### 3. Query
-The wrapper ships as a separate Python package; the `metagraph` conda
-recipe only installs the C++ binary, so the workflow CLI needs an extra
-`pip install` step:
+Query the index (any fasta works as input; the bundled test data is fine for a smoke test):
```bash
-conda install -c bioconda -c conda-forge metagraph # the metagraph binary
-pip install -U "git+https://github.com/ratschlab/metagraph.git#subdirectory=metagraph/workflows"
+metagraph query --query-mode matches -p 8 \
+ -i out/graph.dbg -a out/graph.relax.row_diff_brwt.annodbg \
+ metagraph/tests/data/transcripts_100.fa
```
-Then run the full pipeline as:
+Other ways to use the index:
+
+- [`metagraph align`](https://metagraph.ethz.ch/static/docs/sequence_search.html#sequence-to-graph-alignment): align sequences to the graph and report the alignment itself. With a coordinate-aware annotator it acts as a read mapper, returning source positions.
+- `metagraph query --align`: align each query to the graph first, then report the *labels* of the best-scoring alignment instead of requiring exact *k*-mer matches. Useful for divergent or noisy queries. (So `align` returns alignments/positions; `query --align` returns labels found via alignment.)
+- [`metagraph server_query`](https://metagraph.ethz.ch/static/docs/api.html): start an HTTP server that answers queries over a REST API (also drivable from the Python client).
+
+The [Minimal example](#minimal-example) below walks through each step on a smaller dataset.
+
+## Minimal example
+
+
+Step-by-step walkthrough with the metagraph CLI (build → annotate → query → stats)
+
+A hands-on demo using `metagraph` directly (no workflow wrapper). A *label* is whatever tag you want each *k*-mer associated with: a filename, a fasta header, or a custom string. `--anno-header` below produces one label per fasta record.
```bash
-metagraph-workflows build samples.txt -o out/ --primary
+cd metagraph/tests/data
+
+# 1. Build a de Bruijn graph (k-mer index) with k=31
+metagraph build -v -p 4 -k 31 -o samples metasub_fake_data_simple.fa
+
+# 2. Construct an annotation matrix linking each k-mer to its source label.
+metagraph annotate -v -p 4 -i samples.dbg --anno-header \
+ -o samples metasub_fake_data_simple.fa
+
+# 3. Query: for each query sequence, report all labels whose k-mers cover ≥80% of it.
+metagraph query --query-mode matches \
+ -i samples.dbg -a samples.column.annodbg \
+ --min-kmers-fraction-label 0.8 \
+ metasub_fake_data_simple.fa
+
+# 4. Print graph + annotation stats.
+metagraph stats -a samples.column.annodbg samples.dbg
```
-`samples.txt` is a text file listing your input sample paths (one per
-line); a directory of sample files works just as well. `out/` will
-contain `graph.dbg`, `graph_small.dbg`, and the requested annotation
-artifacts. You can also feed a list inline with process substitution:
+Outputs `samples.dbg` (the de Bruijn graph) and `samples.column.annodbg` (3 labels, one per fasta record). Sample query output (tab-separated: query index, query header, then one `
-Tell the workflow how much hardware to use; everything else (memory
-caps per stage, per-column buffer sizing, BRWT clustering parameters)
-is derived automatically:
+## More recipes
+
+
+Direct CLI: align / assemble / differential assembly / stats
```bash
-metagraph-workflows build samples.txt -o out/ --primary \
- -p 34 --mem-gb 70 --disk-swap-dir /scratch/swap
-```
+# Build a de Bruijn graph from a fasta (single sample)
+metagraph build -v -p 8 -k 31 --mem-cap-gb 10 -o graph data.fa.gz
-Useful switches:
-- `-p N` — maximum CPU cores to use (defaults to all cores)
-- `--mem-gb GB` — approximate RAM budget per rule (default 16)
-- `--disk-swap-dir DIR` — directory for on-disk spill buffers
-- `--primary` — build a primary graph (recommended for most workloads)
-- `--anno-type FMT` — request a specific annotation format
- (repeat for multiple outputs); the default is `relax.row_diff_brwt`
-- `--with-counts` / `--with-coords` — count- or coordinate-aware
- annotation (mutually exclusive)
-- `--graph EXISTING.dbg` — reuse an already-built graph and only run
- the annotation + transforms
-
-See `metagraph/workflows/README.rst` for setup and the full option
-list (`metagraph-workflows build --help`).
-
-
-## Typical workflow
-1. Build de Bruijn graph from Fasta files, FastQ files, or [KMC k-mer counters](https://github.com/refresh-bio/KMC/):\
-`./metagraph build`
-2. Annotate graph using the column compressed annotation:\
-`./metagraph annotate`
-3. Transform the built annotation to a different annotation scheme:\
-`./metagraph transform_anno`
-4. Query annotated graph\
-`./metagraph query`
-
-### Example
-```
-DATA="../tests/data/transcripts_1000.fa"
+# Build using disk swap (limits RAM at the cost of disk I/O)
+metagraph build -v -p 8 -k 31 --mem-cap-gb 10 --disk-swap /scratch/swap -o graph data.fa.gz
-./metagraph build -k 12 -o transcripts_1000 $DATA
+# Annotate a graph with file-based labels (one label per input file)
+metagraph annotate -v -p 8 --anno-filename -i graph.dbg -o annotation data.fa.gz
-./metagraph annotate -i transcripts_1000.dbg --anno-filename -o transcripts_1000 $DATA
+# Align sequences to the graph (plain sequence-to-graph alignment, no labels).
+metagraph align -v -i graph.dbg query.fa
-./metagraph query -i transcripts_1000.dbg -a transcripts_1000.column.annodbg $DATA
+# Labeled alignment: with -a, the walk is label-trace-consistent, i.e. every
+# k-mer of the reported path lies in every reported label.
+metagraph align -v -i graph.dbg -a annotation.row_diff_brwt.annodbg reads.fq
-./metagraph stats -a transcripts_1000.column.annodbg transcripts_1000.dbg
-```
+# Same, but with a coordinate-aware annotator: the walk is additionally
+# coordinate-consistent (positions step by ±1 per node), so hits map to
+# source positions. Functions as a read mapper over indexed genomes.
+metagraph align -v -i graph.dbg -a annotation.row_diff_brwt_coord.annodbg reads.fq
-### Print usage
-`./metagraph`
+# query --align: aligns each query to the graph WITHOUT label constraints,
+# then fetches labels for the highest-scoring walk. Use when exact k-mer
+# matching is too strict (divergent or noisy queries).
+metagraph query --align -i graph.dbg -a annotation.row_diff_brwt.annodbg query.fa
-### Build graph
+# Assemble unitigs from a graph (writes assembled.fasta.gz)
+metagraph assemble -v graph.dbg -o assembled --unitigs
-* #### Simple build
-```bash
-./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 \
- -o /graph /*.fasta.gz \
-2>&1 | tee /log.txt
-```
+# Differential assembly: JSON rules define which label groups must be present vs. absent.
+# Sample rule files: metagraph/tests/data/example.diff.json, example_simple.diff.json.
+metagraph assemble -v graph.dbg --unitigs \
+ -a annotation.column.annodbg \
+ --diff-assembly-rules diff_assembly_rules.json \
+ -o diff_assembled
-* #### Build with disk swap (use to limit the RAM usage)
-```bash
-./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 --disk-swap \
- -o /graph /*.fasta.gz \
-2>&1 | tee /log.txt
+# Stats: graph only, annotation only, or both
+metagraph stats graph.dbg
+metagraph stats -a annotation.column.annodbg
+metagraph stats -a annotation.column.annodbg graph.dbg
```
-#### Build from k-mers filtered with KMC
-```bash
-K=20
-./KMC/kmc -ci5 -t4 -k$K -m5 -fm .fasta.gz .cutoff_5 ./KMC
-./metagraph build -v -p 4 -k $K --mem-cap-gb 10 -o graph .cutoff_5.kmc_pre
-```
+See the [online docs](https://metagraph.ethz.ch/static/docs/index.html) for the full subcommand reference, alphabet builds, and the row-diff / BRWT clustering pipeline.
-### Annotate graph
-```bash
-./metagraph annotate -v --anno-type row --fasta-anno \
- -i primates.dbg \
- -o primates \
- ~/fasta_zurich/refs_chimpanzee_primates.fa
-```
+
-### Convert annotation to Multi-BRWT
-1) Cluster columns
-```bash
-./metagraph transform_anno -v --linkage --greedy \
- -o linkage.txt \
- --subsample R \
- -p NCORES \
- primates.column.annodbg
-```
-Requires `N*R/8 + 6*N^2` bytes of RAM, where `N` is the number of columns and `R` is the number of rows subsampled.
+
+Build from KMC k-mer counters (e.g., to filter low-abundance k-mers)
-2) Construct Multi-BRWT
-```bash
-./metagraph transform_anno -v -p NCORES --anno-type brwt \
- --linkage-file linkage.txt \
- -o primates \
- --parallel-nodes V \
- -p NCORES \
- primates.column.annodbg
-```
-Requires `M*V/8 + Size(BRWT)` bytes of RAM, where `M` is the number of rows in the annotation and `V` is the number of nodes merged concurrently.
+[KMC](https://github.com/refresh-bio/KMC) is an extremely efficient *k*-mer counter that can pre-process inputs and filter by abundance. Build a graph from *k*-mers occurring at least 5 times:
-### Query graph
```bash
-./metagraph query -v -i /graph.dbg \
- -a /annotation.column.annodbg \
- --min-kmers-fraction-label 0.8 --labels-delimiter ", " \
- query_seq.fa
-```
+K=31
+# Count k-mers with KMC (drops k-mers with abundance < 5)
+./KMC/kmc -ci5 -t4 -k$K -m5 -fq input.fastq.gz input.cutoff_5 ./KMC
-### Align to graph
-```bash
-./metagraph align -v -i /graph.dbg query_seq.fa
+# Build the graph directly from the KMC counter
+metagraph build -v -p 4 -k $K -o graph input.cutoff_5.kmc_pre
```
-### Assemble sequences
+Add `--count-kmers` to keep the KMC abundance counts as a weight vector alongside the graph (`graph.dbg.weights`). The weighted graph is the input for [`metagraph clean`](https://metagraph.ethz.ch/static/docs/quick_start.html#graph-cleaning) and for indexing *k*-mer counts.
+
+
+
+
+Common query flags
+
```bash
-./metagraph assemble -v /graph.dbg \
- -o assembled.fa \
- --unitigs
+# Filter labels with low k-mer coverage (default 0.7)
+metagraph query --min-kmers-fraction-label 0.8 ...
+
+# Delimiter joining labels in `--query-mode labels` output (default ":")
+metagraph query --query-mode labels --labels-delimiter ", " ...
+
+# Per-k-mer presence/absence bitmask per matching label
+metagraph query --query-mode signature ...
+
+# Indexed k-mer counts (requires count-aware annotation, see online docs)
+metagraph query --query-mode counts ...
+
+# k-mer coordinates / per-target positions (requires coordinate-aware annotation)
+metagraph query --query-mode coords ...
+
+# Server mode (Python API / HTTP queries)
+metagraph server_query -i graph.dbg -a annotation.row_diff_brwt.annodbg --port 5555 --parallel 8
```
-### Assemble differential sequences
+
+
+## More install options
+
+The recommended conda install is in [Quick start](#quick-start). MetaGraph runs on Linux and macOS. Bioconda ships the `DNA` and `Protein` alphabets (`metagraph` is symlinked to `metagraph_DNA`); the Docker image adds `DNA5`. For other alphabets, build from source.
+
+### Docker
+
```bash
-./metagraph assemble -v /graph.dbg \
- --unitigs \
- -a /annotation.column.annodbg \
- --diff-assembly-rules diff_assembly_rules.json \
- -o diff_assembled.fa
+docker pull ghcr.io/ratschlab/metagraph:master
+docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
+ metagraph build -v -k 10 -o /mnt/transcripts_1000 /mnt/transcripts_1000.fa
```
-See [`metagraph/tests/data/example.diff.json`](metagraph/tests/data/example.diff.json) and [`metagraph/tests/data/example_simple.diff.json`](metagraph/tests/data/example_simple.diff.json) for sample files.
+Replace `${HOME}` with the host directory you want to expose under `/mnt`. The default image targets the `DNA` alphabet `{A,C,G,T}`; the `DNA5` and `Protein` variants are invoked as `metagraph_DNA5` / `metagraph_Protein` from the same image. All published images are listed [here](https://github.com/ratschlab/metagraph/pkgs/container/metagraph).
+
+
+Advanced Docker usage
+
+Inspect what's mounted in the container:
-### Get stats
-Stats for graph
```bash
-./metagraph stats graph.dbg
+docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master ls /mnt
```
-Stats for annotation
+
+Run the `DNA5` / `Protein` variants:
+
```bash
-./metagraph stats -a annotation.column.annodbg
+docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
+ metagraph_Protein build -v -k 10 -o /mnt/graph /mnt/protein.fa
```
-Stats for both
+
+Drop into an interactive shell for multi-step workflows:
+
```bash
-./metagraph stats -a annotation.column.annodbg graph.dbg
+docker run -it --entrypoint /bin/bash -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master
+# inside container:
+metagraph --version
+metagraph build -v -k 31 -o /mnt/graph /mnt/data.fa.gz
```
-## Developer Notes
+
+
+### Install from sources
-### Build a docker container
+See the [installation guide](https://metagraph.ethz.ch/static/docs/installation.html#install-from-source) for custom alphabet / debug builds.
-Simply run `docker build .`
+## Contributing
-### Makefile
+
+Developer notes: docker build, releases
-The `Makefile` in the top level source directory can be used to build and test `metagraph` more conveniently. The following
-arguments are supported:
-* `env`: environment in which to compile/run (`""`: on the host, `docker`: in a docker container)
-* `alphabet`: compile metagraph for a certain alphabet (e.g. `DNA` or `Protein`, default `DNA`)
-* `additional_cmake_args`: additional arguments to pass to cmake.
+**Build a docker container.** Run `docker build .`
-Examples:
+**Releases.** Three steps: 1) bump the version in `package.json`; 2) tag the commit with that version; 3) create a GitHub release.
-```
-# compiles metagraph in a docker container for the `DNA` alphabet
-make build-metagraph env=docker alphabet=DNA
-```
+
-### Update and create a new release
+## Citation
+
+If MetaGraph or its index resources are useful in your work, please cite:
-Creating a new version release is done in three steps:
+> Karasikov M, Mustafa H, Danciu D, Kulkov O, Zimmermann M, Barber C, Rätsch G, Kahles A. Efficient and accurate search in petabase-scale sequence repositories. *Nature*. 2025;647: 1036–1044.
-1. Update package.json and set the version
-2. Add a tag with that new version
-3. Make a new release on github
+
+BibTeX
+
+```bibtex
+@article{karasikov2025metagraph,
+ title={Efficient and accurate search in petabase-scale sequence repositories},
+ author={Karasikov, Mikhail and Mustafa, Harun and Danciu, Daniel and Kulkov, Oleksandr and Zimmermann, Marc and Barber, Christopher and R{\"a}tsch, Gunnar and Kahles, Andr{\'e}},
+ journal={Nature},
+ volume={647},
+ number={8091},
+ pages={1036--1044},
+ year={2025},
+ publisher={Nature Publishing Group},
+ doi={10.1038/s41586-025-09603-w}
+}
+```
+
+
## License
-Metagraph is distributed under the GPLv3 License (see LICENSE).
-Please find further information in the AUTHORS and COPYRIGHTS files.
+
+MetaGraph is distributed under the GPLv3 License (see [LICENSE](LICENSE)). See also [AUTHORS](AUTHORS) and [COPYRIGHT](COPYRIGHT).
diff --git a/metagraph/docs/source/images/metagraph_logo.svg b/metagraph/docs/source/images/metagraph_logo.svg
new file mode 100644
index 0000000000..8dd4605e67
--- /dev/null
+++ b/metagraph/docs/source/images/metagraph_logo.svg
@@ -0,0 +1,111 @@
+
+
diff --git a/metagraph/docs/source/images/metagraph_logo_dark.svg b/metagraph/docs/source/images/metagraph_logo_dark.svg
new file mode 100644
index 0000000000..295fbf6fad
--- /dev/null
+++ b/metagraph/docs/source/images/metagraph_logo_dark.svg
@@ -0,0 +1,111 @@
+
+
diff --git a/metagraph/docs/source/quick_start.rst b/metagraph/docs/source/quick_start.rst
index bf856ed76f..34c2e59c97 100644
--- a/metagraph/docs/source/quick_start.rst
+++ b/metagraph/docs/source/quick_start.rst
@@ -676,6 +676,16 @@ The conversion to ``Multi-BRWT`` can be done either
changing the value passed with flag ``--subsample ``. The 1M rows subsampled by default are usually enough
even for very large annotations. Increasing this value usually does not lead to any significantly better compression.
+.. note::
+ Rough RAM requirements of the two stages above:
+
+ * computing the column clustering (linkage) needs about ``N*R/8 + 6*N^2`` bytes, where ``N``
+ is the number of columns (labels) and ``R`` the number of subsampled rows
+ (flag ``--subsample``, 1M by default);
+ * constructing the Multi-BRWT needs about ``M*V/8 + Size(BRWT)`` bytes, where ``M`` is the
+ number of rows in the annotation and ``V`` the number of nodes processed in parallel
+ (flag ``--parallel-nodes``).
+
Finally, the internal structure of the BRWT tree can be relaxed (which is always recommended to do) to increase
the arity of its internal nodes and enhance the compression::