TetriServe: Serve your DiT models like Tetris! 🧩

Multi-GPU diffusion model serving with dynamic GPU allocation and SLO-aware scheduling.
Pack inference requests like Tetris — maximize GPU utilization, meet latency SLOs.

Overview

TetriServe is a serving system for diffusion models (FLUX, SD3) that dynamically allocates GPUs per request using sequence-parallel inference. Rather than fixing a static degree of parallelism, TetriServe assigns each request the optimal number of GPUs at runtime based on its resolution and latency SLO — achieving up to 32% higher SLO attainment than fixed sequence-parallel baselines.

Key ideas:

🔲 Dynamic GPU allocation — each request gets the right number of GPUs based on resolution × latency SLO
⚡ SLO-aware scheduling — dyn_slo_schedule_global_dp packs requests to maximize SLO attainment
📐 Supports FLUX and SD3 on multi-GPU clusters (tested on 8× H100 80GB, 4× A40 48GB)

Three DiT requests with different resolutions and SLOs. Fixed sequence parallelism (xDiT SP=1, SP=4) misses multiple deadlines. TetriServe packs requests like Tetris 🧩, adapting SP, and meets all SLOs.

Results

Evaluated on FLUX.1-dev (8× H100 80GB) and SD3 (4× A40 48GB) against fixed sequence-parallel baselines (xDiT SP=1/2/4/8) and a resolution-specific SP oracle (RSSP).

SLO Attainment Ratio (SAR) — FLUX on 8× H100:

SLO Scale	Best Baseline (SAR)	TetriServe (SAR)	Gain
1.0×	~74%	~76%	+2pp
1.1×	~65%	~83%	+28%
1.2×	~69%	~91%	+32%
1.5×	~80%	~95%	+15pp

+10% avg SAR over best baseline on Uniform workload mix
+15% avg SAR over best baseline on Skewed (large-resolution-heavy) workload mix
Gains are largest at tight SLOs (1.0×–1.2×) where fixed SP is most constrained

See the paper for full evaluation details, ablation studies, and A40/SD3 results.

Installation

Requirements: Python 3.10+, CUDA 12.4+, 2+ NVIDIA GPUs

git clone https://github.com/DiT-Serving/TetriServe && cd TetriServe

# Create venv and install
uv venv .venv --python 3.10
uv pip install -p .venv torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
uv pip install -p .venv -e ".[diffusers]"
uv pip install -p .venv --no-build-isolation "flash-attn==2.6.3"

# Verify
uv run --no-sync python -c "import tetriserve; print('OK')"

Docker

docker build -t tetriserve .

docker run --gpus all --ipc=host --network host --rm \
  -v $(pwd):/workspaces/TetriServe \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  tetriserve

Quick Start

# Login to HuggingFace (for model weights)
huggingface-cli login

# Launch TetriServe (4 GPUs, SD3)
uv run --no-sync python -m tetriserve.server.launcher \
  --nnodes 1 --nproc_per_node 4 \
  --master_addr 127.0.0.1 --master_port 1037 \
  --perf_model_dir log/benchmark/scaling_efficiency/ \
  --schedule_logic "fcfs_optimal_shape" \
  --output_type 'pil' \
  --model "stabilityai/stable-diffusion-3-medium-diffusers"

Send a request:

curl -X POST "http://localhost:1037/v1/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A beautiful sunset over mountains",
    "height": 1024, "width": 1024,
    "num_inference_steps": 20,
    "latency_threshold": 16
  }'

Benchmarking

Reproduce the paper results on 8× H100:

# Baseline (data-parallel, fixed SP)
python benchmark/serving_benchmark/benchmark_serving.py \
  --config benchmark/serving_benchmark/configs/benchmark_flux_H100_baseline.yaml

# TetriServe (dynamic SLO-aware scheduling)
python benchmark/serving_benchmark/benchmark_serving.py \
  --config benchmark/serving_benchmark/configs/benchmark_flux_H100_tetris_dp.yaml

Configs are also provided for A40, L40 and SD3. See benchmark/serving_benchmark/configs/.

Documentation

Citation

@inproceedings{tetriserve2026,
  title     = {TetriServe: Efficiently Serving Mixed DiT Workloads},
  author    = {Runyu Lu and Shiqi He and Wenxuan Tan and Shenggui Li and Ruofan Wu and Jeff J. Ma and Ang Chen and Mosharaf Chowdhury},
  booktitle = {The 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'26), Volume 2},
  year      = {2026},
  url       = {https://arxiv.org/abs/2510.01565}
}

Acknowledgements

TetriServe builds on the following open-source projects:

xDiT 0.4.3 — sequence-parallel DiT inference
vLLM — serving system design patterns
SGLang — scheduler design

Name		Name	Last commit message	Last commit date
Latest commit History 342 Commits
.devcontainer		.devcontainer
assets		assets
benchmark		benchmark
doc		doc
examples		examples
log		log
scripts		scripts
tests		tests
tetriserve		tetriserve
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TetriServe: Serve your DiT models like Tetris! 🧩

Overview

Results

Installation

Quick Start

Benchmarking

Documentation

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TetriServe: Serve your DiT models like Tetris! 🧩

Overview

Results

Installation

Quick Start

Benchmarking

Documentation

Citation

Acknowledgements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages