Skip to content

ml-rust/boostr

Repository files navigation

boostr

ML framework in Rust. Write once. Run on any backend.

Production-grade LLM primitives — flash attention, quantization, MoE, state-space models, KV caching, and distributed training — built on numr so the same code runs on CPU, CUDA, and WebGPU.

Docs · Crate · Modules · Example · Contributing

Join the Discord

CI status crates.io docs.rs License GitHub stars

boostr extends numr with production-grade ML primitives. It provides attention mechanisms, quantization support, model architectures, and inference infrastructure — all built on numr's foundational tensors, runtimes, and ops. No reimplementation. No wrappers. Pure extension traits.

Why boostr

  • One codebase, every backend. Write once against numr's Runtime; run on CPU (SIMD), CUDA (PTX), or WebGPU (WGSL) by switching a feature flag — no per-device dispatch, no rewrite.
  • No vendor lock-in. Every kernel is native — no cuBLAS, cuDNN, or MKL. Flash attention, quantized matmul, and fused optimizers are all hand-written per backend.
  • Backends are a foundation concern. Hardware support lives in numr, so new backends added there flow up to boostr automatically — the abstraction is built in, not bolted on per device.
  • Train and serve from one stack. The same primitives power oxidizr training and blazr inference — no Python runtime, single-binary deployment.

Who it's for

  • LLM trainers — distributed training with ZeRO (stages 1/2/3), tensor and pipeline parallelism (1F1B, GPipe, ZeroBubble), mixed precision, and fused optimizers.
  • Inference engineers — flash attention v2/v3, paged KV cache, continuous batching, speculative decoding, and prefix caching for high-throughput serving.
  • Quantization & compression researchers — 26 GGUF-compatible formats with a dedicated QuantTensor type and per-backend dequant / quantized-matmul kernels.
  • Architecture researchers — LLaMA, Mamba2 (SSD kernels), and hybrid transformer/SSM models, with an extensible system for custom architectures.
  • WebAssembly & edge developers — the WebGPU backend targets consumer GPUs (Vulkan/Metal/DX12) with no CUDA dependency.

Key Capabilities

Quantization

  • 26 formats (GGUF-compatible): Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2K–Q8K, IQ1S–IQ4XS, TQ1_0, TQ2_0
  • QuantTensor type for block-quantized data
  • Per-backend kernels: Native SIMD (CPU), PTX (CUDA), WGSL (WebGPU)
  • Zero-copy GGUF loading with memory mapping

Attention

  • Flash Attention v2/v3 with fused QKV projection
  • Multi-Head Latent Attention (MLA) — compressed KV cache
  • Grouped Query Attention (GQA) and multi-head variants
  • Paged attention for memory-efficient inference
  • Variable-length attention with ragged tensors
  • Prefix caching for context reuse

Position Encodings

  • RoPE: Split-half, interleaved, ALiBi variants
  • YaRN for length extrapolation
  • Efficient fused implementations on all backends

Model Architectures

  • LLaMA — standard and tensor-parallelized
  • Mamba2 — state space models with SSD kernels
  • Hybrid — mixed transformer/SSM models
  • Extensible architecture system for custom models

Neural Network Modules

  • Linear — standard and quantized variants
  • Embedding for token embeddings
  • LayerNorm, RMSNorm with fused implementations
  • MoE layers with expert routing and load balancing

Inference Infrastructure

  • Paged KV cache with block allocator for memory efficiency
  • Request scheduler with continuous batching
  • Prefix caching for prompt reuse
  • Speculative decoding with adaptive draft depth and verification kernels
  • Flash decoding for single-token decode (CUDA, auto-selected when S_q=1)

Training

  • Optimizers: AdamW, Lamb, SGD with gradient clipping
  • Mixed precision (AMP) with automatic loss scaling
  • Gradient accumulation and checkpointing
  • Learning rate scheduling (warmup, cosine, linear decay)
  • Distributed training:
    • ZeRO stage 1/2/3 (parameter/gradient/optimizer sharding)
    • Tensor parallelism with communicators
    • Pipeline parallelism (1F1B, Gpipe, ZeroBubble schedules)

Model Loading

  • SafeTensors: Zero-copy memory-mapped loading
  • GGUF: Full format support with block-quantized tensors
  • Format auto-detection

Multi-Backend

  • CPU: SIMD kernels (AVX2, NEON), native ops
  • CUDA: PTX kernels, Flash Attention v2/v3, fused ops (CUDA 12.x)
  • WebGPU: WGSL shaders, cross-platform GPU support

Architecture

┌───────────────────────────────────────────────────────┐
│                    boostr                             │
│   (attention, RoPE, MoE, quantization, model loaders) │
└──────────────────────────┬────────────────────────────┘
                           │
                        (uses)
                           │
┌──────────────────────────▼───────────────────────────┐
│                      numr                            │
│   (tensors, ops, runtime, autograd, linalg, FFT)     │
└──────────────────────────────────────────────────────┘

Design principles:

  • Extension traits: ML ops (AttentionOps, RoPEOps) implemented on numr's clients — not new types
  • QuantTensor: Separate type for quantized data with custom kernels
  • impl_generic: Composite ops composed from numr primitives, same logic on all backends
  • Custom kernels: Dequant, quantized matmul, fused attention use per-backend optimizations (SIMD/PTX/WGSL)
  • Vendor-agnostic: No cuBLAS, cuDNN, or MKL; all native kernels

Quick Start

Installation

Add to Cargo.toml:

[dependencies]
boostr = "<latest-version>"

# With CUDA support (requires CUDA 12.x)
# boostr = { version = "0.1", features = ["cuda"] }

# With WebGPU support
# boostr = { version = "0.1", features = ["wgpu"] }

Build

# CPU build
cargo build --release

# CUDA support (requires CUDA 12.x)
cargo build --release --features cuda

# WebGPU support
cargo build --release --features wgpu

# Run tests
cargo test
cargo test --features cuda

Basic Usage

use boostr::*;
use boostr::ops::traits::attention::flash::FlashAttentionOps;
use numr::ops::RandomOps;
use numr::runtime::cpu::{CpuClient, CpuDevice};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = CpuClient::new(CpuDevice::new());

    // Create random tensors via numr's RandomOps
    let q = client.randn(&[1, 8, 32, 64], DType::F32)?; // [batch, heads, seq, dim]
    let k = client.randn(&[1, 8, 32, 64], DType::F32)?;
    let v = client.randn(&[1, 8, 32, 64], DType::F32)?;

    // Flash Attention forward pass
    let (output, _lse) = client.flash_attention_fwd(
        &q, &k, &v,
        8,    // num_heads
        8,    // num_kv_heads
        64,   // head_dim
        true, // causal
        0,    // window_size (0 = no sliding window)
        None, // kv_seq_len override
    )?;

    Ok(())
}

Loading a Model

use boostr::format::Gguf;
use boostr::{CpuRuntime, DType};
use numr::runtime::Runtime;

// Open a GGUF model file (with optional memory mapping)
let mut gguf = Gguf::open("model.gguf")?;
let metadata = gguf.metadata();
let device = <CpuRuntime as Runtime>::Device::default();

// Load tensors — quantized as QuantTensor, others as f32
for name in gguf.tensor_names().map(|s| s.to_string()).collect::<Vec<_>>() {
    let info = gguf.tensor_info(&name)?;
    if info.ggml_type.is_quantized() {
        let qt = gguf.load_tensor_quantized::<CpuRuntime>(&name, &device)?;
    } else {
        let t = gguf.load_tensor_f32::<CpuRuntime>(&name, &device)?;
    }
}

Inference with KV Cache

use boostr::inference::PagedKvCache;

// Create a paged KV cache for efficient inference
let mut kv_cache = PagedKvCache::new(
    &client,
    num_layers,
    batch_size,
    max_seq_len,
    head_dim,
)?;

// Process tokens with cache
for token_idx in 0..seq_len {
    // ... forward pass using kv_cache ...
    kv_cache.update(layer_idx, &k, &v)?;
}

Feature Flags

Feature Purpose Dependencies
cpu CPU backend (default) numr
cuda CUDA GPU acceleration (CUDA 12.x) numr/cuda, cudarc
nccl Multi-GPU via NCCL numr/nccl
wgpu WebGPU cross-platform GPU numr/wgpu
distributed Distributed inference over nexar nexar, anyhow, bytemuck
f16 Half-precision float support numr/f16
fp8 FP8 precision support numr/fp8
tts-g2p Grapheme-to-phoneme via espeak-ng¹ espeakng

¹ Requires libespeak-ng available at runtime.

Module Overview

  • ops/ — ML-specific operations (attention, RoPE, MoE, etc.)
  • quant/ — Quantized tensors and kernels (26 formats)
  • nn/ — Neural network modules (Linear, Embedding, LayerNorm, RMSNorm, MoE)
  • model/ — Model architectures (LLaMA, Mamba2, Hybrid)
  • format/ — Model loaders (SafeTensors, GGUF)
  • inference/ — Inference infrastructure (KV cache, scheduling, batching)
  • optimizer/ — Training optimizers (AdamW, Lamb, SGD)
  • trainer/ — Training utilities and distributed training (ZeRO, tensor/pipeline parallelism)
  • distributed/ — Multi-GPU coordination

Performance

boostr provides production-grade performance through:

  • Fused kernels — Attention, layer norm, optimizer steps compiled to single kernels
  • Custom quantization — Per-format SIMD/PTX/WGSL kernels for dequant and quantized matmul
  • Memory efficiency — Paged KV cache, prefix caching, gradient checkpointing
  • Distributed training — ZeRO stages, tensor/pipeline parallelism with minimal communication overhead
  • Zero-copy loading — Memory-mapped GGUF with quantized weights

Ecosystem

boostr is part of the ml-rust organization:

  • numr — Foundational numerical computing (tensors, ops, linalg, FFT)
  • boostr — ML framework (this project)
  • oxidizr — Training framework for Mamba2, MLA, MoE (uses boostr)
  • blazr — Inference server with OpenAI-compatible API (uses boostr)
  • compressr — Model quantization and compression (uses boostr)
  • splintr — High-performance BPE tokenizer

Building from Source

Requirements

  • Rust 1.85+
  • For CUDA: CUDA 12.x and cudarc dependencies
  • For WebGPU: wgpu and platform GPU drivers

Clone and Build

git clone https://github.com/ml-rust/boostr.git
cd boostr

# CPU
cargo build --release

# CUDA
cargo build --release --features cuda

# Run tests
cargo test --all-features

# Format and lint
cargo fmt --all
cargo clippy --all-targets

Documentation

Testing

# Run all tests
cargo test --all-features

# Specific test suite
cargo test ops::attention --all-features

# Verbose output
cargo test --all-features -- --nocapture

Contributing

Contributions are welcome! See CONTRIBUTING.md for architecture conventions, the impl_generic pattern, and pull request guidance.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Acknowledgments

boostr builds on the numerical foundation provided by numr and is designed to power production ML infrastructure across training (oxidizr) and inference (blazr).

About

ML primitives for you to build your own AI/ML framework. Built on numr.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors