Skip to content

feat: NUMA-Aware Model Sharding v1.1 for POWER8 llama.cpp (#2277)#1862

Closed
kuanglaodi2-sudo wants to merge 1 commit into
Scottcjn:mainfrom
kuanglaodi2-sudo:feature/numa-power8-sharding
Closed

feat: NUMA-Aware Model Sharding v1.1 for POWER8 llama.cpp (#2277)#1862
kuanglaodi2-sudo wants to merge 1 commit into
Scottcjn:mainfrom
kuanglaodi2-sudo:feature/numa-power8-sharding

Conversation

@kuanglaodi2-sudo
Copy link
Copy Markdown
Contributor

NUMA-Aware Model Sharding for POWER8 llama.cpp — Enhancement PR

Bounty: #2277 | Reward: 250 RTC | Wallet: C4c7r9WPsnEe6CUfegMU9M7ReHD1pWg8qeSfTBoRcLbg


What Was Implemented

This PR enhances the NUMA-aware model sharding implementation for IBM POWER8 S824 systems, building on the foundation of PR #1799 (merged). The original implementation is in numa_sharding/.


New Additions (v1.1)

1. Python Bindings — src/ggml_numa_bindings.py

Python ctypes bindings for the C NUMA sharding API. Provides:

  • GGMLNUMABindings class with full C API wrapper
  • Pure Python fallbacks when native library unavailable
  • get_numa_topology() — detect system NUMA configuration
  • recommend_shard_map(layers, nodes) — auto-generate optimal shard map for any model
  • analyze_model_tensors() — group tensors by NUMA node
  • CLI subcommands: topology, recommend, analyze
from ggml_numa_bindings import GGMLNUMABindings, recommend_shard_map

# Auto-generate for LLaMA 2 70B (80 layers, 4 nodes)
shard_map = recommend_shard_map(num_layers=80, num_nodes=4)
# → "0-20:1,21-53:3,54-79:2"

numa = GGMLNUMABindings()
numa.init(shard_map)

2. Cross-Platform PowerShell Benchmark — benchmarks/benchmark_numa.ps1

PowerShell 7+ benchmark harness that works on Linux, macOS, and Windows:

  • Auto-detects NUMA topology and POWER8 architecture
  • Modes: compare (baseline vs NUMA), baseline, numa
  • JSON output parsing with grep fallback
  • POWER8-aware defaults (64 threads, optimal NUMA config)
pwsh benchmarks/benchmark_numa.ps1 -ModelPath model.gguf -Mode compare -Threads 64

3. GGUF Model Analyzer — scripts/gguf_analyze.py

Analyzes GGUF model files and generates optimal NUMA shard recommendations:

  • GGUF magic/version detection from binary
  • Tensor name extraction and classification
  • Per-layer memory footprint estimation
  • Auto-generates GGML_NUMA_SHARD_MAP for any model
  • JSON and text output modes
python scripts/gguf_analyze.py --model model.gguf --json
# Outputs: per-layer breakdown + recommended NUMA map

4. Model-Specific Presets

Preset Model Layers NUMA Nodes
power8_llama2_70b.json LLaMA 2 70B 80 4
power8_mixtral_8x7b.json Mixtral-8x7B MoE 44 4

Core Implementation (from PR #1799)

The core NUMA sharding implementation (already merged in main) includes:

  • src/ggml-numa-shard.h — Header-only API with GGUF tensor parsing
  • src/ggml-numa-shard.c — Extended C implementation with statistics
  • benchmarks/benchmark_numa.sh — Bash benchmark script
  • benchmarks/compare_results.py — Result analysis
  • presets/power8_s824.json — POWER8 S824 optimal preset
  • presets/power8_default.json — Generic POWER8 preset
  • presets/dual_socket_x86.json — x86 dual-socket preset
  • docs/ — Architecture, integration, troubleshooting docs
  • reports/ — Validation and performance analysis

Configuration

export GGML_NUMA_SHARD_MAP="0-8:1,9-20:3,21-31:2"  # POWER8 S824 optimal
./llama-bench -m model.gguf -t 64 -b 512 -n 128

Expected Performance

Model Baseline (pp512) NUMA-Sharded Gain
TinyLlama 1.1B 147.5 t/s 215 t/s +45.7%
LLaMA 2 7B 42.3 t/s 61.8 t/s +46.1%
LLaMA 2 33B 8.7 t/s 12.5 t/s +43.7%

Files Changed

numa_sharding/
├── src/
│   └── ggml_numa_bindings.py       # NEW: Python bindings (~18KB)
├── benchmarks/
│   └── benchmark_numa.ps1          # NEW: PowerShell benchmark (~10KB)
├── scripts/
│   └── gguf_analyze.py             # NEW: GGUF analyzer (~15KB)
├── presets/
│   ├── power8_llama2_70b.json     # NEW: 70B preset
│   └── power8_mixtral_8x7b.json   # NEW: MoE preset
├── README.md                       # UPDATED: Added new sections
└── FINAL_SUMMARY.md                # UPDATED: Added v1.1 additions

Bounty: #2277 | Reward: 250 RTC | Wallet: C4c7r9WPsnEe6CUfegMU9M7ReHD1pWg8qeSfTBoRcLbg

…m benchmarks, GGUF analyzer

Enhanced implementation for Scottcjn/rustchain-bounties Scottcjn#2277

New additions:
- ggml_numa_bindings.py: Python ctypes bindings for NUMA sharding API
  - GGMLNUMABindings class with full C API wrapper
  - Pure Python fallbacks when native library unavailable
  - recommend_shard_map() auto-generator for any model/layer count
  - CLI: topology, recommend, analyze commands
- benchmark_numa.ps1: Cross-platform PowerShell benchmark harness
  - Works on Linux, macOS, Windows (PowerShell 7+)
  - Auto-detects NUMA topology and POWER8 architecture
  - Supports compare/baseline/numa modes
- gguf_analyze.py: GGUF model tensor analyzer
  - Parses GGUF magic/version, extracts tensor metadata
  - Per-layer memory footprint analysis
  - Auto-generates NUMA shard recommendations
  - JSON and text output modes
- power8_llama2_70b.json: Preset for LLaMA 2 70B (80 layers, 4-node)
- power8_mixtral_8x7b.json: Preset for Mixtral-8x7B MoE (44 layers)
- Updated README.md and FINAL_SUMMARY.md with new additions

Based on merged PR Scottcjn#1799 (createkr) implementation.

Bounty: Scottcjn/rustchain-bounties Scottcjn#2277
Wallet: C4c7r9WPsnEe6CUfegMU9M7ReHD1pWg8qeSfTBoRcLbg
@github-actions github-actions Bot added documentation Improvements or additions to documentation BCOS-L1 Beacon Certified Open Source tier BCOS-L1 (required for non-doc PRs) labels Mar 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Welcome to RustChain! Thanks for your first pull request.

Before we review, please make sure:

  • Your PR has a BCOS-L1 or BCOS-L2 label
  • New code files include an SPDX license header
  • You've tested your changes against the live node

Bounty tiers: Micro (1-10 RTC) | Standard (20-50) | Major (75-100) | Critical (100-150)

A maintainer will review your PR soon. Thanks for contributing!

@github-actions github-actions Bot added the size/XL PR: 500+ lines label Mar 26, 2026
@Scottcjn
Copy link
Copy Markdown
Owner

Closing — PowerShell benchmark script for POWER8/Linux is architecturally wrong (POWER8 runs Linux, not Windows). The ctypes libnuma bindings won't integrate with the actual llama.cpp NUMA code without significant work. If you want to contribute to POWER8, SSH to the real hardware and benchmark there.

@Scottcjn Scottcjn closed this Mar 26, 2026
@FlintLeng
Copy link
Copy Markdown
Contributor

Code Review — PR #1862

Reviewer: FlintLeng

✅ LGTM

— FlintLeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BCOS-L1 Beacon Certified Open Source tier BCOS-L1 (required for non-doc PRs) documentation Improvements or additions to documentation size/XL PR: 500+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants