PSIM is a C++ simulator for evaluating distributed machine learning execution protocols under different network topologies and load-balancing strategies.
The simulator models protocol task graphs that combine compute tasks and communication flows, places them on simulated machines, routes flows through a configurable network, and reports completion time, flow-level metrics, utilization, and load-balancing decisions.
This repository also includes the experiment scripts used for the IFIP Networking 2025 paper:
Foresight: Joint Time and Space Scheduling for Efficient Distributed ML Training
PSIM represents a distributed ML workload as a dependency graph of tasks:
- Compute tasks execute on a machine or accelerator.
- Flow tasks transfer data between source and destination devices.
- Empty tasks act as graph markers, synchronizers, or logging points.
During simulation, PSIM advances active compute tasks and flows in discrete time steps. Network flows register their requested bandwidth on each bottleneck in their path, bottlenecks allocate bandwidth according to the configured allocator, and completed tasks trigger their dependent tasks.
- Protocol simulation: models compute tasks, communication flows, and dependency-driven execution.
- Network modeling: supports fat-tree, leaf-spine, and big-switch topologies with explicit bottlenecks.
- Routing and load balancing: includes random, ECMP, round-robin, least-loaded, power-of-k, replay-from-file, and protocol-defined routing modes.
- Bandwidth allocation: supports fair-share, max-min fair-share, fixed-level priority, and priority queue allocation.
- Experiment execution: runs repeated simulations with per-run logs, flow information, load-balancing decisions, and regret measurements.
- Analysis workflow: provides Python orchestration and plotting scripts for generating and processing paper-scale experiments.
A demo visualization of PSIM's execution with different load-balancing policies is available at:
Example 1: 4 jobs sharing a 32-Machine Cluster, with 4 spines and 4 ToRs. Each jobs is running a data-parallel training protocol with Ring-Allreduce communication at the end of each training iteration. Each flow randomly picks one of the 4 spines for its path.
Example 2: A similar setup, but with a load-aware routing policy. Each flow picks the least loaded spine when starts transmission. The load-aware policy achieves better performance, but requires end-hosts to access non-local congestion signals.
The repository has two main parts:
src/andinclude/contain the C++ simulator.run/contains the Python experiment orchestration, placement/routing helpers, and plotting scripts used by the paper experiments.
The most useful implementation entry points are:
src/main.ccsets up command-line configuration, logging, repetitions, and per-run output directories.src/psim.ccowns the main simulation loop: starting tasks, advancing flows/compute, collecting history, and logging results.src/protocol_builder.ccbuilds protocol graphs either from input files or from generated experiment metadata.src/network.ccandsrc/core_network.ccimplement the network models and bottlenecks.src/loadbalancer.ccimplements routing and load-balancing policies.include/gconfig.hlists the runtime configuration fields populated by command-line options.
PSIM currently expects:
- C++ build tools: a C++17 compiler, CMake, Boost Program Options, and Python development headers/libraries.
- Python packages for experiments and plotting:
matplotlib,numpy,pandas,networkx,seaborn, andscipy. - Git submodules:
deps/spdloganddeps/json.
On Ubuntu-like systems, the base system dependencies are typically:
sudo apt-get update
sudo apt-get install -y cmake g++ libboost-all-dev python3-dev
python3 -m pip install matplotlib numpy pandas networkx seaborn scipyClone with submodules:
git clone --recursive git@github.com:FaridZandi/psim.git
cd psimIf the repository was already cloned without submodules:
git submodule update --init --recursiveThis is required because the CMake build imports deps/spdlog and deps/json.
mkdir -p build
cd build
cmake ..
make -jThe build creates the psim executable under build/.
From the build directory, run the simulator with a protocol input:
./psim \
--protocol-file-dir ../input/128search \
--protocol-file-name vgg128-simtime.txt \
--network-type leafspine \
--lb-scheme roundrobin \
--rep-count 1 \
--console-log-level 5Output is written under the configured workers directory. By default, PSIM writes to:
build/workers/worker-<worker-id>/run-<rep>/
The path above assumes the command is run from the build/ directory.
Typical generated files include:
runtime.txtresults.txtlb-decisions.txtregrets.txtflow-info.txt
PSIM is configured through command-line flags that populate the global configuration object in include/gconfig.h.
The most important options are grouped below. For the complete list, run ./build/psim --help.
| Option | Description |
|---|---|
--machine-count |
Number of machines/devices in the simulated cluster. |
--protocol-file-name |
Either an input file name or a built-in protocol builder name such as nethint-test. Multiple names can be comma-separated. |
--protocol-file-dir |
Directory used when --protocol-file-name refers to input files. |
--placement-file |
JSON placement file used by the runtime protocol builder. |
--timing-file |
Optional JSON timing/throttling file used by the runtime protocol builder. |
--routing-file |
JSON routing file used by generated protocols and readprotocol routing. |
--subflows |
Number of subflows to create for generated communication. |
--isolate-job-id |
Run only one job from a generated workload. |
| Option | Description |
|---|---|
--network-type |
Network model: fattree, leafspine, or bigswitch. |
--link-bandwidth |
Base link bandwidth. |
--ft-server-per-rack |
Number of servers per rack. |
--ft-rack-per-pod |
Number of racks per pod. |
--ft-agg-per-pod |
Number of aggregation switches per pod. |
--ft-pod-count |
Number of pods. |
--ft-core-count |
Number of core switches or spines. |
--ft-server-tor-link-capacity-mult |
Multiplier for server-to-ToR link capacity. |
--ft-tor-agg-link-capacity-mult |
Multiplier for ToR-to-aggregation link capacity. |
--ft-agg-core-link-capacity-mult |
Multiplier for aggregation-to-core link capacity. |
--gpu-per-machine |
Number of GPUs per machine in supported topologies. |
--gpu-gpu-link-capacity-mult |
Multiplier for intra-machine GPU link capacity. |
| Option | Description |
|---|---|
--lb-scheme |
Load-balancing policy: random, roundrobin, ecmp, zero, readfile, readprotocol, leastloaded, powerofK, futureload, robinhood, or sita-e. |
--lb-decisions-file |
File used by readfile load balancing. |
--ecmp-entropy-options |
Number of entropy choices used by ECMP. |
--load-metric |
Load signal used by load-aware policies: flowsize, flowcount, utilization, allocated, or registered. |
--priority-allocator |
Bottleneck allocator: priorityqueue, fixedlevels, fairshare, or maxmin. |
--bn-priority-levels |
Number of bottleneck priority levels. |
--initial-rate |
Initial flow sending rate. |
--min-rate |
Minimum flow sending rate. |
--rate-increase |
Multiplicative rate increase factor. |
--rate-decrease-factor |
Multiplicative rate decrease factor. |
--drop-chance-multiplier |
Multiplier used by probabilistic drop/congestion behavior. |
--punish-oversubscribed |
Enable oversubscription penalty behavior. |
--punish-oversubscribed-min |
Lower bound used by oversubscription penalty behavior. |
| Option | Description |
|---|---|
--rep-count |
Number of repeated simulation runs. |
--step-size |
Fixed simulation time step. |
--adaptive-step-size |
Enable adaptive step sizing. |
--adaptive-step-size-min |
Minimum adaptive step size. |
--adaptive-step-size-max |
Maximum adaptive step size. |
--workers-dir |
Directory where per-run output is written. |
--worker-id |
Worker identifier used in output paths. |
--simulation-seed |
Base seed used for repeated runs. |
--console-log-level |
Console log verbosity. Higher values are quieter. |
--file-log-level |
File log verbosity. |
--core-status-profiling-interval |
Interval for recording core link status. |
--no-profile-core-status |
Disable core status profiling. |
--record-bottleneck-history |
Record bottleneck allocation history. |
--record-machine-history |
Record per-machine queue history. |
--print-flow-progress-history |
Record per-flow progress history. |
--export-dot |
Export protocol graph DOT files. |
The Python experiment scripts also maintain higher-level experiment settings such as placement mode, timing scheme, comparison name, and routing strategy. Those settings are used to generate the placement, timing, and routing files passed into the C++ simulator.
PSIM supports two ways to create protocol graphs.
The original path is to load a protocol file from --protocol-file-dir.
The file loader recognizes lines for:
Commcommunication tasks.ForwandBackcompute tasks.AllRempty/synchronization tasks.
For this mode, --protocol-file-name is the file name, for example:
--protocol-file-dir ../input/128search \
--protocol-file-name vgg128-simtime.txtMost current experiments use the protocol builder instead of static protocol files. In this mode, --protocol-file-name names a built-in builder, and the simulator constructs the protocol graph at runtime.
The main experiment builder is:
--protocol-file-name nethint-testnethint-test reads generated experiment metadata and creates the protocol graph inside src/protocol_builder.cc. The key inputs are:
--placement-file: JSON description of jobs, machine assignments, communication size, compute size, layer count, and iteration count.--timing-file: optional JSON timing metadata with per-job iteration offsets and throttle rates.--routing-file: JSON routing metadata that maps generated flows to spines/cores and rates.
The Python scripts under run/ generate these files before invoking build/psim. This is the path used by the paper sweeps: Python defines the experiment, produces placement/timing/routing artifacts, then launches the C++ simulator with --protocol-file-name nethint-test.
There are also smaller built-in protocol builders useful for debugging:
build-ringbuild-all-to-allperiodic-testperiodic-test-simple
The paper experiments evaluate Foresight as a coordinated scheduling pipeline rather than as a single load-balancing rule inside the simulator. The Python experiment layer generates a schedule, and the C++ simulator executes that schedule through runtime-built protocols.
At a high level, the workflow is:
- Generate job placements and workload metadata.
- Compute timing decisions that control when job iterations begin.
- Compute routing decisions that assign generated flows to spines/cores.
- Optionally split communication into subflows and search over throttle rates.
- Run PSIM with
--protocol-file-name nethint-testand--lb-scheme readprotocol.
The main scheduling components are represented in the experiment scripts by comparison names:
- TS: time scheduling. Generates per-job iteration offsets through the timing file.
- RO: routing optimization. Generates protocol-defined routing decisions consumed by
readprotocol. - SUB: subflow/throttle search. Splits communication and assigns throttle rates when multiple subflows are enabled.
- REP / rounds: iterative refinement variants controlled by settings such as
farid-rounds.
In practice, the Foresight path uses the Python code under run/ to create placement-file, timing-file, and routing-file artifacts, then invokes the C++ simulator to evaluate the resulting execution schedule. The simulator itself remains responsible for task execution, bottleneck bandwidth allocation, flow progress, and final metrics.
The plots below show the same workload before and after Foresight's scheduling decisions. The baseline produces burstier link demand, while Foresight spreads communication over time and routes flows to reduce sustained contention.
Baseline runtime link load (left) and Foresight runtime link load (right).
Additional routing diagnostics:
The remaining-capacity view below shows the state after routing scheduling. The useful property is that the routed flows fit within the available link capacity, so no link remains overloaded.
The final comparison summarizes the impact of these scheduling decisions against the other evaluated methods.
The run/ directory contains Python scripts for reproducing or extending the experiments.
From run/:
# Figure 5
python sweep-components-jobsizes.py
python sweep-components-oversub.py
# Figure 6
python sweep-placement.py
# Figure 7
python sweep-intensity.py
python sweep-topology.pyExperiment results are written under:
run/results/exps/
The experiment scripts expect a built simulator at build/psim and may copy that binary into per-run result directories.
This repository is research-oriented and contains several areas that are good candidates for cleanup:
- Modernize CMake target definitions and project metadata.
- Replace shell-based filesystem operations with
std::filesystem. - Move global configuration out of the singleton-style
GConfobject. - Replace fixed-size job progress arrays with dynamically sized containers.
- Clarify the boundary between reusable simulator code and experiment-specific scripts.
- Document the protocol file format with a complete example.
- Add a small smoke-test input and a deterministic quick-start command.
PSIM is actively useful as a research simulator, but the repository still reflects its research-prototype history. The core simulator is implemented in C++, while experiment generation, execution, and plotting are handled by Python scripts under run/.
For new contributors, the best starting points are:
- Build the simulator.
- Run a single small protocol input.
- Inspect the generated
results.txtandflow-info.txt. - Follow one sweep script under
run/to understand how large experiment batches are configured.




