Skip to content

[WIP] prototype Overlay V2 in Rust#5147

Draft
marta-lokhova wants to merge 15 commits intostellar:masterfrom
marta-lokhova:Feb18OverlayV2-clean
Draft

[WIP] prototype Overlay V2 in Rust#5147
marta-lokhova wants to merge 15 commits intostellar:masterfrom
marta-lokhova:Feb18OverlayV2-clean

Conversation

@marta-lokhova
Copy link
Copy Markdown
Contributor

This is still a half-baked prototype of Rust overlay, but pushing as a draft for visibility and initial feedback.

The last commit adds a couple of files to docs that give an overview of what this prototype is doing, but tl;dr is that is implements overlay as a separate process in Rust.

The good news is that we can use many things out of the box via libp2p: Kademlia for peer discovery, peer auth, GossipSub for topology forming, QUIC for multiplexing. We also benefit from using Rust's tokio async scheduler. Of course, these things need tuning and probably a bunch of debugging to hash out all the edge cases, but it's quite encouraging to see something that kinda works in just a few days of "vibe coding" (as confirmed by a good amount of tests on the Rust side and in OverlayIPCTests). It would be interesting to see if we can actually get a reliable initial implmenetation of Rust overlay, such that we can optimize individual areas as needed (for example, we know GossipSub isn't what we want, so we can implement a custom solution in Rust, rather than poking at C++ which has a bunch of weird legacy stuff like PeerManager etc).

Another interesting observation in this experiment was finding the clean cut point between overlay <> SCP. By moving most of peer and flooding components to Rust, this actually leaves a single digit number of interaction points with C++, which is quite a bit easier to reason about and encourages good separation of responsibilities.

Next steps would be to actually get it to run in supercluster (right now there are some annoying issues I'm debugging like DNS resolution and containers being able to properly talk to each other), and see what kind of SCP latency we can get from removal of HOL blocking via multiplexing plus the natural parallelism we get from separation of processes.

marta-lokhova and others added 15 commits February 18, 2026 18:43
Add the Rust overlay crate to the workspace and update the build system
to compile and link the Rust overlay library alongside stellar-core.

- Add workspace Cargo.toml and overlay/Cargo.toml with libp2p, tokio,
  and IPC dependencies
- Update configure.ac to detect Rust toolchain and set RUSTFLAGS
- Update Makefile.am and src/Makefile.am to build the Rust overlay
  library and link it into stellar-core

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement the Rust overlay's core modules:

IPC transport (overlay/src/ipc/):
- Unix domain socket transport for C++ <-> Rust communication
- Binary message framing with length-prefix encoding
- Async message handling with tokio
- Message types for SCP envelopes, transactions, tx sets, and
  peer management commands

Flood control (overlay/src/flood/):
- Mempool for transaction storage and deduplication
- INV/GETDATA protocol for pull-based transaction dissemination
- Inventory batcher for efficient batched advertisement
- Inventory tracker to avoid redundant sends
- Pending request tracking with timeout support
- Transaction set assembly and caching
- Transaction buffer for outbound flow management

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement the peer-to-peer networking layer using libp2p:

- config.rs: Network configuration (listen addresses, bootstrap peers,
  Kademlia DHT settings, connection limits)
- libp2p_overlay.rs: Main libp2p swarm implementation with Kademlia DHT
  for peer discovery, gossipsub for message dissemination, and connection
  management with configurable limits
- integrated.rs: Integration layer that bridges IPC transport with the
  libp2p overlay, routing messages between C++ core and the P2P network
- main.rs: Standalone binary entry point for running the overlay as a
  separate process
- http/mod.rs: HTTP server for overlay status, metrics, and admin
  endpoints (peer info, connection stats)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add comprehensive test suites for the Rust overlay:

- e2e_binary.rs: End-to-end tests that launch overlay processes and
  verify IPC communication, message routing, and peer connectivity
- kademlia_test.rs: Tests for Kademlia DHT peer discovery including
  multi-node topologies, peer table convergence, and bootstrap scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the legacy C++ overlay networking layer with a thin IPC bridge
that communicates with the new Rust overlay process.

New files:
- IPC.cpp/h: Low-level IPC transport (Unix domain sockets, message
  framing, async read/write)
- OverlayIPC.cpp/h: High-level IPC protocol handling (message
  serialization, request/response routing)
- RustOverlayManager.cpp/h: OverlayManager implementation that delegates
  networking to the Rust overlay via IPC
- NetworkConstants.h: Shared constants between C++ and Rust

Removed legacy components:
- Peer.cpp/h, TCPPeer.cpp/h: Direct TCP peer connections (now handled
  by Rust libp2p)
- FlowControl.cpp/h, FlowControlCapacity.cpp/h: Per-peer flow control
  (now handled by Rust flood control)
- Floodgate.cpp/h: Transaction flooding (replaced by Rust INV/GETDATA)
- PeerAuth.cpp/h, Hmac.cpp/h: Authentication (now handled by libp2p
  noise protocol)
- PeerManager.cpp/h, PeerDoor.cpp/h: Peer lifecycle management
- ItemFetcher.cpp/h, Tracker.cpp/h: SCP message fetching
- TxAdverts.cpp/h, TxDemandsManager.cpp/h: Pull-mode tx dissemination
- Survey*.cpp/h: Network survey (to be reimplemented)
- OverlayManagerImpl.cpp/h: Legacy overlay manager

Updated tests:
- IPCTests.cpp: IPC transport unit tests
- OverlayIPCTests.cpp: IPC protocol integration tests
- OverlayIPCBenchmark.cpp: IPC throughput benchmarks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the C++ core to work with the Rust overlay via IPC instead of
the legacy in-process overlay.

Herder changes:
- Remove TransactionQueue and TxQueueLimiter (transaction queuing now
  handled by Rust mempool)
- Adapt HerderImpl to submit transactions via IPC
- Update PendingEnvelopes to fetch SCP messages via IPC
- Simplify TxSetFrame to work with externally-assembled tx sets

Application changes:
- Add RustOverlayManager instantiation to ApplicationImpl
- Update Config with Rust overlay settings (IPC socket path, overlay
  binary path)
- Update AppConnector to expose IPC interface
- Adapt CommandHandler for new overlay endpoints

Simulation changes:
- Rewrite Simulation and Topologies to launch Rust overlay processes
- Update LoadGenerator and TxGenerator for IPC-based tx submission
- Adapt CoreTests for new topology model

Other adaptations:
- Database: Remove overlay-specific tables (peers, peer preferences)
- LedgerManager: Remove direct overlay interactions
- Fuzz tests: Adapt for new overlay interface

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update Dockerfiles to install Rust toolchain and build the overlay
crate as part of the stellar-core build pipeline.

- Dockerfile: Add rustup installation, cargo build step
- Dockerfile.testing: Include Rust build for test images

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add comprehensive design document for the Rust overlay, cross-checked
against the actual implementation:

- Architecture overview with ASCII diagrams
- QUIC transport and stream protocol specification
- IPC protocol with all message types and wire formats
- INV/GETDATA pull-based TX flooding protocol (new)
- Mempool, TX set building, and shared state documentation
- Complete test coverage summary (178 Rust + C++ integration tests)
- Known issues and unimplemented features

Excludes session notes, debugging logs, and agent instructions from
dev/ai/ — this document is the single source of truth for the overlay
design.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SetPeerConfig handler previously only accepted IP:port format for
peer addresses. In Kubernetes, peers are identified by DNS hostnames
(e.g. pod-0.service.namespace.svc.cluster.local) which were silently
skipped, preventing any peer connections from being established.

This change adds:

- DNS resolution for peer addresses: resolve_peer_addr() handles both
  IP:port (parsed directly) and DNS hostnames (resolved via
  tokio::net::lookup_host), using listen_port as default when no port
  is specified in the hostname string.

- Self-dial detection: collect_local_addrs() builds a set of the
  node's own addresses using instant UDP socket probing at startup,
  plus background DNS hostname resolution for K8s pod IP detection.
  Peers that resolve to a local address are skipped instead of
  producing a confusing 'no addresses for peer' error from libp2p.

- Retry with exponential backoff: when DNS resolution fails (common
  during K8s pod startup when not all peers are DNS-ready yet), a
  background task retries unresolved peers with exponential backoff
  (2s initial delay, 30s max, up to 10 attempts).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement all overlay-related metrics from docs/metrics.md in the Rust
overlay V2 process and pipe them to C++ core's libmedida-backed /metrics
HTTP endpoint.

Rust side:
- New overlay/src/metrics.rs: lock-free atomic counters for ~40 metrics
  covering connections, I/O bytes/messages/errors, flood pull-mode
  (advertised/demanded/fulfilled/unfulfilled), connection lifecycle
  (inbound/outbound attempt/establish/drop), per-message-type send/recv
  counters, and timer summaries (recv-transaction, recv-scp, fetch-txset,
  tx-pull-latency, tx-batch-size histogram).
- MetricsSnapshot with serde::Serialize for JSON IPC transport.
- Instrumented libp2p_overlay.rs: ConnectionEstablished/Closed, SCP
  broadcast, TX broadcast/INV batching, GETDATA fulfill/unfulfill,
  TX receive with dedup tracking (unique/duplicate bytes), inbound
  stream reads, TxSet stream reads, and housekeeping timeouts.
- New IPC message types: RequestOverlayMetrics (13) / OverlayMetricsResponse
  (105) for synchronous Core->Overlay metrics request.
- main.rs handles RequestOverlayMetrics by snapshotting and responding.

C++ side:
- OverlayIPC::requestMetrics(): synchronous IPC call with separate
  mutex/CV to avoid collision with getTopTransactions().
- RustOverlayManager::syncOverlayMetrics(): parses JSON snapshot,
  computes deltas for monotonic counters, and updates libmedida
  Meters/Counters/Timers/Histograms accordingly.
- ApplicationImpl::syncAllMetrics() calls syncOverlayMetrics() so
  metrics are fresh whenever /metrics endpoint is queried.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All TX sets were cached with ledger_seq: 0, causing evict_before() to
wipe them on the first ledger close past sequence 5. This made the node
unable to serve TX sets to peers ~600ms after caching them.

- Use current_ledger_seq when inserting into TxSetCache (both
  TxSetReceived from peers and CacheTxSet from Core)
- Extend eviction buffer from 5 to 12 ledgers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant