[WIP] prototype Overlay V2 in Rust#5147
Draft
marta-lokhova wants to merge 15 commits intostellar:masterfrom
Draft
[WIP] prototype Overlay V2 in Rust#5147marta-lokhova wants to merge 15 commits intostellar:masterfrom
marta-lokhova wants to merge 15 commits intostellar:masterfrom
Conversation
Add the Rust overlay crate to the workspace and update the build system to compile and link the Rust overlay library alongside stellar-core. - Add workspace Cargo.toml and overlay/Cargo.toml with libp2p, tokio, and IPC dependencies - Update configure.ac to detect Rust toolchain and set RUSTFLAGS - Update Makefile.am and src/Makefile.am to build the Rust overlay library and link it into stellar-core Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement the Rust overlay's core modules: IPC transport (overlay/src/ipc/): - Unix domain socket transport for C++ <-> Rust communication - Binary message framing with length-prefix encoding - Async message handling with tokio - Message types for SCP envelopes, transactions, tx sets, and peer management commands Flood control (overlay/src/flood/): - Mempool for transaction storage and deduplication - INV/GETDATA protocol for pull-based transaction dissemination - Inventory batcher for efficient batched advertisement - Inventory tracker to avoid redundant sends - Pending request tracking with timeout support - Transaction set assembly and caching - Transaction buffer for outbound flow management Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement the peer-to-peer networking layer using libp2p: - config.rs: Network configuration (listen addresses, bootstrap peers, Kademlia DHT settings, connection limits) - libp2p_overlay.rs: Main libp2p swarm implementation with Kademlia DHT for peer discovery, gossipsub for message dissemination, and connection management with configurable limits - integrated.rs: Integration layer that bridges IPC transport with the libp2p overlay, routing messages between C++ core and the P2P network - main.rs: Standalone binary entry point for running the overlay as a separate process - http/mod.rs: HTTP server for overlay status, metrics, and admin endpoints (peer info, connection stats) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add comprehensive test suites for the Rust overlay: - e2e_binary.rs: End-to-end tests that launch overlay processes and verify IPC communication, message routing, and peer connectivity - kademlia_test.rs: Tests for Kademlia DHT peer discovery including multi-node topologies, peer table convergence, and bootstrap scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the legacy C++ overlay networking layer with a thin IPC bridge that communicates with the new Rust overlay process. New files: - IPC.cpp/h: Low-level IPC transport (Unix domain sockets, message framing, async read/write) - OverlayIPC.cpp/h: High-level IPC protocol handling (message serialization, request/response routing) - RustOverlayManager.cpp/h: OverlayManager implementation that delegates networking to the Rust overlay via IPC - NetworkConstants.h: Shared constants between C++ and Rust Removed legacy components: - Peer.cpp/h, TCPPeer.cpp/h: Direct TCP peer connections (now handled by Rust libp2p) - FlowControl.cpp/h, FlowControlCapacity.cpp/h: Per-peer flow control (now handled by Rust flood control) - Floodgate.cpp/h: Transaction flooding (replaced by Rust INV/GETDATA) - PeerAuth.cpp/h, Hmac.cpp/h: Authentication (now handled by libp2p noise protocol) - PeerManager.cpp/h, PeerDoor.cpp/h: Peer lifecycle management - ItemFetcher.cpp/h, Tracker.cpp/h: SCP message fetching - TxAdverts.cpp/h, TxDemandsManager.cpp/h: Pull-mode tx dissemination - Survey*.cpp/h: Network survey (to be reimplemented) - OverlayManagerImpl.cpp/h: Legacy overlay manager Updated tests: - IPCTests.cpp: IPC transport unit tests - OverlayIPCTests.cpp: IPC protocol integration tests - OverlayIPCBenchmark.cpp: IPC throughput benchmarks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the C++ core to work with the Rust overlay via IPC instead of the legacy in-process overlay. Herder changes: - Remove TransactionQueue and TxQueueLimiter (transaction queuing now handled by Rust mempool) - Adapt HerderImpl to submit transactions via IPC - Update PendingEnvelopes to fetch SCP messages via IPC - Simplify TxSetFrame to work with externally-assembled tx sets Application changes: - Add RustOverlayManager instantiation to ApplicationImpl - Update Config with Rust overlay settings (IPC socket path, overlay binary path) - Update AppConnector to expose IPC interface - Adapt CommandHandler for new overlay endpoints Simulation changes: - Rewrite Simulation and Topologies to launch Rust overlay processes - Update LoadGenerator and TxGenerator for IPC-based tx submission - Adapt CoreTests for new topology model Other adaptations: - Database: Remove overlay-specific tables (peers, peer preferences) - LedgerManager: Remove direct overlay interactions - Fuzz tests: Adapt for new overlay interface Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update Dockerfiles to install Rust toolchain and build the overlay crate as part of the stellar-core build pipeline. - Dockerfile: Add rustup installation, cargo build step - Dockerfile.testing: Include Rust build for test images Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add comprehensive design document for the Rust overlay, cross-checked against the actual implementation: - Architecture overview with ASCII diagrams - QUIC transport and stream protocol specification - IPC protocol with all message types and wire formats - INV/GETDATA pull-based TX flooding protocol (new) - Mempool, TX set building, and shared state documentation - Complete test coverage summary (178 Rust + C++ integration tests) - Known issues and unimplemented features Excludes session notes, debugging logs, and agent instructions from dev/ai/ — this document is the single source of truth for the overlay design. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SetPeerConfig handler previously only accepted IP:port format for peer addresses. In Kubernetes, peers are identified by DNS hostnames (e.g. pod-0.service.namespace.svc.cluster.local) which were silently skipped, preventing any peer connections from being established. This change adds: - DNS resolution for peer addresses: resolve_peer_addr() handles both IP:port (parsed directly) and DNS hostnames (resolved via tokio::net::lookup_host), using listen_port as default when no port is specified in the hostname string. - Self-dial detection: collect_local_addrs() builds a set of the node's own addresses using instant UDP socket probing at startup, plus background DNS hostname resolution for K8s pod IP detection. Peers that resolve to a local address are skipped instead of producing a confusing 'no addresses for peer' error from libp2p. - Retry with exponential backoff: when DNS resolution fails (common during K8s pod startup when not all peers are DNS-ready yet), a background task retries unresolved peers with exponential backoff (2s initial delay, 30s max, up to 10 attempts). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement all overlay-related metrics from docs/metrics.md in the Rust overlay V2 process and pipe them to C++ core's libmedida-backed /metrics HTTP endpoint. Rust side: - New overlay/src/metrics.rs: lock-free atomic counters for ~40 metrics covering connections, I/O bytes/messages/errors, flood pull-mode (advertised/demanded/fulfilled/unfulfilled), connection lifecycle (inbound/outbound attempt/establish/drop), per-message-type send/recv counters, and timer summaries (recv-transaction, recv-scp, fetch-txset, tx-pull-latency, tx-batch-size histogram). - MetricsSnapshot with serde::Serialize for JSON IPC transport. - Instrumented libp2p_overlay.rs: ConnectionEstablished/Closed, SCP broadcast, TX broadcast/INV batching, GETDATA fulfill/unfulfill, TX receive with dedup tracking (unique/duplicate bytes), inbound stream reads, TxSet stream reads, and housekeeping timeouts. - New IPC message types: RequestOverlayMetrics (13) / OverlayMetricsResponse (105) for synchronous Core->Overlay metrics request. - main.rs handles RequestOverlayMetrics by snapshotting and responding. C++ side: - OverlayIPC::requestMetrics(): synchronous IPC call with separate mutex/CV to avoid collision with getTopTransactions(). - RustOverlayManager::syncOverlayMetrics(): parses JSON snapshot, computes deltas for monotonic counters, and updates libmedida Meters/Counters/Timers/Histograms accordingly. - ApplicationImpl::syncAllMetrics() calls syncOverlayMetrics() so metrics are fresh whenever /metrics endpoint is queried. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All TX sets were cached with ledger_seq: 0, causing evict_before() to wipe them on the first ledger close past sequence 5. This made the node unable to serve TX sets to peers ~600ms after caching them. - Use current_ledger_seq when inserting into TxSetCache (both TxSetReceived from peers and CacheTxSet from Core) - Extend eviction buffer from 5 to 12 ledgers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is still a half-baked prototype of Rust overlay, but pushing as a draft for visibility and initial feedback.
The last commit adds a couple of files to
docsthat give an overview of what this prototype is doing, but tl;dr is that is implements overlay as a separate process in Rust.The good news is that we can use many things out of the box via libp2p: Kademlia for peer discovery, peer auth, GossipSub for topology forming, QUIC for multiplexing. We also benefit from using Rust's tokio async scheduler. Of course, these things need tuning and probably a bunch of debugging to hash out all the edge cases, but it's quite encouraging to see something that kinda works in just a few days of "vibe coding" (as confirmed by a good amount of tests on the Rust side and in OverlayIPCTests). It would be interesting to see if we can actually get a reliable initial implmenetation of Rust overlay, such that we can optimize individual areas as needed (for example, we know GossipSub isn't what we want, so we can implement a custom solution in Rust, rather than poking at C++ which has a bunch of weird legacy stuff like PeerManager etc).
Another interesting observation in this experiment was finding the clean cut point between overlay <> SCP. By moving most of peer and flooding components to Rust, this actually leaves a single digit number of interaction points with C++, which is quite a bit easier to reason about and encourages good separation of responsibilities.
Next steps would be to actually get it to run in supercluster (right now there are some annoying issues I'm debugging like DNS resolution and containers being able to properly talk to each other), and see what kind of SCP latency we can get from removal of HOL blocking via multiplexing plus the natural parallelism we get from separation of processes.