Skip to content

fix(libp2p): make AutoNAT log messages actionable for operators#4260

Open
sveitser wants to merge 4 commits intomainfrom
ma/clearer-autonat-logs
Open

fix(libp2p): make AutoNAT log messages actionable for operators#4260
sveitser wants to merge 4 commits intomainfrom
ma/clearer-autonat-logs

Conversation

@sveitser
Copy link
Copy Markdown
Collaborator

@sveitser sveitser commented May 6, 2026

  • Per-probe failures (autonat::OutboundProbeEvent::Error) demoted from warn! to debug!. Each failed probe is a single peer momentarily unable to dial back; the operator cannot fix any individual probe and the WARN spam drowns out the actually-useful aggregate signal.

  • StatusChanged events promoted from debug! to actionable messages keyed off NatStatus:

    Public(addr): info!  - confirms reachability, includes the address
    Private:      error! - operator MUST act, message names the flag and the
                           network-level checks (firewall, NAT, security group)
    Unknown:      debug! - genuinely uncertain, not actionable
  • Before this change, an operator behind a misconfigured NAT/NodePort saw a flood of opaque "AutoNAT Probe failed to peer ... with error: OutboundRequest( Io(Kind(UnexpectedEof)))" lines and no clear signal that their node was unreachable. After this change they see a single ERROR line naming --libp2p-advertise-address (env ESPRESSO_NODE_LIBP2P_ADVERTISE_ADDRESS) and the firewall checks they need to perform.

  • Drive-by: fix a "fales" typo in a comment and whitelist the bimap crate name in .typos.toml so the pre-commit spell-check passes.

- Per-probe failures (autonat::OutboundProbeEvent::Error) demoted from warn! to
  debug!. Each failed probe is a single peer momentarily unable to dial back; the
  operator cannot fix any individual probe and the WARN spam drowns out the
  actually-useful aggregate signal.

- StatusChanged events promoted from debug! to actionable messages keyed off
  NatStatus:

    Public(addr): info!  - confirms reachability, includes the address
    Private:      error! - operator MUST act, message names the flag and the
                           network-level checks (firewall, NAT, security group)
    Unknown:      debug! - genuinely uncertain, not actionable

- Before this change, an operator behind a misconfigured NAT/NodePort saw a
  flood of opaque "AutoNAT Probe failed to peer ... with error: OutboundRequest(
  Io(Kind(UnexpectedEof)))" lines and no clear signal that their node was
  unreachable. After this change they see a single ERROR line naming
  --libp2p-advertise-address (env ESPRESSO_NODE_LIBP2P_ADVERTISE_ADDRESS) and
  the firewall checks they need to perform.

- Drive-by: fix a "fales" typo in a comment and whitelist the `bimap` crate name
  in .typos.toml so the pre-commit spell-check passes.
@sveitser sveitser requested a review from twittner May 6, 2026 13:39
@sveitser
Copy link
Copy Markdown
Collaborator Author

sveitser commented May 6, 2026

One of the things I don't really know due to not being familiar with libp2p is when the status changes. I will check that soon.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the typo configuration, fixes a typo in a comment, and improves AutoNAT logging in the libp2p networking layer. The logging changes now distinguish between public, private, and unknown NAT statuses, providing actionable error messages for private nodes. I have kept the review comment regarding the tight coupling of consensus-specific terminology in the networking layer, as it identifies a significant architectural improvement opportunity.

Comment thread crates/hotshot/libp2p-networking/src/network/node.rs Outdated
@sveitser
Copy link
Copy Markdown
Collaborator Author

sveitser commented May 6, 2026

One of the things I don't really know due to not being familiar with libp2p is when the status changes. I will check that soon.

So it seems this would fire if the first check fails which is bad, will fix.

sveitser added 2 commits May 6, 2026 16:37
- Demote Unknown -> Private to warn since confidence is 0 on the first flip
  and a single transient probe failure would otherwise fire the loud error.
- Emit the operator-facing error only once confidence reaches confidence_max
  in Private status, indicating the result is reproduced by repeated probes.
- Latch the error per Private episode and reset on transitions back to
  Public or Unknown so recovery + re-loss re-fires.
@sveitser sveitser marked this pull request as ready for review May 6, 2026 15:09
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Nextest failures (1) in this run

Test Attempts Time (s) Main history
cliquenet::bench/bench1::tcp/10 MiB 1 25.07 passing

See the step summary for flaky tests and slowest tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants