Skip to content

Feat/auto-restart tunnels on ping failure (+ optional fallback tunnel)#1182

Open
naonak wants to merge 9 commits intowgtunnel:masterfrom
naonak:feat/auto-restart-v2
Open

Feat/auto-restart tunnels on ping failure (+ optional fallback tunnel)#1182
naonak wants to merge 9 commits intowgtunnel:masterfrom
naonak:feat/auto-restart-v2

Conversation

@naonak
Copy link
Copy Markdown
Contributor

@naonak naonak commented Feb 28, 2026

Auto-restart tunnel on ping failure

Summary

Adds an optional auto-restart mechanism that monitors active WireGuard tunnel and automatically restart it when ping monitoring detects sustained connectivity failure. Entirely opt-in, configurable under Settings → Tunnel Monitoring → Auto-restart. See #1036.


Problem

WireGuard tunnel can silently stop passing traffic when configured ping target is unreachable. Without manual intervention, the tunnel stays "Up" in the UI while being effectively dead.


What's new

Functional

  • Auto-restart on ping failure — restarts the tunnel after N consecutive ping-failure intervals reported by the existing ping monitor
  • Post-restart verification — 5 s after the tunnel comes back UP, performs a fresh ping to confirm recovery; cooldown only starts if verification fails
  • Early recovery during cooldown — periodic pings continue running during cooldown; if they succeed before the timer expires, the next restart is skipped
  • Exponential backoff — optionally doubles the cooldown between each attempt
  • Give-up action — after max attempts: either keep monitoring (do nothing) or stop the tunnel entirely
  • Recovery notifications — always active; snackbar when the tunnel recovers or when max attempts are exhausted (no per-setting toggle)
  • No restart when internet is unavailable — pings are skipped and marked NoConnectivity when connectivityManager.allNetworks reports no physical network with NET_CAPABILITY_VALIDATED; prevents spurious restarts during ISP outages or mobile data being disabled
  • Real-time status in tunnel list — the tunnel card shows live restart progress with attempt counter on every phase (restarting 1/3…, verifying 1/3…, restart 1/3 · next in 30s), plus a cumulative restart counter inline with uptime (uptime: 3m · ↺ 4)
  • Fallback tunnel — when max attempts are exhausted, optionally switch to a designated fallback tunnel instead of stopping or doing nothing. Configurable globally (default fallback for all tunnels) and per-tunnel (override). Emits a SwitchedToFallback notification. Self-reference is prevented to avoid restart loops.

Configuration

Setting Default Description
Restart cooldown 30 s Minimum time between restart attempts
Consecutive failures before restart 3 Ping-failure streak required to trigger
Exponential backoff off Double cooldown on each attempt
Max attempts 5 Give up after N failed restart attempts
Give-up action Do nothing Do nothing or stop the tunnel
Fallback tunnel off Switch to a fallback tunnel after max attempts
Default fallback tunnel Global fallback used unless overridden per-tunnel

Technical design

HandshakeRestartHandler

Core of the feature. One monitoring coroutine per active tunnel, started when the tunnel appears in activeTunnels and cancelled when it leaves (via StateFlow observation). A Mutex serialises job lifecycle to prevent races during rapid tunnel transitions.

Trigger logic (awaitPingFailures)
Waits for pingFailuresBeforeRestart consecutive ping cycles where all targets report unreachable, using distinctUntilChanged on pingStates to track actual new cycles rather than reacting to every stats emission.

Restart / verify / cooldown cycle

awaitPingFailures()   <- N consecutive failures

loop attempt++:

  [RESTARTING]  -- periodic pings suppressed
  stopTunnel -> delay(300 ms)
  guard: if another tunnel became active -> abort (auto-tunnel took over)
  startTunnel -> wait UP (30 s timeout)

  [VERIFYING]  -- periodic pings suppressed
  delay(5 s settle)
  direct ping
  ok  -> ConnectionRestored, attempt=0, re-arm awaitPingFailures
  fail -> ...

  if attempt >= maxAttempts ->
    if fallback enabled -> SwitchedToFallback, stop current, start fallback, return
    else -> ConnectionPermanentlyLost / stop

  [COOLDOWN]  -- periodic pings ACTIVE
  race(cooldownMs):
    pingStates all reachable -> ConnectionRestored, attempt=0, re-arm
    timeout -> loop (attempt++)

Ping suppression during restart
TunnelMonitorHandler checks restartProgress before issuing periodic pings and skips the cycle only while isRestarting or isVerifying. Pings remain active during cooldown so early recovery can be detected.

Auto-tunnel coordination
After stopping the tunnel, before restarting it, the handler checks whether another tunnel became active (e.g. auto-tunnel switched to a mobile-data tunnel). If so, the restart is aborted cleanly — the auto-tunnel's decision takes priority.

Recovery flow

  1. Ping streak detected -> ConnectionDegrading notification (attempt N/max)
  2. Tunnel stopped + restarted -> 5 s settle -> verification ping
    • 2a. Ping succeeds -> ConnectionRestored, attempt counter resets, monitor re-arms
    • 2b. Ping fails -> cooldown (with live ping monitoring for early exit) -> loop
  3. Max attempts reached:
    • 3a. Fallback enabled -> SwitchedToFallback, stop current tunnel, start fallback, handler exits
    • 3b. DO_NOTHING -> ConnectionPermanentlyLost, suspends until natural ping recovery then re-arms
    • 3c. STOP_TUNNEL -> ConnectionPermanentlyLost, tunnel stopped, handler exits

UI — restart progress sequence

restarting 1/3…             (pings suppressed)
verifying 1/3…              (pings suppressed — direct ping)
restart 1/3 · next in 30s   (pings active -> early recovery possible)
restarting 2/3…
verifying 2/3…
restart 2/3 · next in 60s   (backoff)
restarting 3/3…
verifying 3/3…
-> awaiting ping recovery     (pings active — natural recovery)
-> or tunnel stopped
-> or switched to fallback tunnel

TunnelRestartProgress is a pure in-memory domain type flowing HandshakeRestartHandler -> TunnelManager -> SharedAppViewModel -> TunnelsUiState -> TunnelList — not persisted.

Database

  • MonitoringSettings entity extended with new fields (sane defaults via auto-migration)
  • TunnelConfig entity extended with fallbackTunnelId (DB v35)

Also included


Test plan

Happy path

  • Enable ping monitoring + auto-restart; block peer ICMP -> tunnel does not restart until pingFailuresBeforeRestart consecutive failure cycles are observed, then restarts
  • Progress visible in tunnel list: restarting 1/N… -> verifying 1/N… -> restart 1/N · next in Xs (countdown live) -> cleared on success
  • "Connection restored" notification emitted after successful verification ping; attempt counter resets, monitor re-arms
  • totalRestarts counter increments and is shown inline with uptime (uptime: 3m · ↺ 2) across multiple recovery cycles
  • Health dot forced to UNHEALTHY (red) throughout the restart cycle, even if WireGuard briefly reports healthy

Cooldown early recovery

  • Block endpoint -> restart -> verify fails -> during cooldown, unblock endpoint -> pings succeed -> "Connection restored" without triggering next restart

Exponential backoff

  • With backoff enabled, cooldown doubles each attempt: 30s -> 60s -> 120s…
  • With backoff disabled, cooldown stays constant

Max attempts — DO_NOTHING

  • After max attempts: ConnectionPermanentlyLost notification (indicates tunnel still running), progress freezes on awaiting ping recovery
  • No verifying… flash before settling on awaiting ping recovery (no false-positive race)
  • When connectivity naturally recovers: "Connection restored" notification, progress cleared, monitor re-arms automatically

Max attempts — STOP_TUNNEL

  • After max attempts: ConnectionPermanentlyLost notification (indicates tunnel stopped), tunnel is actually stopped, progress cleared
  • Handler exits; no further restart attempts

Max attempts — Fallback tunnel

  • After max attempts with fallback enabled: SwitchedToFallback notification, current tunnel stops, fallback tunnel starts
  • Per-tunnel fallback overrides the global default fallback
  • Setting the fallback to the same tunnel as the source is prevented (no restart loop)
  • ConnectionPermanentlyLost is NOT emitted when a fallback is available
  • awaiting recovery progress clears immediately when fallback switch begins (not after)
  • If the configured fallback tunnel no longer exists, falls back to give-up action (DO_NOTHING / STOP_TUNNEL)
  • Toggling the failing tunnel OFF during a fallback switch: restart loop aborted, tunnel stays off
  • Fallback tunnel screen shows current fallback name per tunnel; selecting a new one updates immediately
  • DB upgrade from v34: fallbackTunnelId defaults to null (no fallback) for all existing tunnels

Auto-tunnel interaction

  • With auto-tunnel enabled (WiFi->A, mobile->B): trigger restart on A, then switch to mobile data mid-restart -> B activates, A's restart handler aborts cleanly, no flip-flop

Settings changes mid-cycle

  • Disabling auto-restart (or disabling ping) mid-cycle: current restart cancelled, progress cleared immediately
  • Re-enabling auto-restart: monitoring re-arms on the next ping cycle

Manual intervention

  • Toggling the tunnel switch OFF during an active restart: restart cancelled cleanly, no phantom progress remains
  • Deleting the tunnel while restarting: job cancelled, no crash, no orphaned progress

DB migration

  • Upgrade from version 29: no crash, monitoring_settings created with all defaults (auto-restart off, cooldown 30s, max 5 attempts, DO_NOTHING)

@naonak
Copy link
Copy Markdown
Contributor Author

naonak commented Mar 1, 2026

Hey everyone 👋

The auto-restart feature is ready for broader testing — I've been running it for a while and can't reproduce any more bugs. If you've been waiting for a way to automatically recover from silent tunnel failures, now's a great time to give it a try.

You can find it under Settings → Tunnel Monitoring → Auto-restart (requires ping monitoring to be enabled first).

Any feedback — edge cases, unexpected behaviour, UI quirks — is welcome. Thanks!

naonak and others added 2 commits March 10, 2026 14:16
Introduces MonitoringSettings Room entity and domain model to persist
auto-restart configuration: enabled flag, ping failure threshold,
cooldown duration, max restart attempts, exponential backoff toggle,
and on-max-attempts action (keep waiting or stop tunnel).

BackendMessage sealed class defines typed tunnel lifecycle events:
ConnectionDegrading, ConnectionRestored, ConnectionPermanentlyLost.
TunnelRestartProgress domain state tracks the full restart lifecycle
(idle → restarting → verifying → cooldown → awaiting recovery).

DB migrated from version 29 to 35.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fication

Implements HandshakeRestartHandler, a coroutine-based state machine that
monitors ping health and automatically restarts the tunnel when consecutive
ping failures exceed the configured threshold.

Restart flow:
1. N consecutive ping failures → stop + restart tunnel (attempt 1/max)
2. 5 s verification ping after tunnel comes UP confirms recovery
3. On verification failure → exponential (or fixed) cooldown, then retry
4. Pings remain active during cooldown → early recovery skips next restart
5. After max attempts: emit ConnectionPermanentlyLost; if DO_NOTHING,
   suspend until natural ping recovery then re-arm automatically
6. On successful verification or natural recovery → emit ConnectionRestored,
   reset counter, re-arm monitor

Edge cases handled:
- Abort restart cycle when auto-tunnel switches to a different tunnel
- Skip unnecessary restart when ping recovers during cooldown
- Always poll WireGuard stats regardless of Doze mode (prerequisite fix)

TunnelMonitoringHandler wires HandshakeRestartHandler alongside the existing
ping/handshake monitors. TunnelManager exposes restart progress state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@naonak naonak force-pushed the feat/auto-restart-v2 branch from a9eee47 to 48e50c2 Compare March 10, 2026 13:17
@naonak
Copy link
Copy Markdown
Contributor Author

naonak commented Mar 10, 2026

I have kept testing and everything looks good up to this point. I grouped the commits to facilitate the code review process.

@naonak naonak force-pushed the feat/auto-restart-v2 branch 2 times, most recently from af67ffa to e2c65f2 Compare March 10, 2026 14:15
AutoRestartScreen: configures auto-restart (enable/disable, ping failures
before restart, cooldown, max attempts, exponential backoff, on-max-attempts
action). Accessible from Settings → Tunnel monitoring.

TunnelList: inline restart progress label below tunnel name shows the current
phase — "restarting 1/3…", "verifying 1/3…", "restart 1/3 · next in 28s",
"awaiting ping recovery" — and total restart counter alongside uptime
("uptime: 4m · ↺ 3"). Dot color forced to UNHEALTHY during active restart.

MonitoringViewModel bridges MonitoringSettings persistence and exposes
restartProgress state from TunnelManager to the UI layer.

Snackbar notifications emitted on ConnectionRestored and
ConnectionPermanentlyLost (always active, no per-setting toggle).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@naonak naonak force-pushed the feat/auto-restart-v2 branch from e2c65f2 to b39a60d Compare March 10, 2026 14:32
naonak added a commit to naonak/wgtunnel that referenced this pull request Mar 11, 2026
When all pings fail (timeout -> Icmp.PingResult.Failed), rttList stays
empty and stats.transmitted was never assigned, leaving it at 0.

Move stats.transmitted = count before the rttList.isNotEmpty() check so
it always reflects the number of attempted pings, matching the expected
semantics of "packets transmitted".

This unblocks HandshakeRestartHandler.awaitPingFailures() (introduced in
wgtunnel#1182) which requires transmitted > 0 to distinguish a real failure from
pings not routed through the tunnel.
When all pings fail (timeout -> Icmp.PingResult.Failed), rttList stays
empty and stats.transmitted was never assigned, leaving it at 0.

Move stats.transmitted = count before the rttList.isNotEmpty() check so
it always reflects the number of attempted pings, matching the expected
semantics of "packets transmitted".

This unblocks HandshakeRestartHandler.awaitPingFailures() (introduced in
wgtunnel#1182) which requires transmitted > 0 to distinguish a real failure from
pings not routed through the tunnel.
naonak added a commit to naonak/wgtunnel that referenced this pull request Mar 11, 2026
When all pings fail (timeout -> Icmp.PingResult.Failed), rttList stays
empty and stats.transmitted was never assigned, leaving it at 0.

Move stats.transmitted = count before the rttList.isNotEmpty() check so
it always reflects the number of attempted pings, matching the expected
semantics of "packets transmitted".

This unblocks HandshakeRestartHandler.awaitPingFailures() (introduced in
wgtunnel#1182) which requires transmitted > 0 to distinguish a real failure from
pings not routed through the tunnel.
Remove the cooldownMs > pingIntervalMs guard. The withTimeoutOrNull block
already handles both cases correctly — it expires after cooldownMs when no
recovery is detected, and exits early if pings succeed. This enables early
recovery detection even when cooldown <= pingInterval, at zero extra cost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@naonak naonak force-pushed the feat/auto-restart-v2 branch from 24979f9 to ca72d74 Compare March 11, 2026 19:46
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@naonak naonak force-pushed the feat/auto-restart-v2 branch from 6370b11 to 3443313 Compare March 18, 2026 18:57
naonak and others added 2 commits March 19, 2026 09:47
- DB v35: add fallbackTunnelId to TunnelConfig, isFallbackEnabled and
  defaultFallbackTunnelId to MonitoringSettings
- HandshakeRestartHandler: switch to fallback on max failures, emit
  SwitchedToFallback notification; fix race condition (keep restarting=true
  until after startTunnel); prevent self-reference fallback loop;
  clear "awaiting recovery" progress before fallback switch; only emit
  ConnectionPermanentlyLost when no fallback available
- TunnelConfig.equals(): include fallbackTunnelId so StateFlow emits
  on fallback change and FallbackTunnelScreen recomposes correctly
- FallbackTunnelScreen: per-tunnel fallback picker with SurfaceRow
  expandedContent pattern
- AutoRestartScreen: global fallback toggle + default fallback dropdown,
  disabled state grays out dropdown
- DropdownSelector: add enabled param with disabled color
- Navigation: Route.FallbackTunnel + navbar state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When restarting=true blocks job cancellation during a stop/start cycle,
the activeTunnels collector skips cleanup. After restarting=false the
collector won't re-fire since activeTunnels hasn't changed, leaving the
job alive to restart the tunnel indefinitely even after a user toggle-off.

Fix: after clearing restarting flag, check if the tunnel is still in
activeTunnels. If absent, it was stopped externally during the protected
window — clear progress and return to break the loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@naonak
Copy link
Copy Markdown
Contributor Author

naonak commented Mar 19, 2026

I've added a fallback tunnel feature to this PR.

When max restart attempts are exhausted, the tunnel can now automatically switch to a designated fallback tunnel instead of stopping or doing nothing. Configurable globally (default fallback for all tunnels) and per-tunnel. Full details in the updated PR description.

I'm stopping improvements here — the feature set is complete and the code is ready for review.

@naonak naonak changed the title Feat/auto-restart tunnels on ping failure 1036 Feat/auto-restart tunnels on ping failure (+ optional fallback tunnel) Mar 21, 2026
@fiveseven7
Copy link
Copy Markdown

Is it possible to also restart the tunnel when the log monitor detects failed handshake initiations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants