Feat/auto-restart tunnels on ping failure (+ optional fallback tunnel)#1182
Feat/auto-restart tunnels on ping failure (+ optional fallback tunnel)#1182naonak wants to merge 9 commits intowgtunnel:masterfrom
Conversation
|
Hey everyone 👋 The auto-restart feature is ready for broader testing — I've been running it for a while and can't reproduce any more bugs. If you've been waiting for a way to automatically recover from silent tunnel failures, now's a great time to give it a try. You can find it under Settings → Tunnel Monitoring → Auto-restart (requires ping monitoring to be enabled first). Any feedback — edge cases, unexpected behaviour, UI quirks — is welcome. Thanks! |
Introduces MonitoringSettings Room entity and domain model to persist auto-restart configuration: enabled flag, ping failure threshold, cooldown duration, max restart attempts, exponential backoff toggle, and on-max-attempts action (keep waiting or stop tunnel). BackendMessage sealed class defines typed tunnel lifecycle events: ConnectionDegrading, ConnectionRestored, ConnectionPermanentlyLost. TunnelRestartProgress domain state tracks the full restart lifecycle (idle → restarting → verifying → cooldown → awaiting recovery). DB migrated from version 29 to 35. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fication Implements HandshakeRestartHandler, a coroutine-based state machine that monitors ping health and automatically restarts the tunnel when consecutive ping failures exceed the configured threshold. Restart flow: 1. N consecutive ping failures → stop + restart tunnel (attempt 1/max) 2. 5 s verification ping after tunnel comes UP confirms recovery 3. On verification failure → exponential (or fixed) cooldown, then retry 4. Pings remain active during cooldown → early recovery skips next restart 5. After max attempts: emit ConnectionPermanentlyLost; if DO_NOTHING, suspend until natural ping recovery then re-arm automatically 6. On successful verification or natural recovery → emit ConnectionRestored, reset counter, re-arm monitor Edge cases handled: - Abort restart cycle when auto-tunnel switches to a different tunnel - Skip unnecessary restart when ping recovers during cooldown - Always poll WireGuard stats regardless of Doze mode (prerequisite fix) TunnelMonitoringHandler wires HandshakeRestartHandler alongside the existing ping/handshake monitors. TunnelManager exposes restart progress state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
a9eee47 to
48e50c2
Compare
|
I have kept testing and everything looks good up to this point. I grouped the commits to facilitate the code review process. |
af67ffa to
e2c65f2
Compare
AutoRestartScreen: configures auto-restart (enable/disable, ping failures
before restart, cooldown, max attempts, exponential backoff, on-max-attempts
action). Accessible from Settings → Tunnel monitoring.
TunnelList: inline restart progress label below tunnel name shows the current
phase — "restarting 1/3…", "verifying 1/3…", "restart 1/3 · next in 28s",
"awaiting ping recovery" — and total restart counter alongside uptime
("uptime: 4m · ↺ 3"). Dot color forced to UNHEALTHY during active restart.
MonitoringViewModel bridges MonitoringSettings persistence and exposes
restartProgress state from TunnelManager to the UI layer.
Snackbar notifications emitted on ConnectionRestored and
ConnectionPermanentlyLost (always active, no per-setting toggle).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
e2c65f2 to
b39a60d
Compare
When all pings fail (timeout -> Icmp.PingResult.Failed), rttList stays empty and stats.transmitted was never assigned, leaving it at 0. Move stats.transmitted = count before the rttList.isNotEmpty() check so it always reflects the number of attempted pings, matching the expected semantics of "packets transmitted". This unblocks HandshakeRestartHandler.awaitPingFailures() (introduced in wgtunnel#1182) which requires transmitted > 0 to distinguish a real failure from pings not routed through the tunnel.
When all pings fail (timeout -> Icmp.PingResult.Failed), rttList stays empty and stats.transmitted was never assigned, leaving it at 0. Move stats.transmitted = count before the rttList.isNotEmpty() check so it always reflects the number of attempted pings, matching the expected semantics of "packets transmitted". This unblocks HandshakeRestartHandler.awaitPingFailures() (introduced in wgtunnel#1182) which requires transmitted > 0 to distinguish a real failure from pings not routed through the tunnel.
When all pings fail (timeout -> Icmp.PingResult.Failed), rttList stays empty and stats.transmitted was never assigned, leaving it at 0. Move stats.transmitted = count before the rttList.isNotEmpty() check so it always reflects the number of attempted pings, matching the expected semantics of "packets transmitted". This unblocks HandshakeRestartHandler.awaitPingFailures() (introduced in wgtunnel#1182) which requires transmitted > 0 to distinguish a real failure from pings not routed through the tunnel.
Remove the cooldownMs > pingIntervalMs guard. The withTimeoutOrNull block already handles both cases correctly — it expires after cooldownMs when no recovery is detected, and exits early if pings succeed. This enables early recovery detection even when cooldown <= pingInterval, at zero extra cost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
24979f9 to
ca72d74
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6370b11 to
3443313
Compare
- DB v35: add fallbackTunnelId to TunnelConfig, isFallbackEnabled and defaultFallbackTunnelId to MonitoringSettings - HandshakeRestartHandler: switch to fallback on max failures, emit SwitchedToFallback notification; fix race condition (keep restarting=true until after startTunnel); prevent self-reference fallback loop; clear "awaiting recovery" progress before fallback switch; only emit ConnectionPermanentlyLost when no fallback available - TunnelConfig.equals(): include fallbackTunnelId so StateFlow emits on fallback change and FallbackTunnelScreen recomposes correctly - FallbackTunnelScreen: per-tunnel fallback picker with SurfaceRow expandedContent pattern - AutoRestartScreen: global fallback toggle + default fallback dropdown, disabled state grays out dropdown - DropdownSelector: add enabled param with disabled color - Navigation: Route.FallbackTunnel + navbar state Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When restarting=true blocks job cancellation during a stop/start cycle, the activeTunnels collector skips cleanup. After restarting=false the collector won't re-fire since activeTunnels hasn't changed, leaving the job alive to restart the tunnel indefinitely even after a user toggle-off. Fix: after clearing restarting flag, check if the tunnel is still in activeTunnels. If absent, it was stopped externally during the protected window — clear progress and return to break the loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I've added a fallback tunnel feature to this PR. When max restart attempts are exhausted, the tunnel can now automatically switch to a designated fallback tunnel instead of stopping or doing nothing. Configurable globally (default fallback for all tunnels) and per-tunnel. Full details in the updated PR description. I'm stopping improvements here — the feature set is complete and the code is ready for review. |
|
Is it possible to also restart the tunnel when the log monitor detects failed handshake initiations? |
Auto-restart tunnel on ping failure
Summary
Adds an optional auto-restart mechanism that monitors active WireGuard tunnel and automatically restart it when ping monitoring detects sustained connectivity failure. Entirely opt-in, configurable under Settings → Tunnel Monitoring → Auto-restart. See #1036.
Problem
WireGuard tunnel can silently stop passing traffic when configured ping target is unreachable. Without manual intervention, the tunnel stays "Up" in the UI while being effectively dead.
What's new
Functional
NoConnectivitywhenconnectivityManager.allNetworksreports no physical network withNET_CAPABILITY_VALIDATED; prevents spurious restarts during ISP outages or mobile data being disabledrestarting 1/3…,verifying 1/3…,restart 1/3 · next in 30s), plus a cumulative restart counter inline with uptime (uptime: 3m · ↺ 4)SwitchedToFallbacknotification. Self-reference is prevented to avoid restart loops.Configuration
Technical design
HandshakeRestartHandlerCore of the feature. One monitoring coroutine per active tunnel, started when the tunnel appears in
activeTunnelsand cancelled when it leaves (viaStateFlowobservation). AMutexserialises job lifecycle to prevent races during rapid tunnel transitions.Trigger logic (
awaitPingFailures)Waits for
pingFailuresBeforeRestartconsecutive ping cycles where all targets report unreachable, usingdistinctUntilChangedonpingStatesto track actual new cycles rather than reacting to every stats emission.Restart / verify / cooldown cycle
Ping suppression during restart
TunnelMonitorHandlerchecksrestartProgressbefore issuing periodic pings and skips the cycle only whileisRestartingorisVerifying. Pings remain active during cooldown so early recovery can be detected.Auto-tunnel coordination
After stopping the tunnel, before restarting it, the handler checks whether another tunnel became active (e.g. auto-tunnel switched to a mobile-data tunnel). If so, the restart is aborted cleanly — the auto-tunnel's decision takes priority.
Recovery flow
ConnectionDegradingnotification (attempt N/max)ConnectionRestored, attempt counter resets, monitor re-armsSwitchedToFallback, stop current tunnel, start fallback, handler exitsDO_NOTHING->ConnectionPermanentlyLost, suspends until natural ping recovery then re-armsSTOP_TUNNEL->ConnectionPermanentlyLost, tunnel stopped, handler exitsUI — restart progress sequence
TunnelRestartProgressis a pure in-memory domain type flowingHandshakeRestartHandler->TunnelManager->SharedAppViewModel->TunnelsUiState->TunnelList— not persisted.Database
MonitoringSettingsentity extended with new fields (sane defaults via auto-migration)TunnelConfigentity extended withfallbackTunnelId(DB v35)Also included
fix: pingWithStats() transmitted always 0 on ping timeout —
stats.transmittedwas only set insideif (rttList.isNotEmpty()), so when all pings failed (timeout),transmittedstayed 0. This preventedawaitPingFailures()from ever triggering a restart (see fix(ping): transmitted always 0 when all pings fail (timeout) #1197)fix: always poll WireGuard stats regardless of Doze mode — stats polling was gated on
isDeviceIdleMode; removed the gate so handshake timestamps remain up to date in the background fix(core): always poll WireGuard stats regardless of Doze mode #1177Test plan
Happy path
pingFailuresBeforeRestartconsecutive failure cycles are observed, then restartsrestarting 1/N…->verifying 1/N…->restart 1/N · next in Xs(countdown live) -> cleared on successtotalRestartscounter increments and is shown inline with uptime (uptime: 3m · ↺ 2) across multiple recovery cyclesCooldown early recovery
Exponential backoff
Max attempts — DO_NOTHING
ConnectionPermanentlyLostnotification (indicates tunnel still running), progress freezes onawaiting ping recoveryverifying…flash before settling onawaiting ping recovery(no false-positive race)Max attempts — STOP_TUNNEL
ConnectionPermanentlyLostnotification (indicates tunnel stopped), tunnel is actually stopped, progress clearedMax attempts — Fallback tunnel
SwitchedToFallbacknotification, current tunnel stops, fallback tunnel startsConnectionPermanentlyLostis NOT emitted when a fallback is availableawaiting recoveryprogress clears immediately when fallback switch begins (not after)fallbackTunnelIddefaults to null (no fallback) for all existing tunnelsAuto-tunnel interaction
Settings changes mid-cycle
Manual intervention
DB migration
monitoring_settingscreated with all defaults (auto-restart off, cooldown 30s, max 5 attempts, DO_NOTHING)