Feat/auto restart stale handshake 1036#1175
Closed
naonak wants to merge 28 commits intowgtunnel:masterfrom
Closed
Conversation
Add three new columns to MonitoringSettings (v29→30→31): - isRestartOnHandshakeTimeoutEnabled: master toggle - maxHandshakeRestartAttempts: rate-limit cap (default 5/hour) - restartCooldownSeconds: delay between restart attempts (default 30s) Room auto-migrations handle the schema upgrade transparently. Closes wgtunnel#1036 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds HandshakeRestartHandler that monitors active tunnels and automatically restarts them when a trigger condition is detected: - Stale WireGuard handshake (no handshake renewed in ~3.5 min) - All pinged peers unreachable (when ping monitoring is enabled) Key behaviors: - Rate limiting: max N attempts per hour (configurable) - Configurable cooldown between restarts - Reacts to network changes (reconnect triggers early check after 10s grace) - cancelAndClear(): safe cancellation on manual tunnel stop - Cooldown cancels early if tunnel recovers (withTimeoutOrNull race) - triggerReason() checks ping failure first to avoid misleading "Stale handshake" label when ping is the actual trigger - failingPingTargets uses PingState.pingTarget (the actual IP), not the map key (which is the peer Base64 public key) Exposes TunnelRestartProgress via StateFlow for UI consumption. TunnelManager wires the handler; internal restartTunnel bypasses cancelAndClear to avoid self-cancellation during auto-restart. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AutoRestartScreen (new):
- Toggle to enable/disable auto-restart on stale handshake
- Dropdown for max restart attempts (rate limit per hour)
- Dropdown for cooldown between restarts (15 / 30 / 60 / 120 / 300s)
SettingsScreen: quick toggle + navigation entry for AutoRestartScreen
TunnelMonitoringScreen: toggle for "Use ping for detection"
(isPingMonitoringEnabled) — enables ping-based restart trigger
TunnelList: live status display during auto-restart cycle:
- Switch stays ON while restarting (tunnel remains visually active)
- Health indicator color frozen at trigger color (yellow/red) for
the entire restart+cooldown cycle instead of going gray
- Descriptive text below tunnel name:
"Stale handshake · Restarting… 1/5"
"Ping unreachable: 1.1.1.1 · Restart 2/5 · next in 28s"
(countdown hidden at last attempt since no next retry follows)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace per-restart notification with a stateful persistent notification system that tracks tunnel recovery lifecycle: - ConnectionDegrading: ongoing notification updates on each restart attempt showing reason (stale handshake / ping unreachable) and progress (N/M) - ConnectionRestored: ongoing dismissed + brief "Connection restored" notification - ConnectionPermanentlyLost: permanent failure notification after max attempts - ConnectionCancelled: silent dismiss on manual tunnel stop Adds opt-in toggle (recovery_notifications) in the auto-restart config screen. Consolidates DB migrations 29→30 (single step, no release in between). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When both conditions are true (stale handshake AND ping unreachable), report STALE_HANDSHAKE as the cause — not PING_FAILURE. Ping fails as a consequence of a broken WireGuard handshake, not the other way around. PING_FAILURE is only meaningful when the handshake is fresh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When all auto-restart attempts fail (ConnectionPermanentlyLost), the user can now choose what happens next: keep waiting (DO_NOTHING, default) or stop the tunnel (STOP_TUNNEL). The failure notification also differentiates both cases — "VPN stopped" is appended when the tunnel was shut down. DB: version 30 → 31 (AutoMigration, max_attempts_action column). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move "use ping for detection" from TunnelMonitoringScreen to AutoRestartScreen where it belongs logically. The toggle is disabled (greyed out) when ping monitoring is not enabled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re restart Adds pingFailuresBeforeRestart setting (1-5, default 1) to require N consecutive failing ping intervals before triggering an auto-restart. Avoids aggressive restarts on transient network glitches. Stale-handshake restarts remain unaffected. Also adds enabled param to LabelledDropdown, propagated from SurfaceRow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds isBackoffEnabled toggle (DB v33). When on, the cooldown after each failed restart doubles (base × 2^(attempt-1)), capped at 5 min. Base cooldown remains configurable via restartCooldownSeconds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…streak counter tunStateFlow emits on every WireGuard stats poll (bytes, handshake time…), so the previous drop(1).first() returned almost instantly, causing the consecutive-failures streak to hit the threshold in milliseconds rather than across N real ping intervals. Now we wait until lastPingAttemptMillis changes, which only happens when the ping service completes a new ping cycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The tunStateFlow.first{} predicate only waited for a new ping timestamp,
which would hang forever if ping was disabled mid-wait (pingStates cleared
→ newPingTime always null). Now also breaks out if the tunnel is no longer
in a triggering state, so recovery during streak wait is handled cleanly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'Consecutive ping failures before first restart' better conveys that this setting only applies to the initial restart triggered by ping failures, not to subsequent restart attempts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When exponential backoff is enabled, "Max restarts per hour" no longer works reliably (timestamps age out of the 1h window, resetting the counter). Replace the arbitrary 5-min cap and the attempt-count limit with a time-based "give up after X minutes" setting (default 1h). - New DB field backoffTimeoutMinutes (default 60, DB v33→34) - Backoff OFF: keep count-based give-up (max N restarts per hour) - Backoff ON: give up when elapsed time since first restart > timeout - Removed MAX_BACKOFF_SECONDS cap — backoff now grows freely (natural limit via the timeout) - UI: "Give up after" dropdown (15min/30min/1h/2h) grayed when backoff OFF; "Max restarts per hour" grayed when backoff ON Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Attempts Switch from time-based give-up (fixed minutes) to attempt-count-based give-up with dynamic time display in UI. The dropdown shows estimated cumulative time (e.g. "3 attempts (~3m30)") computed from the current restartCooldownSeconds setting, making the relationship between cooldown and total retry duration explicit. DB v34→v35: removes backoff_timeout_minutes, adds backoff_max_attempts (default=3). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the UI countdown hits 0 but the handler hasn't yet set isRestarting=true for the next attempt (or cleared _restartProgress after the final one), the `else` branch incorrectly showed the "max reached" string for a brief moment. Fix: only show "max reached" when attemptNumber >= maxAttempts; show nothing (null) during the transient gap between countdown expiry and handler state update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Useful for testing and for exponential backoff with short initial intervals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Attempts v34 was never shipped so the intermediate schema was missing, breaking the auto-migration KSP step. Merging both steps into one additive migration (add backoff_max_attempts with default=3, no column removal needed). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ff toggle Instead of graying out, show only the relevant setting: - Backoff ON → show "Give up after" (attempt count with time estimate) - Backoff OFF → show "Max restarts per hour" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Was showing sum of all cooldowns including after the last attempt. Should show time from first to last restart (sum of inter-restart cooldowns): base * (2^(n-1) - 1) instead of base * (2^n - 1) Example: 2 attempts, 3s cooldown → 3s (not 9s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After max attempts reached + tunnel recovery, _restartProgress was never cleared in the DO_NOTHING path, leaving "max reached" permanently visible in TunnelList even after the tunnel recovered. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Race: coroutine could increment _restartCounts after cancelAndClear cleared it (non-suspend update). Also fixes count persisting when auto-restart feature is toggled off/on (init block cancels jobs without cancelAndClear). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…unnel toggle When a tunnel is toggled off then on, WireGuard may retain stale statistics (old handshake timestamps) from before the stop until the new handshake completes. The monitoring loop was reading the state immediately on fresh start, causing shouldTrigger() to return true and triggering an immediate restart. Add a 30s startup grace period (STARTUP_GRACE_MS) that waits for the tunnel to reach a healthy state before entering the monitoring loop. Only applies on fresh start (empty restartTimestamps); jobs recreated after auto-restart use the existing cooldown mechanism. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the hardcoded STARTUP_GRACE_MS (30s) with a user-configurable startupGraceSeconds setting (DB v34→35). Added a dropdown in the Auto-restart screen with options: disabled (0s), 10s, 15s, 30s (default), 60s. Setting 0 disables the grace period entirely for users who need immediate monitoring on tunnel startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing is off Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
currentPingTime was captured from `state` (read at the top of the while loop),
but tunStateFlow.first { } starts collecting later. Another coroutine (the ping
monitor) could update the StateFlow between these two reads, causing first { } to
receive the already-advanced ping time as its first replay value and exit
immediately — effectively allowing a single ping cycle to satisfy a streak of N.
Fix: re-read currentPingTime from tunStateFlow.value right before first { }.
Since there is no suspend point between tunStateFlow.value and the first { } call,
the StateFlow cannot be updated between the two reads, so the replayed current
value always matches our baseline and we correctly wait for the next cycle.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add post-restart grace period after each restart to prevent rapid restart loops when cooldown < WireGuard re-handshake time - Add pre-restart verification ping series (using NetworkUtils.pingWithStats) before restarting to skip unnecessary restarts when tunnel is recoverable - Add 5s option to startup grace period dropdown - Remove 3s option from restart cooldown dropdown (too short) - Rename 'Use ping to trigger restart' to 'Use ping monitoring' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
|
updated here : #1176 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1036
What this PR adds
Automatically restarts a WireGuard tunnel when a connectivity issue is detected, without requiring manual intervention. Two trigger conditions are supported:
How it works
A new
HandshakeRestartHandlerruns alongside each active tunnel. It periodically checks the handshake age and/or ping state, and when a degraded condition is detected:Configuration (Settings → Monitoring → Auto-restart)
Recovery notifications
Instead of one notification per restart, a single persistent notification evolves with the tunnel state:
Stale handshake · restarting… (1/5)Connection restoredStale handshake · max restarts reachedCan be disabled in the auto-restart config screen.
Tunnel list display
When auto-restart is active, the tunnel row shows:
Stale handshake · restarting (2/5)Max restarts reachedwhen permanently lost↺ 3next to the uptimeDatabase
Single migration
29 → 30adding 4 columns tomonitoring_settings:is_restart_on_handshake_timeout_enabledmax_handshake_restart_attemptsrestart_cooldown_secondsis_recovery_notification_enabled