Feat/auto restart stale handshake 1036 by naonak · Pull Request #1175 · wgtunnel/android

naonak · 2026-02-24T18:58:01Z

What this PR adds

Automatically restarts a WireGuard tunnel when a connectivity issue is detected, without requiring manual intervention. Two trigger conditions are supported:

Stale handshake – when the last handshake exceeds the WireGuard timeout threshold
Ping failure – when the configured ping target becomes unreachable (requires ping monitoring to be enabled)

How it works

A new HandshakeRestartHandler runs alongside each active tunnel. It periodically checks the handshake age and/or ping state, and when a degraded condition is detected:

The tunnel is restarted (up to a configurable maximum number of attempts)
A cooldown delay is observed between each attempt
If the connection recovers, monitoring resumes normally
If the maximum number of attempts is reached, the handler waits for the tunnel to recover on its own (e.g. after a network change)

Configuration (Settings → Monitoring → Auto-restart)

Setting	Default	Description
Enable auto-restart	off	Master toggle
Restart cooldown	30 s	Delay between restart attempts
Max attempts	5	Max restarts before giving up
Recovery notifications	on	Android notification during recovery

Recovery notifications

Instead of one notification per restart, a single persistent notification evolves with the tunnel state:

Degraded – ongoing notification: Stale handshake · restarting… (1/5)
Restored – brief notification: Connection restored
Permanently lost – persistent notification: Stale handshake · max restarts reached
Manual stop – notification silently dismissed

Can be disabled in the auto-restart config screen.

Tunnel list display

When auto-restart is active, the tunnel row shows:

Current restart reason and attempt count while restarting: Stale handshake · restarting (2/5)
Max restarts reached when permanently lost
Total restart count since tunnel was activated: ↺ 3 next to the uptime

Database

Single migration 29 → 30 adding 4 columns to monitoring_settings:

is_restart_on_handshake_timeout_enabled
max_handshake_restart_attempts
restart_cooldown_seconds
is_recovery_notification_enabled

Add three new columns to MonitoringSettings (v29→30→31): - isRestartOnHandshakeTimeoutEnabled: master toggle - maxHandshakeRestartAttempts: rate-limit cap (default 5/hour) - restartCooldownSeconds: delay between restart attempts (default 30s) Room auto-migrations handle the schema upgrade transparently. Closes wgtunnel#1036 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds HandshakeRestartHandler that monitors active tunnels and automatically restarts them when a trigger condition is detected: - Stale WireGuard handshake (no handshake renewed in ~3.5 min) - All pinged peers unreachable (when ping monitoring is enabled) Key behaviors: - Rate limiting: max N attempts per hour (configurable) - Configurable cooldown between restarts - Reacts to network changes (reconnect triggers early check after 10s grace) - cancelAndClear(): safe cancellation on manual tunnel stop - Cooldown cancels early if tunnel recovers (withTimeoutOrNull race) - triggerReason() checks ping failure first to avoid misleading "Stale handshake" label when ping is the actual trigger - failingPingTargets uses PingState.pingTarget (the actual IP), not the map key (which is the peer Base64 public key) Exposes TunnelRestartProgress via StateFlow for UI consumption. TunnelManager wires the handler; internal restartTunnel bypasses cancelAndClear to avoid self-cancellation during auto-restart. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AutoRestartScreen (new): - Toggle to enable/disable auto-restart on stale handshake - Dropdown for max restart attempts (rate limit per hour) - Dropdown for cooldown between restarts (15 / 30 / 60 / 120 / 300s) SettingsScreen: quick toggle + navigation entry for AutoRestartScreen TunnelMonitoringScreen: toggle for "Use ping for detection" (isPingMonitoringEnabled) — enables ping-based restart trigger TunnelList: live status display during auto-restart cycle: - Switch stays ON while restarting (tunnel remains visually active) - Health indicator color frozen at trigger color (yellow/red) for the entire restart+cooldown cycle instead of going gray - Descriptive text below tunnel name: "Stale handshake · Restarting… 1/5" "Ping unreachable: 1.1.1.1 · Restart 2/5 · next in 28s" (countdown hidden at last attempt since no next retry follows) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace per-restart notification with a stateful persistent notification system that tracks tunnel recovery lifecycle: - ConnectionDegrading: ongoing notification updates on each restart attempt showing reason (stale handshake / ping unreachable) and progress (N/M) - ConnectionRestored: ongoing dismissed + brief "Connection restored" notification - ConnectionPermanentlyLost: permanent failure notification after max attempts - ConnectionCancelled: silent dismiss on manual tunnel stop Adds opt-in toggle (recovery_notifications) in the auto-restart config screen. Consolidates DB migrations 29→30 (single step, no release in between). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When both conditions are true (stale handshake AND ping unreachable), report STALE_HANDSHAKE as the cause — not PING_FAILURE. Ping fails as a consequence of a broken WireGuard handshake, not the other way around. PING_FAILURE is only meaningful when the handshake is fresh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When all auto-restart attempts fail (ConnectionPermanentlyLost), the user can now choose what happens next: keep waiting (DO_NOTHING, default) or stop the tunnel (STOP_TUNNEL). The failure notification also differentiates both cases — "VPN stopped" is appended when the tunnel was shut down. DB: version 30 → 31 (AutoMigration, max_attempts_action column). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move "use ping for detection" from TunnelMonitoringScreen to AutoRestartScreen where it belongs logically. The toggle is disabled (greyed out) when ping monitoring is not enabled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…re restart Adds pingFailuresBeforeRestart setting (1-5, default 1) to require N consecutive failing ping intervals before triggering an auto-restart. Avoids aggressive restarts on transient network glitches. Stale-handshake restarts remain unaffected. Also adds enabled param to LabelledDropdown, propagated from SurfaceRow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds isBackoffEnabled toggle (DB v33). When on, the cooldown after each failed restart doubles (base × 2^(attempt-1)), capped at 5 min. Base cooldown remains configurable via restartCooldownSeconds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…streak counter tunStateFlow emits on every WireGuard stats poll (bytes, handshake time…), so the previous drop(1).first() returned almost instantly, causing the consecutive-failures streak to hit the threshold in milliseconds rather than across N real ping intervals. Now we wait until lastPingAttemptMillis changes, which only happens when the ping service completes a new ping cycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The tunStateFlow.first{} predicate only waited for a new ping timestamp, which would hang forever if ping was disabled mid-wait (pingStates cleared → newPingTime always null). Now also breaks out if the tunnel is no longer in a triggering state, so recovery during streak wait is handled cleanly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

'Consecutive ping failures before first restart' better conveys that this setting only applies to the initial restart triggered by ping failures, not to subsequent restart attempts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When exponential backoff is enabled, "Max restarts per hour" no longer works reliably (timestamps age out of the 1h window, resetting the counter). Replace the arbitrary 5-min cap and the attempt-count limit with a time-based "give up after X minutes" setting (default 1h). - New DB field backoffTimeoutMinutes (default 60, DB v33→34) - Backoff OFF: keep count-based give-up (max N restarts per hour) - Backoff ON: give up when elapsed time since first restart > timeout - Removed MAX_BACKOFF_SECONDS cap — backoff now grows freely (natural limit via the timeout) - UI: "Give up after" dropdown (15min/30min/1h/2h) grayed when backoff OFF; "Max restarts per hour" grayed when backoff ON Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Attempts Switch from time-based give-up (fixed minutes) to attempt-count-based give-up with dynamic time display in UI. The dropdown shows estimated cumulative time (e.g. "3 attempts (~3m30)") computed from the current restartCooldownSeconds setting, making the relationship between cooldown and total retry duration explicit. DB v34→v35: removes backoff_timeout_minutes, adds backoff_max_attempts (default=3). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When the UI countdown hits 0 but the handler hasn't yet set isRestarting=true for the next attempt (or cleared _restartProgress after the final one), the `else` branch incorrectly showed the "max reached" string for a brief moment. Fix: only show "max reached" when attemptNumber >= maxAttempts; show nothing (null) during the transient gap between countdown expiry and handler state update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Useful for testing and for exponential backoff with short initial intervals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Attempts v34 was never shipped so the intermediate schema was missing, breaking the auto-migration KSP step. Merging both steps into one additive migration (add backoff_max_attempts with default=3, no column removal needed). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ff toggle Instead of graying out, show only the relevant setting: - Backoff ON → show "Give up after" (attempt count with time estimate) - Backoff OFF → show "Max restarts per hour" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Was showing sum of all cooldowns including after the last attempt. Should show time from first to last restart (sum of inter-restart cooldowns): base * (2^(n-1) - 1) instead of base * (2^n - 1) Example: 2 attempts, 3s cooldown → 3s (not 9s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

After max attempts reached + tunnel recovery, _restartProgress was never cleared in the DO_NOTHING path, leaving "max reached" permanently visible in TunnelList even after the tunnel recovered. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Race: coroutine could increment _restartCounts after cancelAndClear cleared it (non-suspend update). Also fixes count persisting when auto-restart feature is toggled off/on (init block cancels jobs without cancelAndClear). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…unnel toggle When a tunnel is toggled off then on, WireGuard may retain stale statistics (old handshake timestamps) from before the stop until the new handshake completes. The monitoring loop was reading the state immediately on fresh start, causing shouldTrigger() to return true and triggering an immediate restart. Add a 30s startup grace period (STARTUP_GRACE_MS) that waits for the tunnel to reach a healthy state before entering the monitoring loop. Only applies on fresh start (empty restartTimestamps); jobs recreated after auto-restart use the existing cooldown mechanism. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the hardcoded STARTUP_GRACE_MS (30s) with a user-configurable startupGraceSeconds setting (DB v34→35). Added a dropdown in the Auto-restart screen with options: disabled (0s), 10s, 15s, 30s (default), 60s. Setting 0 disables the grace period entirely for users who need immediate monitoring on tunnel startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ing is off Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

currentPingTime was captured from `state` (read at the top of the while loop), but tunStateFlow.first { } starts collecting later. Another coroutine (the ping monitor) could update the StateFlow between these two reads, causing first { } to receive the already-advanced ping time as its first replay value and exit immediately — effectively allowing a single ping cycle to satisfy a streak of N. Fix: re-read currentPingTime from tunStateFlow.value right before first { }. Since there is no suspend point between tunStateFlow.value and the first { } call, the StateFlow cannot be updated between the two reads, so the replayed current value always matches our baseline and we correctly wait for the next cycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add post-restart grace period after each restart to prevent rapid restart loops when cooldown < WireGuard re-handshake time - Add pre-restart verification ping series (using NetworkUtils.pingWithStats) before restarting to skip unnecessary restarts when tunnel is recoverable - Add 5s option to startup grace period dropdown - Remove 3s option from restart cooldown dropdown (too short) - Rename 'Use ping to trigger restart' to 'Use ping monitoring' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

naonak · 2026-02-26T13:45:25Z

updated here : #1176

naonak and others added 5 commits February 24, 2026 11:18

naonak mentioned this pull request Feb 25, 2026

[FEATURE] - Configurable delay before auto‑connect after Wi‑Fi disconnect #1159

Open

naonak and others added 23 commits February 25, 2026 10:14

fix(ui): clarify ping failures threshold label

c5fe7b2

'Consecutive ping failures before first restart' better conveys that this setting only applies to the initial restart triggered by ping failures, not to subsequent restart attempts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ui): add 3s, 5s, 10s options to restart cooldown dropdown

f7bc6e7

Useful for testing and for exponential backoff with short initial intervals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ui): hide ping failures dropdown instead of graying it out when p…

40891a3

…ing is off Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ui): gray out icon when ping monitoring row is disabled

0377598

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ui): rename "Use ping monitoring" to "Restart on ping failure"

e8790b9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

naonak closed this Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/auto restart stale handshake 1036#1175

Feat/auto restart stale handshake 1036#1175
naonak wants to merge 28 commits intowgtunnel:masterfrom
naonak:feat/auto-restart-stale-handshake-1036

naonak commented Feb 24, 2026

Uh oh!

naonak commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

naonak commented Feb 24, 2026

What this PR adds

How it works

Configuration (Settings → Monitoring → Auto-restart)

Recovery notifications

Tunnel list display

Database

Uh oh!

naonak commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant