Skip to content

Feat/auto restart stale handshake 1036#1175

Closed
naonak wants to merge 28 commits intowgtunnel:masterfrom
naonak:feat/auto-restart-stale-handshake-1036
Closed

Feat/auto restart stale handshake 1036#1175
naonak wants to merge 28 commits intowgtunnel:masterfrom
naonak:feat/auto-restart-stale-handshake-1036

Conversation

@naonak
Copy link
Copy Markdown
Contributor

@naonak naonak commented Feb 24, 2026

Closes #1036

What this PR adds

Automatically restarts a WireGuard tunnel when a connectivity issue is detected, without requiring manual intervention. Two trigger conditions are supported:

  • Stale handshake – when the last handshake exceeds the WireGuard timeout threshold
  • Ping failure – when the configured ping target becomes unreachable (requires ping monitoring to be enabled)

How it works

A new HandshakeRestartHandler runs alongside each active tunnel. It periodically checks the handshake age and/or ping state, and when a degraded condition is detected:

  1. The tunnel is restarted (up to a configurable maximum number of attempts)
  2. A cooldown delay is observed between each attempt
  3. If the connection recovers, monitoring resumes normally
  4. If the maximum number of attempts is reached, the handler waits for the tunnel to recover on its own (e.g. after a network change)

Configuration (Settings → Monitoring → Auto-restart)

Setting Default Description
Enable auto-restart off Master toggle
Restart cooldown 30 s Delay between restart attempts
Max attempts 5 Max restarts before giving up
Recovery notifications on Android notification during recovery

Recovery notifications

Instead of one notification per restart, a single persistent notification evolves with the tunnel state:

  • Degraded – ongoing notification: Stale handshake · restarting… (1/5)
  • Restored – brief notification: Connection restored
  • Permanently lost – persistent notification: Stale handshake · max restarts reached
  • Manual stop – notification silently dismissed

Can be disabled in the auto-restart config screen.

Tunnel list display

When auto-restart is active, the tunnel row shows:

  • Current restart reason and attempt count while restarting: Stale handshake · restarting (2/5)
  • Max restarts reached when permanently lost
  • Total restart count since tunnel was activated: ↺ 3 next to the uptime

Database

Single migration 29 → 30 adding 4 columns to monitoring_settings:

  • is_restart_on_handshake_timeout_enabled
  • max_handshake_restart_attempts
  • restart_cooldown_seconds
  • is_recovery_notification_enabled

naonak and others added 5 commits February 24, 2026 11:18
Add three new columns to MonitoringSettings (v29→30→31):
- isRestartOnHandshakeTimeoutEnabled: master toggle
- maxHandshakeRestartAttempts: rate-limit cap (default 5/hour)
- restartCooldownSeconds: delay between restart attempts (default 30s)

Room auto-migrations handle the schema upgrade transparently.

Closes wgtunnel#1036

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds HandshakeRestartHandler that monitors active tunnels and
automatically restarts them when a trigger condition is detected:
- Stale WireGuard handshake (no handshake renewed in ~3.5 min)
- All pinged peers unreachable (when ping monitoring is enabled)

Key behaviors:
- Rate limiting: max N attempts per hour (configurable)
- Configurable cooldown between restarts
- Reacts to network changes (reconnect triggers early check after 10s grace)
- cancelAndClear(): safe cancellation on manual tunnel stop
- Cooldown cancels early if tunnel recovers (withTimeoutOrNull race)
- triggerReason() checks ping failure first to avoid misleading
  "Stale handshake" label when ping is the actual trigger
- failingPingTargets uses PingState.pingTarget (the actual IP),
  not the map key (which is the peer Base64 public key)

Exposes TunnelRestartProgress via StateFlow for UI consumption.
TunnelManager wires the handler; internal restartTunnel bypasses
cancelAndClear to avoid self-cancellation during auto-restart.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AutoRestartScreen (new):
- Toggle to enable/disable auto-restart on stale handshake
- Dropdown for max restart attempts (rate limit per hour)
- Dropdown for cooldown between restarts (15 / 30 / 60 / 120 / 300s)

SettingsScreen: quick toggle + navigation entry for AutoRestartScreen

TunnelMonitoringScreen: toggle for "Use ping for detection"
(isPingMonitoringEnabled) — enables ping-based restart trigger

TunnelList: live status display during auto-restart cycle:
- Switch stays ON while restarting (tunnel remains visually active)
- Health indicator color frozen at trigger color (yellow/red) for
  the entire restart+cooldown cycle instead of going gray
- Descriptive text below tunnel name:
    "Stale handshake · Restarting… 1/5"
    "Ping unreachable: 1.1.1.1 · Restart 2/5 · next in 28s"
  (countdown hidden at last attempt since no next retry follows)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace per-restart notification with a stateful persistent notification
system that tracks tunnel recovery lifecycle:

- ConnectionDegrading: ongoing notification updates on each restart attempt
  showing reason (stale handshake / ping unreachable) and progress (N/M)
- ConnectionRestored: ongoing dismissed + brief "Connection restored" notification
- ConnectionPermanentlyLost: permanent failure notification after max attempts
- ConnectionCancelled: silent dismiss on manual tunnel stop

Adds opt-in toggle (recovery_notifications) in the auto-restart config screen.
Consolidates DB migrations 29→30 (single step, no release in between).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When both conditions are true (stale handshake AND ping unreachable),
report STALE_HANDSHAKE as the cause — not PING_FAILURE. Ping fails
as a consequence of a broken WireGuard handshake, not the other way
around. PING_FAILURE is only meaningful when the handshake is fresh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
naonak and others added 23 commits February 25, 2026 10:14
When all auto-restart attempts fail (ConnectionPermanentlyLost), the user
can now choose what happens next: keep waiting (DO_NOTHING, default) or
stop the tunnel (STOP_TUNNEL). The failure notification also differentiates
both cases — "VPN stopped" is appended when the tunnel was shut down.

DB: version 30 → 31 (AutoMigration, max_attempts_action column).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move "use ping for detection" from TunnelMonitoringScreen to
AutoRestartScreen where it belongs logically. The toggle is disabled
(greyed out) when ping monitoring is not enabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re restart

Adds pingFailuresBeforeRestart setting (1-5, default 1) to require N consecutive
failing ping intervals before triggering an auto-restart. Avoids aggressive restarts
on transient network glitches. Stale-handshake restarts remain unaffected.

Also adds enabled param to LabelledDropdown, propagated from SurfaceRow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds isBackoffEnabled toggle (DB v33). When on, the cooldown after each
failed restart doubles (base × 2^(attempt-1)), capped at 5 min.
Base cooldown remains configurable via restartCooldownSeconds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…streak counter

tunStateFlow emits on every WireGuard stats poll (bytes, handshake time…),
so the previous drop(1).first() returned almost instantly, causing the
consecutive-failures streak to hit the threshold in milliseconds rather
than across N real ping intervals.

Now we wait until lastPingAttemptMillis changes, which only happens when
the ping service completes a new ping cycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The tunStateFlow.first{} predicate only waited for a new ping timestamp,
which would hang forever if ping was disabled mid-wait (pingStates cleared
→ newPingTime always null). Now also breaks out if the tunnel is no longer
in a triggering state, so recovery during streak wait is handled cleanly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'Consecutive ping failures before first restart' better conveys that
this setting only applies to the initial restart triggered by ping failures,
not to subsequent restart attempts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When exponential backoff is enabled, "Max restarts per hour" no longer
works reliably (timestamps age out of the 1h window, resetting the
counter). Replace the arbitrary 5-min cap and the attempt-count limit
with a time-based "give up after X minutes" setting (default 1h).

- New DB field backoffTimeoutMinutes (default 60, DB v33→34)
- Backoff OFF: keep count-based give-up (max N restarts per hour)
- Backoff ON: give up when elapsed time since first restart > timeout
- Removed MAX_BACKOFF_SECONDS cap — backoff now grows freely (natural
  limit via the timeout)
- UI: "Give up after" dropdown (15min/30min/1h/2h) grayed when backoff
  OFF; "Max restarts per hour" grayed when backoff ON

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Attempts

Switch from time-based give-up (fixed minutes) to attempt-count-based
give-up with dynamic time display in UI. The dropdown shows estimated
cumulative time (e.g. "3 attempts (~3m30)") computed from the current
restartCooldownSeconds setting, making the relationship between cooldown
and total retry duration explicit.

DB v34→v35: removes backoff_timeout_minutes, adds backoff_max_attempts (default=3).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the UI countdown hits 0 but the handler hasn't yet set isRestarting=true
for the next attempt (or cleared _restartProgress after the final one), the
`else` branch incorrectly showed the "max reached" string for a brief moment.

Fix: only show "max reached" when attemptNumber >= maxAttempts; show nothing
(null) during the transient gap between countdown expiry and handler state update.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Useful for testing and for exponential backoff with short initial intervals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Attempts

v34 was never shipped so the intermediate schema was missing, breaking
the auto-migration KSP step. Merging both steps into one additive
migration (add backoff_max_attempts with default=3, no column removal needed).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ff toggle

Instead of graying out, show only the relevant setting:
- Backoff ON  → show "Give up after" (attempt count with time estimate)
- Backoff OFF → show "Max restarts per hour"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Was showing sum of all cooldowns including after the last attempt.
Should show time from first to last restart (sum of inter-restart cooldowns):
  base * (2^(n-1) - 1)  instead of  base * (2^n - 1)

Example: 2 attempts, 3s cooldown → 3s (not 9s).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After max attempts reached + tunnel recovery, _restartProgress was never
cleared in the DO_NOTHING path, leaving "max reached" permanently visible
in TunnelList even after the tunnel recovered.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Race: coroutine could increment _restartCounts after cancelAndClear cleared
it (non-suspend update). Also fixes count persisting when auto-restart
feature is toggled off/on (init block cancels jobs without cancelAndClear).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…unnel toggle

When a tunnel is toggled off then on, WireGuard may retain stale statistics
(old handshake timestamps) from before the stop until the new handshake completes.
The monitoring loop was reading the state immediately on fresh start, causing
shouldTrigger() to return true and triggering an immediate restart.

Add a 30s startup grace period (STARTUP_GRACE_MS) that waits for the tunnel to
reach a healthy state before entering the monitoring loop. Only applies on fresh
start (empty restartTimestamps); jobs recreated after auto-restart use the
existing cooldown mechanism.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the hardcoded STARTUP_GRACE_MS (30s) with a user-configurable
startupGraceSeconds setting (DB v34→35). Added a dropdown in the Auto-restart
screen with options: disabled (0s), 10s, 15s, 30s (default), 60s.

Setting 0 disables the grace period entirely for users who need immediate
monitoring on tunnel startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing is off

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
currentPingTime was captured from `state` (read at the top of the while loop),
but tunStateFlow.first { } starts collecting later. Another coroutine (the ping
monitor) could update the StateFlow between these two reads, causing first { } to
receive the already-advanced ping time as its first replay value and exit
immediately — effectively allowing a single ping cycle to satisfy a streak of N.

Fix: re-read currentPingTime from tunStateFlow.value right before first { }.
Since there is no suspend point between tunStateFlow.value and the first { } call,
the StateFlow cannot be updated between the two reads, so the replayed current
value always matches our baseline and we correctly wait for the next cycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add post-restart grace period after each restart to prevent rapid restart
  loops when cooldown < WireGuard re-handshake time
- Add pre-restart verification ping series (using NetworkUtils.pingWithStats)
  before restarting to skip unnecessary restarts when tunnel is recoverable
- Add 5s option to startup grace period dropdown
- Remove 3s option from restart cooldown dropdown (too short)
- Rename 'Use ping to trigger restart' to 'Use ping monitoring'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@naonak
Copy link
Copy Markdown
Contributor Author

naonak commented Feb 26, 2026

updated here : #1176

@naonak naonak closed this Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] - Restart tunnel after handshake exceeds certain time

1 participant