Skip to content

Feat/auto-restart tunnels on stale handshake or ping failure 1036 EXPERIMENTAL#1176

Closed
naonak wants to merge 8 commits intowgtunnel:masterfrom
naonak:feat/auto-restart-pr
Closed

Feat/auto-restart tunnels on stale handshake or ping failure 1036 EXPERIMENTAL#1176
naonak wants to merge 8 commits intowgtunnel:masterfrom
naonak:feat/auto-restart-pr

Conversation

@naonak
Copy link
Copy Markdown
Contributor

@naonak naonak commented Feb 26, 2026

Auto-restart tunnels on stale handshake or ping failure (EXPERIMENTAL)

WORK IN PROGRESS.

Summary

Adds an optional auto-restart mechanism that monitors the active WireGuard tunnel and automatically restarts it when a connection problem is detected. The feature is entirely opt-in and configurable through a new screen under Settings → Tunnel Monitoring → Auto-restart.


Problem

WireGuard tunnels can silently stop passing traffic when:

  • The last handshake becomes stale (~3.5 min without a successful re-key)
  • All configured ping targets are unreachable for several consecutive intervals

Without manual intervention the tunnel stays "Up" in the UI while being effectively dead.


What's new

Functional

  • Auto-restart on stale handshake — restarts the tunnel when the WireGuard handshake threshold is exceeded, without requiring the user to toggle the tunnel manually
  • Restart on ping failure — optionally also restarts after N consecutive ping-failure intervals reported by the existing ping monitor
  • Pre-restart verification — when ping is enabled, performs a fresh ping series just before each restart attempt; skips the restart if any target is reachable (tunnel recovered on its own)
  • Exponential backoff — optionally doubles the cooldown between each attempt, up to a configurable number of attempts
  • Give-up action — after max attempts: either keep monitoring (do nothing) or stop the tunnel entirely
  • Recovery notifications — notifies the user when the tunnel that was restarting comes back healthy
  • Real-time status in tunnel list — the tunnel card shows live restart progress: attempt count, countdown to next retry, trigger reason (stale handshake / ping failure), and failing ping targets

Configuration

Setting Default Description
Restart cooldown 30s Minimum time between restart attempts
Startup grace period 15s Delay before first check after tunnel start
Restart on ping failure off Use ping failures as an additional trigger
Consecutive failures before restart 3 Ping failure streak required
Exponential backoff off Double cooldown on each attempt
Max attempts (backoff) 5 Give up after N attempts with backoff
Max attempts (no backoff) 10 Max attempts per hour without backoff
Give-up action Do nothing Do nothing or stop tunnel
Recovery notifications on Notify when tunnel recovers

Technical design

HandshakeRestartHandler

The core of the feature. A monitoring coroutine is started when the tunnel comes up and cancelled when it goes down (via activeTunnels StateFlow). A Mutex serialises job lifecycle.

Trigger logic (shouldTrigger)

  1. Stale handshake — always checked first (WireGuard already waits ~3.5 min)
  2. Ping failure — all attempted pings unreachable, gated on isPingMonitoringEnabled

False-positive protection

  • Startup grace — waits for the tunnel to reach a healthy state before the first check; prevents false triggers from stale kernel stats retained across tunnel restarts
  • Ping streak threshold — ping failures must repeat for N consecutive intervals; waits for an actual new ping cycle (not just any stats update) using lastPingAttemptMillis comparison
  • Pre-restart verification — re-pings targets fresh before committing to a restart, using NetworkUtils.pingWithStats()
  • Post-restart grace — mirrors startup grace after each restart; prevents rapid-fire loops when cooldown < WireGuard re-keying time, since isTunnelStale() can still fire on stale stats before the new handshake completes

Rate limiting

  • Timestamps are recorded in an ArrayDeque<Long>
  • Without backoff: timestamps older than 1 hour are pruned on each check
  • With backoff: cooldown × 2^(attempt-1), capped at attempt 31 to prevent Long overflow

Network change reactivity

  • networkChangeFlow observes connectivity state changes (WiFi ↔ Cellular ↔ Ethernet) and wakes the monitoring loop immediately after a 3 s grace, avoiding the full ~3.5 min stale-handshake wait after a network switch

Give-up

  • DO_NOTHING — suspends until the tunnel recovers or goes down, then resets timestamps and resumes monitoring
  • STOP_TUNNEL — calls stopTunnel(id) and returns (job terminates)

Database

  • MonitoringSettings entity extended with 9 new fields (all with sane defaults via auto-migration)
  • DB version 29 → 31 (two auto-migrations)
  • MaxAttemptsAction stored as a string enum via DatabaseConverters
  • TunnelRestartProgress is a pure domain state type — not persisted, lives only in memory

UI

  • AutoRestartScreen exposes all settings through MonitoringViewModel (Orbit MVI pattern)
  • AutoRestartScreen shows a warning banner when battery optimization is enabled, with a direct tap-to-disable shortcut — battery optimization can prevent auto-restart from firing reliably on some devices
  • LabelledNumberDropdown added as a reusable component for numeric option lists
  • Backoff give-up dropdown shows estimated total wait time (e.g. 5 attempts (~4m35s)) computed from computeCooldown() so the user can reason about the effective timeout
  • TunnelRestartProgress flows from HandshakeRestartHandlerTunnelManagerSharedAppViewModelTunnelsUiStateTunnelList

Also included

fix: reduce network change grace period from 10 s to 3 s

HandshakeRestartHandler observes network transitions (WiFi ↔ LTE ↔ Ethernet) and wakes the restart loop early to avoid waiting the full ~3.5 min stale-handshake window. The previous grace period of 10 s was longer than necessary: 3 s is enough to distinguish a real network switch from a momentary drop, while still reacting quickly enough to restart the tunnel before the user notices the outage.

fix: show battery optimization warning in auto-restart screen

If Android battery optimization is active, the app process can be throttled or delayed in the background, preventing auto-restart from firing reliably (especially on devices running Android < 14 with aggressive OEM power management). A contextual warning banner is now shown at the top of the Auto-restart screen whenever battery optimization is not disabled, with a tap action that opens the system exemption prompt directly.


Test plan

  • Enable auto-restart, disconnect network — tunnel restarts within cooldown + grace period
  • Enable ping failure trigger, block ICMP — restart triggers after N consecutive failures
  • Verify pre-restart verification skips restart when tunnel self-recovers mid-cooldown
  • Enable backoff — confirm cooldown doubles each attempt
  • Reach max attempts with STOP_TUNNEL action — tunnel stops, notification shown
  • Reach max attempts with DO_NOTHING — monitoring resumes after manual recovery
  • Toggle tunnel off manually during auto-restart — restart cancelled cleanly
  • Startup grace: toggle tunnel on/off rapidly — no spurious restart on startup
  • Recovery notification shown when tunnel comes back healthy

naonak and others added 4 commits February 26, 2026 14:35
Introduces the data model for the auto-restart feature:

- MonitoringSettings entity/domain model with all configurable fields:
  isAutoRestartEnabled, restartCooldownSeconds, maxHandshakeRestartAttempts,
  startupGraceSeconds, isRecoveryNotificationEnabled, isPingMonitoringEnabled,
  pingFailuresBeforeRestart, isBackoffEnabled, backoffMaxAttempts,
  maxAttemptsAction
- MaxAttemptsAction enum: DO_NOTHING or STOP_TUNNEL when max attempts reached
- TunnelRestartProgress domain state for real-time UI feedback
- BackendMessage extended with restart-related events (restarting, recovered,
  max attempts reached)
- MonitoringSettingsMapper for entity ↔ domain conversion
- DatabaseConverters updated for new types
- AppDatabase bumped to v31 with auto-migrations (v29→30, v30→31)
- DB schema snapshots for v30 and v31

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HandshakeRestartHandler runs a coroutine per active tunnel and handles
all auto-restart logic. Key behaviours:

Restart triggers
- Stale handshake: restarts when last handshake exceeds the WireGuard
  threshold (always active when auto-restart is enabled)
- Ping failure streak: optionally restarts after N consecutive ping
  failures reported by the ping monitor (isPingMonitoringEnabled)

False-positive protection
- Startup grace period: skips restart checks for configurable seconds
  after the tunnel first starts, avoiding false triggers during the
  initial WireGuard handshake
- Post-restart grace period: waits after each restart before re-checking,
  preventing rapid-fire loops when the cooldown is shorter than the
  WireGuard re-keying time
- Pre-restart verification pings: when ping is enabled, performs a fresh
  ping series just before restarting; skips restart if any target is
  reachable (tunnel self-recovered)

Rate limiting & give-up
- Configurable cooldown between attempts (restartCooldownSeconds)
- Optional exponential backoff: doubles cooldown each attempt up to
  backoffMaxAttempts, then triggers maxAttemptsAction
- maxAttemptsAction: DO_NOTHING (keep monitoring) or STOP_TUNNEL

Observability
- Emits TunnelRestartProgress events consumed by the UI for real-time
  status display (countdown, attempt count, restart reason)
- Recovery notifications via NotificationMonitor when a tunnel that was
  restarting comes back healthy

Integration
- TunnelManager creates one HandshakeRestartHandler per tunnel start and
  cancels it on stop
- TunnelLifecycleManager and TunnelProvider updated to expose the required
  state flows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New AutoRestartScreen accessible from Settings > Tunnel Monitoring:

- Enable/disable auto-restart toggle
- Restart cooldown dropdown (5s → 5min)
- Startup grace period dropdown (0 → 60s)
- Restart on ping failure toggle (gated on ping being enabled)
- Consecutive ping failures threshold (1–5)
- Exponential backoff toggle with give-up attempts dropdown;
  dropdown label shows estimated total wait time for quick tuning
- Max attempts action: do nothing or stop tunnel
- Recovery notifications toggle

Navigation: added AutoRestart route, entry in MainActivity nav graph,
and navbar state mapping. MonitoringViewModel exposes all settings
as state with individual update intents. LabelledNumberDropdown added
as a new reusable component for numeric option lists.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each tunnel card now displays live restart progress when
HandshakeRestartHandler is active:

- "Restarting… (attempt N)" during a restart
- "Next restart in Xs" countdown during cooldown
- Restart reason: stale handshake or ping failure
- Ping target when the trigger is a ping failure
- "Max attempts reached" when give-up action fires
- Status clears automatically on tunnel recovery

SharedAppViewModel collects TunnelRestartProgress from TunnelManager
and exposes it as a StateFlow. TunnelsUiState carries the progress map
keyed by tunnel ID. SettingsViewModel passes it through to the tunnels
screen.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@naonak
Copy link
Copy Markdown
Contributor Author

naonak commented Feb 26, 2026

#1036

naonak and others added 4 commits February 26, 2026 15:14
- AppDatabase: consolidate auto-migrations 31→32→33→34→35 into a
  single AutoMigration(31, 35); intermediate schema files were never
  committed so Room could not generate the migration code
- TunnelManager: remove `override val restartCounts` which was absent
  from the TunnelProvider interface (restartCounts is managed internally
  by HandshakeRestartHandler; attemptNumber in TunnelRestartProgress
  serves the same purpose externally)
- SharedAppViewModel: remove redundant restartCounts from the combine;
  use restartProgress only (which already contains attemptNumber)
- TunnelList: remove restartCount parameter from TunnelStatisticsRow
  call (parameter was removed from the composable signature)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add description under "Restart on ping failure" (requires ping monitoring)
- Add description under "Startup grace period"
- Tune defaults: grace 30→10s, cooldown 30→15s, ping failures before restart 1→2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 seconds of silent traffic failure on WiFi→LTE transitions was too
aggressive. 3 seconds is sufficient to distinguish a real network switch
from a brief drop, while limiting unnecessary downtime.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…creen

When auto-restart is active but battery optimization is NOT disabled,
Android may restrict the monitoring process (especially on pre-Android-14
devices). A contextual banner now appears at the top of the screen with
a direct link to the system battery optimization settings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@naonak naonak changed the title Feat/auto-restart tunnels on stale handshake or ping failure 1036 Feat/auto-restart tunnels on stale handshake or ping failure 1036 EXPERIMENTAL Feb 27, 2026
@naonak
Copy link
Copy Markdown
Contributor Author

naonak commented Feb 28, 2026

new version #1182

@naonak naonak closed this Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant