Skip to content

fix: app freezes on exit#8769

Draft
sandrade-dcl wants to merge 12 commits into
devfrom
fix/exit-app-delay
Draft

fix: app freezes on exit#8769
sandrade-dcl wants to merge 12 commits into
devfrom
fix/exit-app-delay

Conversation

@sandrade-dcl
Copy link
Copy Markdown
Contributor

@sandrade-dcl sandrade-dcl commented May 13, 2026

Exit-app freeze investigation (#8764)

Bug summary

When the user clicks EXIT in a standalone IL2CPP build, the process hangs for
20-95+ seconds before terminating. 100% reproducible per QA, cross-platform
(Windows + Mac), build-only — does NOT repro in the Unity Editor. Worst case
seen on Genesis Plaza with many peers connected.

Root-cause narrowing

WinDbg attach on a hung process consistently shows:

  • Main thread is blocked inside il2cpp::vm::Runtime::Shutdown
    il2cpp::os::Thread::Shutdown, waiting for managed threads to detach.
  • 11–16 native threads named tokio-runtime-worker (from livekit_ffi) are
    still attached to the IL2CPP managed runtime and never detach, holding the
    shutdown indefinitely.
  • IL2CPP's managed-thread shutdown only returns once those threads detach,
    which only happens when the threads themselves exit. The threads exit when
    the tokio runtime that owns them is shut down. The tokio runtime shuts down
    when the LiveKit Room handles are disconnected/dropped.

What this PR adds (investigation tooling, not a fix yet)

A set of command-line bisection flags under --exit-test-* that toggle
specific behaviors during VoiceChatPlugin init / shutdown. These exist so we
can isolate the offending attachment path on real QA builds without rebuilding
for every hypothesis. They are diagnostic-only and will be removed once the
real fix lands.

Flags added

  • --exit-test-voice-init-stop=N
    Stops VoiceChatPlugin.InitializeAsync at stage N (1..8). Used to bisect
    which init step starts the FFI workers that later refuse to detach.

  • --exit-test-skip-nearby-voice-systems
    Skips registering the entire NEARBY voice ECS system block in
    InjectToWorld.

  • --exit-test-nearby-inject-stop=N
    Stops the NEARBY system injection at stage N (1..7). Bisects which specific
    NEARBY system pulls FFI into the runtime.

  • --exit-test-skip-audio-source-create
    In NearbyAudioBindingSystem, skips the sourceFactory.Create() /
    LivekitAudioSource.Play() call. Tests whether the active AudioSource
    binding (which triggers OnAudioFilterRead → AudioStream.ReadAudio on the
    audio thread) is what keeps the tokio workers attached.

    Important caveat (discovered late): this flag bypasses the
    LivekitAudioSource creation only AFTER registry.GetActiveStream(key)
    has already executed. That call is what synchronously constructs an
    AudioStreamInternal and subscribes to FFI events — i.e. the actual FFI
    surface — so this flag does not test what its name suggests. The runs
    collected with it should be re-read accordingly. See
    --exit-test-skip-get-active-stream for the real bypass.

  • --exit-test-skip-get-active-stream
    In NearbyAudioBindingSystem, bypasses the registry.GetActiveStream(key)
    call entirely (placed BEFORE the existing GetActiveStream call). That call
    lazily constructs a LiveKit AudioStreamInternal whose constructor does a
    synchronous FFI request AND subscribes to FfiClient.AudioStreamEventReceived.
    Callbacks on that event are dispatched from livekit_ffi tokio worker
    threads, which is how those threads become attached to the IL2CPP managed
    runtime. If exit becomes consistently fast with this flag, the AudioStream
    subscription is the attachment trigger.

  • --exit-test-disconnect-rooms-on-quit
    Hooks Application.wantsToQuit. First invocation cancels the quit, runs
    DisconnectAsync on IslandRoom and VoiceChatRoom concurrently
    (10s cap), then re-issues Application.Quit(). The second pass observes
    disconnectsCompleted=true and lets Unity proceed with shutdown.

    Earlier attempt did this in Dispose() with .AsTask().Wait() and
    deadlocked — blocking the main thread prevented PlayerLoop ticks, so the
    UniTask continuations inside DisconnectInstruction.AwaitWithSuccess could
    not resume, and the 3s cancellation always fired before the disconnect
    completed. The wantsToQuit flow keeps the PlayerLoop active during the
    disconnect.

  • --exit-test-post-disconnect-delay-ms=N
    Inserts an await UniTask.Delay(N) between the room-disconnect completion
    and the final Application.Quit(). Tests the hypothesis that after
    DisconnectAsync returns, the FFI tokio runtime needs a brief wind-down
    window before its worker threads actually exit and detach from IL2CPP.

Code change shipped behind no flag

NearbyAudioSourceFactory.DisposeRoot() now explicitly stops and frees both
LIVE and LEGACY LivekitAudioSource instances before destroying the parent
container. Previously only pool-resident instances ran the Stop() + Free()
release path; live instances were torn down via Unity.Destroy() on the
parent container, which fires OnDestroy() but bypasses Stop() / Free().
The AudioSource then stays in Play state long enough for one or more
OnAudioFilterRead callbacks to cross into livekit_ffi via
AudioStream.ReadAudio, attaching the consuming thread to IL2CPP.

This change did not by itself eliminate the freeze (WinDbg dumps with the fix
applied showed the same 16 tokio workers attached), but it closes a real
leak that is consistent with the freeze pattern and is a strict improvement.

Experiment log

All times are wall-clock from EXIT click to process termination on a Windows
IL2CPP standalone build, same machine, same scene (Genesis Plaza, populated).
Single run per row unless otherwise noted — variance is high (see notes), so
treat individual rows as indicative, not as precise measurements.

Phase 1 — VoiceChatPlugin.InitializeAsync bisection

Goal: find out whether the offending attachment happens during init.

# Command Exit time
1 Decentraland.exe (baseline) 21.95 s
2 --exit-test-voice-init-stop=1 4.65 s
3 --exit-test-voice-init-stop=2 3.65 s
4 --exit-test-voice-init-stop=3 6.23 s
5 --exit-test-voice-init-stop=4 5.40 s
6 --exit-test-voice-init-stop=5 6.55 s
7 --exit-test-voice-init-stop=6 3.05 s
8 --exit-test-voice-init-stop=7 5.81 s
9 --exit-test-voice-init-stop=8 5.21 s
10 --exit-test-skip-nearby-voice-systems 3.05 s

Observation: stopping InitializeAsync at any stage (even stage 8, i.e. full
init complete) keeps exit fast (≤7 s). What makes it slow is what runs
after InitializeAsync — specifically the NEARBY ECS systems being injected
into the world (row 10 confirms: skipping the whole NEARBY block → 3 s).

Phase 2 — Within InjectToWorld NEARBY block bisection

Goal: identify which specific NEARBY system causes the attachment. Stages
correspond to the order systems are registered in
VoiceChatPlugin.InjectToWorld:

  1. NearbyLivekitBridgeSystem
  2. NearbyAudibleRangeSystem
  3. NearbyAudioBindingSystem
  4. NearbyAudioPositionSystem
  5. NearbyAudioCleanupSystem
  6. NearbyVoiceChatNametagSystem
  7. NearbyVoiceChatDebugSystem
# Command Exit time
1 --exit-test-nearby-inject-stop=1 6.13 s
2 --exit-test-nearby-inject-stop=2 13.61 s
3 --exit-test-nearby-inject-stop=3 32.80 s
4 --exit-test-nearby-inject-stop=4 30.63 s
5 --exit-test-nearby-inject-stop=5 46.24 s
6 --exit-test-nearby-inject-stop=6 19.56 s
7 --exit-test-nearby-inject-stop=7 45.05 s

Observation (N=1): stage 1 fast, stage 2+ slow — pointed at
NearbyAudibleRangeSystem as the trigger. Re-running stage 4 produced
30.63 s instead of an earlier 3.51 s, which forced a re-run of stages 1–3
with N=3 to control for variance (see next phase).

Phase 2b — N=3 re-confirmation of stages 1, 2, 3

Stage Run 1 Run 2 Run 3 Systems registered
1 4.90 s 3.31 s 5.47 s Bridge
2 6.87 s 5.46 s 4.08 s Bridge + AudibleRange
3 44.44 s 3.06 s 45.68 s Bridge + AudibleRange + AudioBinding

Observation: stages 1 AND 2 are both consistently fast (3–7 s). The freeze
only appears starting at stage 3 — and even then bimodally (2/3 slow, 1/3
fast). The previous N=1 reading that implicated NearbyAudibleRangeSystem
was an artifact of single-sample noise. The real culprit is the system
added at stage 3: NearbyAudioBindingSystem.

Inside NearbyAudioBindingSystem.CreateAndBindAudioSourcesToStreamers, for
every (walletId, sid) pair that's not yet bound, the system executes
registry.GetActiveStream(key) BEFORE any of the existing --exit-test-*
checks. That call delegates to room.AudioStreams.ActiveStream(key) which:

  1. Synchronously issues an FFI NewAudioStreamRequest.
  2. Constructs an AudioStreamInternal whose constructor subscribes to
    FfiClient.Instance.AudioStreamEventReceived.
  3. Stores the resulting AudioStream in the per-Room Streams dictionary.

The subscription in step 2 is the attachment surface: every time the FFI
publishes an audio frame event, the handler is invoked from a livekit_ffi
tokio worker thread. [MonoPInvokeCallback] attaches the calling native
thread to the IL2CPP managed runtime, and Mono never detaches it again
unless that thread itself exits — which it only does when the tokio
runtime owning it is shut down.

The previous --exit-test-skip-audio-source-create flag was bypassing the
LivekitAudioSource.Create() / .Play() calls (the OnAudioFilterRead →
ReadAudio path), but the call sequence runs GetActiveStream before that
check — so the attachment was happening regardless of whether the flag was
set. This explains why exit was still slow (58.61 s) with that flag on:
we were never actually testing the hypothesis we thought we were.

The new --exit-test-skip-get-active-stream flag moves the bypass earlier
to skip GetActiveStream itself. Pending: 3-run validation.

Phase 3 — Targeted hypotheses

# Command Exit time Notes
1 --exit-test-skip-audio-source-create 58.61 s Did NOT actually skip the FFI attachment surface (see phase-2b caveat). GetActiveStream still ran.
2 --exit-test-disconnect-rooms-on-quit (broken sync-over-async impl in Dispose) 39.18 s Untestable — disconnect cancelled at 3 s by sync-over-async deadlock.
3 both flags above 39.43 s Same as run 2.
4 --exit-test-disconnect-rooms-on-quit (rewritten via Application.wantsToQuit) — run 1 58.99 s Disconnect completed cleanly.
5 same — run 2 44.39 s Disconnect completed cleanly.
6 same — run 3 4.73 s Disconnect completed cleanly.
7 --exit-test-disconnect-rooms-on-quit --exit-test-post-disconnect-delay-ms=2000 > 40 s Disconnect completed, 2 s delay elapsed, exit still slow.

Observation: with the rewritten disconnect flow, DisconnectAsync always
completes inside its 10 s cap, but the total exit time varies wildly across
runs (4.7 s to 59 s). Adding a 2 s wind-down delay after disconnect does
not deterministically help (still > 40 s). Conclusion: simply disconnecting
the room + waiting is not the right fix on its own. The most likely reason
is that worker threads already attached to IL2CPP via in-flight
AudioStreamEventReceived callbacks remain attached until the tokio
runtime itself shuts down, which appears not to happen reliably even after
DisconnectAsync returns and audioStreams.Free() unsubscribes the
handlers.

Current hypothesis under test

The current best hypothesis is that the attachment surface is the
FfiClient.AudioStreamEventReceived subscription created by each
AudioStreamInternal instance, which is constructed lazily via
registry.GetActiveStream(key) inside NearbyAudioBindingSystem.

The next test will run --exit-test-skip-get-active-stream for 3 runs in
the same conditions used for the phase-2b table. Expected readings:

  • 3/3 fast (≤ 7 s): GetActiveStream confirmed as the trigger. The real
    fix is to guarantee every AudioStream is disposed before
    Application.Quit() is allowed to proceed (Room.DisconnectAsync already
    calls audioStreams.Free() so this likely means we need a deterministic
    point where we await the FFI's own shutdown signal, not just the local
    unsubscribe).
  • Mixed results: GetActiveStream is part of the picture but there are
    additional attachment paths (other FfiClient.* events, Room events,
    participant events) that need to be covered too.
  • 3/3 still slow: the attachment is somewhere outside the AudioStream
    surface entirely. The next things to look at would be the other
    FfiClient event subscriptions and the room/track callbacks that
    NearbyAudioStreamsRegistry itself wires up.

Mitigation path (room disconnect + delay) remains a candidate fallback but
is not deterministic on its own — the 40+ s exit time with a 2 s delay
indicates the FFI tokio runtime is not winding down promptly even after
the room is disconnected.

Files touched

  • Assets/DCL/Infrastructure/Global/AppArgs/AppArgsFlags.cs
  • Assets/DCL/PluginSystem/Global/VoiceChatPlugin.cs
  • Assets/DCL/VoiceChat/NearbyVoiceChat/Systems/NearbyAudioBindingSystem.cs
  • Assets/DCL/VoiceChat/NearbyVoiceChat/Core/NearbyAudioSourceFactory.cs

How to test

The freeze does NOT reproduce in the Editor. Build a Windows IL2CPP standalone
on this branch, run with the flag combination relevant to the hypothesis under
test, and measure wall-clock time from clicking EXIT to process termination.
Player.log lines tagged EXIT TEST: trace each bisection branch.

What's next

Once we identify the deterministic mitigation (likely: disconnect rooms +
small wind-down delay, OR an explicit FFI runtime shutdown call), the diag
flags get stripped and only the production fix remains.

…eads

- Skip `SegmentServerDispose` during `Application.quitting` to avoid main thread stalls caused by Rust runtime thread joins during HTTP timeouts.
@sandrade-dcl sandrade-dcl self-assigned this May 13, 2026
@sandrade-dcl sandrade-dcl added the force-build Used to trigger a build on draft PR label May 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

@sandrade-dcl sandrade-dcl added the clean-build Used to trigger clean build on PR label May 14, 2026
@sandrade-dcl sandrade-dcl changed the title investigation: app freezes on exit fix: app freezes on exit May 14, 2026
The previous diagnosis pointed at SegmentServerDispose() as the cause of
the exit freeze, but the WinDbg attach with GameAssembly.pdb symbols
revealed the real culprit is livekit_ffi tokio threads attached to the
IL2CPP runtime (confirmed empirically by Ashley disabling VoiceChatPlugin).
The Segment skip therefore brings no value and is being reverted to keep
the experimentation branch focused on the actual cause.

Refs #8764
Adds two command-line toggles to narrow down which VoiceChatPlugin
component leaves livekit_ffi tokio threads attached to the IL2CPP
runtime, blocking process shutdown (Ashley's voice-chat-disabled test
proved the freeze lives inside this plugin).

  --exit-test-voice-init-stop=N        (N in 1..8) stops InitializeAsync
                                        after stage N
  --exit-test-skip-nearby-voice-systems skips the NEARBY ECS systems in
                                        InjectToWorld

Stages map to sequential component instantiations in InitializeAsync:
  1 VoiceChatMicrophoneHandler + VoiceChatMicrophoneStateManager
  2 MicrophoneTrackPublisher    (first to touch VoiceChatRoom)
  3 RemoteTrackListener         (second to touch VoiceChatRoom)
  4 VoiceChatRoomManager
  5 VoiceChatNametagsHandler
  6 MicrophoneAudioToggleHandler
  7 VoiceChatPanelPresenter
  8 VoiceChatDebugContainer

Flag-gated. No behavior change without the flags. Investigation tooling,
to be removed once the offending component is identified.

Refs #8764
@sandrade-dcl sandrade-dcl removed the clean-build Used to trigger clean build on PR label May 14, 2026
sandrade-dcl and others added 8 commits May 14, 2026 12:17
The VOICE_CHAT report category is filtered out of Player.log builds,
which hides the bisection flag confirmation. Move both EXIT TEST log
lines to ReportCategory.ALWAYS so they're always visible and we can
verify the flag is being read.

Refs #8764
Bisection round 1 (--exit-test-voice-init-stop=N) ruled out every
component instantiated in InitializeAsync and the NEARBY block of
InitializeAsync itself. The freeze is triggered only when the 7 ECS
systems in InjectToWorld's NEARBY block run.

Adds --exit-test-nearby-inject-stop=N (N in 1..7) to stop after each
of the 7 NearbyXxxSystem.InjectToWorld(...) calls, so we can pin down
which one (or which subset) leaves livekit_ffi tokio threads attached
to the IL2CPP runtime.

Refs #8764
NearbyAudioSourceFactory.DisposeRoot() destroyed the pool's parent
GameObject without first running Stop+Free on the LIVE (and legacy)
LivekitAudioSource instances. Unity.Destroy() on those GameObjects
fires LivekitAudioSource.OnDestroy() which only disposes the WavWriter
- it does not stop the underlying AudioSource nor null the
Weak<AudioStream> reference. The AudioSource therefore stays in Play
state long enough for one or more OnAudioFilterRead callbacks to cross
into livekit_ffi via AudioStream.ReadAudio. Those FFI calls keep the
consuming threads attached to the IL2CPP managed runtime; the threads
never detach, and il2cpp::vm::Runtime::Shutdown deadlocks waiting on
them - the multi-second to minute-long EXIT freeze tracked in #8764.

This change iterates liveInstances and legacyInstances and explicitly
invokes Stop()+Free() on each before pool.Dispose() and the parent
container destruction. Mirrors the in-pool teardown path that already
runs via ResetForPool (onRelease).

Refs #8764

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following the WinDbg re-attach showing the DisposeRoot fix did not move
the deadlock pattern (same 16 tokio-runtime-worker threads still attached
to the IL2CPP runtime), adds two complementary toggles to isolate the
real attachment-keeping path in a single build.

  --exit-test-skip-audio-source-create
      In NearbyAudioBindingSystem, skip the sourceFactory.Create()
      call (and resulting NearbyAudioSourceComponent). The system
      stays registered and iterates avatars normally — only the
      LivekitAudioSource creation/Play() that triggers
      OnAudioFilterRead → AudioStream.ReadAudio is bypassed.

      Pass-fail: if EXIT is fast with this flag, the active
      AudioSource binding is what keeps tokio workers attached.
      If still slow, the mere registration of the NEARBY systems
      is enough (independent of audio data flow).

  --exit-test-disconnect-rooms-on-quit
      In VoiceChatPlugin.Dispose, call DisconnectAsync on
      IslandRoom and VoiceChatRoom (capped to 3s total).
      Disconnecting the rooms shuts down their LiveKit tokio
      runtime, which terminates the worker threads and lets
      them detach from IL2CPP, freeing Runtime::Shutdown.

      Pass-fail: if EXIT is fast with this flag alone (no other
      flags), the absence of room disconnect on quit is the
      cause. This would point to the proper fix being a
      disconnect call somewhere in the production shutdown path.

Refs #8764
…tsToQuit

Previous attempt used .AsTask().Wait() in Dispose() which deadlocked: blocking
the main thread prevented PlayerLoop ticks, so the UniTask continuations inside
DisconnectAsync.AwaitWithSuccess could not resume, and the disconnect was
cancelled at the 3s timeout without ever completing.

New approach: hook Application.wantsToQuit so the room disconnects run while
the PlayerLoop is still pumping. The handler returns false to cancel the quit,
fires DisconnectRoomsThenQuitAsync().Forget(), and the async task re-issues
Application.Quit() once both rooms are disconnected (or the 10s timeout fires).
The second wantsToQuit invocation observes disconnectsCompleted=true and
returns true so Unity proceeds with shutdown.

Investigation of #8764.
Allows inserting a delay between DisconnectAsync completion and Application.Quit()
to test the hypothesis that the FFI tokio runtime needs a wind-down window after
the room disconnect to fully detach its worker threads from IL2CPP.

Default 0 keeps existing behavior. Use 1000/2000/3000 to bisect the value that
makes exit consistently fast across runs.

Refs #8764.
The existing SKIP_AUDIO_SOURCE_CREATE flag was skipping LivekitAudioSource.Create()
but NOT the registry.GetActiveStream(key) call that precedes it. GetActiveStream
constructs an AudioStreamInternal that synchronously hits FFI and subscribes to
FfiClient.AudioStreamEventReceived — callbacks on that event arrive from livekit_ffi
tokio worker threads, which is how those threads become attached to the IL2CPP
managed runtime.

This new flag moves the bypass earlier so the GetActiveStream call itself is
skipped. If exit becomes consistently fast with this flag (and slow without it),
the AudioStream subscription is the attachment trigger and the real fix is to
ensure every AudioStream gets disposed on quit, on top of room DisconnectAsync.

Refs #8764.
@sandrade-dcl sandrade-dcl deleted the fix/exit-app-delay branch May 15, 2026 12:28
@sandrade-dcl sandrade-dcl restored the fix/exit-app-delay branch May 22, 2026 09:22
@sandrade-dcl sandrade-dcl reopened this May 22, 2026
@sandrade-dcl sandrade-dcl added the clean-build Used to trigger clean build on PR label May 22, 2026
@m3taphysics
Copy link
Copy Markdown
Collaborator

PR #8769, run #26290163948

Builds: Windows change, Windows baseline, macOS change, macOS baseline

Framework 13 i7

Metric Change Baseline Delta Improvement
Samples 2700 2700
CPU average 33.3 ms 33.3 ms 0.0 ms -0.0%
CPU 1% worst 34.0 ms 33.8 ms 0.3 ms -0.8%
CPU 0.1% worst 36.9 ms 36.6 ms 0.3 ms -0.8%
GPU average 8.2 ms 7.5 ms 0.7 ms -9.5% 🔴
GPU 1% worst 19.8 ms 18.5 ms 1.3 ms -7.0% 🔴
GPU 0.1% worst 25.6 ms 23.7 ms 1.9 ms -8.0% 🔴

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clean-build Used to trigger clean build on PR force-build Used to trigger a build on draft PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants