fix: app freezes on exit#8769
Draft
sandrade-dcl wants to merge 12 commits into
Draft
Conversation
…eads - Skip `SegmentServerDispose` during `Application.quitting` to avoid main thread stalls caused by Rust runtime thread joins during HTTP timeouts.
Contributor
|
Windows and Mac build successful in Unity Cloud! You can find a link to the downloadable artifact below. |
The previous diagnosis pointed at SegmentServerDispose() as the cause of the exit freeze, but the WinDbg attach with GameAssembly.pdb symbols revealed the real culprit is livekit_ffi tokio threads attached to the IL2CPP runtime (confirmed empirically by Ashley disabling VoiceChatPlugin). The Segment skip therefore brings no value and is being reverted to keep the experimentation branch focused on the actual cause. Refs #8764
Adds two command-line toggles to narrow down which VoiceChatPlugin
component leaves livekit_ffi tokio threads attached to the IL2CPP
runtime, blocking process shutdown (Ashley's voice-chat-disabled test
proved the freeze lives inside this plugin).
--exit-test-voice-init-stop=N (N in 1..8) stops InitializeAsync
after stage N
--exit-test-skip-nearby-voice-systems skips the NEARBY ECS systems in
InjectToWorld
Stages map to sequential component instantiations in InitializeAsync:
1 VoiceChatMicrophoneHandler + VoiceChatMicrophoneStateManager
2 MicrophoneTrackPublisher (first to touch VoiceChatRoom)
3 RemoteTrackListener (second to touch VoiceChatRoom)
4 VoiceChatRoomManager
5 VoiceChatNametagsHandler
6 MicrophoneAudioToggleHandler
7 VoiceChatPanelPresenter
8 VoiceChatDebugContainer
Flag-gated. No behavior change without the flags. Investigation tooling,
to be removed once the offending component is identified.
Refs #8764
The VOICE_CHAT report category is filtered out of Player.log builds, which hides the bisection flag confirmation. Move both EXIT TEST log lines to ReportCategory.ALWAYS so they're always visible and we can verify the flag is being read. Refs #8764
Bisection round 1 (--exit-test-voice-init-stop=N) ruled out every component instantiated in InitializeAsync and the NEARBY block of InitializeAsync itself. The freeze is triggered only when the 7 ECS systems in InjectToWorld's NEARBY block run. Adds --exit-test-nearby-inject-stop=N (N in 1..7) to stop after each of the 7 NearbyXxxSystem.InjectToWorld(...) calls, so we can pin down which one (or which subset) leaves livekit_ffi tokio threads attached to the IL2CPP runtime. Refs #8764
NearbyAudioSourceFactory.DisposeRoot() destroyed the pool's parent GameObject without first running Stop+Free on the LIVE (and legacy) LivekitAudioSource instances. Unity.Destroy() on those GameObjects fires LivekitAudioSource.OnDestroy() which only disposes the WavWriter - it does not stop the underlying AudioSource nor null the Weak<AudioStream> reference. The AudioSource therefore stays in Play state long enough for one or more OnAudioFilterRead callbacks to cross into livekit_ffi via AudioStream.ReadAudio. Those FFI calls keep the consuming threads attached to the IL2CPP managed runtime; the threads never detach, and il2cpp::vm::Runtime::Shutdown deadlocks waiting on them - the multi-second to minute-long EXIT freeze tracked in #8764. This change iterates liveInstances and legacyInstances and explicitly invokes Stop()+Free() on each before pool.Dispose() and the parent container destruction. Mirrors the in-pool teardown path that already runs via ResetForPool (onRelease). Refs #8764 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following the WinDbg re-attach showing the DisposeRoot fix did not move
the deadlock pattern (same 16 tokio-runtime-worker threads still attached
to the IL2CPP runtime), adds two complementary toggles to isolate the
real attachment-keeping path in a single build.
--exit-test-skip-audio-source-create
In NearbyAudioBindingSystem, skip the sourceFactory.Create()
call (and resulting NearbyAudioSourceComponent). The system
stays registered and iterates avatars normally — only the
LivekitAudioSource creation/Play() that triggers
OnAudioFilterRead → AudioStream.ReadAudio is bypassed.
Pass-fail: if EXIT is fast with this flag, the active
AudioSource binding is what keeps tokio workers attached.
If still slow, the mere registration of the NEARBY systems
is enough (independent of audio data flow).
--exit-test-disconnect-rooms-on-quit
In VoiceChatPlugin.Dispose, call DisconnectAsync on
IslandRoom and VoiceChatRoom (capped to 3s total).
Disconnecting the rooms shuts down their LiveKit tokio
runtime, which terminates the worker threads and lets
them detach from IL2CPP, freeing Runtime::Shutdown.
Pass-fail: if EXIT is fast with this flag alone (no other
flags), the absence of room disconnect on quit is the
cause. This would point to the proper fix being a
disconnect call somewhere in the production shutdown path.
Refs #8764
…tsToQuit Previous attempt used .AsTask().Wait() in Dispose() which deadlocked: blocking the main thread prevented PlayerLoop ticks, so the UniTask continuations inside DisconnectAsync.AwaitWithSuccess could not resume, and the disconnect was cancelled at the 3s timeout without ever completing. New approach: hook Application.wantsToQuit so the room disconnects run while the PlayerLoop is still pumping. The handler returns false to cancel the quit, fires DisconnectRoomsThenQuitAsync().Forget(), and the async task re-issues Application.Quit() once both rooms are disconnected (or the 10s timeout fires). The second wantsToQuit invocation observes disconnectsCompleted=true and returns true so Unity proceeds with shutdown. Investigation of #8764.
Allows inserting a delay between DisconnectAsync completion and Application.Quit() to test the hypothesis that the FFI tokio runtime needs a wind-down window after the room disconnect to fully detach its worker threads from IL2CPP. Default 0 keeps existing behavior. Use 1000/2000/3000 to bisect the value that makes exit consistently fast across runs. Refs #8764.
The existing SKIP_AUDIO_SOURCE_CREATE flag was skipping LivekitAudioSource.Create() but NOT the registry.GetActiveStream(key) call that precedes it. GetActiveStream constructs an AudioStreamInternal that synchronously hits FFI and subscribes to FfiClient.AudioStreamEventReceived — callbacks on that event arrive from livekit_ffi tokio worker threads, which is how those threads become attached to the IL2CPP managed runtime. This new flag moves the bypass earlier so the GetActiveStream call itself is skipped. If exit becomes consistently fast with this flag (and slow without it), the AudioStream subscription is the attachment trigger and the real fix is to ensure every AudioStream gets disposed on quit, on top of room DisconnectAsync. Refs #8764.
Collaborator
|
PR #8769, run #26290163948 Builds: Windows change, Windows baseline, macOS change, macOS baseline Framework 13 i7
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Exit-app freeze investigation (#8764)
Bug summary
When the user clicks EXIT in a standalone IL2CPP build, the process hangs for
20-95+ seconds before terminating. 100% reproducible per QA, cross-platform
(Windows + Mac), build-only — does NOT repro in the Unity Editor. Worst case
seen on Genesis Plaza with many peers connected.
Root-cause narrowing
WinDbg attach on a hung process consistently shows:
il2cpp::vm::Runtime::Shutdown→il2cpp::os::Thread::Shutdown, waiting for managed threads to detach.tokio-runtime-worker(from livekit_ffi) arestill attached to the IL2CPP managed runtime and never detach, holding the
shutdown indefinitely.
which only happens when the threads themselves exit. The threads exit when
the tokio runtime that owns them is shut down. The tokio runtime shuts down
when the LiveKit Room handles are disconnected/dropped.
What this PR adds (investigation tooling, not a fix yet)
A set of command-line bisection flags under
--exit-test-*that togglespecific behaviors during VoiceChatPlugin init / shutdown. These exist so we
can isolate the offending attachment path on real QA builds without rebuilding
for every hypothesis. They are diagnostic-only and will be removed once the
real fix lands.
Flags added
--exit-test-voice-init-stop=NStops
VoiceChatPlugin.InitializeAsyncat stage N (1..8). Used to bisectwhich init step starts the FFI workers that later refuse to detach.
--exit-test-skip-nearby-voice-systemsSkips registering the entire NEARBY voice ECS system block in
InjectToWorld.--exit-test-nearby-inject-stop=NStops the NEARBY system injection at stage N (1..7). Bisects which specific
NEARBY system pulls FFI into the runtime.
--exit-test-skip-audio-source-createIn
NearbyAudioBindingSystem, skips thesourceFactory.Create()/LivekitAudioSource.Play()call. Tests whether the active AudioSourcebinding (which triggers
OnAudioFilterRead → AudioStream.ReadAudioon theaudio thread) is what keeps the tokio workers attached.
Important caveat (discovered late): this flag bypasses the
LivekitAudioSourcecreation only AFTERregistry.GetActiveStream(key)has already executed. That call is what synchronously constructs an
AudioStreamInternaland subscribes to FFI events — i.e. the actual FFIsurface — so this flag does not test what its name suggests. The runs
collected with it should be re-read accordingly. See
--exit-test-skip-get-active-streamfor the real bypass.--exit-test-skip-get-active-streamIn
NearbyAudioBindingSystem, bypasses theregistry.GetActiveStream(key)call entirely (placed BEFORE the existing GetActiveStream call). That call
lazily constructs a LiveKit
AudioStreamInternalwhose constructor does asynchronous FFI request AND subscribes to
FfiClient.AudioStreamEventReceived.Callbacks on that event are dispatched from livekit_ffi tokio worker
threads, which is how those threads become attached to the IL2CPP managed
runtime. If exit becomes consistently fast with this flag, the AudioStream
subscription is the attachment trigger.
--exit-test-disconnect-rooms-on-quitHooks
Application.wantsToQuit. First invocation cancels the quit, runsDisconnectAsynconIslandRoomandVoiceChatRoomconcurrently(10s cap), then re-issues
Application.Quit(). The second pass observesdisconnectsCompleted=trueand lets Unity proceed with shutdown.Earlier attempt did this in
Dispose()with.AsTask().Wait()anddeadlocked — blocking the main thread prevented PlayerLoop ticks, so the
UniTask continuations inside
DisconnectInstruction.AwaitWithSuccesscouldnot resume, and the 3s cancellation always fired before the disconnect
completed. The
wantsToQuitflow keeps the PlayerLoop active during thedisconnect.
--exit-test-post-disconnect-delay-ms=NInserts an
await UniTask.Delay(N)between the room-disconnect completionand the final
Application.Quit(). Tests the hypothesis that afterDisconnectAsyncreturns, the FFI tokio runtime needs a brief wind-downwindow before its worker threads actually exit and detach from IL2CPP.
Code change shipped behind no flag
NearbyAudioSourceFactory.DisposeRoot()now explicitly stops and frees bothLIVE and LEGACY
LivekitAudioSourceinstances before destroying the parentcontainer. Previously only pool-resident instances ran the
Stop()+Free()release path; live instances were torn down via
Unity.Destroy()on theparent container, which fires
OnDestroy()but bypassesStop()/Free().The
AudioSourcethen stays in Play state long enough for one or moreOnAudioFilterReadcallbacks to cross into livekit_ffi viaAudioStream.ReadAudio, attaching the consuming thread to IL2CPP.This change did not by itself eliminate the freeze (WinDbg dumps with the fix
applied showed the same 16 tokio workers attached), but it closes a real
leak that is consistent with the freeze pattern and is a strict improvement.
Experiment log
All times are wall-clock from EXIT click to process termination on a Windows
IL2CPP standalone build, same machine, same scene (Genesis Plaza, populated).
Single run per row unless otherwise noted — variance is high (see notes), so
treat individual rows as indicative, not as precise measurements.
Phase 1 — VoiceChatPlugin.InitializeAsync bisection
Goal: find out whether the offending attachment happens during init.
Decentraland.exe(baseline)--exit-test-voice-init-stop=1--exit-test-voice-init-stop=2--exit-test-voice-init-stop=3--exit-test-voice-init-stop=4--exit-test-voice-init-stop=5--exit-test-voice-init-stop=6--exit-test-voice-init-stop=7--exit-test-voice-init-stop=8--exit-test-skip-nearby-voice-systemsObservation: stopping
InitializeAsyncat any stage (even stage 8, i.e. fullinit complete) keeps exit fast (≤7 s). What makes it slow is what runs
after InitializeAsync — specifically the NEARBY ECS systems being injected
into the world (row 10 confirms: skipping the whole NEARBY block → 3 s).
Phase 2 — Within InjectToWorld NEARBY block bisection
Goal: identify which specific NEARBY system causes the attachment. Stages
correspond to the order systems are registered in
VoiceChatPlugin.InjectToWorld:NearbyLivekitBridgeSystemNearbyAudibleRangeSystemNearbyAudioBindingSystemNearbyAudioPositionSystemNearbyAudioCleanupSystemNearbyVoiceChatNametagSystemNearbyVoiceChatDebugSystem--exit-test-nearby-inject-stop=1--exit-test-nearby-inject-stop=2--exit-test-nearby-inject-stop=3--exit-test-nearby-inject-stop=4--exit-test-nearby-inject-stop=5--exit-test-nearby-inject-stop=6--exit-test-nearby-inject-stop=7Observation (N=1): stage 1 fast, stage 2+ slow — pointed at
NearbyAudibleRangeSystemas the trigger. Re-running stage 4 produced30.63 s instead of an earlier 3.51 s, which forced a re-run of stages 1–3
with N=3 to control for variance (see next phase).
Phase 2b — N=3 re-confirmation of stages 1, 2, 3
Observation: stages 1 AND 2 are both consistently fast (3–7 s). The freeze
only appears starting at stage 3 — and even then bimodally (2/3 slow, 1/3
fast). The previous N=1 reading that implicated
NearbyAudibleRangeSystemwas an artifact of single-sample noise. The real culprit is the system
added at stage 3:
NearbyAudioBindingSystem.Inside
NearbyAudioBindingSystem.CreateAndBindAudioSourcesToStreamers, forevery
(walletId, sid)pair that's not yet bound, the system executesregistry.GetActiveStream(key)BEFORE any of the existing--exit-test-*checks. That call delegates to
room.AudioStreams.ActiveStream(key)which:NewAudioStreamRequest.AudioStreamInternalwhose constructor subscribes toFfiClient.Instance.AudioStreamEventReceived.AudioStreamin the per-RoomStreamsdictionary.The subscription in step 2 is the attachment surface: every time the FFI
publishes an audio frame event, the handler is invoked from a livekit_ffi
tokio worker thread.
[MonoPInvokeCallback]attaches the calling nativethread to the IL2CPP managed runtime, and Mono never detaches it again
unless that thread itself exits — which it only does when the tokio
runtime owning it is shut down.
The previous
--exit-test-skip-audio-source-createflag was bypassing theLivekitAudioSource.Create()/.Play()calls (the OnAudioFilterRead →ReadAudio path), but the call sequence runs
GetActiveStreambefore thatcheck — so the attachment was happening regardless of whether the flag was
set. This explains why exit was still slow (58.61 s) with that flag on:
we were never actually testing the hypothesis we thought we were.
The new
--exit-test-skip-get-active-streamflag moves the bypass earlierto skip GetActiveStream itself. Pending: 3-run validation.
Phase 3 — Targeted hypotheses
--exit-test-skip-audio-source-createGetActiveStreamstill ran.--exit-test-disconnect-rooms-on-quit(broken sync-over-async impl in Dispose)--exit-test-disconnect-rooms-on-quit(rewritten viaApplication.wantsToQuit) — run 1--exit-test-disconnect-rooms-on-quit --exit-test-post-disconnect-delay-ms=2000Observation: with the rewritten disconnect flow,
DisconnectAsyncalwayscompletes inside its 10 s cap, but the total exit time varies wildly across
runs (4.7 s to 59 s). Adding a 2 s wind-down delay after disconnect does
not deterministically help (still > 40 s). Conclusion: simply disconnecting
the room + waiting is not the right fix on its own. The most likely reason
is that worker threads already attached to IL2CPP via in-flight
AudioStreamEventReceivedcallbacks remain attached until the tokioruntime itself shuts down, which appears not to happen reliably even after
DisconnectAsyncreturns andaudioStreams.Free()unsubscribes thehandlers.
Current hypothesis under test
The current best hypothesis is that the attachment surface is the
FfiClient.AudioStreamEventReceivedsubscription created by eachAudioStreamInternalinstance, which is constructed lazily viaregistry.GetActiveStream(key)insideNearbyAudioBindingSystem.The next test will run
--exit-test-skip-get-active-streamfor 3 runs inthe same conditions used for the phase-2b table. Expected readings:
GetActiveStreamconfirmed as the trigger. The realfix is to guarantee every
AudioStreamis disposed beforeApplication.Quit()is allowed to proceed (Room.DisconnectAsync alreadycalls
audioStreams.Free()so this likely means we need a deterministicpoint where we await the FFI's own shutdown signal, not just the local
unsubscribe).
GetActiveStreamis part of the picture but there areadditional attachment paths (other
FfiClient.*events, Room events,participant events) that need to be covered too.
surface entirely. The next things to look at would be the other
FfiClientevent subscriptions and the room/track callbacks thatNearbyAudioStreamsRegistryitself wires up.Mitigation path (room disconnect + delay) remains a candidate fallback but
is not deterministic on its own — the 40+ s exit time with a 2 s delay
indicates the FFI tokio runtime is not winding down promptly even after
the room is disconnected.
Files touched
Assets/DCL/Infrastructure/Global/AppArgs/AppArgsFlags.csAssets/DCL/PluginSystem/Global/VoiceChatPlugin.csAssets/DCL/VoiceChat/NearbyVoiceChat/Systems/NearbyAudioBindingSystem.csAssets/DCL/VoiceChat/NearbyVoiceChat/Core/NearbyAudioSourceFactory.csHow to test
The freeze does NOT reproduce in the Editor. Build a Windows IL2CPP standalone
on this branch, run with the flag combination relevant to the hypothesis under
test, and measure wall-clock time from clicking EXIT to process termination.
Player.log lines tagged
EXIT TEST:trace each bisection branch.What's next
Once we identify the deterministic mitigation (likely: disconnect rooms +
small wind-down delay, OR an explicit FFI runtime shutdown call), the diag
flags get stripped and only the production fix remains.