Skip to content

allocate: thread-local allocations.#212

Merged
froggey merged 11 commits into
froggey:masterfrom
iskamag:tlabka
Jun 16, 2026
Merged

allocate: thread-local allocations.#212
froggey merged 11 commits into
froggey:masterfrom
iskamag:tlabka

Conversation

@iskamag

@iskamag iskamag commented May 8, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@iskamag iskamag marked this pull request as ready for review May 8, 2026 09:10
@iskamag iskamag marked this pull request as draft May 14, 2026 22:07
@iskamag iskamag marked this pull request as ready for review May 14, 2026 22:12
@froggey

froggey commented May 17, 2026

Copy link
Copy Markdown
Owner

Due to cross-cpu migration I don't think this is SMP-safe at the moment. Each cpu has it's own tlab bump/limit, but there's no synchronization here ensuring the correct limit is compared against if a migration occurs between the limit load and bump update.

  1. %do-allocate-from-general-area called
  2. Load fs:tlab-limit
  3. Migration occurs. This switches cpu and changes fs, makes the loaded limit stale
  4. Fetch/inc fs:tlab-base
  5. Compare new bump pointer against the stale limit (this is effectively another cpu's limit)

Same issue on arm64.

Possible solutions:

  1. put tlab in the thread object instead of the cpu. means we have more tlabs and memory pressure increases but then they are truly thread-local. keeping the allocation meters accurate might be a pain here. the advantage is that no synchronization is needed at all on the hot path and a thread's thread object is inherently stable.
  2. disable interrupts on the hot path? ugly, scary. disabling interrupts is dangerous. Very tricky when also needing to write arbitrary heap memory
  3. keep the tlab on the cpu, but go back to using atomics and load the cpu struct at the start of %do-allocate-from-general-area, this way it's consistent the whole way through
  4. something else? secret fourth thing?

Aside from this, I really like the approach. It's surprisingly non-invasive

@froggey

froggey commented May 17, 2026

Copy link
Copy Markdown
Owner

arm64 build completes successfully with my changes btw

@iskamag

iskamag commented May 17, 2026

Copy link
Copy Markdown
Contributor Author

TSX would've fixed this.

Anyway, I think option 1 is better. Since Claimore would be thread-local...

@froggey

froggey commented Jun 15, 2026

Copy link
Copy Markdown
Owner

changes look good, but we still have cpu-local accounting (bytes-consed, cons-allocation-count, cons-fast-path-hits, general-allocation-count, general-fast-path-hits), are they worth keeping/do they have the same race condition as allocation did?

@iskamag

iskamag commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

technically there's a race but on x86 they should add up to the same number with a tiny skew, while ARM would lose counts. However there is no real data corruption.

Whether to keep those counters is up to you.

@iskamag

iskamag commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

i.e x86 could have 7+3 allocs instead of 6+4

@froggey

froggey commented Jun 16, 2026

Copy link
Copy Markdown
Owner

I don't think the GC relies on these counters, so happy to merge this. Maybe we can revisit accuracy at a later time

@froggey froggey merged commit bd1fa49 into froggey:master Jun 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants