From cc089f03f5e18db4a228c6094d7734304dabee3a Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Thu, 12 Dec 2024 17:38:49 +0000 Subject: [PATCH 01/13] [sysvabi64] Add chapter on Thread Local Storage The thread local storage chapter contains: * A description of Thread Local Storage based on addenda32 * The key design decisions of AArch64 TLS such as tls variant, tls dialect, TCB size. * The ABI required code sequence for TLSDESC that must be emitted exactly, as GNU ld requires it to be. * Sequences for the different code-models. * Relaxations for GD->IE, GD->LE and IE->LE. * Synchronization requirements for Lazy TLSDESC. With advice not to support it due to overhead of synchronization. --- sysvabi64/sysvabi64-tls.svg | 283 +++++++++++++++++ sysvabi64/sysvabi64.rst | 607 +++++++++++++++++++++++++++++++++++- 2 files changed, 889 insertions(+), 1 deletion(-) create mode 100644 sysvabi64/sysvabi64-tls.svg diff --git a/sysvabi64/sysvabi64-tls.svg b/sysvabi64/sysvabi64-tls.svg new file mode 100644 index 00000000..d96ff878 --- /dev/null +++ b/sysvabi64/sysvabi64-tls.svg @@ -0,0 +1,283 @@ + +image/svg+xmlTCBTLSTLS...TLSnnndtvdtv...dtvdtvTLSIndex:1...NN + 1Component index N stored in GOTPC-relative reference to GOTOffset in TLSof importedvariablestored inGOTDatasegmentGOT ...DSOTextsegmentoffset 1offset Ntp +1ND1ND + \ No newline at end of file diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 47762cf3..34ed4ecd 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -22,7 +22,14 @@ .. _SCO-ELF: http://www.sco.com/developers/gabi .. _SYM-VER: http://www.akkadia.org/drepper/symbol-versioning .. _SYSVABI: https://github.com/ARM-software/abi-aa/releases -.. _TLSDESC: http://www.fsfla.org/~lxoliva/writeups/TLS/paper-lk2006.pdf +.. _ELFTLS: https://www.uclibc.org/docs/tls.pdf +.. _TLSDESC: http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt + +.. role:: c(code) + :language: c + +.. role:: cpp(code) + :language: cpp System V ABI for the Arm® 64-bit Architecture (AArch64) ******************************************************* @@ -222,6 +229,7 @@ Change History | | | - Update ifunc resolver content to include | | | | information on AT_HWCAP3,4 fields. | | | | - Document Function Multi-Versioning. | + | | | - Added chapter on Thread Local Storage (TLS) | +------------+------------------------------+-------------------------------------------------------+ References @@ -626,6 +634,8 @@ syntax is of the form ``#::`` +-----------------------+-------------+---------------------------------------+ | ``gottprel`` | ``adrp`` | R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21 | +-----------------------+-------------+---------------------------------------+ + | ``gottprel`` | ``ldr`` | R_AARCH64_TLSIE_LD_GOTTPREL_PREL19 | + +-----------------------+-------------+---------------------------------------+ | ``gottprel_lo12`` | ``ldr`` | R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC | +-----------------------+-------------+---------------------------------------+ | ``tprel`` | ``add`` | R_AARCH64_TLSLE_ADD_TPREL_LO12 | @@ -2064,6 +2074,601 @@ See `MemTagABIELF64`_ and `PAuthABIELF64`_ for details of reserved tags. PageBreak oneColumn +Thread Local Storage +==================== + +Introduction to thread local storage +------------------------------------ + +Thread Local Storage (TLS) is a class of own data (static storage) that – +like the stack – is instanced once for each thread of execution. It fits +into the abstract storage hierarchy as follows. + +* (Most global) Program-own data (static and extern variables, instanced + once per program/process). + +* Thread local storage (variables instanced once per thread, shared between + all accessing function activations). + +* (Most local) Automatic data (stack variables, instanced once per function + activation, per thread). + +* How to denote TLS in source programs. + + C++11 and C11 use :c:`thread_local T t...`; A GCC extension uses + :c:`__thread T t...`; this is Q-o-I. + +* How to represent the initializing images of TLS in object files, and how + to define symbols in TLS. + + The rules for ELF are well established (see ``SHF_TLS``, ``STT_TLS`` in + SCO-ELF_). + +* How a loader or run-time system creates instances of TLS per-thread at + execution time. + + This is part of ABI for the platform or execution environment. + +* How to relocate, statically and dynamically, with respect to symbols + defined in TLS (for details of relocations relevant to AArch64 Linux see + AAELF64_). + +* How code must address variables allocated in TLS (the subject of the + notes below). + +It is the last two bullet points that are the subject of this ABI. + +Introduction to TLS addressing +------------------------------ + +In the most general form, a program is constructed dynamically from an +executable and a number of shared libraries. Each component, +(executable or shared library) can be mapped into multiple +processes. Additionally a shared library can be loaded dynamically by +a program, rather than being part of the initial process image +constructed when the program is first loaded. + +For the purpose of addressing TLS, components, referred to as modules, +of an application are identified using indexes. The module index for +the executable is always 1, but the module indexes for shared +libraries are allocated at process start time, or when a shared +library is loaded dynamically via dlopen. A shared library may have a +different module index in two different processes so its per-thread +module index must be part of its program-own state (or be queried +dynamically). The run-time system is responsible for maintaining a +per-thread vector of pointers to allocated TLS regions indexed by +these module indexes. + +There is a system resource called the Thread Pointer (TP) that +typically, points to a Thread Control Block (TCB) for the currently +executing thread which, in turn, points to the Dynamic Thread Vector +(DTV) for that thread. + +.. raw:: pdf + + PageBreak oneColumn + +SystemV AArch64 TLS addressing architecture +------------------------------------------- + +The figure below depicts the fundamental components of the TLS +addressing architecture used by SystemV for AArch64. + +.. _SystemV AArch64 TLS addressing architecture: + +.. figure:: sysvabi64-tls.svg + + SystemV AArch64 TLS addressing architecture + +The TLS data for a module is called the TLS Block. + +The thread pointer points directly to the Thread Control Block (TCB). + +The size of the TCB is 16-bytes, where the first 8 bytes contain the +pointer to the Dynamic Thread Vector (DTV), and the other 8 bytes are +reserved for the implementation. + +Following the TCB and any required alignment padding, the TLS Blocks +of the modules loaded at process start form the static TLS Block. The +memory for the TLS Block is allocated at process start time. + +The TLS Blocks for modules loaded dynamically via dlopen are known as +dynamic TLS. + +Index 0 of the Dynamic Thread Vector DTV[0] typically contains a +generation counter which can be used to update or reallocate the DTV +when dynamic modules are opened or closed. + +Index N, where N > 0, of the Dynamic Thread Vector DTV[N] points to +the TLS block for module N. + +To calculate the address of a TLS variable in any given module, static +or dynamic, the expression ``TP[0][Module id][offset in module]`` can +be used. The function ``__tls_get_address(module_id, offset)`` returns +the result of this calculation. + +The calculation above is the most general and it can be applied to both static and dynamic TLS. There are four defined models of accessing TLS that trade off generality for performance. In order of descending generality: + + 1. General Dynamic, can be used anywhere. + + 2. Local Dynamic, can be used anywhere where the definition of the + TLS variable and the access are from the same module. + + 3. Initial Exec, can be used for TLS variables defined in the + static TLS block. + + 4. Local Exec, can be used in the executable for TLS variables + defined in the executables static TLS block. + +SystemV AArch64 TLS addressing +------------------------------ + +AArch64 TLS SystemV design choices + +* AArch64 uses variant 1 TLS as described in ELFTLS_. + +* The thread pointer (TP) is always accessible via the ``TPIDR_EL0`` + system register. This can be accessed via inlining a ``mrs`` + instruction to read the thread pointer. + +* The compiler can generate code that supports a TLS block size of 4 + KiB, 16 MiB, 4GiB or 16EiB, depending on the addressing mode. The + default is 16 MiB for all addressing modes. + +The static and dynamic linker must agree on the size of the padding +between the TCB and the executables TLS Block. Using ``TCB`` as the +size of the TCB (16 bytes), ``PAD`` as the size of the padding bytes, +and ``PT_TLS`` as the program header with type PT_TLS. ``PAD`` must be +the smallest positive integer that satisfies the following congruence: + +``TP + TCB + PAD ≡ PT_TLS.p_vaddr (modulo PT_TLS.p_align)`` + +Given that ``TP ≡ 0 (modulo PT_TLS.p_align)``. An expression +for `PAD` is ``PAD = (PT_TLS.p_vaddr - TCB) mod PT_TLS.p_align``. + +A significant number of dynamic linkers use a different calculation +that requires ``PT_TLS.p_vaddr ≡ 0 (modulo PT_TLS.p_align)`` to +correctly align the executables TLS block. For maximum compatibility, +static linkers and any linker scripts including TLS, are recommended +to align the TLS block so that `PT_TLS.p_vaddr ≡ 0 (modulo p_align)`. + +There are two dialects of TLS supported by the relocations defined in +AAELF64_, the traditional dialect described by ELFTLS_ and the +descriptor dialect described by TLSDESC_. This document describes only +the descriptor dialect as this is the default dialect for GCC and the +only dialect supported by clang. + +Code sequences for accessing TLS variables +------------------------------------------ + +The code sequences below assume the default TLS block size of 16 MiB, +this permits the Local Exec model to use of a pair of add instructions +with a combined 24-bit immediate field. Larger TLS sizes can be +supported by using a ``movz`` and one or more ``movk`` instructions to +construct an offset from the thread-pointer in a register. + +A code model may use a sequence from a less restrictive code model. + +In the code-sequences below: + +* ``tp`` is a core register containing the thread-pointer. + +* ``gp`` is a core register containing the base of the GOT. + +* ``xn`` is an arbitrary core register. Numbered core registers such + as ``x0`` and ``x1`` refer to the specific core register. + +* ``.tlsdesccall`` is an assembler directive that adds a + ``R_AARCH64_TLSDESC_CALL`` relocation to the next instruction. + +* ``.tlsdescldr`` is an assembler directive that adds a + ``R_AARCH64_TLSDESC_LDR`` relocation to the next instruction. + +* ``.tlsdescadd`` is an assembler directive that adds a + ``R_AARCH64_TLSDESC_ADD`` relocation to the next instruction. + +General Dynamic +^^^^^^^^^^^^^^^ + +General Dynamic is the most general form of accessing TLS. It supports +static and dynamic TLS. + +To permit static linker relaxation. The TLSDESC code sequences must be +emitted exactly as specified, with no other instruction breaking up +the sequence, with exactly the same registers used. + +The code sequences below return the offset of the TLS variable from +``tp`` in ``x0``. To get the address of the TLS variable requires +additional code to add ``x0`` to be added to ``tp``, this is not part +of the ABI required TLSDESC code sequence. + +Small Code Model + +.. code-block:: asm + + adrp x0, :tlsdesc:var // R_AARCH64_TLSDESC_ADR_PAGE21 var + ldr x1, [x0, #:tlsdesc_lo12:var] // R_AARCH64_TLSDESC_LD64_LO12 var + add x0, x0, #:tlsdesc_lo12:var] // R_AARCH64_TLSDESC_ADD_LO12 var + .tlsdesccall var + blr x1 // R_AARCH64_TLSDESC_CALL var + // offset of var from tp in x0 + +Tiny Code Model + +.. code-block:: asm + + ldr x1, :tlsdesc:var // R_AARCH64_TLSDESC_LD_PREL19 var + adr x0, :tlsdesc:var // R_AARCH64_TLSDESC_ADR_PREL21 var + .tlsdesccall var + blr x1 // R_AARCH64_TLSDESC_CALL var + // offset of var from tp in x0 + +Large Code Model + +.. code-block:: asm + + movz x0, #:tlsdesc_off_g1:var // R_AARCH64_TLSDESC_OFF_G1 var + movk x0, #:tlsdesc_off_g0_nc:var // R_AARCH64_TLSDESC_OFF_GO_NC var + .tlsdescldr var + ldr x1, [gp, x0] // R_AARCH64_TLSDESC_LDR var + .tlsdescadd var + add x0, gp, x0 // R_AARCH64_TLSDESC_ADD var + .tlsdesccall var + blr x1 // R_AARCH64_TLSDESC_CALL var + // offset of var from tp in x0 + +Local Dynamic +^^^^^^^^^^^^^ + +Local Dynamic is a special case of general dynamic where the compiler +knows that the TLS variable is defined in the same module as the code +that is accessing the variable. In this case the offset of the TLS +variable from the start of the module's TLS block is a static link +time constant. Instead of dynamically calculating the offset of the +TLS variable from the thread-pointer. The offset of the module's TLS +block from the thread-pointer is calculated, then the offset of the +TLS variable within that block is added. This is more efficient than +general dynamic when more than one TLS variable from the same module +is accessed from the same function, but less efficient when accessing +a single TLS variable. + +The code sequence for local dynamic is the same as global dynamic and +like global dynamic must be emitted exactly as specified. There are no +specific relocations for Local Dynamic using the descriptor dialect. A +special symbol ``_TLS_MODULE_BASE_`` is used to get a tlsdesccall to +return the offset of the module's TLS block from the thread pointer. + +Code-generators are not required to implement local dynamic and can +emit general dynamic in its place. + +Initial Exec +^^^^^^^^^^^^ + +Initial Exec can be used for static TLS. The location of the module's +TLS block and the offset of the TLS variable within that block are +run-time constants. The dynamic-loader computes the offset from the +thread-pointer and places it in a GOT entry. The GOT entry is +relocated by dynamic relocation ``R_AARCH64_TLS_TPREL64``. + +A shared-library that contains Initial Exec TLS must have the +``DF_STATIC_TLS`` dynamic tag set. An attempt to load a shared library +with ``DF_STATIC_TLS`` via ``dlopen`` will be rejected. + +Small Code model + +The static linker is permitted to relax the instructions below to +Local Exec individually using the relocation directive. The +instructions do not have to be contiguous. + +.. code-block:: asm + + adrp xn, :gottprel: var // R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21 var + ldr xn, [xn, #:gottprel_lo12:var] // R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC var + // offset of var from tp in xn + +Tiny Code model + +.. code-block:: asm + + ldr xn, :gottprel:var // R_AARCH64_TLSIE_LD_GOTTPREL_PREL19 var + // offset of var from tp in xn + +Large Code model + +.. code-block:: asm + + movz xn, #:gottprel_g1:var // R_AARCH64_TLSIE_MOVW_GOTTPREL_G1 var + movk xn, #:gottprel_g0_nc:var // R_AARCH64_TLSIE_MOVW_GOTTPREL_G0_NC var + ldr xn, [gp, xn] + // offset of var from tp in xn + +Local Exec +^^^^^^^^^^ + +Local Exec is used for accesses to the executable's TLS block. The +executable always has the TLS module index of 1 so the offsets of the +TLS variables from the thread pointer are static link time +constants. The code sequences are the same for all code models. + +The instruction sequences below are not ABI. Using the instructions +and relocations below increases the chances of static linkers applying +the relaxations in (AAELF64_) when the size of the executables TLS +block is smaller than 16 KiB. + +.. code-block:: asm + + add xn, xn, :tprel_hi12:var, lsl #12 // R_AARCH64_TLSLE_ADD_TPREL_HI12 var + add xn, xn, :tprel_lo12_nc:var // R_AARCH64_TLSLE_ADD_TPREL_LO12_NC var + // offset of var from tp in xn + +Optimization to load a 64-bit var directly into a core register. + +.. code-block:: asm + + add xn, tp, :tprel_hi12:var, lsl #12 // R_AARCH64_TLSLE_ADD_TPREL_HI12 var + ldr xn, [xn, #:tprel_lo12_nc:var] // R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC var + +Static link time TLS Relaxations +-------------------------------- + +The static linker can relax a more general TLS model to a more +constrained TLS model when the TLS variables meet the requirements for +using the constrained model. + +The Relaxations described below can be automatically applied to code +sequences in the executable. Relaxing from general dynamic will +prevent a shared library from being opened at runtime via dlopen so +should not be applied automatically. + +The static linker should use the relocation directives to distinguish +between code models. + +General Dynamic to Initial Exec +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This relaxation can be performed when the TLS variable is defined in a +module that is part of static TLS. + +Small Code Model + +.. code-block:: asm + + adrp x0, :gottprel:var // R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21 var + ldr x0, [x0, :gottprel_lo12:var] // R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC var + nop + nop + // offset of var from tp in x0 + +Tiny Code Model + +.. code-block:: asm + + ldr x0, :gottprel:var // R_AARCH64_TLSIE_LD_GOTTPREL_PREL19 var + nop + nop + // offset of var from tp in x0 + +Large Code Model + +.. code-block:: asm + + movz x0, #:gottprel_g1:var // R_AARCH64_TLSIE_MOVW_GOTTPREL_G1 var + movk x0, #:gottprel_g0_nc:var // R_AARCH64_TLSIE_MOVW_GOTTPREL_G0_NC var + ldr x0, [gp, x0] + nop + // offset of var from tp in x0 + +General Dynamic to Local Exec +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This relaxation can be performed when the TLS variable is defined in +the executable. + +Small Code Model + +.. code-block:: asm + + movz x0, :tprel_g1:var // R_AARCH64_TLSLE_MOVW_TPREL_G1 var + movk x0, :tprel_g0:var // R_AARCH64_TLSLE_MOVW_TPREL_G0_NC var + nop + nop + // offset of var from tp in x0 + +Tiny Code Model + +.. code-block:: asm + + movz x0, :tprel_g1:var // R_AARCH64_TLSLE_MOVW_TPREL_G1 var + movk x0, :tprel_g0:var // R_AARCH64_TLSLE_MOVW_TPREL_G0_NC var + nop + // offset of var from tp in x0 + +Large Code Model + +.. code-block:: asm + + movz x0, :tprel_g1:var // R_AARCH64_TLSLE_MOVW_TPREL_G1 var + movk x0, :tprel_g0:var // R_AARCH64_TLSLE_MOVW_TPREL_G0_NC var + nop + nop + nop + // offset of var from tp in x0 + +Initial Exec to Local Exec +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This relaxation is only defined for the Small Code model. It can be +performed when the TLS variable is defined in the executable. The +static linker is permitted to relax each instruction individually, +using the relocation directive to identify the instruction. The +destination register must be preserved. + +.. code-block:: asm + + movz xn, :tprel_g1:var // R_AARCH64_TLSLE_MOVW_TPREL_G1 var + movk xn, :tprel_g0:var // R_AARCH64_TLSLE_MOVW_TPREL_G0_NC var + +TLS Descriptors +--------------- + +The TLS Descriptor dialect permits a dynamic linker to use the +location and properties of the TLS symbol to select an optimal +resolver function. + +The static relocations with a prefix of ``R_AARCH64_TLSDESC_`` +targeting TLS symbol ``var``, instruct the static linker to create a +TLS Descriptor for ``var``. The TLS Descriptor for a variable is +stored in a pair of consecutive GOT entries, N and N + 1. The GOT +entry for N has a dynamic ``R_AARCH64_TLSDESC`` relocation targeting +the TLS symbol for ``var``. + +When resolving the ``R_AARCH64_TLSDESC`` relocation, the dynamic +loader places the address of the chosen resolver function in the first +GOT entry, and the argument for the chosen resolver function in the +second GOT entry. + +The AArch64 C and assembler examples are adapted from the AArch32 +TLSDESC_ paper. The C code below represents the TLS Descriptor. + +.. code-block:: c + + // Argument passed to TLS resolver functions. + struct tlsdesc + { + ptrdiff_t (*resolver)(struct tlsdesc *); + union + { + void *pointer; + long value; + } argument; + }; + +TLS Resolver Functions +---------------------- + +The TLS resolver functions are not standardized by this ABI as they +are internal to the dynamic linker. Programs must not directly refer +to TLS resolver functions. + +Calling Convention +^^^^^^^^^^^^^^^^^^ + +TLS resolver functions have one argument, the address of the TLS +descriptor, passed in ``x0``, they return the offset of the variable +from the thread pointer in ``x0``. + +TLS resolver functions must save all registers that they modify with +the exception of ``x0``, ``x1``, ``x30`` and the processor flags. + +Example Resolver Functions +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +These examples are for illustrative purposes only. Due to the +restrictions on calling convention, the resolver routines must be +written in assembly language. + +Static TLS Specialization + +When the TLS variable is in the static TLS block, the offset from the +thread pointer is fixed at runtime. The dynamic loader can calculate +the offset and place it in the TLS descriptor. All the static TLS +resolver function needs to do is extract the offset and return it. + +.. code-block:: asm + + _dl_tlsdesc_return: + // x0 contains pointer to struct tlsdesc. + // tlsdesc.argument.value contains offset of variable from TP + ldr x0, [x0, #8] + ret + +Dynamic TLS Specialization + +When the TLS variable is defined in dynamic TLS the address of the TLS +variable must be calculated by the resolver function using +``__tls_get_addr``. The resolver function returns the offset from the +thread pointer by subtracting the address of the thread pointer from +the address of the TLS variable. In practice an implementation of the +dynamic TLS resolver contains many platform specific details outside +of the scope of the ABI. An example of how a dynamic resolver might be +implemented can be found in the Dynamic Specialization section of +TLSDESC_. + +Undefined Weak Symbols + +An undefined weak symbol has the value 0. As the resolver function +returns an offset from the Thread Pointer, to get a value of 0 when +added to the Thread Pointer the resolver function returns a negative +thread pointer value that cancels to 0 when added to the thread +pointer. + +.. code-block:: asm + + __dl_tlsdesc_undefweak: + mrs x0, tpidr_el0 + neg x0, x0 + ret + +Lazy resolution of R_AARCH64_TLSDESC +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The TLSDESC_ paper describes an optional mechanism to resolve TLSDESC +calls lazily. Lazy resolution for TLSDESC resolver functions is not +recommended on AArch64. Additional synchronization is required for +each TLSDESC call, which has a significant affect on performance. The +description below describes the additional synchronization that is +needed. + +Instead of fully resolving the ``R_AARCH64_TLSDESC`` relocation at +module load time, a lazy resolver function runs on the first TLSDESC +call. The lazy resolver updates the TLS Descriptor with the actual +resolver function and the parameter to the actual resolver +function. In a multi-threaded program when lazy TLS in use, the +resolver functions must ensure that the write to the parameter in the +TLS descriptor has completed before reading it. + +.. code-block:: asm + + // Code to obtain the offset of var from thread-pointer. + // Loads the address of the resolver function into x1. + // Places the address of the TLS Descriptor into x0. + adrp x0, :tlsdesc:var + ldr x1, [x0, #:tlsdesc_lo12:var] + add x0, x0, #:tlsdesc_lo12:var] + .tlsdesccall var + blr x1 // _dl_desc_return + + // Resolver function + _dl_tlsdesc_return: + // load the parameter from the TLS descriptor. Without + // synchronization this load can read an old value prior + // to the lazy resolvers update to the descriptor completing. + ldr x0, [x0, #8] + ret + +The recommended way to ensure synchronization between the lazy +resolver update of the TLS Descriptor and the actual resolver function +accessing the TLS Descriptor is: + +* The TLS lazy resolver function uses a store release when updating + the address of the resolver function in the TLS Descriptor. + +* The actual entry function uses a load acquire on the address of the + resolver function, with a destination register of xzr. + +Referring to the example above, the code for the resolver function becomes: + +.. code-block:: asm + + // Resolver function + _dl_tlsdesc_return: + // Guaranteed to complete after the lazy resolvers store release + // of the address in [x0]. + ldar xzr, [x0] + // Access the parameter. + ldr x0, [x0, #8] + ret + Libraries ========= From 78ccdb37c12734367971d64e9192298e442d0195 Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Fri, 21 Feb 2025 14:33:03 +0000 Subject: [PATCH 02/13] [sysvabi64] Review comments from Claudio * Edits to split up the bullet points in How to denote TLS in source. * Changed program-own state to process-state as the thread-id may not be stored separately from the programs data. * Removed typically from some of the descriptions as the typically will almost always be the case for a sysvabi platform. * Linked alignment padding to the definition. * Provided a bit more information about generation counters. --- sysvabi64/sysvabi64.rst | 40 ++++++++++++++++++++++++++-------------- 1 file changed, 26 insertions(+), 14 deletions(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 34ed4ecd..dd7f28ba 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -2093,6 +2093,9 @@ into the abstract storage hierarchy as follows. * (Most local) Automatic data (stack variables, instanced once per function activation, per thread). +Rules governing thread local storage on AArch64 +----------------------------------------------- + * How to denote TLS in source programs. C++11 and C11 use :c:`thread_local T t...`; A GCC extension uses @@ -2109,6 +2112,8 @@ into the abstract storage hierarchy as follows. This is part of ABI for the platform or execution environment. +This document and AA_ELF64_ are concerned with: + * How to relocate, statically and dynamically, with respect to symbols defined in TLS (for details of relocations relevant to AArch64 Linux see AAELF64_). @@ -2116,8 +2121,6 @@ into the abstract storage hierarchy as follows. * How code must address variables allocated in TLS (the subject of the notes below). -It is the last two bullet points that are the subject of this ABI. - Introduction to TLS addressing ------------------------------ @@ -2134,13 +2137,13 @@ the executable is always 1, but the module indexes for shared libraries are allocated at process start time, or when a shared library is loaded dynamically via dlopen. A shared library may have a different module index in two different processes so its per-thread -module index must be part of its program-own state (or be queried +module index must be part of its process state (or be queried dynamically). The run-time system is responsible for maintaining a per-thread vector of pointers to allocated TLS regions indexed by these module indexes. There is a system resource called the Thread Pointer (TP) that -typically, points to a Thread Control Block (TCB) for the currently +points to a Thread Control Block (TCB) for the currently executing thread which, in turn, points to the Dynamic Thread Vector (DTV) for that thread. @@ -2168,26 +2171,34 @@ The size of the TCB is 16-bytes, where the first 8 bytes contain the pointer to the Dynamic Thread Vector (DTV), and the other 8 bytes are reserved for the implementation. -Following the TCB and any required alignment padding, the TLS Blocks -of the modules loaded at process start form the static TLS Block. The -memory for the TLS Block is allocated at process start time. +Following the TCB and any required alignment padding (defined in +`SystemV AArch64 TLS addressing`_), the TLS Blocks of the modules +loaded at process start form the static TLS Block. The memory for the +TLS Block is allocated at process start time. The TLS Blocks for modules loaded dynamically via dlopen are known as dynamic TLS. -Index 0 of the Dynamic Thread Vector DTV[0] typically contains a -generation counter which can be used to update or reallocate the DTV -when dynamic modules are opened or closed. +Index N, where N > 0, of the Dynamic Thread Vector DTV[N] is a pointer +to the TLS block for module N. -Index N, where N > 0, of the Dynamic Thread Vector DTV[N] points to -the TLS block for module N. +Index 0 of the Dynamic Thread Vector DTV[0] is reserved for use by the +platform. It typically contains the thread's generation counter which +can be used to update or reallocate the DTV when TLS variables in +dynamic modules loaded by ``dlopen`` are first used. When a dynamic +TLS variable is accessed the thread's generation count is compared +with the global generation count which can be used to trigger updates +of the DTV. The details are platform specific. To calculate the address of a TLS variable in any given module, static or dynamic, the expression ``TP[0][Module id][offset in module]`` can be used. The function ``__tls_get_address(module_id, offset)`` returns the result of this calculation. -The calculation above is the most general and it can be applied to both static and dynamic TLS. There are four defined models of accessing TLS that trade off generality for performance. In order of descending generality: +The calculation above is the most general and it can be applied to +both static and dynamic TLS. There are four defined models of +accessing TLS that trade off generality for performance. In order of +descending generality: 1. General Dynamic, can be used anywhere. @@ -2656,7 +2667,8 @@ accessing the TLS Descriptor is: * The actual entry function uses a load acquire on the address of the resolver function, with a destination register of xzr. -Referring to the example above, the code for the resolver function becomes: +Referring to the example above, the code for the resolver function +becomes: .. code-block:: asm From 10de38dc3908fa944ac1f5be1e0f49dcf0770b03 Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Fri, 21 Feb 2025 15:45:57 +0000 Subject: [PATCH 03/13] [sysvabi64] Review comments from Maskray * Rearranged formulas and used TCBsize to make it clearer. * Taken out "significant" from a significant number of dynamic linkers. * Give reason for using relaxation rather than optimization. * Clarify that there is no requirement to implement any TLSDESC resolver given in the sysvabi. --- sysvabi64/sysvabi64.rst | 47 ++++++++++++++++++++++++++++------------- 1 file changed, 32 insertions(+), 15 deletions(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index dd7f28ba..93f8e5ab 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -2226,22 +2226,33 @@ AArch64 TLS SystemV design choices KiB, 16 MiB, 4GiB or 16EiB, depending on the addressing mode. The default is 16 MiB for all addressing modes. +Recall that the Thread Pointer ``TP`` points to the start of the ``TCB``. + The static and dynamic linker must agree on the size of the padding -between the TCB and the executables TLS Block. Using ``TCB`` as the -size of the TCB (16 bytes), ``PAD`` as the size of the padding bytes, -and ``PT_TLS`` as the program header with type PT_TLS. ``PAD`` must be -the smallest positive integer that satisfies the following congruence: +between the TCB and the executables TLS Block. Using ``TCBsize`` as the +size of the TCB (16 bytes), ``PADsize`` as the size of the padding bytes, +and ``PT_TLS`` as the program header with type PT_TLS. + +The Thread Pointer ``TP`` and also the address of the start of the +``TCB``, must satisfy the requirement. + +``TP ≡ 0 (modulo PT_TLS.p_align)``. -``TP + TCB + PAD ≡ PT_TLS.p_vaddr (modulo PT_TLS.p_align)`` +``PADsize`` must be the smallest positive integer that satisfies the +following congruence: -Given that ``TP ≡ 0 (modulo PT_TLS.p_align)``. An expression -for `PAD` is ``PAD = (PT_TLS.p_vaddr - TCB) mod PT_TLS.p_align``. +``TCBsize + PADsize ≡ PT_TLS.p_vaddr (modulo PT_TLS.p_align)``. -A significant number of dynamic linkers use a different calculation -that requires ``PT_TLS.p_vaddr ≡ 0 (modulo PT_TLS.p_align)`` to -correctly align the executables TLS block. For maximum compatibility, -static linkers and any linker scripts including TLS, are recommended -to align the TLS block so that `PT_TLS.p_vaddr ≡ 0 (modulo p_align)`. +An expression for ``PADsize`` is therefore: + +``PADsize = (PT_TLS.p_vaddr - TCBsize) mod PT_TLS.p_align``. + +A number of dynamic linkers use a different calculation that requires +``PT_TLS.p_vaddr ≡ 0 (modulo PT_TLS.p_align)`` to correctly align the +executables TLS block, for either static or dynamic TLS. For maximum +compatibility, static linkers and any linker scripts including TLS, +are recommended to align the TLS block so that `PT_TLS.p_vaddr ≡ 0 +(modulo p_align)`. There are two dialects of TLS supported by the relocations defined in AAELF64_, the traditional dialect described by ELFTLS_ and the @@ -2422,6 +2433,11 @@ Optimization to load a 64-bit var directly into a core register. Static link time TLS Relaxations -------------------------------- +Relaxation is a term used by the TLS literature such as ELFTLS_ to +represent an optimization. AAELF64_ has used optimization for similar +link-time instruction sequence optimizations. This document will use +relaxation to be consistent with existing references. + The static linker can relax a more general TLS model to a more constrained TLS model when the TLS variables meet the requirements for using the constrained model. @@ -2574,9 +2590,10 @@ the exception of ``x0``, ``x1``, ``x30`` and the processor flags. Example Resolver Functions ^^^^^^^^^^^^^^^^^^^^^^^^^^ -These examples are for illustrative purposes only. Due to the -restrictions on calling convention, the resolver routines must be -written in assembly language. +These examples are for illustrative purposes only. There is no +requirement for any of the following resolver functions to be +implemented. Due to the restrictions on calling convention, the +resolver routines must be written in assembly language. Static TLS Specialization From 75882bef67461dffbcd2f83dd9d9c834e30bef54 Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Fri, 21 Feb 2025 15:59:25 +0000 Subject: [PATCH 04/13] fix CI due to broken reference --- sysvabi64/sysvabi64.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 93f8e5ab..0f6e1472 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -2112,7 +2112,7 @@ Rules governing thread local storage on AArch64 This is part of ABI for the platform or execution environment. -This document and AA_ELF64_ are concerned with: +This document and AAELF64_ are concerned with: * How to relocate, statically and dynamically, with respect to symbols defined in TLS (for details of relocations relevant to AArch64 Linux see From cec90b5800f1a15e57fda86696af5cd457672249 Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Tue, 18 Mar 2025 14:00:27 +0000 Subject: [PATCH 05/13] [sysvabi64] Correct local exec instruction to use tp as input Change the input register in add xn, xn, :tprel_hi12:var, lsl #12 to the thread pointer tp. We want to calculate the offset from the thread pointer so it needs to be an input of the add. --- sysvabi64/sysvabi64.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 0f6e1472..f2752aed 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -2419,7 +2419,7 @@ block is smaller than 16 KiB. .. code-block:: asm - add xn, xn, :tprel_hi12:var, lsl #12 // R_AARCH64_TLSLE_ADD_TPREL_HI12 var + add xn, tp, :tprel_hi12:var, lsl #12 // R_AARCH64_TLSLE_ADD_TPREL_HI12 var add xn, xn, :tprel_lo12_nc:var // R_AARCH64_TLSLE_ADD_TPREL_LO12_NC var // offset of var from tp in xn From a9bf78aa4160cc5c0701cb9afa73b1b6bbe4414a Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Mon, 7 Apr 2025 13:50:13 +0100 Subject: [PATCH 06/13] [sysvabi64] document TLSDESC resolver extension register reqs Document the decision in the GCC mailing list thread TLSDESC clobber ABI stability/futureproofness? https://gcc.gnu.org/legacy-ml/gcc/2018-10/msg00112.html TLSDESC resolver functions assume that any registers added by an extension are caller saved for a TLSDESC call. A brief summary: Dynamic TLS may be lazy allocated upon the first use of a TLSDESC resolver. This may involve calls to heap allocation functions provided by the user, which may use registers from extensions like SVE and SME. As the resolver function can't know what is saved it would have to save all SVE and SME state. This would be way more expensive than a caller save, and an older libc written prior to the introduction of the extension would be unaware of them so the caller has to do the save. * The SVE and SME state is already --- sysvabi64/sysvabi64.rst | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index f2752aed..52c5804c 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -14,6 +14,7 @@ .. _AAELF64: https://github.com/ARM-software/abi-aa/releases .. _CPPABI64: https://developer.arm.com/docs/ihi0059/latest .. _GCABI: https://itanium-cxx-abi.github.io/cxx-abi/abi.html +.. _GCCML: https://gcc.gnu.org/legacy-ml/gcc/2018-10/msg00112.html .. _LINUX_ABI: https://github.com/hjl-tools/linux-abi/wiki .. _MemTagABIELF64: https://github.com/ARM-software/abi-aa/releases .. _PAuthABIELF64: https://github.com/ARM-software/abi-aa/releases @@ -254,6 +255,8 @@ This document refers to, or is referred to by, the following documents. +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ | GCABI_ | https://itanium-cxx-abi.github.io/cxx-abi/abi.html | Generic C++ ABI | +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ + | GCCML_ | https://gcc.gnu.org/legacy-ml/gcc/2018-10/msg00112.html | GCC Mailing list topic TLSDESC clobber ABI stability/futureproofness? | + +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ | HWCAP_ | https://www.kernel.org/doc/html/latest/arch/arm64/elf_hwcaps.html | Linux Kernel HWCAPs interface | +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ | LINUX_ABI_ | https://github.com/hjl-tools/linux-abi/wiki | Linux Extensions to gABI | @@ -2584,8 +2587,13 @@ TLS resolver functions have one argument, the address of the TLS descriptor, passed in ``x0``, they return the offset of the variable from the thread pointer in ``x0``. -TLS resolver functions must save all registers that they modify with -the exception of ``x0``, ``x1``, ``x30`` and the processor flags. +TLS resolver functions must save all general-purpose and SIMD&FP +registers that they modify with the exception of ``x0``, ``x1``, +``x30`` and the processor flags. + +TLS resolver functions are not required to save any register added by +an extension, such as the scalable vector registers or the SVE +predicate registers. See `GCCML`_ for details. Example Resolver Functions ^^^^^^^^^^^^^^^^^^^^^^^^^^ From cd67e16aa551f655b186e22feffac46238fbc75e Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Fri, 27 Jun 2025 09:09:50 +0100 Subject: [PATCH 07/13] [sysvabi64] Simple fixes from Ties' review [NFC] --- sysvabi64/sysvabi64.rst | 78 ++++++++++++++++++++++------------------- 1 file changed, 41 insertions(+), 37 deletions(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 52c5804c..c660a864 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -2099,19 +2099,19 @@ into the abstract storage hierarchy as follows. Rules governing thread local storage on AArch64 ----------------------------------------------- -* How to denote TLS in source programs. +* How to denote TLS in source programs: C++11 and C11 use :c:`thread_local T t...`; A GCC extension uses :c:`__thread T t...`; this is Q-o-I. * How to represent the initializing images of TLS in object files, and how - to define symbols in TLS. + to define symbols in TLS: The rules for ELF are well established (see ``SHF_TLS``, ``STT_TLS`` in SCO-ELF_). * How a loader or run-time system creates instances of TLS per-thread at - execution time. + execution time: This is part of ABI for the platform or execution environment. @@ -2127,6 +2127,10 @@ This document and AAELF64_ are concerned with: Introduction to TLS addressing ------------------------------ +This section covers only the definitions required to understand the +AArch64 specific details. A more in-depth description to TLS +addressing in general can be found in ELFTLS_. + In the most general form, a program is constructed dynamically from an executable and a number of shared libraries. Each component, (executable or shared library) can be mapped into multiple @@ -2134,9 +2138,9 @@ processes. Additionally a shared library can be loaded dynamically by a program, rather than being part of the initial process image constructed when the program is first loaded. -For the purpose of addressing TLS, components, referred to as modules, -of an application are identified using indexes. The module index for -the executable is always 1, but the module indexes for shared +For the purpose of addressing TLS, components of an application, +referred to as modules, are identified using indexes. The module index +for the executable is always 1, but the module indexes for shared libraries are allocated at process start time, or when a shared library is loaded dynamically via dlopen. A shared library may have a different module index in two different processes so its per-thread @@ -2195,7 +2199,7 @@ of the DTV. The details are platform specific. To calculate the address of a TLS variable in any given module, static or dynamic, the expression ``TP[0][Module id][offset in module]`` can -be used. The function ``__tls_get_address(module_id, offset)`` returns +be used. The function ``__tls_get_addr(module_id, offset)`` returns the result of this calculation. The calculation above is the most general and it can be applied to @@ -2217,12 +2221,12 @@ descending generality: SystemV AArch64 TLS addressing ------------------------------ -AArch64 TLS SystemV design choices +AArch64 TLS SystemV design choices: * AArch64 uses variant 1 TLS as described in ELFTLS_. * The thread pointer (TP) is always accessible via the ``TPIDR_EL0`` - system register. This can be accessed via inlining a ``mrs`` + system register. This can be accessed via inlining an ``mrs`` instruction to read the thread pointer. * The compiler can generate code that supports a TLS block size of 4 @@ -2237,7 +2241,7 @@ size of the TCB (16 bytes), ``PADsize`` as the size of the padding bytes, and ``PT_TLS`` as the program header with type PT_TLS. The Thread Pointer ``TP`` and also the address of the start of the -``TCB``, must satisfy the requirement. +``TCB``, must satisfy the requirement: ``TP ≡ 0 (modulo PT_TLS.p_align)``. @@ -2270,13 +2274,13 @@ The code sequences below assume the default TLS block size of 16 MiB, this permits the Local Exec model to use of a pair of add instructions with a combined 24-bit immediate field. Larger TLS sizes can be supported by using a ``movz`` and one or more ``movk`` instructions to -construct an offset from the thread-pointer in a register. +construct an offset from the thread pointer in a register. A code model may use a sequence from a less restrictive code model. In the code-sequences below: -* ``tp`` is a core register containing the thread-pointer. +* ``tp`` is a core register containing the thread pointer. * ``gp`` is a core register containing the base of the GOT. @@ -2304,10 +2308,10 @@ the sequence, with exactly the same registers used. The code sequences below return the offset of the TLS variable from ``tp`` in ``x0``. To get the address of the TLS variable requires -additional code to add ``x0`` to be added to ``tp``, this is not part -of the ABI required TLSDESC code sequence. +additional code to add ``x0`` to ``tp``, this is not part of the ABI +required TLSDESC code sequence. -Small Code Model +Small Code Model; .. code-block:: asm @@ -2318,7 +2322,7 @@ Small Code Model blr x1 // R_AARCH64_TLSDESC_CALL var // offset of var from tp in x0 -Tiny Code Model +Tiny Code Model; .. code-block:: asm @@ -2328,7 +2332,7 @@ Tiny Code Model blr x1 // R_AARCH64_TLSDESC_CALL var // offset of var from tp in x0 -Large Code Model +Large Code Model; .. code-block:: asm @@ -2349,9 +2353,9 @@ Local Dynamic is a special case of general dynamic where the compiler knows that the TLS variable is defined in the same module as the code that is accessing the variable. In this case the offset of the TLS variable from the start of the module's TLS block is a static link -time constant. Instead of dynamically calculating the offset of the -TLS variable from the thread-pointer. The offset of the module's TLS -block from the thread-pointer is calculated, then the offset of the +time constant, instead of dynamically calculating the offset of the +TLS variable from the thread pointer. The offset of the module's TLS +block from the thread pointer is calculated, then the offset of the TLS variable within that block is added. This is more efficient than general dynamic when more than one TLS variable from the same module is accessed from the same function, but less efficient when accessing @@ -2372,14 +2376,14 @@ Initial Exec Initial Exec can be used for static TLS. The location of the module's TLS block and the offset of the TLS variable within that block are run-time constants. The dynamic-loader computes the offset from the -thread-pointer and places it in a GOT entry. The GOT entry is +thread pointer and places it in a GOT entry. The GOT entry is relocated by dynamic relocation ``R_AARCH64_TLS_TPREL64``. A shared-library that contains Initial Exec TLS must have the ``DF_STATIC_TLS`` dynamic tag set. An attempt to load a shared library with ``DF_STATIC_TLS`` via ``dlopen`` will be rejected. -Small Code model +Small Code model; The static linker is permitted to relax the instructions below to Local Exec individually using the relocation directive. The @@ -2391,14 +2395,14 @@ instructions do not have to be contiguous. ldr xn, [xn, #:gottprel_lo12:var] // R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC var // offset of var from tp in xn -Tiny Code model +Tiny Code model; .. code-block:: asm ldr xn, :gottprel:var // R_AARCH64_TLSIE_LD_GOTTPREL_PREL19 var // offset of var from tp in xn -Large Code model +Large Code model; .. code-block:: asm @@ -2415,10 +2419,10 @@ executable always has the TLS module index of 1 so the offsets of the TLS variables from the thread pointer are static link time constants. The code sequences are the same for all code models. -The instruction sequences below are not ABI. Using the instructions -and relocations below increases the chances of static linkers applying -the relaxations in (AAELF64_) when the size of the executables TLS -block is smaller than 16 KiB. +The instruction sequences below are not required by the ABI but using +the instructions and relocations below increases the chances of static +linkers applying the relaxations in (AAELF64_) when the size of the +executables TLS block is smaller than 16 KiB. .. code-block:: asm @@ -2459,7 +2463,7 @@ General Dynamic to Initial Exec This relaxation can be performed when the TLS variable is defined in a module that is part of static TLS. -Small Code Model +Small Code Model; .. code-block:: asm @@ -2469,7 +2473,7 @@ Small Code Model nop // offset of var from tp in x0 -Tiny Code Model +Tiny Code Model; .. code-block:: asm @@ -2478,7 +2482,7 @@ Tiny Code Model nop // offset of var from tp in x0 -Large Code Model +Large Code Model; .. code-block:: asm @@ -2494,7 +2498,7 @@ General Dynamic to Local Exec This relaxation can be performed when the TLS variable is defined in the executable. -Small Code Model +Small Code Model; .. code-block:: asm @@ -2504,7 +2508,7 @@ Small Code Model nop // offset of var from tp in x0 -Tiny Code Model +Tiny Code Model; .. code-block:: asm @@ -2513,7 +2517,7 @@ Tiny Code Model nop // offset of var from tp in x0 -Large Code Model +Large Code Model; .. code-block:: asm @@ -2603,7 +2607,7 @@ requirement for any of the following resolver functions to be implemented. Due to the restrictions on calling convention, the resolver routines must be written in assembly language. -Static TLS Specialization +Static TLS Specialization: When the TLS variable is in the static TLS block, the offset from the thread pointer is fixed at runtime. The dynamic loader can calculate @@ -2618,7 +2622,7 @@ resolver function needs to do is extract the offset and return it. ldr x0, [x0, #8] ret -Dynamic TLS Specialization +Dynamic TLS Specialization: When the TLS variable is defined in dynamic TLS the address of the TLS variable must be calculated by the resolver function using @@ -2665,7 +2669,7 @@ TLS descriptor has completed before reading it. .. code-block:: asm - // Code to obtain the offset of var from thread-pointer. + // Code to obtain the offset of var from thread pointer. // Loads the address of the resolver function into x1. // Places the address of the TLS Descriptor into x0. adrp x0, :tlsdesc:var From 1c4b738c66142094d3f77b3c109d54a69650f3e2 Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Fri, 27 Jun 2025 11:01:12 +0100 Subject: [PATCH 08/13] Update description of the generation counter and __tls_get_addr Include a pseudo code description of __tls_get_addr with deferred TLS for dynamic modules. --- sysvabi64/sysvabi64.rst | 46 ++++++++++++++++++++++++++++++----------- 1 file changed, 34 insertions(+), 12 deletions(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index c660a864..62d21077 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -2189,23 +2189,45 @@ dynamic TLS. Index N, where N > 0, of the Dynamic Thread Vector DTV[N] is a pointer to the TLS block for module N. -Index 0 of the Dynamic Thread Vector DTV[0] is reserved for use by the -platform. It typically contains the thread's generation counter which -can be used to update or reallocate the DTV when TLS variables in -dynamic modules loaded by ``dlopen`` are first used. When a dynamic -TLS variable is accessed the thread's generation count is compared -with the global generation count which can be used to trigger updates -of the DTV. The details are platform specific. - To calculate the address of a TLS variable in any given module, static or dynamic, the expression ``TP[0][Module id][offset in module]`` can be used. The function ``__tls_get_addr(module_id, offset)`` returns the result of this calculation. -The calculation above is the most general and it can be applied to -both static and dynamic TLS. There are four defined models of -accessing TLS that trade off generality for performance. In order of -descending generality: +Index 0 of the Dynamic Thread Vector DTV[0] is reserved for use by the +platform. It is typically used to store the thread's generation +counter. In an implementation that supports deferred allocation of +TLS, a global generation number is incremented whenever the number of +dynamic modules changes due to ``dlopen`` or ``dlclose``. In the +``__tls_get_addr(module_id, offset)`` function, if the thread's +generation count is less than the global generation number, the +thread's DTV is updated, and the TLS for the ``module_id`` is +allocated if it is not present. + +In pseudo code + +.. code-block:: c + + /* tls_get_addr with deferred allocation */ + void * __tls_get_addr(size_t module_id, size_t offset) + { + dtv = get_thread_dtv(); + + if (dtv[0].generation_counter != global_generation_number) + /* includes setting the thread's generation counter to + the global_generation_number */ + update_thread_dtv(); + + if (dtv[module_id] == unallocated) + allocate_tls(dtv, module_id); + + return dtv[module_id][offset]; + } + +The calculation in __tls_get_addr is the most general and it can be +applied to both static and dynamic TLS. There are four defined models +of accessing TLS that trade off generality for performance. In order +of descending generality: 1. General Dynamic, can be used anywhere. From 48e730b9b54a148f59505fd4c5d3c7df88e5e3cb Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Fri, 27 Jun 2025 11:54:20 +0100 Subject: [PATCH 09/13] Reword derivation of TLS padding to make it easier to understand Use integers modulo m to avoid excess use of (modulo m). Explain the congruence symbol. Put expression first so derivation is optional. --- sysvabi64/sysvabi64.rst | 63 +++++++++++++++++++++++++---------------- 1 file changed, 39 insertions(+), 24 deletions(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 62d21077..773e00f7 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -2247,6 +2247,12 @@ AArch64 TLS SystemV design choices: * AArch64 uses variant 1 TLS as described in ELFTLS_. +* There are two dialects of TLS supported by the relocations defined + in AAELF64_, the traditional dialect described by ELFTLS_ and the + descriptor dialect described by TLSDESC_. This document describes + only the descriptor dialect as this is the default dialect for GCC + and the only dialect supported by clang. + * The thread pointer (TP) is always accessible via the ``TPIDR_EL0`` system register. This can be accessed via inlining an ``mrs`` instruction to read the thread pointer. @@ -2255,39 +2261,48 @@ AArch64 TLS SystemV design choices: KiB, 16 MiB, 4GiB or 16EiB, depending on the addressing mode. The default is 16 MiB for all addressing modes. -Recall that the Thread Pointer ``TP`` points to the start of the ``TCB``. - -The static and dynamic linker must agree on the size of the padding -between the TCB and the executables TLS Block. Using ``TCBsize`` as the -size of the TCB (16 bytes), ``PADsize`` as the size of the padding bytes, -and ``PT_TLS`` as the program header with type PT_TLS. - -The Thread Pointer ``TP`` and also the address of the start of the -``TCB``, must satisfy the requirement: - -``TP ≡ 0 (modulo PT_TLS.p_align)``. +* The TLS for an executable or shared-library is described by the + ``PT_TLS`` program header. -``PADsize`` must be the smallest positive integer that satisfies the -following congruence: +Recall from the diagram in `SystemV AArch64 TLS addressing +architecture`_ that the Thread Pointer ``TP`` points to the start of +the ``TCB``, which is followed by 0 or more bytes of alignment +padding, then the executable's TLS block. -``TCBsize + PADsize ≡ PT_TLS.p_vaddr (modulo PT_TLS.p_align)``. +The ``TP``, and hence the start of the ``TCB`` must be aligned to a +``PT_TLS.p_align`` boundary. This can be expressed as ``TP ≡ 0 (modulo +PT_TLS.p_align)`` where ``≡`` means congruent to. -An expression for ``PADsize`` is therefore: +The static and dynamic linker must agree on the size of the padding +(``PADsize``) between the TCB and the executable's TLS Block. Using +``TCBsize`` as the size of the TCB (16 bytes), the following expression can be used to calcluate ``PADsize`` from the ``PT_TLS`` program header. ``PADsize = (PT_TLS.p_vaddr - TCBsize) mod PT_TLS.p_align``. A number of dynamic linkers use a different calculation that requires ``PT_TLS.p_vaddr ≡ 0 (modulo PT_TLS.p_align)`` to correctly align the -executables TLS block, for either static or dynamic TLS. For maximum +executables TLS block. In this case the expression above simplifies to +``PADsize = Max(0, PT_TLS.p_align - TCBsize``). For maximum compatibility, static linkers and any linker scripts including TLS, -are recommended to align the TLS block so that `PT_TLS.p_vaddr ≡ 0 -(modulo p_align)`. - -There are two dialects of TLS supported by the relocations defined in -AAELF64_, the traditional dialect described by ELFTLS_ and the -descriptor dialect described by TLSDESC_. This document describes only -the descriptor dialect as this is the default dialect for GCC and the -only dialect supported by clang. +are recommended to align the TLS block so that ``PT_TLS.p_vaddr ≡ 0 +(modulo p_align)``. This requires the start of the TLS to be aligned +to the maximum of the .tdata and .tbss sections. + +The expression for ``PADsize`` above can be derived from the +requirement that ``PADsize`` must be the smallest positive integer +that satisfies the following congruence: + +``TP`` + ``TCBsize + PADsize ≡ PT_TLS.p_vaddr (modulo PT_TLS.p_align)``. + +Using Integers modulo m where (``PT_TLS.p_align``). +``TP:sub:m + TCBsize:sub:m + PADsize:sub:m = PT_TLS.p_vaddr:sub:m`` + +As ``TP:sub:m`` is 0 as ``TP ≡ 0 (modulo PT_TLS.p_align)`` rearranging +we get: + +``PADsize:sub:m = PT_TLS.p_vaddr:sub:m - TCBsize:sub:m`` +which is equivalent to +``PADsize:sub:m = (PT_TLS.p_vaddr - TCBsize):sub:m``. Code sequences for accessing TLS variables ------------------------------------------ From 555d8afe7fea0e39c3a8b76ee2e7a38191578f40 Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Fri, 27 Jun 2025 13:36:35 +0100 Subject: [PATCH 10/13] Move around some paragraphs to introduce some topics before they are used. --- sysvabi64/sysvabi64.rst | 53 ++++++++++++++++++++++------------------- 1 file changed, 29 insertions(+), 24 deletions(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 773e00f7..109a96cf 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -2229,7 +2229,7 @@ applied to both static and dynamic TLS. There are four defined models of accessing TLS that trade off generality for performance. In order of descending generality: - 1. General Dynamic, can be used anywhere. + 1. General Dynamic, also known as Global Dynamic, can be used anywhere. 2. Local Dynamic, can be used anywhere where the definition of the TLS variable and the access are from the same module. @@ -2304,6 +2304,21 @@ we get: which is equivalent to ``PADsize:sub:m = (PT_TLS.p_vaddr - TCBsize):sub:m``. +TLS Descriptors +--------------- + +AArch64 uses the TLS Descriptor dialect for the general dynamic model. +The TLS Descriptor dialect permits a dynamic linker to use the +location and properties of the TLS symbol to select an optimal +resolver function. + +The static relocations with a prefix of ``R_AARCH64_TLSDESC_`` +targeting TLS symbol ``var``, instruct the static linker to create a +TLS Descriptor for ``var``. The TLS Descriptor for a variable is +stored in a pair of consecutive GOT entries, N and N + 1. The GOT +entry for N has a dynamic ``R_AARCH64_TLSDESC`` relocation targeting +the TLS symbol for ``var``. + Code sequences for accessing TLS variables ------------------------------------------ @@ -2333,6 +2348,16 @@ In the code-sequences below: * ``.tlsdescadd`` is an assembler directive that adds a ``R_AARCH64_TLSDESC_ADD`` relocation to the next instruction. +Relaxation is a term used by the TLS literature such as ELFTLS_ to +represent an optimization. AAELF64_ has used optimization for similar +link-time instruction sequence optimizations. This document will use +relaxation to be consistent with existing references. + +The static linker can relax a more general TLS model to a more +constrained TLS model when the TLS variables meet the requirements for +using the constrained model. The section `Static link time TLS +Relaxations`_ describes the details of the permitted relaxations. + General Dynamic ^^^^^^^^^^^^^^^ @@ -2458,7 +2483,7 @@ constants. The code sequences are the same for all code models. The instruction sequences below are not required by the ABI but using the instructions and relocations below increases the chances of static -linkers applying the relaxations in (AAELF64_) when the size of the +linkers applying the optimizations in (AAELF64_) when the size of the executables TLS block is smaller than 16 KiB. .. code-block:: asm @@ -2477,15 +2502,6 @@ Optimization to load a 64-bit var directly into a core register. Static link time TLS Relaxations -------------------------------- -Relaxation is a term used by the TLS literature such as ELFTLS_ to -represent an optimization. AAELF64_ has used optimization for similar -link-time instruction sequence optimizations. This document will use -relaxation to be consistent with existing references. - -The static linker can relax a more general TLS model to a more -constrained TLS model when the TLS variables meet the requirements for -using the constrained model. - The Relaxations described below can be automatically applied to code sequences in the executable. Relaxing from general dynamic will prevent a shared library from being opened at runtime via dlopen so @@ -2579,19 +2595,8 @@ destination register must be preserved. movz xn, :tprel_g1:var // R_AARCH64_TLSLE_MOVW_TPREL_G1 var movk xn, :tprel_g0:var // R_AARCH64_TLSLE_MOVW_TPREL_G0_NC var -TLS Descriptors ---------------- - -The TLS Descriptor dialect permits a dynamic linker to use the -location and properties of the TLS symbol to select an optimal -resolver function. - -The static relocations with a prefix of ``R_AARCH64_TLSDESC_`` -targeting TLS symbol ``var``, instruct the static linker to create a -TLS Descriptor for ``var``. The TLS Descriptor for a variable is -stored in a pair of consecutive GOT entries, N and N + 1. The GOT -entry for N has a dynamic ``R_AARCH64_TLSDESC`` relocation targeting -the TLS symbol for ``var``. +TLS Descriptor resolver functions +--------------------------------- When resolving the ``R_AARCH64_TLSDESC`` relocation, the dynamic loader places the address of the chosen resolver function in the first From 72c1c77f45aa641a36f15e9084652ffef2d6b3ea Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Fri, 15 Aug 2025 14:52:29 +0100 Subject: [PATCH 11/13] [sysvabi64] Move TLSDESC resolvers to separate design doc The TLSDESC resolver functions are not ABI so we can move them out of the sysvabi64 document. Providing some examples that can be used by a dynamic linker is still useful so move this to the design documents section. Add a comment about DTV surplus TLS that permits a dynamic loader to dlopen a DSO with initial-exec TLS. There can be a small number of performance critical shared-libraries that use initial exec TLS, but are expected to be opened via dlopen, particularly by scripting languages like python. --- design-documents/tlsdesc-resolvers.rst | 137 +++++++++++++++++++++++++ sysvabi64/sysvabi64.rst | 127 +++-------------------- 2 files changed, 150 insertions(+), 114 deletions(-) create mode 100644 design-documents/tlsdesc-resolvers.rst diff --git a/design-documents/tlsdesc-resolvers.rst b/design-documents/tlsdesc-resolvers.rst new file mode 100644 index 00000000..b96d4f9a --- /dev/null +++ b/design-documents/tlsdesc-resolvers.rst @@ -0,0 +1,137 @@ +.. + Copyright (c) 2023-2025, Arm Limited and its affiliates. All rights reserved. + CC-BY-SA-4.0 AND Apache-Patent-License + See LICENSE file for details + +.. _SYSVABI64: https://github.com/ARM-software/abi-aa/tree/main/sysvabi64/sysvabi64.rst +.. _TLSDESC: http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt + +Thread Local Storage TLSDESC resolver functions +*********************************************** + +Preamble +======== + +Background +---------- + +The ``R_AARCH64_TLSDESC`` dynamic relocation is platform specific. The +dynamic loader is expected to choose an appropriate resolver function +for the context. This document provides some example resolver +functions. + +These examples are for illustrative purposes only. There is no +requirement for any of the following resolver functions to be +implemented. + +The ABI requirements for calling convention of resolver functions is +described in `SYSVABI64`_. + +Example Resolver Functions +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Due to the restrictions on calling convention, the +resolver routines must be written in assembly language. + +Static TLS Specialization: + +When the TLS variable is in the static TLS block, the offset from the +thread pointer is fixed at runtime. The dynamic loader can calculate +the offset and place it in the TLS descriptor. All the static TLS +resolver function needs to do is extract the offset and return it. + +.. code-block:: asm + + _dl_tlsdesc_return: + // x0 contains pointer to struct tlsdesc. + // tlsdesc.argument.value contains offset of variable from TP + ldr x0, [x0, #8] + ret + +Dynamic TLS Specialization: + +When the TLS variable is defined in dynamic TLS the address of the TLS +variable must be calculated by the resolver function using +``__tls_get_addr``. The resolver function returns the offset from the +thread pointer by subtracting the address of the thread pointer from +the address of the TLS variable. In practice an implementation of the +dynamic TLS resolver contains many platform specific details outside +of the scope of the ABI. An example of how a dynamic resolver might be +implemented can be found in the Dynamic Specialization section of +TLSDESC_. + +Undefined Weak Symbols + +An undefined weak symbol has the value 0. As the resolver function +returns an offset from the Thread Pointer, to get a value of 0 when +added to the Thread Pointer the resolver function returns a negative +thread pointer value that cancels to 0 when added to the thread +pointer. + +.. code-block:: asm + + __dl_tlsdesc_undefweak: + mrs x0, tpidr_el0 + neg x0, x0 + ret + +Lazy resolution of R_AARCH64_TLSDESC +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The TLSDESC_ paper describes an optional mechanism to resolve TLSDESC +calls lazily. Lazy resolution for TLSDESC resolver functions is not +recommended on AArch64. Additional synchronization is required for +each TLSDESC call, which has a significant affect on performance. The +description below describes the additional synchronization that is +needed. + +Instead of fully resolving the ``R_AARCH64_TLSDESC`` relocation at +module load time, a lazy resolver function runs on the first TLSDESC +call. The lazy resolver updates the TLS Descriptor with the actual +resolver function and the parameter to the actual resolver +function. In a multi-threaded program when lazy TLS in use, the +resolver functions must ensure that the write to the parameter in the +TLS descriptor has completed before reading it. + +.. code-block:: asm + + // Code to obtain the offset of var from thread pointer. + // Loads the address of the resolver function into x1. + // Places the address of the TLS Descriptor into x0. + adrp x0, :tlsdesc:var + ldr x1, [x0, #:tlsdesc_lo12:var] + add x0, x0, #:tlsdesc_lo12:var] + .tlsdesccall var + blr x1 // _dl_desc_return + + // Resolver function + _dl_tlsdesc_return: + // load the parameter from the TLS descriptor. Without + // synchronization this load can read an old value prior + // to the lazy resolvers update to the descriptor completing. + ldr x0, [x0, #8] + ret + +The recommended way to ensure synchronization between the lazy +resolver update of the TLS Descriptor and the actual resolver function +accessing the TLS Descriptor is: + +* The TLS lazy resolver function uses a store release when updating + the address of the resolver function in the TLS Descriptor. + +* The actual entry function uses a load acquire on the address of the + resolver function, with a destination register of xzr. + +Referring to the example above, the code for the resolver function +becomes: + +.. code-block:: asm + + // Resolver function + _dl_tlsdesc_return: + // Guaranteed to complete after the lazy resolvers store release + // of the address in [x0]. + ldar xzr, [x0] + // Access the parameter. + ldr x0, [x0, #8] + ret diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 109a96cf..36c742cf 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -25,6 +25,7 @@ .. _SYSVABI: https://github.com/ARM-software/abi-aa/releases .. _ELFTLS: https://www.uclibc.org/docs/tls.pdf .. _TLSDESC: http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt +.. _TLSDESCRES: https://github.com/ARM-software/abi-aa/tree/main/design-documents/tlsdesc-resolvers.txt .. role:: c(code) :language: c @@ -271,6 +272,8 @@ This document refers to, or is referred to by, the following documents. +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ | SYM-VER_ | http://people.redhat.com/drepper/symbol-versioning | GNU Symbol Versioning | +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ + | TLSDESCRES_ | design-documents/tlsdesc-resolvers | TLSDESC resolver function examples | + +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ Terms and Abbreviations ----------------------- @@ -2442,8 +2445,12 @@ thread pointer and places it in a GOT entry. The GOT entry is relocated by dynamic relocation ``R_AARCH64_TLS_TPREL64``. A shared-library that contains Initial Exec TLS must have the -``DF_STATIC_TLS`` dynamic tag set. An attempt to load a shared library -with ``DF_STATIC_TLS`` via ``dlopen`` will be rejected. +``DF_STATIC_TLS`` dynamic tag set. In the general case an attempt to +load a shared library with ``DF_STATIC_TLS`` via ``dlopen`` will be +rejected. Some dynamic loaders implement a surplus of DTV slots that +permit a fixed number of ``DF_STATIC_TLS`` modules to be dynamically +loaded. Whether a DTV surplus is available and how many slots are +available is implementation defined. Small Code model; @@ -2604,7 +2611,7 @@ GOT entry, and the argument for the chosen resolver function in the second GOT entry. The AArch64 C and assembler examples are adapted from the AArch32 -TLSDESC_ paper. The C code below represents the TLS Descriptor. +`TLSDESC`_ paper. The C code below represents the TLS Descriptor. .. code-block:: c @@ -2626,6 +2633,9 @@ The TLS resolver functions are not standardized by this ABI as they are internal to the dynamic linker. Programs must not directly refer to TLS resolver functions. +The `TLSDESCRES`_ document contains information on how a platform +might implement the resolver functions. + Calling Convention ^^^^^^^^^^^^^^^^^^ @@ -2641,117 +2651,6 @@ TLS resolver functions are not required to save any register added by an extension, such as the scalable vector registers or the SVE predicate registers. See `GCCML`_ for details. -Example Resolver Functions -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -These examples are for illustrative purposes only. There is no -requirement for any of the following resolver functions to be -implemented. Due to the restrictions on calling convention, the -resolver routines must be written in assembly language. - -Static TLS Specialization: - -When the TLS variable is in the static TLS block, the offset from the -thread pointer is fixed at runtime. The dynamic loader can calculate -the offset and place it in the TLS descriptor. All the static TLS -resolver function needs to do is extract the offset and return it. - -.. code-block:: asm - - _dl_tlsdesc_return: - // x0 contains pointer to struct tlsdesc. - // tlsdesc.argument.value contains offset of variable from TP - ldr x0, [x0, #8] - ret - -Dynamic TLS Specialization: - -When the TLS variable is defined in dynamic TLS the address of the TLS -variable must be calculated by the resolver function using -``__tls_get_addr``. The resolver function returns the offset from the -thread pointer by subtracting the address of the thread pointer from -the address of the TLS variable. In practice an implementation of the -dynamic TLS resolver contains many platform specific details outside -of the scope of the ABI. An example of how a dynamic resolver might be -implemented can be found in the Dynamic Specialization section of -TLSDESC_. - -Undefined Weak Symbols - -An undefined weak symbol has the value 0. As the resolver function -returns an offset from the Thread Pointer, to get a value of 0 when -added to the Thread Pointer the resolver function returns a negative -thread pointer value that cancels to 0 when added to the thread -pointer. - -.. code-block:: asm - - __dl_tlsdesc_undefweak: - mrs x0, tpidr_el0 - neg x0, x0 - ret - -Lazy resolution of R_AARCH64_TLSDESC -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The TLSDESC_ paper describes an optional mechanism to resolve TLSDESC -calls lazily. Lazy resolution for TLSDESC resolver functions is not -recommended on AArch64. Additional synchronization is required for -each TLSDESC call, which has a significant affect on performance. The -description below describes the additional synchronization that is -needed. - -Instead of fully resolving the ``R_AARCH64_TLSDESC`` relocation at -module load time, a lazy resolver function runs on the first TLSDESC -call. The lazy resolver updates the TLS Descriptor with the actual -resolver function and the parameter to the actual resolver -function. In a multi-threaded program when lazy TLS in use, the -resolver functions must ensure that the write to the parameter in the -TLS descriptor has completed before reading it. - -.. code-block:: asm - - // Code to obtain the offset of var from thread pointer. - // Loads the address of the resolver function into x1. - // Places the address of the TLS Descriptor into x0. - adrp x0, :tlsdesc:var - ldr x1, [x0, #:tlsdesc_lo12:var] - add x0, x0, #:tlsdesc_lo12:var] - .tlsdesccall var - blr x1 // _dl_desc_return - - // Resolver function - _dl_tlsdesc_return: - // load the parameter from the TLS descriptor. Without - // synchronization this load can read an old value prior - // to the lazy resolvers update to the descriptor completing. - ldr x0, [x0, #8] - ret - -The recommended way to ensure synchronization between the lazy -resolver update of the TLS Descriptor and the actual resolver function -accessing the TLS Descriptor is: - -* The TLS lazy resolver function uses a store release when updating - the address of the resolver function in the TLS Descriptor. - -* The actual entry function uses a load acquire on the address of the - resolver function, with a destination register of xzr. - -Referring to the example above, the code for the resolver function -becomes: - -.. code-block:: asm - - // Resolver function - _dl_tlsdesc_return: - // Guaranteed to complete after the lazy resolvers store release - // of the address in [x0]. - ldar xzr, [x0] - // Access the parameter. - ldr x0, [x0, #8] - ret - Libraries ========= From 0834e04806dbbd677dad5947b304346878fbef11 Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Wed, 25 Mar 2026 11:33:24 +0000 Subject: [PATCH 12/13] [sysvabi64] Yury's review comments * Added `` `` to some variables. * Added some more section headings. * Used code-blocks for formula. * Fixed reference to design document. --- sysvabi64/sysvabi64.rst | 92 ++++++++++++++++++++--------------------- 1 file changed, 44 insertions(+), 48 deletions(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 36c742cf..3da7bcf6 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -272,7 +272,7 @@ This document refers to, or is referred to by, the following documents. +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ | SYM-VER_ | http://people.redhat.com/drepper/symbol-versioning | GNU Symbol Versioning | +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ - | TLSDESCRES_ | design-documents/tlsdesc-resolvers | TLSDESC resolver function examples | + | TLSDESCRES_ | design-documents/tlsdesc-resolvers.rst | TLSDESC resolver function examples | +-----------------+-------------------------------------------------------------------+-----------------------------------------------------------------------------+ Terms and Abbreviations @@ -2099,13 +2099,8 @@ into the abstract storage hierarchy as follows. * (Most local) Automatic data (stack variables, instanced once per function activation, per thread). -Rules governing thread local storage on AArch64 ------------------------------------------------ - -* How to denote TLS in source programs: - - C++11 and C11 use :c:`thread_local T t...`; A GCC extension uses - :c:`__thread T t...`; this is Q-o-I. +Scope of the section +-------------------- * How to represent the initializing images of TLS in object files, and how to define symbols in TLS: @@ -2207,30 +2202,14 @@ generation count is less than the global generation number, the thread's DTV is updated, and the TLS for the ``module_id`` is allocated if it is not present. -In pseudo code - -.. code-block:: c - - /* tls_get_addr with deferred allocation */ - void * __tls_get_addr(size_t module_id, size_t offset) - { - dtv = get_thread_dtv(); - - if (dtv[0].generation_counter != global_generation_number) - /* includes setting the thread's generation counter to - the global_generation_number */ - update_thread_dtv(); +The calculation in ``__tls_get_addr`` is the most general and it can be +applied to both static and dynamic TLS. - if (dtv[module_id] == unallocated) - allocate_tls(dtv, module_id); - - return dtv[module_id][offset]; - } +TLS models +---------- -The calculation in __tls_get_addr is the most general and it can be -applied to both static and dynamic TLS. There are four defined models -of accessing TLS that trade off generality for performance. In order -of descending generality: +There are four defined models of accessing TLS +of accessing TLS that trade off generality for performance. In descending order of generality: 1. General Dynamic, also known as Global Dynamic, can be used anywhere. @@ -2243,10 +2222,8 @@ of descending generality: 4. Local Exec, can be used in the executable for TLS variables defined in the executables static TLS block. -SystemV AArch64 TLS addressing ------------------------------- - -AArch64 TLS SystemV design choices: +AArch64 TLS SystemV design choices +---------------------------------- * AArch64 uses variant 1 TLS as described in ELFTLS_. @@ -2256,9 +2233,9 @@ AArch64 TLS SystemV design choices: only the descriptor dialect as this is the default dialect for GCC and the only dialect supported by clang. -* The thread pointer (TP) is always accessible via the ``TPIDR_EL0`` - system register. This can be accessed via inlining an ``mrs`` - instruction to read the thread pointer. +* The thread pointer (``TP``) is always accessible via the + ``TPIDR_EL0`` system register. This can be accessed via inlining an + ``mrs`` instruction to read the thread pointer. * The compiler can generate code that supports a TLS block size of 4 KiB, 16 MiB, 4GiB or 16EiB, depending on the addressing mode. The @@ -2267,6 +2244,9 @@ AArch64 TLS SystemV design choices: * The TLS for an executable or shared-library is described by the ``PT_TLS`` program header. +TP, TCB and padding size +------------------------ + Recall from the diagram in `SystemV AArch64 TLS addressing architecture`_ that the Thread Pointer ``TP`` points to the start of the ``TCB``, which is followed by 0 or more bytes of alignment @@ -2278,9 +2258,13 @@ PT_TLS.p_align)`` where ``≡`` means congruent to. The static and dynamic linker must agree on the size of the padding (``PADsize``) between the TCB and the executable's TLS Block. Using -``TCBsize`` as the size of the TCB (16 bytes), the following expression can be used to calcluate ``PADsize`` from the ``PT_TLS`` program header. +``TCBsize`` as the size of the TCB (16 bytes), the following +expression can be used to calcluate ``PADsize`` from the ``PT_TLS`` +program header. -``PADsize = (PT_TLS.p_vaddr - TCBsize) mod PT_TLS.p_align``. +.. code-block:: c + + PADsize = (PT_TLS.p_vaddr - TCBsize) mod PT_TLS.p_align A number of dynamic linkers use a different calculation that requires ``PT_TLS.p_vaddr ≡ 0 (modulo PT_TLS.p_align)`` to correctly align the @@ -2295,17 +2279,29 @@ The expression for ``PADsize`` above can be derived from the requirement that ``PADsize`` must be the smallest positive integer that satisfies the following congruence: -``TP`` + ``TCBsize + PADsize ≡ PT_TLS.p_vaddr (modulo PT_TLS.p_align)``. +.. code-block:: c + + TP + TCBsize + PADsize ≡ PT_TLS.p_vaddr (modulo PT_TLS.p_align) + +Using Integers modulo m where (``PT_TLS.p_align``), denoted by a (m) +suffix in the formula below. + +.. code-block:: c -Using Integers modulo m where (``PT_TLS.p_align``). -``TP:sub:m + TCBsize:sub:m + PADsize:sub:m = PT_TLS.p_vaddr:sub:m`` + TP(m) + TCBsize(m) + PADsize(m) = PT_TLS.p_vaddr(m) -As ``TP:sub:m`` is 0 as ``TP ≡ 0 (modulo PT_TLS.p_align)`` rearranging +As ``TP(m)`` is 0 as ``TP ≡ 0 (modulo PT_TLS.p_align)`` rearranging we get: -``PADsize:sub:m = PT_TLS.p_vaddr:sub:m - TCBsize:sub:m`` -which is equivalent to -``PADsize:sub:m = (PT_TLS.p_vaddr - TCBsize):sub:m``. +.. code-block:: c + + PADsize(m) = PT_TLS.p_vaddr(m) - TCBsize(m) + +which is equivalent to: + +.. code-block:: c + + PADsize(m) = (PT_TLS.p_vaddr - TCBsize)(m). TLS Descriptors --------------- @@ -2316,7 +2312,7 @@ location and properties of the TLS symbol to select an optimal resolver function. The static relocations with a prefix of ``R_AARCH64_TLSDESC_`` -targeting TLS symbol ``var``, instruct the static linker to create a +targeting TLS symbol ``var`` instruct the static linker to create a TLS Descriptor for ``var``. The TLS Descriptor for a variable is stored in a pair of consecutive GOT entries, N and N + 1. The GOT entry for N has a dynamic ``R_AARCH64_TLSDESC`` relocation targeting @@ -2460,7 +2456,7 @@ instructions do not have to be contiguous. .. code-block:: asm - adrp xn, :gottprel: var // R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21 var + adrp xn, :gottprel:var // R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21 var ldr xn, [xn, #:gottprel_lo12:var] // R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC var // offset of var from tp in xn From 5076042bfe7fd67d7263f35d9042030491e602a8 Mon Sep 17 00:00:00 2001 From: Peter Smith Date: Wed, 25 Mar 2026 13:11:15 +0000 Subject: [PATCH 13/13] [sysvabi64] Fixup broken link [NFC] Previous review comments changed name of a section. --- sysvabi64/sysvabi64.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sysvabi64/sysvabi64.rst b/sysvabi64/sysvabi64.rst index 3da7bcf6..33e590e1 100644 --- a/sysvabi64/sysvabi64.rst +++ b/sysvabi64/sysvabi64.rst @@ -2177,7 +2177,7 @@ pointer to the Dynamic Thread Vector (DTV), and the other 8 bytes are reserved for the implementation. Following the TCB and any required alignment padding (defined in -`SystemV AArch64 TLS addressing`_), the TLS Blocks of the modules +`TP, TCB and padding size`_), the TLS Blocks of the modules loaded at process start form the static TLS Block. The memory for the TLS Block is allocated at process start time.