aboutsummaryrefslogtreecommitdiffstats
path: root/docs/Atomics.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/Atomics.rst')
-rw-r--r--docs/Atomics.rst441
1 files changed, 441 insertions, 0 deletions
diff --git a/docs/Atomics.rst b/docs/Atomics.rst
new file mode 100644
index 0000000..db279590
--- /dev/null
+++ b/docs/Atomics.rst
@@ -0,0 +1,441 @@
+.. _atomics:
+
+==============================================
+LLVM Atomic Instructions and Concurrency Guide
+==============================================
+
+.. contents::
+ :local:
+
+Introduction
+============
+
+Historically, LLVM has not had very strong support for concurrency; some minimal
+intrinsics were provided, and ``volatile`` was used in some cases to achieve
+rough semantics in the presence of concurrency. However, this is changing;
+there are now new instructions which are well-defined in the presence of threads
+and asynchronous signals, and the model for existing instructions has been
+clarified in the IR.
+
+The atomic instructions are designed specifically to provide readable IR and
+optimized code generation for the following:
+
+* The new C++0x ``<atomic>`` header. (`C++0x draft available here
+ <http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C1x draft available here
+ <http://www.open-std.org/jtc1/sc22/wg14/>`_.)
+
+* Proper semantics for Java-style memory, for both ``volatile`` and regular
+ shared variables. (`Java Specification
+ <http://java.sun.com/docs/books/jls/third_edition/html/memory.html>`_)
+
+* gcc-compatible ``__sync_*`` builtins. (`Description
+ <http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html>`_)
+
+* Other scenarios with atomic semantics, including ``static`` variables with
+ non-trivial constructors in C++.
+
+Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
+which ensures that every volatile load and store happens and is performed in the
+stated order. A couple examples: if a SequentiallyConsistent store is
+immediately followed by another SequentiallyConsistent store to the same
+address, the first store can be erased. This transformation is not allowed for a
+pair of volatile stores. On the other hand, a non-volatile non-atomic load can
+be moved across a volatile load freely, but not an Acquire load.
+
+This document is intended to provide a guide to anyone either writing a frontend
+for LLVM or working on optimization passes for LLVM with a guide for how to deal
+with instructions with special semantics in the presence of concurrency. This
+is not intended to be a precise guide to the semantics; the details can get
+extremely complicated and unreadable, and are not usually necessary.
+
+.. _Optimization outside atomic:
+
+Optimization outside atomic
+===========================
+
+The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
+lead to undefined results in a concurrent environment; see `NotAtomic`_. This
+section specifically goes into the one optimizer restriction which applies in
+concurrent environments, which gets a bit more of an extended description
+because any optimization dealing with stores needs to be aware of it.
+
+From the optimizer's point of view, the rule is that if there are not any
+instructions with atomic ordering involved, concurrency does not matter, with
+one exception: if a variable might be visible to another thread or signal
+handler, a store cannot be inserted along a path where it might not execute
+otherwise. Take the following example:
+
+.. code-block:: c
+
+ /* C code, for readability; run through clang -O2 -S -emit-llvm to get
+ equivalent IR */
+ int x;
+ void f(int* a) {
+ for (int i = 0; i < 100; i++) {
+ if (a[i])
+ x += 1;
+ }
+ }
+
+The following is equivalent in non-concurrent situations:
+
+.. code-block:: c
+
+ int x;
+ void f(int* a) {
+ int xtemp = x;
+ for (int i = 0; i < 100; i++) {
+ if (a[i])
+ xtemp += 1;
+ }
+ x = xtemp;
+ }
+
+However, LLVM is not allowed to transform the former to the latter: it could
+indirectly introduce undefined behavior if another thread can access ``x`` at
+the same time. (This example is particularly of interest because before the
+concurrency model was implemented, LLVM would perform this transformation.)
+
+Note that speculative loads are allowed; a load which is part of a race returns
+``undef``, but does not have undefined behavior.
+
+Atomic instructions
+===================
+
+For cases where simple loads and stores are not sufficient, LLVM provides
+various atomic instructions. The exact guarantees provided depend on the
+ordering; see `Atomic orderings`_.
+
+``load atomic`` and ``store atomic`` provide the same basic functionality as
+non-atomic loads and stores, but provide additional guarantees in situations
+where threads and signals are involved.
+
+``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
+atomic store (where the store is conditional for ``cmpxchg``), but no other
+memory operation can happen on any thread between the load and store. Note that
+LLVM's cmpxchg does not provide quite as many options as the C++0x version.
+
+A ``fence`` provides Acquire and/or Release ordering which is not part of
+another operation; it is normally used along with Monotonic memory operations.
+A Monotonic load followed by an Acquire fence is roughly equivalent to an
+Acquire load.
+
+Frontends generating atomic instructions generally need to be aware of the
+target to some degree; atomic instructions are guaranteed to be lock-free, and
+therefore an instruction which is wider than the target natively supports can be
+impossible to generate.
+
+.. _Atomic orderings:
+
+Atomic orderings
+================
+
+In order to achieve a balance between performance and necessary guarantees,
+there are six levels of atomicity. They are listed in order of strength; each
+level includes all the guarantees of the previous level except for
+Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
+
+.. _NotAtomic:
+
+NotAtomic
+---------
+
+NotAtomic is the obvious, a load or store which is not atomic. (This isn't
+really a level of atomicity, but is listed here for comparison.) This is
+essentially a regular load or store. If there is a race on a given memory
+location, loads from that location return undef.
+
+Relevant standard
+ This is intended to match shared variables in C/C++, and to be used in any
+ other context where memory access is necessary, and a race is impossible. (The
+ precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
+
+Notes for frontends
+ The rule is essentially that all memory accessed with basic loads and stores
+ by multiple threads should be protected by a lock or other synchronization;
+ otherwise, you are likely to run into undefined behavior. If your frontend is
+ for a "safe" language like Java, use Unordered to load and store any shared
+ variable. Note that NotAtomic volatile loads and stores are not properly
+ atomic; do not try to use them as a substitute. (Per the C/C++ standards,
+ volatile does provide some limited guarantees around asynchronous signals, but
+ atomics are generally a better solution.)
+
+Notes for optimizers
+ Introducing loads to shared variables along a codepath where they would not
+ otherwise exist is allowed; introducing stores to shared variables is not. See
+ `Optimization outside atomic`_.
+
+Notes for code generation
+ The one interesting restriction here is that it is not allowed to write to
+ bytes outside of the bytes relevant to a store. This is mostly relevant to
+ unaligned stores: it is not allowed in general to convert an unaligned store
+ into two aligned stores of the same width as the unaligned store. Backends are
+ also expected to generate an i8 store as an i8 store, and not an instruction
+ which writes to surrounding bytes. (If you are writing a backend for an
+ architecture which cannot satisfy these restrictions and cares about
+ concurrency, please send an email to llvmdev.)
+
+Unordered
+---------
+
+Unordered is the lowest level of atomicity. It essentially guarantees that races
+produce somewhat sane results instead of having undefined behavior. It also
+guarantees the operation to be lock-free, so it do not depend on the data being
+part of a special atomic structure or depend on a separate per-process global
+lock. Note that code generation will fail for unsupported atomic operations; if
+you need such an operation, use explicit locking.
+
+Relevant standard
+ This is intended to match the Java memory model for shared variables.
+
+Notes for frontends
+ This cannot be used for synchronization, but is useful for Java and other
+ "safe" languages which need to guarantee that the generated code never
+ exhibits undefined behavior. Note that this guarantee is cheap on common
+ platforms for loads of a native width, but can be expensive or unavailable for
+ wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
+ languages would normally split a 64-bit store on ARM into two 32-bit unordered
+ stores.)
+
+Notes for optimizers
+ In terms of the optimizer, this prohibits any transformation that transforms a
+ single load into multiple loads, transforms a store into multiple stores,
+ narrows a store, or stores a value which would not be stored otherwise. Some
+ examples of unsafe optimizations are narrowing an assignment into a bitfield,
+ rematerializing a load, and turning loads and stores into a memcpy
+ call. Reordering unordered operations is safe, though, and optimizers should
+ take advantage of that because unordered operations are common in languages
+ that need them.
+
+Notes for code generation
+ These operations are required to be atomic in the sense that if you use
+ unordered loads and unordered stores, a load cannot see a value which was
+ never stored. A normal load or store instruction is usually sufficient, but
+ note that an unordered load or store cannot be split into multiple
+ instructions (or an instruction which does multiple memory operations, like
+ ``LDRD`` on ARM).
+
+Monotonic
+---------
+
+Monotonic is the weakest level of atomicity that can be used in synchronization
+primitives, although it does not provide any general synchronization. It
+essentially guarantees that if you take all the operations affecting a specific
+address, a consistent ordering exists.
+
+Relevant standard
+ This corresponds to the C++0x/C1x ``memory_order_relaxed``; see those
+ standards for the exact definition.
+
+Notes for frontends
+ If you are writing a frontend which uses this directly, use with caution. The
+ guarantees in terms of synchronization are very weak, so make sure these are
+ only used in a pattern which you know is correct. Generally, these would
+ either be used for atomic operations which do not protect other memory (like
+ an atomic counter), or along with a ``fence``.
+
+Notes for optimizers
+ In terms of the optimizer, this can be treated as a read+write on the relevant
+ memory location (and alias analysis will take advantage of that). In addition,
+ it is legal to reorder non-atomic and Unordered loads around Monotonic
+ loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
+ operations are unlikely to be used in ways which would make those
+ optimizations useful.
+
+Notes for code generation
+ Code generation is essentially the same as that for unordered for loads and
+ stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required
+ to appear as a single operation.
+
+Acquire
+-------
+
+Acquire provides a barrier of the sort necessary to acquire a lock to access
+other memory with normal loads and stores.
+
+Relevant standard
+ This corresponds to the C++0x/C1x ``memory_order_acquire``. It should also be
+ used for C++0x/C1x ``memory_order_consume``.
+
+Notes for frontends
+ If you are writing a frontend which uses this directly, use with caution.
+ Acquire only provides a semantic guarantee when paired with a Release
+ operation.
+
+Notes for optimizers
+ Optimizers not aware of atomics can treat this like a nothrow call. It is
+ also possible to move stores from before an Acquire load or read-modify-write
+ operation to after it, and move non-Acquire loads from before an Acquire
+ operation to after it.
+
+Notes for code generation
+ Architectures with weak memory ordering (essentially everything relevant today
+ except x86 and SPARC) require some sort of fence to maintain the Acquire
+ semantics. The precise fences required varies widely by architecture, but for
+ a simple implementation, most architectures provide a barrier which is strong
+ enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting
+ such a fence after the equivalent Monotonic operation is sufficient to
+ maintain Acquire semantics for a memory operation.
+
+Release
+-------
+
+Release is similar to Acquire, but with a barrier of the sort necessary to
+release a lock.
+
+Relevant standard
+ This corresponds to the C++0x/C1x ``memory_order_release``.
+
+Notes for frontends
+ If you are writing a frontend which uses this directly, use with caution.
+ Release only provides a semantic guarantee when paired with a Acquire
+ operation.
+
+Notes for optimizers
+ Optimizers not aware of atomics can treat this like a nothrow call. It is
+ also possible to move loads from after a Release store or read-modify-write
+ operation to before it, and move non-Release stores from after an Release
+ operation to before it.
+
+Notes for code generation
+ See the section on Acquire; a fence before the relevant operation is usually
+ sufficient for Release. Note that a store-store fence is not sufficient to
+ implement Release semantics; store-store fences are generally not exposed to
+ IR because they are extremely difficult to use correctly.
+
+AcquireRelease
+--------------
+
+AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
+barrier (for fences and operations which both read and write memory).
+
+Relevant standard
+ This corresponds to the C++0x/C1x ``memory_order_acq_rel``.
+
+Notes for frontends
+ If you are writing a frontend which uses this directly, use with caution.
+ Acquire only provides a semantic guarantee when paired with a Release
+ operation, and vice versa.
+
+Notes for optimizers
+ In general, optimizers should treat this like a nothrow call; the the possible
+ optimizations are usually not interesting.
+
+Notes for code generation
+ This operation has Acquire and Release semantics; see the sections on Acquire
+ and Release.
+
+SequentiallyConsistent
+----------------------
+
+SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
+and Release semantics for stores. Additionally, it guarantees that a total
+ordering exists between all SequentiallyConsistent operations.
+
+Relevant standard
+ This corresponds to the C++0x/C1x ``memory_order_seq_cst``, Java volatile, and
+ the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
+
+Notes for frontends
+ If a frontend is exposing atomic operations, these are much easier to reason
+ about for the programmer than other kinds of operations, and using them is
+ generally a practical performance tradeoff.
+
+Notes for optimizers
+ Optimizers not aware of atomics can treat this like a nothrow call. For
+ SequentiallyConsistent loads and stores, the same reorderings are allowed as
+ for Acquire loads and Release stores, except that SequentiallyConsistent
+ operations may not be reordered.
+
+Notes for code generation
+ SequentiallyConsistent loads minimally require the same barriers as Acquire
+ operations and SequentiallyConsistent stores require Release
+ barriers. Additionally, the code generator must enforce ordering between
+ SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
+ is usually done by emitting either a full fence before the loads or a full
+ fence after the stores; which is preferred varies by architecture.
+
+Atomics and IR optimization
+===========================
+
+Predicates for optimizer writers to query:
+
+* ``isSimple()``: A load or store which is not volatile or atomic. This is
+ what, for example, memcpyopt would check for operations it might transform.
+
+* ``isUnordered()``: A load or store which is not volatile and at most
+ Unordered. This would be checked, for example, by LICM before hoisting an
+ operation.
+
+* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
+ that they return true for any operation which is volatile or at least
+ Monotonic.
+
+* Alias analysis: Note that AA will return ModRef for anything Acquire or
+ Release, and for the address accessed by any Monotonic operation.
+
+To support optimizing around atomic operations, make sure you are using the
+right predicates; everything should work if that is done. If your pass should
+optimize some atomic operations (Unordered operations in particular), make sure
+it doesn't replace an atomic load or store with a non-atomic operation.
+
+Some examples of how optimizations interact with various kinds of atomic
+operations:
+
+* ``memcpyopt``: An atomic operation cannot be optimized into part of a
+ memcpy/memset, including unordered loads/stores. It can pull operations
+ across some atomic operations.
+
+* LICM: Unordered loads/stores can be moved out of a loop. It just treats
+ monotonic operations like a read+write to a memory location, and anything
+ stricter than that like a nothrow call.
+
+* DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can
+ be DSE'ed in some cases, but it's tricky to reason about, and not especially
+ important.
+
+* Folding a load: Any atomic load from a constant global can be constant-folded,
+ because it cannot be observed. Similar reasoning allows scalarrepl with
+ atomic loads and stores.
+
+Atomics and Codegen
+===================
+
+Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
+On architectures which use barrier instructions for all atomic ordering (like
+ARM), appropriate fences are split out as the DAG is built.
+
+The MachineMemOperand for all atomic operations is currently marked as volatile;
+this is not correct in the IR sense of volatile, but CodeGen handles anything
+marked volatile very conservatively. This should get fixed at some point.
+
+Common architectures have some way of representing at least a pointer-sized
+lock-free ``cmpxchg``; such an operation can be used to implement all the other
+atomic operations which can be represented in IR up to that size. Backends are
+expected to implement all those operations, but not operations which cannot be
+implemented in a lock-free manner. It is expected that backends will give an
+error when given an operation which cannot be implemented. (The LLVM code
+generator is not very helpful here at the moment, but hopefully that will
+change.)
+
+The implementation of atomics on LL/SC architectures (like ARM) is currently a
+bit of a mess; there is a lot of copy-pasted code across targets, and the
+representation is relatively unsuited to optimization (it would be nice to be
+able to optimize loops involving cmpxchg etc.).
+
+On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
+generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
+fences generate an ``MFENCE``, other fences do not cause any code to be
+generated. cmpxchg uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg``
+uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
+other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending
+on the users of the result, some ``atomicrmw`` operations can be translated into
+operations like ``LOCK AND``, but that does not work in general.
+
+On ARM, MIPS, and many other RISC architectures, Acquire, Release, and
+SequentiallyConsistent semantics require barrier instructions for every such
+operation. Loads and stores generate normal instructions. ``cmpxchg`` and
+``atomicrmw`` can be represented using a loop with LL/SC-style instructions
+which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
+on ARM, etc.). At the moment, the IR does not provide any way to represent a
+weak ``cmpxchg`` which would not require a loop.