diff options
author | Bill Wendling <isanbard@gmail.com> | 2012-06-29 09:00:01 +0000 |
---|---|---|
committer | Bill Wendling <isanbard@gmail.com> | 2012-06-29 09:00:01 +0000 |
commit | 3bf24bdb1bcdc72aed4e2313aa3dafe3bc6746a0 (patch) | |
tree | 6724278bedef622560021a0d695bf1ef4a5a89f6 /docs | |
parent | 16eeb6f5ebc978b03745177b9ac82684ab1c6932 (diff) | |
download | external_llvm-3bf24bdb1bcdc72aed4e2313aa3dafe3bc6746a0.zip external_llvm-3bf24bdb1bcdc72aed4e2313aa3dafe3bc6746a0.tar.gz external_llvm-3bf24bdb1bcdc72aed4e2313aa3dafe3bc6746a0.tar.bz2 |
Sphinxify the Atomics documentation.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@159416 91177308-0d34-0410-b5e6-96231b3b80d8
Diffstat (limited to 'docs')
-rw-r--r-- | docs/Atomics.html | 569 | ||||
-rw-r--r-- | docs/Atomics.rst | 441 |
2 files changed, 441 insertions, 569 deletions
diff --git a/docs/Atomics.html b/docs/Atomics.html deleted file mode 100644 index 2358f4d..0000000 --- a/docs/Atomics.html +++ /dev/null @@ -1,569 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" - "http://www.w3.org/TR/html4/strict.dtd"> -<html> -<head> - <title>LLVM Atomic Instructions and Concurrency Guide</title> - <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> - <link rel="stylesheet" href="_static/llvm.css" type="text/css"> -</head> -<body> - -<h1> - LLVM Atomic Instructions and Concurrency Guide -</h1> - -<ol> - <li><a href="#introduction">Introduction</a></li> - <li><a href="#outsideatomic">Optimization outside atomic</a></li> - <li><a href="#atomicinst">Atomic instructions</a></li> - <li><a href="#ordering">Atomic orderings</a></li> - <li><a href="#iropt">Atomics and IR optimization</a></li> - <li><a href="#codegen">Atomics and Codegen</a></li> -</ol> - -<div class="doc_author"> - <p>Written by Eli Friedman</p> -</div> - -<!-- *********************************************************************** --> -<h2> - <a name="introduction">Introduction</a> -</h2> -<!-- *********************************************************************** --> - -<div> - -<p>Historically, LLVM has not had very strong support for concurrency; some -minimal intrinsics were provided, and <code>volatile</code> was used in some -cases to achieve rough semantics in the presence of concurrency. However, this -is changing; there are now new instructions which are well-defined in the -presence of threads and asynchronous signals, and the model for existing -instructions has been clarified in the IR.</p> - -<p>The atomic instructions are designed specifically to provide readable IR and - optimized code generation for the following:</p> -<ul> - <li>The new C++0x <code><atomic></code> header. - (<a href="http://www.open-std.org/jtc1/sc22/wg21/">C++0x draft available here</a>.) - (<a href="http://www.open-std.org/jtc1/sc22/wg14/">C1x draft available here</a>)</li> - <li>Proper semantics for Java-style memory, for both <code>volatile</code> and - regular shared variables. - (<a href="http://java.sun.com/docs/books/jls/third_edition/html/memory.html">Java Specification</a>)</li> - <li>gcc-compatible <code>__sync_*</code> builtins. - (<a href="http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html">Description</a>)</li> - <li>Other scenarios with atomic semantics, including <code>static</code> - variables with non-trivial constructors in C++.</li> -</ul> - -<p>Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ - volatile, which ensures that every volatile load and store happens and is - performed in the stated order. A couple examples: if a - SequentiallyConsistent store is immediately followed by another - SequentiallyConsistent store to the same address, the first store can - be erased. This transformation is not allowed for a pair of volatile - stores. On the other hand, a non-volatile non-atomic load can be moved - across a volatile load freely, but not an Acquire load.</p> - -<p>This document is intended to provide a guide to anyone either writing a - frontend for LLVM or working on optimization passes for LLVM with a guide - for how to deal with instructions with special semantics in the presence of - concurrency. This is not intended to be a precise guide to the semantics; - the details can get extremely complicated and unreadable, and are not - usually necessary.</p> - -</div> - -<!-- *********************************************************************** --> -<h2> - <a name="outsideatomic">Optimization outside atomic</a> -</h2> -<!-- *********************************************************************** --> - -<div> - -<p>The basic <code>'load'</code> and <code>'store'</code> allow a variety of - optimizations, but can lead to undefined results in a concurrent environment; - see <a href="#o_nonatomic">NonAtomic</a>. This section specifically goes - into the one optimizer restriction which applies in concurrent environments, - which gets a bit more of an extended description because any optimization - dealing with stores needs to be aware of it.</p> - -<p>From the optimizer's point of view, the rule is that if there - are not any instructions with atomic ordering involved, concurrency does - not matter, with one exception: if a variable might be visible to another - thread or signal handler, a store cannot be inserted along a path where it - might not execute otherwise. Take the following example:</p> - -<pre> -/* C code, for readability; run through clang -O2 -S -emit-llvm to get - equivalent IR */ -int x; -void f(int* a) { - for (int i = 0; i < 100; i++) { - if (a[i]) - x += 1; - } -} -</pre> - -<p>The following is equivalent in non-concurrent situations:</p> - -<pre> -int x; -void f(int* a) { - int xtemp = x; - for (int i = 0; i < 100; i++) { - if (a[i]) - xtemp += 1; - } - x = xtemp; -} -</pre> - -<p>However, LLVM is not allowed to transform the former to the latter: it could - indirectly introduce undefined behavior if another thread can access x at - the same time. (This example is particularly of interest because before the - concurrency model was implemented, LLVM would perform this - transformation.)</p> - -<p>Note that speculative loads are allowed; a load which - is part of a race returns <code>undef</code>, but does not have undefined - behavior.</p> - - -</div> - -<!-- *********************************************************************** --> -<h2> - <a name="atomicinst">Atomic instructions</a> -</h2> -<!-- *********************************************************************** --> - -<div> - -<p>For cases where simple loads and stores are not sufficient, LLVM provides - various atomic instructions. The exact guarantees provided depend on the - ordering; see <a href="#ordering">Atomic orderings</a></p> - -<p><code>load atomic</code> and <code>store atomic</code> provide the same - basic functionality as non-atomic loads and stores, but provide additional - guarantees in situations where threads and signals are involved.</p> - -<p><code>cmpxchg</code> and <code>atomicrmw</code> are essentially like an - atomic load followed by an atomic store (where the store is conditional for - <code>cmpxchg</code>), but no other memory operation can happen on any thread - between the load and store. Note that LLVM's cmpxchg does not provide quite - as many options as the C++0x version.</p> - -<p>A <code>fence</code> provides Acquire and/or Release ordering which is not - part of another operation; it is normally used along with Monotonic memory - operations. A Monotonic load followed by an Acquire fence is roughly - equivalent to an Acquire load.</p> - -<p>Frontends generating atomic instructions generally need to be aware of the - target to some degree; atomic instructions are guaranteed to be lock-free, - and therefore an instruction which is wider than the target natively supports - can be impossible to generate.</p> - -</div> - -<!-- *********************************************************************** --> -<h2> - <a name="ordering">Atomic orderings</a> -</h2> -<!-- *********************************************************************** --> - -<div> - -<p>In order to achieve a balance between performance and necessary guarantees, - there are six levels of atomicity. They are listed in order of strength; - each level includes all the guarantees of the previous level except for - Acquire/Release. (See also <a href="LangRef.html#ordering">LangRef</a>.)</p> - -<!-- ======================================================================= --> -<h3> - <a name="o_notatomic">NotAtomic</a> -</h3> - -<div> - -<p>NotAtomic is the obvious, a load or store which is not atomic. (This isn't - really a level of atomicity, but is listed here for comparison.) This is - essentially a regular load or store. If there is a race on a given memory - location, loads from that location return undef.</p> - -<dl> - <dt>Relevant standard</dt> - <dd>This is intended to match shared variables in C/C++, and to be used - in any other context where memory access is necessary, and - a race is impossible. (The precise definition is in - <a href="LangRef.html#memmodel">LangRef</a>.) - <dt>Notes for frontends</dt> - <dd>The rule is essentially that all memory accessed with basic loads and - stores by multiple threads should be protected by a lock or other - synchronization; otherwise, you are likely to run into undefined - behavior. If your frontend is for a "safe" language like Java, - use Unordered to load and store any shared variable. Note that NotAtomic - volatile loads and stores are not properly atomic; do not try to use - them as a substitute. (Per the C/C++ standards, volatile does provide - some limited guarantees around asynchronous signals, but atomics are - generally a better solution.) - <dt>Notes for optimizers</dt> - <dd>Introducing loads to shared variables along a codepath where they would - not otherwise exist is allowed; introducing stores to shared variables - is not. See <a href="#outsideatomic">Optimization outside - atomic</a>.</dd> - <dt>Notes for code generation</dt> - <dd>The one interesting restriction here is that it is not allowed to write - to bytes outside of the bytes relevant to a store. This is mostly - relevant to unaligned stores: it is not allowed in general to convert - an unaligned store into two aligned stores of the same width as the - unaligned store. Backends are also expected to generate an i8 store - as an i8 store, and not an instruction which writes to surrounding - bytes. (If you are writing a backend for an architecture which cannot - satisfy these restrictions and cares about concurrency, please send an - email to llvmdev.)</dd> -</dl> - -</div> - - -<!-- ======================================================================= --> -<h3> - <a name="o_unordered">Unordered</a> -</h3> - -<div> - -<p>Unordered is the lowest level of atomicity. It essentially guarantees that - races produce somewhat sane results instead of having undefined behavior. - It also guarantees the operation to be lock-free, so it do not depend on - the data being part of a special atomic structure or depend on a separate - per-process global lock. Note that code generation will fail for - unsupported atomic operations; if you need such an operation, use explicit - locking.</p> - -<dl> - <dt>Relevant standard</dt> - <dd>This is intended to match the Java memory model for shared - variables.</dd> - <dt>Notes for frontends</dt> - <dd>This cannot be used for synchronization, but is useful for Java and - other "safe" languages which need to guarantee that the generated - code never exhibits undefined behavior. Note that this guarantee - is cheap on common platforms for loads of a native width, but can - be expensive or unavailable for wider loads, like a 64-bit store - on ARM. (A frontend for Java or other "safe" languages would normally - split a 64-bit store on ARM into two 32-bit unordered stores.) - <dt>Notes for optimizers</dt> - <dd>In terms of the optimizer, this prohibits any transformation that - transforms a single load into multiple loads, transforms a store - into multiple stores, narrows a store, or stores a value which - would not be stored otherwise. Some examples of unsafe optimizations - are narrowing an assignment into a bitfield, rematerializing - a load, and turning loads and stores into a memcpy call. Reordering - unordered operations is safe, though, and optimizers should take - advantage of that because unordered operations are common in - languages that need them.</dd> - <dt>Notes for code generation</dt> - <dd>These operations are required to be atomic in the sense that if you - use unordered loads and unordered stores, a load cannot see a value - which was never stored. A normal load or store instruction is usually - sufficient, but note that an unordered load or store cannot - be split into multiple instructions (or an instruction which - does multiple memory operations, like <code>LDRD</code> on ARM).</dd> -</dl> - -</div> - -<!-- ======================================================================= --> -<h3> - <a name="o_monotonic">Monotonic</a> -</h3> - -<div> - -<p>Monotonic is the weakest level of atomicity that can be used in - synchronization primitives, although it does not provide any general - synchronization. It essentially guarantees that if you take all the - operations affecting a specific address, a consistent ordering exists. - -<dl> - <dt>Relevant standard</dt> - <dd>This corresponds to the C++0x/C1x <code>memory_order_relaxed</code>; - see those standards for the exact definition. - <dt>Notes for frontends</dt> - <dd>If you are writing a frontend which uses this directly, use with caution. - The guarantees in terms of synchronization are very weak, so make - sure these are only used in a pattern which you know is correct. - Generally, these would either be used for atomic operations which - do not protect other memory (like an atomic counter), or along with - a <code>fence</code>.</dd> - <dt>Notes for optimizers</dt> - <dd>In terms of the optimizer, this can be treated as a read+write on the - relevant memory location (and alias analysis will take advantage of - that). In addition, it is legal to reorder non-atomic and Unordered - loads around Monotonic loads. CSE/DSE and a few other optimizations - are allowed, but Monotonic operations are unlikely to be used in ways - which would make those optimizations useful.</dd> - <dt>Notes for code generation</dt> - <dd>Code generation is essentially the same as that for unordered for loads - and stores. No fences are required. <code>cmpxchg</code> and - <code>atomicrmw</code> are required to appear as a single operation.</dd> -</dl> - -</div> - -<!-- ======================================================================= --> -<h3> - <a name="o_acquire">Acquire</a> -</h3> - -<div> - -<p>Acquire provides a barrier of the sort necessary to acquire a lock to access - other memory with normal loads and stores. - -<dl> - <dt>Relevant standard</dt> - <dd>This corresponds to the C++0x/C1x <code>memory_order_acquire</code>. It - should also be used for C++0x/C1x <code>memory_order_consume</code>. - <dt>Notes for frontends</dt> - <dd>If you are writing a frontend which uses this directly, use with caution. - Acquire only provides a semantic guarantee when paired with a Release - operation.</dd> - <dt>Notes for optimizers</dt> - <dd>Optimizers not aware of atomics can treat this like a nothrow call. - It is also possible to move stores from before an Acquire load - or read-modify-write operation to after it, and move non-Acquire - loads from before an Acquire operation to after it.</dd> - <dt>Notes for code generation</dt> - <dd>Architectures with weak memory ordering (essentially everything relevant - today except x86 and SPARC) require some sort of fence to maintain - the Acquire semantics. The precise fences required varies widely by - architecture, but for a simple implementation, most architectures provide - a barrier which is strong enough for everything (<code>dmb</code> on ARM, - <code>sync</code> on PowerPC, etc.). Putting such a fence after the - equivalent Monotonic operation is sufficient to maintain Acquire - semantics for a memory operation.</dd> -</dl> - -</div> - -<!-- ======================================================================= --> -<h3> - <a name="o_acquire">Release</a> -</h3> - -<div> - -<p>Release is similar to Acquire, but with a barrier of the sort necessary to - release a lock. - -<dl> - <dt>Relevant standard</dt> - <dd>This corresponds to the C++0x/C1x <code>memory_order_release</code>.</dd> - <dt>Notes for frontends</dt> - <dd>If you are writing a frontend which uses this directly, use with caution. - Release only provides a semantic guarantee when paired with a Acquire - operation.</dd> - <dt>Notes for optimizers</dt> - <dd>Optimizers not aware of atomics can treat this like a nothrow call. - It is also possible to move loads from after a Release store - or read-modify-write operation to before it, and move non-Release - stores from after an Release operation to before it.</dd> - <dt>Notes for code generation</dt> - <dd>See the section on Acquire; a fence before the relevant operation is - usually sufficient for Release. Note that a store-store fence is not - sufficient to implement Release semantics; store-store fences are - generally not exposed to IR because they are extremely difficult to - use correctly.</dd> -</dl> - -</div> - -<!-- ======================================================================= --> -<h3> - <a name="o_acqrel">AcquireRelease</a> -</h3> - -<div> - -<p>AcquireRelease (<code>acq_rel</code> in IR) provides both an Acquire and a - Release barrier (for fences and operations which both read and write memory). - -<dl> - <dt>Relevant standard</dt> - <dd>This corresponds to the C++0x/C1x <code>memory_order_acq_rel</code>. - <dt>Notes for frontends</dt> - <dd>If you are writing a frontend which uses this directly, use with caution. - Acquire only provides a semantic guarantee when paired with a Release - operation, and vice versa.</dd> - <dt>Notes for optimizers</dt> - <dd>In general, optimizers should treat this like a nothrow call; the - the possible optimizations are usually not interesting.</dd> - <dt>Notes for code generation</dt> - <dd>This operation has Acquire and Release semantics; see the sections on - Acquire and Release.</dd> -</dl> - -</div> - -<!-- ======================================================================= --> -<h3> - <a name="o_seqcst">SequentiallyConsistent</a> -</h3> - -<div> - -<p>SequentiallyConsistent (<code>seq_cst</code> in IR) provides - Acquire semantics for loads and Release semantics for - stores. Additionally, it guarantees that a total ordering exists - between all SequentiallyConsistent operations. - -<dl> - <dt>Relevant standard</dt> - <dd>This corresponds to the C++0x/C1x <code>memory_order_seq_cst</code>, - Java volatile, and the gcc-compatible <code>__sync_*</code> builtins - which do not specify otherwise. - <dt>Notes for frontends</dt> - <dd>If a frontend is exposing atomic operations, these are much easier to - reason about for the programmer than other kinds of operations, and using - them is generally a practical performance tradeoff.</dd> - <dt>Notes for optimizers</dt> - <dd>Optimizers not aware of atomics can treat this like a nothrow call. - For SequentiallyConsistent loads and stores, the same reorderings are - allowed as for Acquire loads and Release stores, except that - SequentiallyConsistent operations may not be reordered.</dd> - <dt>Notes for code generation</dt> - <dd>SequentiallyConsistent loads minimally require the same barriers - as Acquire operations and SequentiallyConsistent stores require - Release barriers. Additionally, the code generator must enforce - ordering between SequentiallyConsistent stores followed by - SequentiallyConsistent loads. This is usually done by emitting - either a full fence before the loads or a full fence after the - stores; which is preferred varies by architecture.</dd> -</dl> - -</div> - -</div> - -<!-- *********************************************************************** --> -<h2> - <a name="iropt">Atomics and IR optimization</a> -</h2> -<!-- *********************************************************************** --> - -<div> - -<p>Predicates for optimizer writers to query: -<ul> - <li>isSimple(): A load or store which is not volatile or atomic. This is - what, for example, memcpyopt would check for operations it might - transform.</li> - <li>isUnordered(): A load or store which is not volatile and at most - Unordered. This would be checked, for example, by LICM before hoisting - an operation.</li> - <li>mayReadFromMemory()/mayWriteToMemory(): Existing predicate, but note - that they return true for any operation which is volatile or at least - Monotonic.</li> - <li>Alias analysis: Note that AA will return ModRef for anything Acquire or - Release, and for the address accessed by any Monotonic operation.</li> -</ul> - -<p>To support optimizing around atomic operations, make sure you are using - the right predicates; everything should work if that is done. If your - pass should optimize some atomic operations (Unordered operations in - particular), make sure it doesn't replace an atomic load or store with - a non-atomic operation.</p> - -<p>Some examples of how optimizations interact with various kinds of atomic - operations: -<ul> - <li>memcpyopt: An atomic operation cannot be optimized into part of a - memcpy/memset, including unordered loads/stores. It can pull operations - across some atomic operations. - <li>LICM: Unordered loads/stores can be moved out of a loop. It just treats - monotonic operations like a read+write to a memory location, and anything - stricter than that like a nothrow call. - <li>DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores - can be DSE'ed in some cases, but it's tricky to reason about, and not - especially important. - <li>Folding a load: Any atomic load from a constant global can be - constant-folded, because it cannot be observed. Similar reasoning allows - scalarrepl with atomic loads and stores. -</ul> - -</div> - -<!-- *********************************************************************** --> -<h2> - <a name="codegen">Atomics and Codegen</a> -</h2> -<!-- *********************************************************************** --> - -<div> - -<p>Atomic operations are represented in the SelectionDAG with - <code>ATOMIC_*</code> opcodes. On architectures which use barrier - instructions for all atomic ordering (like ARM), appropriate fences are - split out as the DAG is built.</p> - -<p>The MachineMemOperand for all atomic operations is currently marked as - volatile; this is not correct in the IR sense of volatile, but CodeGen - handles anything marked volatile very conservatively. This should get - fixed at some point.</p> - -<p>Common architectures have some way of representing at least a pointer-sized - lock-free <code>cmpxchg</code>; such an operation can be used to implement - all the other atomic operations which can be represented in IR up to that - size. Backends are expected to implement all those operations, but not - operations which cannot be implemented in a lock-free manner. It is - expected that backends will give an error when given an operation which - cannot be implemented. (The LLVM code generator is not very helpful here - at the moment, but hopefully that will change.)</p> - -<p>The implementation of atomics on LL/SC architectures (like ARM) is currently - a bit of a mess; there is a lot of copy-pasted code across targets, and - the representation is relatively unsuited to optimization (it would be nice - to be able to optimize loops involving cmpxchg etc.).</p> - -<p>On x86, all atomic loads generate a <code>MOV</code>. - SequentiallyConsistent stores generate an <code>XCHG</code>, other stores - generate a <code>MOV</code>. SequentiallyConsistent fences generate an - <code>MFENCE</code>, other fences do not cause any code to be generated. - cmpxchg uses the <code>LOCK CMPXCHG</code> instruction. - <code>atomicrmw xchg</code> uses <code>XCHG</code>, - <code>atomicrmw add</code> and <code>atomicrmw sub</code> use - <code>XADD</code>, and all other <code>atomicrmw</code> operations generate - a loop with <code>LOCK CMPXCHG</code>. Depending on the users of the - result, some <code>atomicrmw</code> operations can be translated into - operations like <code>LOCK AND</code>, but that does not work in - general.</p> - -<p>On ARM, MIPS, and many other RISC architectures, Acquire, Release, and - SequentiallyConsistent semantics require barrier instructions - for every such operation. Loads and stores generate normal instructions. - <code>cmpxchg</code> and <code>atomicrmw</code> can be represented using - a loop with LL/SC-style instructions which take some sort of exclusive - lock on a cache line (<code>LDREX</code> and <code>STREX</code> on - ARM, etc.). At the moment, the IR does not provide any way to represent a - weak <code>cmpxchg</code> which would not require a loop.</p> -</div> - -<!-- *********************************************************************** --> - -<hr> -<address> - <a href="http://jigsaw.w3.org/css-validator/check/referer"><img - src="http://jigsaw.w3.org/css-validator/images/vcss-blue" alt="Valid CSS"></a> - <a href="http://validator.w3.org/check/referer"><img - src="http://www.w3.org/Icons/valid-html401-blue" alt="Valid HTML 4.01"></a> - - <a href="http://llvm.org/">LLVM Compiler Infrastructure</a><br> - Last modified: $Date: 2011-08-09 02:07:00 -0700 (Tue, 09 Aug 2011) $ -</address> - -</body> -</html> diff --git a/docs/Atomics.rst b/docs/Atomics.rst new file mode 100644 index 0000000..db279590 --- /dev/null +++ b/docs/Atomics.rst @@ -0,0 +1,441 @@ +.. _atomics: + +============================================== +LLVM Atomic Instructions and Concurrency Guide +============================================== + +.. contents:: + :local: + +Introduction +============ + +Historically, LLVM has not had very strong support for concurrency; some minimal +intrinsics were provided, and ``volatile`` was used in some cases to achieve +rough semantics in the presence of concurrency. However, this is changing; +there are now new instructions which are well-defined in the presence of threads +and asynchronous signals, and the model for existing instructions has been +clarified in the IR. + +The atomic instructions are designed specifically to provide readable IR and +optimized code generation for the following: + +* The new C++0x ``<atomic>`` header. (`C++0x draft available here + <http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C1x draft available here + <http://www.open-std.org/jtc1/sc22/wg14/>`_.) + +* Proper semantics for Java-style memory, for both ``volatile`` and regular + shared variables. (`Java Specification + <http://java.sun.com/docs/books/jls/third_edition/html/memory.html>`_) + +* gcc-compatible ``__sync_*`` builtins. (`Description + <http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html>`_) + +* Other scenarios with atomic semantics, including ``static`` variables with + non-trivial constructors in C++. + +Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile, +which ensures that every volatile load and store happens and is performed in the +stated order. A couple examples: if a SequentiallyConsistent store is +immediately followed by another SequentiallyConsistent store to the same +address, the first store can be erased. This transformation is not allowed for a +pair of volatile stores. On the other hand, a non-volatile non-atomic load can +be moved across a volatile load freely, but not an Acquire load. + +This document is intended to provide a guide to anyone either writing a frontend +for LLVM or working on optimization passes for LLVM with a guide for how to deal +with instructions with special semantics in the presence of concurrency. This +is not intended to be a precise guide to the semantics; the details can get +extremely complicated and unreadable, and are not usually necessary. + +.. _Optimization outside atomic: + +Optimization outside atomic +=========================== + +The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can +lead to undefined results in a concurrent environment; see `NotAtomic`_. This +section specifically goes into the one optimizer restriction which applies in +concurrent environments, which gets a bit more of an extended description +because any optimization dealing with stores needs to be aware of it. + +From the optimizer's point of view, the rule is that if there are not any +instructions with atomic ordering involved, concurrency does not matter, with +one exception: if a variable might be visible to another thread or signal +handler, a store cannot be inserted along a path where it might not execute +otherwise. Take the following example: + +.. code-block:: c + + /* C code, for readability; run through clang -O2 -S -emit-llvm to get + equivalent IR */ + int x; + void f(int* a) { + for (int i = 0; i < 100; i++) { + if (a[i]) + x += 1; + } + } + +The following is equivalent in non-concurrent situations: + +.. code-block:: c + + int x; + void f(int* a) { + int xtemp = x; + for (int i = 0; i < 100; i++) { + if (a[i]) + xtemp += 1; + } + x = xtemp; + } + +However, LLVM is not allowed to transform the former to the latter: it could +indirectly introduce undefined behavior if another thread can access ``x`` at +the same time. (This example is particularly of interest because before the +concurrency model was implemented, LLVM would perform this transformation.) + +Note that speculative loads are allowed; a load which is part of a race returns +``undef``, but does not have undefined behavior. + +Atomic instructions +=================== + +For cases where simple loads and stores are not sufficient, LLVM provides +various atomic instructions. The exact guarantees provided depend on the +ordering; see `Atomic orderings`_. + +``load atomic`` and ``store atomic`` provide the same basic functionality as +non-atomic loads and stores, but provide additional guarantees in situations +where threads and signals are involved. + +``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an +atomic store (where the store is conditional for ``cmpxchg``), but no other +memory operation can happen on any thread between the load and store. Note that +LLVM's cmpxchg does not provide quite as many options as the C++0x version. + +A ``fence`` provides Acquire and/or Release ordering which is not part of +another operation; it is normally used along with Monotonic memory operations. +A Monotonic load followed by an Acquire fence is roughly equivalent to an +Acquire load. + +Frontends generating atomic instructions generally need to be aware of the +target to some degree; atomic instructions are guaranteed to be lock-free, and +therefore an instruction which is wider than the target natively supports can be +impossible to generate. + +.. _Atomic orderings: + +Atomic orderings +================ + +In order to achieve a balance between performance and necessary guarantees, +there are six levels of atomicity. They are listed in order of strength; each +level includes all the guarantees of the previous level except for +Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.) + +.. _NotAtomic: + +NotAtomic +--------- + +NotAtomic is the obvious, a load or store which is not atomic. (This isn't +really a level of atomicity, but is listed here for comparison.) This is +essentially a regular load or store. If there is a race on a given memory +location, loads from that location return undef. + +Relevant standard + This is intended to match shared variables in C/C++, and to be used in any + other context where memory access is necessary, and a race is impossible. (The + precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.) + +Notes for frontends + The rule is essentially that all memory accessed with basic loads and stores + by multiple threads should be protected by a lock or other synchronization; + otherwise, you are likely to run into undefined behavior. If your frontend is + for a "safe" language like Java, use Unordered to load and store any shared + variable. Note that NotAtomic volatile loads and stores are not properly + atomic; do not try to use them as a substitute. (Per the C/C++ standards, + volatile does provide some limited guarantees around asynchronous signals, but + atomics are generally a better solution.) + +Notes for optimizers + Introducing loads to shared variables along a codepath where they would not + otherwise exist is allowed; introducing stores to shared variables is not. See + `Optimization outside atomic`_. + +Notes for code generation + The one interesting restriction here is that it is not allowed to write to + bytes outside of the bytes relevant to a store. This is mostly relevant to + unaligned stores: it is not allowed in general to convert an unaligned store + into two aligned stores of the same width as the unaligned store. Backends are + also expected to generate an i8 store as an i8 store, and not an instruction + which writes to surrounding bytes. (If you are writing a backend for an + architecture which cannot satisfy these restrictions and cares about + concurrency, please send an email to llvmdev.) + +Unordered +--------- + +Unordered is the lowest level of atomicity. It essentially guarantees that races +produce somewhat sane results instead of having undefined behavior. It also +guarantees the operation to be lock-free, so it do not depend on the data being +part of a special atomic structure or depend on a separate per-process global +lock. Note that code generation will fail for unsupported atomic operations; if +you need such an operation, use explicit locking. + +Relevant standard + This is intended to match the Java memory model for shared variables. + +Notes for frontends + This cannot be used for synchronization, but is useful for Java and other + "safe" languages which need to guarantee that the generated code never + exhibits undefined behavior. Note that this guarantee is cheap on common + platforms for loads of a native width, but can be expensive or unavailable for + wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe" + languages would normally split a 64-bit store on ARM into two 32-bit unordered + stores.) + +Notes for optimizers + In terms of the optimizer, this prohibits any transformation that transforms a + single load into multiple loads, transforms a store into multiple stores, + narrows a store, or stores a value which would not be stored otherwise. Some + examples of unsafe optimizations are narrowing an assignment into a bitfield, + rematerializing a load, and turning loads and stores into a memcpy + call. Reordering unordered operations is safe, though, and optimizers should + take advantage of that because unordered operations are common in languages + that need them. + +Notes for code generation + These operations are required to be atomic in the sense that if you use + unordered loads and unordered stores, a load cannot see a value which was + never stored. A normal load or store instruction is usually sufficient, but + note that an unordered load or store cannot be split into multiple + instructions (or an instruction which does multiple memory operations, like + ``LDRD`` on ARM). + +Monotonic +--------- + +Monotonic is the weakest level of atomicity that can be used in synchronization +primitives, although it does not provide any general synchronization. It +essentially guarantees that if you take all the operations affecting a specific +address, a consistent ordering exists. + +Relevant standard + This corresponds to the C++0x/C1x ``memory_order_relaxed``; see those + standards for the exact definition. + +Notes for frontends + If you are writing a frontend which uses this directly, use with caution. The + guarantees in terms of synchronization are very weak, so make sure these are + only used in a pattern which you know is correct. Generally, these would + either be used for atomic operations which do not protect other memory (like + an atomic counter), or along with a ``fence``. + +Notes for optimizers + In terms of the optimizer, this can be treated as a read+write on the relevant + memory location (and alias analysis will take advantage of that). In addition, + it is legal to reorder non-atomic and Unordered loads around Monotonic + loads. CSE/DSE and a few other optimizations are allowed, but Monotonic + operations are unlikely to be used in ways which would make those + optimizations useful. + +Notes for code generation + Code generation is essentially the same as that for unordered for loads and + stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required + to appear as a single operation. + +Acquire +------- + +Acquire provides a barrier of the sort necessary to acquire a lock to access +other memory with normal loads and stores. + +Relevant standard + This corresponds to the C++0x/C1x ``memory_order_acquire``. It should also be + used for C++0x/C1x ``memory_order_consume``. + +Notes for frontends + If you are writing a frontend which uses this directly, use with caution. + Acquire only provides a semantic guarantee when paired with a Release + operation. + +Notes for optimizers + Optimizers not aware of atomics can treat this like a nothrow call. It is + also possible to move stores from before an Acquire load or read-modify-write + operation to after it, and move non-Acquire loads from before an Acquire + operation to after it. + +Notes for code generation + Architectures with weak memory ordering (essentially everything relevant today + except x86 and SPARC) require some sort of fence to maintain the Acquire + semantics. The precise fences required varies widely by architecture, but for + a simple implementation, most architectures provide a barrier which is strong + enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting + such a fence after the equivalent Monotonic operation is sufficient to + maintain Acquire semantics for a memory operation. + +Release +------- + +Release is similar to Acquire, but with a barrier of the sort necessary to +release a lock. + +Relevant standard + This corresponds to the C++0x/C1x ``memory_order_release``. + +Notes for frontends + If you are writing a frontend which uses this directly, use with caution. + Release only provides a semantic guarantee when paired with a Acquire + operation. + +Notes for optimizers + Optimizers not aware of atomics can treat this like a nothrow call. It is + also possible to move loads from after a Release store or read-modify-write + operation to before it, and move non-Release stores from after an Release + operation to before it. + +Notes for code generation + See the section on Acquire; a fence before the relevant operation is usually + sufficient for Release. Note that a store-store fence is not sufficient to + implement Release semantics; store-store fences are generally not exposed to + IR because they are extremely difficult to use correctly. + +AcquireRelease +-------------- + +AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release +barrier (for fences and operations which both read and write memory). + +Relevant standard + This corresponds to the C++0x/C1x ``memory_order_acq_rel``. + +Notes for frontends + If you are writing a frontend which uses this directly, use with caution. + Acquire only provides a semantic guarantee when paired with a Release + operation, and vice versa. + +Notes for optimizers + In general, optimizers should treat this like a nothrow call; the the possible + optimizations are usually not interesting. + +Notes for code generation + This operation has Acquire and Release semantics; see the sections on Acquire + and Release. + +SequentiallyConsistent +---------------------- + +SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads +and Release semantics for stores. Additionally, it guarantees that a total +ordering exists between all SequentiallyConsistent operations. + +Relevant standard + This corresponds to the C++0x/C1x ``memory_order_seq_cst``, Java volatile, and + the gcc-compatible ``__sync_*`` builtins which do not specify otherwise. + +Notes for frontends + If a frontend is exposing atomic operations, these are much easier to reason + about for the programmer than other kinds of operations, and using them is + generally a practical performance tradeoff. + +Notes for optimizers + Optimizers not aware of atomics can treat this like a nothrow call. For + SequentiallyConsistent loads and stores, the same reorderings are allowed as + for Acquire loads and Release stores, except that SequentiallyConsistent + operations may not be reordered. + +Notes for code generation + SequentiallyConsistent loads minimally require the same barriers as Acquire + operations and SequentiallyConsistent stores require Release + barriers. Additionally, the code generator must enforce ordering between + SequentiallyConsistent stores followed by SequentiallyConsistent loads. This + is usually done by emitting either a full fence before the loads or a full + fence after the stores; which is preferred varies by architecture. + +Atomics and IR optimization +=========================== + +Predicates for optimizer writers to query: + +* ``isSimple()``: A load or store which is not volatile or atomic. This is + what, for example, memcpyopt would check for operations it might transform. + +* ``isUnordered()``: A load or store which is not volatile and at most + Unordered. This would be checked, for example, by LICM before hoisting an + operation. + +* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note + that they return true for any operation which is volatile or at least + Monotonic. + +* Alias analysis: Note that AA will return ModRef for anything Acquire or + Release, and for the address accessed by any Monotonic operation. + +To support optimizing around atomic operations, make sure you are using the +right predicates; everything should work if that is done. If your pass should +optimize some atomic operations (Unordered operations in particular), make sure +it doesn't replace an atomic load or store with a non-atomic operation. + +Some examples of how optimizations interact with various kinds of atomic +operations: + +* ``memcpyopt``: An atomic operation cannot be optimized into part of a + memcpy/memset, including unordered loads/stores. It can pull operations + across some atomic operations. + +* LICM: Unordered loads/stores can be moved out of a loop. It just treats + monotonic operations like a read+write to a memory location, and anything + stricter than that like a nothrow call. + +* DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can + be DSE'ed in some cases, but it's tricky to reason about, and not especially + important. + +* Folding a load: Any atomic load from a constant global can be constant-folded, + because it cannot be observed. Similar reasoning allows scalarrepl with + atomic loads and stores. + +Atomics and Codegen +=================== + +Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes. +On architectures which use barrier instructions for all atomic ordering (like +ARM), appropriate fences are split out as the DAG is built. + +The MachineMemOperand for all atomic operations is currently marked as volatile; +this is not correct in the IR sense of volatile, but CodeGen handles anything +marked volatile very conservatively. This should get fixed at some point. + +Common architectures have some way of representing at least a pointer-sized +lock-free ``cmpxchg``; such an operation can be used to implement all the other +atomic operations which can be represented in IR up to that size. Backends are +expected to implement all those operations, but not operations which cannot be +implemented in a lock-free manner. It is expected that backends will give an +error when given an operation which cannot be implemented. (The LLVM code +generator is not very helpful here at the moment, but hopefully that will +change.) + +The implementation of atomics on LL/SC architectures (like ARM) is currently a +bit of a mess; there is a lot of copy-pasted code across targets, and the +representation is relatively unsuited to optimization (it would be nice to be +able to optimize loops involving cmpxchg etc.). + +On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores +generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent +fences generate an ``MFENCE``, other fences do not cause any code to be +generated. cmpxchg uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg`` +uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all +other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending +on the users of the result, some ``atomicrmw`` operations can be translated into +operations like ``LOCK AND``, but that does not work in general. + +On ARM, MIPS, and many other RISC architectures, Acquire, Release, and +SequentiallyConsistent semantics require barrier instructions for every such +operation. Loads and stores generate normal instructions. ``cmpxchg`` and +``atomicrmw`` can be represented using a loop with LL/SC-style instructions +which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX`` +on ARM, etc.). At the moment, the IR does not provide any way to represent a +weak ``cmpxchg`` which would not require a loop. |