diff options
Diffstat (limited to 'docs/Frontend/PerformanceTips.rst')
-rw-r--r-- | docs/Frontend/PerformanceTips.rst | 161 |
1 files changed, 161 insertions, 0 deletions
diff --git a/docs/Frontend/PerformanceTips.rst b/docs/Frontend/PerformanceTips.rst new file mode 100644 index 0000000..d8c0465 --- /dev/null +++ b/docs/Frontend/PerformanceTips.rst @@ -0,0 +1,161 @@ +===================================== +Performance Tips for Frontend Authors +===================================== + +.. contents:: + :local: + :depth: 2 + +Abstract +======== + +The intended audience of this document is developers of language frontends +targeting LLVM IR. This document is home to a collection of tips on how to +generate IR that optimizes well. As with any optimizer, LLVM has its strengths +and weaknesses. In some cases, surprisingly small changes in the source IR +can have a large effect on the generated code. + +Avoid loads and stores of large aggregate type +================================================ + +LLVM currently does not optimize well loads and stores of large :ref:`aggregate +types <t_aggregate>` (i.e. structs and arrays). As an alternative, consider +loading individual fields from memory. + +Aggregates that are smaller than the largest (performant) load or store +instruction supported by the targeted hardware are well supported. These can +be an effective way to represent collections of small packed fields. + +Prefer zext over sext when legal +================================== + +On some architectures (X86_64 is one), sign extension can involve an extra +instruction whereas zero extension can be folded into a load. LLVM will try to +replace a sext with a zext when it can be proven safe, but if you have +information in your source language about the range of a integer value, it can +be profitable to use a zext rather than a sext. + +Alternatively, you can :ref:`specify the range of the value using metadata +<range-metadata>` and LLVM can do the sext to zext conversion for you. + +Zext GEP indices to machine register width +============================================ + +Internally, LLVM often promotes the width of GEP indices to machine register +width. When it does so, it will default to using sign extension (sext) +operations for safety. If your source language provides information about +the range of the index, you may wish to manually extend indices to machine +register width using a zext instruction. + +Other things to consider +========================= + +#. Make sure that a DataLayout is provided (this will likely become required in + the near future, but is certainly important for optimization). + +#. Add nsw/nuw flags as appropriate. Reasoning about overflow is + generally hard for an optimizer so providing these facts from the frontend + can be very impactful. For languages which need overflow semantics, + consider using the :ref:`overflow intrinsics <int_overflow>`. + +#. Use fast-math flags on floating point operations if legal. If you don't + need strict IEEE floating point semantics, there are a number of additional + optimizations that can be performed. This can be highly impactful for + floating point intensive computations. + +#. Use inbounds on geps. This can help to disambiguate some aliasing queries. + +#. Add noalias/align/dereferenceable/nonnull to function arguments and return + values as appropriate + +#. Mark functions as readnone/readonly or noreturn/nounwind when known. The + optimizer will try to infer these flags, but may not always be able to. + Manual annotations are particularly important for external functions that + the optimizer can not analyze. + +#. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing + analysis), prefer GEPs + +#. Use the lifetime.start/lifetime.end and invariant.start/invariant.end + intrinsics where possible. Common profitable uses are for stack like data + structures (thus allowing dead store elimination) and for describing + life times of allocas (thus allowing smaller stack sizes). + +#. Use pointer aliasing metadata, especially tbaa metadata, to communicate + otherwise-non-deducible pointer aliasing facts + +#. Use the "most-private" possible linkage types for the functions being defined + (private, internal or linkonce_odr preferably) + +#. Mark invariant locations using !invariant.load and TBAA's constant flags + +#. Prefer globals over inttoptr of a constant address - this gives you + dereferencability information. In MCJIT, use getSymbolAddress to provide + actual address. + +#. Be wary of ordered and atomic memory operations. They are hard to optimize + and may not be well optimized by the current optimizer. Depending on your + source language, you may consider using fences instead. + +#. If calling a function which is known to throw an exception (unwind), use + an invoke with a normal destination which contains an unreachable + instruction. This form conveys to the optimizer that the call returns + abnormally. For an invoke which neither returns normally or requires unwind + code in the current function, you can use a noreturn call instruction if + desired. This is generally not required because the optimizer will convert + an invoke with an unreachable unwind destination to a call instruction. + +#. If you language uses range checks, consider using the IRCE pass. It is not + currently part of the standard pass order. + +#. For languages with numerous rarely executed guard conditions (e.g. null + checks, type checks, range checks) consider adding an extra execution or + two of LoopUnswith and LICM to your pass order. The standard pass order, + which is tuned for C and C++ applications, may not be sufficient to remove + all dischargeable checks from loops. + +#. Use profile metadata to indicate statically known cold paths, even if + dynamic profiling information is not available. This can make a large + difference in code placement and thus the performance of tight loops. + +#. When generating code for loops, try to avoid terminating the header block of + the loop earlier than necessary. If the terminator of the loop header + block is a loop exiting conditional branch, the effectiveness of LICM will + be limited for loads not in the header. (This is due to the fact that LLVM + may not know such a load is safe to speculatively execute and thus can't + lift an otherwise loop invariant load unless it can prove the exiting + condition is not taken.) It can be profitable, in some cases, to emit such + instructions into the header even if they are not used along a rarely + executed path that exits the loop. This guidance specifically does not + apply if the condition which terminates the loop header is itself invariant, + or can be easily discharged by inspecting the loop index variables. + +#. In hot loops, consider duplicating instructions from small basic blocks + which end in highly predictable terminators into their successor blocks. + If a hot successor block contains instructions which can be vectorized + with the duplicated ones, this can provide a noticeable throughput + improvement. Note that this is not always profitable and does involve a + potentially large increase in code size. + +#. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds + of predecessors). Among other issues, the register allocator is known to + perform badly with confronted with such structures. The only exception to + this guidance is that a unified return block with high in-degree is fine. + +p.s. If you want to help improve this document, patches expanding any of the +above items into standalone sections of their own with a more complete +discussion would be very welcome. + + +Adding to this document +======================= + +If you run across a case that you feel deserves to be covered here, please send +a patch to `llvm-commits +<http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>`_ for review. + +If you have questions on these items, please direct them to `llvmdev +<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_. The more relevant +context you are able to give to your question, the more likely it is to be +answered. + |