summaryrefslogtreecommitdiffstats
path: root/src/mesa/drivers/dri/i965/brw_fs_live_variables.cpp
Commit message (Collapse)AuthorAgeFilesLines
* i965/ir: Update several stale comments.Francisco Jerez2016-09-141-6/+6
| | | | Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
* i965/fs: Add wrapper functions for fs_inst::regs_read and ::regs_written.Francisco Jerez2016-09-141-2/+2
| | | | | | | | | | | | | | This is in preparation for dropping fs_inst::regs_read and ::regs_written in favor of more accurate alternatives expressed in byte units. The main reason these wrappers are useful is that a number of optimization passes implement dataflow analysis with register granularity, so these helpers will come in handy once we've switched register offsets and sizes to the byte representation. The wrapper functions will also make sure that GRF misalignment (currently neglected by most of the back-end) is taken into account correctly in the calculation of regs_read and regs_written. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
* i965/fs: Replace fs_reg::reg_offset with fs_reg::offset expressed in bytes.Francisco Jerez2016-09-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fs_reg::offset field in byte units introduced in this patch is a more straightforward alternative to the current register offset representation split between fs_reg::reg_offset and ::subreg_offset. The split representation makes it too easy to forget about one of the offsets while dealing with the other, which has led to multiple back-end bugs in the past. To make the matter worse the unit reg_offset was expressed in was rather inconsistent, for uniforms it would be expressed in either 4B or 16B units depending on the back-end, and for most other things it would be expressed in 32B units. This encodes reg_offset as a new offset field expressed consistently in byte units. Each rvalue reference of reg_offset in existing code like 'x = r.reg_offset' is rewritten to 'x = r.offset / reg_unit', and each lvalue reference like 'r.reg_offset = x' is rewritten to 'r.offset = r.offset % reg_unit + x * reg_unit'. Because the change affects a lot of places and is rather non-trivial to verify due to the inconsistent value of reg_unit, I've tried to avoid making any additional changes other than applying the rewrite rule above in order to keep the patch as simple as possible, sometimes at the cost of introducing obvious stupidity (e.g. algebraic expressions that could be simplified given some knowledge of the context) -- I'll clean those up later on in a second pass. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
* i965/fs: Track flag register liveness with byte granularity.Francisco Jerez2016-05-271-18/+5
| | | | | | | | | | | | | | | | | | This is required for correctness in presence of multiple 8-wide flag writes (e.g. 8-wide instructions with a conditional mod set) which update a different portion of the same 16-bit flag subregister. Right now we keep track of flag dataflow with 16-bit granularity and consider flag writes to have killed any previous definition of the same subregister even if the write was less than 16 channels wide, which can cause live flag register updates to be dead code-eliminated incorrectly. Additionally this makes sure that we handle 32-wide flag writes and reads which may span multiple flag subregisters so the current approach of just setting/testing a single bit from the live set wouldn't have worked. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
* i965: Add src/dst interference for certain instructions with hazards.Kenneth Graunke2015-11-301-35/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When working on tessellation shaders, I created some vec4 virtual opcodes for creating message headers through a sequence like: mov(8) g7<1>UD 0x00000000UD { align1 WE_all 1Q compacted }; mov(1) g7.5<1>UD 0x00000100UD { align1 WE_all }; mov(1) g7<1>UD g0<0,1,0>UD { align1 WE_all compacted }; mov(1) g7.3<1>UD g8<0,1,0>UD { align1 WE_all }; This is done in the generator since the vec4 backend can't handle align1 regioning. From the visitor's point of view, this is a single opcode: hs_set_output_urb_offsets vgrf7.0:UD, 1U, vgrf8.xxxx:UD Normally, there's no hazard between sources and destinations - an instruction (naturally) reads its sources, then writes the result to the destination. However, when the virtual instruction generates multiple hardware instructions, we can get into trouble. In the above example, if the register allocator assigned vgrf7 and vgrf8 to the same hardware register, then we'd clobber the source with 0 in the first instruction, and read back the wrong value in the last one. It occured to me that this is exactly the same problem we have with SIMD16 instructions that use W/UW or B/UB types with 0 stride. The hardware implicitly decodes them as two SIMD8 instructions, and with the overlapping regions, the first would clobber the second. Previously, we handled that by incrementing the live range end IP by 1, which works, but is excessive: the next instruction doesn't actually care about that. It might also be the end of control flow. This might keep values alive too long. What we really want is to say "my source and destinations interfere". This patch creates new infrastructure for doing just that, and teaches the register allocator to add interference when there's a hazard. For my vec4 case, we can determine this by switching on opcodes. For the SIMD16 case, we just move the existing code there. I audited our existing virtual opcodes that generate multiple instructions; I believe FS_OPCODE_PACK_HALF_2x16_SPLIT needs this treatment as well, but no others. v2: Rebased by mattst88. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965: Rename GRF to VGRF.Matt Turner2015-11-131-3/+3
| | | | | | | | | | The 2-bit hardware register file field is ARF, GRF, MRF, IMM. Rename GRF to VGRF (virtual GRF) so that we can reuse the GRF name to mean an assigned general purpose register. Reviewed-by: Emil Velikov <emil.velikov@collabora.co.uk> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Fix indentation in fs_live_variables::compute_start_endIago Toral Quiroga2015-10-141-9/+8
| | | | Reviewed-by: Francisco Jerez <currojerez@riseup.net>
* i965/fs_live_variables: Do liveness analysis bottom-to-topJason Ekstrand2015-06-241-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | From Muchnick's Advanced Compiler Design and Implementation: "To determine which variables are live at each point in a flowgraph, we perform a backward data-flow analysis" Previously, we were walking the blocks forwards and updating the livein and then the liveout. However, the livein calculation depends on the liveout and the liveout depends on the successor blocks. The net result is that it takes one full iteration to go from liveout to livein and then another full iteration to propagate to the predecessors. This works out to an O(n^2) computation where n is the number of blocks. If we run things in the other order, it's O(nl) where l is the maximum loop depth which is practically bounded by 3. On my HSW desktop, one particular shadertoy test gets a 20% improvement in compile times: N Min Max Median Avg Stddev x 10 15.965 16.884 16.026 16.1822 0.34736846 + 10 12.813 13.052 12.876 12.8891 0.06913666 Difference at 95.0% confidence -3.2931 +/- 0.235316 -20.3501% +/- 1.45417% (Student's t, pooled s = 0.250444) Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965/fs: Remove dependency of fs_inst on the visitor class.Francisco Jerez2015-02-101-1/+1
| | | | | | The fs_visitor argument of fs_inst::regs_read() wasn't used at all. Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965: Factor out virtual GRF allocation to a separate object.Francisco Jerez2015-02-101-4/+4
| | | | | | | | | | | | | Right now virtual GRF book-keeping and allocation is performed in each visitor class separately (among other hundred different things), leading to duplicated logic in each visitor and preventing layering as it forces any code that manipulates i965 IR and needs to allocate virtual registers to depend on the specific visitor that happens to be used to translate from GLSL IR. v2: Use realloc()/free() to allocate VGRF book-keeping arrays (Connor). Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965/fs: Use const fs_reg & rather than a copy or pointer.Matt Turner2014-12-011-10/+4
| | | | | | Also while we're touching var_from_reg, just make it an inline function. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Track liveness of the flag register.Matt Turner2014-12-011-0/+36
| | | | Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Use local pointer to block_data in live intervals.Matt Turner2014-12-011-24/+30
| | | | | | | The next patch will be simplified because of this, and makes reading the code a lot easier. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Use instruction execution sizes instead of heuristicsJason Ekstrand2014-09-301-5/+5
| | | | | Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com> Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965/fs_live_variables: Use var_from_vgrf insead of repeating the calculationJason Ekstrand2014-09-301-2/+2
| | | | | Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com> Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965: Remove now unneeded calls to calculate_cfg().Matt Turner2014-09-241-1/+0
| | | | | | | Now that nothing invalidates the CFG, we can calculate_cfg() immediately after emit_fb_writes()/emit_thread_end() and never again. Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
* i965: Remove cfg-invalidating parameter from invalidate_live_intervals.Matt Turner2014-09-241-4/+1
| | | | | | Everything has been converted to preserve the CFG. Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
* i965: Add invalidate_cfg parameter to invalidate_live_intervals().Matt Turner2014-08-221-2/+3
| | | | | | | Will let us avoid invalidating the CFG if the optimization pass has removed instructions using the new basic block methods. Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
* i965: Add and use foreach_block macro.Matt Turner2014-08-181-26/+25
| | | | | Use this as an opportunity to rename 'block_num' to 'num'. block->num is clear, and block->block_num has always been redundant.
* i965: Add cfg to backend_visitor.Matt Turner2014-07-211-7/+5
| | | | Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
* i965/fs: Pass cfg to calculate_live_intervals().Matt Turner2014-07-011-4/+8
| | | | | | | We've often created the CFG immediately before, so use it when available. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
* i965: Add and use foreach_inst_in_block macros.Matt Turner2014-07-011-4/+1
| | | | Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
* i965/fs: Loop from 0 to inst->sources, not 0 to 3.Matt Turner2014-06-011-1/+1
| | | | | | Reviewed-by: Chris Forbes <chrisf@ijw.co.nz> Reviewed-by: Tapani Pälli <tapani.palli@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/cfg: Embed exec_node in bblock_link.Matt Turner2014-05-151-2/+1
| | | | | | | In order to remove bblock_link's inheritance of exec_node. Also makes linked list walk code much nicer. Acked-by: Eric Anholt <eric@anholt.net>
* i965: Generalize the pixel_x/y workaround for all UW types.Eric Anholt2014-05-121-4/+4
| | | | | | | | | | | | | | | This is the only case where a fs_reg in brw_fs_visitor is used during optimization/code generation, and it meant that optimizations had to be careful to not move pixel_x/y's register number without updating it. Additionally, it turns out we had a couple of other UW values that weren't getting this treatment (like gl_SampleID), so this more general fix is probably a good idea (though I wasn't able to replicate problems with either pixel_[xy]'s values or gl_SampleID, even when telling the register allocator to reuse registers immediately) Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Fix vgrf0 live interval when no interpolation was done.Eric Anholt2014-04-081-2/+4
| | | | | | | | | | When you've got a simple solid-color shader that doesn't generate pixel_x/y interpolation, we were deciding that the first vgrf was both the undefined pixel_x and pixel_y, and extending its live interval to avoid the stride problem. That tricked other optimization that tries to see if a particular instruction is the last use of a variable. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Remove fs_reg::smear.Francisco Jerez2014-02-121-1/+1
| | | | | | | | | The same effect can be achieved using a combination of ::stride and ::subreg_offset. Remove the less flexible ::smear to keep the data members of fs_reg orthogonal. Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Paul Berry <stereotype441@gmail.com>
* i965/fs: Add support for specifying register horizontal strides.Francisco Jerez2014-02-121-1/+1
| | | | | | | | | | | v2: Some improvements for copy propagation with non-contiguous register strides and mismatching types. v3: Add example of the situation that the copy propagation changes are intended to avoid. Clarify that 'fs_reg::apply_stride()' is expected to work with zero strides too. Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Paul Berry <stereotype441@gmail.com>
* i965/fs: Assert that var < num_vars.Matt Turner2014-01-211-0/+2
| | | | | | Helped to track down a problem in a version of the next commit. Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965/fs: Fix the example about overwriting uniforms in SIMD16.Matt Turner2014-01-211-5/+5
| | | | | | | mov takes only a single source argument. Example instruction inexplicably changed from add to mov in commit f10f5e49. Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965/cfg: Clean up cfg_t constructors.Matt Turner2013-12-041-1/+1
| | | | | | | parent_mem_ctx was unused since db47074a, so remove the two wrappers around create() and make create() the constructor. Reviewed-by: Eric Anholt <eric@anholt.net>
* i965: Handle deallocation of some private ralloc contexts explicitly.Francisco Jerez2013-10-291-1/+1
| | | | | | | | | These ralloc contexts belong to a specific object and are being deallocated manually from the class destructor. Now that we've hooked up destructors to ralloc there's no reason for them to be children of any other context, and doing so might to lead to double frees under some circumstances. The class destructor has all the responsibility of freeing class memory resources now.
* i965: s/Muchnik/Muchnick/.Matt Turner2013-10-251-1/+1
| | | | Reviewed-by: Eric Anholt <eric@anholt.net>
* i965/fs: Convert gen7 to using GRFs for texture messages.Eric Anholt2013-10-101-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Looking at Lightsmark's shaders, the way we used MRFs (or in gen7's case, GRFs) was bad in a couple of ways. One was that it prevented compute-to-MRF for the common case of a texcoord that gets used exactly once, but where the texcoord setup all gets emitted before the texture calls (such as when it's a bare fragment shader input, which gets interpolated before processing main()). Another was that it introduced a bunch of dependencies that constrained scheduling, and forced waits for texture operations to be done before they are required. For example, we can now move the compute-to-MRF interpolation for the second texture send down after the first send. The downside is that this generally prevents remove_duplicate_mrf_writes() from doing anything, whereas previously it avoided work for the case of sampling from the same texcoord twice. However, I suspect that most of the win that originally justified that code was in avoiding the WAR stall on the first send, which this patch also avoids, rather than the small cost of the extra instruction. We see instruction count regressions in shaders in unigine, yofrankie, savage2, hon, and gstreamer. Improves GLB2.7 performance by 0.633628% +/- 0.491809% (n=121/125, avg of ~66fps, outliers below 61 dropped). Improves openarena performance by 1.01092% +/- 0.66897% (n=425). No significant difference on Lightsmark (n=44). v2: Squash in the fix for register unspilling for send-from-GRF, fixing a segfault in lightsmark. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Acked-by: Matt Turner <mattst88@gmail.com>
* i965/fs: Use per-channel interference for register_coalesce_2().Eric Anholt2013-10-101-0/+13
| | | | | | | | | | This will let us coalesce into texture-from-GRF arguments, which would otherwise be prevented due to the live interval for the whole vgrf extending across all the MOVs setting up the channels of the message v2 (Kenneth Graunke): Rebase for renames. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Keep a copy of the live variables class around.Eric Anholt2013-10-101-10/+11
| | | | | | | | | | | Now optimization passes will be able to look at the per-channel ranges. v2: Rebase on various optimization pass changes. v3 (Kenneth Graunke): Rename live_variables to live_intervals; split introduction of invalidate_live_intervals() into a separate patch. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Remove start/end aliases in compute_live_intervals().Kenneth Graunke2013-10-101-8/+6
| | | | | | | | | | | | | In compute_live_intervals(), start and end are shorter names for the virtual_grf_start and virtual_grf_end class members. Now that the fs_live_intervals class has arrays named start and end which are indexed by var, rather than VGRF, reusing the name is confusing. Plus, most of the code has been factored out, so using the long names isn't as inconvenient. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Eric Anholt <eric@anholt.net>
* i965/fs: Track live variable ranges on a per-channel level.Eric Anholt2013-10-101-74/+76
| | | | | | | | | | | This is the information we'll actually use to replace the virtual_grf_start[]/end[] arrays. No change in shader-db. v2 (Kenneth Graunke): Rebase; minor comment updates. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Factor def[]/use[] setup out to a separate function.Eric Anholt2013-10-101-16/+41
| | | | | | | | | These blocks are about to grow some more code, and the indentation was getting out of hand. v2 (Kenneth Graunke): Rebase, minor typo fixes and style changes. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Create a helper function for invalidating live intervals.Kenneth Graunke2013-10-101-0/+6
| | | | | | | | | | For now, this simply sets live_intervals_valid = false, but in the future it will do something more sophisticated. Based on a patch by Eric Anholt. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Eric Anholt <eric@anholt.net>
* i965/fs: Do live variables dataflow analysis on a per-channel level.Eric Anholt2013-10-101-17/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This significantly improves our handling of VGRFs of size > 1. Previously, we only marked VGRFs as def'd if the whole register was written by a single instruction. Large VGRFs which were written piecemeal would not be considered def'd at all, even if they were ultimately completely written. Without being def'd, these were then marked "live in" to the basic block, often extending the range to preceding blocks and sometimes even the start of the program. The new per-component tracking gives more accurate live intervals, which makes register coalescing more effective. In the future, this should help with texturing from GRFs on Gen7+. A sampler message might be represented by a 2-register VGRF which holds the texture coordinates. If those are incoming varyings, they'll be produced by two PLN instructions, which are piecemeal writes. No reduction in shader-db instruction counts. However, code which prints the live interval ranges does show that some VGRFs now have smaller (and more correct) live intervals. v2: Rebase on current send-from-GRF code requiring adding extra use[]s. v3: Rebase on live intervals fix to include defs in the end of the interval. v4 (Kenneth Graunke): Rebase; split off a few preparatory patches; add lots of comments; minor style changes; rewrite commit message. v5 (Eric Anholt): whitespace nit. Written-by: Eric Anholt <eric@anholt.net> [v1-3] Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> [v4] Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Eric Anholt <eric@anholt.net> (v4)
* i965/fs: Rename num_vars to num_vgrfs in live interval analysis.Kenneth Graunke2013-10-101-7/+6
| | | | | | | | | num_vars was shorthand for the number of virtual GRFs. num_vgrfs is a bit clearer. Plus, the next patch will introduce "vars" which are distinct from vgrfs. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Eric Anholt <eric@anholt.net>
* i965/fs: Short-circuit a loop in live variable analysis.Kenneth Graunke2013-10-101-5/+6
| | | | | | | | This has no functional effect, but should make subsequent changes a little simpler. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Eric Anholt <eric@anholt.net>
* i965/fs: Fix test for smearing enabled on an instruction.Eric Anholt2013-05-291-1/+1
| | | | | | | | | | | | | | | | We were expanding the live range too far, breaking register_coalesce_2() and compute_to_mrf() on 16-wide shaders. Turning it back on improves GLB2.7 performance by 0.239355% +/- 0.0850649% (n=398). shader-db stats are: total instructions in shared programs: 1627211 -> 1609262 (-1.10%) instructions in affected programs: 450351 -> 432402 (-3.99%) While 33 new 16-wide shaders are gained, 70 are lost. Despite that, tropics (the app that lost the most 16-wide) shows a .41% +/- .16% (n=7/8, first-run outlier removed) performance improvement on my HSW. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Make virtual grf live intervals actually cover their used range.Eric Anholt2013-05-091-55/+21
| | | | | | | | | | | | | | | | | | Previously, we would sometimes not consider a write to a register to extend the end of the interval, nor would we consider a read before a write to extend the start. This made for a bunch of complicated logic related to how to treat the results when dead code might be present. Instead, just extend the interval and fix dead code elimination to know how to remove it. Interestingly, this actually results in a tiny bit more optimization: total instructions in shared programs: 1391220 -> 1390799 (-0.03%) instructions in affected programs: 14037 -> 13616 (-3.00%) v2: Fix a theoretical problem with the simd16 workaround if dst == src, where we would revert the bump of the live range. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> (v1)
* i965/fs: Add a helper function for checking for partial register updates.Eric Anholt2013-04-121-3/+1
| | | | | | | | These checks were all over, and every time I wrote one I had to try to decide again what the cases were for partial updates. v2: Fix inadvertent reladdr check removal. Reviewed-by: Matt Turner <mattst88@gmail.com>
* mesa: Add a macro to bitset for determining bitset size.Eric Anholt2013-04-121-2/+1
| | | | Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965/fs: Bake regs_written into the IR instead of recomputing it later.Eric Anholt2013-04-011-1/+1
| | | | | | | | | For sampler messages, it depends on the target gen, and on gen4 SIMD16-sampler-on-SIMD8-execution we were returning 4 instead of 8 like we should. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> NOTE: This is a candidate for the 9.1 branch.
* i965/fs: Improve live variables calculation performance.Eric Anholt2013-03-111-22/+26
| | | | | | | | | | | | We can execute way fewer instructions by doing our boolean manipulation on an "int" of bits at a time, while also reducing our working set size. Reduces compile time of L4D2's slowest shader from 4s to 1.1s (-72.4% +/- 0.2%, n=10) v2: Remove redundant masking (noted by Ken) Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/fs: Fix register allocation for uniform pull constants in 16-wide.Eric Anholt2013-03-111-23/+31
| | | | | | | | | | | | | | | We were allowing a compressed instruction to write a register that contained the last use of a uniform pull constant (either UBO load or push constant spillover), so it would get half its values smashed. Since we need to see the actual instruction to decide this, move the pre-gen6 pixel_x/y logic here, which should improve the performance of register allocation since virtual_grf_interferes() is called more than once per instruction. NOTE: This is a candidate for the stable branches. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=61317 Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>