summaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/vc4/vc4_register_allocate.c
Commit message (Collapse)AuthorAgeFilesLines
* vc4: Fix register class handling of DDX/DDY arguments.Eric Anholt2016-11-241-1/+1
| | | | | | | | I had this exactly backwards, but apparently the piglit tests were all landing in r0-r3 anyway. Cc: "13.0" <mesa-stable@lists.freedesktop.org> (cherry picked from commit 977d8b526b983c8d19df00af224033389f8ab7c8)
* vc4: Don't abort when a shader compile fails.Eric Anholt2016-11-231-1/+2
| | | | | | | | | | It's much better to just skip the draw call entirely. Getting this information out of register allocation will also be useful for implementing threaded fragment shaders, which will need to retry non-threaded if RA fails. Cc: <mesa-stable@lists.freedesktop.org> (cherry picked from commit 4d019bd703e7c20d56d5b858577607115b4926a3)
* vc4: Add register allocation support for MUL output rotation.Eric Anholt2016-08-251-0/+13
| | | | | | | We need the source to be in r0-r3, so make a new register class for it. It will be up to the surrounding passes to make sure that the r0-r3 allocation of its source won't conflict with anything other class requirements on that temp.
* vc4: Implement live intervals using a CFG.Eric Anholt2016-07-121-38/+8
| | | | | Right now our CFG is always a trivial single basic block, but that will change when enable loops.
* vc4: Add a "qir_for_each_inst_inorder" macro and use it in many places.Eric Anholt2016-07-121-2/+2
| | | | | | | | We have the prior list_foreach() all over the code, but I need to move where instructions live as part of adding support for control flow. Start by just converting to a helper iterator macro. (The simpler "qir_for_each_inst()" will be used for the for-each-inst-in-a-block iterator macro later)
* vc4: Switch the unpack ops to being unpack flags on a mov.Eric Anholt2015-10-261-9/+15
| | | | | | | | | | | | This paves the way for copy propagating our unpacks. We end up with a small change on shader-db: total instructions in shared programs: 89390 -> 89251 (-0.16%) instructions in affected programs: 19041 -> 18902 (-0.73%) which appears to be because we no longer convert MOVs for an FMAX dst, r4.unpack, r4.unpack (instead of the previous MOV dst, r4.unpack), and this ends up with a slightly better schedule.
* vc4: Fix up the test for whether the unpack can be from r4.Eric Anholt2015-10-261-8/+2
| | | | We can do 16a/16b from float as well. No difference on shader-db.
* vc4: Actually allow math results to allocate into r4.Eric Anholt2015-08-211-1/+6
| | | | | | | | | | I switched us to tracking whether the results *could* go to r4, but then didn't make a separate register class for the class bits that included r4. Switch the "any" class to actually be "any", and name the "any but r4" class more appropriately. total instructions in shared programs: 96798 -> 94680 (-2.19%) instructions in affected programs: 62736 -> 60618 (-3.38%)
* vc4: Fold the 16-bit integer pack into the instructions generating it.Eric Anholt2015-08-211-6/+7
| | | | | total instructions in shared programs: 97580 -> 96798 (-0.80%) instructions in affected programs: 52826 -> 52044 (-1.48%)
* vc4: Allow unpack_8[abcd]_f's src to stay in r4.Eric Anholt2015-08-201-1/+15
| | | | | | | I had QPU emit code to do it, but forgot to flag the register class. total instructions in shared programs: 97974 -> 97590 (-0.39%) instructions in affected programs: 25291 -> 24907 (-1.52%)
* vc4: Switch QPU_PACK_SCALED to be two non-SSA instructions.Eric Anholt2015-08-201-1/+2
| | | | | total instructions in shared programs: 98159 -> 98136 (-0.02%) instructions in affected programs: 12279 -> 12256 (-0.19%)
* vc4: Allow QIR registers to be non-SSA.Eric Anholt2015-08-201-2/+3
| | | | | | | | | Now that we have NIR, most of the optimization we still need to do is peepholes on instruction selection rather than general dataflow operations. This means we want to be able to have QIR be a lot closer to the actual QPU instructions, just with virtual registers. Allowing multiple instructions writing the same register opens up a lot of possibilities.
* util/ra: Make allocating conflict lists optionalJason Ekstrand2015-08-181-1/+1
| | | | | | | | | Since i965 is now using make_reg_conflicts_transitive and doesn't need q-value computations, they are disabled on i965. They are enabled everywhere else so that they get the old behavior. This reduces the time spent in eglInitialize() on BDW by around 10-15%. Reviewed-by: Eric Anholt <eric@anholt.net>
* vc4: Make r4-writes implicitly move to a temp, and allocate temps to r4.Eric Anholt2015-08-041-19/+64
| | | | | | | | | | | Previously, SFU values always moved to a temporary, and TLB color reads and texture reads always lived in r4. Instead, we can have these results just be normal temporaries, and the register allocator can leave the values in r4 when they don't interfere with anything else using r4. shader-db results: total instructions in shared programs: 100809 -> 100040 (-0.76%) instructions in affected programs: 42383 -> 41614 (-1.81%)
* vc4: Add better debug for register allocation failure.Eric Anholt2015-07-141-1/+5
|
* vc4: Convert from simple_list.h to list.hEric Anholt2015-05-291-7/+2
| | | | list.h is a nicer and more familiar set of list functions/macros.
* vc4: Move the tests for src needing to be an A register to vc4_qir.c.Eric Anholt2015-01-151-17/+5
| | | | I want it from another location.
* vc4: Avoid the save/restore of r3 for raddr conflicts, just use ra31.Eric Anholt2015-01-111-2/+2
| | | | | | | | | | | | | Turns out this was harmful in code quality: total instructions in shared programs: 39487 -> 38845 (-1.63%) instructions in affected programs: 22522 -> 21880 (-2.85%) This costs us yet another register, which is painful since it means more programs might fail to compile). However, the alternative was causing us trouble where we'd save/restore r3 while it contained a MIN-ed direct texture offset, causing the kernel to fail to validate our shaders (such as in GLB2.7).
* vc4: Add support for 16-bit signed/unsigned norm/scaled vertex attrs.Eric Anholt2014-12-151-0/+4
|
* vc4: Add support for 8-bit unnormalized vertex attrs.Eric Anholt2014-12-151-0/+4
|
* vc4: Rename UNPACK_8* to UNPACK_8*_F.Eric Anholt2014-12-151-4/+4
| | | | | There is an equivalent unpack function without conversion to float if you use an integer operation instead.
* vc4: Reserve rb31 instead of r3 for raddr conflict spills.Eric Anholt2014-12-091-3/+3
| | | | | | | | | | This increases the cost of a raddr b conflict spill (save r3 to rb31, move src1 to r3, move rb31 back to r3 when done, instead of just move src1 to r3), but on average thanks to instruction pairing it's more worthwhile to have another accumulator. total instructions in shared programs: 46428 -> 46171 (-0.55%) instructions in affected programs: 38030 -> 37773 (-0.68%)
* vc4: Prioritize allocating accumulators to short-lived values.Eric Anholt2014-12-091-14/+59
| | | | | | | | | | | | | | | | | | The register allocator walks from the end of the nodes array looking for trivially-allocatable things to put on the stack, meaning (assuming everything is trivially colorable and gets put on the stack in a single pass) the low node numbers get allocated first. The things allocated first happen to get the lower-numbered registers, which is to say the fast accumulators that can be paired more easily. When we previously made the nodes match the temporary register numbers, we'd end up putting the shader inputs (VS or FS) in the accumulators, which are often long-lived values. By prioritizing the shortest-lived values for allocation, we can get a lot more instructions that involve accumulators, and thus fewer conflicts for raddr and WS. total instructions in shared programs: 52870 -> 46428 (-12.18%) instructions in affected programs: 52260 -> 45818 (-12.33%)
* vc4: Interleave register allocation from regfile A and B.Eric Anholt2014-12-081-39/+38
| | | | | | | | | | | | | The register allocator prefers low-index registers from vc4_regs[] in the configuration we're using, which is good because it means we prioritize allocating the accumulators (which are faster). On the other hand, it was causing raddr conflicts because everything beyond r0-r2 ended up in regfile A until you got massive register pressure. By interleaving, we end up getting more instruction pairing from getting non-conflicting raddrs and QPU_WSes. total instructions in shared programs: 55957 -> 52719 (-5.79%) instructions in affected programs: 46855 -> 43617 (-6.91%)
* vc4: Put dead writes into the NOP register when generating code.Eric Anholt2014-09-231-1/+8
| | | | | | | | They still provide register pressure since I haven't made a special class for them, but since they're only live for one instruction it probably doesn't matter. This improves the readability of QPU assembly.
* vc4: Add support for 8-bit unorm/snorm vertex inputs.Eric Anholt2014-09-231-0/+8
|
* vc4: Switch to using Mesa's register allocator.Eric Anholt2014-09-231-105/+105
| | | | | | | | | This will let me more reliably allocate a-file registers, which are going to be even more in demand when I start using a-file unpacks. Also fixes a bug where the reservation of payload registers (FRAG_Z/W) was off by one but just caused failure to register allocate at all if the off-by-one was fixed.
* vc4: Make a static list of all the registers.Eric Anholt2014-09-231-12/+82
|
* vc4: Use the same method as for FRAG_Z to handle fragcoord W.Eric Anholt2014-09-191-0/+8
| | | | I need to get the non-reciprocal version of W for interpolation, anyway.
* vc4: Fix memory leaks in register allocation.Eric Anholt2014-09-161-0/+3
|
* vc4: Move register allocation to a separate file.Eric Anholt2014-09-161-0/+157
I'm going to be rewriting it all, and having it mixed up with the QIR-to-QPU opcode translation was messy.