summaryrefslogtreecommitdiffstats
path: root/src/mesa/drivers/dri/i965/gen7_gs_state.c
Commit message (Collapse)AuthorAgeFilesLines
* i965: Eliminate brw->gs.prog_data pointer.Kenneth Graunke2016-10-051-14/+17
| | | | | | | | | | | | Just say no to: - brw->gs.base.prog_data = &brw->gs.prog_data->base.base; We'll just use the brw_stage_prog_data pointer in brw_stage_state and downcast it to brw_gs_prog_data as needed. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Timothy Arceri <timothy.arcero@collabora.com>
* i965: get rid of duplicated values from gen_device_infoLionel Landwerlin2016-09-231-1/+2
| | | | | | | | Now that we have gen_device_info mutable, we can update its values and drop all copies we had in brw_context. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Fix cross-primitive scratch corruption when changing the per-thread ↵Francisco Jerez2016-06-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allocation. I haven't found any mention of this in the hardware docs, but experimentally what seems to be going on is that when the per-thread scratch slot size is changed between two pipelined draw calls, shader invocations using the old and new scratch size setting may end up being executed in parallel, causing their scratch offset calculations to be based in a different partitioning of the scratch space, which can cause their thread-local scratch space to overlap leading to cross-thread scratch corruption. I've been experimenting with alternative workarounds, like emitting a PIPE_CONTROL with DC flush and CS stall between draw (or dispatch compute) calls using different per-thread scratch allocation settings, or avoiding reuse of the scratch BO if the per-thread scratch allocation doesn't exactly match the original. Both seem to be as effective as this workaround, but they have potential performance implications, while this should be basically for free. Fixes over 40 failures in our CI system with spilling forced on (including CTS, dEQP and Piglit failures) on a number of different platforms from Gen4 to Gen9. The 'glsl-max-varyings' piglit test seems to be able to reproduce this bug consistently in the vertex shader on at least Gen4, Gen8 and Gen9 with spilling forced on. Cc: <mesa-stable@lists.freedesktop.org> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Make all atoms to track BRW_NEW_BLORP by defaultKenneth Graunke2016-04-231-0/+1
| | | | Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com
* i965: Don't tell the hardware about our UAV access.Francisco Jerez2015-10-091-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hardware documentation relating to the UAV HW-assisted coherency mechanism and UAV access enable bits is scarce and sometimes contradictory, and there's quite some guesswork behind this commit, so let me summarize the background first: HSW and later hardware have infrastructure to support a stricter form of data coherency between shader invocations from separate primitives. The mechanism is controlled by the "Accesses UAV" bits on 3DSTATE_VS, _HS, _DS, _GS and _PS (or _PS_EXTRA on BDW+), and the "UAV Coherency Required" bit on the 3DPRIMITIVE command. Regardless of whether "UAV Coherency Required" is set, the hardware fixed-function units will increment a per-stage semaphore for each request received if "Accesses UAV" is set for the same or any lower stage. An implicit DC flush is emitted by the lowermost stage with "Accesses UAV" set once it's done processing the request, this also happens regardless of the value of "UAV Coherency Required". The completion of the DC flush will cause the same stage and all previous ones to decrement the semaphore, marking the UAV accesses for the primitive as coherent with L3. The "UAV Coherency Required" 3DPRIMITIVE bit will cause a pipeline stall before any threads are dispatched for the first FF stage with "Accesses UAV" set until the semaphore is cleared for the same stage. Effectively this guarantees that UAV memory accesses performed by previous primitives from any stage will be strictly ordered (and thanks to the implicit DC flush visible in memory) with UAV accesses from the following primitives. None of this is required by the usual image, atomic counter and SSBO GL APIs which have very relaxed cross-primitive coherency and ordering requirements, so we don't actually ever set the "UAV Coherency Required" bit -- Ordering with respect to shader invocations from previous stages on the same primitive where there is a data dependency is of course already guaranteed as the spec requires, regardless of this mechanism being enabled. We do set the "Accesses UAV" bits though since my commit ac7664e493655e290783c23a0412b9c70936da50 (which this patch partially reverts), mainly because of comments like the following from the BDW PRM: > 3DSTATE_GS >[...] > 12 Accesses UAV > Format: Enable > This field must be set when GS has a UAV access. There are similar comments in the documentation for the other 3DSTATE_*S commands. The "must" part is misleading and unjustified AFAIK. Most of the "Accesses UAV" bits don't seem to have any side effects other than the implicit DC flushes and the related book-keeping in anticipation for a subsequent primitive with "UAV Coherency Required" set, so in most cases they are unnecessary and may incur a performance penalty. There is an exception though. On Gen8+ the PS_EXTRA UAV access bit influences the calculation of the PS UAV-only and ThreadDispatchEnable signals which on previous generations were set explicitly by the driver, so we cannot always avoid enabling it on the PS stage. The primary motivation for this change is that in fact the hardware coherency mechanism is buggy and will cause a rather non-deterministic hang on Gen8 when VS is the only stage with "Accesses UAV" set and the processing of a request terminates immediately after the implicit DC flush is sent for a previous primitive with no additional vertices being emitted for the second primitive, what will cause the hardware to skip sending a second DC flush and cause the VS to stall indefinitely waiting for a response from the DC (BDWGFX HSD 1912017). This hardware bug can be reproduced on current master with the spec@arb_shader_image_load_store@host-mem-barrier@Indirect/RaW piglit subtest (if you have the patience to run it a few dozen times). The proposed workaround is to insert CS STALLs speculatively between 3DPRIMITIVE commands when "Accesses UAV" is enabled for the VS stage only. Because this would affect one of the hottest paths in the driver and likely decrease performance even further due to the unnecessary serialization, and because we don't actually need the implicit DC flushes, it seems better to just disable them. Cc: 11.0 <mesa-stable@lists.freedesktop.org>
* i965/gen7-8: Poke the 3DSTATE UAV access enable bits.Francisco Jerez2015-08-111-1/+3
| | | | | | v2: Set the PS UAV-only bit on HSW (Ken). Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Create a shader_dispatch_mode enum to replace VS/GS fields.Kenneth Graunke2015-06-011-1/+1
| | | | | | | | | | | | | We used to store the GS dispatch mode in brw_gs_prog_data while separately storing the VS dispatch mode in brw_vue_prog_data::simd8. This patch introduces an enum to represent all possible dispatch modes, and stores it in brw_vue_prog_data::dispatch_mode, unifying the two. Based on a suggestion by Matt Turner. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
* i965: Rename brw_vec4_prog_data/key to brw_bue_prog_data/keyKristian Høgsberg2014-12-101-1/+1
| | | | | | | | | | | These structs aren't vec4 specific, they are shared by shader stages operating on Vertex URB Entries (VUEs). VUEs are the data structures in the URB that hold vertex data between the pipeline geometry stages. Using vue in the name instead of vec4 makes a lot more sense, especially when we add scalar vertex shader support. Signed-off-by: Kristian Høgsberg <krh@bitplanet.net> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Move BRW_NEW_*_PROG_DATA flags to .brw (not .cache).Kenneth Graunke2014-12-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | I put the BRW_NEW_*_PROG_DATA flags at the beginning so that brw_state_cache.c can still continue using 1 << brw_cache_id. I also added a comment explaining the difference between BRW_NEW_*_PROG_DATA and BRW_NEW_*_PROGRAM, as it took me a long time to remember it. Non-mechanical changes: - brw_state_cache.c and brw_ff_gs.c now signal .brw, not .cache. - brw_state_upload.c - INTEL_DEBUG=state changes. - brw_context.h - bit definition merging. v2: Correct the explanation of BRW_NEW_*_PROG_DATA to mention state-based recompiles, and nix the "proper subset" claim, as it's false. (Caught by Kristian Høgsberg). Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Kristian Høgsberg <krh@bitplanet.net> Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965: Rename CACHE_NEW_*_PROG to BRW_NEW_*_PROG_DATA.Kenneth Graunke2014-12-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we've moved a bunch of CACHE_NEW_* bits to BRW_NEW_*, the only ones that are left are legitimately related to the program cache. Yet, it seems a bit wasteful to have an entire bitfield for only 7 bits. State upload is one of the hottest paths in the driver. For each atom in the list, we call check_state() to see if it needs to be emitted. Currently, this involves comparing three separate bitfields (mesa, brw, and cache). Consolidating the brw and cache bitfields would save a small amount of CPU overhead per atom. Broadwell, for example, has 57 state atoms, so this small savings can add up. CACHE_NEW_*_PROG covers the brw_*_prog_data structures, as well as the offset into the program cache BO (prog_offset). Since most uses refer to brw_*_prog_data, I decided to use BRW_NEW_*_PROG_DATA as the name. Removing "cache" completely is a bit painful, so I decided to do it in several patches for easier review, and to separate mechanical changes from manual ones. This one simply renames things, and was made via: $ for file in *.[ch]; do sed -i -e 's/CACHE_NEW_\([A-Z_\*]*\)_PROG/BRW_NEW_\1_PROG_DATA/g' \ -e 's/BRW_NEW_WM_PROG_DATA/BRW_NEW_FS_PROG_DATA/g' $file done Note that BRW_NEW_*_PROG_DATA is still in .cache, not .brw! The next patch will remedy this flaw. It will also fix the alphabetization issues. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Kristian Høgsberg <krh@bitplanet.net> Acked-by: Matt Turner <mattst88@gmail.com>
* i965: Alphabetize brw_tracked_state flags and use a consistent style.Kenneth Graunke2014-11-291-3/+3
| | | | | | | | | | | | | | | | Most of the dirty flags were listed in some arbitrary order. Some used bonus parenthesis. Some put multiple flags on one line, others put one per line. Some used tabs instead of spaces...but only on some lines. This patch settles on one flag per line, in alphabetical order, using spaces instead of tabs, and sheds the unnecessary parentheses. Sorting was mostly done with vim's visual block feature and !sort, although I alphabetized short lists by hand; it was pretty manual. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Kristian Høgsberg <krh@bitplanet.net> Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965/gs: Reuse gen6 constant push buffers setup code in gen7+.Samuel Iglesias Gonsalvez2014-09-191-33/+0
| | | | | | | | The code required for gen6 and gen7+ is almost the same, so reuse it. Signed-off-by: Samuel Iglesias Gonsalvez <siglesias@igalia.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Acked-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/gs: Use single dispatch mode as fallback to dual object mode when possible.Iago Toral Quiroga2014-09-191-3/+1
| | | | | | | | | | | | | | | | | | Currently, when a geometry shader can't use dual object mode we fall back to dual instance mode, however, when invocations == 1, single dispatch mode is more performant and equally efficient in terms of register pressure. Single dispatch mode requires that the driver can handle interleaving of input registers, but this is already supported (dual instance mode has the same requirement). However, to take full advantage of single dispatch mode to reduce register pressure we would also need the ability to store two separate vec4 output values into vec8 registers, which would approximately double our capacity to store temporary values, but currently the vec4 visitor and generator classes do not support this, so at the moment register pressure in single and dual instance modes is the same. Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Acked-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Move curb_read_length/total_scratch to brw_stage_prog_data.Kenneth Graunke2014-09-031-2/+2
| | | | | | | | All shader stages have these fields, so it makes sense to store them in the common base structure, rather than duplicating them in each. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965/gen6+: Merge VS/GS and WM push constant buffer upload paths.Eric Anholt2014-07-021-3/+3
| | | | Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Move dispatch_grf_start_reg and first_curbe_grf into stage_prog_data.Eric Anholt2014-07-021-1/+1
| | | | | | | I wanted to access this value from stage-generic code, so stop storing it under two different names. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Move push constant state packets to push constant update time.Eric Anholt2014-05-021-14/+15
| | | | | | | -0.553779% +/- 0.423394% effect on cairo-perf-trace runtime on glamor (n=612) Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Drop unnecessary state flag for units on NEW_BINDING_TABLE.Eric Anholt2014-05-021-1/+0
| | | | | | | | | | | Commit 30259856a8a82a55c030df1ad052e505c61144bc moved the state packets to table generation time, but forgot to make this change. Apparently the performance win there was about not reemitting the table pointers on unrelated state changes. No performance difference on cairo on glamor (n=118). Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/gen7+: Move sampler state packets to the stage sampler state table update.Eric Anholt2014-05-021-7/+1
| | | | | | | | | | | | Now that we have the stage state coming into our setup of sampler states, it's easy to drop an identifier into it of which stage the stage_state is, and then look up which packet to emit in a little table. No performance difference on cairo on glamor (n=492). v2: Don't forget to do the workaround flush on IVB. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Move binding table update packets to binding table setup time.Eric Anholt2014-03-101-6/+0
| | | | | | | | | | | | | | | | | | | This keeps us from needing to reemit all the other stage state just because a surface changed. Improves unoptimized glamor x11perf -f8text by 1.10201% +/- 0.489869% (n=296). [v1] v2: - Drop binding table packets from Gen8 unit state as well. - Pass _3DSTATE_BINDING_TABLE_POINTERS_XS to brw_upload_binding_table, cutting even more code. v3: Don't forget to drop them from 3DSTATE_GS (botched refactor in v2). Signed-off-by: Eric Anholt <eric@anholt.net> [v1] Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> [v1] Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> [v2, v3] Reviewed-by: Eric Anholt <eric@anholt.net> [v3]
* i965: support instanced GS on gen7Jordan Justen2014-02-201-0/+2
| | | | | | | | | | v3: * Properly prevent dual object mode execution when the invocation count > 1 Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Paul Berry <stereotype441@gmail.com> Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
* i965/gen7: Emit workaround flush when changing GS enable state.Paul Berry2013-11-181-0/+16
| | | | | | | | | | v2: Don't go to extra work to avoid extraneous flushes. (Previous experiments in the kernel have suggested that flushing the pipeline when it is already empty is extremely cheap). Cc: "10.0" <mesa-stable@lists.freedesktop.org> Reviewed-by: Eric Anholt <eric@anholt.net>
* i965: Tell the unit states how many binding table entries we have.Eric Anholt2013-11-051-1/+3
| | | | | | | | | | Before the series with 3c9dc2d31b80fc73bffa1f40a91443a53229c8e2 to dynamically assign our binding table indices, we didn't really track our binding table count per shader, so we never filled in these fields. Affects cairo-gl trace runtime by -2.47953% +/- 1.07281% (n=20) Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/gs: Add the ability to compile a DUAL_INSTANCED geometry shader.Paul Berry2013-10-241-1/+3
| | | | | | Not yet enabled. Reviewed-by: Eric Anholt <eric@anholt.net>
* i965/gs: Set the REORDER bit in 3DSTATE_GS.Paul Berry2013-10-151-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | Ivy Bridge's "reorder enable" bit gives us a binary choice for the order in which vertices from triangle strips are delivered to the geometry shader. Neither choice follows the OpenGL spec, but setting the bit is better, because it gets triangle orientation correct. Haswell replaces the "reorder enable" bit with a new "reorder mode" bit (which occupies the same location in the command packet). This bit gives us a different binary choice, which affects both triangle strips and triangle strips with adjacency. Setting the bit ("reorder trailing") gives the proper order according to the OpenGL spec. So in either case we want to set the bit. On Ivy Bridge, fixes piglit test "triangle-strip-orientation". On Haswell, fixes piglit tests "glsl-1.50-geometry-primitive-types {GL_TRIANGLE_STRIP,GL_TRIANGLE_STRIP_ADJACENCY}" and "glsl-1.50-geometry-tri-strip-ordering-with-prim-restart *". v2: Rename the bit to "REORDER_TRAILING" for consistency with Haswell docs. Reviewed-by: Matt Turner <mattst88@gmail.com>
* i965/gs: Fix incorrect numbering of DWORDs in 3DSTATE_GSPaul Berry2013-10-011-3/+4
| | | | | | | | | | | | | | | In commit 247f90c77e8f3894e963d796628246ba0bde27b5 (i965/gs: Set control data header size/format appropriately for EndPrimitive()), I incorrectly numbered the DWORDs in the 3DSTATE_GS command starting from 1 instead of starting from 0. This caused the control data format to be programmed into the wrong DWORD, resulting in corruption in some geometry shaders that used an output type of points. This patch numbers the DWORDs starting from 0, as we do for all other commands, which causes the control data format to be programmed into the correct DWORD. Reviewed-by: Chad Versace <chad.versace@linux.intel.com>
* i965/gs: Implement basic gl_PrimitiveIDIn functionality.Paul Berry2013-09-171-0/+2
| | | | | | | | | | | | | | | | | If the geometry shader refers to the built-in variable gl_PrimitiveIDIn, we need to set a bit in 3DSTATE_GS to tell the hardware to dispatch primitive ID to r1, and we need to leave room for it when allocating registers. Note: this feature doesn't yet work properly when software primitive restart is in use (the primitive ID counter will incorrectly reset with each primitive restart, since software primitive restart works by performing multiple draw calls). I plan to address that in a future patch series. Fixes piglit test "spec/glsl-1.50/execution/geometry/primitive-id-in". Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/gs: Set control data header size/format appropriately for EndPrimitive().Paul Berry2013-09-111-14/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The gen7 geometry shader uses a "control data header" at the beginning of the output URB entry to store either (a) flag bits (1 bit/vertex) indicating whether EndPrimitive() was called after each vertex, or (b) stream ID bits (2 bits/vertex) indicating which stream each vertex should be sent to (when multiple transform feedback streams are in use). Fortunately, OpenGL only requires separate streams to be supported when the output type is points, and EndPrimitive() only has an effect when the output type is line_strip or triangle_strip, so it's not a problem that these two uses of the control data header are mutually exclusive. This patch modifies do_vec4_gs_prog() to determine the correct hardware settings for configuring the control data header, and modifies upload_gs_state() to propagate these settings to the hardware. In addition, it modifies do_vec4_gs_prog() to ensure that the output URB entry is large enough to contain both the output vertices *and* the control data header. Finally, it modifies vec4_gs_visitor so that it accounts for the size of the control data header when computing the offset within the URB where output vertex data should be stored. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> v2: Fixed incorrect handling of IVB/HSW differences. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/gs: Add a state atom to set up geometry shader state.Paul Berry2013-09-111-0/+144
v2: Do not attempt to share the code that uploads 3DSTATE_BINDING_TABLE_POINTERS_GS, 3DSTATE_SAMPLER_STATE_POINTERS_GS, or 3DSTATE_GS with VS. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> v3: Add _NEW_TRANSFORM to gen7_gs_state. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>