| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
Just say no to:
- brw->gs.base.prog_data = &brw->gs.prog_data->base.base;
We'll just use the brw_stage_prog_data pointer in brw_stage_state
and downcast it to brw_gs_prog_data as needed.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Timothy Arceri <timothy.arcero@collabora.com>
|
|
|
|
|
|
|
|
| |
Now that we have gen_device_info mutable, we can update its values and drop
all copies we had in brw_context.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
allocation.
I haven't found any mention of this in the hardware docs, but
experimentally what seems to be going on is that when the per-thread
scratch slot size is changed between two pipelined draw calls, shader
invocations using the old and new scratch size setting may end up
being executed in parallel, causing their scratch offset calculations
to be based in a different partitioning of the scratch space, which
can cause their thread-local scratch space to overlap leading to
cross-thread scratch corruption.
I've been experimenting with alternative workarounds, like emitting a
PIPE_CONTROL with DC flush and CS stall between draw (or dispatch
compute) calls using different per-thread scratch allocation settings,
or avoiding reuse of the scratch BO if the per-thread scratch
allocation doesn't exactly match the original. Both seem to be as
effective as this workaround, but they have potential performance
implications, while this should be basically for free.
Fixes over 40 failures in our CI system with spilling forced on
(including CTS, dEQP and Piglit failures) on a number of different
platforms from Gen4 to Gen9. The 'glsl-max-varyings' piglit test
seems to be able to reproduce this bug consistently in the vertex
shader on at least Gen4, Gen8 and Gen9 with spilling forced on.
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
| |
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The hardware documentation relating to the UAV HW-assisted coherency
mechanism and UAV access enable bits is scarce and sometimes
contradictory, and there's quite some guesswork behind this commit, so
let me summarize the background first: HSW and later hardware have
infrastructure to support a stricter form of data coherency between
shader invocations from separate primitives. The mechanism is
controlled by the "Accesses UAV" bits on 3DSTATE_VS, _HS, _DS, _GS and
_PS (or _PS_EXTRA on BDW+), and the "UAV Coherency Required" bit on
the 3DPRIMITIVE command.
Regardless of whether "UAV Coherency Required" is set, the hardware
fixed-function units will increment a per-stage semaphore for each
request received if "Accesses UAV" is set for the same or any lower
stage. An implicit DC flush is emitted by the lowermost stage with
"Accesses UAV" set once it's done processing the request, this also
happens regardless of the value of "UAV Coherency Required". The
completion of the DC flush will cause the same stage and all previous
ones to decrement the semaphore, marking the UAV accesses for the
primitive as coherent with L3.
The "UAV Coherency Required" 3DPRIMITIVE bit will cause a pipeline
stall before any threads are dispatched for the first FF stage with
"Accesses UAV" set until the semaphore is cleared for the same stage.
Effectively this guarantees that UAV memory accesses performed by
previous primitives from any stage will be strictly ordered (and
thanks to the implicit DC flush visible in memory) with UAV accesses
from the following primitives.
None of this is required by the usual image, atomic counter and SSBO
GL APIs which have very relaxed cross-primitive coherency and ordering
requirements, so we don't actually ever set the "UAV Coherency
Required" bit -- Ordering with respect to shader invocations from
previous stages on the same primitive where there is a data dependency
is of course already guaranteed as the spec requires, regardless of
this mechanism being enabled. We do set the "Accesses UAV" bits
though since my commit ac7664e493655e290783c23a0412b9c70936da50 (which
this patch partially reverts), mainly because of comments like the
following from the BDW PRM:
> 3DSTATE_GS
>[...]
> 12 Accesses UAV
> Format: Enable
> This field must be set when GS has a UAV access.
There are similar comments in the documentation for the other
3DSTATE_*S commands. The "must" part is misleading and unjustified
AFAIK. Most of the "Accesses UAV" bits don't seem to have any side
effects other than the implicit DC flushes and the related
book-keeping in anticipation for a subsequent primitive with "UAV
Coherency Required" set, so in most cases they are unnecessary and may
incur a performance penalty. There is an exception though. On Gen8+
the PS_EXTRA UAV access bit influences the calculation of the PS
UAV-only and ThreadDispatchEnable signals which on previous
generations were set explicitly by the driver, so we cannot always
avoid enabling it on the PS stage.
The primary motivation for this change is that in fact the hardware
coherency mechanism is buggy and will cause a rather non-deterministic
hang on Gen8 when VS is the only stage with "Accesses UAV" set and the
processing of a request terminates immediately after the implicit DC
flush is sent for a previous primitive with no additional vertices
being emitted for the second primitive, what will cause the hardware
to skip sending a second DC flush and cause the VS to stall
indefinitely waiting for a response from the DC (BDWGFX HSD 1912017).
This hardware bug can be reproduced on current master with the
spec@arb_shader_image_load_store@host-mem-barrier@Indirect/RaW piglit
subtest (if you have the patience to run it a few dozen times).
The proposed workaround is to insert CS STALLs speculatively between
3DPRIMITIVE commands when "Accesses UAV" is enabled for the VS stage
only. Because this would affect one of the hottest paths in the
driver and likely decrease performance even further due to the
unnecessary serialization, and because we don't actually need the
implicit DC flushes, it seems better to just disable them.
Cc: 11.0 <mesa-stable@lists.freedesktop.org>
|
|
|
|
|
|
| |
v2: Set the PS UAV-only bit on HSW (Ken).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We used to store the GS dispatch mode in brw_gs_prog_data while
separately storing the VS dispatch mode in brw_vue_prog_data::simd8.
This patch introduces an enum to represent all possible dispatch modes,
and stores it in brw_vue_prog_data::dispatch_mode, unifying the two.
Based on a suggestion by Matt Turner.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
|
|
|
|
|
|
|
|
|
|
|
| |
These structs aren't vec4 specific, they are shared by shader stages
operating on Vertex URB Entries (VUEs). VUEs are the data structures in
the URB that hold vertex data between the pipeline geometry stages.
Using vue in the name instead of vec4 makes a lot more sense, especially
when we add scalar vertex shader support.
Signed-off-by: Kristian Høgsberg <krh@bitplanet.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I put the BRW_NEW_*_PROG_DATA flags at the beginning so that
brw_state_cache.c can still continue using 1 << brw_cache_id.
I also added a comment explaining the difference between
BRW_NEW_*_PROG_DATA and BRW_NEW_*_PROGRAM, as it took me a long time
to remember it.
Non-mechanical changes:
- brw_state_cache.c and brw_ff_gs.c now signal .brw, not .cache.
- brw_state_upload.c - INTEL_DEBUG=state changes.
- brw_context.h - bit definition merging.
v2: Correct the explanation of BRW_NEW_*_PROG_DATA to mention
state-based recompiles, and nix the "proper subset" claim,
as it's false. (Caught by Kristian Høgsberg).
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we've moved a bunch of CACHE_NEW_* bits to BRW_NEW_*, the only
ones that are left are legitimately related to the program cache. Yet,
it seems a bit wasteful to have an entire bitfield for only 7 bits.
State upload is one of the hottest paths in the driver. For each atom
in the list, we call check_state() to see if it needs to be emitted.
Currently, this involves comparing three separate bitfields (mesa, brw,
and cache). Consolidating the brw and cache bitfields would save a
small amount of CPU overhead per atom. Broadwell, for example, has
57 state atoms, so this small savings can add up.
CACHE_NEW_*_PROG covers the brw_*_prog_data structures, as well as the
offset into the program cache BO (prog_offset). Since most uses refer
to brw_*_prog_data, I decided to use BRW_NEW_*_PROG_DATA as the name.
Removing "cache" completely is a bit painful, so I decided to do it in
several patches for easier review, and to separate mechanical changes
from manual ones. This one simply renames things, and was made via:
$ for file in *.[ch]; do
sed -i -e 's/CACHE_NEW_\([A-Z_\*]*\)_PROG/BRW_NEW_\1_PROG_DATA/g' \
-e 's/BRW_NEW_WM_PROG_DATA/BRW_NEW_FS_PROG_DATA/g' $file
done
Note that BRW_NEW_*_PROG_DATA is still in .cache, not .brw!
The next patch will remedy this flaw. It will also fix the
alphabetization issues.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
Acked-by: Matt Turner <mattst88@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the dirty flags were listed in some arbitrary order. Some used
bonus parenthesis. Some put multiple flags on one line, others put one
per line. Some used tabs instead of spaces...but only on some lines.
This patch settles on one flag per line, in alphabetical order, using
spaces instead of tabs, and sheds the unnecessary parentheses.
Sorting was mostly done with vim's visual block feature and !sort,
although I alphabetized short lists by hand; it was pretty manual.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
|
|
|
|
|
|
|
|
| |
The code required for gen6 and gen7+ is almost the same, so reuse it.
Signed-off-by: Samuel Iglesias Gonsalvez <siglesias@igalia.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when a geometry shader can't use dual object mode we fall back to
dual instance mode, however, when invocations == 1, single dispatch mode is
more performant and equally efficient in terms of register pressure.
Single dispatch mode requires that the driver can handle interleaving of
input registers, but this is already supported (dual instance mode has
the same requirement). However, to take full advantage of single dispatch mode
to reduce register pressure we would also need the ability to store two
separate vec4 output values into vec8 registers, which would approximately
double our capacity to store temporary values, but currently the vec4 visitor
and generator classes do not support this, so at the moment register pressure
in single and dual instance modes is the same.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
|
| |
All shader stages have these fields, so it makes sense to store them in
the common base structure, rather than duplicating them in each.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
| |
I wanted to access this value from stage-generic code, so stop storing it
under two different names.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
| |
-0.553779% +/- 0.423394% effect on cairo-perf-trace runtime on glamor
(n=612)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 30259856a8a82a55c030df1ad052e505c61144bc moved the state packets to
table generation time, but forgot to make this change. Apparently the
performance win there was about not reemitting the table pointers on
unrelated state changes.
No performance difference on cairo on glamor (n=118).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we have the stage state coming into our setup of sampler states,
it's easy to drop an identifier into it of which stage the stage_state is,
and then look up which packet to emit in a little table.
No performance difference on cairo on glamor (n=492).
v2: Don't forget to do the workaround flush on IVB.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This keeps us from needing to reemit all the other stage state just
because a surface changed.
Improves unoptimized glamor x11perf -f8text by 1.10201% +/- 0.489869%
(n=296). [v1]
v2:
- Drop binding table packets from Gen8 unit state as well.
- Pass _3DSTATE_BINDING_TABLE_POINTERS_XS to brw_upload_binding_table,
cutting even more code.
v3: Don't forget to drop them from 3DSTATE_GS (botched refactor in v2).
Signed-off-by: Eric Anholt <eric@anholt.net> [v1]
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> [v1]
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> [v2, v3]
Reviewed-by: Eric Anholt <eric@anholt.net> [v3]
|
|
|
|
|
|
|
|
|
|
| |
v3:
* Properly prevent dual object mode execution when
the invocation count > 1
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
|
|
|
|
|
|
|
|
|
|
| |
v2: Don't go to extra work to avoid extraneous flushes. (Previous
experiments in the kernel have suggested that flushing the pipeline
when it is already empty is extremely cheap).
Cc: "10.0" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
|
|
|
|
|
|
|
|
|
|
| |
Before the series with 3c9dc2d31b80fc73bffa1f40a91443a53229c8e2 to
dynamically assign our binding table indices, we didn't really track our
binding table count per shader, so we never filled in these fields.
Affects cairo-gl trace runtime by -2.47953% +/- 1.07281% (n=20)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
| |
Not yet enabled.
Reviewed-by: Eric Anholt <eric@anholt.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ivy Bridge's "reorder enable" bit gives us a binary choice for the
order in which vertices from triangle strips are delivered to the
geometry shader. Neither choice follows the OpenGL spec, but setting
the bit is better, because it gets triangle orientation correct.
Haswell replaces the "reorder enable" bit with a new "reorder mode"
bit (which occupies the same location in the command packet). This
bit gives us a different binary choice, which affects both triangle
strips and triangle strips with adjacency. Setting the bit ("reorder
trailing") gives the proper order according to the OpenGL spec.
So in either case we want to set the bit.
On Ivy Bridge, fixes piglit test "triangle-strip-orientation".
On Haswell, fixes piglit tests "glsl-1.50-geometry-primitive-types
{GL_TRIANGLE_STRIP,GL_TRIANGLE_STRIP_ADJACENCY}" and
"glsl-1.50-geometry-tri-strip-ordering-with-prim-restart *".
v2: Rename the bit to "REORDER_TRAILING" for consistency with Haswell
docs.
Reviewed-by: Matt Turner <mattst88@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In commit 247f90c77e8f3894e963d796628246ba0bde27b5 (i965/gs: Set
control data header size/format appropriately for EndPrimitive()), I
incorrectly numbered the DWORDs in the 3DSTATE_GS command starting
from 1 instead of starting from 0. This caused the control data
format to be programmed into the wrong DWORD, resulting in corruption
in some geometry shaders that used an output type of points.
This patch numbers the DWORDs starting from 0, as we do for all other
commands, which causes the control data format to be programmed into
the correct DWORD.
Reviewed-by: Chad Versace <chad.versace@linux.intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the geometry shader refers to the built-in variable
gl_PrimitiveIDIn, we need to set a bit in 3DSTATE_GS to tell the
hardware to dispatch primitive ID to r1, and we need to leave room for
it when allocating registers.
Note: this feature doesn't yet work properly when software primitive
restart is in use (the primitive ID counter will incorrectly reset
with each primitive restart, since software primitive restart works by
performing multiple draw calls). I plan to address that in a future
patch series.
Fixes piglit test "spec/glsl-1.50/execution/geometry/primitive-id-in".
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The gen7 geometry shader uses a "control data header" at the beginning
of the output URB entry to store either
(a) flag bits (1 bit/vertex) indicating whether EndPrimitive() was
called after each vertex, or
(b) stream ID bits (2 bits/vertex) indicating which stream each vertex
should be sent to (when multiple transform feedback streams are in
use).
Fortunately, OpenGL only requires separate streams to be supported
when the output type is points, and EndPrimitive() only has an effect
when the output type is line_strip or triangle_strip, so it's not a
problem that these two uses of the control data header are mutually
exclusive.
This patch modifies do_vec4_gs_prog() to determine the correct
hardware settings for configuring the control data header, and
modifies upload_gs_state() to propagate these settings to the
hardware.
In addition, it modifies do_vec4_gs_prog() to ensure that the output
URB entry is large enough to contain both the output vertices *and*
the control data header.
Finally, it modifies vec4_gs_visitor so that it accounts for the size
of the control data header when computing the offset within the URB
where output vertex data should be stored.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
v2: Fixed incorrect handling of IVB/HSW differences.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|
|
v2: Do not attempt to share the code that uploads
3DSTATE_BINDING_TABLE_POINTERS_GS, 3DSTATE_SAMPLER_STATE_POINTERS_GS,
or 3DSTATE_GS with VS.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
v3: Add _NEW_TRANSFORM to gen7_gs_state.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
|