summaryrefslogtreecommitdiffstats
path: root/src/mesa/drivers/dri/i965/gen7_cs_state.c
Commit message (Collapse)AuthorAgeFilesLines
* i965: Add missing BRW_NEW_CS_PROG_DATA to compute constant atom.Kenneth Graunke2016-10-041-1/+2
| | | | | | | | | CACHE_NEW_CS_PROG hasn't existed in quite a long time...the old comment was there, but not the actual bit. Cc: mesa-stable@lists.freedesktop.org Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
* i965: get rid of duplicated values from gen_device_infoLionel Landwerlin2016-09-231-2/+2
| | | | | | | | Now that we have gen_device_info mutable, we can update its values and drop all copies we had in brw_context. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* intel/i965: make gen_device_info mutableLionel Landwerlin2016-09-231-1/+1
| | | | | | | | | | | | Make gen_device_info a mutable structure so we can update the fields that can be refined by querying the kernel (like subslices and EU numbers). This patch does not make any functional change, it just makes gen_get_device_info() fill a structure rather than returning a const pointer. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Rename intelScreen to screen.Kenneth Graunke2016-09-201-2/+2
| | | | | | | | "intelScreen" is wordy and also doesn't fit our style guidelines. "screen" is shorter, which is nice, because we use it fairly often. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
* intel: s/brw_device_info/gen_device_info/Jason Ekstrand2016-09-031-1/+1
| | | | | | | | | | | | | Generated by: sed -i -e 's/brw_device_info/gen_device_info/g' src/intel/**/*.c sed -i -e 's/brw_device_info/gen_device_info/g' src/intel/**/*.h sed -i -e 's/brw_device_info/gen_device_info/g' **/i965/*.c sed -i -e 's/brw_device_info/gen_device_info/g' **/i965/*.cpp sed -i -e 's/brw_device_info/gen_device_info/g' **/i965/*.h Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965: use new subroutine index uploader.Dave Airlie2016-08-231-0/+3
| | | | | | | | This plugs the subroutine index updates into the i965 backend, where it loads constants. Signed-off-by: Dave Airlie <airlied@redhat.com> Acked-by: Andres Gomez <agomez@igalia.com>
* i965: Use ISL for emitting buffer surface statesJason Ekstrand2016-07-151-1/+1
| | | | | | Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> Reviewed-by: Chad Versace <chad.versace@intel.com>
* i965: Fix comment about CS scratch space encodings on Broadwell+.Kenneth Graunke2016-06-161-1/+1
| | | | | | I typo'd this. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Fix cross-primitive scratch corruption when changing the per-thread ↵Francisco Jerez2016-06-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allocation. I haven't found any mention of this in the hardware docs, but experimentally what seems to be going on is that when the per-thread scratch slot size is changed between two pipelined draw calls, shader invocations using the old and new scratch size setting may end up being executed in parallel, causing their scratch offset calculations to be based in a different partitioning of the scratch space, which can cause their thread-local scratch space to overlap leading to cross-thread scratch corruption. I've been experimenting with alternative workarounds, like emitting a PIPE_CONTROL with DC flush and CS stall between draw (or dispatch compute) calls using different per-thread scratch allocation settings, or avoiding reuse of the scratch BO if the per-thread scratch allocation doesn't exactly match the original. Both seem to be as effective as this workaround, but they have potential performance implications, while this should be basically for free. Fixes over 40 failures in our CI system with spilling forced on (including CTS, dEQP and Piglit failures) on a number of different platforms from Gen4 to Gen9. The 'glsl-max-varyings' piglit test seems to be able to reproduce this bug consistently in the vertex shader on at least Gen4, Gen8 and Gen9 with spilling forced on. Cc: <mesa-stable@lists.freedesktop.org> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Fix encode_slm_size() to take a generation, not a device info.Kenneth Graunke2016-06-131-1/+2
| | | | | | | | | | | | | | | In the Vulkan driver, we have the generation number (a compile time constant) but not necessarily the brw_device_info struct. I meant to rework the function to take a generation number instead of a brw_device_info pointer to accomodate this. But I forgot, and left it taking a brw_device_info pointer, while making Vulkan pass the generation number (8, 9, ...) directly. This led to crashes. Brown paper bag fix for commit 87d062a94080373995170f51063a9649. Cc: "12.0" <mesa-stable@lists.freedesktop.org> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96504 Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Use the correct number of threads for compute shaders.Kenneth Graunke2016-06-121-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were programming the number of threads per subslice, when we should have been programming the total number of threads on the GPU as a whole. Thanks to Curro and Jordan for helping track this down! On Skylake GT3e: - Improves performance in Unreal's Elemental Demo by roughly 1.5-1.7x. - Improves performance in Synmark's Gl43CSDof by roughly 3.7x. - Improves performance in Synmark's Gl43GSCloth by roughly 1.18x. On Broadwell GT2: - Improves performance in Unreal's Elemental Demo by roughly 1.2-1.5x. - Improves performance in Synmark's Gl43CSDof by roughly 2.0x. - Improves performance in Synmark's Gl43GSCloth by 1.47035% +/- 0.255654% (n=25). On Haswell GT3e: - Improves performance in Unreal's Elemental Demo (in GL 4.3 mode) by roughly 1.10x. - Improves performance in Synmark's Gl43CSDof by roughly 1.18x. - Decreases performance in Synmark's Gl43CSCloth by -1.99484% +/- 0.432771% (n=64). On Ivybridge GT2: - Improves performance in Unreal's Elemental Demo (in GL 4.2 mode) by roughly 1.03x. - Improves performance in Synmark's G/43CSDof by roughly 1.25x. - No change in Synmark's Gl43CSCloth (n=28). Cc: "12.0" <mesa-stable@lists.freedesktop.org> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965: Fix CS scratch size calculations on Ivybridge and Baytrail.Kenneth Graunke2016-06-121-2/+4
| | | | | | | | | These are linear, not powers of two, and much more limited. Cc: "12.0" <mesa-stable@lists.freedesktop.org> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965: Fix Haswell CS per-thread scratch space encoding.Kenneth Graunke2016-06-121-2/+14
| | | | | | | | | | | | | | | | | | | | | | Most scratch stages use power of two sizes, in kilobytes, where 0 means 1kB. But compute shaders on Haswell have a minimum of 2kB, and use a representation where 0 = 2kB. This meant that we were effectively telling the hardware to allocate each thread twice as much space as we meant to, while simultaneously not allocating that much space in the buffer, leading to overflows. Note that the existing code is completely wrong for Ivybridge, but that will take additional work to sort out, so I've left it as is for now. A subsequent commit will take care of that. Together with the previous patches, this fixes rendering corruption on Synmark's Gl43CSDof on Haswell. Cc: "12.0" <mesa-stable@lists.freedesktop.org> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965: Fix shared local memory size for Gen9+.Kenneth Graunke2016-06-121-9/+2
| | | | | | | | | | | | | | | | | | | | | Skylake changes the representation of shared local memory size: Size | 0 kB | 1 kB | 2 kB | 4 kB | 8 kB | 16 kB | 32 kB | 64 kB | ------------------------------------------------------------------- Gen7-8 | 0 | none | none | 1 | 2 | 4 | 8 | 16 | ------------------------------------------------------------------- Gen9+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | The old formula would substantially underallocate the amount of space. This fixes GPU hangs on Skylake when running with full thread counts. v2: Fix the Vulkan driver too, use a helper function, and fix the table in the comments and commit message. Cc: "12.0" <mesa-stable@lists.freedesktop.org> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965: Remove old CS local ID handlingJordan Justen2016-06-011-4/+1
| | | | | | | | | | | | | The old method pushed data for each channels uvec3 data of gl_LocalInvocationID. The new method pushes 1 dword of data that is a 'thread local ID' value. Based on that value, we can generate gl_LocalInvocationIndex and gl_LocalInvocationID with some calculations. Cc: "12.0" <mesa-stable@lists.freedesktop.org> Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
* i965: Support new local ID push constant & cross-thread constantsJordan Justen2016-06-011-45/+49
| | | | | | | | | | | | | | | | | The cross thread constant support appears on Haswell. It allows us to upload a set of uniform data for all threads without duplicating it per thread. We also support per-thread data which allows us to store a per-thread ID in one of the uniforms that can be used to calculate the gl_LocalInvocationIndex and gl_LocalInvocationID variables. v4: * Support the old local ID push constant layout as well (Jason) Cc: "12.0" <mesa-stable@lists.freedesktop.org> Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
* i965: Store number of threads in brw_cs_prog_dataJordan Justen2016-06-011-22/+10
| | | | | | Cc: "12.0" <mesa-stable@lists.freedesktop.org> Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
* i965: Make all atoms to track BRW_NEW_BLORP by defaultKenneth Graunke2016-04-231-0/+3
| | | | Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com
* glsl: move to compiler/Emil Velikov2016-01-261-1/+1
| | | | | | Signed-off-by: Emil Velikov <emil.velikov@collabora.com> Acked-by: Matt Turner <mattst88@gmail.com> Acked-by: Jose Fonseca <jfonseca@vmware.com>
* i965: Trigger CS state reemission when new sampler state is uploaded.Francisco Jerez2016-01-191-0/+1
| | | | | | | | | | This reuses the NEW_SAMPLER_STATE_TABLE state bit (currently only used on pre-Gen7 hardware) to signal that the sampler state tables have changed in order to make sure that the GPGPU interface descriptor is updated. Reviewed-by: Marta Lofstedt <marta.lofstedt@intel.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965/gen8/cs: Gen8 requires 64 byte alignment for push constant dataJordan Justen2015-12-221-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The BDW PRM Vol2a: Command Reference: Instructions, section MEDIA_CURBE_LOAD, says that 'CURBE Total Data Length' and 'CURBE Data Start Address' are 64-byte aligned. This is different from previous gens, that were 32-byte aligned. v2 (Jordan): - CURBE Data Start Address is also 64-byte aligned. - The call to brw_state_batch should also use 64-byte alignment. - Improve PRM reference. v3: * New patch from Jordan. Always align base and size to 64 bytes. Fixes the following SSBO CTS tests on BDW: ES31-CTS.shader_storage_buffer_object.basic-atomic-case1-cs ES31-CTS.shader_storage_buffer_object.basic-operations-case1-cs ES31-CTS.shader_storage_buffer_object.basic-operations-case2-cs ES31-CTS.shader_storage_buffer_object.basic-stdLayout_UBO_SSBO-case2-cs ES31-CTS.shader_storage_buffer_object.advanced-write-fragment-cs ES31-CTS.shader_storage_buffer_object.advanced-indirectAddressing-case2-cs ES31-CTS.shader_storage_buffer_object.advanced-matrix-cs And many other CS CTS tests as reported by Marta Lofstedt. (Commit message is from Iago, but in v3, code is from Jordan.) Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Tested-by: Iago Toral Quiroga <itoral@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Enable shared local memory for CS shared variablesJordan Justen2015-12-091-0/+12
| | | | | | | | | v3: * Check shared variable size at link time Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
* i965/state: Get rid of dword_pitch arguments to buffer functionsJason Ekstrand2015-12-071-1/+1
| | | | | Cc: "11.0" <mesa-stable@lists.freedesktop.org> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: Clean up #includes in the compiler.Matt Turner2015-11-241-0/+2
| | | | Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
* i965: Setup pull constant state for compute programsJordan Justen2015-11-011-0/+31
| | | | | Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
* i965/cs: Split out helper for building local id payloadKristian Høgsberg Kristensen2015-10-081-71/+9
| | | | | | | | | | | | The initial motivation for this patch was to avoid calling brw_cs_prog_local_id_payload_dwords() in gen7_cs_state.c from the compiler. This commit ends up refactoring things a bit more so as to split out the logic to build the local id payload to brw_fs.cpp. This moves the payload building closer to the compiler code that uses the payload layout and makes it available to other users of the compiler. Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> Signed-off-by: Kristian Høgsberg Kristensen <krh@bitplanet.net>
* i965/cs: Remove the prog argument from local_id_payload_dwordsJason Ekstrand2015-10-021-4/+3
| | | | Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965/cs: Re-emit cs_state when surfaces have changedJordan Justen2015-09-291-1/+2
| | | | | | | | | Unlike rendering (BINDING_TABLE_POINTERS_*S), compute doesn't have a binding table pointers command. Instead it is part of the MEDIA_INTERFACE_DESCRIPTOR structure loaded by the brw_cs_state atom. Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
* i965/cs: Re-emit push constants and cs_state on new batchesJordan Justen2015-09-291-2/+4
| | | | | | | | | | We need to re-emit push constansts when a new batch is started since the push constants are stored in the batch. We also need to re-emit the MEDIA_INTERFACE_DESCRIPTOR (in brw_cs_state) since it is stored in the batch. Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
* i965: Move compute shader code aroundKristian Høgsberg Kristensen2015-09-141-0/+347
This moves the compute shader code around in order to make the way the code is split up more consistent. There should be no functional changes. Typically we have a few files per stage: brw_vs.c, brw_wm.c brw_gs.c: code to drive code generation and implement precompiling and cache search. genX_<stage>_state.c gen specific implementation of the state emission for the shader stage. The brw_*_emit() functions are all in the same files as the visitor classes they use (with the exception of VS, which may use either vec4 or fs). To make compute follow this convention, we move the brw_cs_emit() function into brw_fs.cpp. We can then rename brw_cs.cpp to brw_cs.c and do this in C like the other similar files. Finally, move state setup and atoms to gen7_cs_state.c. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Signed-off-by: Kristian Høgsberg Kristensen <krh@bitplanet.net>