summaryrefslogtreecommitdiffstats
path: root/src/mesa/drivers/dri/i965/brw_compiler.c
Commit message (Collapse)AuthorAgeFilesLines
* mesa: remove gl_shader_compiler_options::EmitNoNoiseMarek Olšák2016-10-191-1/+0
| | | | | | | it's always true Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
* intel: s/brw_device_info/gen_device_info/Jason Ekstrand2016-09-031-1/+1
| | | | | | | | | | | | | Generated by: sed -i -e 's/brw_device_info/gen_device_info/g' src/intel/**/*.c sed -i -e 's/brw_device_info/gen_device_info/g' src/intel/**/*.h sed -i -e 's/brw_device_info/gen_device_info/g' **/i965/*.c sed -i -e 's/brw_device_info/gen_device_info/g' **/i965/*.cpp sed -i -e 's/brw_device_info/gen_device_info/g' **/i965/*.h Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965: Rewrite FS input handling to use the new NIR intrinsics.Kenneth Graunke2016-07-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This eliminates the need to walk the list of input variables, recurse into their types (via logic largely redundant with nir_lower_io), and interpolate all possible inputs up front. The backend no longer has to care about variables at all, which eliminates complications from trying to pack multiple variables into the same location. Instead, each intrinsic specifies exactly what's needed. This should unblock Timothy's work on GL_ARB_enhanced_layouts. Each load_interpolated_input intrinsic corresponds to PLN instructions, while load_barycentric_at_* intrinsics correspond to pixel interpolator messages. The pixel/centroid/sample barycentric intrinsics simply refer to payload fields (delta_xy[]), and don't actually generate any code. Because we use a single intrinsic for both centroid-qualified variables and interpolateAtCentroid(), they become indistinguishable. We stop sending pixel interpolator messages for those, and instead use the payload provided data, which should be considerably faster. On Broadwell: total instructions in shared programs: 9067751 -> 9067570 (-0.00%) instructions in affected programs: 145902 -> 145721 (-0.12%) helped: 422 HURT: 209 total spills in shared programs: 2849 -> 2899 (1.76%) spills in affected programs: 760 -> 810 (6.58%) helped: 0 HURT: 10 total fills in shared programs: 3910 -> 3950 (1.02%) fills in affected programs: 617 -> 657 (6.48%) helped: 0 HURT: 10 LOST: 3 GAINED: 3 The differences mostly appear to be slight changes in MOVs. v2: Use nir_shader_compiler_options::use_interpolated_input_intrinsics flag rather than passing it directly to nir_lower_io. Use the unreachable() macro rather than assert in one place. (Review feedback from Chris Forbes.) Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Chris Forbes <chrisforbes@google.com> Acked-by: Jason Ekstrand <jason@jlekstrand.net>
* i965/compiler: Bring back the INTEL_PRECISE_TRIG environment variableJason Ekstrand2016-06-131-0/+2
| | | | | | | | | | | This was removed in d9546b0c5d and replced with the precise_trig driconf option. However, we still need precise trig in the Vulkan driver so this commit brings back the environment variable and compiler->precise_trig is effectively the logical OR of the two. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96484 Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Cc: "12.0" <mesa-stable@lists.freedesktop.org>
* i965: Integrate precise trig into configuration infrastructureGurchetan Singh2016-06-071-2/+0
| | | | | | | | | | | | | | With this change, to enable precise SIN and COS instructions on Intel hardware, one can put <option name="precise_trig" value="true"/> in the proper drirc file. V2: Make option name more generic Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Stephane Marchesin <stephane.marchesin@gmail.com>
* i965: Enable cross-thread constants and compact local IDs for hsw+Jordan Justen2016-06-011-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cross thread constant support appears on Haswell. It allows us to upload a set of uniform data for all threads without duplicating it per thread. One complication is that cross-thread constants are loaded into registers before per-thread constants. Previously, our local IDs were loaded before the uniform data and treated as 'payload' data, even though they were actually pushed into the registers like the other uniform data. Therefore, in this patch we simultaneously enable a newer layout where each thread now uses a single uniform slot for a unique local ID for the thread. This uniform is handled specially to make sure it is added last into the uniform push constant registers. This minimizes our usage of push constant registers, and maximizes our ability to use cross-thread constants for registers. To swap from the old to the new layout, we also need to flip some lowering pass switches to let our driver handle the lowering instead. We also no longer force thread_local_id_index to -1. v4: * Minimize size of patch that switches from the old local ID layout to the new layout (Jason) Cc: "12.0" <mesa-stable@lists.freedesktop.org> Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
* nir: Make lowering gl_LocalInvocationIndex optionalJordan Justen2016-06-011-1/+2
| | | | | | Cc: "12.0" <mesa-stable@lists.freedesktop.org> Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
* i965: Move compiler debug functions to intel_screen.cJason Ekstrand2016-05-261-42/+0
| | | | | | | They reference the compiler so they shouldn't go in libi965_compiler.la. Reviewed-by: Emil Velikov <emil.velikov@collabora.com> Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
* glsl: Add an option to clamp block indices when lowering UBO/SSBOsJason Ekstrand2016-05-231-0/+1
| | | | | | | | This prevents array overflow when the block is actually an array of UBOs or SSBOs. On some hardware such as i965, such overflows can cause GPU hangs. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* mesa/main: Add support for GL_ARB_cull_distance (v2)Tobias Klausmann2016-05-141-1/+1
| | | | | | | | | | | | airlied: v2: rename LowerClipDistance to LowerCombinedClipCullDistnace. I don't think we want any other behaviour with any current hw. Signed-off-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de> Reviewed-by: Edward O'Callaghan <eocallaghan@alterapraxis.com> Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Kristian Høgsberg <krh@bitplanet.net> Signed-off-by: Dave Airlie <airlied@redhat.com>
* i965: Enable scalar GS by default.Kenneth Graunke2016-05-121-1/+1
| | | | | | | | | | | | | | | | | I'd originally left this off because Orbital Explorer was hanging the GPU, but it seems to be working these days. There have been a bunch of changes since then, so we probably fixed something. On my Broadwell laptop, both Synmark/GSCloth and Orbital Explorer seem to run at approximately the same framerate in either mode. This is despite large reductions in instruction count for Synmark, and large increases for Orbital Explorer. It apparently just doesn't matter. Switching to scalar mode will gain us fp64 support in the next release, as vec4-mode support isn't yet ready. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965: Stop splitting fma() prior to optimizationJason Ekstrand2016-05-111-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | According to the GLSL spec, if the user uses the fma() intrinsic to generate a precise-consumed value, and you have it in your hardware, you shouldn't split it. For a while now, we've been splitting all ffma's up-front and then planned to fuse them later which isn't valid. Correctly handling the GLSL behaviour fixes rendering corruptions in Tomb Raider. The only reason why doing this possibly helped before was for ARB programs which is handled by the previous commit. Shader-db results on Haswell: total instructions in shared programs: 7560300 -> 7561510 (0.02%) instructions in affected programs: 56265 -> 57475 (2.15%) helped: 86 HURT: 291 The only shaders in the database that are affected are from "Shadow of Mordor" which is the first app in our database to use fma(). We could, at some point in the future, split inexact ffma opcodes which would fix the shader-db regressions since Shadow of Mordor doesn't ues precise. However, this fixes a bug now and and the shader-db impact is fairly small. Reported-by: Kenneth Graunke <kenneth@whitecape.org> Cc: "11.1 11.2" <mesa-stable@lists.freedesktop.org> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
* i965: use double lowering passConnor Abbott2016-05-101-0/+1
| | | | | | | | | | v2: also lower trunc, ceil, floor, fract and roundEven (Iago) v3: also lower mod for doubles (Sam) Signed-off-by: Iago Toral Quiroga <itoral@igalia.com> Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
* i965: enable lrp lowering for doublesSamuel Iglesias Gonsálvez2016-05-101-0/+1
| | | | | | | | | Broadwell and previous generations does not support lrp instruction operating with doubles. Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
* Revert "Revert "i965: Switch to scalar TCS by default.""Kenneth Graunke2016-05-091-1/+1
| | | | | | | | This reverts commit bd326c229c528a214c9fda705e7a961cfa49ac9e. Now that we've fixed the GPU hangs, let's turn it back on. Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
* Revert "i965: Switch to scalar TCS by default."Kenneth Graunke2016-05-051-1/+1
| | | | | | | This reverts commit b593737ed8349b280fa29242c35f565b59ab3025. Apparently it causes GPU hangs on some image load store tests. Let's turn it back off until we figure out why.
* i965: Switch to scalar TCS by default.Kenneth Graunke2016-05-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Normally, we expect SIMD8 shaders to be more instructions than SIMD4x2 shaders, as it takes four instructions to operate on a vec4, rather than a single instruction. However, the benefit is that it can process 8 objects per shader thread instead of 2. Surprisingly, the shader-db statistics show an improvement in both instruction and cycle counts: Synmark: -31.25% instructions, -29.27% cycles, 0 hurt. Tessmark: -36.92% instructions, -37.81% cycles, 0 hurt. Unigine Heaven: -3.42% instructions, -17.95% cycles, 0 hurt. Shadow of Mordor: +13.24% instructions (26 with fewer instructions, 45 with more), -5.23% cycles (44 with fewer cycles, 27 with more cycles). Presumably, this is because the SIMD8 URB messages are a much more natural fit than the SIMD4x2 URB messages - there's a ton less header setup. I benchmarked Shadow of Mordor and Unigine Heaven on my Skylake GT3e, and the performance seems to be the same or increase ever so slightly (< 1 FPS difference). So I believe it's strictly superior. There's also a lot more optimization potential we can do in scalar mode. This will also help us finish fp64 support, as scalar support is going to land much sooner than vec4-mode support. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Matt Turner <mattst88@gmail.com>
* nir: Separate 32 and 64-bit fmod loweringSamuel Iglesias Gonsálvez2016-05-041-1/+1
| | | | | | | | Split 32-bit and 64-bit fmod lowering as the drivers might need to lower them separately inside NIR depending on the HW support. Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* i965: Write a scalar TCS backend that runs in SINGLE_PATCH mode.Kenneth Graunke2016-05-031-1/+3
| | | | | | | | | | | | | | | | | | | | | | | Unlike most shader stages, the Hull Shader hardware makes us explicitly tell it how many threads to dispatch and manually configure the channel mask. One perk of this is that we have a lot of flexibility - we can run it in either SIMD4x2 or SIMD8 mode. Treating it as SIMD8 means that shaders with 8 or fewer output vertices (which is overwhemingly the common case) can be handled by a single thread. This has several intriguing properties: - Accessing input arrays with gl_InvocationID as the index is a simple SIMD8 URB read with g1 as the header. No indirect addressing required. - Barriers are no-ops. - We could potentially do output shadowing to combine writes, as the concurrency concerns are gone. (We don't do this yet, though.) v2: Drop first_non_payload_grf change, as it was always adding 0 (caught by Jordan Justen). Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
* nir: rename lower_flrp to lower_flrp32Samuel Iglesias Gonsálvez2016-04-281-1/+1
| | | | | | | A later patch will add lower_flrp64 option to NIR. Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
* nir/lower_system_values: Add support for several computed valuesJason Ekstrand2016-04-111-1/+2
| | | | Reviewed-by: Rob Clark <robdclark@gmail.com>
* i965: Add an INTEL_PRECISE_TRIG=1 option to fix SIN/COS output range.Kenneth Graunke2016-04-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | The SIN and COS instructions on Intel hardware can produce values slightly outside of the [-1.0, 1.0] range for a small set of values. Obviously, this can break everyone's expectations about trig functions. According to an internal presentation, the COS instruction can produce a value up to 1.000027 for inputs in the range (0.08296, 0.09888). One suggested workaround is to multiply by 0.99997, scaling down the amplitude slightly. Apparently this also minimizes the error function, reducing the maximum error from 0.00006 to about 0.00003. When enabled, fixes 16 dEQP precision tests dEQP-GLES31.functional.shaders.builtin_functions.precision. {cos,sin}.{highp,mediump}_compute.{scalar,vec2,vec4,vec4}. at the cost of making every sin and cos call more expensive (about twice the number of cycles on recent hardware). Enabling this option has been shown to reduce GPUTest Volplosion performance by about 10%. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
* i965: Have NIR lower flrp on pre-GEN6 vec4 backendIan Romanick2016-03-221-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we were doing the lowering by hand in vec4_visitor::emit_lrp. By doing it in NIR, we have the opportunity for NIR to do additional optimization of the expanded code. This also enables optimizations added by the next commit. shader-db results: G4X / Ironlake total instructions in shared programs: 4024401 -> 4016538 (-0.20%) instructions in affected programs: 447686 -> 439823 (-1.76%) helped: 2623 HURT: 0 total cycles in shared programs: 84375846 -> 84328296 (-0.06%) cycles in affected programs: 16964960 -> 16917410 (-0.28%) helped: 2556 HURT: 41 Unsurprisingly, no changes on later platforms. v2: Formatting and comment changes suggested by Matt. Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Matt Turner <mattst88@gmail.com>
* mesa: Remove EmitCondCodes.Matt Turner2016-03-011-1/+0
| | | | | | Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Acked-by: Brian Paul <brianp@vmware.com>
* i965/gen7+: Use NIR for lowering of pack/unpack opcodes.Matt Turner2016-02-011-0/+15
| | | | Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
* i965/fs: Switch from GLSL IR to NIR for un/packHalf2x16 scalarizing.Matt Turner2016-02-011-0/+2
| | | | Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
* i965: Make separate nir_options for scalar/vector stages.Matt Turner2016-02-011-28/+33
| | | | | | | We'll want to have different lowering options set for scalar/vector stages. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
* i965: Move brw_compiler_create() to new brw_compiler.c.Matt Turner2016-02-011-0/+157
A future patch will want to use designated initalizers, which aren't available in C++, but this is C. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>