Are there any plans to allow more than 8 shader passes?

system · 15 January 2017 03:31

I’ve been working on an elaborate shader recently, and its performance would significantly benefit from using as many as 9 passes all by itself. Moreover, my hope is to run it after Maister’s 3-pass NTSC shader (which can be technically reduced to two passes by combining the first two passes, but that would just add to the number of permutations that already exist for it). How difficult would it be to extend Retroarch and the Cg shader spec to allow as many as say, 16 passes (for future headroom)? Are there any plans to do so? Alternately, how difficult would it be to add support for more advanced Cg shader profiles on GPUs that support them?

The shader I’m working on can technically be kept down to 5 passes with some tricks, but these tricks come at the cost of performance. For instance, there are 2 independent source images which require a different kind of 2-pass resampling/convolution. Normally I’d do this part in a total of 4 passes, but the 8-pass limit has forced me to get creative, vertically stack them, and perform the convolutions in parallel in a total of 2 passes (I can afford 6 passes in total if I also write a condensed 2-pass version of the NTSC shader, but using 7 passes means no NTSC shader at all…and using 4 passes for this brings the shader up to 7 passes minimum).

Unfortunately, this comes at an obscene cost: One of the images requires as many as 126 taps in each pass. This wouldn’t be a big deal if it had 2 passes all to itself, because it’s very small, but the image stacked above it is much larger. It also wouldn’t be that big of a deal if Retroarch’s shader profile supported dynamic branches like fp40…but it doesn’t (it seems to be fp30 with a looser instruction limit?). As a result, the compiler must perform all 126 texture reads not only for each pixel of the small image they’re meant for, but for every single pixel of the unrelated output image above it, and for every other pixel which would otherwise be discarded.

If I had extra passes or dynamic branches, those 126 taps over a small output area would be amortized enough that they’d only amount to 2 to 6 texture reads or so per viewport pixel, depending on some factors. Actually, there would be a lot less samples altogether with dynamic branching in particular…but instead, they add more like 25 texture reads per viewport pixel. It’s not absolutely prohibitive, but it limits 60FPS performance to GPUs more expensive than I’d like.

If nothing else, are there any plans to add mipmapping or even anisotropic filtering support for external textures specified in the .cgp file? It’s no guarantee, but I might not even have to do that troublesome convolution at all if there were options for better texture filtering.

Thanks

hunterk · 15 January 2017 04:49

I was under the impression you can currently get at least 12 passes via cgp and the 8-pass limit is just a practical limit for RGUI’s sake.

Sounds like you’ve got something serious in the works. I look forward to checking it out

system · 15 January 2017 04:50

Thanks, huntkerk! I just checked, and it turns out you’re right.

Based on the performance of a bunch of stacked stock shaders at 1x scale (with a few real shaders tacked onto the end), it looks like each pass itself may cause a greater performance hit than the added number of reads and writes (etc.) would imply though, so it may or may not be the panacea I’m looking for in all cases. Do you happen to know what kind of overhead each extra pass has?

hunterk · 15 January 2017 04:50

That’s a good question. I would think the impact varies by GPU, but I seem to remember someone saying the penalty is something like 10% per additional pass.

system · 15 January 2017 04:50

Thanks I actually have a few more questions while I’m at it, but they’re off topic, so I’ll start a new thread in the Shaders forum…