I’ve been working on an elaborate shader recently, and its performance would significantly benefit from using as many as 9 passes all by itself. Moreover, my hope is to run it after Maister’s 3-pass NTSC shader (which can be technically reduced to two passes by combining the first two passes, but that would just add to the number of permutations that already exist for it). How difficult would it be to extend Retroarch and the Cg shader spec to allow as many as say, 16 passes (for future headroom)? Are there any plans to do so? Alternately, how difficult would it be to add support for more advanced Cg shader profiles on GPUs that support them?
The shader I’m working on can technically be kept down to 5 passes with some tricks, but these tricks come at the cost of performance. For instance, there are 2 independent source images which require a different kind of 2-pass resampling/convolution. Normally I’d do this part in a total of 4 passes, but the 8-pass limit has forced me to get creative, vertically stack them, and perform the convolutions in parallel in a total of 2 passes (I can afford 6 passes in total if I also write a condensed 2-pass version of the NTSC shader, but using 7 passes means no NTSC shader at all…and using 4 passes for this brings the shader up to 7 passes minimum).
Unfortunately, this comes at an obscene cost: One of the images requires as many as 126 taps in each pass. This wouldn’t be a big deal if it had 2 passes all to itself, because it’s very small, but the image stacked above it is much larger. It also wouldn’t be that big of a deal if Retroarch’s shader profile supported dynamic branches like fp40…but it doesn’t (it seems to be fp30 with a looser instruction limit?). As a result, the compiler must perform all 126 texture reads not only for each pixel of the small image they’re meant for, but for every single pixel of the unrelated output image above it, and for every other pixel which would otherwise be discarded.
If I had extra passes or dynamic branches, those 126 taps over a small output area would be amortized enough that they’d only amount to 2 to 6 texture reads or so per viewport pixel, depending on some factors. Actually, there would be a lot less samples altogether with dynamic branching in particular…but instead, they add more like 25 texture reads per viewport pixel. It’s not absolutely prohibitive, but it limits 60FPS performance to GPUs more expensive than I’d like.
If nothing else, are there any plans to add mipmapping or even anisotropic filtering support for external textures specified in the .cgp file? It’s no guarantee, but I might not even have to do that troublesome convolution at all if there were options for better texture filtering.
Thanks