Lanczos2 sharp shader

Hyllian · 15 January 2017 05:07

I can confirm the anti-ringing code works with the Catmull-Rom filter inside the crt-hyllian shader! Yes! Sharpness and non-ring FTW!

The next crt-hyllian shader will have this.

system · 15 January 2017 05:07

I’m glad you got the anti-ringing working! The screenshots look quite smooth at the edges. I’ve been trying all day yesterday and today to get that “Jinc with fewer samples” shader working too, but no dice yet…I’m running into some nasty nondeterministic issues with the method from GPU Pro 2, so I can’t even tell how many bugs I’m trying to fix yet. :-/

Played around with sRGB a bit today. It appears that aliasing GL_RGBA8 and GL_SRGB8_ALPHA8 for GL cores is simply not possible. I get different swizzling on my nVidia GTX 760. I hope that’s a bug in their driver, because GL spec says the two should be compatible formats …

glCopySubTexImage2D from FBO seems to be a big speed hit since it has to go through a format conversion. glCopyImageSubData (raw copy) has the same swizzling issue as glTextureView.

The workaround is to use GL_TEXTURE_SWIZZLE_RGBA, but if it’s not programmatically possible to figure out how the swizzling happens, then ye … :[/quote] Wow, that’s not good news at all. I haven’t really looked into this yet, but are the libretro GL cores writing to a GL_BGRA format texture by any chance? If so, I wonder what would happen if the GL_RGBA format was used? If it suddenly works, it might be a driver bug after all.

Maister · 15 January 2017 05:07

There is no GL_BGRA/GL_BGRA8 internal format. It is not specificed how GL_RGBA8 is represented internally. The two last params you pass to glTexSubImage2D (e.g. GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV) just specify the exact input format (if it doesn’t match with what the internal format wants, you have to convert, etc). However, it’s obvious here that GL_RGBA8 and GL_SRGB8_ALPHA8 don’t have the same internal representation on the GPU which I find very awkward considering that RGBA8 and SRGB8_ALPHA8 are essentially exactly the same thing except for an optional gamma-table lookup.

Anyways, I’ve sent a bug report to nVidia.

I guess we can only rely on that FBO textures can be sRGB if anything. That sounds pretty limited to me, but far easier to implement at least. It might make sense as well. The input format you get from a libretro (GL) core is probably not perfect sRGB anyways. Also, if sRGB is not supported, you won’t get wrong gamma, just more banding and/or squashed blacks. Also, using higher precision formats for FBOs would be a valid workaround if sRGB is not supported.

I guess we also want possibilities to mipmap the input texture as well as FBO results? Would be handy for bloom and/or fake HDR-like effects as you could blur over higher LOD levels without having to stamp out a shitton of shader passes.

system · 15 January 2017 05:07

Yeah, that makes sense. I know GL_BRGA, etc. has to match the input pixel data, but I was starting to wonder if OpenGL (or nVidia’s implementation) also used it as a hint for how to represent the data internally (as a subtype of the actual internal format parameter perhaps).

As far as the rest goes: It seems implementing sRGB FBO’s is no problem for pass1+, but it’s clearly hairier for pass0. Can you rephrase the solution you’re suggesting? I didn’t catch your meaning.

I do agree the input format coming from libretro cores is almost never perfect sRGB, at least for the current cores. For instance, NTSC consoles output an RGB format based on NTSC colorimetry and 2.2 gamma, which predates sRGB. (The colorimetry difference doesn’t actually matter in this case though, because the decoding step only handles the sRGB gamma curve.) However, the purpose of decoding the input as sRGB though is that it’s very close to 2.2 gamma anyway, and using the sRGB FBO allows for gamma-correct bilinear filtering. It’s the difference between taking one bilinear texture sample with no pow() function, or taking four texture samples with carefully snapped coordinates followed by four pow() functions. One use case for this is doing a much faster Gaussian blur of the input based on half as many bilinear samples (without even having to use pow as well).

Using mipmaps on the input texture and FBO results would indeed allow for some faster operations if you ignore the mipmap generation overhead, but I imagine it might be prohibitively slow in practice, unless it’s only used for the largest of blurs. In comparison, it’s cheap to generate mipmaps for an LUT texture, because you only have to do it when the LUT is initially loaded, but FBO mipmaps would have to be computed once a frame for each affected FBO…ouch. Still, it’s probably worth trying, since the default would be “off” anyway. The speed might surprise me.

Maister · 15 January 2017 05:07

You could have a first pass being a “linearizing” pass, simply decoding the input frame with “correct” gamma, and stick it in an sRGB FBO. That’s done at 1x scale and should be pretty fast considering the resolution.

Having sRGB support on the input texture for GL cores simply appears to be too big of a portability hazard that it’s worth going down that road.

Mipmapping an FBO is pretty fast. It’s generally a very big win if you’re trying to do large blurs. (20x20 blur on full texture or 3x3 blur on higher LOD version? I’d take mipmaps).

system · 15 January 2017 05:07

Thanks for the clarification! I was trying to figure out why you were talking about banding and black crush, but it all seems obvious now: As you said, if you do a first pass that linearizes and use sRGB FBO’s from that point forward, you can use the same shader chain whether sRGB is supported or not…and if it isn’t, the gamma is still correct, and you just get banding to let you know what went wrong. I wonder if it’s worth adding an option to control automatically using floating point FBO’s in place of sRGB FBO’s when sRGB isn’t present? It would be very slow, but it would allow users to trade off quality and performance. (Unrelated tangent: There are still occasions when floating point FBO’s are outright superior too, not just for precision, but for packing multiple values for the next pass with something like pack_2half().)

It’s a shame the swizzling issue is creating such complications, but a linearizing first pass written to an sRGB FBO is definitely a huge improvement over the current situation. Is that really faster than glCopySubTexImage2D though? If so, that surprises me…

Also, I’m glad to hear mipmapping FBO’s is viable! I didn’t want to suggest it before, because I thought it would be futile, but if I recall correctly it’s actually done for bloom operations in game engines anyway, so yeah. Blurring is one of the most time-consuming shader tasks, so combined with sRGB FBO’s for direct bilinear sampling, mipmapped FBO’s would be a huge win.

As a side note, I wonder if the mipmapping algorithm takes into account whether the FBO is sRGB or not (i.e. if it decodes, generates mipmaps levels, then reencodes)? The GL_EXT_framebuffer_sRGB spec doesn’t say. The GL_EXT_texture_sRGB spec makes strong suggestions for implementations to do it properly, but it doesn’t mandate it (or maybe it can’t, due to conflicting language in the OpenGL standard leaving mipmapping methods up to the implementation).

Maister · 15 January 2017 05:07

The GL spec does not mandate that bilinear filtering on sRGB is “correct”. It’s possible for it to do bilinear filtering, then gamma. However, no modern GPUs are that primitive. Same thing with mipmapping. I expect it to just work.

In GLES2, you cannot do sRGB + mipmapping, at least not with GL_EXT_sRGB. There is an extension to do sRGB + mipmapping in GLES2: https://www.khronos.org/registry/gles/e … p_sRGB.txt. Reading the GLES3 spec it doesn’t say that you cannot mipmap sRGB, so I guess it’s supported.

EDIT: I don’t know of any GPUs where float FBOs would be supported (patented!) and sRGB is not. GL_RGBA16 is a more likely fallback format I guess.

Hyllian · 15 January 2017 05:07

Jinc and Sinc filters added to the repo.

Though the Jinc ones are using Sinc as a Jinc approximation.

Inside new “windowed” folder: https://github.com/libretro/common-shad … r/windowed

There are three jinc flavors: default, sharp and sharper. And I put the lanczos2-sharp.

All of them have the anti-ringing code.

system · 15 January 2017 05:07

I wonder how well-supported that GLES2 extension is nowadays? Even if it’s rare, it’s not so bad though: The shaders that most need to combine sRGB and mipmapping are the ones that do heavy blooms, and mobile users probably aren’t going to be using too many of those until they get beefier GPU’s anyway.

That wouldn’t be bad at all. The format clamps to [0.0, 1.0] and expands to a 16-bit int range, right? Heck, it’s a better fallback than floating point FBO’s anyway, because it’s half the size of GL_RGBA32F and better precision than sRGB.

I was under the impression the jinc and sinc kernels are identical anyway. The only difference is usage, right? (That is, you use a 2D radial distance for jinc and do it all in one pass to avoid angular bias.)

The “windowed” folder was a really good idea, by the way. There are so many different variations on sinc/jinc windowing that I can imagine there being a huge family of shaders in there eventually.

Maister · 15 January 2017 05:07

Alright, I guess I’ll try to implement sRGB + mipmapping support tomorrow.

Hyllian · 15 January 2017 05:07

Theoretically, they’re very distinct functions. But, for X<2.5 I’ve found a sinc function that is a very good approximation of the jinc one. The biggest differencies of the sinc used in windowed filters from the jinc one are the zero locations. The sinc has the zeros at 1.0 and 2.0, and the jinc at 1.22 and 2.233. The sinc function can be used in cylindrical coordinates too, the lanczos2-sharp I put there is using it.

system · 15 January 2017 05:07

I feel kind of bad that you’ve decided to take all the work upon yourself, but I’m also very grateful for it. You know the codebase inside and out, and I’d just be stumbling around learning things the hard way.

Whoa, you’re right, I was totally off base: http://mathworld.wolfram.com/JincFunction.html For some reason I thought the ImageMagick guys invented jinc as a name for a cylindrical sinc, and I had totally forgotten your discussion of jinc as a Bessel function too. Those are some weird zero locations. I mean, it’s not so much the locations themselves but the fact that they aren’t multiples of each other. Apparently it works though, and I guess it’s part of the reason why the even-lobed jincs are so good at [precisely] getting rid of dithering by accident.

Maister · 15 January 2017 05:07

I’ve implemented sRGB FBOs and mipmapping. Works on GLES as well as desktop GL.

I’ve added a shittybloom.cgp to common-shaders bloom/ folder which uses the new features. Probably will only work on nVidia due to tex2Dlod and tex2Dlod with offsets. GLSL backend works of course with mipmap/sRGB, but cg2glsl cannot convert these new shaders yet.

Hyllian · 15 January 2017 05:07

Does this option turn any existing shader code in common-shaders obsolete? If yes, what would it be?

Maister · 15 January 2017 05:07

Here’s with and without shittybloom.cgp for reference in MGS: http://imgur.com/CyLxLjW,BRCOExu

Hyllian: No, I don’t think so. Maybe some of the CRT shaders. With sRGB you can for example avoid having to constantly gamma-correct and apply gamma again on every FBO pass. And since you can work in linear space with good precision, you can do correct bilinear filtering as well.

Then again, neither sRGB nor mipmapping is implemented on PS3 atm, so it wouldn’t be as portable. Consider it experimental for now.

system · 15 January 2017 05:07

You have just made my life so much easier.

So, since we don’t have to worry about the input to the first pass, it appears srgb_framebufferN affect the pass output (rather than the input) for consistency with float_framebufferN…awesome. Can you also clarify the last pass behavior? My best understanding is this: 1.) You cannot set srgb_framebuffer on the true last pass, because it writes directly to the framebuffer. 2.) You can set srgb_framebuffer on the “last pass” if it has an explicit scale parameter, but it won’t really affect on the output of the final image. The implicit true final pass will decode as sRGB and subsequently write to a regular RGBA framebuffer without gamma correction. If you want to do gamma correction, you should do it in the last pass (or “last pass”) yourself, and unless you want banding in the whites, it’s a bad idea to set srgb_framebuffer for the “last pass.” Hopefully I have that right. Users have to manually decode gamma in the first pass, so it’s probably best for symmetry that they have to manually encode gamma in the last pass anyway, as we talked about earlier.

Looking at the new code, and it looks like the if/else if priorities are inconsistent for srgb and float FBO’s in gl.c: The actual code at line 525 gives priority to the float FBO’s, but the error checking at line 513 gives priority to the sRGB FBO’s. Say both are enabled in the .cgp file for some reason: Should this have defined or undefined behavior? Currently, if sRGB FBO’s are supported but float FBO’s are not, there will be no error message, but the float FBO will be passed over for the sRGB FBO. If it’s supposed to be defined behavior that the float FBO takes priority, users/authors won’t be informed of its failure in the case.

On second thought, I think sRGB FBO’s should probably take priority for two reasons: 1.) They’re faster 2.) If the shader author enables both, they’re saying they don’t care which is used. That means that they don’t want/need the advantages of float framebuffers over sRGB framebuffers (like packing multiple output values).

UPDATE: Anyway, I just swapped the priority and sent a pull request.

As a side note, did you decide against GL_RGBA16 fallback?

Maister · 15 January 2017 05:07

Yes, your assumptions are correct here. If last pass doesn’t write to FBO, *_framebuffer options are simply ignored. If last pass has a scale parameter, you force it to go through an FBO first. This will be sRGB/float if you really tell it to, but there is no reason why you should do this ofc. The “dumb” scaling pass afterwards just blits the result, and it does not deal with gamma-issues at all. If you want to be truly gamma-correct, you should avoid the implicit last pass.

Until I see cases where GL_RGBA16 rendertargets are supported, but not sRGB, I’ll avoid that. GLES doesn’t support GL_RGBA16 textures anyways.

I merged your changes and did some smaller cleanups.

Ideally, I’d like a framebuffer_format%u parameter, where you could specifiy whatever format you want (rgb10_a2, srgb8, rgba8, rgba16, rgba16f, rgba32f, etc …). The downside is that almost none of these formats are portable so it’s kinda pointless. Float FBO support was added rather hastily, but I really needed it for some shaders I was working on at the time When we can rely on GLES3+ and GL3+ being everywhere, FBO formats could be made more flexible.

system · 15 January 2017 05:07

Yeah, that makes sense: It would be silly to go through a ton of trouble to make everything gamma-correct, only to let the implicit last-pass do a gamma-space resize operation. Still, it’s not as bad as doing a Gaussian blur in gamma-space or something like that.

Oh, I must have misunderstood you earlier then. I thought you were suggesting GL_RGBA16 as a fallback for GLES in particular.

Cool, thanks. Sorry about the missing whitespace in the ternary operator by the way.

I had the same thought earlier today actually, but it’s a shame adding it would create duplicate functionality in the Cg shader spec: float_framebuffer is already part of the standard, and srgb_framebuffer is consistent with it…but if it’s officially added, then both would presumably become deprecated once framebuffer_format is added.

Out of curiosity, what did you originally need float framebuffers for? Were you just trying to avoid banding, or were you trying to pack multiple values to unpack in a later pass, or were you doing something totally different?

system · 15 January 2017 05:07

Hey Hyllian, I’m sorry to say I won’t be able to help optimize your Jinc shader after all. I really tried, but it turns out distributing texture fetches across a 2x2 pixel quad probably won’t help for resize shaders. Sorry

The technique should be faster for texture-fetch-bound Gaussian blurs at the same resolution, because the weights can be computed statically. Resize shaders are another matter entirely though, because the weights have to be computed at runtime. A 5x5 filter (which doesn’t make sense for anything with negative lobes) requires computed all 36 weights over a 6x6 window before zeroing out the 11 outside each destination fragment’s 5x5 window. A 4x4 filter (Lanczos2) unfortunately requires the same number of samples (11 are duplicates), 25 weight calculations, and zeroing out 9 of them. I had enough morbid curiosity to finally get them to work after a few days of debugging (ugh), but all the ALU operations cause the performance to actually be slower than the regular version…so it’s slower, more complicated, and only works with derivatives.

Hyllian · 15 January 2017 05:08

I made a ddt-jinc shader. It appear like a jinc even sharper than jinc-sharper shader:

http://pastebin.com/PPLNfCcy