Lanczos2 sharp shader

Maister · 15 January 2017 05:06

The Cg runtime automatically falls back to arbvp1/fp1 unless you’re running an nVidia card sadly, so we’re quite limited in shader features if we want decent compat, unless a brand new Cg -> GLSL translator is written.

Derivatives are problematic due to two things:

GLES2 doesn’t support this by default (needs extension).
The cgc compiler crashes when trying to cross-compile GLES code with derivatives. We really need the automatic conversion support to GLSL.
CGP scale can be inferred by passing a flat shaded IN.output_size / IN.video_size to frag. Doesn’t help if you want to #ifdef on it I suppose …
It’s possible to override functions based on profile at least: http://http.developer.nvidia.com/Cg/Cg_language.html
A way to pass generic key/value defines to GLSL and Cg shaders sounds like a fair feature. But then again, remember that the Cg format is not RetroArch-only. Randomly adding lots of features to the spec will hurt compatibility with other projects.

system · 15 January 2017 05:06

Speaking of cgc crashing with GLSL conversion, I made a very simple test shader for testing my mipmapping support. The shader only reads the input, does a tiled read from an LUT at a very small scale, and multiplies them together before returning the result…no derivatives or anything fancy. The test is just about seeing whether the LUT was successfully mipmapped or not (if it wasn’t, the aliasing/moire will make your eyes scream in pain). It works fine in Retroarch, and it compiles fine by itself on the cgc command line too. I tried converting to GLSL to make sure the mipmapping was working right in GLSL too, but the conversion script crashed cgc due to not liking a colon somewhere (having to do with the input/output semantics). It seems a little brittle. :-/

That only works for getting the scale of the current pass. I’m talking about pass2 needing to know the output resolution of the final pass (say pass7)…that’s the issue I had. To get around it, I had to set pass2’s scale to viewport mode, set an e.g. 0.125 scale, duplicate the 0.125 to an #include file for user settings, and calculate: viewport_resolution = IN.output_size/scale_in_include_file_which_may_be_out_of_sync_with_actual_scale_in_cgp_file; important_scale_value = f(viewport_resolution);

Is there a way to tell from a shader what profile was loaded though? Otherwise you don’t know if you need to override or not (plus, there’s no user-made substitute for something like ddx/ddy, so you need to change a larger piece of code than just that single function call if they aren’t present). There’s nowhere in my code where I absolutely NEED derivatives, but not using them limits user options for quality.

Indeed, and once you add on to a standard, it’s hard to remove a bad feature. That’s why I wanted to focus on adding mipmapping and sRGB support first. They require only trivial additions to the .cgp file format to permit, and it’s hard to get that part wrong. (Well, not really: I went through a couple of different bad designs for specifying sRGB FBO’s before determining the “correct” way to do it.) I felt pretty comfortable with pitching those two ideas to Squarepusher and asking if he’d pull once I implemented them.

Once you start talking about passing a lot more information to shaders though, there are a lot of different ways to design it with different pros/cons…and once they become standardized and adopted by other implementations, they become VERY hard to change again if they’re fundamentally flawed. Generic key:value pairs would basically complete my whole “wish list” of information I wish shaders knew, but they’re a much bigger feature requiring a little more discussion and “design by committee,” not to mention an implementor who knows more about the codebase than I do.

Maister · 15 January 2017 05:06

A problem with #defining stuff in shaders is that cross-compile from cgc breaks actually since it would have to pass-through the #defines (but it doesn’t). Didn’t think about that. Before advanced features like this are added, I’m really starting to consider a new Cg cross-compiler that is not stuck in early 2000s and can cater to our use cases. … The Cg backend doesn’t work in neither EGL, core GL nor GLES …

system · 15 January 2017 05:06

You’re referring to a more advanced script to replace cg2glsl.py, not a full-blown compiler, right?

Maister · 15 January 2017 05:06

A simplified compiler, yes. Most of it is doable with simple transformations. Using cgc is very, very, very, very brittle. If you’ve seen cg2glsl.py it’s a pile of pile of hacks and somewhat happens to work. Maintaining it is pretty hard.

It’s also important that the resulting shader code is somewhat sane, and doesn’t look like “assembly” which cgc output tends to look like.

system · 15 January 2017 05:06

I’m not really familiar with the transformations shader code goes through before reaching its final binary form in GPU memory, but I’m aware that it’s processed in several stages, cgc being the first (for Cg code that is). Does cgc do nontrivial optimizations for the code generation, or does it leave that all for the next step in the chain anyway?

Maister · 15 January 2017 05:06

CGC transforms the code quite heavily, here’s 2xbr for example. http://pastebin.com/Q4WHTqiC Sure, it works, but … ye. That’s about it.

CGC has a GLES output mode, but as-is, it’s completely broken. cg2glsl.py essentially makes a ton of source transformations with python’s .replace(), adds in lots of varyings where needed, uniforms, etc to whip it into something that can work and dumps out a turd of GLSL code.

In RetroArch with vp1/fp1, etc, CGC just emits straight ASM. For the nVidia profiles, it’s just a different flavor of asm.

system · 15 January 2017 05:06

Well, I guess my concern is that if cgc does a lot of intelligent optimizations, replacing cgc could be an incredible (prohibitive?) amount of work if you’re aiming for full language compliance and comparable performance…unless of course GLSL’s optimizer is better, and your “compiler” just transforms to GLSL like a better version of cg2glsl.py.

Maister · 15 January 2017 05:06

CGC -> GLSL doesn’t improve performance. Whatever optimizations are done by CGC are done by the GLSL compiler anyways. CGC just mangles the code which probably just makes it harder for the GLSL compiler to optimize. I can’t say I’ve seen it actually do something cool that any sane GLSL compiler wouldn’t do.

Full language compliance is moot anyways. Being compatible with the existing shaders is probably more worthwhile. Cg is a pretty big language after all.

Yes, a “compiler” here would mean something that lightly parses Cg, determines which input/output variables are there and emits appropriate varyings and uniforms. Some language differences can be overcome with #defines, e.g. lerp(), tex2D(), floatN vs vecN, etc. Some are more intricate, (matrix ordering for example,) and some are pretty hard without a proper parser (like struct initialization).

It’s not that uncommon to do something like this in rendering engines these days which support D3D and GL. Most of them can enforce a certain coding style to ensure that conversion goes smoothly, but dunno if that’s appropriate for these Cg shaders …

Doubt this would happen in the near future, but I think we need a far better Cg environment before we can start adding features like this.

system · 15 January 2017 05:06

cgc->GLSL doesn’t improve performance over what though? A comparable handcoded GLSL shader, you mean?

The reason I’m confused is because I truly have no idea what format cgc outputs. I guess I was under the impression that cgc bypassed the GLSL compiler and produced the same format GLSL does. From what you’re saying though (and from the output sample you gave), cgc just translates from source-level Cg to source-level GLSL or HLSL (and poorly at that), and it never comes any closer to machine code anyway. Is that correct?

That’s a bit scary, considering the shader I’m working on is bigger and more complex than any of the existing ones…but hopefully I don’t use any uncommon language features, haha.

You’re referring to the generic key:value pair feature, right? After all, the sRGB feature can just be made to fail on unsupported platforms the same way as calling ddx(), ddy(), tex2Dlod(), etc. (or it could default to regular FBO’s and textures, giving ugly output and an error message in the console). That is, shaders that only work on some platforms wouldn’t be anything new.

Maister · 15 January 2017 05:07

Yes, I suppose adding a “will-look-weird-if-not-supported” sRGB is a possible stop-gap. I guess it’s okay to enforce that last pass must do gamma correction then? Otherwise, either the static menu blending shaders or all other input textures suddenly have to be SRGB-aware.

system · 15 January 2017 05:07

Since we can pick the failure mode in this case, would you prefer, “Will look weird if not supported,” or “Fail to load the shader if not supported?”

Yeah, it’s fine to make users manually gamma-correct the output of the last pass. Actually, it seems to be the sanest decision by far. There might be slight benefits to having a final_output_srgb option in .cgp files, but not enough to justify the downsides:

All you’d really get from that is a very slight potential performance increase (maybe) over manual gamma conversion, or a slightly bigger increase over manually converting to a precise sRGB encoding, but there’s no real reason the output has to be precise sRGB anyway: The output variance between displays is greater than the differences between sRGB and 2.2 gamma, and come to think of it, I think the most displays are calibrated for a specific gamma value anyway rather than precise sRGB…the standard is supposed to be 2.2 (it used to be 1.8 on the Mac even), but according to Wikipedia, LCD displays may actually operate at more like 2.5.

In contrast, implementing final_output_srgb for .cgp files would either require an additional awkward RGUI option or force the last pass to display wrong when set from RGUI. Plus, it would require more implementation effort for the feature itself, because sometimes the last pass isn’t “really” the last pass: If it’s given a specific scale, Retroarch adds another pass to scale to the output resolution. That would make outputting sRGB in the final pass one of the most complicated parts of the feature to implement…for next to no benefit. Plus, as far as the .cgp file format itself goes, srgb0, srgb1, etc. fit into the existing paradigm, but it’s harder to think of a good name for “final_pass_srgb” that deserves to be standardized.

Maister · 15 January 2017 05:07

Alright. Then I have to figure out how this will interact with GL HW render. With CPU-based rendering, it’s easy to just swap out the internal format to GL_SRGB8_ALPHA8, but with HW render the FBO texture cannot simply be SRGB:

In GLES, there is no GL_FRAMEBUFFER_SGB enable bit (it’s always on without some obscure extension), so you’d suddenly get gamma-correction when you’re not expecting it. Therefore the texture libretro GL cores are rendering to cannot be sRGB.
HW render is not specificed for sRGB targets, so HW renderers have had to do gamma-correction anyways. Suddenly changing when or not to do gamma-correction in the libretro GL core depending on which shader is run would be horrible.

It might be possible to use aliased textures with glTextureView, or a glCopyTexSubImage from non-sRGB to sRGB as a workaround. Kinda makes me wish that SRGB was a texture parameter and not a format …

I guess I’d prefer “will look weird” failure approach. That matches the failure case for floating point FBOs atm. Not having sRGB texture support is pretty uncommon these days I think. Having a weird look is better than getting stock shader, because it would be harder to realize what caused the error.

Hyllian · 15 January 2017 05:07

I figured out (by tweaking) a great and fast replacement for the Jinc function calculation!

The Jinc function is defined as this:

jinc(x) = J1(x)/x, where J1(x) is the Bessel function of the first kind.

J1(x) is calculated as this: x/2 - ((pix)^3)/(1!2!2^3) + ((pix)^5)/(2!3!2^5) - ((pi*x)^7)/(3!4!2^7)+ … and so on, as it is a series.

Dividing by x to get Jinc(x), we get: 1/2 - ((pix)^2)/(1!2!2^3) + ((pix)^4)/(2!3!2^5) - ((pi*x)^6)/(3!4!2^7)+ …

To get the value 1 at x=0, it’s necessary to multyply by two. So, to make a good function suited for filter we get this:

2*(1/2 - ((pix)^2)/(1!2!2^3) + ((pix)^4)/(2!3!2^5) - ((pi*x)^6)/(3!4!2^7)+ …)

or 2*Jinc(x). And then we can convolve with a jinc window too, making the filter kernel as this: jinc(x)jinc(xK), where K is the ratio of the first zero and the second zero of jinc function (r1=1.22 and r2=2.233).

That series decay very slowly. I verified that only with the first 20 terms I can get a good approximation for Jinc(x) for x<2.5. As the filter we are working for shaders in Retroarch needs at least radius 2, the amount of calculation is too much!

So, for x<2.5, I found this function, which is a good approximation for Jinc(x)Jinc(xK):

2sin(pix/2)sin(xpi0.825)/(0.825(pi*x)^2)

Replacing the lanczos2-sharp,cg kernel with that works! And it’s indeed the Jinc windowed-Jinc filter with 2 lobes! But I think the result is a bit blurred for my tastes and for pixel arts, so I tweaked a bit and find this other function which is sharper and very good for retro games:

sin(pix0.4)sin(xpi0.825)/(0.40.825pixpix)

I only need a anti-ringing algorithm now.

So, here’s the jinc-sharp shader: http://pastebin.com/TGT9XDSE

BTW: It takes care of dithering in genesis games!!!

system · 15 January 2017 05:07

Yeah, the libretro cores are expecting to just write values directly to an FBO without any automatic gamma-correction messing with their values, and it’s important to maintain their expectations. glTextureView does seem to be the appropriate response to this situation, since it allows us to “reinterpret” the existing data as another format, whereas blitting with glCopyTexImage2D or glCopyTexSubImage2D would be the next best thing. This is only a concern for the first FBO though (the input to pass0), because we can freely choose the image format for the rest depending on whether it’s supposed to be sRGB or not.

Can you elaborate on what you mean by the first sentence? I don’t understand your meaning. I definitely understand the second sentence though: We don’t want to retroactively change the libretro API and impose a burden on libretro core developers. That would be a nightmare, haha.

I’m glad you brought up glTextureView, because I had never heard of it before.

Okay, we’ll go with that. Matching the failure case for floating point FBO’s is a lot more consistent than matching the case for missing functions (ddx() for instance), since both deal with unsupported FBO types.

system · 15 January 2017 05:07

Hyllian:

I figured out (by tweaking) a great and fast replacement for the Jinc function calculation!

The Jinc function is defined as this:

jinc(x) = J1(x)/x, where J1(x) is the Bessel function of the first kind.

J1(x) is calculated as this: x/2 - ((pix)^3)/(1!2!2^3) + ((pix)^5)/(2!3!2^5) - ((pi*x)^7)/(3!4!2^7)+ … and so on, as it is a series.

Dividing by x to get Jinc(x), we get: 1/2 - ((pix)^2)/(1!2!2^3) + ((pix)^4)/(2!3!2^5) - ((pi*x)^6)/(3!4!2^7)+ …

To get the value 1 at x=0, it’s necessary to multyply by two. So, to make a good function suited for filter we get this:

2*(1/2 - ((pix)^2)/(1!2!2^3) + ((pix)^4)/(2!3!2^5) - ((pi*x)^6)/(3!4!2^7)+ …)

or 2*Jinc(x). And then we can convolve with a jinc window too, making the filter kernel as this: jinc(x)jinc(xK), where K is the ratio of the first zero and the second zero of jinc function (r1=1.22 and r2=2.233).

That series decay very slowly. I verified that only with the first 20 terms I can get a good approximation for Jinc(x) for x<2.5. As the filter we are working for shaders in Retroarch needs at least radius 2, the amount of calculation is too much!

So, for x<2.5, I found this function, which is a good approximation for Jinc(x)Jinc(xK):

2sin(pix/2)sin(xpi0.825)/(0.825(pi*x)^2)

Replacing the lanczos2-sharp,cg kernel with that works! And it’s indeed the Jinc windowed-Jinc filter with 2 lobes! But I think the result is a bit blurred for my tastes and for pixel arts, so I tweaked a bit and find this other function which is sharper and very good for retro games:

sin(pix0.4)sin(xpi0.825)/(0.40.825pixpix)

I only need a anti-ringing algorithm now.

So, here’s the jinc-sharp shader: http://pastebin.com/TGT9XDSE

BTW: It takes care of dithering in genesis games!!!

For anti-ringing, have you tried the simple value-clamping mentioned above? It seems to be what madshi does based on his description. A super-naive version that doesn’t filter out the “main contributors” from the rest of the samples would go like:

float3 min4(float3 a, float3 b, float3 c, float3 d)
{
    return min(a, min(b, min(c, d)));
}
float3 max4(float3 a, float3 b, float3 c, float3 d)
{
    return max(a, max(b, max(c, d)));
}

float4 main_fragment(in out_vertex VAR, uniform sampler2D s_p : TEXUNIT0, uniform input IN) : COLOR
{
    //  blah, blah
    //  ...
    //  Get texture samples.
    //  Use pow(tex2D(texture, uv), 2.2) to convert each sample to linear light for "proper" color mixing;
    //  If you do, this will make light halos worse but make dark halos better and avoid an overall darkening effect.
    float3 c00 = ...
    //  Get min/max samples
    float3 min_sample = min4(min4(c00, c01, c02, c03), min4(c10, c11, c12, c13), min4(c20, c21, c22, c23), min4(c30, c31, c32, c33));
    float3 max_sample = max4(max4(c00, c01, c02, c03), max4(c10, c11, c12, c13), max4(c20, c21, c22, c23), max4(c30, c31, c32, c33));
    //  Compute color
    color = mul(weights[0], float4x3(c00, c10, c20, c30));
    color+= mul(weights[1], float4x3(c01, c11, c21, c31));
    color+= mul(weights[2], float4x3(c02, c12, c22, c32));
    color+= mul(weights[3], float4x3(c03, c13, c23, c33));
    color = color/(dot(mul(weights, float4(1)), 1));
    //  Hard-limit the color
    color = clamp(color, min_sample, max_sample);
    //  Output, and use pow(color, 1/2.2) if you converted to linear light before.
}

madshi doesn’t say what he means by using only the “main contributors.” I see three likely alternatives: 1.) He might be referring to judging the influence of a sample based on the absolute value of (sample * weight), then taking min and max unweighted sample values of the top 4, 5, 6, etc. most influential samples. 2.) He might just be referring to taking the min and max values of the 4, 9, 16, or whatever-number closest samples. 3.) He might be referring to taking the min and max values of all samples where abs(sample * weight) > thresh for some threshold. You can quickly experiment with those approaches and optimize after you find out how to get the results you want. Since you’re only taking 16 samples total (for a 2-lobe filter), you might just want to start out by trying the super-naive version first and see where it gets you. If it doesn’t get rid of enough halos, you should be more restrictive about which samples you use to create your clamping range. If hard-limiting the result based on the sample values looks too “sudden” due to the discontinuous first derivative of min and max (it could even create aliasing), you can try creating some softer continuous function f(color, min_sample, max_sample) that handles it more smoothly. Either way, creating some hard or soft clamping range based on some subset of the samples is probably what you’re looking for.

I’m still working on the plain unwindowed 5x5 Jinc with nine samples, but the optimizations should carry over (for nVidia cards at least, since it relies on derivatives) to your shader pretty trivially, since only the weights will differ. I also realized a way to make a 4x4 version with 9 samples, and it’s a little faster than the 5x5 version due to less ALU ops even though it doesn’t require any fewer samples…so I’ll submit that one too once I finish. The delay is partly due to spending a lot of the day posting in this thread, partly because of errand-running, partly because I wanted to better vectorize things (for older GPU’s that don’t have the GCN architecture), and partly because it’s more of a pain in the ass than I thought to translate the algorithm from a Gaussian blur where output_size = video_size to a resize function where output_size != video_size. I’ll get it to you though soon enough…hopefully tonight, unless I get too tired and have to put a bit off until tomorrow.

Dogway · 15 January 2017 05:07

dup post

Maister · 15 January 2017 05:07

Can you elaborate on what you mean by the first sentence? I don’t understand your meaning. I definitely understand the second sentence though: We don’t want to retroactively change the libretro API and impose a burden on libretro core developers. That would be a nightmare, haha. [/quote] I mean that libretro GL cores have had to gamma-correct themselves, since it was never specificed that sRGB FBOs might have been used.

system · 15 January 2017 05:07

Oh, right. For some reason it sounded like you were talking about something different, but now that you’ve clarified, I can’t figure out why I didn’t understand the first time around. Maybe I was tired. :o

Hyllian · 15 January 2017 05:07

Thanks for the sugestion, it indeed works! Though it makes the shader very slow.

But, I’ll see if I can optimize it. The implementation is good. I’ll try to reduce the num of colors that use min and max.

EDIT: I’ve changed the colors a bit and got a great speed up (twice as fast) and better IQ. I reduced the comparison to only the 4 central pixels:

      //  Get min/max samples
      float3 min_sample = min4(c11, c21, c12, c22);
      float3 max_sample = max4(c11, c21, c12, c22);

This is good for pixel arts, because they’re plagued by hard edges. And, at hard edges, if you test a color two spaces from the edge, the result gets some strange artifacts. So, the solution to test only the colors along the edge (or at the central texture lookups in the shader) is indeed the best solution.

Here’s the shader with anti-ringing: http://pastebin.com/9AkrFXaw