A new little shader i did (glsl)

All signals can be thought of as the sum of some amount of sine wave components (that is what a Fourier transform does, more or less). A 4.2 MHz sine wave is just the highest frequency component you can fit in a signal that is limited to 4.2 MHz. I don’t mean to suggest that luma is modulated onto a sine wave carrier signal or anything like that.

A system can have a dot clock that is like 640 like the Amiga but the format has to support it too, so it’ll be applicable to RGB only.

A console’s dot clock and the bandwidth limits of luma and chroma are two somewhat independent ideas.

For example, the N64 only outputs 640 pixel lines. Any lower resolution is resampled to 640 before output, which is part of why the console is so blurry. In NTSC regions, it also only outputs NTSC-style video, whether through composite, an RF adapter, or S-Video.

In any case, I didn’t mean to derail this thread. Sorry about that.

1 Like

I did a port of ntsc-mini and tiny_ntsc current GLSL to Slang, and some testing, i think it’s pretty accurate and ntsc-mini passing the ntsc test patterns perfectly well on par to blargg ntsc cpu filter. So when PR is accepted you can do some tests and tell me what you think.

5 Likes

I took a look at your ntsc-mini. I appreciate your devotion to getting each system correct. I also like that you actually create a composite signal and demodulate it, rather than keeping luma and chroma separate and mixing in some artifacts like some of the other shaders. I don’t understand the filter you use on luma, though. I’m assuming this is a notch filter so you can notch out the chroma, but I can’t work it out. Or is it a low pass filter?

3 Likes

I am away from pc now but IIRC i used a window filter cutting (high) frequencies up to a certain point that chroma lives.

2 Likes

Another shader that does a smart filtering and gamma correct in one pass is crt-nobody (named smart wgt in the image). You have to filter the image in some way when in linear to actually do proper gamma correct.

4 Likes

That is a neat little function that I haven’t seen before. It looks like wgt is:

size = clamp(size, -1.0, 1.0);
size = 1.0 - size * size;
return size * size * size;

I’ve been using this (basically a variation of smoothstep):

x = clamp(abs(x), 0.0, 1.0);
return x * x * (2.0 * x - 3.0) + 1.0;
4 Likes

Just tried that, you can also use cos(x)^7 (edit - or ^6) , which is almost the same in -2…+2 interval, probably lighter, since cos/sin are highly optimized :wink:

Schermata_20250918_160218

Diff is about 3.3% at most:

Ofc, if -2…+2 is not enough: cos(clamp(x,-1,1))^7 , but then I bet the computational cost would be almost the same, just saying…

3 Likes

Cos/sin is not lighter than a simple multiply, can’t test right now as i switched to linux mint and intel-gpu-tools is not working because windows (dual boot) have that secure boot or something (bios prevents some things when in that mode). But from previous tests e.g. sin() scanlines is heavier than e.g. crt-pi scanlines that use a simple 1.0 - x * x where x is a fract of texture scanline. And exp() is heavier than sin(). But you get better quality in return in each step up. Sin() and exp() are almost flawless even in non integer.

1 Like

I’m not saying cos is lighter than mul.

In this case it is cos,pow vs clamp,abs,3 muls,add,sub or clamp,3muls,sub,add

Also, sin/cos may(/not) be optimized depending on the target gpu and driver ofc.

Some of those operations end up getting collapsed into one instruction on most architectures. And others get expanded to multiple instructions. For RDNA 4, the three different functions get compiled something like this. (You can mostly ignore the s_delay_alu instructions.)

For wgt:

v_med3_num_f32	 v0,  v0,  -1.0,  1.0 
s_delay_alu   	 instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1) 
v_fma_f32     	 v0,  -v0,  v0,  1.0
v_mul_f32_e32 	 v1,  v0,  v0
s_delay_alu   	 instid0(VALU_DEP_1)
v_mul_f32_e32 	 v0,  v1,  v0

For the cubic function:

v_max_num_f32_e64	 v0,  |v0|,  |v0| clamp
s_delay_alu      	 instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1) 
v_mul_f32_e32    	 v1,  v0,  v0
v_fmaak_f32      	 v0,  2.0,  v0,  0xc0400000 
v_fma_f32        	 v0,  v1,  v0,  1.0

For the cosine, pow(cos(x), 7):

v_mul_f32_e32	 v0,  0.15915494,  v0
s_delay_alu  	 instid0(VALU_DEP_1) | instskip(NEXT) | instid1(TRANS32_DEP_1)
v_cos_f32_e32	 v0,  v0
v_mul_f32_e32	 v1,  v0,  v0
s_delay_alu  	 instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
v_mul_f32_e32	 v0,  v0,  v1
v_mul_f32_e32	 v1,  v1,  v1
v_mul_f32_e32	 v0,  v0,  v1

So there are a few things at work here:

  • a * x + b can be done with one FMA instruction, which makes calculating polynomials very fast with Horner’s method
  • abs() is basically free on inputs (it gets collapsed into another instruction)
  • it’s not obvious, but clamp(x, 0.0, 1.0) is basically free on outputs, so clamp(abs(x) * x, 0.0, 1.0) could be one mul instruction (note that this is only for 0.0 and 1.0)
  • before the cosine can be evaluated, the value needs to be multiplied by 1/(2*pi) (I don’t know why)
  • the compiler thinks the fastest way to evaluate pow(x, 7) is with 4 mul (multiply) instructions. For larger values, pow(x, y) is often implemented as exp(y * log(x)), so it can be a particularly slow function.
  • transcendental instructions (cos, exp, log, etc) are usually slower than basic arithmetic instructions (but it’s complicated)
2 Likes

Nice!

I can’t check now, but apart from the asm instructions, the cycles usually gives a better view of the final load, (cos usually proven to be light in that regard on my 10+yrs old igp)

Also you’re very right about abs and clamp between 0 and 1!

1 Like

Probably this is a fast one too, can’t test actual GPU stress atm

#define GAMMAIN(color) color*color 
#define PI 3.14159265358979323846
#define invA  0.7

// Cheap Lanczos-like bump weight
float lanczos_like_bump(float x)
{
    x = abs(x);
    if (x * invA >= 1.0) return 0.0;
    float t = x * invA;
    float t2 = t * t;
    float w = (1.0 - t2);
    w = w * w * w; // (1 - t^2)^3
    return w;
}

void main()
{
    // Floor gives the "center" texel
    // ogl2pos is TEX0.xy*TextureSize in Vertex for speed
    vec2 base = floor(ogl2pos);

    // Choose filter radius (same as 1/invA)
    int a = 1;

    vec4 sum = vec4(0.0);
    float wsum = 0.0;

        for (int i = -a; i <= a; i++) {
            vec2 p = base + vec2(float(i), 0.0);
            // invdims is like SourceSize.zw in Vertex
            vec2 uv = (vec2(p) + 0.5)*invdims; // center of texel

            // distances in x and y
            float dx = ogl2pos.x - float(p.x);
            //float dy = ogl2pos.y - float(p.y);

            // separable weight = wx * wy
            float wx = lanczos_like_bump(dx);
            //float wy = lanczos_like_bump(dy);
            float w = wx ;

            sum += GAMMAIN(COMPAT_TEXTURE(Source, uv)) * w;
            wsum += w;
        }
    // Normalize (one divide per pixel)        
    vec4 col = sum/ wsum;

    float f = fract(ogl2pos.y)-0.5;
    float l = max(max(col.r,col.g),col.b);
    float beam = mix(15.0,8.0,l);
    col *= sqrt(exp(-beam*f*f));
    FragColor = sqrt(col);
}

3 Likes

Here is a fast catmull weight too, 2 samples left and 2 right where

#define key  -0.5  // Catmull-Rom

float cubic(float x) {
    float ax = abs(x);
    float a = key;
    float pix = ax*ax;
    if (ax < 1.0) {
        return (a + 2.0) * (pix * ax)
             - (a + 3.0) * (pix)
             + 1.0; // at center
    } else if (ax < 2.0) {
        return a * (pix * ax)
             - 5.0 * a * (pix)
             + 8.0 * a * ax
             - 4.0 * a;
    } else {
        return 0.0;
    }
}

4 Likes

Somehow unexpected results on my end:

void main() {
float size = vTexCoord.x-0.5;
float c = 0.0;


//size = clamp(size, -1.0, 1.0);
//size = 1.0 - size * size;
//c =  size * size * size;
//SIMD8 shader: 11 instructions. 0 loops. 111 cycles. 0:0 spills:fills, 1 sends, scheduled with mode top-down. Promoted 1 constants. Compacted 176 to 144 bytes (18%)
//SIMD16 shader: 11 instructions. 0 loops. 130 cycles. 0:0 spills:fills, 1 sends, scheduled with mode top-down. Promoted 1 constants. Compacted 176 to 144 bytes (18%)
//------------------------


//size = clamp(abs(size), 0.0, 1.0);
//c=size * size * (2.0 * size - 3.0) + 1.0;
//SIMD8 shader: 11 instructions. 0 loops. 102 cycles. 0:0 spills:fills, 1 sends, scheduled with mode top-down. Promoted 1 constants. Compacted 176 to 144 bytes (18%)
//SIMD16 shader: 11 instructions. 0 loops. 122 cycles. 0:0 spills:fills, 1 sends, scheduled with mode top-down. Promoted 1 constants. Compacted 176 to 144 bytes (18%)
//------------------------


//c=pow(cos(size),7.0);
//SIMD8 shader: 8 instructions. 0 loops. 80 cycles. 0:0 spills:fills, 1 sends, scheduled with mode top-down. Promoted 1 constants. Compacted 128 to 96 bytes (25%)
//SIMD16 shader: 8 instructions. 0 loops. 92 cycles. 0:0 spills:fills, 1 sends, scheduled with mode top-down. Promoted 1 constants. Compacted 128 to 96 bytes (25%)
//------------------------


//size = clamp(size, -1.0, 1.0);
//c=pow(cos(size),7.0);
//SIMD8 shader: 10 instructions. 0 loops. 104 cycles. 0:0 spills:fills, 1 sends, scheduled with mode top-down. Promoted 1 constants. Compacted 160 to 128 bytes (20%)
//SIMD16 shader: 10 instructions. 0 loops. 120 cycles. 0:0 spills:fills, 1 sends, scheduled with mode top-down. Promoted 1 constants. Compacted 160 to 128 bytes (20%)

FragColor.rgb = vec3(c); 

}

On Intel/MESA, this one liner gives the cycles used, where GLSL75 varies depending on the ordinal position of the slang shader in the .slangp chain.:

MESA_GLSL_CACHE_DISABLE=1 INTEL_DEBUG=fs,fall,perf VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/intel_icd.x86_64.json timeout 7 retroarch /koko/tmp/bench.png --set-shader preset.slangp 2>/koko/tmp/vsfs.txt ; cat /koko/tmp/vsfs.txt |grep -A1 GLSL75|grep SIMD|sort -u|tac

1 Like

With some old crt shaders i liked to use adaptive linear filtering:

w1 = f;
w2 = 1.0-f;
w1 = mix(w1, w1*w1, sharpness); // can also use w1*w1*w1 or pow instead of mix
w2 = mix(w2, w2*w2, sharpness);
color = (w1*color2 + w2*color1)/(w1+w2);

It can be tested with crt-guest-sm in the glsl repository.

4 Likes

Looks neat, here is a small test with some tweaks i did. I noticed scanlines look more “even” when used color vec3 as the critical variable for some reason, instead of a float like max() color channel value.

void main() 
  { 
vec2 dx = vec2(invdims.x,0.0);

vec2 fp = fract(ogl2pos);
float f = fp.y - 0.5;
vec2 pos = (floor(ogl2pos)+0.5)*invdims;

vec3 c00 = COMPAT_TEXTURE(Source,pos).rgb;
c00 = GAMMAIN(c00);
vec3 c01 = COMPAT_TEXTURE(Source,pos - dx).rgb;
c01 = GAMMAIN(c01);

float w1 = fp.x;
w1 = mix(w1, w1*w1, sharpness);
float w2 = 1.0-fp.x;
w2 = mix(w2, w2*w2, sharpness);

vec3 col1 = (c00*w1 + c01*w2)/(w1+w2);
col1 = sqrt(col1);
vec3 lum = mix(vec3(1.35),vec3(1.05),col1);
vec3 f1 = exp(-(6.0-col1*4.0)*f*f*lum);

vec3 sum = col1*f1 ;

FragColor.rgb = sum; 
}

3 Likes

Hey, very nice example. “Vec3 scanlines” are indeed nice looking, also the edges get a bit saturated.

2 Likes

Ok so i managed to run intel_gpu_top and while i can’t tell exactly how much it stresses the GPU, i can see that it draws less current when sin() is used instead of a simple 1.0-4.0 * f * f scanlines. Seems it’s the fastest at least on an Intel GPU, perhaps some other GPUs like rpi have a different behaviour.

Every GPU indeed has a different behaviour, e.g. my old HTC One will choke on multiple texture reads (memory bandwidth?) while intel_gpu_top reports 1.8 watt on the shader and another shader that draws the same current can run well there, because it does a single texture read and uses some trick like smoothstep for coordinates.

3 Likes

An S-Video like single pass i wrote in 2-3 hours. This is a true modulate/de-modulate shader including scanlines, mask, curvature and everything with YIQ resolution controls like gtuv50

8 Likes

Thank you for sharing. Could it be possible to transfer it to slang please ?

1 Like