Some of those operations end up getting collapsed into one instruction on most architectures. And others get expanded to multiple instructions. For RDNA 4, the three different functions get compiled something like this. (You can mostly ignore the s_delay_alu
instructions.)
For wgt:
v_med3_num_f32 v0, v0, -1.0, 1.0
s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
v_fma_f32 v0, -v0, v0, 1.0
v_mul_f32_e32 v1, v0, v0
s_delay_alu instid0(VALU_DEP_1)
v_mul_f32_e32 v0, v1, v0
For the cubic function:
v_max_num_f32_e64 v0, |v0|, |v0| clamp
s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
v_mul_f32_e32 v1, v0, v0
v_fmaak_f32 v0, 2.0, v0, 0xc0400000
v_fma_f32 v0, v1, v0, 1.0
For the cosine, pow(cos(x), 7)
:
v_mul_f32_e32 v0, 0.15915494, v0
s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(TRANS32_DEP_1)
v_cos_f32_e32 v0, v0
v_mul_f32_e32 v1, v0, v0
s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
v_mul_f32_e32 v0, v0, v1
v_mul_f32_e32 v1, v1, v1
v_mul_f32_e32 v0, v0, v1
So there are a few things at work here:
- a * x + b can be done with one FMA instruction, which makes calculating polynomials very fast with Horner’s method
-
abs()
is basically free on inputs (it gets collapsed into another instruction)
- it’s not obvious, but
clamp(x, 0.0, 1.0)
is basically free on outputs, so clamp(abs(x) * x, 0.0, 1.0)
could be one mul
instruction (note that this is only for 0.0 and 1.0)
- before the cosine can be evaluated, the value needs to be multiplied by 1/(2*pi) (I don’t know why)
- the compiler thinks the fastest way to evaluate
pow(x, 7)
is with 4 mul
(multiply) instructions. For larger values, pow(x, y)
is often implemented as exp(y * log(x))
, so it can be a particularly slow function.
- transcendental instructions (
cos
, exp
, log
, etc) are usually slower than basic arithmetic instructions (but it’s complicated)