Thanks: It’s good to know to be extra cautious with ddx/ddy! I wasn’t able to even run ddx/ddy on my ATI GPU (probably due to the open source drivers), but they didn’t seem to impact my 8800 GTS at all. tex2D calls that use the derivative arguments are extremely slow though, because the GPU doesn’t know if the derivative will be the same for each 4-fragment block and serializes the texture accesses, slowing them down by a factor of 4.
Thankfully I don’t have to use any of those functions. They’re just in one codepath, depending on a user option for how to deal with artifacts caused by anisotropic filtering with manually tiled texture coords (frac/fmod causes a derivative discontinuity at tile boundaries). tex2Dlod is another option (since it can disable anisotropic filtering), but it’s also too advanced for my “baseline” profile to handle (based on what my open source radeon driver can handle ;)). There are other higher-level solutions involving rearranging the work done in each pass though. All this, and it doesn’t even matter for the ATI machine I actually play on, because it apparently isn’t using AF at all (with no way I know of to enable it).
The good thing about options is they’re there…the bad thing is I’m tempted to code in every single one.