The bounty was only posted less than a week ago, it takes time to get takers!
(Sometimes a few month)
It’s better to think of it in terms of horizontal scanrate terms:
There’s no difference for the first scanline right after Present(). VSYNC OFF is essentially lagless for the first scanline output right underneath tearline in the graphics output.
Tests confirm that that Present()-to-Photons on a CRT is almost non-measurable for the first scanline during VSYNC OFF. API to lights hitting eyeballs. Just like the original console’s original beam racing! After all, I am the founder of Blur Busters and inventor of TestUFO, so I understand input lag!
Scanlines are transmitted out of a graphics output (including DisplayPort and DVI or HDMI) at time intervals matching the current horizontal scan rate (e.g. 67.5 KHz for 1080p 60Hz HDMI signal).
So, the higher the Present() rate (frameslicing), the closer the lag identically matches the original console or machine. With a tightly optimized jitter margin, the maximum lag is an average of one frameslice worth (the time interval between two Present()s or glutSwapBuffers() during VSYNC OFF).
Yes, that said, it scan scale up/down simply via a configurable frameslice count (centrally).
Depending on how the frameslice coarseness is configured –
I’ve seen it add only 1% extra GPU load (e.g. no shaders/filters/HLSL, low frameslice count, power efficient on powerful GPU) all the way to maxing out GPU 100% (power hungry 8000 frameslices/second). It’s surprisingly flexible how much GPU power you want to use up.
Less powerful GPUs like PI/Android might go for 4 frameslices, while GTX Titans can approach 10,000 frameslices per second (the Kefrens Bars rasterdemo uses 8000 frameslices/second). Even 4 frameslices is still sub-refresh-cycle latency.
NTSC scanrate is 15625 scanlines per second, so if we’re presenting at about 1500 frameslices per second, we’ll have a max input lag of approximately 10 NTSC scanlines worth (10/15625th of a second).
When enabling frameslice beam racing simultaneous with Retroarch Native CRT Support, we can replicate original arcade machines’ input lag pretty closely (no lag advantage or disadvantage) for proper original lagfeel with no surge-execution distortions for mid-refresh-cycle input reads. A game can just simply essentially stream scanlines (frameslice’s worth) iout of the graphics output while emulating at 1:1 speed. Just like original machine. This is essentially what beam raced frameslicing does. Frameslies can be 1 pixel row or one full screen height’s worth, or any height in between (e.g. 1/4 screen height).
Now, if we have faster GPUs that can output single-scanline frameslices (15625 frameslices per second matching NTSC scanrate) we can pretty much hit the console’s original latency (within one scanline lag anyway). Excluding any signal-tech differences (e.g. codec latency for converting digital to analog, but that can be sub-millisecond).
The beauty is we don’t have to have one-scanline-tall frameslices; we can go coarse multi-scanline frameslices instead. Even frameslices 1/4th screen height, which work fine on slower 8-to-10-year-old GPUs and is very doable on mobile GPUs. And the timing of the frameslices can vary safely as long as the frameslices fit between the realraster (above) and emuraster (below) to produce a tearingless VSYNC OFF mode (lagless VSYNC ON look).
It scales down (slower GPUs with coarse frameslices) and scales up (faster GPUs with fine frameslices, potentially as small as 1-scanline frameslices), and input lag can actually approach exactly original machine for all possible inputread (midscreens, midrasters too) to the sub-millisecond identicalness for analog outputs. Whenever, whatever timing any input reads were relative to VBI, it’s preserved.
(Note: A “frameslice” is multiple scanlines (rows of pixels) between two metaphorical tearlines, but they are invisible in the jittermargin technique – for more info, see GitHub entry)