Don’t forget buffer backpressure mechanics and the frame queue concepts (two semi-independent extra lag factors). I explained that – depending on circumstances – VSYNC ON can frequently lead to more than 1 frame of input lag. Not all VSYNC ON are identical, between all applications/drivers/programming techniques – in many graphics drivers, it is often a shallow frame queue (Because that’s more stutterless than a traditional double buffer technique). There are indeed ways to force it to behave like a double buffer, but when running fully flat-out, you can end up have 2 more frames of lag unless you do special techniques.
In addition, consider varying-lag distortion effects. I edited the above post to make it bigger to cover some of the concepts how it’s not always an exact “X frame of lag” – one input read may receive 33ms lag and the next input read may receive 16ms of input lag, because of lag-granularity caused by varying timing of original machine’s input read, and how it interacts with the boundaries of the surge-execution intervals of various lag-reducing techniques – how the emulator lag-distorts the input read (or not). Most retro games reads input at a consistent time, but that timing can jitter on the original machine, and it might jitter across the surge-execution intervals, creating a lag granularity effect that did not exist on the original machine. What this can potentially mean is that button-mashing may feel erratic in the emulator and consistent on original – for a specific retro game (that had varying-in-refresh-cycle input reads). There are pros/cons of all kinds of various input-lag-reduction methods.
Good emulators with optimization may have only 1 frame of input lag (next frame response) but not all of them. Yet none (except beam raced frameslicing) can do same-frame response, e.g. midscreen input reads changing screen content of the bottom of the same dislay refresh cycle (no need to wait till next refresh). And none can do guaranteed subframe fixed latency offset to input reads (emulators that try for sub-frame latencies often subject them to refresh-cycle-rounding-off effects, caused by surge-execution distortions). Beamraced emulation (render and scanout on the fly) are more able to guarantee consistent lag for all possible input read timings throughout any part of any emulator refresh cycle, relative to the real-world refresh cycle.
Obviously, RunAhead is superior for many things while beamracing is another good tool for faithful replication of original machine lag while reducing system requirements (during low frameslice rates)
Full answer will require a multi-page reply to explain things. If you want, we can move to the Area51 forum of the Blur Busters Forums to discuss this part further.
I’ll give a partial answer to help conceptually.
Remember, cable scanout is sometimes totally different from panel scanout. There’s no concept such as “clearing the screen” on the monitor side. So forget about it, the emulator can’t do anything about it. (In reality, impulse display will automatically clear and sample-and-hold LCD displays will hold until next refresh cycle – but the considerations are exactly identical regardless whether you connect an original machine to it, or an emulator to it. So, thusly, this discussion is irrelevant here. Stop guessing displayside mechanics for now, we’re only comparing emulator-vs-original connected to the SAME display. Whether it be same CRT or same LCD.
We just want the cable to behave identically where possible. (internal builtin displays also serialize “ala cable-like” too, phones, tablets, and laptops sequential scan too).
Focus on cable scan POV, and ignore display scan POV. So let’s focus on cable scan-out – and the GPU act of reading one pixel row at a time from its front buffer into the output at exact horizontal scanrate intervals).
Also, on the GPU side, Best Practice #9 recommends against clearing the front buffer between emulator refresh cycles, in order to keep the jitter margin huge (wraparound style).
If you’re an oldtimer – Another metaphor (if it is easier to understand frameslice beamracing) is an old reel-to-reel video tape that runs through a record head and a playback head simultaneously.
The Tape Delay Loop Metaphor Might Help
Technically, nothing stops an engineer from putting two heads side by side feeding a tape through both – to record and then playback simultaneously – that’s what an old “analog tape delay loop” is – a record head and a playback head running simultaneously on a tape loop.
Metaphorically, the tape delay loop represents one refresh cycle in our situation. In our beamracing case, the metaphorical “record head” is the delivery of new scanlines (even if it’s surged frameslicefuls at a time) to the front buffer, ahead of the “playback head”, the one-scanline-at-a-time readout of the front buffer into the graphics output (at exact horizontal scanrate intervals).
The front buffer isn’t onscreen instantly, it’s still being readout one pixel row at a time into the graphics output at exact constant rate (horizontal scanrate), so you always can keep changing the undelivered portions of the front buffer (including undelivered portions of a frameslice), ad-infinitum, as long as your real raster (the pixel row readout to output) stays ahead of the emu raster (new frame buffer data being put into the front buffer one way or another). This is a great way to understand why we have a full loop of a wraparound jitter margin (full refresh cycle minus one frameslice worth).
Decreasing input lag is by putting the playback head as close as possible to the record head. That’s tightening the metaphorical beam race margin.
- The jitter margin is the tape between the playback head and the record head.
- The race margin is the tape between the record head and the playback head.
So, a new looped safety jitter margin of one full refresh cycle minus one frameslice.
The entire tape loop represents one refresh cycle, looping around. So for 1080p, you can have a >900-scanline jitter margin with zero tearing, if you use the wraparound-refresh-cycle technique as described above in step 9 of Best Practices. Ideally you want to race with tight latency, though. If you do a “2 frameslice bea, race margin”, that means with 10-frameslice per refresh cycle, you have gotten a 1-frameslice verboten region (tearing risk), 8-frameslice race-too-fast safety margin, and 1-frameslice race-too-slow safety margin – before tearing appears. That’s 15ms of random beam race error you can get with zero tearing!!
In our case, metaphorically, frameslice beam racing is simply the record head surging batches of multiples scanlines onto the metaphorical tape loop. (e.g. a movable record head that intermittently records faster than the playback head). The playback head’s playback speed is totally merrily unchanged!! (i.e. the pixel row readout from front buffer to GPU’s output jack). As long as the record head never falls behind and collides with the playback head (aka tearing artifact) – thankfully this is just a metaphor, and tearlines won’t wreck the metaphorical tape mechanicals and tape loop permanently (ha!) – beam racing can recover during the next refresh cycle (aka only a 1-refresh-cycle appearance of tearing artifact). Metaphorically, front buffer rendering (adding one scanline at a time) means the record head doesn’t have to surge ahead (it can record at the same velocity as the playback head).
You can adjust the race margin to somewhere far enough back that your margin is never breached. That is the metaphorical equivalent of the distance between the tape record head (adding new emu lines to front buffer) and the tape playback head (GPU output jack beginning transmission of 1 pixel row at a time)
That’s why it’s so forgiving if properly programmed, and thus can be made feasible on 8-year-old GPUs, Android GPUs, and Raspberry PI GPUs, especially at lower frameslice counts on lower-resolution framebuffers (which emulators often are), so we’ve found innovative techniques that surprised us why it hasn’t been used before now – it’s conceptually hard for someone to grasp until they go “Aha!”. (like via the user-friendly Blur Busters diagrams, etc).
I can conceptualize this visually in a totally different way if you were not born in the era of analog tape loop, but this should help (in a way) to conceptualize that we’ve successfully achieved a 900+ scanline safety jitter margin for 1080p beam racing, even with wraparound (e.g. Present()ing bottom half while we’re already scanning top half, and Present()ing top half while we’re already scanning bottom half – both situations have NO tearing, because of the way we’ve cleverly done this, with Best Practice #9 two posts ago…) – making it super-forgiving and much more usable on slower-performing systems. Smartphone GPUs can easily do 240 duplicate-frames a second, it’s only extra memory bandwidth to append new frameslices, anyway.
So that is how a 900+ scanline fully-looped across-refresh-cycle wraparound jitter margin is achieved with 1080p frameslice beamracing. At 60Hz, this means up to ~15ms range of beam race synchronization error before tearing appears! This helps soak up peformance imperfections very well during transient beamrace out-of-sync, e.g. background software. And the beamace margin can also be a configurable value, as a tradeoff between latency and tearline-apparance-during-duress-situations.
Modern systems can easily do submillisecond race margins flawlessly, while Android/PI might need a 4ms race margin - still subframe latency!
Yes, in the extreme case frameslices can become one pixel row with no jitter margin (like how my Kefrens Bars demo turns a GeForce 1080 into a lowly Atari TIA with raster-realtime big-pixels at nearly 10,000 tearlines per second) but emulators like the jittermargin technique that hides tearing by simply keeping graphics unchanged at realraster and keeping emuraster ahead of realraster. (Like the tape delay loop metaphor explained above).
This is my partial answer. I have to go back to work, but hopefully this helps you understand better…