Also, I hadn’t seen the beamracing post above before making mine - very cool.
@mdrejhon: so if I understand correctly, rather than presenting the framebuffer all at once, you separate it into chunks that you present in a series of smaller flushes, right?
I’m trying to understand how this doesn’t lead to tearing. If we have 10 frame slices, is the thinking that the real raster is then updated 600 times per second total? It is clear that there will be intermediate states where the raster will be half-displaying the past frame and half-displaying the current one, but is the thinking that since we’re changing things so fast, that assuming no jitter, this will not perceptually manifest as tearing?
It’s pretty simple for me to see how this would reduce total latency by something like ~0.5 frame: the top scanline no longer has to wait until the entire buffer is finished to present (~1 frame difference), whereas the bottom scanline is the same (~0 frame difference), so on average you get a 0.5 frame difference.
My second thought is that you would get a further reduction if some main part of the framebuffer updating routine is currently done synchronously (is it?). This would mean that all sorts of stuff is blocked until a complete framebuffer is filled (like polling the input). Having 10x more partial framebuffer calls would let you do things like poll the input 10x per frame rather than just 1x, which would give you, more or less, another half a frame of latency reduction. Coupled with the above, you get a total latency reduction of roughly 1 frame less than doing the entire framebuffer synchronously. Is that the thinking?
@mdrejhon I think @Dwedit would be the best person to talk to about extending the libretro API
for this purpose given his excellent past track record. Hopefully you guys can get some initial implementation going that way. That is assuming @Dwedit is interested.
Essentially, yes. The rest of the screen remains unchanged.
There are many ways to do this:
Full screen blitting (Present() full screen of incompletely-rendered emulator framebuffer)
Partial blitting (Present() ing only a rectangle)
Front buffer (Writing new emulator scanlines directly to front buffer)
The same principles of beam racing jitter margin applies to all the above. All of them can be made to behave identically (With different GPU bandwidth-waste considerations) at the end of the day.
In all situations, the already-rendered (top portion) of emulator frame buffer is left unchanged and kept in the onscreen buffer. That, in itself, prevents tearing. Duplicate frames don’t show tearing. Likewise, duplicate frameslices don’t show tearing either. Viola!
The incompleteness never hits the graphics output, as long as new emulator scanlines are appended to the framebuffer before the GPU reads the pixel row out to the graphics output.
It’s already a working & proven technique already implemented in 2 other emulators including WinUAE. Please read the WinUAE thread from Page 8 through to the end.
No tearing. Duplicate frames = no tearing. Duplicate frameslices = no tearing. If the screen is unchanged where the real-raster is, there’s no tearing.
And the jitter-margin is very healthy. If there’s a lot of random variances in the beam-race distance between emulator-raster and real-raster, there is no tearing, as long as it’s all within the jitter margin (which can be almost a full refresh cycle long, with a few additional tricks!) In practicality, variances and latency can be easily brought to sub-millisecond on modern systems (>1000 frameslices per second).
Although I risk shameless self promoting when I say this – being the founder of Blur Busters & inventor of TestUFO and have co-authored peer reviewed conference papers – I have a lot of practical skills when it comes to rasters. I programmed split screens and sprite multiplication in 6502 assembly language in SuperMon64, so I’m familiar with raster programming myself, too.
There’s no tearing because the screen is unchanged where the realraster is.
It’s already implemented & proven.
Also, there is no “perceptually” – tearing electronically doesn’t happen because it’s unchanged bits & bytes at the real-world raster. It as if Present() never happened at that area of the screen.
Also, I posted an older BlurBusters article to hype it up a bit (ha!). Since then I’ve collaborated with quite a few emu authors. Shoutout to Tony Wilen, Calamity, Tommy and Unwinder (new RTSS raster-based framerate capping feature using “racing the VBI to do a tearingless VSYNC OFF” – not an emulator, but inspired by our research).
Depends on the VSYNC ON implementations – you can save 2 frames of latency, and also depends on parameters like NVInspector’s “Max Prerendered frames”.
But not only that, the randomness of the input lag savings disappear (e.g. granularity errors of input reads caused by enforced conversion of midscreen input reads to beginning-of-screen input reads, especially with surge execution techniques). Say, you’ve got a flight simulator game that does input reads at random positions throughout the emulator’s screen. Many emulator lag-reducing techniques will round-off those reads to refresh cycle granularities out of necessity. Etc. So you got linearity distortions, too.
Emulators already do inputreads on the fly so the number of inputreads does not increase or decrease. They already rasterplot into an offscreen frame buffer, because many emulators need to support retro “racing the beam” tricks. What we’re simply doing is beam-racing the visibility of that already existing offscreen frame buffer. In the past, you waited to build up the whole offscreen emulator framebuffer (beamracing in the dark) before displaying it all at once. With the tricks, you can now beamrace the visibility of that existing offscreen progressively-more-complete emulator framebuffer. Basically scanning out more realtime.
In reality with good frameslice beamracing, and execution synchronization (pace emulator CPU cycle exact or scanline exact), the latency between input read and the photon visibility can become practically perfectly constant and consistent (not granularized by the frameslices) with only an exact latency of the decided beamrace margin. (In reality, your input device is limited by the poll rate, e.g. 1000Hz mouse, so that’s the practical new limiting factor, but that is another topic altogether)
Since the granularity is hidden by the jitter margin, if the realraster and emuraster are chasing at an exact distance apart (e.g. 1ms) despite the frameslicing occuring (like brick laying new bricks on a path, while a person behind is walking at constant speed – adding new sections to a road granularity while the cars behind moves at a constant speed – same “relativity” basis. As long as you don’t go too fast, you don’t run into the incomplete section of the road (or on the screen: tearing), but the realraster and emuraster is always moving at a constant speed, despite the “intervalness” of the frameslicing.
You simply adjust the beamracing margin (e.g. chase margin) to make sure the new frameslices are being added on time before the constant-moving realraster hits the top edge of a not-yet-delivered frameslice (that’s when you get artifacts or tearing problems). Even when presenting frameslices, the GPU still outputs one scanline at a time out of the graphics output. So the frameslices are essentially being “appended” to the GPU’s internal graphics output buffer ahead of the current constant-rate (at horizontal scanrate) readout of pixels by the RAMDAC (analog) or digital transceiver (digital). Sometimes they’re bunched into 2 or 4 scanlines (DisplayPort micropackets). But other than that, there’s no relationship between frameslice height and graphics output – the display doesn’t know what a frameslice is. The frameslice is only relevant to the GPU memory. The realraster and emuraster continues at constant speed, and the frameslice boundaries jump granularity but always in between the two (realraster and emuraster, screen-height-wise), and tearlines never become visible. As long as it’s within the jitter margin technique, there can safely be variances – it doesn’t have to be perfectly synchronous. The margin between emuraster and realraster can vary.
You simply just have to keep appending newly rendered framebuffer data (emulator scanlines) to the “scanning-out onscreen framebuffer” before the pixel row readout (real raster) begins transmitting the pixel rows. Metaphorically, this “onscreen framebuffer” is not actually already onscreen. The front buffer is simply a buffer directly at the graphics output that’s got a transceiver still reading it one scanline at a time (at exact interval, without caring about frame slice sizes or frame slice boundaries), serializing the data to the graphics output. All of the beamracing techniques discussed above are simply various methods of appending new data to the front buffer before the pixel row begins to be transmitted out of the graphics output. It’s just the nature of serializing a 2D image into a 1D graphics cable, and it works the same way from a 1930s analog TV broadcast through a 2020s DisplayPort cable. That’s why frameslice beam racing works on the entire 100 year span without tearing.
Properly implemented frameslice beam racing is actually more forgiving than things like Adaptive VSYNC (e.g. trying to Present() in the VBI between refresh cycles to try to do a tearingless VSYNC OFF), since the jitter margin is much bigger than the height of the VBI. So tearing is less likely to happen! And if when tearing happens, it only briefly appears then disappears as the beamracing catches up into the proper jitter margin.
The synchronousness of input reads is unchanged, but the synchronousness of input-reads-to-photons becomes much closer to the original machine. Beam racing replicates the original machine’s latency. That’s what makes it so preservationist friendly and more fair (especially when head-to-heading an emulator with a real machine). So before a real-machine arcade contest, one can train in an emulator configured to replace original machine latencies via frameslice beam racing.
Some emulators will surge-execute a framesliceful, but that’s not essential. If you wanted, you can simply execute emulator code in cycle exact realtime (like an original machine) and the raster line callback will “decide” when it’s time to deliver a new frameslice on a leading “emu-raster percentage ahead of real-raster” basis (or simply use a 2 frameslice beam race margin) The emuraster and realraster always moves at constant scan-rate in this case (emuraster at emu scanrate, realraster at realworld scanrate). With the frameslice boundaries jumping forward in granular steps, but always confined between the emuraster and realraster. Tearing never electronically reaches the graphics output, tearing doesn’t exist. (Because the previous frameslice (that the realraster is currently single-scanline-stepping through) is still onscreen unchanged).
RunAhead is superior in reducing lag in many ways, but beam racing (sync between emuraster & realraster) is the only reliable way to duplicate an original machine’s exact latency mechanics. As a bonus, it is less demanding (At low frameslice counts) on the machine. All original times are preserved between realworld/emulator inputread versus the pixels being turned into photons – even for mid-screen input reads on the original machine – and even for certain games that have jittery-on-original-machine input reads that sometimes read right before/after original machine’s VBI – which causes a 1/60sec (16.7ms) randomly varying latency-jumping jitter on emulators doing surge-execution techniques since some input reads miss the VBI and gets delayed – the beamracing avoids such a situation).
Whatever distortions to input reads is occuring by all kinds of emulator lag-reducing techniques, all of that disappear with beam racing and you only get a fixed latency offset representing your beamracing margin (which can become really tight, sub-millisecond relative to original machine when both orig machine & emulator is connected to the same display) – with high-frameslice rate or front buffer rendering. Yet still scale all the way down to simpler machines – still remain sub-frame latency even on Android/Raspberry PI 4-frameslice coarse beamracing by using a low frameslice count and a wider (but still sub-frame) beam racing margin.
So while RunAhead is superior in many ways, and can reliably get less latency than the original machine – frameslice beamracing is much more preservationist-friendly, replicates original machine latency for all possible inputread timings (including mid-screen and jittery input reads).
P.S. If you download WinUAE, turn on the WinUAE tinted-frameslice feature (for debugging/analysis), and you’ll see the tearing (jitter margin debugging technique). It’s fun to watch and a very educational learning experience. Turn it off, and the tearing is not visible – horizontal scrolling platforms look like perfect VSYNC ON.
@mdrejhon: OK, thanks - I’ll read in more detail to write a more detailed response, but as I quickly look at the WinUAE thread, I have a question about this:
Yes, beam racing improves framebuffer lag by sometimes almost an order of magnitude – 40ms worst case to sub-5ms is directly in the territory of full order of magnitude. And can still go closer to near 0ms! That’s literally a jawdropper and totally felt in pinball gaming.
Based on what you wrote above, it looks like rendering delay would be reduced by something like 0.5-1 frames (8-16 ms) using this technique. But going from 40ms to 5ms corresponds to decreasing by roughly 2 frames. How does splitting the frame into slices lead to multi-frame reduction?
One other question: let’s say the real raster has just finished the last frameslice. When the real raster restarts again from the top, does it erase the entire screen and begin again, progressively filling in the new frame (with the bottom unfinished parts being completely erased to black)? Or does it only erase the top bits and start progressively overwriting those, leaving the remnants of the old frame unaltered at the bottom?
Don’t forget buffer backpressure mechanics and the frame queue concepts (two semi-independent extra lag factors). I explained that – depending on circumstances – VSYNC ON can frequently lead to more than 1 frame of input lag. Not all VSYNC ON are identical, between all applications/drivers/programming techniques – in many graphics drivers, it is often a shallow frame queue (Because that’s more stutterless than a traditional double buffer technique). There are indeed ways to force it to behave like a double buffer, but when running fully flat-out, you can end up have 2 more frames of lag unless you do special techniques.
In addition, consider varying-lag distortion effects. I edited the above post to make it bigger to cover some of the concepts how it’s not always an exact “X frame of lag” – one input read may receive 33ms lag and the next input read may receive 16ms of input lag, because of lag-granularity caused by varying timing of original machine’s input read, and how it interacts with the boundaries of the surge-execution intervals of various lag-reducing techniques – how the emulator lag-distorts the input read (or not). Most retro games reads input at a consistent time, but that timing can jitter on the original machine, and it might jitter across the surge-execution intervals, creating a lag granularity effect that did not exist on the original machine. What this can potentially mean is that button-mashing may feel erratic in the emulator and consistent on original – for a specific retro game (that had varying-in-refresh-cycle input reads). There are pros/cons of all kinds of various input-lag-reduction methods.
Good emulators with optimization may have only 1 frame of input lag (next frame response) but not all of them. Yet none (except beam raced frameslicing) can do same-frame response, e.g. midscreen input reads changing screen content of the bottom of the same dislay refresh cycle (no need to wait till next refresh). And none can do guaranteed subframe fixed latency offset to input reads (emulators that try for sub-frame latencies often subject them to refresh-cycle-rounding-off effects, caused by surge-execution distortions). Beamraced emulation (render and scanout on the fly) are more able to guarantee consistent lag for all possible input read timings throughout any part of any emulator refresh cycle, relative to the real-world refresh cycle.
Obviously, RunAhead is superior for many things while beamracing is another good tool for faithful replication of original machine lag while reducing system requirements (during low frameslice rates)
Full answer will require a multi-page reply to explain things. If you want, we can move to the Area51 forum of the Blur Busters Forums to discuss this part further.
I’ll give a partial answer to help conceptually.
Remember, cable scanout is sometimes totally different from panel scanout. There’s no concept such as “clearing the screen” on the monitor side. So forget about it, the emulator can’t do anything about it. (In reality, impulse display will automatically clear and sample-and-hold LCD displays will hold until next refresh cycle – but the considerations are exactly identical regardless whether you connect an original machine to it, or an emulator to it. So, thusly, this discussion is irrelevant here. Stop guessing displayside mechanics for now, we’re only comparing emulator-vs-original connected to the SAME display. Whether it be same CRT or same LCD.
We just want the cable to behave identically where possible. (internal builtin displays also serialize “ala cable-like” too, phones, tablets, and laptops sequential scan too).
Focus on cable scan POV, and ignore display scan POV. So let’s focus on cable scan-out – and the GPU act of reading one pixel row at a time from its front buffer into the output at exact horizontal scanrate intervals).
Also, on the GPU side, Best Practice #9 recommends against clearing the front buffer between emulator refresh cycles, in order to keep the jitter margin huge (wraparound style).
If you’re an oldtimer – Another metaphor (if it is easier to understand frameslice beamracing) is an old reel-to-reel video tape that runs through a record head and a playback head simultaneously.
The Tape Delay Loop Metaphor Might Help
Technically, nothing stops an engineer from putting two heads side by side feeding a tape through both – to record and then playback simultaneously – that’s what an old “analog tape delay loop” is – a record head and a playback head running simultaneously on a tape loop.
Metaphorically, the tape delay loop represents one refresh cycle in our situation. In our beamracing case, the metaphorical “record head” is the delivery of new scanlines (even if it’s surged frameslicefuls at a time) to the front buffer, ahead of the “playback head”, the one-scanline-at-a-time readout of the front buffer into the graphics output (at exact horizontal scanrate intervals).
The front buffer isn’t onscreen instantly, it’s still being readout one pixel row at a time into the graphics output at exact constant rate (horizontal scanrate), so you always can keep changing the undelivered portions of the front buffer (including undelivered portions of a frameslice), ad-infinitum, as long as your real raster (the pixel row readout to output) stays ahead of the emu raster (new frame buffer data being put into the front buffer one way or another). This is a great way to understand why we have a full loop of a wraparound jitter margin (full refresh cycle minus one frameslice worth).
Decreasing input lag is by putting the playback head as close as possible to the record head. That’s tightening the metaphorical beam race margin.
The jitter margin is the tape between the playback head and the record head.
The race margin is the tape between the record head and the playback head.
So, a new looped safety jitter margin of one full refresh cycle minus one frameslice.
The entire tape loop represents one refresh cycle, looping around. So for 1080p, you can have a >900-scanline jitter margin with zero tearing, if you use the wraparound-refresh-cycle technique as described above in step 9 of Best Practices. Ideally you want to race with tight latency, though. If you do a “2 frameslice bea, race margin”, that means with 10-frameslice per refresh cycle, you have gotten a 1-frameslice verboten region (tearing risk), 8-frameslice race-too-fast safety margin, and 1-frameslice race-too-slow safety margin – before tearing appears. That’s 15ms of random beam race error you can get with zero tearing!!
In our case, metaphorically, frameslice beam racing is simply the record head surging batches of multiples scanlines onto the metaphorical tape loop. (e.g. a movable record head that intermittently records faster than the playback head). The playback head’s playback speed is totally merrily unchanged!! (i.e. the pixel row readout from front buffer to GPU’s output jack). As long as the record head never falls behind and collides with the playback head (aka tearing artifact) – thankfully this is just a metaphor, and tearlines won’t wreck the metaphorical tape mechanicals and tape loop permanently (ha!) – beam racing can recover during the next refresh cycle (aka only a 1-refresh-cycle appearance of tearing artifact). Metaphorically, front buffer rendering (adding one scanline at a time) means the record head doesn’t have to surge ahead (it can record at the same velocity as the playback head).
You can adjust the race margin to somewhere far enough back that your margin is never breached. That is the metaphorical equivalent of the distance between the tape record head (adding new emu lines to front buffer) and the tape playback head (GPU output jack beginning transmission of 1 pixel row at a time)
That’s why it’s so forgiving if properly programmed, and thus can be made feasible on 8-year-old GPUs, Android GPUs, and Raspberry PI GPUs, especially at lower frameslice counts on lower-resolution framebuffers (which emulators often are), so we’ve found innovative techniques that surprised us why it hasn’t been used before now – it’s conceptually hard for someone to grasp until they go “Aha!”. (like via the user-friendly Blur Busters diagrams, etc).
I can conceptualize this visually in a totally different way if you were not born in the era of analog tape loop, but this should help (in a way) to conceptualize that we’ve successfully achieved a 900+ scanline safety jitter margin for 1080p beam racing, even with wraparound (e.g. Present()ing bottom half while we’re already scanning top half, and Present()ing top half while we’re already scanning bottom half – both situations have NO tearing, because of the way we’ve cleverly done this, with Best Practice #9 two posts ago…) – making it super-forgiving and much more usable on slower-performing systems. Smartphone GPUs can easily do 240 duplicate-frames a second, it’s only extra memory bandwidth to append new frameslices, anyway.
So that is how a 900+ scanline fully-looped across-refresh-cycle wraparound jitter margin is achieved with 1080p frameslice beamracing. At 60Hz, this means up to ~15ms range of beam race synchronization error before tearing appears! This helps soak up peformance imperfections very well during transient beamrace out-of-sync, e.g. background software. And the beamace margin can also be a configurable value, as a tradeoff between latency and tearline-apparance-during-duress-situations.
Modern systems can easily do submillisecond race margins flawlessly, while Android/PI might need a 4ms race margin - still subframe latency!
Yes, in the extreme case frameslices can become one pixel row with no jitter margin (like how my Kefrens Bars demo turns a GeForce 1080 into a lowly Atari TIA with raster-realtime big-pixels at nearly 10,000 tearlines per second) but emulators like the jittermargin technique that hides tearing by simply keeping graphics unchanged at realraster and keeping emuraster ahead of realraster. (Like the tape delay loop metaphor explained above).
This is my partial answer. I have to go back to work, but hopefully this helps you understand better…
Thank you! Very helpful! I admit that some of the terminology is fairly new for me, so I spent a good amount of time searching and found a glossary you posted here that was very useful in getting in sync with some of this stuff.
I also liked the the tape loop metaphor that you used - rather appropriate, since my background is primarily in audio digital signal processing (with some occasional image processing thrown in). Video processing is a little different, but so far seems straightforward enough, given the way I’m used to looking at things. Many of the terms you use, and some of the other terms I see thrown around in this discussion, are things that I generally recognize from general DSP jargon. Some others are quite new entirely, so I hope I can quickly get to understanding those terms correctly.
With respect to cable scanout vs panel scanout, after looking at your glossary, I think you’re right that panel scanout is irrelevant to the discussion and cable scanout is what really matters here. Likewise, I’m also less interested in things like USB poll interval lag (and variance). The user will be able to supply their own TV and controller, which will hopefully work well enough. What I’m most interested in is getting my head wrapped around the middle, where there are huge chunks of latency that could, hopefully, be reduced using a method like this. Beyond that, the user can tweak their TV/controller if need be, independently from this - play around with USB poll intervals, or lower the TV’s resolution if need be, or just generally find settings that work.
I am really quite surprised to see that VSync can cause multiple frames of input lag. I thought that VSync synchronizes the monitor playback rate with the GPU framerate, so that there is no jitter and hence no tearing. What I don’t get is, when people use the “Frame advance” feature, where they push a button on the controller and then manually advance frames only to see it register 2-4 frames later – is VSync somehow doing that? Would beamracing be able to help lower latency even in that scenario?
I think I’ll start there for now. Your posts are very detailed and I rather appreciate that - I’ll probably need to read a few times before I get it all. For now I’m most interested in understanding the basic signal path and the major components driving latency which come from things like video processing, rather than input polling and such.
To simplify, lets focus only on emulator thru graphics output.
Traditionally, an unoptimized emulator on unoptimized drivers can:
Consider whether emulator does preemptive real input reads before renderin emulator frame, or does inout reads in realtime while doing the render (simulated raster scanout)
Render emulator frame (varies, up to 1/60sec lag, depending on how intensive it is, and if this jiffy is executed or not to speed up individual emulator rendered frame for delivery sooner. Input read can be early in emulator frame, versus late in emulator frame)
Deliver to graphics card (varies, up to 1 frame lag) - Present() blocks until room in frame queue. Thats buffer backpressure lag!
Any frame queues used in the graphics card (varies, 0, or 1 or 2 frame lag). Graphics drivers delivers through any prerendered frames in sequence as consecutive individual refresh cycles.
You can make it efficient and tight (1 frame lag) but in reality, can be awful. Some very old Blur Busters input lag tests of Battlefield 4 had over 60ms input lag even on a display that had less than 10ms input lag, and even when a frame rendered in only 15ms. Conversely, CS:GO reliably achieved approximately 20ms or so. That was tests way back almost five years ago. The huge variances between software even for VSYNC ON, OFF, GSYNC… Software plays a role on how much lag they add and how they treat the sync workflows (that’s why you hear various tricks such as “input delaying” to reduce lag closer to output.)
Usually maximum prerendered frames is 1, and that is necessary for compatibility with lots of things such as SLI which must multiplex frames from multiple cards into the same frame queue. It also massively improves frame pacing. A few years ago, there was a controversy with the disappearance of the “0” setting in NVInspector for Max Prerendered Frames.
You can use tricks to reduce a lot of lag in this lag chain, but VSYNC ON in games can vary humongously between different apps, it’s simply time interval of input read (which necessarily occurs before rendering in most 3D games) versus the pixel hitting the output jack (the point A to B we are limiting scope to for simplicity).
Beam raced frameslices does input read, render AND output in essentially realtime. Just like the original machine did. Faithfully. With it, there can be just be a mere 1 millisecond between an input read and the actual reacted pixels hitting the graphics output. For any game that does continual input reads mid-scanout, the photons of that can actually hit your eyes in subframe time. Like a mid-screen input read for bottom-of-screen pinball flippers.
Beam raced frameslices does input read, render AND output in essentially realtime. Just like the original machine did. Faithfully. With it, there can be just be a mere 1 millisecond between an input read and the actual reacted pixels hitting the graphics output. For any game that does continual input reads mid-scanout, the photons of that can actually hit your eyes in subframe time. Like a mid-screen input read for bottom-of-screen pinball flippers.
You still need to change a lot in every single emulator so it render the slices and even poll more often than 1/refresh rate, and is that even accurate? I’m pretty sure it’s not for every single case.
@mdrejhon: thanks for that. As a rough pass I think I get the idea of why VSYNC can sometimes lead to multi-frame delay. I do think I’m continuing to get bogged down a little bit w/ the terminology here though.
Right now, the way I think of input timing latency is as follows:
The input is a delta function, and we are trying to figure out the total latency (or group delay) in the “impulse response.”
The signal path consists of a string of delay lines, one after the other, each of which adds some time delay to the signal.
The total time delay is the sum of the time delays of each component.
Rather than all of the delays being set in stone, the delay of each component is a random variable according to some probability distribution. We know the range of values each can take, the probability of each, and the mean and variance. For instance, a 100 Hz USB poll interval is a uniform distribution on 0-10ms, which has a mean of 5ms and a stdev of ~3ms.
The components are “approximately independent” of one another, at least given reasonable running conditions. Meaning the USB poll interval position does not correlate with, for instance, the refresh interval position, or whatever. Both are equally random, or if there is any correlation, it’s negligible.
Because of #5, the expected value of the total latency is the sum of the expected value of the latency of each component.
#5 is probably where the gray area is. Some components might correlate with one another… sometimes… only under certain fundamental conditions… and it’s hard to tell where. That seems to be the basic problem.
If we have two components that tend to correlate significantly, so that a better latency on one suggests a higher or lower probability of a better latency on the other, then we can chunk them into a single component. Ultimately we can always arrive at some chunks of components that do not correlate with one another in any signficant way, given at least reasonably normal working conditions.
Actually, for at least one module, it’s easier than you think.
WinUAE told me it was a quick modification to add basic 60 Hz support.
This is because for “raster accurate” emulation modules (e.g. Nintendo and Super Nintendo emulation):
The emulator module is already plotting one scan line at a time into an offscreen buffer. It’s already happening with the NES module.
The emulator module is already (usually) doing real time input reads while plotting scan lines. It’s already happening with the NES module.
The new raster poll API simply lets the centralized beamracer to “peek” at the ALREADY EXISTING module’s ALREADY EXISTING OFFSCREEN FRAMEBUFFER, and grab a frameslice from it.
The centralized code will do the peeking, and the centralized code will do the grabbing of the frameslice itself. The raster poll is simply giving the central code opportunities to do early-peeks of the emulator’s existing offscreen framebuffer, every time a new emulator scan line is written to it.
For at least one of the easiest Retroarch modules, it looks like only a 10 line modification.
All the complexity is centralized (probably ~1000 lines of code, 3-4 days of programming work). Please re-read my proposal. That’s where the RetroArch work ahead is cut.
More difficult cores will take a lot more time, but once the core libretro is made beamrace compatible, then the beamrace support can be added to only one module at a time. And from what I looked, the easiest module will only need a simple hook (10 lines) to turn it into a successful beamracer.
Emulator authors – over the last two decades – have done an amazing job refining realtime beamracing on the emulator offscreen buffer already. So it’s not much work left to glue the remaining step. So to the authors of the “easy” modules, thank you so much for making it so easy to beamrace for real!
It’s the last piece of puzzle that most emulators programmers do not understand; the “black box” between Present() and photons – but people like me do. That 1% is complex to understand and this is why I am writing big posts to explain that 1% needed to finish the “full beamrace chain”.
And, even if it’s easier than expected with some modules (NES)…
…It will also be more difficult than expected with other modules (who knows which ones). It depends on how much of the beamracing chain they’ve already completed.
The fact is that both extremes exist.
The beauty is that once it’s implemented in libretro, it can be implemented one module at a time, one by one, beginning with the easiest modules – taking our merry time.
Once the easy module is done, it gives everyone the “aha” moment, and makes some people understand frameslice beam racing much better. (The remaining 1% step needed to finally pull an emulator’s existing internal beamracing out to the real world display).
For some modules, >99%+ of the beamracing work is already done. 20 years of beamracing development has done that already, but never beyond the Present() API.]
The major complexity will be making libretro compatible. If there’s a lot of layering (e.g. lack of a VSYNC OFF mode, and a lot of black box layers, it has to be refactored somewhat). Basically, VSYNC OFF support needs to be added to LibRetro, in order for frameslice beamracing to work. It might or might not be royal headache.
But on the NES module side (at least), it’s quite minor changes there for that particular module since it already does internal beamraced input reads and internal beamraced line-plots into its internal framebuffer.
For time split between “The core, and the easiest module” – I guesstimate over 95% of programming time will be focussed on the centralized code, and 5% of the time spent on the easiest module. Once done, the bridges can be crossed for remainder of modules.
The hardest module might need lots of code – and/or rewriting – to be compatible, but the easiest modules will essentially only need 10 lines of modifications.
The already cycle-exact and raster-exact modules will obviously be the easiest modules, especially if they’re simply (as the NES module is) already rasterplotting one line at a time internally already to an internal frame buffer. Those types will be easy to do frameslice beamracing.
The emulator modules don’t even need to know what the heck a frameslice is, if one re-reads my proposal.
In my proposal, all the emulator module is doing is letting the core code (centrallized raster poll code) to do early-peeks to the existing offscreen beamraced buffer that most 8-bit and 16-bit emulators already do, in order to be compatible with retro-era raster interrupts.
I’m going to cover this from the opposite side first.
This is an easier argument for me to make, because there is fewer variables. Makes latency math simpler.
I’ve got a 1000fps high speed camera. With a test program, I’ve successfully got API to photons in just 3 milliseconds on my fastest LCD display. That is Present() to photons hitting the camera sensor. That includes LCD GtG. That includes DVI/DisplayPort latency. That includes monitor processing latency. I’m able to get this for top edge, center, and bottom edge of screen.
I can likely probably get <1ms with a CRT and an older graphics card with a direct adaptorless VGA output.
But let’s simplify. So, now we already know the baseline absolute-best proof from my high speed camera, and it is proven “realtime” API to photons, by all practical raster extent.
I’ve also done brief tests that showed 4ms-5ms from mouse click to photons, for some extreme blank-screen VSYNC OFF tests. Now, we know that DisplayLag.com can have a display latency difference from top/center/bottom (e.g. 2ms, 9ms, 17ms). Obviously for simplicity, most sites only report average latency (VBI to screen middle). Which is often half a refresh cycle. Which is why you never see numbers less than 8.3ms on sites such as DisplayLag.com for a 60Hz display (1/2 of 1/60sec = 8.33333). That’s simply a stopwatch from VBI-to-raster. With beamracing, the lag is vertically uniform (e.g. 2ms, 2ms, 2ms TOP/CENTER/BOTTOM) for Present-to-Photons during VSYNC OFF frameslice beamracing. (There’s micro lag gradients within frameslices – caused by the granularity of frameslice versus the one pixel row at a time scanout of graphics output – but that can be filtered by the tape loop metaphor for consistent unvarying subframe emulator pixel to photons time, as a fixed screen height difference between emu raster and real raster – all easily centralizable inside the central code of the raster poll API, it’s just simply busysleep on RTDSC or QueryPerformanceCounter)
Briefly going offtopic, but currently the least-laggy LCDs via DisplayPort/DVI tends to be ~3ms for API-to-photons if we're focussing on minimum lag (top lag, or beamraced pixel-for-pixel lag) -- subtract about 8.3ms from the number you see on DisplayLag.com and that's your beamraced input lag attainable. Certainly there are less laggy displays and more laggy displays, but some LCDs are almost as fast as CRTs in response (e.g. just digital transceiver lag & GtG lag, with a few scanline buffered micropacket lag). Although we're not worried about panel scanout, the fact is some of them have realtime synchronouz cable-to-panel scanout abilities (also see www.blurbusters.com/lightboost/video for an older example high speed video of how an LCD scans out -- and how some blur-reduction strobe backlights work (LightBoost, ULMB). In non-strobed operation, it's essentially a fast-moving a GtG fade zone chasing behind the currently-being-refreshed pixel rows, being refreshed practically on the fly directly from the cable (with only line-buffer processing for overdrive -- unlike old LCDs that often full framebuffered the refresh cycle first).
So all we’re worried about is increases to latency to this absolute-best baseline.
My graphics card can do up to 8000 frameslices per second (Kefrens Bars demo).
Mouse poll 1000Hz adds an average of 0.5ms latency (the midpiont average of 0ms…1ms latency). There are some 2000Hz mice and overclocked 8000Hz mice experimentation being done, so it’s possible to theoretically get lower – USB of 0.125ms latency has been successfully achieved with mouse overclocking.
Emulator frameslice granularity at 2400 frameslices per second (40 frameslices per refresh cycle, HLSL filters disabled, GTX 1080 Ti extreme case) with a 1.5 frameslice average beamrace margin = 1.5/2400sec latency = 0.6ms latency.
So mouse poll latency 0.5ms and beamrace margin latency 0.6ms = 1.1ms lag for input-read-to-pixel-transmitting-on-wire. It could be even less obviously given my computer’s performance. But this is already incredibly small.
While doable on i7’s with powerful GPUs, it is overkill for many. A lot are happy with 10-frameslice beamracing (600 frameslices per second). 2 frameslice lag equals 2/600sec or 1/300sec or 3.333ms for the common 10 frameslice WinUAE setting (I’m excluding the 0.5ms from 1000Hz USB input poll, obvoiusly).
It continues to scale down to more leninet margins, like 4-frameslice or 6-frameslice beamracing on slower platforms (e.g. PI, Android, etc). There’s more lag for that, but still subframe lag compared to any other possible non-RunAhead lag reduction approaches. 4 frameslices with 1.5 slice beamrace margin is still only ~6ms lag – incrediblly low for a Raspberry PI, and I suspect 10 frameslices are doable on the newer mobile GPUs. At 600 frameslices per second, that gets real close to more exactly reproducing faithful-original latencies.
@mdrejhon: Very cool. I can kind of see how the beam racing thing covers a lot of the different sources of latency inherent in the signal.
I’m still confused on the basic phenomenon of how this doesn’t lead to tearing. Suppose, for instance, you’re playing some game, and you have half of the frame rendered, and it just so happens that half of the character sprite is rendered. Then, midway through the frame, the user pushes some input button, causing the character to move. So input is processed mid-frame.
Do we process the input here, so that the next half of the frame is rendered in accordance with the character moving? And so on, even if they push another button? Basically, do we process multiple sequential inputs within the same frame, even if it’s already been rendered?
If you do, then the thing is, the sprite and position have now changed. If you totally change gears and start rendering the next frame, you would get, I think, a tearline, or at least a visual mess.
On the other hand, if you do just continue to process the current farme the same way as before, then at least under the hood, the input poll routine can get a bit of a head start on things for the next frame. So then your multiple combinations of buttons are partly processed during the current frame, but all get lumped together in the next frame.
It seems something like that, as I think this through. Partly the issue might be that there’s some subtlety in the way Present() works that I don’t quite get.
Many emulators – for NES, SNES, Commodore 64, Apple, several 80s/90s-era MAME modules – renders only one pixel row at a time, or less (line-exact emulators) or a single pixel at a time (cycle-exact emulators).
Sure, the original 8-bit software has done half a frame buffer, but the emulator is simulating a virtual equivalent of a CRT electron gun already! (Some of them single-pixel cycle-exact granularity; others of them scanline-at-time granularity).
So the emulator is already serializing one pixel row at a time. Plotting them to the existing offscreen buffer.
For a NTSC CRT, that is equivalent to 15,625 scanlines per second (Horizontal Scan Rate = 15.6 KHz) so that’s one new scanline plotted every ~1/15625th of a second. So the offscreen framebuffer is getting a new pixel row every average 1/15625th of a second. Some emulators execute synchronously, by putting a busyloop where needed to scanline-pace it – other emulators will surge-execute 1/60th sec worth of emulation (faster than original machine) to deliver a full framebufferful in traditional PC-based “full frame buffer at a time” workflows. Regardless, the inputreads in the original emulator code varies from game to game, and sometimes some input reads are always at beginning of a refresh cycle, or end of refresh cycle, or some games do input reads mid-screens, it really varies from game to game. But regardless, whatever original game did, it gets preserved when synchronizing emuraster with realraster.
That’s because they have to preserve original beam-race behaviours like raster interrupts. That’s why those particular modules tend to be very beamrace-friendly to the real-world.
Because they’re already doing that, 99% of the work is already essentially done.
The large amount of writing I do in this thread is doing the remaining 1% of the work synchronizing the emulator raster (line-at-a-time) to the real-world raster – which is something that most people don’t realize is now already possible. But, it is indeed a somewhat complex-to-grasp 1% that requires good understanding of the way things used to be done originally.
Does the above help explain why there is no tearing?
@mdrejhon: ok, yes, it does. Sorry for the delay in responding, got busy for a bit here…
I now understand why there is no tearing. So the original game was in charge of protecting against that while doing the beamracing. So, we assume there’s something like that going on.
Doing beamracing means we get as close to the original game as possible. Since the original game is doing some kind of beamrace mid-frame tearing protection, the game itself will add a tiny bit of lag between when you press a button and when it appears on the TV, because it needs to not to mess up the current frame.
If we then run the above game in an emulator, then not only do we have the original game’s input latency, but we now also have the additional VSync latency, where things are delayed by yet another frame (or two). So it all adds up to increased cumulative latency.
It seems like there are a few case splits here in the way that original games deal with input latency - Yoshi’s Island seems to be particularly slow, for instance, whereas other games might be a lot faster. But, there are lots of ways to split the latency chain up into modules, and going with beamracing affects all equally, so that’s why this is good to implement. Right?
This is a weird way to put it, but I’m trying to make sure I’m balancing the latency “checkbook” correctly. Does the above make sense?
“beamrace mid-frame tearing protection” should be phrased as “beamrace by design”.
Remember… Atari 2600 had no framebuffer memory at all. Zero, none, nada, zilch! They had to buffer a single scanline at a time – generate new graphics in realtime, every scanline.
You see, the only way to do graphics on an Atari 2600 in the 1970s-1980s was beamracing out of necessity.
So the Atari 2600 essentially beamraced several thousand simple “linebuffers” A second. Doing that on a puny 1 MHz CPU is no less than a miraculous programming feat. Raster feats continued for a couple decades afterwards, at least for other graphics special-effects like scroll zones or sprite multiplication.
Later on, even when games gained framebuffers, some special effects (e.g. 16 sprites instead of 8, or a stationary scorebar below a scrolling zone) also required beamracing out of necessity. Basically they intentionally injected a dividing line between two different framebuffers (if you must use the word “tearline” terminology, yes, that’s essentially an ancestor to a modern tearline).
Emulators had to preserve whatever beamracing antics that the original machines did.
Adminst all of this, while not all used lagless input mechanisms, some were essentially sub-scanline lagless – input reads of a joystick controller port had virtually no latency – it was typically just a mechanical joystick, where mechanical switches completed circuits directly on separate pins of a 9-pin joystick port. Up/Down/Left/Right/Button only required 5 wires plus shared ground. Extra wires can be used for things like extra buttons, etc. Anyway, the moment a joystick button is fully pressed, the circuit is completed right there and then. Which directly changes the bits of a byte of a single in-memory address. The joystick port is often read by a register-read instruction or a PEEK command (in 6502, the machine language programming instruction could be “LDA $DC00” or “LDA $DC011” (Commodore 64 version of joystick register) which is essentially the assembly language / machine language equivalent of the BASIC command “LET J = PEEK(56320)” – hex DC00 equals decimal 56320). The fast joystick-peeking instruction, which executes in microseconds, may actually be an instruction embedded within raster-realtime generated graphics – or might be a few scanlines before – or might be at beginning of refresh cycle (blanking interval) or a few refresh cycles ago (e.g. framebuffering workflows – but rememeber: a lot of this was the era before framebuffers). So latencies sometimes are microseconds between joystick to photons (CRTs can illuminate a ‘pixel’ in essentially microseconds). Or sometimes be several milliseconds (one or two refresh cycles) if he original code reads during a different part of refresh cycle or in blanking interval, or even many refresh cycles (e.g. the early simulators, early crude 3D flight simulators, like 1982 Microsoft Flight Simulator running at only 2-3 frames per second with blocky line-drawing graphics). Regardless, framebufferless workflows were still sometimes continued to be used on character-buffered platforms (grids of pre-defined graphics used as building blocks) to do things like add extra colors per row or other effects. Regardless, it varies hugely how an original platform did input reads, but nothing stopped them from doing raster-realtime “input-to-photons behavior” if input reads were done mid-raster. But this is getting offtopic, emulators already (to best effort) preserve this originalness at least to the offscreen framebuffer. So we usually don’t have to do anymore work on the input-reading side (Even though input read granularity rounds off at polls, e.g. 1ms for 1000Hz polls). The beam raced frameslicing is only a modification only to the missing 1% of the emulator “rendering” workflow (subframe raster sync between emu-raster and real-raster) necessary to replicate the original latencies to a much higher accuracy than has ever been achieved before. Regardless, the word “tearing” is non-sequitur to some 8-bit programming technique that did not use frame buffers.
For a good newbie’s guide to this, see Wired Magazine’s https://www.wired.com/2009/03/racing-the-beam/
… I highly recommend that “Racing The Beam” book from Amazon. This will help prepare you for a better understanding of rasterwork.
The two parts of your sentence are (mostly) unrelated to each other. It turns out non-sequitur. We’re just simply preserving original input-read phasing (whether it was only 1 microsecond before pixel output – or 1 second before the frame). There can technically essentially be zero lag, if the input read is made raster-realtime mid-scanline (e.g. Atari 2600) reading the joystick controller register while generating pixels.
It is simply a function of the original game programming, nothing more, nothing less. The writings I do is simply bridging the beamracing from the virtual world (offscreen emulator frame buffer) to the real screen. This allows the emulator to replicate original latencies as faithfully as possible, including subtle within-frame and within-scanline time-offsets of input read relative to generating the pixels on the original machine’s original video output.
(Plus a slight amount of additional latency to create the ‘jitter margin’ to soak up computer performance imperfections… but that doesn’t interfere, as the offscreen emulator buffer is still merrily at its original latency – the jittermargin is simply slight extra latency between emuraster and realraster to soak up computer performance issues, allowing contineud VSYNC ON looking perfection in less-than-perfect performance conditions)
If you are a programmer, I suggest you purchase the book to gain a better understanding of the concept of raster programming.
Applying modern terms such as “tearing” (But, really, tearing is a modern term more applicable to 3D framebuffers) and “we assume” (incorrect: we actually know, so we don’t have to assume. Remember, I have programmed some of these old machines directly. Telling such people who have done real rasterwork whether be Atari TIA, Amiga Copper, C64 raster interrupts, etc that their actual work are “assumptions” can be actually slightly offensive to them when in actuality, they actually understand the old platforms), and framebuffers (because some old machines had no framebuffers, or character buffers only).
The reason that this final 1% of realtimeness has not been easily bridged is that it required three things. An expert that is simultaneously versed on
Understanding how the original software and original machines worked; and
Understanding the latency black box between “Present()-to-photons” (software to photons) in a full & proper temporal manner for all Hz & all VRR tech, including differences between pixels on the screen and
The technology to catch up (it did about 8-10 years ago, but see pre-requisite (1) and (2) above).
People who simultaneously understand both (1) and (2) and (3) are few and far in between. Such an individual would need to and how to apply this to inventing various techniques to sync between old & new.
Likewise, a rocket scientist, might not know archaeology, and vice-versa. Or in a more related field, a mathematician may not be able to create a new molecule, and vice-versa, though applying knowledge may end up applied across boundaries to create a new breakthrough. And sometimes come up with “E=mc^2” simplicity that others can understand.
Several of us (who are familiar with how framebufferless programming worked) are finding it’s a lot simpler than expected once we follow the “best practices” list several posts ago.
(18) Temporarily turn off debug output when programming/debugging real world beam racing. When running in debug mode, create your own built-in graphics console overlay, not a separate console window – don’t use debug console-writing to IDE or separate shell window during beam racing. It can glitch massively if you generate lots of debug output to a console window. Instead, display debug text directly in the 3D framebuffer instead and try to buffer your debug-text-writing till your blanking interval, and then display it as a block of text at top of screen (like a graphics console overlay). Even doing the 3D API calls to draw a thousand letters of text on screen, will cause far less glitches than trying to run a 2nd separate window of text (IDE debug overheads & shell window overheads) can cause massive beam-racing glitches if you try to output debug text – some Debug output commands can cause >16ms stall – I suspect that some IDE’s are programmed in garbage-collected language and sometimes the act of writing console output causes a garbage-collect event to occur. Or some other really nasty operating-system / IDE environment overheads. So if you’re running in debug mode while debugging raster glitches, then temporarily turn off the usual debug output mechanism, and output instead as a graphics-text overlay on your existing 3D framebuffer. Even if it means redundant re-drawing of a line of debugging text at the top edge of the screen every frame.
BTW, I propose to open and contribute to a BountySource on executing this proposal and successfully enabling frameslice beamracing to at least one module.
Would anyone be interested in me opening a BountySource on this?