An input lag investigation

mdrejhon · 28 June 2018 02:35

battaglia01:

@mdrejhon: OK, thanks - I’ll read in more detail to write a more detailed response, but as I quickly look at the WinUAE thread, I have a question about this:

Yes, beam racing improves framebuffer lag by sometimes almost an order of magnitude – 40ms worst case to sub-5ms is directly in the territory of full order of magnitude. And can still go closer to near 0ms! That’s literally a jawdropper and totally felt in pinball gaming.

Based on what you wrote above, it looks like rendering delay would be reduced by something like 0.5-1 frames (8-16 ms) using this technique. But going from 40ms to 5ms corresponds to decreasing by roughly 2 frames. How does splitting the frame into slices lead to multi-frame reduction?

One other question: let’s say the real raster has just finished the last frameslice. When the real raster restarts again from the top, does it erase the entire screen and begin again, progressively filling in the new frame (with the bottom unfinished parts being completely erased to black)? Or does it only erase the top bits and start progressively overwriting those, leaving the remnants of the old frame unaltered at the bottom?

Don’t forget buffer backpressure mechanics and the frame queue concepts (two semi-independent extra lag factors). I explained that – depending on circumstances – VSYNC ON can frequently lead to more than 1 frame of input lag. Not all VSYNC ON are identical, between all applications/drivers/programming techniques – in many graphics drivers, it is often a shallow frame queue (Because that’s more stutterless than a traditional double buffer technique). There are indeed ways to force it to behave like a double buffer, but when running fully flat-out, you can end up have 2 more frames of lag unless you do special techniques.

In addition, consider varying-lag distortion effects. I edited the above post to make it bigger to cover some of the concepts how it’s not always an exact “X frame of lag” – one input read may receive 33ms lag and the next input read may receive 16ms of input lag, because of lag-granularity caused by varying timing of original machine’s input read, and how it interacts with the boundaries of the surge-execution intervals of various lag-reducing techniques – how the emulator lag-distorts the input read (or not). Most retro games reads input at a consistent time, but that timing can jitter on the original machine, and it might jitter across the surge-execution intervals, creating a lag granularity effect that did not exist on the original machine. What this can potentially mean is that button-mashing may feel erratic in the emulator and consistent on original – for a specific retro game (that had varying-in-refresh-cycle input reads). There are pros/cons of all kinds of various input-lag-reduction methods.

Good emulators with optimization may have only 1 frame of input lag (next frame response) but not all of them. Yet none (except beam raced frameslicing) can do same-frame response, e.g. midscreen input reads changing screen content of the bottom of the same dislay refresh cycle (no need to wait till next refresh). And none can do guaranteed subframe fixed latency offset to input reads (emulators that try for sub-frame latencies often subject them to refresh-cycle-rounding-off effects, caused by surge-execution distortions). Beamraced emulation (render and scanout on the fly) are more able to guarantee consistent lag for all possible input read timings throughout any part of any emulator refresh cycle, relative to the real-world refresh cycle.

Obviously, RunAhead is superior for many things while beamracing is another good tool for faithful replication of original machine lag while reducing system requirements (during low frameslice rates)

Full answer will require a multi-page reply to explain things. If you want, we can move to the Area51 forum of the Blur Busters Forums to discuss this part further.

I’ll give a partial answer to help conceptually.

Remember, cable scanout is sometimes totally different from panel scanout. There’s no concept such as “clearing the screen” on the monitor side. So forget about it, the emulator can’t do anything about it. (In reality, impulse display will automatically clear and sample-and-hold LCD displays will hold until next refresh cycle – but the considerations are exactly identical regardless whether you connect an original machine to it, or an emulator to it. So, thusly, this discussion is irrelevant here. Stop guessing displayside mechanics for now, we’re only comparing emulator-vs-original connected to the SAME display. Whether it be same CRT or same LCD.

We just want the cable to behave identically where possible. (internal builtin displays also serialize “ala cable-like” too, phones, tablets, and laptops sequential scan too).

Focus on cable scan POV, and ignore display scan POV. So let’s focus on cable scan-out – and the GPU act of reading one pixel row at a time from its front buffer into the output at exact horizontal scanrate intervals).

Also, on the GPU side, Best Practice #9 recommends against clearing the front buffer between emulator refresh cycles, in order to keep the jitter margin huge (wraparound style).

If you’re an oldtimer – Another metaphor (if it is easier to understand frameslice beamracing) is an old reel-to-reel video tape that runs through a record head and a playback head simultaneously.

The Tape Delay Loop Metaphor Might Help

Technically, nothing stops an engineer from putting two heads side by side feeding a tape through both – to record and then playback simultaneously – that’s what an old “analog tape delay loop” is – a record head and a playback head running simultaneously on a tape loop.

Metaphorically, the tape delay loop represents one refresh cycle in our situation. In our beamracing case, the metaphorical “record head” is the delivery of new scanlines (even if it’s surged frameslicefuls at a time) to the front buffer, ahead of the “playback head”, the one-scanline-at-a-time readout of the front buffer into the graphics output (at exact horizontal scanrate intervals).

The front buffer isn’t onscreen instantly, it’s still being readout one pixel row at a time into the graphics output at exact constant rate (horizontal scanrate), so you always can keep changing the undelivered portions of the front buffer (including undelivered portions of a frameslice), ad-infinitum, as long as your real raster (the pixel row readout to output) stays ahead of the emu raster (new frame buffer data being put into the front buffer one way or another). This is a great way to understand why we have a full loop of a wraparound jitter margin (full refresh cycle minus one frameslice worth).

Decreasing input lag is by putting the playback head as close as possible to the record head. That’s tightening the metaphorical beam race margin.

The jitter margin is the tape between the playback head and the record head.
The race margin is the tape between the record head and the playback head.

So, a new looped safety jitter margin of one full refresh cycle minus one frameslice.

The entire tape loop represents one refresh cycle, looping around. So for 1080p, you can have a >900-scanline jitter margin with zero tearing, if you use the wraparound-refresh-cycle technique as described above in step 9 of Best Practices. Ideally you want to race with tight latency, though. If you do a “2 frameslice bea, race margin”, that means with 10-frameslice per refresh cycle, you have gotten a 1-frameslice verboten region (tearing risk), 8-frameslice race-too-fast safety margin, and 1-frameslice race-too-slow safety margin – before tearing appears. That’s 15ms of random beam race error you can get with zero tearing!!

In our case, metaphorically, frameslice beam racing is simply the record head surging batches of multiples scanlines onto the metaphorical tape loop. (e.g. a movable record head that intermittently records faster than the playback head). The playback head’s playback speed is totally merrily unchanged!! (i.e. the pixel row readout from front buffer to GPU’s output jack). As long as the record head never falls behind and collides with the playback head (aka tearing artifact) – thankfully this is just a metaphor, and tearlines won’t wreck the metaphorical tape mechanicals and tape loop permanently (ha!) – beam racing can recover during the next refresh cycle (aka only a 1-refresh-cycle appearance of tearing artifact). Metaphorically, front buffer rendering (adding one scanline at a time) means the record head doesn’t have to surge ahead (it can record at the same velocity as the playback head).

You can adjust the race margin to somewhere far enough back that your margin is never breached. That is the metaphorical equivalent of the distance between the tape record head (adding new emu lines to front buffer) and the tape playback head (GPU output jack beginning transmission of 1 pixel row at a time)

That’s why it’s so forgiving if properly programmed, and thus can be made feasible on 8-year-old GPUs, Android GPUs, and Raspberry PI GPUs, especially at lower frameslice counts on lower-resolution framebuffers (which emulators often are), so we’ve found innovative techniques that surprised us why it hasn’t been used before now – it’s conceptually hard for someone to grasp until they go “Aha!”. (like via the user-friendly Blur Busters diagrams, etc).

I can conceptualize this visually in a totally different way if you were not born in the era of analog tape loop, but this should help (in a way) to conceptualize that we’ve successfully achieved a 900+ scanline safety jitter margin for 1080p beam racing, even with wraparound (e.g. Present()ing bottom half while we’re already scanning top half, and Present()ing top half while we’re already scanning bottom half – both situations have NO tearing, because of the way we’ve cleverly done this, with Best Practice #9 two posts ago…) – making it super-forgiving and much more usable on slower-performing systems. Smartphone GPUs can easily do 240 duplicate-frames a second, it’s only extra memory bandwidth to append new frameslices, anyway.

So that is how a 900+ scanline fully-looped across-refresh-cycle wraparound jitter margin is achieved with 1080p frameslice beamracing. At 60Hz, this means up to ~15ms range of beam race synchronization error before tearing appears! This helps soak up peformance imperfections very well during transient beamrace out-of-sync, e.g. background software. And the beamace margin can also be a configurable value, as a tradeoff between latency and tearline-apparance-during-duress-situations.

Modern systems can easily do submillisecond race margins flawlessly, while Android/PI might need a 4ms race margin - still subframe latency!

Yes, in the extreme case frameslices can become one pixel row with no jitter margin (like how my Kefrens Bars demo turns a GeForce 1080 into a lowly Atari TIA with raster-realtime big-pixels at nearly 10,000 tearlines per second) but emulators like the jittermargin technique that hides tearing by simply keeping graphics unchanged at realraster and keeping emuraster ahead of realraster. (Like the tape delay loop metaphor explained above).

This is my partial answer. I have to go back to work, but hopefully this helps you understand better…

battaglia01 · 28 June 2018 06:09

Thank you! Very helpful! I admit that some of the terminology is fairly new for me, so I spent a good amount of time searching and found a glossary you posted here that was very useful in getting in sync with some of this stuff.

I also liked the the tape loop metaphor that you used - rather appropriate, since my background is primarily in audio digital signal processing (with some occasional image processing thrown in). Video processing is a little different, but so far seems straightforward enough, given the way I’m used to looking at things. Many of the terms you use, and some of the other terms I see thrown around in this discussion, are things that I generally recognize from general DSP jargon. Some others are quite new entirely, so I hope I can quickly get to understanding those terms correctly.

With respect to cable scanout vs panel scanout, after looking at your glossary, I think you’re right that panel scanout is irrelevant to the discussion and cable scanout is what really matters here. Likewise, I’m also less interested in things like USB poll interval lag (and variance). The user will be able to supply their own TV and controller, which will hopefully work well enough. What I’m most interested in is getting my head wrapped around the middle, where there are huge chunks of latency that could, hopefully, be reduced using a method like this. Beyond that, the user can tweak their TV/controller if need be, independently from this - play around with USB poll intervals, or lower the TV’s resolution if need be, or just generally find settings that work.

I am really quite surprised to see that VSync can cause multiple frames of input lag. I thought that VSync synchronizes the monitor playback rate with the GPU framerate, so that there is no jitter and hence no tearing. What I don’t get is, when people use the “Frame advance” feature, where they push a button on the controller and then manually advance frames only to see it register 2-4 frames later – is VSync somehow doing that? Would beamracing be able to help lower latency even in that scenario?

I think I’ll start there for now. Your posts are very detailed and I rather appreciate that - I’ll probably need to read a few times before I get it all. For now I’m most interested in understanding the basic signal path and the major components driving latency which come from things like video processing, rather than input polling and such.

Tatsuya79 · 28 June 2018 08:57

No, you can’t measure the vsync lag with that method.
What it shows you is the game inherent lag to react and the way input is polled by the emulator.

Vsync adds lag on top of that. You can reduce it with methods such “Hard GPU Sync” in RA or “Maximum Pre-Rendered Frames” in Nvidia Panel.

You’re also missing on what can be the worst source of lag: the screen inherent lag before it starts displaying an image.

mdrejhon · 28 June 2018 14:32

The input lag chain is a very complex topic.

To simplify, lets focus only on emulator thru graphics output.

Traditionally, an unoptimized emulator on unoptimized drivers can:

Consider whether emulator does preemptive real input reads before renderin emulator frame, or does inout reads in realtime while doing the render (simulated raster scanout)

Render emulator frame (varies, up to 1/60sec lag, depending on how intensive it is, and if this jiffy is executed or not to speed up individual emulator rendered frame for delivery sooner. Input read can be early in emulator frame, versus late in emulator frame)
Deliver to graphics card (varies, up to 1 frame lag) - Present() blocks until room in frame queue. Thats buffer backpressure lag!
Any frame queues used in the graphics card (varies, 0, or 1 or 2 frame lag). Graphics drivers delivers through any prerendered frames in sequence as consecutive individual refresh cycles.

You can make it efficient and tight (1 frame lag) but in reality, can be awful. Some very old Blur Busters input lag tests of Battlefield 4 had over 60ms input lag even on a display that had less than 10ms input lag, and even when a frame rendered in only 15ms. Conversely, CS:GO reliably achieved approximately 20ms or so. That was tests way back almost five years ago. The huge variances between software even for VSYNC ON, OFF, GSYNC… Software plays a role on how much lag they add and how they treat the sync workflows (that’s why you hear various tricks such as “input delaying” to reduce lag closer to output.)

Usually maximum prerendered frames is 1, and that is necessary for compatibility with lots of things such as SLI which must multiplex frames from multiple cards into the same frame queue. It also massively improves frame pacing. A few years ago, there was a controversy with the disappearance of the “0” setting in NVInspector for Max Prerendered Frames.

You can use tricks to reduce a lot of lag in this lag chain, but VSYNC ON in games can vary humongously between different apps, it’s simply time interval of input read (which necessarily occurs before rendering in most 3D games) versus the pixel hitting the output jack (the point A to B we are limiting scope to for simplicity).

Beam raced frameslices does input read, render AND output in essentially realtime. Just like the original machine did. Faithfully. With it, there can be just be a mere 1 millisecond between an input read and the actual reacted pixels hitting the graphics output. For any game that does continual input reads mid-scanout, the photons of that can actually hit your eyes in subframe time. Like a mid-screen input read for bottom-of-screen pinball flippers.

Twinaphex · 20 February 2020 18:03

Beam raced frameslices does input read, render AND output in essentially realtime. Just like the original machine did. Faithfully. With it, there can be just be a mere 1 millisecond between an input read and the actual reacted pixels hitting the graphics output. For any game that does continual input reads mid-scanout, the photons of that can actually hit your eyes in subframe time. Like a mid-screen input read for bottom-of-screen pinball flippers.

You still need to change a lot in every single emulator so it render the slices and even poll more often than 1/refresh rate, and is that even accurate? I’m pretty sure it’s not for every single case.

battaglia01 · 28 June 2018 18:47

I’ll respond more to @mdrejhon in a sec, but as a quick response to the above, how much of this would be with the individual cores and how much would be within the libretro API?

battaglia01 · 28 June 2018 23:24

@mdrejhon: thanks for that. As a rough pass I think I get the idea of why VSYNC can sometimes lead to multi-frame delay. I do think I’m continuing to get bogged down a little bit w/ the terminology here though.

Right now, the way I think of input timing latency is as follows:

The input is a delta function, and we are trying to figure out the total latency (or group delay) in the “impulse response.”
The signal path consists of a string of delay lines, one after the other, each of which adds some time delay to the signal.
The total time delay is the sum of the time delays of each component.
Rather than all of the delays being set in stone, the delay of each component is a random variable according to some probability distribution. We know the range of values each can take, the probability of each, and the mean and variance. For instance, a 100 Hz USB poll interval is a uniform distribution on 0-10ms, which has a mean of 5ms and a stdev of ~3ms.
The components are “approximately independent” of one another, at least given reasonable running conditions. Meaning the USB poll interval position does not correlate with, for instance, the refresh interval position, or whatever. Both are equally random, or if there is any correlation, it’s negligible.
Because of #5, the expected value of the total latency is the sum of the expected value of the latency of each component.

#5 is probably where the gray area is. Some components might correlate with one another… sometimes… only under certain fundamental conditions… and it’s hard to tell where. That seems to be the basic problem.

If we have two components that tend to correlate significantly, so that a better latency on one suggests a higher or lower probability of a better latency on the other, then we can chunk them into a single component. Ultimately we can always arrive at some chunks of components that do not correlate with one another in any signficant way, given at least reasonably normal working conditions.

Let’s just start with that, if that makes sense.

mdrejhon · 29 June 2018 04:11

Actually, for at least one module, it’s easier than you think.

WinUAE told me it was a quick modification to add basic 60 Hz support.

This is because for “raster accurate” emulation modules (e.g. Nintendo and Super Nintendo emulation):

The emulator module is already plotting one scan line at a time into an offscreen buffer. It’s already happening with the NES module.
The emulator module is already (usually) doing real time input reads while plotting scan lines. It’s already happening with the NES module.
The new raster poll API simply lets the centralized beamracer to “peek” at the ALREADY EXISTING module’s ALREADY EXISTING OFFSCREEN FRAMEBUFFER, and grab a frameslice from it.

The centralized code will do the peeking, and the centralized code will do the grabbing of the frameslice itself. The raster poll is simply giving the central code opportunities to do early-peeks of the emulator’s existing offscreen framebuffer, every time a new emulator scan line is written to it.

For at least one of the easiest Retroarch modules, it looks like only a 10 line modification.

All the complexity is centralized (probably ~1000 lines of code, 3-4 days of programming work). Please re-read my proposal. That’s where the RetroArch work ahead is cut.

More difficult cores will take a lot more time, but once the core libretro is made beamrace compatible, then the beamrace support can be added to only one module at a time. And from what I looked, the easiest module will only need a simple hook (10 lines) to turn it into a successful beamracer.

Emulator authors – over the last two decades – have done an amazing job refining realtime beamracing on the emulator offscreen buffer already. So it’s not much work left to glue the remaining step. So to the authors of the “easy” modules, thank you so much for making it so easy to beamrace for real!

It’s the last piece of puzzle that most emulators programmers do not understand; the “black box” between Present() and photons – but people like me do. That 1% is complex to understand and this is why I am writing big posts to explain that 1% needed to finish the “full beamrace chain”.

And, even if it’s easier than expected with some modules (NES)…

…It will also be more difficult than expected with other modules (who knows which ones). It depends on how much of the beamracing chain they’ve already completed.

The fact is that both extremes exist.

The beauty is that once it’s implemented in libretro, it can be implemented one module at a time, one by one, beginning with the easiest modules – taking our merry time.

Once the easy module is done, it gives everyone the “aha” moment, and makes some people understand frameslice beam racing much better. (The remaining 1% step needed to finally pull an emulator’s existing internal beamracing out to the real world display).

For some modules, >99%+ of the beamracing work is already done. 20 years of beamracing development has done that already, but never beyond the Present() API.]

The major complexity will be making libretro compatible. If there’s a lot of layering (e.g. lack of a VSYNC OFF mode, and a lot of black box layers, it has to be refactored somewhat). Basically, VSYNC OFF support needs to be added to LibRetro, in order for frameslice beamracing to work. It might or might not be royal headache.

But on the NES module side (at least), it’s quite minor changes there for that particular module since it already does internal beamraced input reads and internal beamraced line-plots into its internal framebuffer.

For time split between “The core, and the easiest module” – I guesstimate over 95% of programming time will be focussed on the centralized code, and 5% of the time spent on the easiest module. Once done, the bridges can be crossed for remainder of modules.

The hardest module might need lots of code – and/or rewriting – to be compatible, but the easiest modules will essentially only need 10 lines of modifications.

The already cycle-exact and raster-exact modules will obviously be the easiest modules, especially if they’re simply (as the NES module is) already rasterplotting one line at a time internally already to an internal frame buffer. Those types will be easy to do frameslice beamracing.

The emulator modules don’t even need to know what the heck a frameslice is, if one re-reads my proposal.

In my proposal, all the emulator module is doing is letting the core code (centrallized raster poll code) to do early-peeks to the existing offscreen beamraced buffer that most 8-bit and 16-bit emulators already do, in order to be compatible with retro-era raster interrupts.

mdrejhon · 29 June 2018 04:33

I’m going to cover this from the opposite side first.

This is an easier argument for me to make, because there is fewer variables. Makes latency math simpler.

I’ve got a 1000fps high speed camera. With a test program, I’ve successfully got API to photons in just 3 milliseconds on my fastest LCD display. That is Present() to photons hitting the camera sensor. That includes LCD GtG. That includes DVI/DisplayPort latency. That includes monitor processing latency. I’m able to get this for top edge, center, and bottom edge of screen.

I can likely probably get <1ms with a CRT and an older graphics card with a direct adaptorless VGA output.

But let’s simplify. So, now we already know the baseline absolute-best proof from my high speed camera, and it is proven “realtime” API to photons, by all practical raster extent.

I’ve also done brief tests that showed 4ms-5ms from mouse click to photons, for some extreme blank-screen VSYNC OFF tests. Now, we know that DisplayLag.com can have a display latency difference from top/center/bottom (e.g. 2ms, 9ms, 17ms). Obviously for simplicity, most sites only report average latency (VBI to screen middle). Which is often half a refresh cycle. Which is why you never see numbers less than 8.3ms on sites such as DisplayLag.com for a 60Hz display (1/2 of 1/60sec = 8.33333). That’s simply a stopwatch from VBI-to-raster. With beamracing, the lag is vertically uniform (e.g. 2ms, 2ms, 2ms TOP/CENTER/BOTTOM) for Present-to-Photons during VSYNC OFF frameslice beamracing. (There’s micro lag gradients within frameslices – caused by the granularity of frameslice versus the one pixel row at a time scanout of graphics output – but that can be filtered by the tape loop metaphor for consistent unvarying subframe emulator pixel to photons time, as a fixed screen height difference between emu raster and real raster – all easily centralizable inside the central code of the raster poll API, it’s just simply busysleep on RTDSC or QueryPerformanceCounter)

Briefly going offtopic, but currently the least-laggy LCDs via DisplayPort/DVI tends to be ~3ms for API-to-photons if we're focussing on minimum lag (top lag, or beamraced pixel-for-pixel lag) -- subtract about 8.3ms from the number you see on DisplayLag.com and that's your beamraced input lag attainable. Certainly there are less laggy displays and more laggy displays, but some LCDs are almost as fast as CRTs in response (e.g. just digital transceiver lag & GtG lag, with a few scanline buffered micropacket lag). Although we're not worried about panel scanout, the fact is some of them have realtime synchronouz cable-to-panel scanout abilities (also see www.blurbusters.com/lightboost/video for an older example high speed video of how an LCD scans out -- and how some blur-reduction strobe backlights work (LightBoost, ULMB). In non-strobed operation, it's essentially a fast-moving a GtG fade zone chasing behind the currently-being-refreshed pixel rows, being refreshed practically on the fly directly from the cable (with only line-buffer processing for overdrive -- unlike old LCDs that often full framebuffered the refresh cycle first).

So all we’re worried about is increases to latency to this absolute-best baseline.

My graphics card can do up to 8000 frameslices per second (Kefrens Bars demo).

Mouse poll 1000Hz adds an average of 0.5ms latency (the midpiont average of 0ms…1ms latency). There are some 2000Hz mice and overclocked 8000Hz mice experimentation being done, so it’s possible to theoretically get lower – USB of 0.125ms latency has been successfully achieved with mouse overclocking.

Emulator frameslice granularity at 2400 frameslices per second (40 frameslices per refresh cycle, HLSL filters disabled, GTX 1080 Ti extreme case) with a 1.5 frameslice average beamrace margin = 1.5/2400sec latency = 0.6ms latency.

So mouse poll latency 0.5ms and beamrace margin latency 0.6ms = 1.1ms lag for input-read-to-pixel-transmitting-on-wire. It could be even less obviously given my computer’s performance. But this is already incredibly small.

While doable on i7’s with powerful GPUs, it is overkill for many. A lot are happy with 10-frameslice beamracing (600 frameslices per second). 2 frameslice lag equals 2/600sec or 1/300sec or 3.333ms for the common 10 frameslice WinUAE setting (I’m excluding the 0.5ms from 1000Hz USB input poll, obvoiusly).

It continues to scale down to more leninet margins, like 4-frameslice or 6-frameslice beamracing on slower platforms (e.g. PI, Android, etc). There’s more lag for that, but still subframe lag compared to any other possible non-RunAhead lag reduction approaches. 4 frameslices with 1.5 slice beamrace margin is still only ~6ms lag – incrediblly low for a Raspberry PI, and I suspect 10 frameslices are doable on the newer mobile GPUs. At 600 frameslices per second, that gets real close to more exactly reproducing faithful-original latencies.

battaglia01 · 30 June 2018 00:56

@mdrejhon: Very cool. I can kind of see how the beam racing thing covers a lot of the different sources of latency inherent in the signal.

I’m still confused on the basic phenomenon of how this doesn’t lead to tearing. Suppose, for instance, you’re playing some game, and you have half of the frame rendered, and it just so happens that half of the character sprite is rendered. Then, midway through the frame, the user pushes some input button, causing the character to move. So input is processed mid-frame.

Do we process the input here, so that the next half of the frame is rendered in accordance with the character moving? And so on, even if they push another button? Basically, do we process multiple sequential inputs within the same frame, even if it’s already been rendered?

If you do, then the thing is, the sprite and position have now changed. If you totally change gears and start rendering the next frame, you would get, I think, a tearline, or at least a visual mess.

On the other hand, if you do just continue to process the current farme the same way as before, then at least under the hood, the input poll routine can get a bit of a head start on things for the next frame. So then your multiple combinations of buttons are partly processed during the current frame, but all get lumped together in the next frame.

It seems something like that, as I think this through. Partly the issue might be that there’s some subtlety in the way Present() works that I don’t quite get.

mdrejhon · 1 July 2018 20:23

That’s the thing…

There’s no such thing as “half a frame rendered”.

Many emulators – for NES, SNES, Commodore 64, Apple, several 80s/90s-era MAME modules – renders only one pixel row at a time, or less (line-exact emulators) or a single pixel at a time (cycle-exact emulators).

Sure, the original 8-bit software has done half a frame buffer, but the emulator is simulating a virtual equivalent of a CRT electron gun already! (Some of them single-pixel cycle-exact granularity; others of them scanline-at-time granularity).

So the emulator is already serializing one pixel row at a time. Plotting them to the existing offscreen buffer.

For a NTSC CRT, that is equivalent to 15,625 scanlines per second (Horizontal Scan Rate = 15.6 KHz) so that’s one new scanline plotted every ~1/15625th of a second. So the offscreen framebuffer is getting a new pixel row every average 1/15625th of a second. Some emulators execute synchronously, by putting a busyloop where needed to scanline-pace it – other emulators will surge-execute 1/60th sec worth of emulation (faster than original machine) to deliver a full framebufferful in traditional PC-based “full frame buffer at a time” workflows. Regardless, the inputreads in the original emulator code varies from game to game, and sometimes some input reads are always at beginning of a refresh cycle, or end of refresh cycle, or some games do input reads mid-screens, it really varies from game to game. But regardless, whatever original game did, it gets preserved when synchronizing emuraster with realraster.

That’s because they have to preserve original beam-race behaviours like raster interrupts. That’s why those particular modules tend to be very beamrace-friendly to the real-world.

Because they’re already doing that, 99% of the work is already essentially done.

The large amount of writing I do in this thread is doing the remaining 1% of the work synchronizing the emulator raster (line-at-a-time) to the real-world raster – which is something that most people don’t realize is now already possible. But, it is indeed a somewhat complex-to-grasp 1% that requires good understanding of the way things used to be done originally.

Does the above help explain why there is no tearing?

battaglia01 · 3 July 2018 17:15

@mdrejhon: ok, yes, it does. Sorry for the delay in responding, got busy for a bit here…

I now understand why there is no tearing. So the original game was in charge of protecting against that while doing the beamracing. So, we assume there’s something like that going on.

Doing beamracing means we get as close to the original game as possible. Since the original game is doing some kind of beamrace mid-frame tearing protection, the game itself will add a tiny bit of lag between when you press a button and when it appears on the TV, because it needs to not to mess up the current frame.

If we then run the above game in an emulator, then not only do we have the original game’s input latency, but we now also have the additional VSync latency, where things are delayed by yet another frame (or two). So it all adds up to increased cumulative latency.

It seems like there are a few case splits here in the way that original games deal with input latency - Yoshi’s Island seems to be particularly slow, for instance, whereas other games might be a lot faster. But, there are lots of ways to split the latency chain up into modules, and going with beamracing affects all equally, so that’s why this is good to implement. Right?

This is a weird way to put it, but I’m trying to make sure I’m balancing the latency “checkbook” correctly. Does the above make sense?

mdrejhon · 4 July 2018 18:05

“beamrace mid-frame tearing protection” should be phrased as “beamrace by design”.

Remember… Atari 2600 had no framebuffer memory at all. Zero, none, nada, zilch! They had to buffer a single scanline at a time – generate new graphics in realtime, every scanline.

You see, the only way to do graphics on an Atari 2600 in the 1970s-1980s was beamracing out of necessity.

So the Atari 2600 essentially beamraced several thousand simple “linebuffers” A second. Doing that on a puny 1 MHz CPU is no less than a miraculous programming feat. Raster feats continued for a couple decades afterwards, at least for other graphics special-effects like scroll zones or sprite multiplication.

Later on, even when games gained framebuffers, some special effects (e.g. 16 sprites instead of 8, or a stationary scorebar below a scrolling zone) also required beamracing out of necessity. Basically they intentionally injected a dividing line between two different framebuffers (if you must use the word “tearline” terminology, yes, that’s essentially an ancestor to a modern tearline).

Emulators had to preserve whatever beamracing antics that the original machines did.

Adminst all of this, while not all used lagless input mechanisms, some were essentially sub-scanline lagless – input reads of a joystick controller port had virtually no latency – it was typically just a mechanical joystick, where mechanical switches completed circuits directly on separate pins of a 9-pin joystick port. Up/Down/Left/Right/Button only required 5 wires plus shared ground. Extra wires can be used for things like extra buttons, etc. Anyway, the moment a joystick button is fully pressed, the circuit is completed right there and then. Which directly changes the bits of a byte of a single in-memory address. The joystick port is often read by a register-read instruction or a PEEK command (in 6502, the machine language programming instruction could be “LDA $DC00” or “LDA $DC011” (Commodore 64 version of joystick register) which is essentially the assembly language / machine language equivalent of the BASIC command “LET J = PEEK(56320)” – hex DC00 equals decimal 56320). The fast joystick-peeking instruction, which executes in microseconds, may actually be an instruction embedded within raster-realtime generated graphics – or might be a few scanlines before – or might be at beginning of refresh cycle (blanking interval) or a few refresh cycles ago (e.g. framebuffering workflows – but rememeber: a lot of this was the era before framebuffers). So latencies sometimes are microseconds between joystick to photons (CRTs can illuminate a ‘pixel’ in essentially microseconds). Or sometimes be several milliseconds (one or two refresh cycles) if he original code reads during a different part of refresh cycle or in blanking interval, or even many refresh cycles (e.g. the early simulators, early crude 3D flight simulators, like 1982 Microsoft Flight Simulator running at only 2-3 frames per second with blocky line-drawing graphics). Regardless, framebufferless workflows were still sometimes continued to be used on character-buffered platforms (grids of pre-defined graphics used as building blocks) to do things like add extra colors per row or other effects. Regardless, it varies hugely how an original platform did input reads, but nothing stopped them from doing raster-realtime “input-to-photons behavior” if input reads were done mid-raster. But this is getting offtopic, emulators already (to best effort) preserve this originalness at least to the offscreen framebuffer. So we usually don’t have to do anymore work on the input-reading side (Even though input read granularity rounds off at polls, e.g. 1ms for 1000Hz polls). The beam raced frameslicing is only a modification only to the missing 1% of the emulator “rendering” workflow (subframe raster sync between emu-raster and real-raster) necessary to replicate the original latencies to a much higher accuracy than has ever been achieved before. Regardless, the word “tearing” is non-sequitur to some 8-bit programming technique that did not use frame buffers.

For a good newbie’s guide to this, see Wired Magazine’s https://www.wired.com/2009/03/racing-the-beam/ … I highly recommend that “Racing The Beam” book from Amazon. This will help prepare you for a better understanding of rasterwork.

mdrejhon · 4 July 2018 18:07

The two parts of your sentence are (mostly) unrelated to each other. It turns out non-sequitur. We’re just simply preserving original input-read phasing (whether it was only 1 microsecond before pixel output – or 1 second before the frame). There can technically essentially be zero lag, if the input read is made raster-realtime mid-scanline (e.g. Atari 2600) reading the joystick controller register while generating pixels.

It is simply a function of the original game programming, nothing more, nothing less. The writings I do is simply bridging the beamracing from the virtual world (offscreen emulator frame buffer) to the real screen. This allows the emulator to replicate original latencies as faithfully as possible, including subtle within-frame and within-scanline time-offsets of input read relative to generating the pixels on the original machine’s original video output.

(Plus a slight amount of additional latency to create the ‘jitter margin’ to soak up computer performance imperfections… but that doesn’t interfere, as the offscreen emulator buffer is still merrily at its original latency – the jittermargin is simply slight extra latency between emuraster and realraster to soak up computer performance issues, allowing contineud VSYNC ON looking perfection in less-than-perfect performance conditions)

If you are a programmer, I suggest you purchase the book to gain a better understanding of the concept of raster programming.

Applying modern terms such as “tearing” (But, really, tearing is a modern term more applicable to 3D framebuffers) and “we assume” (incorrect: we actually know, so we don’t have to assume. Remember, I have programmed some of these old machines directly. Telling such people who have done real rasterwork whether be Atari TIA, Amiga Copper, C64 raster interrupts, etc that their actual work are “assumptions” can be actually slightly offensive to them when in actuality, they actually understand the old platforms), and framebuffers (because some old machines had no framebuffers, or character buffers only).

The reason that this final 1% of realtimeness has not been easily bridged is that it required three things. An expert that is simultaneously versed on

Understanding how the original software and original machines worked; and
Understanding the latency black box between “Present()-to-photons” (software to photons) in a full & proper temporal manner for all Hz & all VRR tech, including differences between pixels on the screen and
The technology to catch up (it did about 8-10 years ago, but see pre-requisite (1) and (2) above).

People who simultaneously understand both (1) and (2) and (3) are few and far in between. Such an individual would need to and how to apply this to inventing various techniques to sync between old & new.

Likewise, a rocket scientist, might not know archaeology, and vice-versa. Or in a more related field, a mathematician may not be able to create a new molecule, and vice-versa, though applying knowledge may end up applied across boundaries to create a new breakthrough. And sometimes come up with “E=mc^2” simplicity that others can understand.

Several of us (who are familiar with how framebufferless programming worked) are finding it’s a lot simpler than expected once we follow the “best practices” list several posts ago.

mdrejhon · 4 July 2018 18:24

Added a new Best-Practice number 18 to the LibRetro Proposal

(18) Temporarily turn off debug output when programming/debugging real world beam racing. When running in debug mode, create your own built-in graphics console overlay, not a separate console window – don’t use debug console-writing to IDE or separate shell window during beam racing. It can glitch massively if you generate lots of debug output to a console window. Instead, display debug text directly in the 3D framebuffer instead and try to buffer your debug-text-writing till your blanking interval, and then display it as a block of text at top of screen (like a graphics console overlay). Even doing the 3D API calls to draw a thousand letters of text on screen, will cause far less glitches than trying to run a 2nd separate window of text (IDE debug overheads & shell window overheads) can cause massive beam-racing glitches if you try to output debug text – some Debug output commands can cause >16ms stall – I suspect that some IDE’s are programmed in garbage-collected language and sometimes the act of writing console output causes a garbage-collect event to occur. Or some other really nasty operating-system / IDE environment overheads. So if you’re running in debug mode while debugging raster glitches, then temporarily turn off the usual debug output mechanism, and output instead as a graphics-text overlay on your existing 3D framebuffer. Even if it means redundant re-drawing of a line of debugging text at the top edge of the screen every frame.

BTW, I propose to open and contribute to a BountySource on executing this proposal and successfully enabling frameslice beamracing to at least one module.

Would anyone be interested in me opening a BountySource on this?

e-tank · 6 July 2018 11:56

as i understand it beam racing/chasing is basically a very clever and more “the right way” solution / replacement for what frame delay currently provides in retroarch, its main purpose would be to cut down on that last frames worth of latency that’s inherent when pushing full frames with vsync on the host via standard api’s (opengl, vulkan, drm, etc. and when i say last frame i mean compared to methods like using hard sync/fences in opengl, which can achieve approximately just 1 frame additional latency over original hardware)

however, wouldn’t it be a lot of work for relatively little pay off? the libretro api, retroarch and its shader pipeline, and all the libretro modules of emulators are all currently based on full frames. not only that but the core logic in many of these emulators are too. (well, to varying degrees. the point being most emulators would need their core logic tweaked as well as the libretro implementation)

it’s also not compatible with run ahead, which while not a silver bullet, provides at least as good if not better results where supported. (run ahead requires that the emulator has reasonably efficient, fully complete, and side affect free serialization support, which not all emulators do or even can provide. and it’s also game dependent) I get that there are pros and cons to each approach, but run ahead at least made a lot of sense for retroarch since it was mostly built upon features that already existed.

personally i think medanfen would be a much better testing ground for this than retroarch. for one thing it would most likely be easier and less disruptive to implement there, and for another doing so would lay the ground work for getting all the libretro mednafen cores ready to support this if and when beam chasing ever gets added to retroarch + libretro api. (note: i would advise against requesting/pestering mednafen author to implement this)

edit: fixed some terminology

edit 2: nevermind, don’t bother responding to this post, mdrejhon has already addressed pretty much everything i’ve brought up at some point or another in one of his many posts here, there were just sooo many about this i didn’t catch all of it at first. anyway, i still stand by my statement that mednafen is more suited for this

markwkidd · 7 July 2018 13:37

I’m sure there are a lot of people like me lurking in this thread who are very interested in seeing beamracing more widely available (for example by adding these features to the libretro API ). Now that things are getting closer to working code it does seem like a good time to start up a bounty.

mdrejhon · 7 July 2018 18:20

I’ll create a github issue on this & then create a BountySource

How I start with a $100 bounty if someone(s) else can pledge to throw in a total of matching $100?

= $200 starter bounty for libretro raster poll API addition + 1 emulator module made compatible (e.g. NES or other)

I’ll match dollar for dollar for anything less.

Negotiable: Can be willing to match more funds beyond $100, this is just a starting point.

mdrejhon · 7 July 2018 18:49

Thanks for the tip about mednafen. However, RetroArch as far as I know, is much more well known (including variants like RetroPie, etc).

Yes, addressed all concerns:

The practical payoff is actually at least slightly bigger than you think. …Beamracing make it workable on Android/PI devices too underpowered for RunAhead. It also saves more lag than a VRR display does (yet beamracing is also compatible with VRR mode too). Slow emulators on slow platforms will often take 1/60 sec to render, then 1/60sec of VSYNC ON buffer delay – two frames of lag on slow platforms (VSYNC ON backpressure latency, unless you do tricks to achieve next-refresh-cycle latency reliably and consistently). Or allows performance-intensive cycle-exact emulators to reduce latency . Streaming pixels realtime to the display like the original machine did, without all the attendant pre-delays. …It’s also more preservationist friendly latency-wise, and much more faithful to original machine’s latency behavior.
Frameslice beamracing is still compatible with RunAhead, I posted some diagrams. Although it is not as lag-saving as I thought, it can still reduce the CPU requirements of RunAhead by roughly 1 frame’s worth per 1/60sec, since the last frame no longer needs to be surge-executed, and the visible frame (last frame) can simply be realtime streamed to display at original emulator speed (aka beamracing)
Despite the “concept” complexity, it takes surprisingly little code, the difficulty is understanding the black box between Present()-to-photons on a per-pixel screen-scanout basis – nearly undocumented everywhere else except at Blur Busters. But the 18-point Best Practices list (that all of us have built up, myself, Calamity, Toni, etc) will save a lot of grief.

mdrejhon · 17 July 2018 05:47

Added GitHub Issue:

EDIT: Added $120 Cash to BountySource:

I’ll dollar-match your donations

I will dollar-match all your future donations (between now and end of September 2018). I donate another $1 everytime anyone donates $1 until the BountySource hits the $360 level.

EDIT: That was FAST! BountySource now $1050 and my share maxed at $360