An input lag investigation

mdrejhon · 29 June 2018 04:33

I’m going to cover this from the opposite side first.

This is an easier argument for me to make, because there is fewer variables. Makes latency math simpler.

I’ve got a 1000fps high speed camera. With a test program, I’ve successfully got API to photons in just 3 milliseconds on my fastest LCD display. That is Present() to photons hitting the camera sensor. That includes LCD GtG. That includes DVI/DisplayPort latency. That includes monitor processing latency. I’m able to get this for top edge, center, and bottom edge of screen.

I can likely probably get <1ms with a CRT and an older graphics card with a direct adaptorless VGA output.

But let’s simplify. So, now we already know the baseline absolute-best proof from my high speed camera, and it is proven “realtime” API to photons, by all practical raster extent.

I’ve also done brief tests that showed 4ms-5ms from mouse click to photons, for some extreme blank-screen VSYNC OFF tests. Now, we know that DisplayLag.com can have a display latency difference from top/center/bottom (e.g. 2ms, 9ms, 17ms). Obviously for simplicity, most sites only report average latency (VBI to screen middle). Which is often half a refresh cycle. Which is why you never see numbers less than 8.3ms on sites such as DisplayLag.com for a 60Hz display (1/2 of 1/60sec = 8.33333). That’s simply a stopwatch from VBI-to-raster. With beamracing, the lag is vertically uniform (e.g. 2ms, 2ms, 2ms TOP/CENTER/BOTTOM) for Present-to-Photons during VSYNC OFF frameslice beamracing. (There’s micro lag gradients within frameslices – caused by the granularity of frameslice versus the one pixel row at a time scanout of graphics output – but that can be filtered by the tape loop metaphor for consistent unvarying subframe emulator pixel to photons time, as a fixed screen height difference between emu raster and real raster – all easily centralizable inside the central code of the raster poll API, it’s just simply busysleep on RTDSC or QueryPerformanceCounter)

Briefly going offtopic, but currently the least-laggy LCDs via DisplayPort/DVI tends to be ~3ms for API-to-photons if we're focussing on minimum lag (top lag, or beamraced pixel-for-pixel lag) -- subtract about 8.3ms from the number you see on DisplayLag.com and that's your beamraced input lag attainable. Certainly there are less laggy displays and more laggy displays, but some LCDs are almost as fast as CRTs in response (e.g. just digital transceiver lag & GtG lag, with a few scanline buffered micropacket lag). Although we're not worried about panel scanout, the fact is some of them have realtime synchronouz cable-to-panel scanout abilities (also see www.blurbusters.com/lightboost/video for an older example high speed video of how an LCD scans out -- and how some blur-reduction strobe backlights work (LightBoost, ULMB). In non-strobed operation, it's essentially a fast-moving a GtG fade zone chasing behind the currently-being-refreshed pixel rows, being refreshed practically on the fly directly from the cable (with only line-buffer processing for overdrive -- unlike old LCDs that often full framebuffered the refresh cycle first).

So all we’re worried about is increases to latency to this absolute-best baseline.

My graphics card can do up to 8000 frameslices per second (Kefrens Bars demo).

Mouse poll 1000Hz adds an average of 0.5ms latency (the midpiont average of 0ms…1ms latency). There are some 2000Hz mice and overclocked 8000Hz mice experimentation being done, so it’s possible to theoretically get lower – USB of 0.125ms latency has been successfully achieved with mouse overclocking.

Emulator frameslice granularity at 2400 frameslices per second (40 frameslices per refresh cycle, HLSL filters disabled, GTX 1080 Ti extreme case) with a 1.5 frameslice average beamrace margin = 1.5/2400sec latency = 0.6ms latency.

So mouse poll latency 0.5ms and beamrace margin latency 0.6ms = 1.1ms lag for input-read-to-pixel-transmitting-on-wire. It could be even less obviously given my computer’s performance. But this is already incredibly small.

While doable on i7’s with powerful GPUs, it is overkill for many. A lot are happy with 10-frameslice beamracing (600 frameslices per second). 2 frameslice lag equals 2/600sec or 1/300sec or 3.333ms for the common 10 frameslice WinUAE setting (I’m excluding the 0.5ms from 1000Hz USB input poll, obvoiusly).

It continues to scale down to more leninet margins, like 4-frameslice or 6-frameslice beamracing on slower platforms (e.g. PI, Android, etc). There’s more lag for that, but still subframe lag compared to any other possible non-RunAhead lag reduction approaches. 4 frameslices with 1.5 slice beamrace margin is still only ~6ms lag – incrediblly low for a Raspberry PI, and I suspect 10 frameslices are doable on the newer mobile GPUs. At 600 frameslices per second, that gets real close to more exactly reproducing faithful-original latencies.

battaglia01 · 30 June 2018 00:56

@mdrejhon: Very cool. I can kind of see how the beam racing thing covers a lot of the different sources of latency inherent in the signal.

I’m still confused on the basic phenomenon of how this doesn’t lead to tearing. Suppose, for instance, you’re playing some game, and you have half of the frame rendered, and it just so happens that half of the character sprite is rendered. Then, midway through the frame, the user pushes some input button, causing the character to move. So input is processed mid-frame.

Do we process the input here, so that the next half of the frame is rendered in accordance with the character moving? And so on, even if they push another button? Basically, do we process multiple sequential inputs within the same frame, even if it’s already been rendered?

If you do, then the thing is, the sprite and position have now changed. If you totally change gears and start rendering the next frame, you would get, I think, a tearline, or at least a visual mess.

On the other hand, if you do just continue to process the current farme the same way as before, then at least under the hood, the input poll routine can get a bit of a head start on things for the next frame. So then your multiple combinations of buttons are partly processed during the current frame, but all get lumped together in the next frame.

It seems something like that, as I think this through. Partly the issue might be that there’s some subtlety in the way Present() works that I don’t quite get.

mdrejhon · 1 July 2018 20:23

That’s the thing…

There’s no such thing as “half a frame rendered”.

Many emulators – for NES, SNES, Commodore 64, Apple, several 80s/90s-era MAME modules – renders only one pixel row at a time, or less (line-exact emulators) or a single pixel at a time (cycle-exact emulators).

Sure, the original 8-bit software has done half a frame buffer, but the emulator is simulating a virtual equivalent of a CRT electron gun already! (Some of them single-pixel cycle-exact granularity; others of them scanline-at-time granularity).

So the emulator is already serializing one pixel row at a time. Plotting them to the existing offscreen buffer.

For a NTSC CRT, that is equivalent to 15,625 scanlines per second (Horizontal Scan Rate = 15.6 KHz) so that’s one new scanline plotted every ~1/15625th of a second. So the offscreen framebuffer is getting a new pixel row every average 1/15625th of a second. Some emulators execute synchronously, by putting a busyloop where needed to scanline-pace it – other emulators will surge-execute 1/60th sec worth of emulation (faster than original machine) to deliver a full framebufferful in traditional PC-based “full frame buffer at a time” workflows. Regardless, the inputreads in the original emulator code varies from game to game, and sometimes some input reads are always at beginning of a refresh cycle, or end of refresh cycle, or some games do input reads mid-screens, it really varies from game to game. But regardless, whatever original game did, it gets preserved when synchronizing emuraster with realraster.

That’s because they have to preserve original beam-race behaviours like raster interrupts. That’s why those particular modules tend to be very beamrace-friendly to the real-world.

Because they’re already doing that, 99% of the work is already essentially done.

The large amount of writing I do in this thread is doing the remaining 1% of the work synchronizing the emulator raster (line-at-a-time) to the real-world raster – which is something that most people don’t realize is now already possible. But, it is indeed a somewhat complex-to-grasp 1% that requires good understanding of the way things used to be done originally.

Does the above help explain why there is no tearing?

battaglia01 · 3 July 2018 17:15

@mdrejhon: ok, yes, it does. Sorry for the delay in responding, got busy for a bit here…

I now understand why there is no tearing. So the original game was in charge of protecting against that while doing the beamracing. So, we assume there’s something like that going on.

Doing beamracing means we get as close to the original game as possible. Since the original game is doing some kind of beamrace mid-frame tearing protection, the game itself will add a tiny bit of lag between when you press a button and when it appears on the TV, because it needs to not to mess up the current frame.

If we then run the above game in an emulator, then not only do we have the original game’s input latency, but we now also have the additional VSync latency, where things are delayed by yet another frame (or two). So it all adds up to increased cumulative latency.

It seems like there are a few case splits here in the way that original games deal with input latency - Yoshi’s Island seems to be particularly slow, for instance, whereas other games might be a lot faster. But, there are lots of ways to split the latency chain up into modules, and going with beamracing affects all equally, so that’s why this is good to implement. Right?

This is a weird way to put it, but I’m trying to make sure I’m balancing the latency “checkbook” correctly. Does the above make sense?

mdrejhon · 4 July 2018 18:05

“beamrace mid-frame tearing protection” should be phrased as “beamrace by design”.

Remember… Atari 2600 had no framebuffer memory at all. Zero, none, nada, zilch! They had to buffer a single scanline at a time – generate new graphics in realtime, every scanline.

You see, the only way to do graphics on an Atari 2600 in the 1970s-1980s was beamracing out of necessity.

So the Atari 2600 essentially beamraced several thousand simple “linebuffers” A second. Doing that on a puny 1 MHz CPU is no less than a miraculous programming feat. Raster feats continued for a couple decades afterwards, at least for other graphics special-effects like scroll zones or sprite multiplication.

Later on, even when games gained framebuffers, some special effects (e.g. 16 sprites instead of 8, or a stationary scorebar below a scrolling zone) also required beamracing out of necessity. Basically they intentionally injected a dividing line between two different framebuffers (if you must use the word “tearline” terminology, yes, that’s essentially an ancestor to a modern tearline).

Emulators had to preserve whatever beamracing antics that the original machines did.

Adminst all of this, while not all used lagless input mechanisms, some were essentially sub-scanline lagless – input reads of a joystick controller port had virtually no latency – it was typically just a mechanical joystick, where mechanical switches completed circuits directly on separate pins of a 9-pin joystick port. Up/Down/Left/Right/Button only required 5 wires plus shared ground. Extra wires can be used for things like extra buttons, etc. Anyway, the moment a joystick button is fully pressed, the circuit is completed right there and then. Which directly changes the bits of a byte of a single in-memory address. The joystick port is often read by a register-read instruction or a PEEK command (in 6502, the machine language programming instruction could be “LDA $DC00” or “LDA $DC011” (Commodore 64 version of joystick register) which is essentially the assembly language / machine language equivalent of the BASIC command “LET J = PEEK(56320)” – hex DC00 equals decimal 56320). The fast joystick-peeking instruction, which executes in microseconds, may actually be an instruction embedded within raster-realtime generated graphics – or might be a few scanlines before – or might be at beginning of refresh cycle (blanking interval) or a few refresh cycles ago (e.g. framebuffering workflows – but rememeber: a lot of this was the era before framebuffers). So latencies sometimes are microseconds between joystick to photons (CRTs can illuminate a ‘pixel’ in essentially microseconds). Or sometimes be several milliseconds (one or two refresh cycles) if he original code reads during a different part of refresh cycle or in blanking interval, or even many refresh cycles (e.g. the early simulators, early crude 3D flight simulators, like 1982 Microsoft Flight Simulator running at only 2-3 frames per second with blocky line-drawing graphics). Regardless, framebufferless workflows were still sometimes continued to be used on character-buffered platforms (grids of pre-defined graphics used as building blocks) to do things like add extra colors per row or other effects. Regardless, it varies hugely how an original platform did input reads, but nothing stopped them from doing raster-realtime “input-to-photons behavior” if input reads were done mid-raster. But this is getting offtopic, emulators already (to best effort) preserve this originalness at least to the offscreen framebuffer. So we usually don’t have to do anymore work on the input-reading side (Even though input read granularity rounds off at polls, e.g. 1ms for 1000Hz polls). The beam raced frameslicing is only a modification only to the missing 1% of the emulator “rendering” workflow (subframe raster sync between emu-raster and real-raster) necessary to replicate the original latencies to a much higher accuracy than has ever been achieved before. Regardless, the word “tearing” is non-sequitur to some 8-bit programming technique that did not use frame buffers.

For a good newbie’s guide to this, see Wired Magazine’s https://www.wired.com/2009/03/racing-the-beam/ … I highly recommend that “Racing The Beam” book from Amazon. This will help prepare you for a better understanding of rasterwork.

mdrejhon · 4 July 2018 18:07

The two parts of your sentence are (mostly) unrelated to each other. It turns out non-sequitur. We’re just simply preserving original input-read phasing (whether it was only 1 microsecond before pixel output – or 1 second before the frame). There can technically essentially be zero lag, if the input read is made raster-realtime mid-scanline (e.g. Atari 2600) reading the joystick controller register while generating pixels.

It is simply a function of the original game programming, nothing more, nothing less. The writings I do is simply bridging the beamracing from the virtual world (offscreen emulator frame buffer) to the real screen. This allows the emulator to replicate original latencies as faithfully as possible, including subtle within-frame and within-scanline time-offsets of input read relative to generating the pixels on the original machine’s original video output.

(Plus a slight amount of additional latency to create the ‘jitter margin’ to soak up computer performance imperfections… but that doesn’t interfere, as the offscreen emulator buffer is still merrily at its original latency – the jittermargin is simply slight extra latency between emuraster and realraster to soak up computer performance issues, allowing contineud VSYNC ON looking perfection in less-than-perfect performance conditions)

If you are a programmer, I suggest you purchase the book to gain a better understanding of the concept of raster programming.

Applying modern terms such as “tearing” (But, really, tearing is a modern term more applicable to 3D framebuffers) and “we assume” (incorrect: we actually know, so we don’t have to assume. Remember, I have programmed some of these old machines directly. Telling such people who have done real rasterwork whether be Atari TIA, Amiga Copper, C64 raster interrupts, etc that their actual work are “assumptions” can be actually slightly offensive to them when in actuality, they actually understand the old platforms), and framebuffers (because some old machines had no framebuffers, or character buffers only).

The reason that this final 1% of realtimeness has not been easily bridged is that it required three things. An expert that is simultaneously versed on

Understanding how the original software and original machines worked; and
Understanding the latency black box between “Present()-to-photons” (software to photons) in a full & proper temporal manner for all Hz & all VRR tech, including differences between pixels on the screen and
The technology to catch up (it did about 8-10 years ago, but see pre-requisite (1) and (2) above).

People who simultaneously understand both (1) and (2) and (3) are few and far in between. Such an individual would need to and how to apply this to inventing various techniques to sync between old & new.

Likewise, a rocket scientist, might not know archaeology, and vice-versa. Or in a more related field, a mathematician may not be able to create a new molecule, and vice-versa, though applying knowledge may end up applied across boundaries to create a new breakthrough. And sometimes come up with “E=mc^2” simplicity that others can understand.

Several of us (who are familiar with how framebufferless programming worked) are finding it’s a lot simpler than expected once we follow the “best practices” list several posts ago.

mdrejhon · 4 July 2018 18:24

Added a new Best-Practice number 18 to the LibRetro Proposal

(18) Temporarily turn off debug output when programming/debugging real world beam racing. When running in debug mode, create your own built-in graphics console overlay, not a separate console window – don’t use debug console-writing to IDE or separate shell window during beam racing. It can glitch massively if you generate lots of debug output to a console window. Instead, display debug text directly in the 3D framebuffer instead and try to buffer your debug-text-writing till your blanking interval, and then display it as a block of text at top of screen (like a graphics console overlay). Even doing the 3D API calls to draw a thousand letters of text on screen, will cause far less glitches than trying to run a 2nd separate window of text (IDE debug overheads & shell window overheads) can cause massive beam-racing glitches if you try to output debug text – some Debug output commands can cause >16ms stall – I suspect that some IDE’s are programmed in garbage-collected language and sometimes the act of writing console output causes a garbage-collect event to occur. Or some other really nasty operating-system / IDE environment overheads. So if you’re running in debug mode while debugging raster glitches, then temporarily turn off the usual debug output mechanism, and output instead as a graphics-text overlay on your existing 3D framebuffer. Even if it means redundant re-drawing of a line of debugging text at the top edge of the screen every frame.

BTW, I propose to open and contribute to a BountySource on executing this proposal and successfully enabling frameslice beamracing to at least one module.

Would anyone be interested in me opening a BountySource on this?

e-tank · 6 July 2018 11:56

as i understand it beam racing/chasing is basically a very clever and more “the right way” solution / replacement for what frame delay currently provides in retroarch, its main purpose would be to cut down on that last frames worth of latency that’s inherent when pushing full frames with vsync on the host via standard api’s (opengl, vulkan, drm, etc. and when i say last frame i mean compared to methods like using hard sync/fences in opengl, which can achieve approximately just 1 frame additional latency over original hardware)

however, wouldn’t it be a lot of work for relatively little pay off? the libretro api, retroarch and its shader pipeline, and all the libretro modules of emulators are all currently based on full frames. not only that but the core logic in many of these emulators are too. (well, to varying degrees. the point being most emulators would need their core logic tweaked as well as the libretro implementation)

it’s also not compatible with run ahead, which while not a silver bullet, provides at least as good if not better results where supported. (run ahead requires that the emulator has reasonably efficient, fully complete, and side affect free serialization support, which not all emulators do or even can provide. and it’s also game dependent) I get that there are pros and cons to each approach, but run ahead at least made a lot of sense for retroarch since it was mostly built upon features that already existed.

personally i think medanfen would be a much better testing ground for this than retroarch. for one thing it would most likely be easier and less disruptive to implement there, and for another doing so would lay the ground work for getting all the libretro mednafen cores ready to support this if and when beam chasing ever gets added to retroarch + libretro api. (note: i would advise against requesting/pestering mednafen author to implement this)

edit: fixed some terminology

edit 2: nevermind, don’t bother responding to this post, mdrejhon has already addressed pretty much everything i’ve brought up at some point or another in one of his many posts here, there were just sooo many about this i didn’t catch all of it at first. anyway, i still stand by my statement that mednafen is more suited for this

markwkidd · 7 July 2018 13:37

I’m sure there are a lot of people like me lurking in this thread who are very interested in seeing beamracing more widely available (for example by adding these features to the libretro API ). Now that things are getting closer to working code it does seem like a good time to start up a bounty.

mdrejhon · 7 July 2018 18:20

I’ll create a github issue on this & then create a BountySource

How I start with a $100 bounty if someone(s) else can pledge to throw in a total of matching $100?

= $200 starter bounty for libretro raster poll API addition + 1 emulator module made compatible (e.g. NES or other)

I’ll match dollar for dollar for anything less.

Negotiable: Can be willing to match more funds beyond $100, this is just a starting point.

mdrejhon · 7 July 2018 18:49

Thanks for the tip about mednafen. However, RetroArch as far as I know, is much more well known (including variants like RetroPie, etc).

Yes, addressed all concerns:

The practical payoff is actually at least slightly bigger than you think. …Beamracing make it workable on Android/PI devices too underpowered for RunAhead. It also saves more lag than a VRR display does (yet beamracing is also compatible with VRR mode too). Slow emulators on slow platforms will often take 1/60 sec to render, then 1/60sec of VSYNC ON buffer delay – two frames of lag on slow platforms (VSYNC ON backpressure latency, unless you do tricks to achieve next-refresh-cycle latency reliably and consistently). Or allows performance-intensive cycle-exact emulators to reduce latency . Streaming pixels realtime to the display like the original machine did, without all the attendant pre-delays. …It’s also more preservationist friendly latency-wise, and much more faithful to original machine’s latency behavior.
Frameslice beamracing is still compatible with RunAhead, I posted some diagrams. Although it is not as lag-saving as I thought, it can still reduce the CPU requirements of RunAhead by roughly 1 frame’s worth per 1/60sec, since the last frame no longer needs to be surge-executed, and the visible frame (last frame) can simply be realtime streamed to display at original emulator speed (aka beamracing)
Despite the “concept” complexity, it takes surprisingly little code, the difficulty is understanding the black box between Present()-to-photons on a per-pixel screen-scanout basis – nearly undocumented everywhere else except at Blur Busters. But the 18-point Best Practices list (that all of us have built up, myself, Calamity, Toni, etc) will save a lot of grief.

mdrejhon · 17 July 2018 05:47

Added GitHub Issue:

EDIT: Added $120 Cash to BountySource:

I’ll dollar-match your donations

I will dollar-match all your future donations (between now and end of September 2018). I donate another $1 everytime anyone donates $1 until the BountySource hits the $360 level.

EDIT: That was FAST! BountySource now $1050 and my share maxed at $360

Twinaphex · 13 July 2018 20:10

Cool that you made a bounty for this. Hope it goes well.

GemaH · 13 July 2018 20:36

Awesome. What’s the difference with this and Runahead? As far as i know Runahead cuts down “internal game” lag frames that exist in the original hardware. Can this do the same?

hunterk · 13 July 2018 20:47

it’s completely different. It reduces display lag to very low levels, potentially similar to original hardware and/or vsync OFF without tearing.

mdrejhon · 17 July 2018 05:48

BountySource now $140 for lagless VSYNC

Someone added $10, so I also added $10.

NOTE: I am currently dollar-matching all donations (thru the $360 level) until end of September. Contribute to the Bounty pot: https://www.bountysource.com/issues/60853960-lagless-vsync-support-add-beam-racing-to-retroarch …

For every $1 you donate, I will donate another $1 – until end of September – or until $360 bounty is built up.

BountySource now $200 for lagless VSYNC

Twinaphex added $30, which I matched with $30.

hunterk · 17 July 2018 02:32

Big contribution from bparker

mdrejhon · 17 July 2018 05:37

$1050 !!! !!! !!!

Wow, that maxes out my dollar-doubling commitment.
I’ve now donated $360 total – in addition to bparker06’s donation of $650.

I’m going to try to reach out to bparker06 to personally thank him for the generosity to the bounty prize pot.

This is now currently the #32 biggest bounty on BountySource.

James-F · 17 July 2018 05:14

Amazing!
Hope that it comes to fruition in Retroarch.

starquake · 17 July 2018 08:05

Huh how does that work. The site shows donations from 2017.