An input lag investigation

Twinaphex · 20 February 2020 21:20

There is not much, they basically explained to me once in #bizhawk on freenode but seems I don’t have the logs.

Basically waterbox saves contain the whole emulator state, not just the game, so it can be used even with emulators that don’t have such functionality.

Dwedit · 20 April 2018 17:25

Crazy idea time! Requires a MMU.

“Savestate” means to mark all read-write memory pages as Read Only
On a memory write, use the access violation exception handler to copy the memory page elsewhere, then mark the page as read-write, and let the write succeed.
“Loadstate” means to revert the memory pages back.

mdrejhon · 21 April 2018 18:16

Big, big lightbulb moment.

The “Run-Ahead” trick could be used to create a new Nintendo Zapper Gun (using a camera). We would successfully undo the camera latency back to the original Zapper latency.

Compliment on RunAhead, I never thought about that. Personally I prefer beam racing over RunAhead, as it is a purer / more original / faithful / fully natural elimination of input lag… and as Chief Blur Buster I’m somewhat of a purist there. But RunAhead is still an incredible innovation that can even partially compensate for display lag too (at the cost of more RunAhead artifacts – since it’s literally creating 60 “alternate-universes” per second).

It is possible to combine RunAhead with beam racing, to make it a (little bit) more purer (and to eliminate the RunAhead artifacts for bigger RunAhead margins). So RunAhead doesn’t replace beam racing nor vice-versa. They are mutually complementary for various purposes like compensating for display lag of slower displays.

Now, RunAhead innovation makes something else possible: A modified Nintendo Zapper that works on LCD, OLED, everything. You’d obviously have to modify the zapper with a low-lag low-rez 120fps camera like the PlayStation camera at the end of the barrel. And monitor only the center pixels of the Zapper aim for the bright white boxes during those Zapper flashes. But once that modification to a Zapper is made (Ardiuno or Raspberry Pi style project embedded inside a gutted Zapper), the camera latency can be successfully undone via RunAhead technique, and the Zapper-compatible signals inserted at the precise microsecond moment into the emulator, mid-raster, at original lagfree CRT timing.

Voilà. No onscreen cursor, no Wii-style sensor bar, original game code, no game modifications needed, successful original-latency Zapper signalling.

A (moddded) Nintendo Zapper working on an LCD, running unmodified Duck Hunt!

This was impossible to do without the RunAhead invention, but with RunAhead it’s possible to develop this now.

mdrejhon · 22 April 2018 01:02

No, it is not.

How so? I disagree.

They are mutually complementary. You’ll still get the same RunAhead artifacts you already get today, you’re simply doing repeat-beamracing jobs, that’s all. And the emulator raster behavior never changes (and even if realworld raster jitters differently on a second run) if there’s no input changes anyway. So potential RunAhead artifacts (of large RunAhead margins) would be identical with or without beam racing.

The big bonus is you can reduce “Number of Frames to Run Ahead” margin to compensate ONLY for laggier displays (displays unavoidably much lagger than a CRT). That reduces your RunAhead artifacts while getting the latency benefits of beam racing.

Or to use RunAhead to accomplish other magical things like future Nintendo Zapper compatibility for LCD/OLEDs/etc.

Also, fundamentally, you can just as easily beamrace 60fps on 75Hz simply by cherrypicking only 60 out of 75 refresh cycles and beamracing those specific refresh cycles. It’d still be stuttery like 60fps on a 75Hz fixed-Hz VSYNC ON display, just without the input lag. So, in theory, the beamraceable venn diagram can be quite huge depending on implementation.

The only “logic footprint” to the emulator point of view is simply synchronized surge-execution (and appropriate audio buffering which you already do) – e.g. executing at the same speed as display scanout of the specific individual refresh cycles you’ve chosen to beamrace. The other thread (RunAhead) don’t need to worry about beamracing the realworld display, just only the main display thread only. And it’s repeatable (unlimited mulligans) if you want to rewind & repeat on a future display refresh cycle. There’s an unlimited supply of subsequent refresh cycles, which you can selectively choose to realworld beamrace (or not).

Logically, the two is fully combineable (and can be enabled separately or simultaneously).

Thus, adding emulator beam racing is fully compatible with RunAhead (unless I’m missing something)?

Also, momentary beamrace missync (e.g. computer freezes, major HDD virus scanner) simply means brief momentary reappearance of tearlines that instantly disappear the next refresh cycle. Beam racing doesn’t have to be perfect every single refresh cycle, it’s not as if an emulator will crash-n-burn on a single instance of missync between emuraster + realraster. Momentary beamrace failures have less artifacts than RunAhead misprediction artifacts.

hunterk · 22 April 2018 01:02

I think the concern is that beamracing needs incremental updates while runahead is busy jumping back and forth multiple frames at a time. That is, if you’re flushing out incrementally to beam-race, aren’t you going to get fragments of whatever frame is getting rendered right at that moment? So, potentially 2 frames in the past, etc?

mdrejhon · 26 April 2018 14:19

Ummmm…

It doesn’t conceptually work that way.

Unbeknownst to some of us, the emulator modules in RetroArch are already doing incremental updates (in the emulator in the main thread and in the emulator in the RunAhead thread) because of the way emulators already internally rasterplot one scanline at a time to their internal respective frame buffers, before sending the full frame buffer down the software chain after it’s fully drawn.

Both the main and RunAhead threads are already doing incremental updates internally, if using the emulators that are emulating the raster scanouts internally already (i.e. most pre-3D-era machines, like 8-bit and 16-bit consoles, and arcade machines).

Adding beam racing to this workflow is simply adding incremental delivery to this. The whole frame is still delivered uninterrupted, just delivered in realtime. This can easily be done via a callback everytime the internal framebuffer is incrementally updated (as it is already today internally in each emulator module, without RetroArch knowing about it). You only rewind after the end of refresh cycles. RunAhead artifacts remain identical. Yet you can still reduce the RunAhead margin because beamracing saves you some lag, so you can back-off the number of RunAhead frames, and thus, get the same lag target with less RunAhead artifacts. Obviously, you’ll want to the RunAhead emulator in a separate thread, so that the original thread can beamrace at its own leisure. But other than that, there’s no fundamental incompatibility.

Also, at low-frameslice margins on modern GPUs, the performance demands of beam racing is much lower than RunAhead. It’s very scaleable. WinUAE supports 1-frameslice (it just acts similiar to inputdelay behavior), 2-frameslice (50% reduction in VSYNC ON lag), 3-frameslice (66% reduction in VSYNC ON lag), so you have no minimum granularity for properly implemented beam racing. In fact, WinUAE has removed inputdelay and replaced it with this, because 1-frameslice (full screen) beamracing behaves practically identically to inputdelay technique.

TL;DR: From a logical point of view, beam racing is made to become just another form of VSYNC ON. Just a lagless one. The rewind artifacts (of the same large rewind margins) would be identical with and without beam racing. We simply just don’t act on rewinds until the beamraced refresh cycle is complete (A.K.A. the emulator delivers its full frame). Thus, leaving existing RunAhead behaviour unmodified. We’re only giving the centrallized graphics early peeks at the emulator’s existing incremental updates, to allow the display to realtime display the rasters (at least at frameslice granularity).

A Possible Simplified Coding Architecture Path Towards Beam Racing:

(To be mutually agreed between multiple parties)

The great thing is most of RetroArch emulator modules are already doing incremental updates internally. The challenge is the current opaqueness (blackboxing) of this from the “deliver only full frames” architecture. This actually can stay mostly unmodified with a simple addendum in the form of an optional per-raster callback function for emulators that already does incremental updates internally today (aka most 8-bit and 16-bit emulators). We’re simply gaining optional access, that’s all.

I think that this can be simplified with standardizing a per-raster callback function technique and letting the central modules handle the complexity. We simply need to know resolution of emulator frame buffer, whether the existing emulator is already doing incremental updates internally, and add a per-raster callback function. Then thus, we’re simply hooking into this preexisting behaviour (and only for the display thread, not the RunAhead thread). Could be done with a per-raster callback function sending the exact same pointer to the exact same emulator framebuffer (that just has one scanline added). The centralized graphics can simply copy a frameslice every X scanlines, so most the complexity is centralized. The emulator existing framebuffer architecture wouldn’t need major mods, except the central graphics simply now get the opportunity to look at the incomplete emulator framebuffer every time a new scanline has been rasterplotted into it (existing emulator logic already done in many RetroArch modules today).

We’d just simply be having each emulator module call a callback function everytime a new row of pixels is plotted, and letting the core module handle the complexity (deciding when to frameslice by copying a rectangle from the existing emulator module framebuffer + and how much to nanosleep before returning from the callback function).

Yes, it requires a good grasp of architecture, and we don’t have to hurry. We can start simple. One emulator module at a time. We just need to define the appropriate community-agreed-upon new every-raster callback API and document it. There is already discovery on sizes (e.g. width, height of framebuffers). The core beamracing sync module can handle the rest. That keeps beam racing simpler by letting core graphics module decide how many scanlines until frameslicing occurs, and how much nanosleep to do (to keep emulator in sync with real raster).

And when you disable beamracing, you simply an immediate “return;” from the raster callback, existing behaviour continues. You can keep the every-frame-delivery mechanism untouched, regarldess of beamracing enable/disable – you’re simply adding a raster callback to existing emulator modules. This obviously, however, needs to be standardized though, so it has to be done by community agreement how to per-raster callbacks (to allow centrallized display code to optionally do beamracing if they want).

And let the open source community help out updating emulator modules to be properly faithful in beamracing (the great thing is most already are) – and when a beamraced game is not realworld beamracing compatible, that automatically means something is wrong with the faithfulness of the emulator module’s implementation, and gives us an opportunity to fix it for preservation purposes!

The RunAhead thread’s emulator is already doing incremental updates too! Unbeknownst to us. It’s just simply very blackboxed today into full frame deliveries at the moment. Also, we only need to act on the per-raster callbacks only from the emulator that has the actual displaying, and ignore the per-raster callbacks from the RunAhead thread.

For some emulator modules, hidden deep within them (programmed by a programmer outside of RetroArch) already has a raster function to copy a pixel row to an accumulator frame buffer internally. For this, we’d simply be adding a line of code to existing emulator module code to call an optional external callback function (and only if it’s not null).

For frameslice beamracing, the core graphics would usually do immediate returns from the callback function until a frameslice height’s worth of scanlines has been done, then it would act on those (present a frameslice to the display and nanosleep before returning from callback). The nanosleep is just exactly like inputdelay, except it’s nanosleeping mid-refresh-cycle to make sure the emulator rendering never gets too far ahead of the real world raster. (That’s why 1-frameslice beamracing behaves identically to inputdelay).

For frontbuffer beam racing, the row of pixels can be directly copied to the front buffer (for platforms that support front buffer rendering)

For turning off beam racing, simply “return;” immediately from the raster callback, and just act on the existing full-frame delivery mechanism (which merrily continues unmodified).

The raster callback function idea makes things extremely flexible and future-proof, while minimizing modification footprint to emulator modules.

So, it’s should be win-win-win, I’d hope. But yes, let’s keep it simple & slow – because while it is a simple concept to me, it isn’t a simple concept to many people…

In Summary, The Per-Raster Callback Function Idea:

Preserve most of the existing emulator architecture, with one major addition:
Add a per-raster callback from emulator modules to central graphics rendering
Just a few lines per emulator module (possibly as few as ~5 lines)
Most complexity stays centrallized into the centrallized beam racing module
This allows centrallized graphics module an early peek at the existing emulator’s existing framebuffer (that the emulator already, right now today, incrementally updates without us knowing about it!!!) and to nanosleep the correct amount before returning from the callback.
The core graphics module that handles all beamraced emulators, do not have to act every callback, but count the number of scanlines then act only at intervals (in the case of frameslice beamracing). Basically “return;” most of the time except when enough scanlines have occured to fill a new frameslice.
Callback hook is ignored (immediate return) when beamracing is disabled (and also for the no-display RunAhead emulator).
Add this support slowly, KISS. Work on only one emulator at a time.
Keeps most things unmodified, manageable.
Emulators that don’t yet have a raster callback, will continue to work without beamracing support until some open source guy adds the few lines of code to the emulator module
Keeps things cross platform.
Adds no platform-dependent code to emulator modules. …(except optionally, for the core centrallized beam racing module, such as optional #ifdefs to enable things like GSYNC beam racing or FreeSync beam racing on certain platforms, like WinUAE already can)

TL;DR: A callback “retro_set_raster_poll” which would be very similar to “retro_set_video_refresh” except it’s called every raster to allow any centrallized beamraced renderer to (A) peek into the incomplete emulator framebuffer (B) and pace to the beam by nanosleeping before returning from callback (either at raster granularity or frameslice granularity). Stubbed out when the emulator module don’t do rasters (e.g. N64, PS3, etc).

Dwedit · 22 April 2018 02:14

I still don’t see how runahead fits into beamracing.

Let’s take an example, Super Mario World on Snes9x main version, which gets 180FPS on my machine. That’s an average frame time of 5.55…ms.

If it runs for two frames, that is 11.11…ms, or 2/3 of a frame. It will have missed the boat at making the top of the screen for sure.

mdrejhon · 22 April 2018 02:20

That’s behind the scenes.

You don’t have to realworld beam-race the non-visible frames.

RunAhead can continue to run unthrottled.

You only realworld beam-race visible frames, not the RunAhead frames.

There’s no restriction to entering/exiting beamracing modes, or running two threads (real thread and RunAhead thread). One may have to slightly refactor code, if you coded it into a corner, but fundamentally the concepts are not incompatible.

Dwedit · 22 April 2018 02:40

So let me make sure I got this right…

Example, Super Mario World running ahead two frames, using a secondary core.

If we are at a synchronization point due to dirty input:

Main core runs one frame, with visuals disabled and audio enabled. Let’s say it takes about 2ms.
Save state, secondary core loads state. Let’s say that takes 1ms.
Secondary core runs one frame with visuals and audio disabled. Let’s say that takes 2ms again.
Now it’s ready to render a visible frame. Let’s suppose it takes 4ms.

Also supposing…

Suppose it takes 16.6666ms for the display device to render a frame and there is no such thing as vblank times
We reach time 5ms, or about 30% into the screen. If the image changed between now and the end of the frame, there would be nasty tearing.
We can start rendering a visible frame using a beam-racing update method, with screen tearing since we started late.

So it seems that this proposal is just to replace waiting for the entire frame to be available with screen tearing when using runahead?

mdrejhon · 22 April 2018 02:53

Not quite, so let’s see if we can drill down to the specific area of confusion.

Which makes it easy for RunAhead to work with beam racing without tearing artifacts.

Main core will certainly need to slow down a bit and scanout in real time.

By doing this, you avoid the VSYNC ON input lag (which means your “1ms” is actually often “16.7ms” or “33ms” – the driver and display is inserting buffering lag for you).

We’re simply siezing more control of scanout delays to our side, and bypassing. So we’re adding intentional lag in our software, but less lag than the graphics drivers already do for VSYNC ON.

Yes, yes, we’re still slaving our delay to the display scanout velocity, but at least that’s under our control instead of the graphics drivers (VSYNC ON) control.

That can continue as normal, unthrottled, at thje same time interval

That’s not an issue.

The graphics driver already waits for VSYNC. Instead, we intentionally do the waiting for VSYNC ourselves instead of the driver doing that shitty stuff for you. Beam racing gives us control instead.

We intentionally wait out the remainder of refresh cycle. Basically, finish beamracing the rest of the unmodified-input refresh cycle.

(Yes, it adds lag to wait, but less lag than the driver’s VSYNC ON).

Then we beamrace the NEXT refresh cycle, with the modified input signalling. But that is still less lag than than letting the driver do VSYNC ON for you.

No tearing happens. We intentionally manually wait ourselves for the next refresh cycle. But at least we’re manually waiting a (shorter time) than the driver’s own automatic VSYNC ON internal waiting.

Too much lag? Boohoo, just add 1 more frame of RunAhead. But you’ll get less input lag with smaller RunAhead values, when you combine beamracing + RunAhead, than using RunAhead with VSYNC ON.

NO. We are simply replacing the driver’s VSYNC ON waiting with our own (shorter) “remainder-of-refresh-cycle” waiting, so it’s less lag than a graphics drivers’ VSYNC ON handling.

Thus, for the same RunAhead frame count, the RunAhead artifacts are identical (backticking behavior – try playing RetroArch with a RunAhead of 5 frames – you’ll begin to see RunAhead artifacts). The RunAhead artifacts will be identical with beam racing.

There will be no tearing if we intentionally wait for the next refresh cycle. But at least the waiting is shorter than letting the graphics driver do the VSYNC ON waiting!

Thusly, we can decrease RunAhead, in order to get the same amount of effective total lag. And fewer RunAhead artifacts because of smaller RunAhead frames.

mdrejhon · 22 April 2018 02:58

And since we’re doing input reads on the fly in the realworld scanout (Just like inputdelay, but at the tearingless frameslice level – basically a subframe inputdelay) then we’re more likely to catch the controller inputs with less lag.

That’s why we can decrease RunAhead frames to get the same amount of realworld input lag. And thus, fewer RunAhead artifacts.

Yes, if we are beamracing we’re intentionally inputdelaying (only on the display thread), but it’s a subframe inputdelay, on a per-frameslice basis. Shortening the time between inputreads and realworld rasters. Meaning less reliant on large RunAhead margins, because we’ve intentionally slowed down the display thread (realworld display method) with a subframe inputdelay equivalent by slowing the display emulator to be in sync with the realworld raster.

With “RunAhead+Beamracing” you can then, thusly decrease RunAhead by about 1 or 2, and still get the same effective realworld input lag as “RunAhead+VSYNC ON”

mdrejhon · 23 April 2018 22:44

Another way to conceptualize beam racing is to fully understand “inputdelay”.

The existing inputdelay is almost tantamount to 1-frameslice (full screen height) beam racing.

You’re basically racing the VBI now

Now, just make your programmer’s human brain gray matter conceptually visualize the concept beam racing this way. Subdivide the height of the refresh cycle. 2-frameslice beam racing is simply the top and bottom half of refresh cycles. You’re inputdelaying 2 times a refresh cycle. Now imagine 10-frameslice beam racing.

That’s 10 tiny inputdelays 10 times a refresh cycle, tightening input reads to only 1.67ms (under 2ms) between emuraster and realraster. That’s 1/10th refresh cycle input lag!

And, let’s go back and study this jitter margin again.

Jitter margin

If you simply frameslice on top of your old refresh cycle, you actually have a much bigger jitter margin (a full refresh cycle worth minus 1 frameslice). So that makes 2-frameslice beam racing possible too.

1-frameslice beam racing is simply doing one Present() during a VBI.

2-frameslice beam racing is simply doing bottom-half Present() while display-thread of emulator is internally doing top-half. (And top-half Present() while display-thread of emulator is internally doing bottom-half). It’s already duplicate so no tearline appears.

10-frameslice beam racing is simply doing a Present() in any frameslice other than the current emulator scanout area. Since all of them are duplicates in their regions of the screen, there is no tearline.

Repeat after me, VSYNC OFF tearlines are just rasters.

…Although the jitter margin is surprisngly generous, I should add that ideally, you want to Present() as close as possible to the top boundary of the currently-rendering frameslice, for minimum input lag – smallest time interval between inputreads and the visible pixels hitting my eyes.

Beamraced frame slices are just simply merely basically inputdelay (which already works with RunAhead) except done at subframe leagues. Remember inputdelay doesn’t zero-out bottom of scanout. It still takes 1/60sec to scanout. So input read and actual display of bottom edge, can never be less than 1/60sec on a normal 60hz display.

Beam racing shatters the scanout-latency barrier. You can have inputdelay of only 1ms from bottom edge of screen with a beamraced frame slice. In addition to siezing control over delays (instead of letting drivers doing VSYNC ON delays and scanout delays), we do it ourselves. You get to skip VSYNC ON latency, and also gain really subframe latency for mid-screen input reads. This magic can allow you to slightly decrease the count (by 1 or 2) of RunAhead frames when doing RunAhead+BeamRace instead of RunAhead+VSYNC-ON. Even though we are adding delays, but only merely to put input reads closer to frame slices (aka rasters).

I measure input lag. Blur Busters invents lag tests. I have verified. In my YouTube video, all pixels on the entire screen surface is hitting my eyes 3 milliseconds after the Present() API call. Top/Center/Bottom lag is equallized just like it is for VSYNC OFF, bypassing the mandatory scanout latency.

Also, extra refresh cycles (e.g. in-between refresh cycles like 60fps beam racing on a 120Hz monitor) are simply left fallow (repeat-refreshes of whatever was already displayed), you only actively beamrace at 60 intervals per second. Cherrypicking 60 refresh cycles per second to beam-race (which, conceptually, is just simply realtime subframe inputdelay + VSYNC OFF frameslicing)

And sure, you can randomly beamrace only some refresh cycles and not beamrace other refresh cycles. It’s just simply stutter. (Just make sure you complete the beamracing for those specific beamraced refresh cycles).

mdrejhon · 23 April 2018 22:27

This is why WinUAE deleted inputdelay in the newest betas because 1-frameslice beamracing behaves pratically identically to existing inputdelay.

So far, in my beamracing experiments, and my source code, I have gotten beam racing working:

Stutterlessly on VRR
Stutterlessly at even multiples (60Hz, 120Hz, 180Hz, etc)
With stutters at any Hz (60fps@75Hz)
PC and Mac

VRR beamracing is simply using the first Present() to begin the scanout (the display waits for you to Present() before beginning to increment its realworld raster), and then Present() frameslices on the fly through the scanout. This does require the hybrid VRR + VSYNC OFF mode, but that’s available with both GSYNC and FreeSync.

Odd-Hz beamracing has the same stutters VSYNC ON. For example 60fps@75Hz since it’s cherrypicking 60 refresh cycles to beamrace, and leaving 15 refresh cycles ignored (simply repeat-refresh cycles). So that does mean an erratic VBI delay.

Frameslices Per Refresh Cycle

Many laptops: Between 5-20 frameslices
Intel 4000 series: About 6 frameslices
Old GeForces like GTX 650: About ~10-20 frameslices
Parallels virtual machine: 2 frame slices (with VSYNC OFF)
GeForce GTX 1080 Ti: 100-150 frameslices

I really hope GPU vendors re-enable front buffer rendering, then we can simply rasterplot scanlines directly to the front buffer, and it’ll also be much easier for HLSL shaders (simulated CRT electron gun beam), since it becomes more kind of an AddRaster() type of workflow. One of these days…

mdrejhon · 23 April 2018 22:37

Did more work on my Tearline Jedi demo (MonoGame C#). Will release it as Apache 2.0 source code in a few weeks on github. Works on both my PC (native DirectX) and Mac (native OpenGL).

That specific demo module is only 114 lines of C# (identical on PC and Mac, no #ifdefs!!) because I’ve hidden most of my beamracing complexity in my easy cross-platform beam racing framework.

No raster register, just precision time offests from VSYNC timestamps. Which is the magical recipie that made this vastly much more crossplatform. It’ll be a good educational learning sandbox for modern-era realtime VSYNC OFF beam racing.

Each row of text is hitting my eyeballs only 2-3ms after the spriteBatch.DrawString() call on my fastest gaming LCD monitor. API to photons in 2-3ms!

James-F · 22 April 2018 16:56

Nvidia is SO going to steal this from you (for PC use), mark my words.

mdrejhon · 23 April 2018 03:34

Yes, please. And AMD too. And HTC Vive and Oculus.

Then again, they are already doing it (to an extent). NVIDIA already enables beamracing workflows with their VRWorks kit for their frontbuffer-enable support.

I just wish that feature was an accessible industry-standard (ala OpenGL API) workflow to enable direct access to front buffer. Then we’d just simply rasterplot emulator rasters directly to front buffer, or do a modified shader/HLSL logic of a AddRaster() type workflow (To add one pre-fuzzy, shadowmasked scanline, at a time). This would be much higher performance and allow tighter beamrace margins than frameslicing via VSYNC OFF, with beam race margins such as only two or three scanlines between the emuraster + realraster.

Fortunately, all the groundwork we’re doing (VSYNC OFF frame slice beam racing for a tearingless & lagless VSYNC mode) and the concept of the raster callback technique I suggested earlier, is compatible with future front-buffer workflows.

Fortunately beamracing was a staple of the original platforms (1970s tech, like Atari 2600) so many patents related to beam racing have already expired, and there’s a lot of beam racing due diligence like raster interrupt knowledge from Amiga and Commodore programmers (like me). It would be monumentally stupid to re-patent all this existing tech, and some VR headset makers already use unpatented beam-racing stuff (e.g. left eye display while rendering the right eye, etc, piggybacking on left-to-right VR OLED/LCD scan to beamrace alternate eye rendering). Basically a 2-frameslice or 4-frameslice beamrace with 3D rendering. Even Android can do beam-raced virtual reality already with GearVR. Some of the algorithms are clever. Some major headsets already do at least 2-frameslice (alternate-eye beamracing) for virtual reality if you didn’t know! I have noticed some have adaptative beamrace-cancellation algorithms (like disqualifying beamrace-losing frames (postponing them to next refesh cycle) that took too long, and simply repeat-refreshing the previous frame instead). Extra techniques are added to make sure people don’t see frontbuffer-rendering artifacts on beamrace failures (toolate frames), or raster/tearing artifacts, so there are a bunch of failsafes to prevent beamrace-failure artifacts in virtual reality (due to slow frames and such). Anyway, different VR headset makers do things differently. It is all complex stuff if you’re not used to this, but various beamracing workflows are simple ABC stuff to my human mind.

Anyway, from an emulator-usefulness perspective – we’re resurrecting classic beam racing tech with a new “VSYNC OFF tearlines are just rasters” twist, and some best-practices principles that I’ve trailblazed (thanks to my understanding of displays/laptop/Mac/PC/VRR/FreeSync/GSYNC/Android displays – I understand how to beamrace all of these.). Somebody just needs to roll the world’s first crossplatform beam racing demo. Rasters (tearlines) were hiding in plain sight for 20 years, but not one single individual understood displays enough to roll-it-all-together into something sufficiently generic and crossplatform.

rafan · 23 April 2018 07:18

@mdrejhon Thanks for sharing your knowledge with us. It’s definitely an interesting topic.

I also have to say that you’re writing soo much that I’m already getting tired at seeing the lenghty posts. It seems you’re spitting out everything that comes to your mind, instead of thinking what is needed for the other to know and comprehend. Sorry about that, just my opinion.

Still I think it’s a fascinating topic and it would be a shame if the real message gets lost with all the noise. Could you do us (me) a great favour and contemplate on a short bullet pointed list, say a step-by-step guide, that a programmer would need to start on implementing beamracing together with runahead in Retroarch?

I would love to see @Dwedit to delve into this topic, but I’m wondering whether you’re getting the important message from a programming standpoint really across.

I’ll leave you with this quote from Goethe:

In der Beschränkung zeigt sich erst der Meister." (1802)

mdrejhon · 23 April 2018 22:40

As fascinating as beamracing is, I’ll pace myself, and I’ll try to centrallize some beamracing information (e.g. Beam Racing FAQs on blurbusters) so I don’t have to write big posts. (but thats also why I started Blur Busters - the passion)

I admit, I’ve been monopolizing this thread lately, so I’ll insert a breather here and wait for a few programmers to chime in. From now on, I’ll throttle myself to allow more than 50% scrollheight filled with other people’s posts! I’ll also await @Dwedit comments.

Bahn_Yuki · 24 April 2018 04:26

I personally enjoy all the information mdrejhon. Keep it up!

mdrejhon · 26 April 2018 16:36

Here’s an image that replaces my stupendously excessively long posts? (I do apologize)

4-frameslice beam raced lagless VSYNC example. We only beam-race visible frames.

And beam racing brings the newer input reads closer to display scanout (completely bypassing driver sync). The beam raced frameslicing brings you subframe inputdelay benefits. It can mean less reliance on larger RunAhead numbers, allowing potentially decreasing RunAhead processing requirements and RunAhead counts. Since we are now able to take over driver latencies to our side and control 100% of them.

EDIT (to clear confusion):

This is time-based graph, 16.7ms between VSYNC events, 60Hz example only
“Off screen” part is the one under our control.
“On screen” is pixels hitting the video cable (and photons hitting the human eyes on lagless monitors) according to Blur Busters input lag tests
With VSYNC OFF, the first row of pixels right under the tearlines are already being transmitted on the cable within microseconds of Present() – confirmed in oscilloscope tests. Yet we made tearlines invisible (because that screen region is duplicate before/after) with our techniques, so we’ve effectively created a new lagless VSYNC mode.
Four frame-slice beam racing is doing four Present() per refresh cycle spaced 1/4 of 1/60sec apart (~4ms apart) when beamracing a 60Hz display.
Frameslicing allows us to chop up the grey arrows to much shorter ones, accomodating mid-screen input reads.
Beginning of purple arrow (blunt end) is API call Present() or glutSwapBuffers()
End of purple arrow (pointy end) is the pixel being transmitted out of graphics output (aka beginning of monitor’s scan out on a lagless monitor like CRT).
Grey arrows is the maximum delay between emulator pixel render and pixels hitting eyes.
“Input delay” is meant same as “frame delay” (delaying input reads to reduce input lag)
Input lag savings still occur if you skip beamraced input reads in the specific RunAhead use case. Retroactive input read insertions aren’t necessarily part of this beam race (they may be worthless) but it’s only one of the many -benefits of beam racing. But there’s many reasons to beam race anyway. In that case, we are simply using beam racing to bypass driver delays with VSYNC ON, by guaranteeing Present()-to-Pixels latency, bypassing driver latencies. We can optionally use fresher input reads (during the beamraced scanout) to replace the inserted input reads, for the few rare games that have sub-frame response (by repeating input reads after the retroactive input read insertion) but most of the input lag savings is from bypassing driver unpredictabilities with VSYNC ON – to get more guaranteed “Present() to Pixels” latency since new pixels hits the video output practically immediatey after Present() during VSYNC OFF as confirmed by oscilloscope. First pixels underneath tearlines = immediate transmission out of graphics output
It can reduce CPU requirements of RunAhead by ~1 frame because the final frame doesn’t need to be surge-executed.
TL;DR: If you skip input reads as part of beamracing, the beamracing advantages don’t disappear, the beamracing still (A) bypass driver VSYNC latencies and (B) reduces CPU requirements a bit by eliminating need to surge-execute final frame.

(For simplicity, this is 60Hz and only 4-frameslices. Any non-60Hz examples is simply cherrypicking a realworld VBI closest to the 60Hz emulator intervals, and beamracing only that specific refresh cycle.)

Obviously, you can use beam racing alone (more purist), or RunAhead alone (many benefits), or both simultaneously to combine benefits.

Graphics drivers are so freaking stubborn and annoying, and not the fun part of an emulator project when they want to do things that you DONT want them to. (argh, argh – I’ve been there). Most of us assume & think there’s not that gap, but there is (almost always) from the BlurBusters input lag tests even with all the NVInspector tricks. And even if you successfully removed one VSYNC ON gap (e.g. Hard GPU Sync and/or other techniques), next frame response time is very hard to get. It is not guaranteed on all system configurations, you still don’t fully cancel out the benefits of beam racing because of sub-frame response ability (response to midscreen input reads are now accomodated by beamracing). You can see the length of the grey lag arrows! With beamracing+VSYNC OFF, we sieze control over all mid-scanout latencies.

Metaphorically, we’re pretending we’re creating our own VSYNC ON driver in our software. We never interrupt our own beam racing, we always “finish” a beam race. No tearing! That way, if we do that, there’s no difference in visual artifacts from VSYNC ON. Any new rollback’d inputs are handled for the next refresh cycle (in an overlapped operation in two threads. You can RunAhead a new series of offscreen frames in a secondary core while you are beamracing a visible frame).

This means we handle sync delays ourselves (handled via the proposed raster callback idea) instead of the driver doing it for us. Subframe inputdelaying. Frameslice beam racing is simply our own roll-our-own “VSYNC ON” driver (with minor twists like beamraced timing + realtime inputreading every frameslice, behaving as a subframe-latency inputdelay).

And, even as you also have done the rollback’d input insertion, you’re always continually updating the inputs, so the green frame(slices) gets freshest known inputs at the time. Mid-screen inputs too. Thus, can decrease RunAhead margin to get same effective latency, when combining beam racing + RunAhead. Or get even less RunAhead lag at the same RunAhead margin. Whatever users prefer. Or users can on a laggier display with the same RunAhead margin (your spreadsheet will remain valid as it has always been), and it will still feel like a less-laggy display thanks to combined RunAhead+beamracing. Whichever users prefer, keep it at spreadsheet recommendation and it will feel like adding 1 extra RunAhead frame beyond spreadsheet recommendations but without the backticking artifacts (when adding beamracing). Win-win-win?

If I am wrong (@Dwedit – am I?), and it’s still genuinely truly guaranteed impossible then, please sharpie the diagram above to illustrate why. If so, then I’ll eat my words – I need a honest teaching because if I am wrong, it means I totally screwed up & dropped the ball on a specific important detail – but let me know! You don’t have to implement beamracing this year or ever – I just want to clear up any confusions…

{My last post in this specific thread till a developer replies. Scout’s honor.}