An input lag investigation

Nvidia is SO going to steal this from you (for PC use), mark my words.

Yes, please. And AMD too. And HTC Vive and Oculus.

Then again, they are already doing it (to an extent). NVIDIA already enables beamracing workflows with their VRWorks kit for their frontbuffer-enable support.

VRWorks

I just wish that feature was an accessible industry-standard (ala OpenGL API) workflow to enable direct access to front buffer. Then we’d just simply rasterplot emulator rasters directly to front buffer, or do a modified shader/HLSL logic of a AddRaster() type workflow (To add one pre-fuzzy, shadowmasked scanline, at a time). This would be much higher performance and allow tighter beamrace margins than frameslicing via VSYNC OFF, with beam race margins such as only two or three scanlines between the emuraster + realraster.

Fortunately, all the groundwork we’re doing (VSYNC OFF frame slice beam racing for a tearingless & lagless VSYNC mode) and the concept of the raster callback technique I suggested earlier, is compatible with future front-buffer workflows.

Fortunately beamracing was a staple of the original platforms (1970s tech, like Atari 2600) so many patents related to beam racing have already expired, and there’s a lot of beam racing due diligence like raster interrupt knowledge from Amiga and Commodore programmers (like me). It would be monumentally stupid to re-patent all this existing tech, and some VR headset makers already use unpatented beam-racing stuff (e.g. left eye display while rendering the right eye, etc, piggybacking on left-to-right VR OLED/LCD scan to beamrace alternate eye rendering). Basically a 2-frameslice or 4-frameslice beamrace with 3D rendering. Even Android can do beam-raced virtual reality already with GearVR. Some of the algorithms are clever. Some major headsets already do at least 2-frameslice (alternate-eye beamracing) for virtual reality if you didn’t know! I have noticed some have adaptative beamrace-cancellation algorithms (like disqualifying beamrace-losing frames (postponing them to next refesh cycle) that took too long, and simply repeat-refreshing the previous frame instead). Extra techniques are added to make sure people don’t see frontbuffer-rendering artifacts on beamrace failures (toolate frames), or raster/tearing artifacts, so there are a bunch of failsafes to prevent beamrace-failure artifacts in virtual reality (due to slow frames and such). Anyway, different VR headset makers do things differently. It is all complex stuff if you’re not used to this, but various beamracing workflows are simple ABC stuff to my human mind.

Anyway, from an emulator-usefulness perspective – we’re resurrecting classic beam racing tech with a new “VSYNC OFF tearlines are just rasters” twist, and some best-practices principles that I’ve trailblazed (thanks to my understanding of displays/laptop/Mac/PC/VRR/FreeSync/GSYNC/Android displays – I understand how to beamrace all of these.). Somebody just needs to roll the world’s first crossplatform beam racing demo. Rasters (tearlines) were hiding in plain sight for 20 years, but not one single individual understood displays enough to roll-it-all-together into something sufficiently generic and crossplatform.

1 Like

@mdrejhon Thanks for sharing your knowledge with us. It’s definitely an interesting topic.

I also have to say that you’re writing soo much that I’m already getting tired at seeing the lenghty posts. It seems you’re spitting out everything that comes to your mind, instead of thinking what is needed for the other to know and comprehend. Sorry about that, just my opinion.

Still I think it’s a fascinating topic and it would be a shame if the real message gets lost with all the noise. Could you do us (me) a great favour and contemplate on a short bullet pointed list, say a step-by-step guide, that a programmer would need to start on implementing beamracing together with runahead in Retroarch?

I would love to see @Dwedit to delve into this topic, but I’m wondering whether you’re getting the important message from a programming standpoint really across.

I’ll leave you with this quote from Goethe:

In der Beschränkung zeigt sich erst der Meister." (1802)

As fascinating as beamracing is, I’ll pace myself, and I’ll try to centrallize some beamracing information (e.g. Beam Racing FAQs on blurbusters) so I don’t have to write big posts. (but thats also why I started Blur Busters - the passion)

I admit, I’ve been monopolizing this thread lately, so I’ll insert a breather here and wait for a few programmers to chime in. From now on, I’ll throttle myself to allow more than 50% scrollheight filled with other people’s posts! I’ll also await @Dwedit comments.

1 Like

I personally enjoy all the information mdrejhon. Keep it up!

1 Like

Here’s an image that replaces my stupendously excessively long posts? (I do apologize) :zipper_mouth_face:

4-frameslice beam raced lagless VSYNC example. We only beam-race visible frames.

And beam racing brings the newer input reads closer to display scanout (completely bypassing driver sync). The beam raced frameslicing brings you subframe inputdelay benefits. It can mean less reliance on larger RunAhead numbers, allowing potentially decreasing RunAhead processing requirements and RunAhead counts. Since we are now able to take over driver latencies to our side and control 100% of them.

Beam Raced RunAhead

EDIT (to clear confusion):

  • This is time-based graph, 16.7ms between VSYNC events, 60Hz example only
  • “Off screen” part is the one under our control.
  • “On screen” is pixels hitting the video cable (and photons hitting the human eyes on lagless monitors) according to Blur Busters input lag tests
  • With VSYNC OFF, the first row of pixels right under the tearlines are already being transmitted on the cable within microseconds of Present() – confirmed in oscilloscope tests. Yet we made tearlines invisible (because that screen region is duplicate before/after) with our techniques, so we’ve effectively created a new lagless VSYNC mode.
  • Four frame-slice beam racing is doing four Present() per refresh cycle spaced 1/4 of 1/60sec apart (~4ms apart) when beamracing a 60Hz display.
  • Frameslicing allows us to chop up the grey arrows to much shorter ones, accomodating mid-screen input reads.
  • Beginning of purple arrow (blunt end) is API call Present() or glutSwapBuffers()
  • End of purple arrow (pointy end) is the pixel being transmitted out of graphics output (aka beginning of monitor’s scan out on a lagless monitor like CRT).
  • Grey arrows is the maximum delay between emulator pixel render and pixels hitting eyes.
  • “Input delay” is meant same as “frame delay” (delaying input reads to reduce input lag)
  • Input lag savings still occur if you skip beamraced input reads in the specific RunAhead use case. Retroactive input read insertions aren’t necessarily part of this beam race (they may be worthless) but it’s only one of the many -benefits of beam racing. But there’s many reasons to beam race anyway. In that case, we are simply using beam racing to bypass driver delays with VSYNC ON, by guaranteeing Present()-to-Pixels latency, bypassing driver latencies. We can optionally use fresher input reads (during the beamraced scanout) to replace the inserted input reads, for the few rare games that have sub-frame response (by repeating input reads after the retroactive input read insertion) but most of the input lag savings is from bypassing driver unpredictabilities with VSYNC ON – to get more guaranteed “Present() to Pixels” latency since new pixels hits the video output practically immediatey after Present() during VSYNC OFF as confirmed by oscilloscope. First pixels underneath tearlines = immediate transmission out of graphics output
  • It can reduce CPU requirements of RunAhead by ~1 frame because the final frame doesn’t need to be surge-executed.
  • TL;DR: If you skip input reads as part of beamracing, the beamracing advantages don’t disappear, the beamracing still (A) bypass driver VSYNC latencies and (B) reduces CPU requirements a bit by eliminating need to surge-execute final frame.

(For simplicity, this is 60Hz and only 4-frameslices. Any non-60Hz examples is simply cherrypicking a realworld VBI closest to the 60Hz emulator intervals, and beamracing only that specific refresh cycle.)

Obviously, you can use beam racing alone (more purist), or RunAhead alone (many benefits), or both simultaneously to combine benefits.

Graphics drivers are so freaking stubborn and annoying, and not the fun part of an emulator project when they want to do things that you DONT want them to. (argh, argh – I’ve been there). Most of us assume & think there’s not that gap, but there is (almost always) from the BlurBusters input lag tests even with all the NVInspector tricks. And even if you successfully removed one VSYNC ON gap (e.g. Hard GPU Sync and/or other techniques), next frame response time is very hard to get. It is not guaranteed on all system configurations, you still don’t fully cancel out the benefits of beam racing because of sub-frame response ability (response to midscreen input reads are now accomodated by beamracing). You can see the length of the grey lag arrows! With beamracing+VSYNC OFF, we sieze control over all mid-scanout latencies.

Metaphorically, we’re pretending we’re creating our own VSYNC ON driver in our software. We never interrupt our own beam racing, we always “finish” a beam race. No tearing! That way, if we do that, there’s no difference in visual artifacts from VSYNC ON. Any new rollback’d inputs are handled for the next refresh cycle (in an overlapped operation in two threads. You can RunAhead a new series of offscreen frames in a secondary core while you are beamracing a visible frame).

This means we handle sync delays ourselves (handled via the proposed raster callback idea) instead of the driver doing it for us. Subframe inputdelaying. Frameslice beam racing is simply our own roll-our-own “VSYNC ON” driver (with minor twists like beamraced timing + realtime inputreading every frameslice, behaving as a subframe-latency inputdelay).

And, even as you also have done the rollback’d input insertion, you’re always continually updating the inputs, so the green frame(slices) gets freshest known inputs at the time. Mid-screen inputs too. Thus, can decrease RunAhead margin to get same effective latency, when combining beam racing + RunAhead. Or get even less RunAhead lag at the same RunAhead margin. Whatever users prefer. Or users can on a laggier display with the same RunAhead margin (your spreadsheet will remain valid as it has always been), and it will still feel like a less-laggy display thanks to combined RunAhead+beamracing. Whichever users prefer, keep it at spreadsheet recommendation and it will feel like adding 1 extra RunAhead frame beyond spreadsheet recommendations but without the backticking artifacts (when adding beamracing). Win-win-win?

If I am wrong (@Dwedit – am I?), and it’s still genuinely truly guaranteed impossible then, please sharpie the diagram above to illustrate why. If so, then I’ll eat my words – I need a honest teaching because if I am wrong, it means I totally screwed up & dropped the ball on a specific important detail – but let me know! You don’t have to implement beamracing this year or ever – I just want to clear up any confusions…

{My last post in this specific thread till a developer replies. :slight_smile: Scout’s honor.}

@mdrejhon I don’t think your idea of mixing Run-Ahead and Beam-racing is going to fly.

I edited your picture slightly and put together what happens between two vsyncs. The red square highlights it.

As you can see your frame 3 takes the whole frame because it is beam-racing via 4 slices. Since it needs to run the whole frame in real-time it clearly is / cannot be executed in Run-Ahead mode (which needs to run multiple emulated frames within one single real-time frame). So in your pictured setup the Run-Ahead flow (and thus Run-Ahead concept) gets broken.

Multithreading / parallelization of both concepts will not be able to negate this, since beam racing a single frame takes up the entire real-time frame. So you still need to choose to either have the input displayed from the Run-Ahead frame (A), OR update input state with / during beam racing (B). It’s black or white, there is no grey.

In case A you’ll lose the single most biggest advantage of beam-racing (input state and display refresh close to real hardware), and in case B the run-ahead flow / concept gets broken.

As I understand it, this is what DWedit means when saying both methods are mutually exclusive.

There is still room for discarding the waiting part of beam racing and only keeping the “workaround for buggy drivers with bad vsync” part. But this is now different than the original proposal.

OK, I now see what you mean, but the good news is it doesn’t negetate different advantages of beam racing:

Even if you eliminate beam raced input reads, you can still combine the two by only worrying about retroactive inputread insertions (no beamraced input read updates). In this sense, beamracing simply only becomes another form of VSYNC ON mode.

There are still two other advantages:

(1) Immediate refreshing of delivered pixels. It’s another mode of next-frame-response that bypasses driver’s VSYNC ON delays. (Yes, there are other ways to get next-frame input response, but having an additional method is useful!).

(2) It eliminates the need to surge-execute the final frame, lessening CPU horsepower.

So even if we indeed lose beamraced input reads (whenever in RunAhead mode), we still have the other beamracing benefits (1) and (2).

Metaphorically, we treat beamracing as an in-house VSYNC ON mode with no further inputreading. So this keeps symmetry between VSYNC ON and the beamraced frameslice mode. Perhaps this makes beamracing less useful when combined with RunAhead, but they are not incompatible for other advantages.

Both beam racing and RunAhead should be concurrent options that can be enabled separately at least but they could/should still be compatible with each other (even if you are right and we indeed lose beamraced input reads; focussing only on retroactive input reads). Beam racing works on an old GTX 380 from year 2009 (640x480, output to arcade CRT, no filters/HLSL/fuzzylines renderers), and also works on a Raspberry Pi (4K versions) so it is a CPU-reducing method, as long as the GPU bandwidth is sufficient to do high VSYNC OFF pagefliprate of redundant framebuffers… There are situations where beam racing works on systems that aren’t RunAhead-capable because the CPU is not powerful enough to run emulator 2X faster. So it’s useful to have both options available.

So, if we do that, we might as well make both of them compatible with each other (e.g. via the raster callback API I suggested). Isn’t that the logical move, anyway? :wink: Even if we have to disable the specific on-the-fly inputreading feature during beamracing (only have it do cached inputreads from beginning of cycle, just exactly like for VSYNC ON – pretending our beamracing is simply our own roll-our-own VSYNC ON driver, only for the specific use-case of concurrent use with RunAhead). Obviously, for non-RunAhead use cases, we’d enable beamraced input reads.

Two more beamracing caveats…

  • Not compatible with Freesync/Gsync
  • Not compatible with screen rotation or flipping.

We made beam racing compatible with FreeSync/GSYNC in WinUAE. I taught Tony Willen how to do it, and it worked perfectly! We use FreeSync+VSYNC OFF as well as GSYNC+VSYNC OFF. The First Present() begins the refresh cycle and the subsequent Present() beamraces new frameslices into that VRR refresh cycle. It does mean fast execution for fast scanouts (e.g. 2.4x faster emulator execution during a 1/144sec scanout).

The way VRR beamracing with a hybrid “GSYNC+VSYNC OFF” mode works is:

  • Present() while the monitor is idling, immediately begins a new manually-begun refresh cycle. (Monitor actually waits for Present() before scanning)
  • Present() while the monitor is busy refreshing, behaves as VSYNC OFF (interrupts currently scanning-out refresh cycle)

For a 144Hz VRR display, if you decide to use the platform raster API (if you beamrace VRR) then the RasterStatus.ScanLine (and ilk) also increments at the velocity of a 144Hz refresh cycle, no matter what the framerate is – even if you trigger the starts of a new refresh cycle. The monitor idles in “.INVBlank = true” as VRR is simply a variable-size blanking interval. So if the monitor is idling, that means the first Present() immediately turns “.InVBlank = true” into “.INVBlank = false” and .ScanLine begins incrementing almost instantly (Within microseconds). Yep, you the programmer, have control over starting the refresh cycles on a VRR display! That’s how VRR works – the display begins scanning immediately upon Present().

For screen rotation: We simply disable beamracing in screen rotation when the scanout direction diverges between realraster/emuraster. (It’s important to just compare scan direction: Sideways monitors with sideways arcade CRTs means left-to-right beamracing can work). There’s an API to check screen rotation on both PC and Mac. Android and probably Linux too. Also, the native orientation is nearly always top-bottom even on phones (except for a few odd phones like HTC OnePlus 5 which had the jelly effect reports, due to its reverse scan).

There is nothing preventing enable/disable beamracing on the fly. So it can automatically switch in/out of beamracing everytime a tablet is rotated away from its native scan direction. Want me to (offline) write up a proposed raster callback API and general guidelines? And vet it past a few developers?

Recommended Workflow

  • Check monitor rotation is equal between real & emu -> If unequal, present full frames instead
  • Check if VSYNC OFF is acceessible -> If no access to tearlines, present full frames instead
  • Check if VRR is enabled --> If enabled, slight change to workflow to trigger refresh cycles with first frameslice

With an optimized beamracing workflow, I get:

  • 60fps on 144Hz GSYNC = perfect smooth beamrace (requires 2.4x exec speed)
  • 60fps on 240Hz FreeSync = perfect smooth beamrace (requires 4x exec speed)
  • 50fps on 100Hz fixed-Hz = perfect smooth beamrace (requires 2x exec speed, beamrace 1 out of 2 refresh cycles)

It is also possible to get:

  • 60fps on 75Hz = stuttery beamrace (beamrace cherrypicked 60 out of 75 refresh cycles)

Here’s a diagrammetric version. These are the existing BlurBusters-format “filmreel theme” scanout graphs, with modifications to indicate how Present() triggers refresh cycles.

If you Present() 100 times a second at exact intervals, you’ve manually created 100Hz of software-timed refresh cycles.

  • If in VBI, Present() starts scanning a new refresh cycle (VRR behavior).
  • If scanning, Present() interrupts with new frame (VSYNC OFF tearline) if using the hybrid GSYNC+VSYNC OFF mode or the FreeSync+VSYNC OFF mode.
  • The first row of pixels underneath the tearline is on the video cable within microseconds of Present(), same situation as non-VRR.

Same for Mantle/OGL/etc applicable presentation APIs such as glutSwapBuffers()

That’s why we succeeded in beamracing with GSYNC in WinUAE.

Is it safe to assume 2 frames runahead would work correctly for all SNES games? I just tested it out on Donkey Kong Country 3 and it feels pretty snappy like NES 8bit which is awesome. I did another quick SloMo recording with my phone and confirmed it shaved off more than 1.5 frames very nice :slight_smile:

I had to enable threaded video for my Celeron N3160 to handle 2 frames of runahead flawlessly (running the Windows 32bit version using Wine on GalliumOS 64bit). I can’t tell with sub-frame accuracy from my 120fps recording but assume it would be pretty much the 2 full frames if running with threaded video disabled.

May I suggest it would be more user friendly if it could also have an automatic setting in addition to setting a manual number of frames, if you could just set runahead frames to “Auto” instead of 1, 2, 3… So it would automatically use 1 frame for the NES cores and 2 frames for the SNES cores for example, if that is known to work without any issues?

Auto would not be optimally tweaked for every game, yet, but as an easy to use “set it and forget it” option to get going. It might be possible to have this optimized per game but for now it could be set per core to begin with. It could also use a database of known and tested games and if the game has been tested simply set runahead to match, otherwise run without, so it will automatically be able to adjust more games as time goes on and the database grows with this info available…

Next, I tried out Super Mario 3 with Nestopia however that did not go so perfectly for me the sound is chopping. I increased the latency from 48ms which I used without problems before to 64ms and also tried 128ms. It seems this core is not yet handling this properly and changing the audio latency made no difference at all. Nestopia is already very low latency even without this though.

Either way something’s wrong here, framerate seems OK but the sound is not (I’m using OpenAL if that makes any difference). I didn’t enable the new statistics overlay but maybe I have some obvious problem with performance, or is Nestopia less optimized or it’s just not patched for this? The core still runs normally without runahead.

Speaking of user friendly menu options and audio performance… How about having just 48, 96, 192 kHz output rates to choose from? Then it only takes a couple of presses to go to from 48000 to 192000 for example. Well, at least the menu goes in 100Hz steps as it is already, still it’s faster to edit the config file externally than using the GUI to change that option :slight_smile:

I’m using Nearest resampling and 192kHz rate having desktop composition (“compton” in my case) turned off so this Chromebook can run Snes9x flawlessly. With threaded video enabled it also handles 2 frames runahead, didn’t think I’d be able to get away with any runahead for SNES on this system but performance is perfect, great work!

mdrejhon, how does it behave on dos games on 60hz screen with beamracing? A lot of them are 70fps.

2 frames isn’t safe for SNES, many games have just 1 frame of lag.

The best way to be sure of how many frames of run ahead to use if you’re not sure is to pause with P and advance the frame with K while holding down an action button. Super Mario World has 2 frames of lag while Donkey Kong Country and F-Zero have 1 frame of lag.

Nestopia is compatible with run ahead. You need to make sure that “run ahead second instance” is on when using it with Nestopia or the audio will render the game unplayable.

Beam racing is mainly recommended for refresh rates higher than frame rates. Beamracing 60fps on 75Hz would be cherrypicking 60 refreshcycles out of 75 to beamrace. It’d be just as stuttery as VSYNC ON if you do that.

To do it in a forgiving way, “programming it in a way to round-off the beamrace-begins to the nearest refresh cycle and then beam racing that” would prevent a beamracing failures. If, for whatever reason, the user stubbornly kept their display at an odd refresh rate. It would click naturally for things like 60fps@120Hz, but would roundoff to the nearest VSYNCs for the odd Hz situation.

Now… Beamracing 70fps on 60Hz could theoretically be achieved by cherrypicking 60 refreshcycles and offscreen/framedropping 10 of them. But you’d have to surge-execute 10 of them, or you’d have to use a intermediate rolling frameslice queue (which creates a minor input-lag slewing artifact, but still less bad than some existing surge-execution input-lag-slew artifacts in today’s emulators when running VSYNC ON 60fps@odd Hz). I wouldn’t bother with such complexity though.

I say just make it beamrace VRR compatible and bedonewithit, to cover fps>Hz situations, and if an opensource programmer want to tweak it to be compatible with fps<Hz situation, let them. One can self-detect beam fallbehinds (failures to keep up) and switch to “surge execute the whole frame” approach to catch up. So beamracing can be selectively ignored on the fly even when beamracing is enabled in a situation that is suddenly incompatible (screen got rotated, screen lowered in Hz, screen went into windowed mode, etc).

Note: Temporary beamrace fails simply manifest itself as brief reappearance of VSYNC OFF tearing until the tearlines instantly roll back inside the jittermargins (whereupon the tearlines instantly disappear the next refresh cycle). Jittermargins can be wraparound full refresh cycle minus one frameslice, if you’re writing new emulated scanlines into into the existing emulator frame buffer containing previous emulator refresh cycle. So flipping that to the front buffer creates no tearlines as long as it’s within a (16.7ms time period minus time period of one frameslice). Even if you Present() the top part of the next emulator refresh cycle while the previous refresh cycle is scanning out, so tearline does not show because the bottom part of buffer is the previous emulator refresh cycle. (That’s why emulator modules ideally SHOULD NOT clear their emulator framebuffers between emulated refresh cycles – that’s not preservationist-proper and makes it incompatible with huge-jitter-margin capability during beam racing). For 60Hz situation, here’s an example. For 10 frameslices per cycle (1.6ms), that’s a ~15ms jitter margin for beam racing before tearlines accidentally appear – meaning beamracing has to fall behind by roughlyto enroach tearlines into a previous refresh cycle, or be too far ahead to enroach tearlines into a next refresh cycle. The amplitude of the jittermargin is 15ms in this particular case, which can make it extremely forgiving on slower systems including Raspberry Pi and Android devices if you adjust the beamrace margins as such. Emulator can execute in its own merry way (even cycle exact emulators) at realtime with no surge execution needed. The proposed raster callback API such as retro_set_raster_poll uses identical arguments as retro_set_video_refresh to allow a central beamracer module to have earlypeeks at the existing partial framebuffer and (only for supported emulator modules, and only if agreed) is explained here but I’ll formalize a better proposal in a separate thread, it may actually only require a 5 line modification to the most easily-supported emulator modules – most of the complexity is hidden in the core beamracing module.

1 Like

Is there any drawback to using “runahead second instance” in some cases ? If not, maybe it should not even be an option and always be enabled ?

It uses more resources like memory and CPU. I can’t use second instance with BSNES like I need to because it causes the emulator to drop frames like crazy and then causes save state jumping so I have to use SNES9X when I want to use run ahead for Super Nintendo. It’s something that you should only use on emulators that need it like Nestopia which has massive audio problems if you don’t use it.

Same experience with bsnes. There are so many glitches during gameplay. Frames are jumping forward and backward.