An input lag investigation


#688

I personally enjoy all the information mdrejhon. Keep it up!


#689

Here’s an image that replaces my stupendously excessively long posts? (I do apologize) :zipper_mouth_face:

4-frameslice beam raced lagless VSYNC example. We only beam-race visible frames.

And beam racing brings the newer input reads closer to display scanout (completely bypassing driver sync). The beam raced frameslicing brings you subframe inputdelay benefits. It can mean less reliance on larger RunAhead numbers, allowing potentially decreasing RunAhead processing requirements and RunAhead counts. Since we are now able to take over driver latencies to our side and control 100% of them.

Beam Raced RunAhead

EDIT (to clear confusion):

  • This is time-based graph, 16.7ms between VSYNC events, 60Hz example only
  • “Off screen” part is the one under our control.
  • “On screen” is pixels hitting the video cable (and photons hitting the human eyes on lagless monitors) according to Blur Busters input lag tests
  • With VSYNC OFF, the first row of pixels right under the tearlines are already being transmitted on the cable within microseconds of Present() – confirmed in oscilloscope tests. Yet we made tearlines invisible (because that screen region is duplicate before/after) with our techniques, so we’ve effectively created a new lagless VSYNC mode.
  • Four frame-slice beam racing is doing four Present() per refresh cycle spaced 1/4 of 1/60sec apart (~4ms apart) when beamracing a 60Hz display.
  • Frameslicing allows us to chop up the grey arrows to much shorter ones, accomodating mid-screen input reads.
  • Beginning of purple arrow (blunt end) is API call Present() or glutSwapBuffers()
  • End of purple arrow (pointy end) is the pixel being transmitted out of graphics output (aka beginning of monitor’s scan out on a lagless monitor like CRT).
  • Grey arrows is the maximum delay between emulator pixel render and pixels hitting eyes.
  • “Input delay” is meant same as “frame delay” (delaying input reads to reduce input lag)
  • Input lag savings still occur if you skip beamraced input reads in the specific RunAhead use case. Retroactive input read insertions aren’t necessarily part of this beam race (they may be worthless) but it’s only one of the many -benefits of beam racing. But there’s many reasons to beam race anyway. In that case, we are simply using beam racing to bypass driver delays with VSYNC ON, by guaranteeing Present()-to-Pixels latency, bypassing driver latencies. We can optionally use fresher input reads (during the beamraced scanout) to replace the inserted input reads, for the few rare games that have sub-frame response (by repeating input reads after the retroactive input read insertion) but most of the input lag savings is from bypassing driver unpredictabilities with VSYNC ON – to get more guaranteed “Present() to Pixels” latency since new pixels hits the video output practically immediatey after Present() during VSYNC OFF as confirmed by oscilloscope. First pixels underneath tearlines = immediate transmission out of graphics output
  • It can reduce CPU requirements of RunAhead by ~1 frame because the final frame doesn’t need to be surge-executed.
  • TL;DR: If you skip input reads as part of beamracing, the beamracing advantages don’t disappear, the beamracing still (A) bypass driver VSYNC latencies and (B) reduces CPU requirements a bit by eliminating need to surge-execute final frame.

(For simplicity, this is 60Hz and only 4-frameslices. Any non-60Hz examples is simply cherrypicking a realworld VBI closest to the 60Hz emulator intervals, and beamracing only that specific refresh cycle.)

Obviously, you can use beam racing alone (more purist), or RunAhead alone (many benefits), or both simultaneously to combine benefits.

Graphics drivers are so freaking stubborn and annoying, and not the fun part of an emulator project when they want to do things that you DONT want them to. (argh, argh – I’ve been there). Most of us assume & think there’s not that gap, but there is (almost always) from the BlurBusters input lag tests even with all the NVInspector tricks. And even if you successfully removed one VSYNC ON gap (e.g. Hard GPU Sync and/or other techniques), next frame response time is very hard to get. It is not guaranteed on all system configurations, you still don’t fully cancel out the benefits of beam racing because of sub-frame response ability (response to midscreen input reads are now accomodated by beamracing). You can see the length of the grey lag arrows! With beamracing+VSYNC OFF, we sieze control over all mid-scanout latencies.

Metaphorically, we’re pretending we’re creating our own VSYNC ON driver in our software. We never interrupt our own beam racing, we always “finish” a beam race. No tearing! That way, if we do that, there’s no difference in visual artifacts from VSYNC ON. Any new rollback’d inputs are handled for the next refresh cycle (in an overlapped operation in two threads. You can RunAhead a new series of offscreen frames in a secondary core while you are beamracing a visible frame).

This means we handle sync delays ourselves (handled via the proposed raster callback idea) instead of the driver doing it for us. Subframe inputdelaying. Frameslice beam racing is simply our own roll-our-own “VSYNC ON” driver (with minor twists like beamraced timing + realtime inputreading every frameslice, behaving as a subframe-latency inputdelay).

And, even as you also have done the rollback’d input insertion, you’re always continually updating the inputs, so the green frame(slices) gets freshest known inputs at the time. Mid-screen inputs too. Thus, can decrease RunAhead margin to get same effective latency, when combining beam racing + RunAhead. Or get even less RunAhead lag at the same RunAhead margin. Whatever users prefer. Or users can on a laggier display with the same RunAhead margin (your spreadsheet will remain valid as it has always been), and it will still feel like a less-laggy display thanks to combined RunAhead+beamracing. Whichever users prefer, keep it at spreadsheet recommendation and it will feel like adding 1 extra RunAhead frame beyond spreadsheet recommendations but without the backticking artifacts (when adding beamracing). Win-win-win?

If I am wrong (@Dwedit – am I?), and it’s still genuinely truly guaranteed impossible then, please sharpie the diagram above to illustrate why. If so, then I’ll eat my words – I need a honest teaching because if I am wrong, it means I totally screwed up & dropped the ball on a specific important detail – but let me know! You don’t have to implement beamracing this year or ever – I just want to clear up any confusions…

{My last post in this specific thread till a developer replies. :slight_smile: Scout’s honor.}


#690

@mdrejhon I don’t think your idea of mixing Run-Ahead and Beam-racing is going to fly.

I edited your picture slightly and put together what happens between two vsyncs. The red square highlights it.

As you can see your frame 3 takes the whole frame because it is beam-racing via 4 slices. Since it needs to run the whole frame in real-time it clearly is / cannot be executed in Run-Ahead mode (which needs to run multiple emulated frames within one single real-time frame). So in your pictured setup the Run-Ahead flow (and thus Run-Ahead concept) gets broken.

Multithreading / parallelization of both concepts will not be able to negate this, since beam racing a single frame takes up the entire real-time frame. So you still need to choose to either have the input displayed from the Run-Ahead frame (A), OR update input state with / during beam racing (B). It’s black or white, there is no grey.

In case A you’ll lose the single most biggest advantage of beam-racing (input state and display refresh close to real hardware), and in case B the run-ahead flow / concept gets broken.

As I understand it, this is what DWedit means when saying both methods are mutually exclusive.


#691

There is still room for discarding the waiting part of beam racing and only keeping the “workaround for buggy drivers with bad vsync” part. But this is now different than the original proposal.


#692

OK, I now see what you mean, but the good news is it doesn’t negetate different advantages of beam racing:

Even if you eliminate beam raced input reads, you can still combine the two by only worrying about retroactive inputread insertions (no beamraced input read updates). In this sense, beamracing simply only becomes another form of VSYNC ON mode.

There are still two other advantages:

(1) Immediate refreshing of delivered pixels. It’s another mode of next-frame-response that bypasses driver’s VSYNC ON delays. (Yes, there are other ways to get next-frame input response, but having an additional method is useful!).

(2) It eliminates the need to surge-execute the final frame, lessening CPU horsepower.

So even if we indeed lose beamraced input reads (whenever in RunAhead mode), we still have the other beamracing benefits (1) and (2).

Metaphorically, we treat beamracing as an in-house VSYNC ON mode with no further inputreading. So this keeps symmetry between VSYNC ON and the beamraced frameslice mode. Perhaps this makes beamracing less useful when combined with RunAhead, but they are not incompatible for other advantages.

Both beam racing and RunAhead should be concurrent options that can be enabled separately at least but they could/should still be compatible with each other (even if you are right and we indeed lose beamraced input reads; focussing only on retroactive input reads). Beam racing works on an old GTX 380 from year 2009 (640x480, output to arcade CRT, no filters/HLSL/fuzzylines renderers), and also works on a Raspberry Pi (4K versions) so it is a CPU-reducing method, as long as the GPU bandwidth is sufficient to do high VSYNC OFF pagefliprate of redundant framebuffers… There are situations where beam racing works on systems that aren’t RunAhead-capable because the CPU is not powerful enough to run emulator 2X faster. So it’s useful to have both options available.

So, if we do that, we might as well make both of them compatible with each other (e.g. via the raster callback API I suggested). Isn’t that the logical move, anyway? :wink: Even if we have to disable the specific on-the-fly inputreading feature during beamracing (only have it do cached inputreads from beginning of cycle, just exactly like for VSYNC ON – pretending our beamracing is simply our own roll-our-own VSYNC ON driver, only for the specific use-case of concurrent use with RunAhead). Obviously, for non-RunAhead use cases, we’d enable beamraced input reads.


#693

Two more beamracing caveats…

  • Not compatible with Freesync/Gsync
  • Not compatible with screen rotation or flipping.

#694

We made beam racing compatible with FreeSync/GSYNC in WinUAE. I taught Tony Willen how to do it, and it worked perfectly! We use FreeSync+VSYNC OFF as well as GSYNC+VSYNC OFF. The First Present() begins the refresh cycle and the subsequent Present() beamraces new frameslices into that VRR refresh cycle. It does mean fast execution for fast scanouts (e.g. 2.4x faster emulator execution during a 1/144sec scanout).

The way VRR beamracing with a hybrid “GSYNC+VSYNC OFF” mode works is:

  • Present() while the monitor is idling, immediately begins a new manually-begun refresh cycle. (Monitor actually waits for Present() before scanning)
  • Present() while the monitor is busy refreshing, behaves as VSYNC OFF (interrupts currently scanning-out refresh cycle)

For a 144Hz VRR display, if you decide to use the platform raster API (if you beamrace VRR) then the RasterStatus.ScanLine (and ilk) also increments at the velocity of a 144Hz refresh cycle, no matter what the framerate is – even if you trigger the starts of a new refresh cycle. The monitor idles in “.INVBlank = true” as VRR is simply a variable-size blanking interval. So if the monitor is idling, that means the first Present() immediately turns “.InVBlank = true” into “.INVBlank = false” and .ScanLine begins incrementing almost instantly (Within microseconds). Yep, you the programmer, have control over starting the refresh cycles on a VRR display! That’s how VRR works – the display begins scanning immediately upon Present().

For screen rotation: We simply disable beamracing in screen rotation when the scanout direction diverges between realraster/emuraster. (It’s important to just compare scan direction: Sideways monitors with sideways arcade CRTs means left-to-right beamracing can work). There’s an API to check screen rotation on both PC and Mac. Android and probably Linux too. Also, the native orientation is nearly always top-bottom even on phones (except for a few odd phones like HTC OnePlus 5 which had the jelly effect reports, due to its reverse scan).

There is nothing preventing enable/disable beamracing on the fly. So it can automatically switch in/out of beamracing everytime a tablet is rotated away from its native scan direction. Want me to (offline) write up a proposed raster callback API and general guidelines? And vet it past a few developers?

Recommended Workflow

  • Check monitor rotation is equal between real & emu -> If unequal, present full frames instead
  • Check if VSYNC OFF is acceessible -> If no access to tearlines, present full frames instead
  • Check if VRR is enabled --> If enabled, slight change to workflow to trigger refresh cycles with first frameslice

With an optimized beamracing workflow, I get:

  • 60fps on 144Hz GSYNC = perfect smooth beamrace (requires 2.4x exec speed)
  • 60fps on 240Hz FreeSync = perfect smooth beamrace (requires 4x exec speed)
  • 50fps on 100Hz fixed-Hz = perfect smooth beamrace (requires 2x exec speed, beamrace 1 out of 2 refresh cycles)

It is also possible to get:

  • 60fps on 75Hz = stuttery beamrace (beamrace cherrypicked 60 out of 75 refresh cycles)

#695

Here’s a diagrammetric version. These are the existing BlurBusters-format “filmreel theme” scanout graphs, with modifications to indicate how Present() triggers refresh cycles.

If you Present() 100 times a second at exact intervals, you’ve manually created 100Hz of software-timed refresh cycles.

  • If in VBI, Present() starts scanning a new refresh cycle (VRR behavior).
  • If scanning, Present() interrupts with new frame (VSYNC OFF tearline) if using the hybrid GSYNC+VSYNC OFF mode or the FreeSync+VSYNC OFF mode.
  • The first row of pixels underneath the tearline is on the video cable within microseconds of Present(), same situation as non-VRR.

Same for Mantle/OGL/etc applicable presentation APIs such as glutSwapBuffers()

That’s why we succeeded in beamracing with GSYNC in WinUAE.


#696

Is it safe to assume 2 frames runahead would work correctly for all SNES games? I just tested it out on Donkey Kong Country 3 and it feels pretty snappy like NES 8bit which is awesome. I did another quick SloMo recording with my phone and confirmed it shaved off more than 1.5 frames very nice :slight_smile:

I had to enable threaded video for my Celeron N3160 to handle 2 frames of runahead flawlessly (running the Windows 32bit version using Wine on GalliumOS 64bit). I can’t tell with sub-frame accuracy from my 120fps recording but assume it would be pretty much the 2 full frames if running with threaded video disabled.

May I suggest it would be more user friendly if it could also have an automatic setting in addition to setting a manual number of frames, if you could just set runahead frames to “Auto” instead of 1, 2, 3… So it would automatically use 1 frame for the NES cores and 2 frames for the SNES cores for example, if that is known to work without any issues?

Auto would not be optimally tweaked for every game, yet, but as an easy to use “set it and forget it” option to get going. It might be possible to have this optimized per game but for now it could be set per core to begin with. It could also use a database of known and tested games and if the game has been tested simply set runahead to match, otherwise run without, so it will automatically be able to adjust more games as time goes on and the database grows with this info available…

Next, I tried out Super Mario 3 with Nestopia however that did not go so perfectly for me the sound is chopping. I increased the latency from 48ms which I used without problems before to 64ms and also tried 128ms. It seems this core is not yet handling this properly and changing the audio latency made no difference at all. Nestopia is already very low latency even without this though.

Either way something’s wrong here, framerate seems OK but the sound is not (I’m using OpenAL if that makes any difference). I didn’t enable the new statistics overlay but maybe I have some obvious problem with performance, or is Nestopia less optimized or it’s just not patched for this? The core still runs normally without runahead.

Speaking of user friendly menu options and audio performance… How about having just 48, 96, 192 kHz output rates to choose from? Then it only takes a couple of presses to go to from 48000 to 192000 for example. Well, at least the menu goes in 100Hz steps as it is already, still it’s faster to edit the config file externally than using the GUI to change that option :slight_smile:

I’m using Nearest resampling and 192kHz rate having desktop composition (“compton” in my case) turned off so this Chromebook can run Snes9x flawlessly. With threaded video enabled it also handles 2 frames runahead, didn’t think I’d be able to get away with any runahead for SNES on this system but performance is perfect, great work!


#697

mdrejhon, how does it behave on dos games on 60hz screen with beamracing? A lot of them are 70fps.


#698

2 frames isn’t safe for SNES, many games have just 1 frame of lag.


#699

The best way to be sure of how many frames of run ahead to use if you’re not sure is to pause with P and advance the frame with K while holding down an action button. Super Mario World has 2 frames of lag while Donkey Kong Country and F-Zero have 1 frame of lag.

Nestopia is compatible with run ahead. You need to make sure that “run ahead second instance” is on when using it with Nestopia or the audio will render the game unplayable.


#700

Beam racing is mainly recommended for refresh rates higher than frame rates. Beamracing 60fps on 75Hz would be cherrypicking 60 refreshcycles out of 75 to beamrace. It’d be just as stuttery as VSYNC ON if you do that.

To do it in a forgiving way, “programming it in a way to round-off the beamrace-begins to the nearest refresh cycle and then beam racing that” would prevent a beamracing failures. If, for whatever reason, the user stubbornly kept their display at an odd refresh rate. It would click naturally for things like 60fps@120Hz, but would roundoff to the nearest VSYNCs for the odd Hz situation.

Now… Beamracing 70fps on 60Hz could theoretically be achieved by cherrypicking 60 refreshcycles and offscreen/framedropping 10 of them. But you’d have to surge-execute 10 of them, or you’d have to use a intermediate rolling frameslice queue (which creates a minor input-lag slewing artifact, but still less bad than some existing surge-execution input-lag-slew artifacts in today’s emulators when running VSYNC ON 60fps@odd Hz). I wouldn’t bother with such complexity though.

I say just make it beamrace VRR compatible and bedonewithit, to cover fps>Hz situations, and if an opensource programmer want to tweak it to be compatible with fps<Hz situation, let them. One can self-detect beam fallbehinds (failures to keep up) and switch to “surge execute the whole frame” approach to catch up. So beamracing can be selectively ignored on the fly even when beamracing is enabled in a situation that is suddenly incompatible (screen got rotated, screen lowered in Hz, screen went into windowed mode, etc).

Note: Temporary beamrace fails simply manifest itself as brief reappearance of VSYNC OFF tearing until the tearlines instantly roll back inside the jittermargins (whereupon the tearlines instantly disappear the next refresh cycle). Jittermargins can be wraparound full refresh cycle minus one frameslice, if you’re writing new emulated scanlines into into the existing emulator frame buffer containing previous emulator refresh cycle. So flipping that to the front buffer creates no tearlines as long as it’s within a (16.7ms time period minus time period of one frameslice). Even if you Present() the top part of the next emulator refresh cycle while the previous refresh cycle is scanning out, so tearline does not show because the bottom part of buffer is the previous emulator refresh cycle. (That’s why emulator modules ideally SHOULD NOT clear their emulator framebuffers between emulated refresh cycles – that’s not preservationist-proper and makes it incompatible with huge-jitter-margin capability during beam racing). For 60Hz situation, here’s an example. For 10 frameslices per cycle (1.6ms), that’s a ~15ms jitter margin for beam racing before tearlines accidentally appear – meaning beamracing has to fall behind by roughlyto enroach tearlines into a previous refresh cycle, or be too far ahead to enroach tearlines into a next refresh cycle. The amplitude of the jittermargin is 15ms in this particular case, which can make it extremely forgiving on slower systems including Raspberry Pi and Android devices if you adjust the beamrace margins as such. Emulator can execute in its own merry way (even cycle exact emulators) at realtime with no surge execution needed. The proposed raster callback API such as retro_set_raster_poll uses identical arguments as retro_set_video_refresh to allow a central beamracer module to have earlypeeks at the existing partial framebuffer and (only for supported emulator modules, and only if agreed) is explained here but I’ll formalize a better proposal in a separate thread, it may actually only require a 5 line modification to the most easily-supported emulator modules – most of the complexity is hidden in the core beamracing module.


#701

Is there any drawback to using “runahead second instance” in some cases ? If not, maybe it should not even be an option and always be enabled ?


#702

It uses more resources like memory and CPU. I can’t use second instance with BSNES like I need to because it causes the emulator to drop frames like crazy and then causes save state jumping so I have to use SNES9X when I want to use run ahead for Super Nintendo. It’s something that you should only use on emulators that need it like Nestopia which has massive audio problems if you don’t use it.


#703

Same experience with bsnes. There are so many glitches during gameplay. Frames are jumping forward and backward.


#704

Depends on the game. Some games only need 1 frame, some need 4.


#705

Hmm good to know! I’ll check with the pause + frame advance method to find out the value for each game.

Still have the audio problem in Super Mario 3 with Nestopia though, not only on the Windows 32bit build running on my Chromebook, but I also quickly tested it on my Shield Android TV now and it’s the same problem there. It handles crt-royale shader at full speed with Nestopia and Snes9x as long as threaded video is enabled (the Shield has multiple times the GPU performance compared to the Chromebook integrated Intel GPU).

So two very different systems with the same audio issue on both. No framerate drops, both devices run stable and smooth. Second instance option on or off, and runahead 1 frame, audio is not working properly with Nestopia in Mario 3. Did anyone else try this?


#706

I have no idea what’s going on with the Bsnes cores, they do get out of sync (release buttons randomly) when using the secondary core. The Mednafen BSNES core is fine however.


#707

I only experienced issues with tha graphics (frame jumping). I had to activate the second instance to get rid of the button release issue. By the way mgba also has issues. The sprites are disappearing and reappearing. Also in mgba some future frames will be displayed randomly.

The 1.7.2 release is great and the latency option is awesom. However some cores have to be optimized to work with that option flawlessly.