An input lag investigation

vanfanel · 23 February 2018 11:09

I have been doing some experimentation today with RetroArch running on the Raspberry Pi VC4 open source driver (and GL on KMSDRM on the Retroarch side), anI found out that, after precise monitor refresh rate measurement, I can leave VSYNC OFF and still I get no tearing at all! Is that expected somehow? Have you guys seen it with another KMSDRM implementations?

Also, I take hard_gpu_sync has no sense on KMSDRM, right?

rafan · 19 March 2018 15:51

@Brunnis @Twinaphex

On Blurbusters there’s a pretty interesting article on how to improve input lag in emulators and match the latency of the original device. See here

Eliminate Input Lag on PC-Based Emulators: Matching the Latency of the Original Device .

And more in-depth discussion in their forum here: Emulator Developers: Lagless VSYNC ON Algorithm .

From what I read it seems a bit focused on using Windows features / API’s, so I’m not sure how well this would be suited for Retroarch. Interesting concept nonetheless.

hunterk · 19 March 2018 15:55

The biggest issue with that algo for RA/libretro, I think, is that it requires running emulators in fractions of a frame, just a few scanlines at a time. Libretro is typically set up for a single frame worth of emulation, so I’m not sure how we would be able to break that up.

Dwedit · 19 March 2018 16:20

I hear that the APIs that get the status of the raster (scanline position) do not always work, and fail on some devices.

It’s also mutually exclusive with using run-ahead to compensate for game internal lag, and eliminating the game internal lag is much more powerful.

TylerL · 24 March 2018 20:47

That looks gorgeous!

Great to see frame time and deviation exposed so conveniently. I have a feeling I’ll be spending a lot more time playing with these stats than playing games

TylerL · 25 March 2018 03:18

Ah, possibly I misunderstood?

Is Frame Time supposed to display the time in ms that RetroArch spent to create the frame, or the time in ms that the frame spends on the screen?

The latest nightly on macOS seems to be showing time-on-screen: https://imgur.com/a/ngdYr

Twinaphex · 25 March 2018 10:08

Just check the sourcecode I guess to be absolutely sure. These values were previously reported at RetroArch exit if you invoked it from the commandline, all I did was hook them up to be able to be seen ingame.

Dwedit · 25 March 2018 12:16

I can clearly see that it’s not showing the time it took to create the frame.

The relevant times would be:

Time spent running the emulator core, excluding time spent in video callback
Time spent in video callback uploading the texture
Time running the shaders
Time spent waiting for Present/SwapBuffers to finish

Twinaphex · 25 March 2018 12:53

In case you’d like to improve its behavior through a pull request, let me give you some pointers -

gfx/video_driver.c (line 2399):

github.com

libretro/RetroArch/blob/master/gfx/video_driver.c#L2399


frame_cache_pitch   = pitch;


video_driver_build_info(&video_info);


/* Get the amount of frames per seconds. */
if (video_driver_frame_count)
{
   unsigned write_index                         =
      video_driver_frame_time_count++ &
      (MEASURE_FRAME_TIME_SAMPLES_COUNT - 1);
   frame_time                                   = new_time - fps_time;
   video_driver_frame_time_samples[write_index] = frame_time;
   fps_time                                     = new_time;


   if (video_driver_frame_count == 1)
      strlcpy(title, video_driver_window_title, sizeof(title));


   if ((video_driver_frame_count % FPS_UPDATE_INTERVAL) == 0)
   {
      char frames_text[64];
      last_fps = TIME_TO_FPS(curr_time, new_time, FPS_UPDATE_INTERVAL);

Here is where frame_time gets set.

Then, later on, we set video_info.frame_time here -

github.com

libretro/RetroArch/blob/master/gfx/video_driver.c#L2476




   if (video_info.fps_show)
      strlcpy(video_info.fps_text,
            msg_hash_to_str(MENU_ENUM_LABEL_VALUE_NOT_AVAILABLE),
            sizeof(video_info.fps_text));


   video_driver_window_title_update = true;
}


video_info.frame_rate = last_fps;
video_info.frame_time = frame_time / 1000.0f;
video_info.frame_count = (uint64_t) video_driver_frame_count;


/* Slightly messy code,
 * but we really need to do processing before blocking on VSync
 * for best possible scheduling.
 */
if (
      (
          !video_driver_state_filter
       || !video_info.post_filter_record

This is the value that gets used in the statistics.

mdrejhon · 26 March 2018 18:02

Good news! I found a way to do beam racing in a cross-platform manner (see video at bottom of post):

Two emulators have successfully implemented the BlurBusters lagless VSYNC ON experiment (via tearingless VSYNC OFF); so it’s actually successfully validated:

– Toni’s WinUAE now has real time beam chasing. 40ms input lag reduced to less than 5ms! http://eab.abime.net/showthread.php?t=88777&page=8

– Calamity’s experimental (unreleased) change to GroovyMAME, patch, same-frame lag:

Related developer-oriented forum thread (read all forum pages)

Toni made his Beam racing compatible with GSYNC and FreeSync, with my help, via fast-beamracing. So you can have even less lag with VRR. VRR still scans top-to-bottom, just faster. So the emulator CPU runs ultra-fast (e.g. 4x faster) whenever a refresh cycle is scanning out (e.g. 1/240sec scanout top-to-bottom of a “60Hz” refresh cycle). It’s like Einstien where everything is relative, it’s synchronizing 1:1 between emulated raster and real-world raster, just faster-scanouts followed by longer-pauses between refresh cycles. As a result, you can have slightly less input lag than the original device being emulated, if you combine VRR + fast-beamracing. Beam racing can also be done on selective refresh cycles (e.g. every other refresh cycle during 120Hz), via surge-emulator-CPU-executes in synchronization with a fast-scanning-out real-world raster.

For understanding the LCD scanout, GSYNC beam racing instructions (raster scan line synchronization to real-world raster position on a GSYNC scanout or FreeSync scanout). Page1, Page2 – though for practical considerations, it works best ONLY on 120Hz+ VRR displays due to a an annoying graphics driver quirk. Toni of WinUAE asked me many questions and I’ve successfully helped him implement beam-racing with GSYNC/FreeSync to have even less lag. That said, beam racing works on slow refresh cycles, fast refresh cycles, and variable refresh cycles – as long as there’s a refresh cycle that’s pretty close to the emulator interval, beam racing can be done on specific chosen refresh cycles – basically catching the caboose of a passing train of a display scanout (or triggering your own scanout in the case of VRR), if you will.

And it doesn’t have to be perfect synchronization between emulator raster and real-world raster, it can be done at the frame-slice level:

So a lagless VSYNC ON emulated via VSYNC OFF, with zero tearing, because the tearing occurs on duplicate frameslices, and duplicates has no effect, so no tearline – viola!

Toni said it was easier than expected to add beam-racing support in WinUAE to successfully synchronize the emulator raster with the real-world raster. And apparently, you can use higher Hz just fine too (just selectively beam-race the appropriate refresh cycles, in an accelerated surge cycle, it’s simply faster top-to-bottom scanouts) – CPU performance and GPU performance willing. I can do up to 7,000 frameslices per second, so my lag from emulator-pixel-render to photons hitting my eyes can be as little as 2/7000ths of a second plus whatever the pixel response is (at least for bufferless gaming LCD monitors and for CRT displays). Buffered LCDs will add more lag, but won’t interfere with beam racing the video signal, so that’s not our emulator author’s concern to worry about how the display buffers the frames, but most good desktop gaming LCDs don’t have buffer latency anymore (at their highest Hz) and are capable of synchronous panel refresh to signal scanout, with only GtG (pixel response) lag.

Although right now RetroArch is fully frame-based, there’s no reason why it could (eventually, slowly, carefully, over the years) add support for optional raster hooks later on in the coming years.

But given the potential rearchitecturing issues, I’d suggest waiting for other simpler emulators to pave the way first, before inserting beam racing workflows into Retroarch. Let’s finish a cross platform beam racing implementation first.

I’m achieving precision tearline positioning with only a microsecond clock offset from VSYNC timestamps, so I don’t need access to a raster register (that’s only cake frosting):

And that’s without access to a raster scan line register! Just generic VSYNC OFF + precision clock counter offsets from a VSYNC timestamp. (Many ways to get a VSYNC timestamp on many platforms, even while running in VSYNC OFF mode). VSYNC OFF tearlines are always raster-exact. – You’re simply becoming a tearline jedi once you understand displays as well as some of us do.

James-F · 27 March 2018 14:07

Why this is fantastic!
Don’t be surprised if you’ll see it on Nvidia cards in a few months, unpatented good ideas are stolen swiftly.

mdrejhon · 27 March 2018 15:24

They already use beam-racing with virtual reality.

What we’re doing is applying beam-racing with emulator rasters in a practical way for the first time. Real-time synchronization between emulator raster and real raster for lagless emulator operation is now a magical reality. Toni WinUAE author said it is the emulator holy grail! It’s worth the implementation difficulty! But it was easier for Toni than expected.

But virtual reality has been already doing it at the coarse strip level: – Android beam-racing for GearVR

https://www.imgtec.com/blog/reducing-latency-in-vr-by-using-single-buffered-strip-rendering/

– NVIDIA VRWorks front buffer rendering requires beam-racing to work

But what I’ve developed is a way to do this with emulators & do it reliably at the frameslice approximation level, using only VSYNC OFF (without needing front buffer, though that would help achieve single-scanline slices in the future). Perfectly with zero tearing, with more frame slice strips than the 4-strip beam-racing VR renderers, and with a safety jitter margin so raster-sync-imperfections are completely invisible (until the error is big, and they’re usually just single-frame accidental appearance of tearing). Other than that, in normal operation, it just looks like perfect VSYNC ON, smooth just like the original device, without VSYNC ON lag. A lagless VSYNC ON.

That, we indeedy, can definitely, finally reliably do – even in C# and script-like languages – and it can get practically line-exact if one decides to do optional hardware scanline polling. The emulator holy grail!!

Tatsuya79 · 27 March 2018 16:00

We’re already skipping input lag caused by games internal logic in another thread, so we’re not easily impressed!

Joke aside, that’s interesting.

Sir_Kevith · 27 March 2018 20:19

I think the beam racing implementation sounds more ideal for me as a CRT user but both ideas are pretty freakin clever.

mdrejhon · 27 March 2018 21:19

While it’s a boon for emulator users with CRT, it’s also equally majorly lag-reducing for LCD too.

It’s compatible with any display (yes, even VRR if you use the GSYNC+VSYNC OFF or the FreeSync+VSYNC OFF simultaneous combo technique as linked). Since outputting onto display cables, are by necessity, top-to-bottom raster serializations of pixels from a 2D plane (screen data), even DisplayPort micropackets are still raster-accurate serializations. Everything on any display cable is top-to-bottom. We’re just piggybacking on that fact that all video outputs on a graphics card still scans out top-to-bottom.

The only thing that really throws it off is rotated display – e.g. left-to-right – but if you’ve got a left-to-right scanning emulator (e.g. Galaxian or Pac Man) – then you can even beam chase left-to-right scanouts too. To enable beam racing synchronization (sync between emu raster + real raster) you need to keep scanout direction in the same direction.

If you ever used a Leo Bodnar Lag Tester, you know that it has three flashing squares, Top/Center/Bottom to measure lag of different parts of a display. It measures lag from VBLANK to the square. SO bottom square often has more lag, unless the display is strobed/pulsed (e.g. LightBoost, DLP, Plasma) then the TOP/CENTER/BOTTOM squares are equally laggy.

The latency reduction offsets are similar regardless what an LCD does – if the LCD (e.g. fast TN LCD) had 3ms/11ms/19ms TOP/CENTER/BOTTOM input lag from Leo Bodnar Lag Tester – beam racing makes Top/Center/Bottom equally 3ms on many LCD panels because you’ve bypassed the scanout latency by essentially streaming the rasters in realtime onto the cable. When you use Leo Bodnar on a CRT, it also measures more lag for bottom edge, but it’s lagless if you do beamchased output.

So what you see as “11ms” on DisplayLag.com (CENTER square on Leo Bodnar Lag Tester) will actually be “3ms” lag with the beam-racing method, because beam-racing equalizes the entire display surface to match the input lag of the TOP-edge square in Leo Bodnar Lag Tester. (see…bypassing scanout lag) The lowest value becomes equalized for the entire screen surface.

(Niggly bit for advanced readers: Well, VSYNC OFF frame slices are their own mini lag-gradients, but a 1000-frame-slice-per-second implementation will have only 1/1000 = 1ms lag gradient throughout the frame slice strip … The more frame slices per second, the tinier these mini lag-gradients become. Instead of a lag gradient for the whole display in the vertical dimension, the lag gradients are shrunk to individual frame slices, so each frame slice may be (example numbers only) 3.17ms-thru-4.17ms lag apiece, depending on which scanline inside the frame slice. This frame slice lag-gradient behavior was confirmed via an oscilloscope. That said, these tiny lag gradients are much smaller than full-refresh-cycle lag gradients. Not relevant topic matter for most people here, even emulator developers, but I mention this only for mathematical completeness’ sake.)

Whatever Leo Bodnar Lag Tester measures for input lag for the TOP square, becomes the input lag of the MIDDLE and BOTTOM when you use beam-raced output. The lag is essentially equallized for the whole display surface. So no additional bottom-edge lag when you do beam-raced output even to LCDs. As many know, Blur Busters does latency tests, and some emu authors have posted high speed video proof on the forum thread now, so it’s validated – realtime beam racing bypasses the mandatory scanout latency of full-framebuffer implementations.

Twinaphex · 27 March 2018 21:24

So can we have a proof of concept implementation for this in RetroArch?

It can be done in somebody’s fork or whatever, just so long as people can play with it, experiment and then report back as to how well it’s working.

mdrejhon · 27 March 2018 21:40

I’m happy to help any volunteer implement experimental beam racing in any emulator. Tony has been asking me many questions privately.

Currently, myself, Calamity (GroovyMAME) and Tony (WinUAE) are collaborating to refine this, and I’m currently writing an open source cross-platform beam racing demo (Hello World type program for beam racing newbies). Back-n-fourth talk is currently occuring in pages 4-5 of this thread, though even more emails have been sent privately amongst us. If any coding volunteer wants to join in, we can help any coder become a Tearline Jedi too for their respective emulator work. If no coder resources are available, wait for creation of a cross-platform beam racing helper framework.

Twinaphex · 27 March 2018 23:18

Is there anybody here who feels inclined to take this guy’s ideas and try to make a working proof of concept for RetroArch with it? Apparently if what he says above is to be believed, they’d be more than willing to provide feedback and help in that process.

Brunnis · 28 March 2018 08:50

This is a great idea! On a high level, the effect of this is the same as using Hard GPU Sync + a very high Frame Delay setting in RetroArch. However, using this method you’re moving from being focused on running the emulator as fast as possible to generate the whole frame (usually CPU limited) to being able to page flip as fast as possible (GPU limited).

I’m guessing this can be slightly tricky to implement in RetroArch, since it would require an API update in addition to core support. It really should be looked into, though.

rafan · 28 March 2018 09:22

It’s actually better than framedelay because it runs the video updates in sync with real hardware. So for systems that do beam racing, like e.g. Atari 2600 etc, it is able to show current frame changes, like the real hardware does. With framedelay, however high you set the value, this is never possible.