An input lag investigation

robertos677 · 5 November 2019 19:19

It’s possible. The runahead algorithm just needs to run on a frameslice basis insted of once per frame.

f = number of scanlines per frameslice
r = runahead value (in scanlines)

for each frameslice:
Run the emulator for f scanlines; save state
Run the emulator for r scanlines; beamsync last f scanlines
load state

All else equal, this multiplies the workload by the number of frameslices per frame, with some overhead for (de)serialization. So if the frame is divided into four frameslices, you have to do four times the work. This isn’t as bad as it seems if we’re using sub-frame runahead values. For example, if r = f, it’s approximately equivalent to a runahead of 1 today, 2r = f is 2 frames of runahead, and so on.

One caveat: Since we’re no longer presenting discrete frames, if the runahead value is set too high (above that of the game’s “internal lag”), then in addition to the usual “frame skipping” effect you normally get when runahead is too high, you might also see occasional screen tearing, not unlike when vsync is turned off, except it will only happen when state changes across frameslice boundaries in response to changes in input, which doesn’t happen as frequently as you might imagine.

This is fundamentally unavoidable. You can’t have intra-frame responses without tearing unless the game was designed with it in mind (which obviously won’t be the case if we are using runahead to achieve it).

However, provided you don’t set the runahead value above that of the “internal lag”, you can still reduce input lag without producing visual disturbances.

This makes it more useful for removing (small) amounts of host lag (e.g. driver, display, polling lag) up to the limit of the internal lag, while still maintaining faithful system latency.

robertos677 · 5 November 2019 23:04

A few more thoughts.

To my knowledge, most games poll for input during the vertical blanking period. This limits the value of scanline syncing somewhat, because we can bruteforce similar latency reduction with sheer cpu power by just sleeping as long as possible (frame delay in retroarch), compressing the entire emulated frame into a much smaller window.

A game that polled for input during the last visible scanline is where beamracing would truly shine. In that case it would shave off almost a full frame of input lag. The same is true of cores that exit at the end of the vertical blanking period, instead of the beginning (e.g. SNES cores pre-Brunnis lagfix patches). The lagfix changes would be superceded by scanline syncing.

Even for the most performant cores, scanline syncing would let us eke out a millisecond or two. And having frame delay-like latency reduction without the high CPU requirements would be nice.

robertos677 · 5 November 2019 23:05

Finally, here are some crude ASCII diagrams which illustrate the differences across various lag-reduction methods.

I’ve attached a screenshot below in case the forum’s limited width forces you to scroll back and forth.

Legend

| = host v-blank interval
* = input lag
p = game polls input
o = earliest possible point at which a visible reaction could occur

snes9x pre-lagfix

 |                                          |                                          |
 <emulation><<<<<<<<blocking_on_video>>>>>>>><emulation><<<<<<<blocking_on_video>>>>>>>
          p*****************************************************************************o

snes9x post-lagfix

 |                                          |                                          |
 <emulation><<<<<<<<blocking_on_video>>>>>>>><emulation><<<<<<<blocking_on_video>>>>>>>
  p******************************************o

snes9x pre-lagfix with frame_delay:

 |                                          |                                          |
 <<<<<<<<<<<<sleeping>>>>>>>>>>>><emulation><<<<<<<<<<<<sleeping>>>>>>>>>>>><emulation>
                                  p*****************************************************o

snes9x post-lagfix with frame_delay

 |                                          |                                          |
 <<<<<<<<<<<<sleeping>>>>>>>>>>>><emulation><<<<<<<<<<<<sleeping>>>>>>>>>>>><emulation>
                                  p**********o

snes9x post-lagfix with frame_delay, game polls input on last scanline

 |                                          |                                          |
 <<<<<<<<<<<<sleeping>>>>>>>>>>>><emulation><<<<<<<<<<<<sleeping>>>>>>>>>>>><emulation>
                                           p********************************************o

scanline sync

 |                                          |                                          |
<emulation><emulation><emulation><emulation><emulation><emulation><emulation><emulation>
                                  p**********o

scanline sync, game polls input on last scanline

 |                                          |                                          |
<emulation><emulation><emulation><emulation><emulation><emulation><emulation><emulation>
                                           p**********o

Screenshot

I hope that isn’t too difficult to follow. The asterisk trails give you a quick visual guide. Shorter = less lag.

Note: these diagrams consider only lag introduced by the syncing method. I’m disregarding USB polling, display, driver lag, etc, because they’re independent of sync method. Most games also have at least one frame of internal lag, but that’s out of scope too. In short, the asterisks represent the smallest theoretical time in which you could see a reaction to your input under ideal (unrealistic) conditions.

The diagrams really don’t do scanline sync justice, because I could fit only four <emulation>'s (frameslices) in the space I gave myself. If you double the frameslice count from 4 to 8, which is quite doable, you halve the input lag. Nevertheless, it still illustrates some of its benefits, namely fixed input lag irrespective of polling time, and comparable input latency to a large frame delay without the high CPU requirements.

Mecha · 9 November 2019 01:29

This is one of the topics that most interests me as I really enjoy playing Shoot em ups and side scrollers. I’m always trying to bring the latency down.

My guess is that this topic should be priority for future builds.

jnsl · 1 December 2019 02:17

I’ve recently upgraded from Win7 to Win10 (same exact hardware) and I’ve noticed Retroarch was choppy. Turns out my frame delay settings which were flawless on Win7 need to be lowered 1-2ms on Win10. Anyone experience that or figure out any way to mitigate it?

i5-3570k + GTX 1070, latest drivers.

Brunnis · 5 December 2019 10:52

Hi everyone! Long time, no see. Just thought I’d provide a very quick update regarding Raspberry Pi 4 input lag with RetroArch. I’ve made a few quick tests with my trusty old LED-rigged controller. I used the development branch of RetroPie for these tests. My results so far are:

As opposed to my previous tests of the Pi 3, I could not measure worse input lag with threaded video enabled on the Pi 4.
The default OpenGL driver on the Pi 4 matches the Pi 3 and earlier using the Dispmanx video driver, in terms of input lag.
The Max swapchain images setting works as expected. A setting of 2 reduces input lag by one frame.

This is good news (particularly the input lag performance of the new open source GL driver), as it means the Pi 4 now behaves the same in terms of input lag as RetroArch does on PCs. The Pi 4 is obviously still slower than a PC, so what input lag reducing settings that can be used depends on how well the particular game and emulator runs. As always.

I ran some tests with Super Mario World 2: Yoshi’s Island using snes9x2010. It appears the following settings work fine (tested both the spinning island scene and some gameplay):

Threaded video off
Max swapchain images = 2
Frame delay = 6

I also forced 1000 Hz polling for USB gamepads (add usbhid.jspoll=1 at end of line in /boot/cmdline.txt in Raspbian). With these settings and a good gamepad, you’ll be approximately 0.8 frames (13 ms) behind a real NES or SNES. Not including your display, of course. Given the fact that most NES and SNES games on an original console took on average 33-50 ms from button press to showing a reaction on screen, being just 13 ms behind is of course very good.

Obviously, for more demanding emulators, you’ll have to scale back the latency reducing settings accordingly.

It’s also worth mentioning that the video driver for the Pi 4 is very much a work in progress. I observed occasional distracting tearing during my tests. It mostly worked fine, though. Still, I’d consider these first tests very much preliminary.

EDIT: To avoid any confusion: These tests were run in a DRM/KMS context, so not under X.

hunterk · 5 December 2019 14:02

whoa, that’s pretty surprising, but in a good way. Pretty great news all around Thanks for your testing and reporting, as always!

Rion · 7 December 2019 02:42

@Brunnis Have you had a look at RetroFlag’s Classic USB Controller-J /U controller?

No ghost input, great d-pad and buttons and as far as i can tell no added latency.

dankcushions · 8 December 2019 11:18

I also forced 1000 Hz polling for USB gamepads (add usbhid.jspoll=1 at end of line in /boot/cmdline.txt in Raspbian).

interesting! could you measure any performance degradation with this option? I wonder if it would be a ‘safe’ default in retropie (do you know what retropie defaults to?)?

RealNC · 8 December 2019 16:04

Note that this is not guaranteed to work. For xbox controllers for example, when using the kernel’s xpad driver, you need xpad.cpoll=1 for 1 millisecond poll interval (1000Hz.)

And you need to verify by running the evhz tool:

I’ve found this information on the MisSTer Wiki though, so not sure if this is something that only works there or in general. Need to test.

Update:
Nope, no effect with XInput gamepads. xpad.cpoll is a custom patch in the mister kernel.

Brunnis · 12 December 2019 09:27

Yeah, I bought one a good while ago. It’s really nice in most ways (look, feel, 250 Hz USB polling by default), except for one important aspect: D-pad sensitivity. I noticed immediately when playing Street Fighter II that when rocking your thumb left and right, there’s a very high likelihood of performing an involuntary jump or crouch. This phenomenon is not nearly as likely to occur on my 8bitdo controllers or my original SNES Mini controllers.

I’ve not noticed any performance degradation, but I’ve not run any formal tests on it. I would guess that if there is any measurable performance impact, it would only be seen while any button/stick is being pressed. I guess there might also be some risk that certain devices don’t like being polled at 1kHz. It would be nice if this could become a new default for RetroPie, but it certainly needs thorough testing.

Good info. Thanks.

Rion · 13 December 2019 10:51

@Brunnis

I have not noticed any d-pad sensitivity issues so far.

Recently completed Super Castlevania for Snes.

Brunnis · 13 December 2019 11:23

Yeah, it could of course be my sample that is particularly sensitive.

Rion · 13 December 2019 16:16

This could be the same problem you are describing that Level1online mentions in his review of both the US version and the Japanese Famicom.

I myself have two J versions and don’t have this problem.

BlockABoots · 11 January 2020 14:47

So is there any chance waterbox save states could be implemented in the MAME core to eliminate all input lag?

upsilandre · 21 January 2020 16:40

Hello! I’m doing some measurements right now (RetroArch, Windows 10, LCD…) with a custom led SNES classic controller + raphnet adapter with my ROM test (NES) and Xperia 960fps HD vidéo. I will give you my conclusions later (french google trad, sorry).

upsilandre · 10 February 2020 14:24

Hello! I made a short comparison video (only with favorable input timing). Details in descriptions and pinned comment.

https://youtu.be/NyrcPyZtfMg

and another to illustrate the concept of favorable and unfavorable input timing

https://youtu.be/YOMIV6PAyR0

GemaH · 10 February 2020 14:40

Well of course RetroArch is going to be the fastest with Run Ahead = 1.

Question is, how fast is it without it and only GPU sync ON?

upsilandre · 10 February 2020 18:13

In theory without Run Ahead it’s +16 frames Xperia, without Frame Delay +12.5, without Hard GPU Sync +32.

Tatsuya79 · 10 February 2020 21:05

Vulkan with “max swapchain images” on 2 should provide the same latency as Hard Sync to 0 in gl without the increased cpu cost (I think it was just under 20%).

It would be interesting to see if that’s working.