An input lag investigation

TylerL · 4 May 2017 12:47

How does Windows 10 (with the proper settings, windowless fullscreen, hard gpu sync, etc) fare against Linux KMS mode these days?

Brunnis · 4 May 2017 15:42

I’ve measured exactly the same results on Windows 10 as Linux in KMS mode. However, it depends on the GPU drivers, so each driver needs to be tested to know for sure, really. I have run into GPU drivers that would perform worse, input lag wise, on both Windows and Linux. In the Windows case, it was a new AMD driver that suddenly introduced 1-2 frames of extra input lag. I reported this to AMD, but don’t know if they ever fixed it. In the Linux case, it was a new Intel GPU driver that seemed to require so much more system resources that I had to turn down the settings in RetroArch (swapchain_images, frame delay). I think it was when I upgraded from kernel 4.8 to 4.10 I saw this regression. I ended up rolling back to 4.8.

So, unfortunately, things are a bit volatile when it comes to input lag. In my case, I ended up just building a dedicated box, for which I confirmed low input lag through measurements, and which I intend to keep static for years (i.e. no OS/driver updates). That way I will at least know that input performance is guaranteed.

Mauricio · 4 May 2017 17:17

tnks very much. I would like to see it.

Brunnis · 4 May 2017 18:09

I’m sorry, but you misunderstood me! The original results (i.e. 4.6 frames input lag) are from running in KMS mode.

Mauricio · 4 May 2017 18:23

Do you already try to performer the test by compiling your own kernel and retroarch??

It is a good idea if you compile the kernel and setup timer @1000Hz and setup your cpu. Another thing is that, I always compile retroarch and its cores from github with the following flags. The same to the kernel as well.

By now, I don’t have any complain about it.

pd: if you see any error, my english it is not my main languague.

andrewlxer · 5 May 2017 00:56

Yes, confirmed that the performance drops, and the lag ends up being reduced by 1 frame (as expected) both in measurements and when playing NES Bucky O’Hare (which is essentially impossible to play with any lag).

Note I didn’t rewrite the swapchain stuff that resulted in the 2/3 confusion, that was the original author of the driver (vanfanel). Not sure of the intent of those changes.

andrewlxer · 5 May 2017 04:05

Note, the max_swapchain_images fix has been merged:

The next RetroArch release (1.5.1) should have all of the stability/performance fixes. If you see further stability issues open a bug with details.

vanfanel · 26 May 2017 09:57

@brunnis, andrewlxer and all: Hello again, guys! Time since I don’t come by this thread!

Andrewlxer: thanks for your fixes on the dispmanx driver! I thought it was better left as a low-latency-only driver, so I didn’t bother fixing the stablity problems and made it “simpler” because of that: for other needs I don’t share (performance over latency, in retro gaming, has no place for me) I thought the GLES driver was enough.

Please note I always use max_swapchain=2, dispmanx driver, plain ALSA audio, and linuxraw for joystick. I don’t have the means to measure input lag, so I chose these fixed configuration options (where we know what input lag we have thanks to Brunnis and his VERY interesting experiments) and measure performance.

Someone mentioned that making the kernel timer 1000Hz could give some results: well, at least in a systen with no CPU usage apart from RetroArch, it does not make any difference regarding performance. I have been experimenting with CPU isolation, leaving 1/2 cores of the Pi3 just for RA, but again, since no other processes are using CPU, that’s not making any difference either. Realtime RR scheduling does not make any difference either. So, in a system dedicated to RA without other non-kernel threads running, experiments which involve kernel rebuilding with custom configuration options is not worth it.

However, recently I got an interesnting patch merged into the Raspberry Pi kernel by the Pi kernel guys (4.9 brach, which is the branch you get when you do rpi-update):

This patch enables custom polling frequencies for the joysticks: passing jspoll=2 in cmdline.txt would give a 500Hz polling rate for the joystick. Not bad. It can be verified using evhz, a very simple program found here:

So, Brunnis, could you do a dispmanx input lag test with this, please?

Nesguy · 9 June 2017 00:43

Hey Brunnis, awesome work! Input lag has long been a concern of mine, but it looks like the issue of input lag in emulators has essentially been solved! I was excited enough by the results of your experiments that I had to come out of lurking. A couple questions-

I don’t think you mentioned if you set hard GPU sync frames to 0 or 1 in your experiments. I assume 0?

You mentioned the Apollo Lake NUCs as maybe being good for a low end, low latency machine. However, I’ve been unable to set hard gpu sync to 0 frames without it causing unbearable slowdown on my NUC6CAYS running Windows 10.

What is the difference in input lag, if any, between a setting of 0 and a setting of 1? In your opinion, is it worth investing in more powerful hardware to run hard gpu sync at 0 instead of 1? Also, would x-less Linux on the same machine have less input lag? Thanks again for your work

Brunnis · 9 June 2017 14:09

@vanfanel Sorry for the slow response, vanfanel. That’s great work getting this into the RPi kernel! I’ll try to test this, but it might be a while due to other things going on in my life at the moment. 500 Hz vs default 125 Hz should shave off another 3 ms (0.18 frames) from the average input lag and 6 ms (0.36 frames) at most. Every little bit counts.

@Nesguy Thanks! In my Windows tests (which is where Hard GPU Sync applies), I’ve always used a Hard GPU Sync Frames setting of 0, as you assumed. I haven’t tested with a setting of 1, but it “should” add a full frame period worth of input lag, i.e. 16.7 ms. I would personally invest in a machine that can handle Hard GPU Sync Frames = 0, but if the system is otherwise setup correctly and you have a very low input lag screen, it may not be worth it.

Regarding Apollo Lake, I made a quick test with Windows 10 on my Asrock J4205-ITX based system a long time ago and I believe I got good performance with Hard GPU Sync Frames = 0. On Linux, I use Max Swapchain Images = 2 (similar to using Hard GPU Sync), together with Frame Delay = 8 and things work great. However, I only use Nestopia and Snes9x2010.

There are several things that could cause your performance issue, for example:

Display driver differences (compared to when I tested)
Power saving settings (try setting “High Performance” power plan in Windows)
Maybe you’re using more demanding emulators?
Have you changed the Frame Delay setting to anything other than its default value (0)? This is a very demanding setting.
The NUC6CAYS uses a slightly slower Apollo Lake variant than the J4205-ITX. It’s worth mentioning, but I don’t think that’s what causes your issues.

BlockABoots · 10 June 2017 21:20

What about the new RawInput input driver introduced with RA 1.6?, is that based on the work you have done Brunnis?, or does your method produce less lag?

Brunnis · 11 June 2017 08:34

Not based on anything I’ve done. It’s a different driver to handle input and doesn’t “conflict” with any of the stuff I’ve been doing/testing. If it does give lower input lag it will do so in addition to the stuff I’ve found. Nobody has tested the effect of this yet, though.

Nesguy · 14 June 2017 23:37

Hey Brunnis, it was the frame delay setting- I had it set to 5 instead of 0. Switched it back to 0 and now SNES emulators run fine at hard gpu sync 0 frames. Everything is awesome now, thanks!

I also set the intel built-in graphics settings to “high performance” under the power settings, just in case that makes a difference.

Nesguy · 15 June 2017 22:21

Has any input lag testing been done with overlays? Do you think using overlays will increase input lag?

I’d also be interested in knowing if the CRT-Pi shader introduces any input latency, since it’s the only CRT shader that will run full speed on the NUC6CAYS. I tested it using Fudoh’s 240p Test Suite manual lag test, and found no difference in latency between using the crt-pi shader vs. no shader, with an average latency of less than one frame (16ms) in both cases. Sooo… that’s pretty awesome. It’d be nice to confirm this with a more scientific test, though.

Nope, not getting any work done today!

fbs777 · 17 June 2017 23:04

Hi, im curious about 2 things

1- what is the diff between the retroarch/libreto snes cores input lag and the original snes9x/zsnes emulators input lag?

2- what about the input lag on pi3 using the gpio? Its lower than usb control?

Brunnis · 2 July 2017 12:15

Time for another small update! I’ve just tested the impact on input lag from:

a) Shaders b) “raw” input driver

I used the same test procedure as always (see original post in this thread), using a Core i7-6700K @ 4.4 GHz and a GTX 1080. RetroArch 1.6.0 was used and testing was performed using Windows 10 and OpenGL. 25 samples were taken for each test case.

Shaders

[Input lag below reported as number of frames at 60 FPS]

No shaders: 5.21 avg / 4.25 min / 6.00 max crt-royale-kurozumi (Cg): 5.13 avg / 4.25 min / 6.00 max crt-geom (Cg): 5.22 avg / 4.00 min / 6.25 max crt-geom (GLSL): 5.08 avg / 4.00 min / 6.00 max

There was no difference at all in the amount of input lag between no shader and using shaders. The average, minimum and maximum measured input lag was the same (within measuring tolerances). This means you can use shaders without worrying about introducing extra input lag.

For another data point, I also tested the crt-aperture GLSL shader on my Pentium J4205 system running Ubuntu 16.10 in DRM/KMS mode, using the built-in Intel graphics. I measured input lag with my usual test routine and just like my tests of the other shaders on the GTX 1080 in Windows, input lag performance remained unchanged after activating the shader.

One thing to remember, though, is that running the shader passes takes additional time. In other words, the time required to generate each frame will increase. If you’re using the Frame Delay setting to reduce input lag, you will likely have to decrease the value in order for your computer/device to still be able to finish rendering the frame on time. With my i7-6700 @ 4.4GHz and GTX 1080, I had to turn frame delay down from 12 to 8 when using the crt-royale-kurozumi shader, therefore increasing input lag by 4 ms.

So, while shaders themselves don’t add any extra input lag, the increased processing time might force you to reduce the frame delay which will have a small impact on input lag. The good news is that you’ll know exactly by how much your input lag increases since it corresponds to the amount of milliseconds frame delay you have to remove in order to retain 60 FPS.

EDIT: There might be more to this than I initially thought. See post below by hunterk where he shows results from his own testing, clearly showing a negative impact on input lag with some shaders. I have not been able to reproduce this, despite running additional tests (this post has been updated with those additional results).

The “raw” input driver

The raw input driver was introduced in RetroArch 1.6.0 and the hope was that this driver would reduce input lag. Until today, however, no tests had been run comparing it to the default dinput driver.

Unfortunately, my tests show that the raw input driver provides zero difference in input lag. At least it’s not measurable with this test method and equipment.

By the way, on a completely unrelated note, why does the menu shader get deactivated whenever you load a shader preset? Seems strange that such a basic thing as using a shader disables the beautiful shader used for the menu background…

hunterk · 26 June 2017 19:54

In my testing, some shaders very much affected latency:

CRT-Geom added a heap of latency in both RetroArch and Higan, but it doesn’t seem to be related to the number of passes. Perhaps it’s related to the number of registers being utilized, I dunno /shrug

As for the menu shader, it’s only when you apply Cg shaders, and that’s because the menu shader pipeline for the fancy ribbon, snow and bokeh all use GLSL shaders. There’s no Cg menu shader for the fancy ribbon because it uses derivatives in a way that doesn’t seem possible in Cg, while the others require drawing a fullscreen quad and the Cg pipeline lacks some stuff for that, as well, IIRC.

We’re deprecating the Cg shaders, anyway, though, so try to move to GLSL shaders when possible (Cg shaders aren’t going anywhere and will still be available as long as Nvidia keeps offering their Cg Toolkit, they’re just not a top priority anymore). I hand-converted almost all of them, so it shouldn’t be too painful to switch.

Brunnis · 26 June 2017 20:05

Hmm, I’m kind of surprised that you got 3-5 extra frames of input lag with some of those shaders… I was planning to test a few more shaders initially, but didn’t really have time. Seems I’ll have to go back and at least test crt-geom or crt-lottes to see if I can replicate your results. It’s unfortunate if there’s so much variation between shaders and there’s no way of knowing the exact impact it has on input lag without testing.

Thanks for the info regarding Cg/GLSL shaders.

Tatsuya79 · 26 June 2017 21:54

Crt-geom is my usual shader on many systems. No way it adds 100ms here. Some particular system/driver issue for sure.

Brunnis · 27 June 2017 19:33

I just finished testing the crt-geom shader. I tested both the Cg and the GLSL variants. Both perform exactly the same, input lag wise, as when running without shaders, just like the crt-royale-kurozumi shader. So, to give you some actual numbers (input lag as number of frames):

No shaders: 5.21 avg / 4.25 min / 6.00 max
crt-royale-kurozumi (Cg): 5.13 avg / 4.25 min / 6.00 max
crt-geom (Cg): 5.22 avg / 4.00 min / 6.25 max
crt-geom (GLSL): 5.08 avg / 4.00 min / 6.00 max

All of the results are well within measuring tolerances.

I can only speculate as to why you got such high input lag in your tests. Are you sure 60 FPS was maintained at all times? Although I don’t really know anything about shader pipelines, I do think it makes more sense that any shaders are run within the actual frame period. It seems strange that the GPU driver would pipeline the shader execution as indicated by your results.