An input lag investigation

nfp0 · 29 March 2022 18:26

Got it. Makes sense.

Why disable audio sync? Would this cause any kind of video delay?

I believe you mean video_frame_delay = "0" and video_frame_delay_auto = "false"

Alright, I’ll give it a try! This is the best case scenario, right? With no frames lost in any kind of compositing (except for Mario’s own known 1 frame of delay).

I’ve been conducting my tests on the 240p Test Suite on the Horiz/Vert Stripes test pattern emulated on the 2014 bsnes core. The picture reacts with zero frames of delay, so it’s good to count input lag.

I assume currently those ~58.33ms best case scenario is only possible on Windows exclusive fullscreen, or Linux KMS, correct?

Yeah, I believe this is all the compositor’s responsibility. As seen on the merge request I linked on my previous post here.

Alright, I’ll give that a try too. I’ve never used Weston though. How do I start it up after I start the session?

Thanks a lot for the help!

EDIT: Don’t mind me. I’ve figured Weston out.

RealNC · 29 March 2022 19:05

To quote from https://invent.kde.org/plasma/kwin/-/merge_requests/502:

I wouldn’t worry too much about additional input lag in the low single digit range.

So don’t worry. I on the other hand do worry, so I use X11. If some day Wayland becomes capable of zero overhead output, I’ll consider it and recommend it. For now, I can’t.

e-tank · 29 March 2022 19:38

it shouldn’t under normal circumstances and if configured properly (which it typically is) but there’s a lot of variables to that. on the other hand disabling it definitely won’t add any lag, so it just removes something completely from the equation for testing purposes

yep ty, will fix

correct, it’s the theoretical best case for those particular settings given above. it’s technically possible for a compositing window manager to keep up and match but i haven’t seen any results to reflect this and am not surprised i haven’t yet. though the numbers in the article u linked to look promising.

i did forgot to add that estimate doesn’t take into account monitor/display lag, so unless you’re using a crt or some other stupidly low response time display it should be that + some (hopefully small) amount of ms. we’re talking + something in the low single digits for a decent gaming monitor.

if you were to measure smb1 on native nes hw hooked up to a crt it comes out to (on the average) just under around ~41.66 ms (2.5 frames * 1000ms/60fps. i’ve come across test results of the game that reflect this a few times now but i can’t recall where off the top of my head…) anyway, the 1 extra frame over native hw is due a fundamental difference in how frames are generated and pushed out in modern applications on frame buffered display hw compared to old frame buffer-less raster on the fly hw like the nes. that’s where runahead and frame delay come into play, as a way of chipping away at or in some cases exceeding that limit. there are other ways too, such as beam chasing that the blur buster ppl came up with, but that’s not something applicable to LR/RA.

pretty much, i’ve seen results that backup the theoretical numbers on both a few times now, though as mentioned above ideally uncomposited x and wayland would match given enough performance overhead, hopefully they will in time

nfp0 · 29 March 2022 20:06

I’ve done additional tests using your setup (which is pretty much what I was already doing). The only thing I changed was keep using the 240p Test Suite methodology as I was doing before, for zero frame in-game latency. I always measure my values on the 2nd half of the screen.

I measured averages for Windows on exclusive full-screen and Weston. So, putting it all together now we have, in order (with frames at 60fps):

95ms (5.7 frames) on Composited X
77ms (4.6 frames) on Kwin Wayland
75ms (4.5 frames) on Weston
65ms (3.9 frames) on Uncomposited X
53ms (3.2 frames) on KMS
51ms (3.0 frames) on Windows exclusive-fullscreen

Note: All of this was measured on Vulkan with max swapchain = 2.

Noise aside, we can conclude then that KMS and Windows have the exact same latency, which is not surprising.

We can also conclude that Wayland on Kwin has the exact same latency as Weston. Not sure what to make of this, as I don’t know how Weston’s presentation queue works.

I also don’t understand why Uncomposited X has more input lag than KMS, as the queue should be the same size. Same for Wayland. All 3 should have the same latency.

Sure thing. I’ve made all the measurements on the same monitor, so that’s a fixed variable. My only purpose was direct comparison. This is a pretty old monitor, so expect some 5~10ms of delay from it, but that’s irrelevant here.

Yeah, I’m aware of the other additional methods, and the additional delay on our modern buffered hardware. But thank you for the thorough explanation!

And it seems my numbers also back that up.

If we remove the 5~10ms latency of my monitor and add the known 1 frame delay in Mario (16,6ms) to the KMS and Windows latency, we kinda reach the ~58.33ms best case number you talked about.

Now the real question is, assuming all is working as expected, where are Uncomposited X and Wayland losing that extra frame? I believe performance is not the issue here. There must be an additional frame in a queue somewhere. I think I’ll take this up to the Kwin developers to try and figure out if direct scan-out is working or not.

There’s one additional test I can conduct though. On Wayland, a windowed RetroArch instance should, in theory, have one additional frame of latency VS fullscreen. If the value turns out the same, then it’s a guarantee direct scan-out is not working. But I’ll do it later. Today I’m tired of all this measuring.

nfp0 · 29 March 2022 20:12

Agree, I wouldn’t. But these are double-digit differences though. There’s a 24ms difference between Wayland and KMS on my system, when they should be the same. It’s a pretty big disparity.

Either something is wrong, or maybe I misinterpreted what direct scan-out is supposed to do.

e-tank · 30 March 2022 02:25

interesting results for sure, tho i think the next thing to try out would be to compile retroarch with debugging enabled in order to verify that you’re actually getting 2 images in your vulkan swapchain on Kwin or Weston. max_swapchain_images is just what RA will request and not what it’s guaranteed to get. the following will compile the build:

./configure --enable-debug
make

with this build (you can just run it straight from the directory you built it in for our purposes) if you specify either --verbose on the command line or log_verbosity = “true” in your config it should tell you what you need to know

one caveat, i had issues with the debug build of RA crashing on me when trying to create a suitable vulkan context, and in order to fix this issue i had to resort to commenting out all the code in between the 3 #ifdef VULKAN_DEBUG sections in the following function in the following file:

gfx/common/vulkan_common.c:1840:bool vulkan_context_init(gfx_ctx_vulkan_data_t *vk, enum vulkan_wsi_type type)

other than that you may also want to look into trying the alternate amd vulkan driver from what you’re currently using. there’s 2 for linux, radv and amdvlk (i believe the former is what valve has gone with on their hardware) and you can find more info about that here: https://wiki.archlinux.org/title/Vulkan

after writing all the above i had completely forgotten until rn that one of the main RA devs wrote a blog post a while back about his experiences with vulkan on various surface types and driver stacks: https://themaister.net/blog/2018/09/09/the-state-of-window-system-integration-wsi-in-vulkan-for-retro-emulators/ see the section on mesa - wayland - linux, he mentions being provided w/4 images in the swapchain for fifo mode and yet only 3 were ever used O_o ya, looks like this stuff gets real hairy, unfortunately…

nfp0 · 30 March 2022 09:49

I checked that post from Themaister. Very interesting read! Indeed if it is using 3 swap images instead of the 2 requested, that would explain the additional frame of latency outside of KMS.

I could swear the release build of RetroArch outputted the information about number of swapchains with verbose logging enabled, but I can’t find it anymore. Has anything in logging changed recently? Or am I remembering the verbose output of the Windows version?

I’m using radv. But I can give amdvlk a try too.

nfp0 · 30 March 2022 11:03

Aaaaand here we are. On Wayland I got:

[INFO] [Vulkan]: Using fences for WSI acquire.
[INFO] [Vulkan]: Using GPU: "AMD RADV SIENNA_CICHLID".
[INFO] [Vulkan]: Queue family 0 supports 1 sub-queues.
[INFO] [Vulkan]: Swapchain supports present mode: 1.
[INFO] [Vulkan]: Swapchain supports present mode: 2.
[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 4 swapchain images.

So it seems it’s using 4 images (or 3 going by what Themaister said). This is reeeally bad for latency.

Meanwhile I also tested Weston, KMS and X with the following results:

Weston:

[INFO] [Vulkan]: Swapchain supports present mode: 1.
[INFO] [Vulkan]: Swapchain supports present mode: 2.
[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 4 swapchain images.

KMS:

[INFO] [Vulkan]: Swapchain supports present mode: 2.
[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 2 swapchain images.

X:

[INFO] [Vulkan]: Swapchain supports present mode: 0.
[INFO] [Vulkan]: Swapchain supports present mode: 1.
[INFO] [Vulkan]: Swapchain supports present mode: 2.
[INFO] [Vulkan]: Swapchain supports present mode: 3.
[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 3 swapchain images.

Not sure what each swapchain present mode represents, but it seems RetroArch is always requesting mode 2.

Summing it all up, the number of swapchains are, in order:

4 on Kwin Wayland
4 on Weston
3 on X
2 on KMS

This all matches up with the latency numbers I measured on my other post. Indeed it seems RetroArch is not able to always get the desired 2 swapchain images.

What’s the path forward from here? From Themaister’s post, I assume RetroArch is working correctly and always requesting 2 images. Then whose door should we knock? Mesa, or the AMD drivers?

RealNC · 30 March 2022 16:28

Hm. On X11 with the proprietary nvidia driver, it seems to be fine:

[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 2 swapchain images.

I don’t have Wayland installed anymore, but I’ll reinstall it just to see what happens there.

Have you considered using the glcore retroarch driver instead? Maybe it helps.

RealNC · 30 March 2022 16:58

I now tested Wayland. RA is only able to configure 3 swapchain images. If I set it to 2, RA hangs.

nfp0 · 30 March 2022 19:29

Hmmm interesting! Seems to be driver dependent then. I’ll try amdvlk later to see if it helps.

I’ll give it a try. Is it compatible with Slang shaders?

Here it doesn’t crash, but I would rather it crash to signal something is not right than not crash and not being able to know it’s using an incorrect number of swap images.

Thank you for the help by testing on your Nvidia!

RealNC · 30 March 2022 20:13

Yes. In fact, it only supports Slang. Unlike the “gl” and “gl1” drivers, “glcore” requires modern OpenGL support by the GPU driver. If your GPU supports Vulkan, then modern OpenGL support should in theory be no problem for it.

Edit:
As a side note, it turns out KWin with Wayland is still not ready. If the compositor crashes, or even just resets due to a GPU driver reset, it kills all applications. This is still listed as a showstopper:

https://community.kde.org/Plasma/Wayland_Showstoppers

Even worse, KWin locks itself to 80FPS when an application uses fullscreen (I use 120Hz for the desktop.)

It seems to me in will take a while yet until kwin+wayland is ready to be actually used.

nfp0 · 30 March 2022 20:57

I have multiple screens at different refresh-rates and I game with VRR, so Wayland is basically a necessity for me. And input lag under VRR is the same as X, so that is a non-issue.

But yeah, there’s still quite a few showstoppers to make it stable. That list used to be huge. It is improving at an astonishing rate. I tried it back in 2021 and it was borderline unusable on KDE. Now I use it daily for work and games and it rarely causes me any issues.

Fedora started shipping it by default too. A bit premature IMO, but in the end it helped speed things up.

Might be an Nvidia specific issue. I have a 180Hz screen and all apps and games run at 180Hz in fullscreen, RetroArch included (before loading a core, ofc).

No idea why that’s happening to you, but I’ve heard Nvidia’s drivers are terrible on Wayland. They’re taking very long to become compatible. Intel and AMD drivers are miles ahead.

Unfortunately I can’t recommend Wayland to anyone using Nvidia for the time being.

nfp0 · 30 September 2022 13:46

SUCCESS!!

At last! I achieved the lowest possible latency on Wayland! Same as on KMS and Windows exclusive fullscreen.

It seems RADV was not playing nice by supplying 4 swap images instead of the requested 2. I installed AMDVLK and now RetroArch gets the requested 2 swap images (I checked on the log). And direct scan-out also seems to be working because now I get only 54ms of latency on Wayland, the theoretical minimum on my system!

So here is the updated table:

95ms (5.7 frames) on Composited X (RADV)
77ms (4.6 frames) on Kwin Wayland (RADV)
75ms (4.5 frames) on Weston (RADV)
65ms (3.9 frames) on Uncomposited X (RADV)
54ms (3.3 frames) on Kwin Wayland (AMDVLK)
53ms (3.2 frames) on KMS (RADV)
51ms (3.0 frames) on Windows exclusive-fullscreen

It is quite possible X also benefits from the AMDVLK swapchain, as it was getting 3 images in RADV instead of the requested 2, but I’ll leave those tests for another day. I’m tired of counting literally thousands of frames by hand

~~These values also mean direct scan-out is probably working correctly on Wayland, otherwise there would be an additional 16.6ms of latency (at 60fps).~~
EDIT: Direct scan-out has no impact here. See my next post.

Anyways, I’m happy that I can enjoy KMS levels of latency on my Wayland desktop now. Next I’ll tighten the Frame Delay setting as much as possible and I’ll call it a day.

Thanks a lot for helping me figure this out! @e-tank @RealNC

I hope this info reaches people trying to reduce latency as much as they can on their Linux PCs!

nfp0 · 8 August 2024 18:23

I did one last measurement with AMDVLK to verify if X also benefited from the reduced swapchain images, and sure it did!

Here is my (hopefully final) table:

95ms (5.7 frames) on Composited X (RADV)
87ms (5.3 frames) on Composited X (AMDVLK)
~~77ms (4.6 frames) on Kwin Wayland (RADV)~~ EDIT: Today, RADV is now as fast as AMDVLK below.
75ms (4.5 frames) on Weston (RADV)
65ms (3.9 frames) on Uncomposited X (RADV)
54ms (3.3 frames) on Uncomposited X (AMDVLK)
54ms (3.3 frames) on Kwin Wayland (AMDVLK)
53ms (3.2 frames) on KMS (RADV)
51ms (3.0 frames) on Windows exclusive-fullscreen

So, it seems Windows, KMS, Uncomposited X and Kwin Wayland on AMDVLK, all reach the theoretical best-case latency!

If my numbers are correct, I would say that time has come already. Sure my system is just one example, and as we saw, this is very driver dependent. But it would be nice if other AMD users would confirm if their cards also have this swapchain discrepancy between RADV and AMDVLK.

Also, apologies for me insisting so much on direct scan-out. I’ve read a bit more about what it does and it does not do what I thought it would. It’s just a small optimization in the compositor. It does not change anything about the swapchain. It reduces the processing needed when an app is fullscreen, so it might indirectly help with latency if you use the Frame Delay feature, though.

e-tank · 31 March 2022 10:15

these are the vulkan presentation modes. 0 is immediate, no vsync so you’ll get tearing. 1 is mailbox, vsync’d but lets the program keep submitting images and only ever uses the last one submitted (useful for fast forwarding). 2 is fifo, vsync’d and blocks when full, which is crucial for timing in RA and other emulators & retro games. EDIT: Also, 2 is the only mode the vulkan spec requires to always be available

i’m glad u got it sorted out and were able to provide all these interesting results in the process. though i do think it would be a good idea to open an issue on the radv driver over this here: https://gitlab.freedesktop.org/mesa/mesa/-/issues

if the amdvlk driver can manage to provide true double buffering in fifo mode under a compositor then there’s really no reason why radv shouldn’t be able to either. it’s a really important feature to have for running fixed rate content with low input latency

nfp0 · 31 March 2022 10:21

Thanks for the explanation!

Yeah, I’ll search Mesa’s issues and open one if it does not exist already.

Completely agree, double-buffering is very important and I have no idea why it’s not working in RADV. Mesa’s RADV is typically superior to AMDVLK in performance in pretty much anything else though. Maybe it’s a problem specific to the RX 6000 series cards because they’re new.

nfp0 · 1 April 2022 17:39

Since AMDVLK is much less performant than RADV, even on RetroArch, I went back to testing and tried with the glcore driver with Hard GPU Sync enabled, but it gave me an average of 71ms. It’s a pretty bad value compared to my results with Vulkan with Max Swapchain = 2 unfortunately.

I really have to find out why RADV is getting 4 images instead of 2.

EDIT: I also found corruption on some handheld border shaders on AMDVLK.

nfp0 · 5 April 2022 16:51

The RADV additional latency has been identified as a problem on Mesa.

Anyone interested can follow the discussion here:

vanfanel · 13 September 2022 18:27

@nfp0 Can you please build and test this for GL on Wayland?

I have no instruments to test input lag, but this should improve over simply doing eglSwapBuffers() and let MESA decide.