An input lag investigation

nfp0 · 28 March 2022 11:43

I’m trying to compare the latency of Vulkan Wayland VS KMS on RetroArch, but I need to set a specific frequency on KMS for consistency purposes.

I have a 4K@120Hz display, and I can boot it in 4K@60Hz mode, but RetroArch changes it to 4K@120Hz when I open it in KMS. I really would like to run it at 60Hz though, for consistent tests among other 60Hz-only setups.

Can anyone help me to force 60Hz? Sorry if this was already requested on this thread.

EDIT: Also, it seems my USB keyboard doesn’t work when I open RetroArch in KMS mode. I tried with the X and UDEV input drivers.

nfp0 · 28 March 2022 23:19

Even though I would still love to know how to force a certain frequency on RetroArch in KMS, I have found another way to conduct the Wayland VS KMS tests on 60Hz, but what I found out seems strange.

I’m using Wayland KDE Plasma (5.24) on Manjaro with RetroArch in fullscreen mode, and assuming direct scan-out is working correctly on RetroArch, I believe the latency should be the exact same as KMS. But that’s not what I verified.

I used my phone’s 480fps slow-motion mode to capture 10 samples each of the latency between a button press on my wired keyboard and my display. I used the exact same RetroArch configuration and hardware for the test. I’m using a Radeon RX 6800 with the AMDGPU driver. Here are my averages:

KMS: 53ms (3.2 frames at 60fps)
Wayland: 77ms (4.6 frames at 60fps)

That’s a 1.4 frames of additional delay on Wayland. Where am I losing that much? Assuming direct scan-out, shouldn’t it be technically the same, or very close? Could it be perhaps RetroArch is not engaging Kwin’s direct scan-out code, or an internal RetroArch issue?

I’m sorry if this was already covered in the thread sometime before, but a search turned nothing specific to my case. Please link me up if that is the case.

RealNC · 28 March 2022 23:36

Wayland has lag. It’s just not made for gaming. It’s made for watching video and browsing the web. The people who develop it are not gamers, don’t understand gaming nor really care about it much. Gaming is a second class citizen there.

If you want low latency, use X11 and disable compositing. This will basically give you the equivalent of Windows 10 with DWM disabled (like when using fullscreen, where Windows turns off DWM.) The DWM equivalent of Wayland cannot be disabled. It’s always there, adding lag.

nfp0 · 29 March 2022 13:11

Aside from the exception of allowing tearing (which is in development for Wayland), both X11 and Wayland have the exact same input lag assuming direct scan-out is working correctly.

Xaver himself wrote an excellent article explaining and testing the input lag of X11 vs Wayland, even going as far as adding XWayland to the mix. I recommend you check it out:

As you can see, Wayland, and even XWayland have the exact same latency as X11.

You’ll probably notice that FIFO latency seems higher on Wayland, but that’s exactly what I was talking about previously. As Xaver points out on the notes below the latency tables, at the time of testing, dmabuf was not implemented yet:

due to increased buffer bloat (the queue for presentation being one frame bigger) the latency with fifo is higher by one frame than on uncomposited X. This should disappear once all the necessary parts for dmabuf feedback are implemented in Mesa

The thing is, I believe dmabuf is now already implemented in Mesa and Kwin, so direct scan-out should be working with RetroArch and should have the exact same latency as uncomposited X11, or even better. Or am I wrong?

I conducted X11 measurements to corroborate what I’m talking about. Here are the averages together with my previous results:

X composited: 95ms (5.7 frames at 60fps)
X uncomposited: 65ms (3.9 frames at 60fps)
KMS: 53ms (3.2 frames at 60fps)
Wayland: 77ms (4.6 frames at 60fps)

So there’s a difference of 12ms between uncomposited X and Wayland on RetroArch, which might be that missing frame from direct scan-out. I don’t know why uncomposited X can’t match the low latency of KMS though (53ms). Might be an error on my part, or some driver shenanigans.

Anyone here versed with how RetroArch works internally knows if there is anything missing to trigger direct scan-out? Also, does RetroArch use FIFO or Mailbox?

Or does anyone knows how to verify that an app is triggering direct scan-out on Wayland?

RealNC · 29 March 2022 16:53

If the applications need to “trigger” direct scanout, then that’s bad.

Also, your reply is kinda funny. It begin with saying wayland is just as low latency as X11, and then shows it’s not

nfp0 · 29 March 2022 17:45

From my understanding, the application doesn’t have to do anything other than being fullscreen. It’s a great feature.

It seems to have been merged by Xaver on time for KDE Plasma 5.22, so I should have it (I’m on 5.24) and RetroArch should be benefiting from it.

You missed the point. My latency under Wayland being slower than X11 is precisely what I’m trying to figure out. It should have the same values under direct scan-out, as demonstrated by Xaver himself, but clearly something is not working correctly on my end. I was hoping any other Wayland user on this thread help me figure it out.

I’m also trying to understand how RetroArch’s Vsync works in more detail.

e-tank · 29 March 2022 19:42

that seemed to be the case at one point but the devs are taking it seriously now thanks in part to valve and the steam deck

with vsync enabled in RA it’ll use fifo, which is usually what you want for applications designed to run at a fixed frame rate, like emulators and retro/retro-like games, which is the kind of stuff LR was primarily designed to for. not saying there aren’t but I’m not aware of any cores rn that are designed to run in and take advantage of the benefits mailbox mode can offer. anyway…

as a reference for your tests, with these settings:

video_vsync = "true"
audio_sync = "false"
vrr_runloop_enable = "false"
run_ahead_enabled = "false"
video_threaded = "false"
video_frame_delay = "0"
video_frame_delay_auto = "false"
video_max_swapchain_images = "2"
video_hard_sync = "true"
video_hard_sync_frames = "0"

and with any of these drivers:

video_driver = "vulkan"
video_driver = "glcore"
video_driver = "gl"

and on a 60 hz display, if you were to load up super mario bros 1 (in fceumm, nestopia, or mesen) the theoretical average time to see a response on mario from any given input when he’s near the bottom of the screen should be just under around ~58.33 ms (3.5 frames * 1000ms / 60fps) EDIT: assuming a crt display, so most likely this + some (hopefully small) ms display lag

it would be bad, but i’m pretty sure that’s not the case and i just assume they meant via their compositor, which brings me to…

at this time, i do not. but might i suggest you try running RA in weston instead of Kwin/KDE to compare, jic. (just install weston via your package manager and you should see an option for it somewhere on the login manager screen)

nfp0 · 29 March 2022 18:26

Got it. Makes sense.

Why disable audio sync? Would this cause any kind of video delay?

I believe you mean video_frame_delay = "0" and video_frame_delay_auto = "false"

Alright, I’ll give it a try! This is the best case scenario, right? With no frames lost in any kind of compositing (except for Mario’s own known 1 frame of delay).

I’ve been conducting my tests on the 240p Test Suite on the Horiz/Vert Stripes test pattern emulated on the 2014 bsnes core. The picture reacts with zero frames of delay, so it’s good to count input lag.

I assume currently those ~58.33ms best case scenario is only possible on Windows exclusive fullscreen, or Linux KMS, correct?

Yeah, I believe this is all the compositor’s responsibility. As seen on the merge request I linked on my previous post here.

Alright, I’ll give that a try too. I’ve never used Weston though. How do I start it up after I start the session?

Thanks a lot for the help!

EDIT: Don’t mind me. I’ve figured Weston out.

RealNC · 29 March 2022 19:05

To quote from https://invent.kde.org/plasma/kwin/-/merge_requests/502:

I wouldn’t worry too much about additional input lag in the low single digit range.

So don’t worry. I on the other hand do worry, so I use X11. If some day Wayland becomes capable of zero overhead output, I’ll consider it and recommend it. For now, I can’t.

e-tank · 29 March 2022 19:38

it shouldn’t under normal circumstances and if configured properly (which it typically is) but there’s a lot of variables to that. on the other hand disabling it definitely won’t add any lag, so it just removes something completely from the equation for testing purposes

yep ty, will fix

correct, it’s the theoretical best case for those particular settings given above. it’s technically possible for a compositing window manager to keep up and match but i haven’t seen any results to reflect this and am not surprised i haven’t yet. though the numbers in the article u linked to look promising.

i did forgot to add that estimate doesn’t take into account monitor/display lag, so unless you’re using a crt or some other stupidly low response time display it should be that + some (hopefully small) amount of ms. we’re talking + something in the low single digits for a decent gaming monitor.

if you were to measure smb1 on native nes hw hooked up to a crt it comes out to (on the average) just under around ~41.66 ms (2.5 frames * 1000ms/60fps. i’ve come across test results of the game that reflect this a few times now but i can’t recall where off the top of my head…) anyway, the 1 extra frame over native hw is due a fundamental difference in how frames are generated and pushed out in modern applications on frame buffered display hw compared to old frame buffer-less raster on the fly hw like the nes. that’s where runahead and frame delay come into play, as a way of chipping away at or in some cases exceeding that limit. there are other ways too, such as beam chasing that the blur buster ppl came up with, but that’s not something applicable to LR/RA.

pretty much, i’ve seen results that backup the theoretical numbers on both a few times now, though as mentioned above ideally uncomposited x and wayland would match given enough performance overhead, hopefully they will in time

nfp0 · 29 March 2022 20:06

I’ve done additional tests using your setup (which is pretty much what I was already doing). The only thing I changed was keep using the 240p Test Suite methodology as I was doing before, for zero frame in-game latency. I always measure my values on the 2nd half of the screen.

I measured averages for Windows on exclusive full-screen and Weston. So, putting it all together now we have, in order (with frames at 60fps):

95ms (5.7 frames) on Composited X
77ms (4.6 frames) on Kwin Wayland
75ms (4.5 frames) on Weston
65ms (3.9 frames) on Uncomposited X
53ms (3.2 frames) on KMS
51ms (3.0 frames) on Windows exclusive-fullscreen

Note: All of this was measured on Vulkan with max swapchain = 2.

Noise aside, we can conclude then that KMS and Windows have the exact same latency, which is not surprising.

We can also conclude that Wayland on Kwin has the exact same latency as Weston. Not sure what to make of this, as I don’t know how Weston’s presentation queue works.

I also don’t understand why Uncomposited X has more input lag than KMS, as the queue should be the same size. Same for Wayland. All 3 should have the same latency.

Sure thing. I’ve made all the measurements on the same monitor, so that’s a fixed variable. My only purpose was direct comparison. This is a pretty old monitor, so expect some 5~10ms of delay from it, but that’s irrelevant here.

Yeah, I’m aware of the other additional methods, and the additional delay on our modern buffered hardware. But thank you for the thorough explanation!

And it seems my numbers also back that up.

If we remove the 5~10ms latency of my monitor and add the known 1 frame delay in Mario (16,6ms) to the KMS and Windows latency, we kinda reach the ~58.33ms best case number you talked about.

Now the real question is, assuming all is working as expected, where are Uncomposited X and Wayland losing that extra frame? I believe performance is not the issue here. There must be an additional frame in a queue somewhere. I think I’ll take this up to the Kwin developers to try and figure out if direct scan-out is working or not.

There’s one additional test I can conduct though. On Wayland, a windowed RetroArch instance should, in theory, have one additional frame of latency VS fullscreen. If the value turns out the same, then it’s a guarantee direct scan-out is not working. But I’ll do it later. Today I’m tired of all this measuring.

nfp0 · 29 March 2022 20:12

Agree, I wouldn’t. But these are double-digit differences though. There’s a 24ms difference between Wayland and KMS on my system, when they should be the same. It’s a pretty big disparity.

Either something is wrong, or maybe I misinterpreted what direct scan-out is supposed to do.

e-tank · 30 March 2022 02:25

interesting results for sure, tho i think the next thing to try out would be to compile retroarch with debugging enabled in order to verify that you’re actually getting 2 images in your vulkan swapchain on Kwin or Weston. max_swapchain_images is just what RA will request and not what it’s guaranteed to get. the following will compile the build:

./configure --enable-debug
make

with this build (you can just run it straight from the directory you built it in for our purposes) if you specify either --verbose on the command line or log_verbosity = “true” in your config it should tell you what you need to know

one caveat, i had issues with the debug build of RA crashing on me when trying to create a suitable vulkan context, and in order to fix this issue i had to resort to commenting out all the code in between the 3 #ifdef VULKAN_DEBUG sections in the following function in the following file:

gfx/common/vulkan_common.c:1840:bool vulkan_context_init(gfx_ctx_vulkan_data_t *vk, enum vulkan_wsi_type type)

other than that you may also want to look into trying the alternate amd vulkan driver from what you’re currently using. there’s 2 for linux, radv and amdvlk (i believe the former is what valve has gone with on their hardware) and you can find more info about that here: https://wiki.archlinux.org/title/Vulkan

after writing all the above i had completely forgotten until rn that one of the main RA devs wrote a blog post a while back about his experiences with vulkan on various surface types and driver stacks: https://themaister.net/blog/2018/09/09/the-state-of-window-system-integration-wsi-in-vulkan-for-retro-emulators/ see the section on mesa - wayland - linux, he mentions being provided w/4 images in the swapchain for fifo mode and yet only 3 were ever used O_o ya, looks like this stuff gets real hairy, unfortunately…

nfp0 · 30 March 2022 09:49

I checked that post from Themaister. Very interesting read! Indeed if it is using 3 swap images instead of the 2 requested, that would explain the additional frame of latency outside of KMS.

I could swear the release build of RetroArch outputted the information about number of swapchains with verbose logging enabled, but I can’t find it anymore. Has anything in logging changed recently? Or am I remembering the verbose output of the Windows version?

I’m using radv. But I can give amdvlk a try too.

nfp0 · 30 March 2022 11:03

Aaaaand here we are. On Wayland I got:

[INFO] [Vulkan]: Using fences for WSI acquire.
[INFO] [Vulkan]: Using GPU: "AMD RADV SIENNA_CICHLID".
[INFO] [Vulkan]: Queue family 0 supports 1 sub-queues.
[INFO] [Vulkan]: Swapchain supports present mode: 1.
[INFO] [Vulkan]: Swapchain supports present mode: 2.
[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 4 swapchain images.

So it seems it’s using 4 images (or 3 going by what Themaister said). This is reeeally bad for latency.

Meanwhile I also tested Weston, KMS and X with the following results:

Weston:

[INFO] [Vulkan]: Swapchain supports present mode: 1.
[INFO] [Vulkan]: Swapchain supports present mode: 2.
[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 4 swapchain images.

KMS:

[INFO] [Vulkan]: Swapchain supports present mode: 2.
[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 2 swapchain images.

X:

[INFO] [Vulkan]: Swapchain supports present mode: 0.
[INFO] [Vulkan]: Swapchain supports present mode: 1.
[INFO] [Vulkan]: Swapchain supports present mode: 2.
[INFO] [Vulkan]: Swapchain supports present mode: 3.
[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 3 swapchain images.

Not sure what each swapchain present mode represents, but it seems RetroArch is always requesting mode 2.

Summing it all up, the number of swapchains are, in order:

4 on Kwin Wayland
4 on Weston
3 on X
2 on KMS

This all matches up with the latency numbers I measured on my other post. Indeed it seems RetroArch is not able to always get the desired 2 swapchain images.

What’s the path forward from here? From Themaister’s post, I assume RetroArch is working correctly and always requesting 2 images. Then whose door should we knock? Mesa, or the AMD drivers?

RealNC · 30 March 2022 16:28

Hm. On X11 with the proprietary nvidia driver, it seems to be fine:

[INFO] [Vulkan]: Creating swapchain with present mode: 2
[INFO] [Vulkan]: Using swapchain size 2560x1440.
[INFO] [Vulkan]: Got 2 swapchain images.

I don’t have Wayland installed anymore, but I’ll reinstall it just to see what happens there.

Have you considered using the glcore retroarch driver instead? Maybe it helps.

RealNC · 30 March 2022 16:58

I now tested Wayland. RA is only able to configure 3 swapchain images. If I set it to 2, RA hangs.

nfp0 · 30 March 2022 19:29

Hmmm interesting! Seems to be driver dependent then. I’ll try amdvlk later to see if it helps.

I’ll give it a try. Is it compatible with Slang shaders?

Here it doesn’t crash, but I would rather it crash to signal something is not right than not crash and not being able to know it’s using an incorrect number of swap images.

Thank you for the help by testing on your Nvidia!

RealNC · 30 March 2022 20:13

Yes. In fact, it only supports Slang. Unlike the “gl” and “gl1” drivers, “glcore” requires modern OpenGL support by the GPU driver. If your GPU supports Vulkan, then modern OpenGL support should in theory be no problem for it.

Edit:
As a side note, it turns out KWin with Wayland is still not ready. If the compositor crashes, or even just resets due to a GPU driver reset, it kills all applications. This is still listed as a showstopper:

https://community.kde.org/Plasma/Wayland_Showstoppers

Even worse, KWin locks itself to 80FPS when an application uses fullscreen (I use 120Hz for the desktop.)

It seems to me in will take a while yet until kwin+wayland is ready to be actually used.

nfp0 · 30 March 2022 20:57

I have multiple screens at different refresh-rates and I game with VRR, so Wayland is basically a necessity for me. And input lag under VRR is the same as X, so that is a non-issue.

But yeah, there’s still quite a few showstoppers to make it stable. That list used to be huge. It is improving at an astonishing rate. I tried it back in 2021 and it was borderline unusable on KDE. Now I use it daily for work and games and it rarely causes me any issues.

Fedora started shipping it by default too. A bit premature IMO, but in the end it helped speed things up.

Might be an Nvidia specific issue. I have a 180Hz screen and all apps and games run at 180Hz in fullscreen, RetroArch included (before loading a core, ofc).

No idea why that’s happening to you, but I’ve heard Nvidia’s drivers are terrible on Wayland. They’re taking very long to become compatible. Intel and AMD drivers are miles ahead.

Unfortunately I can’t recommend Wayland to anyone using Nvidia for the time being.