An input lag investigation

Based on the timing, I’m not sure that ‘discovery’ was truly independent. Cool to see it spread anyway.

Just quickly looking through the commit i don’t think it’s the same. I’m guessing they had even one more additional frame of delay due to polling after instead of before so would still benefit from the code changes mentioned in this thread.

Looks like your thread’s been locked on byuu’s forum, Brunnis. If anyone wants to stick around for post-game coverage, byuu’s rant continues on his Twitter .

Ethos, ethos, ethos… Who needs logos when you have twelve years of ethos…

Regarding the frame advance latency test, the libretro-test core seems to react to inputs the very next frame. Since this core is extremely simple and just simply polls inputs then draws a checkerboard that scrolls in the direction that was pressed, it should be a good baseline to compare any emulator cores against.

This should also disprove any claims about RetroArch or libretro having a broken input system, since this core responds basically instantly with frame advance, so any extra latency seen with this method would be with the game or the emulator code.

[QUOTE=ralphonzo;43627]Looks like your thread’s been locked on byuu’s forum, Brunnis. If anyone wants to stick around for post-game coverage, byuu’s rant continues on his Twitter .

Ethos, ethos, ethos… Who needs logos when you have so much ethos…[/QUOTE]

Took seven pages and some drama to get those 3 lines:

Gonna take some vacations now, I had to quote twice in a single message. :slight_smile:

[QUOTE=Tatsuya79;43635]Took seven pages and some drama to get those 3 lines: [/QUOTE]

I’m left with the impression that the difficulty of that conversation was a feature and not a bug.

Sorry, slow to respond right now. I’m spending a week in Tuscany, so input lag is not the only thing on my mind right now. :slight_smile:

[QUOTE=UvulaBob;43126]I’ve been reading through this thread, and I’m seeing quite a bit of testing data. Unfortunately, what I don’t see is information about how to set some of the variables used during testing (such as enabling dispmanx) on my Pi.

To start, how do I enable dispmanx, and how/should I set the frame delay for the NES emulator? Thanks![/QUOTE] To enable Dispmanx, edit the main retroarch.cfg and make sure video_driver is set to “dispmanx” instead of the default “gl”. Frame delay can be set via the frame_delay setting in retroarch.cfg. To only enable it for NES, you need to use per-core configuration. If need be, I can give you a more detailed answer once I get back home.

[QUOTE=matt;43327]Hi. I don’t have much to add but thought you might like to know that the same reordering of polling code to fix lag was discovered independently in the Provenance emulator for tvOS and iOS:

Anyway, thanks for your work![/QUOTE] Thanks! However, as larskj alluded to, the Provenance fix is not the same as what was done for bsnes/bsnes-mercury and snes9x/snes9x-next. The libretro implementations of those cores were already polling at the right place. The issue that was fixed was in fact in the emulator code, where rearranging the emulator loop provided for a further one frame reduction of input lag. So, Provenance would actually have had a best case input lag of 3 frames before their fix and before applying my emulator loop rearrangement. :slight_smile:

The libretro implementation of fceumm had the very same problem as the Provenance one you linked to, though.

[QUOTE=ralphonzo;43627]Looks like your thread’s been locked on byuu’s forum, Brunnis. If anyone wants to stick around for post-game coverage, byuu’s rant continues on his Twitter .

Ethos, ethos, ethos… Who needs logos when you have twelve years of ethos…[/QUOTE] Amazing… He still doesn’t get it. Thanks for giving it a try, though. It’s interesting to see what a firm conviction can do to one’s ability to process new information. While it’s hard to accept that he doesn’t understand the theory behind the fix, it’s even harder to accept, baffling even, that he can ignore the test results. He’s so convinced that it’s impossible to improve input lag, that he’d rather believe that not only is the theory bollocks but all test results are botched as well. Quite a stretch.

EDIT: I could of course make the same changes to Higan and send a working build to byuu. However, since I don’t really care that much and I’m 90% sure he’d ignore it anyway, I think I’ll pass. :stuck_out_tongue:

Yeah, I wouldn’t bother.

Has anyone else done any testing with the DRM driver on Raspberry Pi? I’ve been testing it with several different Retroarch cores and with Brunnis’ lag fix it’s the most responsive I’ve found. Unfortunately it isn’t compatible with EmulationStation and has bilinear filtering permanently enabled, but the input lag is as close to real hardware as I’ve seen even on a full PC. I’m hoping once the driver matures some more it will be included in EmulationStation.

Hi. Not sure how new this is, but in case:

http://byuu.org/articles/latency/

I’ve been testing out input lag in Vulkan with bsnes-mercury-balanced and comparing it to OpenGL.

Test setup

[ul] [li]Core i7-6700K @ stock frequencies[/li][li]Radeon R9 390 8GB (Radeon Software 16.5.2.1 & 16.7.3, default driver settings unless otherwise noted)[/li][li]HP Z24i monitor (1920x1200)[/li][li]Windows 10 64-bit (Anniversary Update)[/li][li]RetroArch nightly August 4th 2016[/li][li]bsnes-mercury-balanced v094 + Super Mario World 2: Yoshi’s Island[/li][/ul]

All tests were run in fullscreen mode (with windowed fullscreen mode disabled). Vsync was enabled and HW bilinear filtering disabled for all tests. GPU hard sync was enabled for the OpenGL tests. The max swapchain setting was set to 2 unless otherwise noted in the test results.

For these tests, only 10 measurements were taken per test case (as opposed to 35 in the previous testing). The test procedure was otherwise the same as described in the first post in this thread, i.e. 240 FPS camera and LED rigged controller.

Results

I’ll show two graphs. One is for the older AMD GPU driver 16.5.2.1, which is what I used in my original post. The other graph shows results from the latest GPU driver 16.7.3. The reason I’ve tested both is that they produce quite different results. The 16.7.3 results also include input lag numbers for different values of the new “Max swapchain images” setting in RetroArch. Finally, I have tested with OpenGL triple buffering enabled and disabled via the Radeon Software driver setting.

Comments

As you can see, the results are not encouraging. First of all, the latest AMD driver regresses quite dramatically when it comes to input lag. Whether using Vulkan or OpenGL, performance is measurably worse. What’s even more puzzling is that there is now also significantly worse performance when triple buffering is enabled. With the previous driver, input lag was the same whether triple buffering was enabled or disabled (which is the expected outcome when using OpenGL). With the new driver and triple buffering enabled, input lag has increased by two whole frame periods compared to the old driver!

EDIT: I just reported this issue to AMD via their online form at www.amd.com/report

As for Vulkan, input lag is consistently worse than with OpenGL + GPU hard sync. I would guess that this is driver related and not something the RetroArch team can do much about, but it would be good if a dev that’s familiar with the Vulkan backend can comment.

It’s also interesting to note that the “Max swapchain images” setting of 2 was slower than 3. I’d have expected the same or better performance. I’d like to do more thorough testing, with more measurments, to confirm this difference, though.

Vulkan stuttering issue?

Finally, an important observation which could very well compromise the Vulkan results in this post: The rendering does not work as expected in Vulkan mode. I first noticed this when playing around with Yoshi’s Island. As soon as I started moving, I noticed that scrolling was stuttery. It stuttered rapidly, seemingly a few times per second, in a quite obvious and distracting way. I then noticed the very same stuttering when scrolling around in the XMB menu. I tried changing pretty much every video setting, but the stuttering remains until I change back to OpenGL and restart RetroArch.

When anayzing the video clips from my tests, I noticed that the issue is that frames are skipped. When jumping with Mario in Yoshi’s Island, I can clearly see how, at certain points, a frame is skipped.

My system is pretty much “fresh” and with default driver settings. Triple buffering on/off makes no difference. The stuttering appears in both RetroArch 1.3.4 and the nighly build I tested. Same behavior with Radeon Software 16.5.2.1 and 16.7.3.

I’ve seen one other forum member mention the very same issue, but I couldn’t find that post again. I’ve seen no issues on GitHub for this. I doubt that this is a software configuration issue, but I guess it could be specific to the GPU (i.e. Radeon R9 390/390X). Would be great if we could try to get to the bottom of this, because it makes Vulkan unusable for actual gaming on my setup and could also skew the test results.

I have the same doubts about the current state of vulkan in retroarch. Each time I play with it I feel some lag and stuttering. I was wondering if this was caused by slang shaders I use or something, changed that swapchain setting… Nothing really helped yet, and on different cores.

About that Byuu article it’s interesting until it ends up with “it’s not worth it to even try to get 16ms, think about it”. How come it’s so easy to spot when playing snes in retroarch?

The Vulkan vsync-off stuttering issue is known by maister, but I’m not sure he understands why it’s happening in Windows.

Also, triple buffering only works with 3 max swapchain images, and it reduces latency when vsyncing at unlimited draw rates because it allows the software to draw frames to a buffer any time. I’ll explain that below.

How I understand it is that there are the following 3 images buffered in the swap chain. The following paragraphs attempt to explain, to the best of my understanding, 1, 2, and 3 max swapchain images, which should equate to: vsync off, double buffering (vsync on, bad method), and triple buffering (vsync on, good method).

One buffer is called the front buffer. It is an image that the video card reads when it sends the image to the display at the configured refresh rate during the display hardware’s vblank. When max swapchain images is set to 1, this is the only buffer that is supposed to exist, but for some reason this seems broken, as you should see tearing (frames not yet fully drawn) because RA is supposed to write directly to this buffer. Instead, we are seeing stuttering, meaning there is either a second swapchain image, or something is blocking frame output and writing to this buffer is delayed. When vsync is off, calculated pixels should be drawn as fast as possible directly to this buffer and frames sent to the display should be only partially complete, which is what causes tearing.

Another buffer is called the back buffer. It is an image that is copied to the front buffer on vblank, which is a period of time between the end of the previous time and the beginning of the next time an image is sent from the front buffer to the display. When max swapchain images is set to 2, this buffer is written to only after the front buffer has been fully sent to the display (so, after vblank) and it’s copied a new fully-drawn frame to the front buffer. Actually, this operation is not a copy, but just a pointer swap with the front buffer’s pointer so that it’s nearly instant. Since the front buffer must wait for the back buffer for the next vblank and to complete a frame before this swap happens, double buffering should cause the most input latency. Using 2 images in the swap chain is called “double buffering”. The usage of this second buffer differs if there is a third buffer.

If there is a third buffer, which is what you get with triple buffering and 3 max swapchain images, it is an image that is copied to the second buffer whenever it’s fully drawn. Actually, this operation is also not a copy, but just a pointer swap so that there is always a fully-rendered frame in the back buffer to be sent to the front buffer when requested for vblank. This is why triple buffering has the lowest input latency. It always immediately sends the most recently-completed frame and does not prevent the program from writing to it. Triple buffering is a clever method of vsyncing that well-written programs use to reduce input latency at the small cost of a 50% increase in necessary video memory.

[QUOTE=Tatsuya79;44581]About that Byuu article it’s interesting until it ends up with “it’s not worth it to even try to get 16ms, think about it”. How come it’s so easy to spot when playing snes in retroarch?[/QUOTE]Alcaro was looking at current higan re: brunnis’ patch and he said it would probably only gain a few ms assuming vsync is disabled. Vsync is what makes it add/remove an entire frame of latency. Current higan disables vsync by default at the cost of video tearing (unless you have a gsync/freesync monitor). I don’t have one of those to test, but I suspect his lack of exclusive fullscreen is causing Windows, at least, to force vsync anyway as part of its compositing chain (i.e., unless Aero is disabled, which can’t even be done on Win8+).

My problem with byuu’s latency essay is that he’s not an expert on any of it, and he brings zero data in either measurements or algorithms to back up his statements. He just says a bunch of stuff and everyone is supposed to believe him because he’s been working on one emulator–which happens to be the most latent of any in its class–for 12 yrs.

If you want to know some obscure SNES timing thing, byuu’s your man, but he knows little-to-nothing about driver buffering/delays, audio mixing/resampling, etc. And that’s fine; nobody can be an expert on everything, but you shouldn’t present yourself as an expert on something when you’re very definitely not.

If Brunnis is interested in doing some more tests, I’d be curious to see if RetroArch is any faster with vsync disabled and maximum runspeed set to 1x (and/or block on audio).

Actually, my fix should not have anything to do with vsync. This is also reflected in the fact that the fix shows the same one-frame improvement when pausing the emulator and advancing frame-by-frame. Whether vsync is on or off has no effect on such a test.

Even though higan doesn’t have vsync, it still limits framerate to 60 FPS. This means that the frame period is 16.7 ms and polling at the end of the frame interval instead of at the beginning (as the bsnes cores did and as I believe higan still does) will incur close to an extra frame period of lag. To illustrate; before my patch bsnes did the following:

  1. Run game logic
  2. Render frame
  3. Read input
  4. Output frame

Each frame period starts by running game logic (step 1). Since we haven’t yet read the input during this frame period (step 3), the input we have available in step 1 is that which was read in step 3 during the previous frame period. The amount of input lag we add by doing this corresponds to 16.7 ms minus the execution time for steps 1 and 2. For a decent computer which renders a frame in say 2 ms, the potential saving is close to a whole frame period, i.e. 14.7 ms. This is what my patch does, by rearranging the execution loop to this:

  1. Read input
  2. Run game logic
  3. Render frame
  4. Output frame

I’ll see if I can have a look at that.

ah, right, that’s pretty straightforward.

Yep. The one thing that slightly complicated it was the fact that we needed to dynamically determine when to perform step 4 (frame output) on a frame-by-frame basis, in order to handle overscan correctly.

I think the stuttering and latency with Vulkan could very well be driver related. Using a GTX 1070, the Vulkan driver definitely seemed a LOT smoother and faster (less latency) to me than using OpenGL. However after updating my drivers from 368.69 -> 368.81, I started to experience the same choppy scrolling and stuttering that Brunnis was experiencing with Vulkan. After rolling back to 368.69 the stuttering is gone and back to being buttery smooth with scrolling.

[QUOTE=hunterk;44599]Alcaro was looking at current higan re: brunnis’ patch and he said it would probably only gain a few ms assuming vsync is disabled. Vsync is what makes it add/remove an entire frame of latency. Current higan disables vsync by default at the cost of video tearing (unless you have a gsync/freesync monitor). I don’t have one of those to test, but I suspect his lack of exclusive fullscreen is causing Windows, at least, to force vsync anyway as part of its compositing chain (i.e., unless Aero is disabled, which can’t even be done on Win8+).

My problem with byuu’s latency essay is that he’s not an expert on any of it, and he brings zero data in either measurements or algorithms to back up his statements. He just says a bunch of stuff and everyone is supposed to believe him because he’s been working on one emulator–which happens to be the most latent of any in its class–for 12 yrs.

If you want to know some obscure SNES timing thing, byuu’s your man, but he knows little-to-nothing about driver buffering/delays, audio mixing/resampling, etc. And that’s fine; nobody can be an expert on everything, but you shouldn’t present yourself as an expert on something when you’re very definitely not.[/QUOTE]

Byuu’s total disinterest in CRT usage contradicts that he really cares about lag [and the games where it does matter], if you ask me. Ultimately, one should start lag testing with gaming sessions of emulation vs. the real thing, if the real thing is actually demanding and you know it well enough. Even the experts of fighting games and 2D shooters will tell you that this

And it’s why emulation can never come close to the responsiveness of real hardware.

is utterly false in gaming context. They will never use a flat panel for proper sessions, though.

Has current Libretro BSNES cores (with the Brunnis fixes) been tested vs. the latest Higan, under Windows 7 - Aero off? I’m a bit lost on how to enable Brunnis fixes and if they have to do with RA’s frame delay features. (?)

@Brunnis: i want to start by thanking you for doing these tests and confirming that this issue your patch addresses is indeed not theoretical but actually causes ~1 frame input delay. years ago i brought this up with a few emu authors (you can see how one of those discussions went here: https://web.archive.org/web/20160325102948/https://code.google.com/p/genplus-gx/issues/detail?id=274 , although to his credit ekeeke did change his mind, eventually (ty ekeeke!), you can see the result of the patch required to change the emu’s behaviour here: https://bitbucket.org/eke/genesis-plus-gx/commits/ec554b4b702d337168dfcaf4b4a6248062e2db5b ), albeit w/o any concrete proof to convince ppl, and essentially pretty much everyone told me to piss off. i always hoped to get around acquiring the hardware in order to do these kinds of tests but never had the time, energy, etc. to get it done, so really thank you so much for this. that said your fix kind of does have to do with vsync.

the real issue here is that in order to minimize input latency, when the guest polls input the host should poll input as close to (or better yet after, which we can now achieve in retroarch thanks to the frame_delay parameter) where the guest machine would be in real time. in other words ideally when the guest polls input you’d have the host sync, giving us have realtime >= emutime, and then poll hardware on the host. now, this wouldn’t be very practical to do, but by being clever we can at least minimize the amount of time between realtime and emutime when the guest polls input, which brings me to…

the reason your fix works so well is that retroarch is designed primarily to use vsync in order to sync realtime = emutime. with your fix the guest resumes emulation at the start of its vblank instead of the start its active frame. and since most games on the guest poll input either in vblank or shortly thereafter you’ve effectively cut the time between when the guest polls input and when the host does by ~1 frame, since for most games emutime be only be slightly > realtime when the game/guest polls.

higan on the otherhand syncs realtime = emutime using audio. for example, if you set the audio latency to 8ms in higan (8ms is the lowest my hardware can achieve in retroarch using just audio_sync without audio issues) then once the guest has built up 8ms worth of audio samples it then uses that to have the host sync realtime = emutime. using this setup if higan were to poll input on the host on demand, then it would only ever do so at most 8ms ahead of realtime (emutime > realtime), if this is what higan’s doing your patch would have no affect on stand alone higan (not higan/bsnes in retroarch, that’s different). however, if i understand correctly according to byuu higan only polls input on the host once per frame at the start of emulated active frame. if this is true then your fix would reduce input latency in higan by some amount, i don’t think it be a full frame worth on the avg but i’m not sure, i’m too tired at the moment to do the back of the envelope calculation needed to give a ballpark answer…

anyway, the frame step method you describe is meaningless without context. yes it can help pinpoint problems but you also need to have a good idea of both how the emulator works, particularly when it polls input and when / how it syncs realtime = emutime, and also a basic understating of how the game handles and responds to input.

that said i’m willing to admit i could be wrong, regardless i’m happy to discuss the topic further, well when i’m not so tired and can find the time at least (i’ll try!).

also, i’d like to comment on byuu’s article sometime as from my understanding he’s most def wrong on a few issues.