An input lag investigation

I’m not sure about that frame delay setting.

I test it with bsnes-mercury balanced, GPU Hard Sync is ON @0.

Mario World accepts frame delay 5 maximum, is it really faster? I can’t tell for sure Then Super Aleste can go to frame delay 10. Sometimes I feel it’s fast, sometimes the same, I really can’t tell.

In the end I put it on OFF as I’m afraid it could cause frame skipping / stuttering in different games / cores.

what about the input polling option in settings > input called “poll type behavior” that may help when set to “late” vs “early” or “normal” ? did anyone get any numbers for that? i think the fix is great but i still think there is a need for a nice spreadsheet detailing all the various latency related settings, and how many frames they save (or add!).

Does V-Sync still add one frame latency? I remembered testing mgba with real hardware using GBA-SP and NDS and I scroll the settings in Pokemon and Mario main menus with d-pads and recorded with 60fps camera, and I get 4 frames response when scrolling on those system and using Retroarch without vsync. With vsync with hard off or on, I would get one frame added, and this goes on both amd and nvidia cards.

However, I get better response with N64 in RA than other emulators, especially testing Doom 64 and Quake.

Also, it’s nice to keep track on SNES cores getting better response time.

Thanks everyone for testing out the code change in bsnes-mercury!

Thanks a lot for this! I will spend some time tomorrow looking at what possible side effects this change could have.

Unless I’ve misunderstood it (and I havent’ spent much time on it), this setting is not all that interesting in most cases. Setting it to “late” allows the input to be polled when the emulator requests it. As far as I can see, the libretro implementation must be specifically written to utilize this. If the input is simply polled before calling the emulator main loop, the “late” setting doesn’t make a difference. The reason that this setting isn’t very interesting is:

  • In most cases, the emulator loop is constructed so that input is read almost immediately after entering the loop. This is what snes9x-next and bsnes-mercury with my fix do. Whether we poll input from the system just before or just after entering the emulator main loop will not make any meaningful difference in this case.
  • Even for emulators that poll late (such as snes9x-next without my fix), “Poll type behavior” set to “late” is not of much use unless the emulator executes rather slowly. If the emulator loop runs quickly, like in a millisecond or two, polling early or late during the loop obviously won’t make much difference (i.e. only a couple of milliseconds at most).

So, in summary, you’d need a combination of pretty slow emulation and an emulator that polls late to reap any benefits from this setting.

I’ll try this with camera recording on my rig. I believe my 6700K should allow for a quite high frame delay setting.

I honestly don’t think there is much more to do in the emulator(s). I believe the 2-3 frames that it now takes with snes9x-next and bsnes-mercury is inherent to the actual console/games.

On that same theme, I’ve spent a lot of time today looking at the Nestopia source code. Long story short (yeah, it took a while to sift through the sources…): I don’t think Nestopia suffers from the 1 frame additional input lag, like the SNES emulators. Although the source code was a little hard to follow, it looks like Nestopia does the right thing, i.e.:

  1. Polls input.
  2. Kicks off emulator right at the start of VBLANK.
  3. Exits the emulator loop right when the frame has been generated.

I haven’t used the frame advance method on many games with Nestopia, but I have tested Mega Man 2 and it showed 2 frames of emulator input lag. This matches what the SNES emulators can do with my fix and my current hypothesis is that the NES games also have some inherent lag that we probably can’t get rid of.

Hi Brunnis,

Thanks again for all the research you’re doing.

Could you possibly explain again why your bsnes-mercury fix is working? With some of the explanation you gave I’m still a bit puzzled how and why it works. But I’ll try my best to word it like I see it now. Please comment if this is the correct view.

If I would take the perspective of a real SNES (for simplicity take only 240 line case), if I understand you correctly then a real SNES:

  1. polls input at first line when entering vblank (this is line 240)
  2. does all game logic in vertical blank (timewise equivalent of about 22 lines, consisting of say 2 lines frontblanking, 3 lines vsync, rest back blanking)
  3. scans out a raster that displays 240 lines (counting from 0-239) sequentially

Step 1-3 is what we call a frame and a real SNES takes about 16,67ms for each frame. It does this 1-3 cycle over and over until the player turns off the console.

Now I make the step to emulation. The emulation of Step 1 to 3 above on a fairly beefy PC would take about 2 ms for each SNES frame.

The host PC emulating the SNES does this each frame:

  1. get input state
  2. run emulation (steps 1-3 above)
  3. get into a waiting loop (of about 14ms when emulation took 2ms) until host PC screen vertical blank / vsync is reached

This is what we call a frame in the host PC context. It does this frame generation cycle over and over until user quits emulation.

If I understand you correctly your fix for BSNES is purely focused on Step 2 of Host PC context. Taking that Step 2 in Host PC context apart, we get the 1 -3 cycle as mentioned for real SNES. (Only difference is that it takes ~2ms for each 1-3 cycle instead of original ~16,67ms)

Byuu decided to program his emulation such that he gets the PC host input state at line 241, which is one line after where a real SNES gets the input state. Getting the input state at line 241 and not at 240 actually means that the game logic for that frame doesn’t take into account the host PC input state that was read at step 1. It actually takes until the next run of the game logic (one host pc frame later) for this input to be considered, thus causing the extra frame of delay in BSNES. Your fix moves the input polling event to before that line 240, to make sure game logic always gets fed the most current input state (read at step 1 of host PC -current- frame cycle).

Please let me know whether this is the correct understanding of how your fix shaves off one frame of delay in BSNES-Mercury.

P.S. Thanks again for all the informative posts you made, it really helps in getting a better understanding of how these things are related to one another.

[QUOTE=rafan;41731]Hi Brunnis,

Thanks again for all the research you’re doing.

Could you possibly explain again why your bsnes-mercury fix is working? With some of the explanation you gave I’m still a bit puzzled how and why it works. But I’ll try my best to word it like I see it now. Please comment if this is the correct view.

If I would take the perspective of a real SNES (for simplicity take only 240 line case), if I understand you correctly then a real SNES:

  1. polls input at first line when entering vblank (this is line 240)
  2. does all game logic in vertical blank (timewise equivalent of about 22 lines, consisting of say 2 lines frontblanking, 3 lines vsync, rest back blanking)
  3. scans out a raster that displays 240 lines (counting from 0-239) sequentially

Step 1-3 is what we call a frame and a real SNES takes about 16,67ms for each frame. It does this 1-3 cycle over and over until the player turns off the console.

Now I make the step to emulation. The emulation of Step 1 to 3 above on a fairly beefy PC would take about 2 ms for each SNES frame.

The host PC emulating the SNES does this each frame:

  1. get input state
  2. run emulation (steps 1-3 above)
  3. get into a waiting loop (of about 14ms when emulation took 2ms) until host PC screen vertical blank / vsync is reached

This is what we call a frame in the host PC context. It does this frame generation cycle over and over until user quits emulation.

If I understand you correctly your fix for BSNES is purely focused on Step 2 of Host PC context. Taking that Step 2 in Host PC context apart, we get the 1 -3 cycle as mentioned for real SNES. (Only difference is that it takes ~2ms for each 1-3 cycle instead of original ~16,67ms)

Byuu decided to program his emulation such that he gets the PC host input state at line 241, which is one line after where a real SNES gets the input state. Getting the input state at line 241 and not at 240 actually means that the game logic for that frame doesn’t take into account the host PC input state that was read at step 1. It actually takes until the next run of the game logic (one host pc frame later) for this input to be considered, thus causing the extra frame of delay in BSNES. Your fix moves the input polling event to before that line 240, to make sure game logic always gets fed the most current input state (read at step 1 of host PC -current- frame cycle).

Please let me know whether this is the correct understanding of how your fix shaves off one frame of delay in BSNES-Mercury.

P.S. Thanks again for all the informative posts you made, it really helps in getting a better understanding of how these things are related to one another.[/QUOTE] Yes, that’s pretty much exactly what the fix does. Just to be clear, though, I just moved the loop entry/exit point to line 240 or 225 (depending on overscan setting) instead of 241. The emulator itself reads/polls the input at exactly the same place within the SNES frame as before.

Two new important discoveries:

No. 1

In the menus of Mega Man 2, Nestopia achieves single frame latency. This more or less proves two things:

[ul] [li]That Nestopia is implemented correctly like I thought, i.e. it doesn’t suffer from additional input lag due to a misaligned main loop.[/li][li]Two frames of latency during actual gameplay, such as in Mega Man 2, is a result of how the game is written and would be there on a real console as well.[/li][/ul]

No. 2

I just tried the frame-advance method on fceumm. I was interested in this, since I tested this emulator on RetroPie with my old camera test setup and seemed to get higher latency than Nestopia. Guess what? fceumm does indeed have one frame higher input lag than Nestopia! In the menus of Mega Man 2 it has 2 frames lag (compared to 1 with Nestopia) and in actual gameplay it has 3 frames lag (compared to 2 with Nestopia).

To be honest, I’m not particularly keen on digging into the fceumm source code as well. However, I have created an issue report (https://github.com/libretro/libretro-fceumm/issues/45) and I’m now hoping that someone else will pick this up and fix it.

That’s great work in finding the causes of latency in emulation. Also, I compared mednafen’s version of bsnes source code to the recent Brunnis fix to bsnes-mercury-libretro. Here is the Brunnis fix first to src/system/system.cpp:

#if LAGFIX if(cpu.vcounter() == (ppu.overscan() == false ? 225 : 240)) scheduler.exit(Scheduler::ExitReason::FrameEvent); #else if(cpu.vcounter() == 241) scheduler.exit(Scheduler::ExitReason::FrameEvent); #endif

And that from mednafen:

exit_line_counter++;

// if(cpu.vcounter() == 241) scheduler.exit(Scheduler::FrameEvent); if((cpu.vcounter() == 241 && exit_line_counter > 100) || (!ppu.overscan() && cpu.vcounter() == 226)) // Input latency reduction fun. { //printf("Exit: %u ", cpu.vcounter()); scheduler.exit(Scheduler::FrameEvent); }

It appears that mednafen also used overscan to determine “scheduler.exit”, but the two cpu.vcounter values should instead be decremented by 1. This would mirror the Brunnis fix (save for the condition that exit_line_counter is greater than 100). It may be worthwhile to confirm that the current mednafen changes are not adequate to fully decrease input latency and the effect of exit_line_counter on the snes demos (or a similar use of exit_line_counter to test for increased compatibility).

Okay, so I’ve run a camera test on the “Frame Delay” setting. With Nestopia, I could run Mega Man 2 with a frame delay setting of 12 ms on my Core i7-6700K. If everything works as expected, input lag should reduce by 12/16.67 = 0.72 frames. And the test results are as expected (within tolerances):

Without frame delay:

Average: 4.3 Min: 3.25 Max: 5.25

With frame delay set to 12:

Average: 3.4 Min: 2.5 Max: 4.5

This obviously feels great when playing. To understand exactly how good this is and to understand how much room there actually is for improvement, let’s make a simple calculation. We’ll start with the average result (in milliseconds) and remove all the known quantities:

3.4 * 16.666… = 56.67 ms -4 ms (average time until USB poll) -8.33 ms (average time until emulator runs) -4.67 ms (time until emulator finishes first loop and receives vsync. This would be 16.67 ms if the Frame Delay setting was 0, but setting it to 12 has removed 12 ms.) -16.67 ms (time until emulator finishes second loop and receives vsync) -11 ms (time for scanning display from top left until reaching the Mega Man character)

Time left unaccounted for: 12 ms

Although the USB polling time could be decreased slightly by increasing the polling rate, there really isn’t that much to do about the other known quantities listed above. The remaining time could come from other small delays within the system (perhaps specifically the GPU driver/hardware). We also haven’t accounted for any delay within the HP Z24i display I’m using. Even if it’s fast, we can probably expect a couple of milliseconds between receiving a signal at the display’s input and getting detectable change of the corresponding pixels.

What about an actual NES on a CRT?

If we go by the hypothesis that the actual NES hardware also has 2 frames of delay in certain cases (such as during Mega Man 2 gameplay) and that it reads input at the beginning of VBLANK, we arrive at:

-8.33 ms (average time until input is actually read) -16.67 ms (time until NES has finished one frame) -12 ms (time for running through vblank again and scan out the lines until reaching the Mega Man character at the bottom of the screen)

Expected average input lag for Mega Man 2 on real NES and CRT:[B] 2.2 frames

[/B]If the above calculations hold true, our emulated case using an LCD monitor is only 1.2 frames behind the real NES on a CRT. 1.2 frames translates to 20 ms. That’s actually very, very good. :slight_smile:

[QUOTE=Sam33;41747]That’s great work in finding the causes of latency in emulation. Also, I compared mednafen’s version of bsnes source code to the recent Brunnis fix to bsnes-mercury-libretro. Here is the Brunnis fix first to src/system/system.cpp:

And that from mednafen:

It appears that mednafen also used overscan to determine “scheduler.exit”, but the two cpu.vcounter values should instead be decremented by 1. This would mirror the Brunnis fix (save for the condition that exit_line_counter is greater than 100). It may be worthwhile to confirm that the current mednafen changes are not adequate to fully decrease input latency and the effect of exit_line_counter on the snes demos (or a similar use of exit_line_counter to test for increased compatibility).[/QUOTE] Thanks Sam33! I’ll see if I can have a look at that during the day.

[QUOTE=Brunnis;41746] I just tried the frame-advance method on fceumm. I was interested in this, since I tested this emulator on RetroPie with my old camera test setup and seemed to get higher latency than Nestopia. Guess what? fceumm does indeed have one frame higher input lag than Nestopia! In the menus of Mega Man 2 it has 2 frames lag (compared to 1 with Nestopia) and in actual gameplay it has 3 frames lag (compared to 2 with Nestopia).

To be honest, I’m not particularly keen on digging into the fceumm source code as well. However, I have created an issue report (https://github.com/libretro/libretro-fceumm/issues/45) and I’m now hoping that someone else will pick this up and fix it.[/QUOTE] interesting! currently fceumm is the default in retropie. a bit off topic, but can you think of any reason why we shouldn’t just switch to nestopia as the default? i presume they both work fine on the pi, but if nestopia has this advantage…

The only reason I can think of is that Nestopia is slower. My tests (on the i7), indicate that fceumm runs 15-20 percent faster. This is not going to be an issue on the Pi 2 & 3, but it may cause issues with the Pi 1. Would you mind asking the question on the RetroPie forum (or as a GitHub issue) to see if any of the devs would care to comment?

The advantages to fceumm are: a little faster/lighter, better support for a handful of weird chinese pirate mappers and better determinism (for netplay, so not really an issue here). In short: if Nestopia is full speed on RPi 1/0, it’s probably a better choice.

Brunnis keeps going with great findings! Are some devs already involved into this?

[QUOTE=xadox;41800]Brunnis keeps going with great findings! Are some devs already involved into this?[/QUOTE] Thanks! Here’s another one: I believe I just found and fixed the lag issue in fceumm. Pull request is here: https://github.com/libretro/libretro-fceumm/pull/46

Guess I can be counted as a dev now… :stuck_out_tongue:

I’ve tested the fix and it performs as expected, i.e. it removes a full frame of lag and brings fceumm up to the same level of input lag performance as Nestopia. Talk about small fix (moving one line of code one line up…).

EDIT: Repo with the fix can be found here: https://github.com/Brunnis/libretro-fceumm

EDIT: I can see that twinaphex just merged the fix into the fceumm master. Yay!

I’ve spent the better part of the day looking at bsnes-mercury and the viability of my first fix. The problem with that one was that it could break compatibility if a game were to change the overscan setting mid-frame. Apparently, no commerical software does, but still… So, I went in again and devised what I believe to be a much better solution. For the details, please see this pull request:

I’d really appreciate some feedback. The code is available in this repository:

Below are downloads to all core variants (accuracy, balanced, performance) for Win x64. I would very much appreciate if you helped test these out. If you do, please use the frame advance method to confirm the improvement.

Accuracy Balanced Performance

Cheers!

That sounds like a much safer/smarter fix. Good work, dude :slight_smile:

Thanks a lot, Brunnis, that does seem very interesting and I’d love to test it out immediately! However, the links you have posted for the Win x64 DLL’s don’t seem to be functional, because they require the user to type the corresponding decrypt-key.

Ouch, how noobish of me… Not at home right now, but I’ll fix it as soon as I’m back.

EDIT: Links updated! Here they are again:

Accuracy Balanced Performance

Also, from Alcaro’s comment in the pull request:

Does anyone feel up to testing this and reporting back? Perhaps with screenshots?

Do we need to test any specific game with regards to the bottom scanline and overscan features?

No, I don’t think so. Just one game with overscan and one without.