An input lag investigation

tekn0 · 15 January 2017 07:28

[QUOTE=bidinou;41887]About overclocking… As the RPi 3 tends to heat a lot, it might have the opposite result after some minutes / hours as it’ll slow down if overheating. Hmm, although only 1 core is used and although the frame delay means even this one core is not used at 100%.

I couldn’t live without the shaders, though (crt-pi !). Maybe I could get used to a scaler with no bilinear and with good scanlines.

Thanks for sharing all this :)I used to be frustrated for months / years ago because so many people said they noticed no lag.[/QUOTE]

I agree, CRT-pi is great on the PI3. Try editing the crt-pi.glslp file and change filter_linear0 = “true” to filter_linear0 = “false”.

This sets nearest neighbor and really sharpens the image nicely.

bidinou · 15 January 2017 07:28

You can already set #define sharper, is that similar ? (I know, I can figure it out by myself :D)

What is great is that the shader developer got the most out of the Pi while maintaining 60 fps.

Tatsuya79 · 15 January 2017 07:28

I’m back from several days away. I tested the new bsnes code and I see no problem in games with overscan.

Those I found, they are few:

-The Blues Brothers -Dragon Quest I & II -Dragon Quest V

(I checked the 1st and last scanlines for any problem, took some pics: no difference with standard bsnes.)

Brunnis · 15 January 2017 07:28

[QUOTE=Tatsuya79;41937]I’m back from several days away. I tested the new bsnes code and I see no problem in games with overscan.

Those I found, they are few:

-The Blues Brothers -Dragon Quest I & II -Dragon Quest V

(I checked the 1st and last scanlines for any problem, took some pics: no difference with standard bsnes.)[/QUOTE] Great, thanks! I’ll reference your post in the pull requests for both bsnes and bsnes-mercury. Still waiting to hear back from Alcaro after having implemented his suggestions.

Heffer · 15 January 2017 07:28

[QUOTE=bidinou;41913]Can you set the audio latency significantly lower with jack enabled (as it can already be lowered with Alsa) Is the tradeoff in terms of cpu time worth it ?

Edit : here, with ALSA + a QAudio DAC+ sound addon, lowering the audio delay makes it necessary the reduce to video_frame_delay setting. So the right tradeoff between input and sound delays has to be figured out. No overclocking here. Methinks input latency prevails.[/QUOTE]

JACK seems to bog the emulator down too much even at the default 64ms audio latency. I’ll stick with ALSA for now. How low can you get the latency with the DAC? I’m currently running ALSA at 32ms.

Brunnis · 15 January 2017 07:28

[QUOTE=vanfanel;41198]On the RA/dispmanx side of things, while you build your own RA executable from github sources, you can change the dispmanx_surface_setup() call that starts on line 447 into this:



dispmanx_surface_setup(_dispvars, 
            width, 
            height, 
            pitch, 
            _dispvars->rgb32 ? 32 : 16,
            _dispvars->rgb32 ? VC_IMAGE_XRGB8888 : VC_IMAGE_RGB565,
            255,
            _dispvars->aspect_ratio, 
            2,
            0,
            &_dispvars->main_surface);

As you can see, I simply changed 9th parameter from 3 to 2. That will make it use a double buffer instead of a triple buffer. That could make a difference on the input lag. The main thread will be blocked more often by the lack of free buffers to draw into, but emulation will be kept from running an additional loop in advance, which should reduce the time between new info being feeded to the core and the results being visible on screen. Again, this is how I see it on my head: I may be wrong. I just designed this threaded, triple-buffered approach to get the max of the Pi1 weak CPU which had to time to be wasted waiting for vsync on the main thread, YET I wanted smooth scroll with no tearing. I succeeded, but it’s not academic, it’s all my guess and I can be wrong. I can change the sources for you on github if you want and build a version for you to test, just ask me and I will help you as much as I can.[/QUOTE] It took me a while, but I finally got around to doing this. Long story short: I compiled with double buffering and ran a camera test. No change at all.

So, it’s back to the drawing board. The Linux graphics sub-system is definitely uncharted territory for me, so I think someone else will have to look into this some more (I simply don’t think I have the time to go deep into this…). My hypothesis is that it’s the graphics sub-system that’s causing additional lag compared to Win 10. It’s a pretty nice and even one frame (16.67 ms) that differs between the two operating systems, so I don’t think it can be explained by differences in, for example, the input handling. There’s probably some buffering happening somewhere, but is it in user-space, drivers or firmware?

vanfanel · 15 January 2017 07:28

@brunnis: thanks a lot for trying. So, if there’s no difference between double or triple buffering, I don’t know what could be the problem on the dispmanx video driver side…

I understand there is another extra latency frame with the GLES driver on the Pi, compared to dispmanx, right?

I have just made a pull request for a new graphics driver, called “plaindrm”. You have to boot the Pi in the new, experimental KMS mode (add “dtoverlay=vc4-kms-v3d” to config.txt) and rebuild RA with --enable-plaindrm. It should be pretty good for latency, and it’s not based on dispmanx but on the “standard” low-level graphics stack. Can you try it? I can provide a Pi3 binary if you don’t want to go though the building process: you do a very good work with testing and I don’t want to take your time away.

Brunnis · 15 January 2017 07:28

Yup.

[QUOTE=vanfanel;41974]I have just made a pull request for a new graphics driver, called “plaindrm”. You have to boot the Pi in the new, experimental KMS mode (add “dtoverlay=vc4-kms-v3d” to config.txt) and rebuild RA with --enable-plaindrm. It should be pretty good for latency, and it’s not based on dispmanx but on the “standard” low-level graphics stack. Can you try it? I can provide a Pi3 binary if you don’t want to go though the building process: you do a very good work with testing and I don’t want to take your time away.[/QUOTE] Sounds great! I’ll test, but I’m having a bit of an issue with booting the system after adding the vc4 dtoverlay… With the default RetroPie 3.8.1 image, it stops during boot, alternating between three messages:

A start job is running for LSB: Switch to ondemand cpu governor (unless shift key is pressed) A start job is running for LSB: Raise network interfaces. A start job is running for Load/Save RF Kill Switch Status of rfkill0

After a while, the second message disappears and it keeps alternating between the remaining two. Finally, after 10-15 minutes, the system hangs on the “Switch to ondemand cpu governor” message.

So, I disabled the dtoverlay line in config.txt and proceeded to run rpi-update to update the firmware and then apt-get install update and dist-upgrade. Once again tried to activate the vc4 driver and got a different behavior. The system now hangs with this:

The marker at the bottom is not blinking. Before getting to this screen, the line “map: vt02 => fb0” was showing.

When I pulled the plug and rebooted the system again, it just hung at the “map: vt02 => fb0” line instead.

Any ideas?

EDIT: Ohh, and I haven’t recompiled RA yet. Just wanted to see if I could at least boot with the new driver. Maybe that’s the cause of the issue I’m having right now. I’ll see if I can compile and test again.

EDIT: So, I rebuilt RetroArch from the master (I saw that your PR was just merged). I compiled with --enable-kms and set the video_driver in retroarch.cfg to “drm” (is this correct?). I then activated the vc4 driver again and rebooted. Unfortunately, I ran into almost the same issue as before. In addition to the lines from the photo above, I also got:

[ OK ] Started File System Check on /dev/mmcblk0p1. Mounting /boot…

And then nothing. :-/

vanfanel · 15 January 2017 07:28

@Brunnis: I don’t know RetroPie, nor do I care about it…Maybe it’s not ready for KMS/DRM, but I could pass you a ready-to-use Raspbian image wich will work out of the box with the experimental KMS/DRM driver. What medium could we use to pass you the image? The configure option is --enable-plaindrm (or --enable-drm after backporting, I think?)

Brunnis · 15 January 2017 07:28

[QUOTE=vanfanel;41989]@Brunnis: I don’t know RetroPie, nor do I care about it…Maybe it’s not ready for KMS/DRM, but I could pass you a ready-to-use Raspbian image wich will work out of the box with the experimental KMS/DRM driver. What medium could we use to pass you the image?[/QUOTE] Yep, I don’t know exactly what they’ve changed in the RetroPie setup compared to default Raspbian, so probably better to just test with your image. Easiest would be if you could just put it on a service like Google Drive, Dropbox, mega.nz, etc. and give me a link to it. Or if you have an FTP you’d be willing to put it on.

When I looked at the commit, it seemed like they changed it to --enable-kms. Can’t find any reference to “drm” in config.params.sh.

bidinou · 15 January 2017 07:28

The problem is that you need more CPU power, thus to reduce the video_frame_delay setting.

With a simple game, I managed to get a 24 ms latency with the QAudio Dac+. 16 ms didn’t work. This setting is almost OK with the Neo Geo but with video_frame_delay 0 instead of 9. So one has to choose between reducing the audio latency by 40 ms or reducing the input latency by 9 ms (RPi 3 with no overclocking)

dave_j · 15 January 2017 07:28

[QUOTE=bidinou;41932]You can already set #define sharper, is that similar ? (I know, I can figure it out by myself :D)

What is great is that the shader developer got the most out of the Pi while maintaining 60 fps.[/QUOTE]

Sharper will be some way between the default and tekn0’s suggestion. You could think of them as vaguely resembling:

Default = Small CRT TV fed through SCART Sharper = CRT computer monitor Tekn0’s suggestion = LCD with scanlines

Shaper will cause PI0/1/2s to drop frames but Pi3s might be OK since they clock the GPU faster. (You could also overclock the GPU on earlier Pis.)

If you prefer the look of Tekn0’s suggestion you’re probably better off using the dispmanx driver and an overlay for scanlines since you’ll avoid the extra frame of input lag.

On the subject of input lag…

Whilst immediate mode GPUs can start both vertex and fragment shading as soon as they get a drawing command, tile based ones need to have all the geometry passed to them (and vertex shaded) before they can begin fragment shading. This means they’ll only begin fragment shading once they have been told to there’s no more geometry coming - usually by calling swap buffers. It might be possible to trigger fragment shading earlier by calling glFlush as soon as you’ve sent the last bit of geometry but I don’t know how that would fit with retroarch’s displaying of menus/text overlays/etc.

MrPoe · 15 January 2017 07:28

Im a little confused. So how much input lag do we have now with the brunnis patch for snes games? How low can it go?

bidinou · 15 January 2017 07:28

Apparently there was some additional input lag specific to the (s)nes. There is one more frame of lag specific to opengl (you can work around it by using the dispmanx video driver but then you have no pixel shaders – btw, I’d love to know what good overlays you can use with it I tried a few but the result was not satisfying – and there seems to be one more frame of lag which is not understood for the time being.

Also I read this, is it relevant ? “Also, Lakka has now been modified upstream to apply the rt-kernel patch for more consistent scheduling with 1Khz tick rate, preemptible etc. (frame delay setting can be increased further)”

Finally, there is this frame_delay_setting that can help you get back up to one other frame of lag but you need a very fast CPU and this is super annoying to tweak. (different optimal value per system or even per game)

/me enjoys reading this thread even if he is just a dumb user

Edit : brunnis made a summary here : http://libretro.com/forums/showthread.php?t=5428&page=18 Without frame delay, there is up to 5 frames of delay while he only measures about 2 on a real SNES. I clearly see a huge difference between RetroPie on a Pi and the NES/PCE/… FPGA core on my MiST for instance. But FPGA computers are 1) very expensive 2) support few systems.

tekn0 · 15 January 2017 07:28

[QUOTE=bidinou;42111]Apparently there was some additional input lag specific to the (s)nes. There is one more frame of lag specific to opengl (you can work around it by using the dispmanx video driver but then you have no pixel shaders – btw, I’d love to know what good overlays you can use with it I tried a few but the result was not satisfying – and there seems to be one more frame of lag which is not understood for the time being.

Also I read this, is it relevant ? “Also, Lakka has now been modified upstream to apply the rt-kernel patch for more consistent scheduling with 1Khz tick rate, preemptible etc. (frame delay setting can be increased further)”

Finally, there is this frame_delay_setting that can help you get back up to one other frame of lag but you need a very fast CPU and this is super annoying to tweak. (different optimal value per system or even per game)

/me enjoys reading this thread even if he is just a dumb user

Edit : brunnis made a summary here : http://libretro.com/forums/showthread.php?t=5428&page=18 Without frame delay, there is up to 5 frames of delay while he only measures about 2 on a real SNES. I clearly see a huge difference between RetroPie on a Pi and the NES/PCE/… FPGA core on my MiST for instance. But FPGA computers are 1) very expensive 2) support few systems.[/QUOTE]

I am in the same boat. Just a dumb user trying to get my head around it all. There is also what seems like half a frame of lag removed from this option: https://github.com/libretro/RetroArch/issues/3100

I am really hoping Lakka with implement all of these options and turn out to be the distro with the lowest lag. (DRM-KMS, Brunnis patched cores, RT kernel 1khz with +15 Framedelay, Triple buffer option=Maximum swapchain images, dispmanx/plaindrm driver) Thats my collection thus far.

bidinou · 15 January 2017 07:28

tekn0 : do you mean a high framedelay can be specified regardless of the emulated platform / CPU power as long as a RT kernel is used ?

We really need a wiki page dedicated to this investigation, to help both users & developers keep track of what’s been and what could potentially be achieved.

rafan · 15 January 2017 07:28

[QUOTE=Brunnis;41854]Just made a pull request for implementing the exact same fix for bsnes as bsnes-mercury: https://github.com/libretro/bsnes-libretro/pull/16

Hopefully they’ll both be merged very soon.

[/QUOTE]

Good news:

http://github.com/libretro/bsnes-libretro/pull/16 Thanks again for all the effort you put into this, very much appreciated. Reading along with all the findings in this forum was a joy too :).

Last but not least. Hopefully you have time and energy left to look at other cores also in the future. No pressure of course!

hunterk · 15 January 2017 07:28

yeah, it’s in both snes9x and *-next and bsnes and *-mercury. I think CatSFC is the only SNES core we have that could still be affected.

Brunnis · 15 January 2017 07:28

That’s a really good idea… I’ll see what I can do.

[QUOTE=rafan;42154]Good news:

http://github.com/libretro/bsnes-libretro/pull/16 Thanks again for all the effort you put into this, very much appreciated. Reading along with all the findings in this forum was a joy too :).

Last but not least. Hopefully you have time and energy left to look at other cores also in the future. No pressure of course![/QUOTE] Yep, this is great news. The fix was just committed to both the regular bsnes and the bsnes-mercury cores. I’m taking a bit of a break from the emulator analysis stuff for the moment, but I’ll probably spend more time on it in the future. To be honest, though, I’m quite pleased with having removed 1 frame of input lag from fceumm, snes9x, snes9x-next, bsnes and bsnes-mercury.

However, there’s probably still work to be done on the video pipeline of the Raspberry Pi. I haven’t gotten very far on that testing, though. I have made a few tries getting RetroArch to work on the new OpenGL driver, but no luck so far.

Heffer · 15 January 2017 07:28

[QUOTE=Brunnis;41992]Yep, I don’t know exactly what they’ve changed in the RetroPie setup compared to default Raspbian, so probably better to just test with your image. Easiest would be if you could just put it on a service like Google Drive, Dropbox, mega.nz, etc. and give me a link to it. Or if you have an FTP you’d be willing to put it on.

When I looked at the commit, it seemed like they changed it to --enable-kms. Can’t find any reference to “drm” in config.params.sh.[/QUOTE]

I’ve been testing this using a standard Raspian Jessie image and using the RetroPie setup script to install RetroArch and its cores and then custom compiling Retroarch to use the KMS or “DRM” driver and this is looking very promising. I’ve tested several games using the “Brunnis patched” emulators and I’m getting some very exciting results. Here are a few of the games I’ve tested:

[TABLE=“width: 959”]

Game Frames Emulator Video Driver Audio Driver

Akumajou Densetsu 3 RetroArch\lr-FCEUMM DRM SDL2

Super Mario Bros. 3 2 RetroArch\lr-FCEUMM DRM SDL2

Mega Man II 2 RetroArch\lr-FCEUMM DRM SDL2

Castlevania III 3 RetroArch\lr-FCEUMM DRM SDL2

Castlevania 3 RetroArch\lr-FCEUMM DRM SDL2

Super Metroid 2 RetroArch\lr-Snes9x-Next DRM SDL2

Yoshi’s Island 3 RetroArch\lr-Snes9x-Next DRM SDL2

Super Mario All-Stars 2 RetroArch\lr-Snes9x-Next DRM SDL2

Seiken Densetsu 3 4 RetroArch\lr-Snes9x-Next DRM SDL2

Secret of Mana 2 RetroArch\lr-Snes9x-Next DRM SDL2

Chrono Trigger 2 RetroArch\lr-Snes9x-Next DRM SDL2

Super Mario World 3 RetroArch\lr-Snes9x-Next DRM SDL2

[/TABLE]

I’ve found a few issues when using the KMS/DRM video driver:

-When some SNES games enter and leave Hi-res mode it causes the screen to “glitch” for a split-second during the transition. -It doesn’t support overlays or shaders currently. -Can’t disable the bi-linear filtering.

I tested using the pause/frame-step method on a Raspberry Pi 3 with the scaling_governor set to performance and the core_freq and sdram_freq both overclocked to 500 to keep the sound from crackling.