Alsa vs. Tinyalsa: Some experiments in performance and latency

In the interest of getting the lowest possible audio latency out of my x86_64 Lakka box, I decided to see if I could determine which of the audio drivers available was the most performant in this regard. The results so far are interesting, but a bit baffling to me all the same, especially since there’s nowhere near as much documentation on the audio drivers and how they perform as I would like, so feedback on this would be quite welcome.

So first things first: I installed Lakka on a generic Dell PC running a Core i5-7500 CPU running at 3.4 GHz, paired with a cheap Radeon R7 250 GPU. I am outputting sound through the onboard audio. I usually set the video driver to vulkan with Vsync off and max swapchain images set to 2, turn on Run-ahead and Automatic Frame Delay, and throw a light CRT shader on top (usually crt-easymode-halation).

As for audio, Lakka only offers the following audio drivers to me: alsa, alsathread, tinyalsa, and oss. Alsathread has been extremely unperformant for me, requiring a latency setting of around 128 ms to prevent stutter. OSS straight-up fails to initialize and I get no audio. That leaves alsa and tinyalsa to compare. The game I used to test everything is Yoshi’s Island using the latest Snes9x core.

When I first installed Lakka, not only was alsathread (the default driver, by the way) not giving me good results, but neither was alsa for some reason. Both would cause stutter and slowdown like crazy. Tinyalsa worked right off the bat, and after a look at the logs, I learned the minimum latency setting it accepts is 21 ms (going below that makes it set itself back up to 64 ms automatically), so I basically set it and forget it, along with the rest of the settings I outlined above.

However, I decided to try to revisit alsa and see why I was having so much trouble with it. Sure enough, using the 21 ms value I had it at, I immediately got slowdown and stutter, but I began trying a few things to see if I could improve the situation. Disabling frame delay, run-ahead and shaders did little, so instead I tried re-enabling those and just upping swapchain images up to 3, and that did the trick for the most part, although I also ended up having to disable frame delay to remove some lingering microstutters. Then I thought, if I raise audio latency instead, can I keep swapchain images at 2 and keep the rest of the goodies as well? Sure enough, raising it upwards of 64 ms did the trick. After a lot of trial and error, I discovered a threshold: as long as audio latency is set to 54 ms or above, I can keep all my latency reduction settings, plus shaders, intact. But as soon as I go below that, the stuttering begins. I tried all this again with the gl and glcore drivers, and I got the same results (well, for some reason I had to enable Vsync using glcore to prevent jitter; gl and vulkan can do without it, apparently)

So what is going on here? Why does tinyalsa allow me to go as low as 21ms, but alsa has to hover around 54? I decided to look at the logs, and I found something curious. When tinyalsa is used and set to 21 ms, this is what it outputs:

[INFO] [TINYALSA]: Using card: 0, device: 0. [INFO] [TINYALSA]: Can pause: yes. [INFO] [TINYALSA]: Audio rate: 48000Hz. [INFO] [TINYALSA]: Buffer size: 4096 frames. [INFO] [TINYALSA]: Buffer size: 16384 bytes. [INFO] [TINYALSA]: Frame size: 4 bytes. [INFO] [TINYALSA]: Latency: 21ms.

However, when set to 54 ms, this is the output:

[INFO] [TINYALSA]: Using card: 0, device: 0. [INFO] [TINYALSA]: Can pause: yes. [INFO] [TINYALSA]: Audio rate: 48000Hz. [INFO] [TINYALSA]: Buffer size: 10432 frames. [INFO] [TINYALSA]: Buffer size: 16384 bytes. [INFO] [TINYALSA]: Frame size: 4 bytes. [INFO] [TINYALSA]: Latency: 54ms.

That’s odd. It mentions two buffer sizes (one in frames, the other in bytes) each time, but only one of them changes. Now, the key here appears to be the frame size of 4 bytes. Sure enough, when set to 21 ms, the given buffer size of 4096 frames times 4 bytes gives you 16384 bytes. But even though the buffer size in frames increases when set to 54 ms, the buffer size in bytes stays the same!

This gave me an idea. When I was first trying out things with alsa, at one point I had tried setting it to the max latency of 512 ms, and it resulted in ridiculous audio delay, so much even the most casual observer ought to be able to notice it. I tried doing the same with tinyalsa, and, wouldn’t you know it, it sounds virtually identical to setting it to 21 ms! Is the audio latency setting even doing anything with tinyalsa, then?

So, what does alsa report on its log? Well, it’s pretty interesting. This is what it looks like when set to 53 ms, that is, the threshold where it begins to stutter:

[INFO] [Audio]: Set audio input rate to: 31987.32 Hz. [INFO] [ALSA]: Using floating point format. [INFO] [ALSA]: Period size: 1024 frames [INFO] [ALSA]: Buffer size: 2048 frames [INFO] [ALSA]: Can pause: no.

Raising it to 54 ms causes the following:

[INFO] [Audio]: Set audio input rate to: 31987.32 Hz. [INFO] [ALSA]: Using floating point format. [INFO] [ALSA]: Period size: 1024 frames [INFO] [ALSA]: Buffer size: 3072 frames [INFO] [ALSA]: Can pause: no.

Aha! Crossing the 54 ms threshold gives us an extra 1024 frames of buffer. That very well may explain the performance drop-off at 53 ms and below.

But hang on, can we go lower than that, performance be damned? It doesn’t appear we can. Even setting latency down to 2 ms gives the exact same output, and I have no idea if the driver is even honoring such a low latency setting (unlike tinyalsa, alsa says nothing about whether it accepts or rejects the latency setting in the log), and I don’t have the equipment to test audio latency to such a fine degree.

But now I’m intrigued. Despite the ostensibly lower latency that tinyalsa appears to accept, it seems to stick to a pre-defined buffer size which is actually BIGGER than what I can achieve with alsa, and raising or lowering the latency value seems to do close to nothing. The latency setting actually does appear to do something in alsa, however, and the smaller buffer size it achieves despite supposedly having a higher latency value makes me wonder if tinyalsa is not just a teensy little bit of a cheater, and alsa is actually the more performant of the two.

I have to admit I am a bit ignorant in regards to a lot of these things, so if someone who actually knows how these drivers work could correct me and tell me if my hunch is correct or not would be great, at least so I can put this to rest and finally leave well enough alone. :stuck_out_tongue:

1 Like

Hi. I don’t know how this VERY insteresting thread has no answers. I guess people versed on latency factors have moved to FPGAs?

In any case, you can get perfectly stable 32ms latency on GNU/Linux if you:

-Run RetroArch using Vulkan without X11, directly from a TTY

  • -Use 2 buffers (max_swapchains = 2) which results on no noticeable input lag.
  • -Lower audio rate to 44100.
  • -Use the vc4hdmi0 ALSA device.

So, no input lag + low audio latency (32ms is very good!) is possible. We’re getting VERY near to what an FPGA can do… on a lowly Pi4!

1 Like

PipeWire seems to have solved the latency issues. After I switched to it, the audio latency setting can go down to even 5ms (not recommended, but just saying it’s possible.)

If there’s a guide on how to switch your Pi from PulseAudio to PipeWire, I would try it!

@RealNC I have Pipewire built and installed on my Raspberry Pi 4, but in the end it uses ALSA to output audio.

So, are you saying that Pipewire has better latency than plain ALSA? Isn’t that just impossible? Pipewire works on top of ALSA.

If it’s really happening, how can it be technically possible?

If you use “normal” ALSA where it’s not using the device exclusively, then yes. dmix is what ALSA uses to mix the audio of multiple applications together. It has worse latency than PipeWire or PulseAudio.

In my experience, when using PipeWire and selecting the “ALSA” driver in RA, you get latency that’s as good as tinyalsa (which uses the ALSA device directly and exclusively.) And you retain the benefit of not blocking other applications from playing audio.

Note that with PipeWire, if you use the PulseAudio driver in RA, you can’t get less than 24ms audio latency. 24ms PulseAudio latency seems to be the preconfigured minimum in PipeWire. This is probably configurable, but PipeWire’s configuration is rather arcane and I haven’t figured out where and how to change it. 24 is low enough for me, so I just use the PulseAudio driver in RA, but if for some reason you want to try less than that, then use the ALSA driver.

1 Like

I forgot to mention that a month or so ago, we started fixing low latency audio behavior in some cores. If you can’t get clean audio with values less than 50ms or so, it’s usually the core that causes it.

We fixed the cores where we found issues. One of them was Snes9x, so as a test case, you can use it to find the audio latency limit of your hardware. It should work even at 5ms without too many audio glitches now. Obviously, the heavier on the CPU a core is, the more audio glitches there will be with low audio latency settings. Snes9x is quite light on CPU, so it’s a good test case.

2 Likes

@RealNC Then the same low latencies can be archieved on ALSA if the device is accessed directly, instead of accessing it via DMIX, right? That would be possible with the right /etc/asound.conf after looking at the DEVICE/CARD names with aplay -l.

And with regards to the audio latency improvements, yes, I did notice them! What a job you guys did. I am still seeing that GameBoy emulation in general has problems with low-latency: Gambatte has small audio dropouts once in a while with anything below 50ms, and mGBA and GearBoy have exactly the same problem. mGBA is perfect for GBA games in low-latency audio, however: I imagine GB audio is different from GBA audio.

1 Like

Game Boy is weird. It generates audio at 2MHz. SameBoy resamples that to 384kHz, which is then sent to RA, which resamples it to the final output sample rate. Not sure what Gambatte does.

If the Pi can run SameBoy (it’s quite CPU demanding), if you build SameBoy yourself, you can experiment with editing libretro/libretro.c, and near the very start of that file, change 384000 to 48000 and see if that improves it.

Right. You’d have to completely disable PipeWire though, so that the hardware device becomes accessible.

1 Like

@RealNC I would love to try the frequency modification in SameBoy, but right now it doesn’t seem to build…

rgbasm -i build/obj/BootROMs/ -i BootROMs/ -o build/bin/BootROMs/dmg_boot.bin.tmp BootROMs/dmg_boot.asm
error: BootROMs/dmg_boot.asm(3) -> BootROMs/hardware.inc(32):
    syntax error, unexpected SET
rgbasm -i build/obj/BootROMs/ -i BootROMs/ -o build/bin/BootROMs/agb_boot.bin.tmp BootROMs/agb_boot.asm
error: BootROMs/agb_boot.asm(2) -> BootROMs/cgb_boot.asm(3) -> BootROMs/hardware.inc(32):
    syntax error, unexpected SET
rgbasm -i build/obj/BootROMs/ -i BootROMs/ -o build/bin/BootROMs/cgb_boot.bin.tmp BootROMs/cgb_boot.asm
error: Assembly aborted (1 error)!
make[1]: *** [Makefile:430: build/bin/BootROMs/dmg_boot.bin] Error 1
make[1]: Leaving directory '/root/src/libretro/SameBoy'
make: *** [Makefile:352: ..//build/bin/BootROMs/dmg_boot.bin] Error 2
make: *** Waiting for unfinished jobs....
rgbasm -i build/obj/BootROMs/ -i BootROMs/ -o build/bin/BootROMs/sgb_boot.bin.tmp BootROMs/sgb_boot.asm
error: BootROMs/cgb_boot.asm(3) -> BootROMs/hardware.inc(32):
    syntax error, unexpected SET
error: BootROMs/sgb_boot.asm(3) -> BootROMs/hardware.inc(32):
    syntax error, unexpected SET
error: Assembly aborted (1 error)!
error: Assembly aborted (1 error)!
make[1]: *** [Makefile:430: build/bin/BootROMs/sgb_boot.bin] Error 1
make[1]: Leaving directory '/root/src/libretro/SameBoy'
make[1]: *** [Makefile:430: build/bin/BootROMs/agb_boot.bin] Error 1
make[1]: Leaving directory '/root/src/libretro/SameBoy'
make: *** [Makefile:352: ..//build/bin/BootROMs/agb_boot.bin] Error 2
make: *** [Makefile:352: ..//build/bin/BootROMs/sgb_boot.bin] Error 2
error: Assembly aborted (1 error)!
make[1]: *** [Makefile:430: build/bin/BootROMs/cgb_boot.bin] Error 1
make[1]: Leaving directory '/root/src/libretro/SameBoy'
make: *** [Makefile:352: ..//build/bin/BootROMs/cgb_boot.bin] Error 2 

Do you know if there’s a way to omit the bootrom building on this core?

Are you trying to build from the libretro repo? My audio changes have not been merged there. They were merged upstream, so try https://github.com/LIJI32/SameBoy.

Oh, it appears you’re either using the master branch of https://github.com/gbdev/rgbds or the 0.6.x release. Yeah, that doesn’t work. Use the 0.5.2 version:

git fetch --tags
git checkout v0.5.2
make clean && make

@RealNC Thanks, got the SameBoy core to build on both Pi4 and X86_64 Debian. The Pi4 can run this core with no problems at all (~45% of 1 CPU usage for GB Color emulation, TOP says).

However, this core has audio dropouts with less than 50ms delay even if I set the frequency to 48KHz in libretro.c as you suggested. But here’s the catch: those audio dropouts only happen if I am using “Max Swapchain Images” set to 2. Setting it to 3, it goes away. It happens on both the Pi4 and my high-end X86_64 laptop, so it’s not a matter of host CPU performance: it’s a matter of synchronization.

Using Max Swapchain Images" set to 3 is a major source of input lag, so I always use everything with 2 buffers.

The SNES9X core does not have this problem: using 2 buffers and 32ms audio delay, it works flawlessly. So there’s something with these gameboy cores…

Indeed, I went back and tested it once more, and I am now able to get lower values than I could before. However, I have noticed for whatever reason, since I am currently on integrated Intel graphics (my Radeon GPU’s fan bit the dust, so out it went), adding a shader and/or lowering the max swapchain images value seems to limit how far down I can go in audio latency before I get stutter, as well as more obvious things like enabling Run-Ahead. After some experimentation with these, once again testing with Yoshi’s Island (it being one of the most intensive SNES games) these are the lowest values I could achieve without in-game stuttering:

With max swapchain images set to 2 and BOTH a shader and Run-Ahead: 512 ms (with occasional stutter still, so this is worthless)

With max swapchain images set to 2 and EITHER a shader or Run-Ahead: 34 ms

With max swapchain images set to 2 and NO shader or Run-Ahead: 22 ms

With max swapchain images set to 3 and BOTH a shader and Run-Ahead: 16 ms

With max swapchain images set to 3 and EITHER a shader or Run-Ahead: 12 ms

With max swapchain images set to 3 and NO shader or Run-Ahead: 6 ms

With max swapchain images set to 4 and NO shader or Run-Ahead: 5 ms

All tests were done using the Vulkan video driver. The shader I tested with was Sony Megatron, and Run-Ahead, when enabled, was always set to 1 frame.

One thing I tried was messing with the Resampler settings to see if latency was affected, but even though playing around with the resampler quality is described as affecting latency, I saw zero latency difference between even Lowest and Highest regardless of what other settings I fiddled with. As such, I do believe I am going to set it to Highest and keep it there, because why not?

Anyway, seems to be with the way things stand, I’ll be sticking with max swapchain images set to 3 so that I can use both a shader and Run-Ahead at once, since the latter should, in theory, make up for the extra buffer frame, and I can gain an extra 18 ms of audio latency. Can’t wait to replace the GPU, though…

A low audio latency setting does require available spare CPU capacity. It also requires the GPU to not miss the target frame time. On integrated graphics, using a heavy shader is probably not a good idea.

The best setup is a high refresh rate with VRR (because vsync never triggers then.) Not an option for you currently, but in general, the swapchain length doesn’t matter when your refresh rate is higher than your target FPS and FPS is being limited (“Sync to exact content frame rate” does that.) This lets you use a swapchain length of 3 or 4 without any input latency penalty. If your GPU is not being maxed out, that is. The swapchain length comes into play in two situations: when frames are being blocked by vsync, or when frames are being blocked by the GPU (due to high load.)

This is a comparison between swapchain length of 2 (left) and 4 (right).

With 60Hz vsync, as expected, a swapchain of 2 has less latency than 4:

With 120Hz and “sync to content framerate”, swapchain length does not matter and both give the minimum possible latency:

So since you don’t have to use a swapchain of 2, you can lower audio latency more.

(Note that this has nothing to do with VRR itself. You can achieve the same result by disabling vsync on a 60Hz display and activating the “sync to exact content framerate” option. It’s just that without VRR, it’s gonna be ugly due to tearing and stutter.)

1 Like

Well, I wouldn’t exactly call Sony Megatron “heavy”, as it’s quite fast compared to most other CRT shaders, including the fastest guest preset, but yeah, at this point I’m definitely hitting the limits of my current setup. Once I replace the GPU (currently eyeing an RX 550), I should, in theory, be able to bring the max swapchain images back down to 2 without much of a problem, and I’ll be curious to see how low I can bring audio latency then. The prospect of a latency value of less than 16 ms is enticing, to say the least.

Once you get to 30ms, further reductions are kind of pointless. Unless you’re sitting far away from your speakers, in which case it makes sense to get lower. (The further away the speakers are from you, the higher the audio latency gets.)

1 Like

I have a living room setup, so low audio latency is key. But yeah, I suppose 16 ms is already stellar.

By the way, after some more testing, it seems I can actually go as high as 2 Run-Ahead frames without stutter, so that’s very nice. Everything is so responsive now, so it seems crazy to think I could theoretically get lower latency still with better hardware.

1 Like