What would it take to get a dynarec into beetle-saturn?

One of my frustrations with saturn emulation is that most cores don’t work too well on handheld devices (eg anbernic). YabaSanshiro with the SH2 dynamic recompiler does work mostly fullspeed, but it has some compatibility issues.

I looked into the code a bit to see what it might take to use the dynamic recompiler with beetle-saturn (or maybe Kronos) which has better compatibility.

Here is a list of what I see that would need to be fixed or updated:

  1. The dynarec from yabause does not emulate the cache. This is necessary for compatibility with certain games. Fixing this would require the dynamic recompiler to (optionally) generate code for each memory access to check for cache hit or miss.

  2. x86-64 is incomplete, and ARM-64 is missing entirely. The original Yabause does work on x86-64 in 64-bit mode, but the dynarec only generates 32-bit code. This is a problem because it requires some hacks to make sure that the process is mapped in the lower 4GB of RAM. uoYabause does have a full x86-64 dynarec based on Titan Project, but this is missing a lot of the optimizations from the original dynarec.

One path to fix this would be to finish the 64-bit support on x86-64 and then port to 64-bit ARM. (32-bit ARM is already done.) Another possibility would be to keep the original instruction decoder and then use lightrec like beetle-psx. That would require creating a proper intermediate representation, which might be a good idea anyway even if not using lightning.

  1. Debug output. When debugging output is enabled, it shows the decompiled SH2 code, but not the recompiled output (x86, ARM). This isn’t strictly necessary to add, but I’d want to do it before going further.

  2. Threaded compilation to fix the recompilation stuttering problem. The yabause code is not multithreaded at all and everything stops while recompilation is happening. Need to run the recompiler in a separate thread like in beetle-psx. This requires adding thread locks, mutexes, etc.

  3. Need to add the VDP1 slowdown hack for certain games.

  4. Maybe also want the additional cycle timing checks like beetle-psx does.

  5. Yabause dynarec uses a 4K page size for instruction cache invalidation. This code seems to have been inherited from Mupen64plus. This isn’t ideal, but I’m not certain that it actually breaks compatibility with any games. However, the cache mode hacks in mednafen being necessary does suggest that some games are sensitive to the instruction cache behavior, so this needs to be looked into further.

  6. User interface needs some improvement to work well on Rocknix etc.

This will probably take more time than I have right now, so I’m wondering if anyone else would be interested in working on this. It’s mostly clear what needs to be done, but it would take a lot of time for testing and debugging. Maybe we could sponsor a series of bounties for each of these tasks. Would anyone be interested in that?

2 Likes

We haven’t had good luck with bounties on dynarecs, historically, but who knows :man_shrugging:

1 Like

The 2 VDPs are the bottleneck of saturn emulation. I doubt replacing a SH-2 interpreter by a SH-2 dynarec is gonna have the level of impact you hope for on beetle-saturn. It’d probably be already an incredible feat if it reduced cpu usage by 10%.

Kronos is actually using a cached interpreter, whose performance is supposedly close to yabasanshiro’s dynarec (source: the author’s benchmarks when he wrote it).

The main reason behind yabasanshiro’s speed is its hardware renderer, which prefers ignoring how the VDP1 actually works.

2 Likes

I’ve tried YabaSanshiro on RK3566, and it does not run fullspeed without dynarec. The difference is quite noticeable, and much more than 10%.

It’s also quite buggy unfortunately. The VDP2 sprite/layer priority is wrong in some games, and often the emulator just crashes.

This is using the old renderer and not Kronos. RK3566 doesn’t support OpenGL 4, so Kronos doesn’t work at all.

One possibility would be to use the more accurate VDP1/VDP2 from mednafen (beetle) along with the dynarec. However, mednafen is 64-bit only, and the ARM dynarec currently only works in 32-bit mode.

Assuming that could be resolved, I’m not sure what the speed would be like. There would presumably be some amount of frameskip. I’d also want to look into why mednafen needs 64 bits, as this usually isn’t good for performance. Compiling for a 64-bit target generally increases code size, which results in more memory usage and a higher L1/L2 cache miss rate.

It’s rather disappointing that there is still no good Saturn emulation on handhelds. Other than maybe a Steam Deck, things are basically in the same state as ten years ago.

1 Like

I’ve no doubt SH-2 emulation represents a large part of yabasanshiro’s cpu usage, since VDP emulation is offloaded to the gpu with that emulator.

However beetle-saturn is a different story, it doesn’t offload VDP emulation to the gpu, so the SH-2 emulation represents a much smaller percentage of beetle-saturn’s cpu usage.

I had some time to look into this further, so I built mednafen with profiling support (specifically I needed -fno-omit-frame-pointer to get the call graph with perf). I assume beetle-saturn would show similar results, but it was easier to compile mednafen 1.32.1 for this purpose.

Mednafen runs most of the VDP2 in a separate thread, even if not using the GPU. The results show about 65% of the CPU time in the main thread, 30% in the VDP2 thread, and a few percent other processes on the system. Since the VDP2 thread generally gets scheduled on a separate CPU core and doesn’t max it out, the bottleneck is the main thread. If this needs more than 100% of the CPU time, then the game lags.

I tested two games, Panzer Dragoon Saga and Princess Crown. Panzer Dragoon Saga is mostly 3D, while Princess Crown is a 2D side-scroller. Looking at the main thread only, the results are as follows:

Panzer Dragoon Saga:

28.6% SH-2 (DoIDIF, C_MemReadRT, C_MemWriteRT)
26.3% SOUND_Update (RunSCSP, 68K)
8.9% VDP1
2.9% VDP2
2.0% SS_SetEventNT()
0.8% MidSync()

Princess Crown:

28.9% SH-2 (DoIDIF, C_MemReadRT, C_MemWriteRT)
30.3% SOUND_Update (RunSCSP, 68K)
2.4% VDP1
5.1% VDP2
2.6% SS_SetEventNT()
1.0% MidSync()

As expected, Panzer Dragoon Saga uses the VDP1 more heavily for its 3D graphics, but for both games the clear majority of the CPU time is spent in SH2 and 68K emulation.

A 68K recompiler might help, but the low hanging fruit is optimizing SH2 (SH7095) since that dynarec is already written and available.

This is an example of why emulation of everything from PS1/SS/N64 generation onward predominately uses dynamic recompilation.

3 Likes

cached interpreter could indeed be a good boost without the portability issues of a full-on dynarec.

I’m surprised the sound/68K is using so much. Would something like musashi be able to slot in there and save some cycles?

2 Likes

YabaSanshiro does use musashi, so yes. I’d have to look into it more to see what the performance difference is. There’s a bit more going on with sound emulation than just 68K.

1 Like

I’d expect games like burning rangers to be more demanding, but those are surprising numbers nonetheless, i thought it was like kronos where VDP1 is the bottleneck.

1 Like

Burning Rangers is definitely CPU intensive. It seems to heavily use both SH-2 CPUs, and does a lot of 3D drawing with VDP1. SH2 dynarec might help here.

I tried Burning Rangers on YabaSanshiro in RetroArena beta 10, and it gets to the menu, but crashes when you try to start a game. The crash does not happen in Yabause 0.9.15, so there’s a regression in YabaSanshiro, or possibly just something wrong on ARM. Unfortunately I don’t have a good development environment to build all of RetroArena, so efforts to track down that bug will have to wait.

Yabause 0.9.15 needs to build with -fno-PIC -fno-PIE for the assembly code to work, which then won’t link with certain shared libraries. So this is another thing that needs to get fixed before any of this code can be integrated into RetroArch.

I see videos from ten years ago on youtube of people running yabause on Android and ARM Linux, and it looks pretty good. Now I try this, and I find a whole bunch of new bugs and regressions, stuff doesn’t build from source because of broken shared library dependencies, and it’s slow because dynamic recompilation doesn’t work right on 64-bit platforms. I’m wondering what happened. Saturn emulation wasn’t this bad before.

Saturn emulation has been awful for ages. It’s moderately good on beetle saturn and a very beefy PC only. As it is now, medium/low handhelds don’t stand a chance. It runs pretty good on my sd865 (equals my i7-7th gen laptop in performance) but no chance on sd650 (40 fps on yabasanshiro) and that should be faster than medium chips like rk3566.

N64 is pretty bad too for medium/low chips.

YabaSanshiro does run nearly full speed on RK3566 (1.8GHz ARM Cortex-A55). I tried games including Arcana Strikes and Panzer Dragoon Saga, and there is some minor frameskip, but it outputs 30+ FPS. It would be fully playable if not for the VDP2 layering issues and random crashes. Burning Rangers might be a bit slow, but many games would be fine.

I would have thought with newer CPUs the dynarec wouldn’t be necessary, but you need a 3+ GHz CPU, and it’s still kinda laggy.

I profiled the CPU usage in Yabause 0.9.15 on x86-64, and it looks surprisingly similar to Mednafen, except for the VDP2 being on the main thread. The sound processor takes about 25% of the time, mostly in generate_sample() and ScspDspExec(), with 68K being a small part. Switching the 68000 core between C68K and Musashi doesn’t seem to make much difference.

Perhaps some of the sound generation could be done in a separate thread, but other than that, the only thing that can clearly be optimized is the SH2, and fixing that dynarec is going to be a big project.

It’s clear what needs to be fixed, but it will take a lot of time to do it all. This is why I’m wondering if we could incentivize development with bounties.

N64 has some similar issues, although not quite as bad. In fact, the dynarec code in Yabause originally came from Mupen64plus. So maybe if this can get fixed, some of it could be applied to N64 emulation also.