What would it take to get a dynarec into beetle-saturn?

One of my frustrations with saturn emulation is that most cores don’t work too well on handheld devices (eg anbernic). YabaSanshiro with the SH2 dynamic recompiler does work mostly fullspeed, but it has some compatibility issues.

I looked into the code a bit to see what it might take to use the dynamic recompiler with beetle-saturn (or maybe Kronos) which has better compatibility.

Here is a list of what I see that would need to be fixed or updated:

  1. The dynarec from yabause does not emulate the cache. This is necessary for compatibility with certain games. Fixing this would require the dynamic recompiler to (optionally) generate code for each memory access to check for cache hit or miss.

  2. x86-64 is incomplete, and ARM-64 is missing entirely. The original Yabause does work on x86-64 in 64-bit mode, but the dynarec only generates 32-bit code. This is a problem because it requires some hacks to make sure that the process is mapped in the lower 4GB of RAM. uoYabause does have a full x86-64 dynarec based on Titan Project, but this is missing a lot of the optimizations from the original dynarec.

One path to fix this would be to finish the 64-bit support on x86-64 and then port to 64-bit ARM. (32-bit ARM is already done.) Another possibility would be to keep the original instruction decoder and then use lightrec like beetle-psx. That would require creating a proper intermediate representation, which might be a good idea anyway even if not using lightning.

  1. Debug output. When debugging output is enabled, it shows the decompiled SH2 code, but not the recompiled output (x86, ARM). This isn’t strictly necessary to add, but I’d want to do it before going further.

  2. Threaded compilation to fix the recompilation stuttering problem. The yabause code is not multithreaded at all and everything stops while recompilation is happening. Need to run the recompiler in a separate thread like in beetle-psx. This requires adding thread locks, mutexes, etc.

  3. Need to add the VDP1 slowdown hack for certain games.

  4. Maybe also want the additional cycle timing checks like beetle-psx does.

  5. Yabause dynarec uses a 4K page size for instruction cache invalidation. This code seems to have been inherited from Mupen64plus. This isn’t ideal, but I’m not certain that it actually breaks compatibility with any games. However, the cache mode hacks in mednafen being necessary does suggest that some games are sensitive to the instruction cache behavior, so this needs to be looked into further.

  6. User interface needs some improvement to work well on Rocknix etc.

This will probably take more time than I have right now, so I’m wondering if anyone else would be interested in working on this. It’s mostly clear what needs to be done, but it would take a lot of time for testing and debugging. Maybe we could sponsor a series of bounties for each of these tasks. Would anyone be interested in that?

2 Likes

We haven’t had good luck with bounties on dynarecs, historically, but who knows :man_shrugging:

1 Like

The 2 VDPs are the bottleneck of saturn emulation. I doubt replacing a SH-2 interpreter by a SH-2 dynarec is gonna have the level of impact you hope for on beetle-saturn. It’d probably be already an incredible feat if it reduced cpu usage by 10%.

Kronos is actually using a cached interpreter, whose performance is supposedly close to yabasanshiro’s dynarec (source: the author’s benchmarks when he wrote it).

The main reason behind yabasanshiro’s speed is its hardware renderer, which prefers ignoring how the VDP1 actually works.

2 Likes

I’ve tried YabaSanshiro on RK3566, and it does not run fullspeed without dynarec. The difference is quite noticeable, and much more than 10%.

It’s also quite buggy unfortunately. The VDP2 sprite/layer priority is wrong in some games, and often the emulator just crashes.

This is using the old renderer and not Kronos. RK3566 doesn’t support OpenGL 4, so Kronos doesn’t work at all.

One possibility would be to use the more accurate VDP1/VDP2 from mednafen (beetle) along with the dynarec. However, mednafen is 64-bit only, and the ARM dynarec currently only works in 32-bit mode.

Assuming that could be resolved, I’m not sure what the speed would be like. There would presumably be some amount of frameskip. I’d also want to look into why mednafen needs 64 bits, as this usually isn’t good for performance. Compiling for a 64-bit target generally increases code size, which results in more memory usage and a higher L1/L2 cache miss rate.

It’s rather disappointing that there is still no good Saturn emulation on handhelds. Other than maybe a Steam Deck, things are basically in the same state as ten years ago.

1 Like

I’ve no doubt SH-2 emulation represents a large part of yabasanshiro’s cpu usage, since VDP emulation is offloaded to the gpu with that emulator.

However beetle-saturn is a different story, it doesn’t offload VDP emulation to the gpu, so the SH-2 emulation represents a much smaller percentage of beetle-saturn’s cpu usage.

I had some time to look into this further, so I built mednafen with profiling support (specifically I needed -fno-omit-frame-pointer to get the call graph with perf). I assume beetle-saturn would show similar results, but it was easier to compile mednafen 1.32.1 for this purpose.

Mednafen runs most of the VDP2 in a separate thread, even if not using the GPU. The results show about 65% of the CPU time in the main thread, 30% in the VDP2 thread, and a few percent other processes on the system. Since the VDP2 thread generally gets scheduled on a separate CPU core and doesn’t max it out, the bottleneck is the main thread. If this needs more than 100% of the CPU time, then the game lags.

I tested two games, Panzer Dragoon Saga and Princess Crown. Panzer Dragoon Saga is mostly 3D, while Princess Crown is a 2D side-scroller. Looking at the main thread only, the results are as follows:

Panzer Dragoon Saga:

28.6% SH-2 (DoIDIF, C_MemReadRT, C_MemWriteRT)
26.3% SOUND_Update (RunSCSP, 68K)
8.9% VDP1
2.9% VDP2
2.0% SS_SetEventNT()
0.8% MidSync()

Princess Crown:

28.9% SH-2 (DoIDIF, C_MemReadRT, C_MemWriteRT)
30.3% SOUND_Update (RunSCSP, 68K)
2.4% VDP1
5.1% VDP2
2.6% SS_SetEventNT()
1.0% MidSync()

As expected, Panzer Dragoon Saga uses the VDP1 more heavily for its 3D graphics, but for both games the clear majority of the CPU time is spent in SH2 and 68K emulation.

A 68K recompiler might help, but the low hanging fruit is optimizing SH2 (SH7095) since that dynarec is already written and available.

This is an example of why emulation of everything from PS1/SS/N64 generation onward predominately uses dynamic recompilation.

3 Likes

cached interpreter could indeed be a good boost without the portability issues of a full-on dynarec.

I’m surprised the sound/68K is using so much. Would something like musashi be able to slot in there and save some cycles?

2 Likes

YabaSanshiro does use musashi, so yes. I’d have to look into it more to see what the performance difference is. There’s a bit more going on with sound emulation than just 68K.

1 Like

I’d expect games like burning rangers to be more demanding, but those are surprising numbers nonetheless, i thought it was like kronos where VDP1 is the bottleneck.

1 Like

Burning Rangers is definitely CPU intensive. It seems to heavily use both SH-2 CPUs, and does a lot of 3D drawing with VDP1. SH2 dynarec might help here.

I tried Burning Rangers on YabaSanshiro in RetroArena beta 10, and it gets to the menu, but crashes when you try to start a game. The crash does not happen in Yabause 0.9.15, so there’s a regression in YabaSanshiro, or possibly just something wrong on ARM. Unfortunately I don’t have a good development environment to build all of RetroArena, so efforts to track down that bug will have to wait.

Yabause 0.9.15 needs to build with -fno-PIC -fno-PIE for the assembly code to work, which then won’t link with certain shared libraries. So this is another thing that needs to get fixed before any of this code can be integrated into RetroArch.

I see videos from ten years ago on youtube of people running yabause on Android and ARM Linux, and it looks pretty good. Now I try this, and I find a whole bunch of new bugs and regressions, stuff doesn’t build from source because of broken shared library dependencies, and it’s slow because dynamic recompilation doesn’t work right on 64-bit platforms. I’m wondering what happened. Saturn emulation wasn’t this bad before.

Saturn emulation has been awful for ages. It’s moderately good on beetle saturn and a very beefy PC only. As it is now, medium/low handhelds don’t stand a chance. It runs pretty good on my sd865 (equals my i7-7th gen laptop in performance) but no chance on sd650 (40 fps on yabasanshiro) and that should be faster than medium chips like rk3566.

N64 is pretty bad too for medium/low chips.

YabaSanshiro does run nearly full speed on RK3566 (1.8GHz ARM Cortex-A55). I tried games including Arcana Strikes and Panzer Dragoon Saga, and there is some minor frameskip, but it outputs 30+ FPS. It would be fully playable if not for the VDP2 layering issues and random crashes. Burning Rangers might be a bit slow, but many games would be fine.

I would have thought with newer CPUs the dynarec wouldn’t be necessary, but you need a 3+ GHz CPU, and it’s still kinda laggy.

I profiled the CPU usage in Yabause 0.9.15 on x86-64, and it looks surprisingly similar to Mednafen, except for the VDP2 being on the main thread. The sound processor takes about 25% of the time, mostly in generate_sample() and ScspDspExec(), with 68K being a small part. Switching the 68000 core between C68K and Musashi doesn’t seem to make much difference.

Perhaps some of the sound generation could be done in a separate thread, but other than that, the only thing that can clearly be optimized is the SH2, and fixing that dynarec is going to be a big project.

It’s clear what needs to be fixed, but it will take a lot of time to do it all. This is why I’m wondering if we could incentivize development with bounties.

N64 has some similar issues, although not quite as bad. In fact, the dynarec code in Yabause originally came from Mupen64plus. So maybe if this can get fixed, some of it could be applied to N64 emulation also.

2 Likes

Been keeping an eye on this thread. While I can’t contribute on the technical end I can do my part in this way. I’d be more than happy to drop a quick $2-300 maybe more to start depending on the work that needs to be done to get the ball rolling on a bounty for better Saturn emulation. This is pretty much my favorite console of all time so I’m all in for this.

2 Likes

This will need a big bounty as it’s several months of work at least, but I can contribute some too.

There are two things to optimize. One is to try to offload some of the SCSP to a separate thread. I don’t know how well that would work, but it’s worth looking into since this a big chunk of the CPU time.

The SH2 dynamic recompiler is the much more obvious optimization, since we already know that this works, at least in some cases. The question is where to start with this. I see the crashes in YabaSanshiro, but it’s hard to debug that without a point of reference for what it should be doing.

I’m thinking that a more manageable first step would be to start with Yabause 0.9.15 on x86-64.

The obvious problem with building Yabause 0.9.15 on desktop Linux is the broken dependence on Qt due to the HexValidator class and the -fPIC issue. The HexValidator stuff isn’t that important and can be temporarily removed, but newer Qt requires PIC, and building with -fPIC causes the linker to fail on the assembly code in linkage-x64.s.

This assembly code is the main runloop for the SH2 dynarec and some associated functions. It would be possible to write two versions of this (pic/no-pic) but probably the easier thing to do is just generate the proper code at runtime. Doing that requires some improvements to the code generator (rex prefixes, 64-bit pointers, etc)

The code generator is fairly simple and just writes out instructions then goes back and patches the branch instructions. It would be easier to debug if it structured the code into an internal buffer and printed a logfile before assembling it, so I could see what it is doing.

Assuming that is fixed, the next problem is that it doesn’t have proper 64-bit pointers everywhere. To fix this, would need to add the jump table back into arch_init() and add the struct dynarec_local on x86-64 and maybe some other things that I overlooked that don’t have proper 64-bit pointers.

There’s also the annoying nofollow call / endbr64 stuff on newer x86 CPUs, but hopefully that can just be turned off.

Those changes would hopefully get the dynarec working on modern x86. Then we can see which games are working or not working, and maybe fix whatever is needed for Windows and mac. The next step after that would be to backport the code to ARM, or try integrating with beetle-saturn.

2 Likes

Sounds like a lot of work and testing would need to be done before even beginning to bear any fruits, nevertheless I’m still all in. If it’s a task that will take months I also don’t mind adding more to the bounty as time goes on.

2 Likes

Yabasanshiro standalone runs good on my trimui brick, full speed with frame skip one (30/60 actual frames). Don’t know if that’s supposed to be the best it can do. That’s a 2.0ghz A53 quad core there. Didn’t notice any crashes (?)

1 Like

That core version (3.4.2) is from a old 2019 build based on the same version as the stand alone emulator from that same year. So many new features got added to Yaba like vulkan support since then. It’s way out of date but does work for the most part.

2 Likes

That’s similar to the performance I’ve seen on ARM devices. Works fine with one frame skip.

The problem is that this needs to be updated to work with newer emulators and newer operating systems (64-bit). Also I am seeing some random crashes in Yaba Sanshiro 1.9.0.

1 Like

It’s true that retroarch core runs a lot slower, at least on brick. Standalone keeps a full pace by skipping 1 frame while core is stuck on 40 fps and doesn’t look like frame skipping, it’s just slower. It’s acceptable to run standalone and skip 1 frame, i wouldn’t expect miracles anyway on a little handheld. If it could reach full speed i would take it lol

Relevant to this thread, there’s a new Saturn emulator in dev to watch:

4 Likes