Can’t you just add the compiler flag with an increased value?
For this particular issue yes one thing we need to try. But someone needs to get in there and figure out where in the metal driver/compiler you would set it then build and test it on Mac.
If you know anyone who does programming on the Mac and would like to help us, please let us know.
It seems to be a clang thing, and it’s a fairly recent change that’s bothering more people than us, it seems: https://github.com/llvm/llvm-project/issues/48973
I can code, but I’ve been trying to retrieve my old account because Apple doesn’t allow me to do it via email (need xcode).
Good catch, hunterk.
Unfortunately, the C/UNIX world has a bad history with ill defined scope of responsibility. Infamously, ANSI C even fails to make clear who’s responsibility it is to define how structs are structured which led to incompatibilities (most famously with the difference between win32 delphi vs visual C).
I can confirm they all appear to work now, although really really slowly, at least in the nightly build I’m using (on my 8core m1). Perhaps due to a different setting I haven’t set or have yet to set. The app has so many settings these things can be difficult to nail down.
At any rate, real progress!
Hi, I can confirm that Mega Bezel works with the new Vulkan driver, but just crawls.
On Macs I highly recommend the koko-aio shader (also under bezel) that performs incredibly well and in HDR!
What GPUs are both of you using in your systems?
It’s the M1 Silicon Mac with this integrated CPU/GPU structure. Performance-wise it’s a beast and quite capable, but these shaders let them just crawl. A new M2 Mini Pro is in order, we‘ll see whether this can make a difference. However also the M1s are not really underpowered … if you have any ideas, very happy to test!
Yeah, I’ve heard the M1 Silicon is supposed to be quit good.
There is another user who has tried this and gotten good performance on their M1 Pro running on MACOS Ventura.
I’ll get more test results / numbers from them tonight that I can share with you.
I also have an M1 and am experiencing the same issues. Edit: Can run Shadow of the Tomb Raider handily even though it’s running on Rosetta (cpu emulation)
Can you check what the resolution is coming out of the core?
You can do this by changing the second parameter in the list which is something like resolution debug.
Do you get the same performance when you run something like NES , SNES, Genesis?
Hi, here are the test results on an M1 Mac Mini with 16 Gb RAM under MacOS Ventura:
FinalBurn Neo/1942 - HSM MBZ_3_STD.slangp -> 35 fps
Nestopia/Super Mario Bros World - HSM HBZ_3_STD.slangp -> 35 fps
Snes9x/Super Mario World - HSM_HBZ_3_STD.slangp -> 30 fps
Screenshots:
Hope it helps
For most Mega Bezel shaders I’m getting unplayable FPS with broken up jagged audio using an M1 on a 4k monitor. Some of the default crt shaders work fine, however.
Thanks for the tests, this helps a lot to be able to understand what you are experiencing, I really would not have expected such low performance at that viewport resolution (2392x1792) which is maybe 40% larger than hd.
Since the Mega Bezel uses a LOT of ram compared to other shaders because of so many passes at viewport resolution I wonder if it has something to do with ram bandwidth going in/out of the graphics processor. Does the GPU part of the processor share ram with the rest of the processor?
How do the Screen-Only presets and the Potato presets work? (in not suggesting that these are solutions to your issue, just wondering how they compare)
Apple silicon uses “unified memory” which isn’t the same as shared per se. So far as I know, nobody else uses this system as basically anything on the SOC can use the same memory without paging between the different SoC components. In other words the GPU and CPU can edit the same block without copying between them but rather using the same pointer (different than NVIDIA’s old “unified memory” they had years back). The RAM on this system is soldered on the board via micro BGA.
I have 16GB
Doesn’t seem much different from what any modern UMA architecture APU/IGP can do for the last couple decades. Once either the CPU or GPU need to talk to anything outside of the SOC, they’re using the same shared data bus and bandwidth unless there are separate pools of RAM connected to dedicated busses and memory controllers this is how it works. If there were separate pools then things would not be shared anymore and then copies would have to be made when the CPU and the GPU needs to work on the same data. So for all intents and purposes, shared means they share the same memory and bus and unified means the same thing.
The fact that they can use pointers to avoid some copying to improve efficiency doesn’t mean that it’s not a shared memory architecture that does not posses the same advantages of a dedicated memory architecture.
One of the main constraints of using these shared/unified memory architectures is the type of RAM used. For dedicated graphics we have GDDRn memory and HBM if you want to achieve the high bandwidths required by real-time graphics rendering. For the lowest cost, slowest and worst memory bandwidth applications, DDRn is often implemented.
Which does the Apple Silicon use?
DDRn usually has lower bandwidth but also lower latencies than its GDDRn counterparts because they’re optimized for different tasks. When you have to share either one for CPU use, you end up with a situation that can be less than optimal for at least one of them, either you have to tolerate lower total available memory bandwidth or higher latencies.
So here are some further insights into the HSM presets on Apple M1 Mac Mini 16 GB (yes, unified memory) - all FinalBurnNeo/1942, as the other cores don’t behave that differently:
HSM_MBZ_5_POTATO.slangp -> 280 fps (ffw), basically all POTATO presets I tested were at 280 fps
HSM_MBZ_4_SCREEN_ONLY_MEGATRON.slangp -> 280 fps (ffwd)
Same with HSM_MBZ_4_STD_NO-REFLECT.slangp
MBZ_4_STD-NO-REFLECT-EASYMODE.slangp -> a bit slower with 260 fps
MBZ_4_STD-NO-REFLECT-SUPER-XBR_GDV -> breaks down to 160 fps
MBZ_3_STD_MEGATRON.slangp -> crawls at 40 fps and 97 fps (ffwd, interestingly)
Note that the koko-aio shader, which I started using now in Vulkan, also achieves the 280 fps, however with all the nice additional effects that the POTATO presets do not have.
Whether the memory is shared or unified may not matter with respect to this application, but it definitely matters when modifying data when using Apple’s APIs. As for the type of RAM used, there are two things traditional dedicated RAM have to do (or two bottlenecks to overcome): copy RAM via DMA transfers and quickly modify (large chunks of) data through the GPU over and over again as it traverses the graphics pipeline. It’s the latter where the more it behaves like an L3 cache the better.
FWIW according to Apple:
M1 LPDDR4X-4266 128 bit 68.3 GByte/s
M1 Pro LPDDR5-6400 256 bit 204.8 GByte/s
M1 Max LPDDR5-6400 512 bit 409.6 GByte/s
M1 Ultra LPDDR5-6400 1024 bit 819.2 GByte/s