An input lag investigation

fastfade · 29 April 2018 13:19

Hmm good to know! I’ll check with the pause + frame advance method to find out the value for each game.

Still have the audio problem in Super Mario 3 with Nestopia though, not only on the Windows 32bit build running on my Chromebook, but I also quickly tested it on my Shield Android TV now and it’s the same problem there. It handles crt-royale shader at full speed with Nestopia and Snes9x as long as threaded video is enabled (the Shield has multiple times the GPU performance compared to the Chromebook integrated Intel GPU).

So two very different systems with the same audio issue on both. No framerate drops, both devices run stable and smooth. Second instance option on or off, and runahead 1 frame, audio is not working properly with Nestopia in Mario 3. Did anyone else try this?

Dwedit · 29 April 2018 16:05

I have no idea what’s going on with the Bsnes cores, they do get out of sync (release buttons randomly) when using the secondary core. The Mednafen BSNES core is fine however.

papermanzero · 29 April 2018 19:24

I only experienced issues with tha graphics (frame jumping). I had to activate the second instance to get rid of the button release issue. By the way mgba also has issues. The sprites are disappearing and reappearing. Also in mgba some future frames will be displayed randomly.

The 1.7.2 release is great and the latency option is awesom. However some cores have to be optimized to work with that option flawlessly.

OldsXCool · 30 April 2018 02:44

Actually it seems that the frame issues while using second instance on BSNES cores has been fixed in the stable release of 1.7.2. I couldn’t run it when using one of the nightly releases before the official version of 1.7.2. But now it seems to work just fine. However I still have problems with any BSNES core crashing Retro Arch when I go to close the emulator. But that’s a separate issue.

EDIT: Never mind. It got much better but BSNES still has save state warping.

papermanzero · 5 May 2018 13:41

I still have issues with bsnes. The frames are skipping and the input is wrong, independet of the second instance. I opened a bug: https://github.com/libretro/beetle-bsnes-libretro/issues/30#issue-319312429

Dwedit · 5 May 2018 15:31

Now we have the question of how to balance Smoothness, Maintaining the original frame rate, and Input Lag.

For smoothness, you want to have Vsynced frames of course. This could either be done via actual vysnc, or by the proposed beamraced vsync option.

However, using Vsync does not mean you will match the original framerate. Example, SNES runs at 60.0985 FPS, which is not 60FPS. In order to get the SNES framerate to run in the same amount of time at 60FPS, you need to drop a frame about once every 10 seconds.

Then back to Input Lag. In order to be able to drop frames, you need to gradually accumulate a buffered frame over time, and this would gradually increase input lag to one extra frame right before the point where the frame is dropped.

There is a workaround to help match the original framerate on Windows: Custom Refresh Rates on Windows. However, this requires a reboot between each change of a custom refresh rate, so if you want play SNES games (60.098 FPS), then play Genesis games (59.923 FPS), your refresh rate won’t be correct for that system.

When the original frame rate is very far from the monitor refresh rate (50FPS on 60Hz screen), then everything pretty much goes out the window, and you are forced to do full buffering, and all the extra lag that entails.

Dwedit · 5 May 2018 16:12

Only problem is if you’re doing a speedrun race or something, then you need to match the original console timing. Losing 6 seconds over the course of an hour is pretty much fatal for a race.

RetroMiku · 8 May 2018 23:35

I’ve heard that the GroovyArcade Linux distro contains additional software or settings to reduce input lag at the OS/Driver level, so I would be interested in seeing a latency test of that OS against Windows 10 or other Linux distros.

Dwedit · 14 May 2018 13:55

So I was just testing out the D3DKMT API to get scanline position information, and seeing how it compares to the timers. Seems that yes, if the refresh rate is accurate, you can predict what scanline you’re on from timer information alone.

Then I benchmarked the different timers, and found that QueryPerformanceCounter was the most suitable timer to use, and the ancient timeGetTime from WinMM is basically worthless, as it takes the same amount of time to run as QueryPerformanceCounter, and is only precise to 1ms, +/- 0.5ms. Actually reading the scanline position is around 100 times slower then reading the timer.

Tromzy · 19 May 2018 07:19

I have a problem with The Lion King, SNES version, on Bsnes-mercury-balanced. Some frames are skipped for example if I jump, the end of the jumping animation is skipped.

rafan · 25 May 2018 08:18

That’s pretty cool to know. Does this mean there’s some work being done on a possible beamracing vsync option for RA? That would be really awesome

On the topic of the run-ahead feature, is there any chance that you will be taking take a look at some of those cores where the save state ability hasn’t been hooked up yet?

If I may do a shameless plug, one that specifically comes to mind is the BlueMSX core. The MSX has some really neat NES style games (Konami has its origins on it) so it may have your interest. It has for example excellent versions of the Gradius series, and Metal Gear and its successor Solid Snake originated on it. It really deserves a proper save state and run-ahead feature working…

Top 10 Konami games on MSX

mdrejhon · 12 June 2018 19:56

For those interested in my raster work, I’ve posted about my Tearline Jedi Demo in the pouet.net demoscene forums. Including a new video of my real-raster Kefrens Bars demo working on Radeon/GeForce GPUs!

(Intentionally glitched near the end by dragging a window around, screwing raster timings)

That’s one “pixel row” per VSYNC OFF frameslice
Over 100 frameslices per refresh cycle
8000 tearlines per second

Generating raster pixel rows in real time! Each raster “pixel row” are a completely separate GPU framebuffer. VSYNC OFF with tearlines separating each pixel row in this demo. Using the same “Kefrens Bars” raster mathematics as used in the Amiga Copper days.

At 8000 frameslices per second, the Kefrens Bars coarse “pixel rows” are hitting the graphics output port 1/8000sec after the Present() API call. Any lag that happens afterwards is display lag / cable codec lag / etc.

It’s excessively way overkill for emulators which does just fine with higher resolution frame slices, but it demonstrates raster-realtime graphics of Atari TIA lore. Raster real time pixel rows crammed through overkill VSYNC OFF frameslicing!

Works on Radeon/GeForce. Compiles on PC/Mac. Probably the world’s first real-raster Kefrens Bars (Alcatraz Bars) on modern GPUs. Not simulated. Real rasters. It even glitches with intentional raster timing errors (launching software in background):

That’s why my source code is not yet released – debating whether to submit this to a demoscene venue (after graphics improved) before opensourcing. It will be opensourced, but demoscene submission rules requires holding off opensourcing til afterwards.

Dwedit · 13 June 2018 15:00

You got your demoscene program, but how about the source code of a beamracing hello world program?

mdrejhon · 15 June 2018 23:45

It is the same thing. The separate screenfuls are on average 100 lines (the draggable tearline, the bouncing color bars, the Kefrens Bars) are only approximately 100 lines each after blackboxing a few thousand lines into a core easy-to-use beam racing “library” (classes for now, but it could be ported/rolled into a library later).

For the demoscene style modules (which are actually great beam racing sandboxes for beginners too, being C#) – There are modules that does different things such as the mouse-dragged tearlines – that mouse draggable tearline is about 100 lines approximately.

Due to demoscene rules of most of the exhibits, I can’t release any /portion/ of the code until it’s exhibited, even if I fork the core part.

If I don’t submit to a demoscene, then I can release pretty much immediately. If I do then I have to release code afterwards.

I’ll make a decision during July (which then declares the release date of Tearline Jedi) – keep tuned.

More than 75% of people I asked told me to save it for demoscene event first, so that is leaning that way. It’s an audacious abuse of high-framerate VSYNC OFF to real time rasters so has already surprised a few – and even more when I told them it’s just an easy Nerf language (C#) running off an opensource cross platform game engine (MonoGame C#) doing it instead of timing-critical assembly. Turning beam racing into child’s play, essentially.

It would be lovely to see this get displayed on the big screen at Assembly 2018 in front of that large audience, but let’s see.

However, early private invites to git repo for co-develop is fine – if you want to help me improve this demo for a demoscene submission – contact me at [email protected] … This is allowed according to rules if you privately want to see the source code now (As long as you don’t utilize it just yet into other software till it’s exhibited).

Ideas & alternate solutions are welcome… Generally, I want to broadcast this idea a little farther and wider. Appreciated!

DI2edd · 15 June 2018 09:32

So, if I understand this correctly, it would benefit enormously from a single buffering pipeline, right? Because then you wouldn’t have to call Present() or SwapBuffers() or whatever some thousands times a second.

Actually, I’d like to see if there is a difference in peeformance with something like Linux KMS/DRM, which allows for true single buffering compared to Windows.

mdrejhon · 15 June 2018 23:54

Yes, it would benefit enormously with a front buffer – you could just simply render single pixel rows directly to the front buffer a few scanlines ahead of the real raster. The beam racing margin could be adjustable in that case.

But without access to that, we still can at least coarsely simulate front buffer rendering via frameslicing.

You don’t necessarily need thousands of frameslices per second – only Kefrens Bars requires that – it’s only for “Atari TIA style” realtime generation of pixel rows. That, indeed, does require thousands of Presents() a second for realtime generated “pixel rows”.

However, that is not the beamracing workflow for emulators.

For the emulator POV, it is very scaleable.

1 frameslice per refresh cycle - 60 Presents() or glutSwapBuffers() per second of full screenfuls
2 frameslice per refresh cycle - 120 Presents() or glutSwapBuffers() per second of 1/2 screenfuls
3 frameslice per refresh cycle - 180 Presents() or glutSwapBuffers() per second of 1/3 screenfuls
4 frameslice per refresh cycle - 240 Presents() or glutSwapBuffers() per second of 1/4 screenfuls

etc.

So 10 frameslice per refresh cycle - 600 Presents() or glutSwapBuffers() per second of 1/10 screenfuls. Input lag, if using the tight jittermargin technique is between 1 and 2 frameslices worth. So 600 frameslices per second is as little as 1/10th a frame input lag.

All raster-timed of course (obviously, the 1-frameslice case is simply just racing the VBlank…). Since you’d know where the realraster is, you can just render only the frameslice, rather than the whole framebuffer, and present it to the appropriate area of the screen.

I might be able to raise my frameslice rate to over >10,000 frameslices per second that way, but the MonoGame engine doesn’t let me get that low level. However, Direct3D/OpenGL may have some arguments that allow you to blit your offscreen buffer only to the specific region of the onscreen buffer, saving memory bandwidth of uselessly blitting the areas that’s outside of the current raster. This may be for some future experimentation to do.

For Tearline Jedi, I’m keeping it lowest common denominator – all it requires is a tearlines-supported API – but there are indeed many optimizations possible.

While RunAhead is really nice, this is a useful additional tool in the lag-reduction toolbox for the “right tool for right job” situations. It’s more original-latency reproducing And it is more Arduino-friendly than RunAhead technique, at the low frameslice counts. This technique is already being done for certain Android VR apps -for 4-frameslice beam-racing

Twinaphex · 22 June 2018 10:14

I could be wrong, but what I think @Dwedit means is that he would just like to start working on an implementation for RetroArch for this idea instead of stalling this out for months on end and having to wait on some demo to drop first. I can kinda understand where he is coming from too if that is the case.

mdrejhon · 4 July 2018 18:25

Fair comment (Liked).

Lemme see how we can accelerate things. I’ll spend a couple hours to write this proposal post to see if this is a good plan:

Some considerations:

The demo testing will help vet out problems/bugs.
The code is in C# which is not the same language as LibRetro.
Porting the core libraries to C or C++ was going to be a later project, or someone to volunteer.

So, need to figure out which is faster:

Wait for me to release source code (or gain private access to my git repo)
Use it as a private sandbox to learn frameslice beamracing
Write LibRetro frameslice beamracing from scratch

Or

Bootstrap by learning from WinUAE codebase which already has frameslice beamracing
Use it as an educational sandbox
Write LibRetro frameslice beamracing from scratch

Or

Wait (longer) for the C/C++ ports of the raster calculator modules
Use it directly within LibRetro frameslice beamracing.

From these approaches, some elements need to be written from scratch, as some LibRetro pre-requisites independently of all the above. Let me see if I can suggest a blueprint of how to proceed…

Recommended Hook

Add the per-raster callback function called “retro_set_raster_poll”
The arguments are identical to “retro_set_video_refresh”
Do it to one emulator module at a time (begin with the easiest one).

It calls the raster poll every emulator scan line plotted. The incomplete contents of the emulator framebuffer (complete up to the most recently plotted emulator scanline) is provided. This allows centralization of frameslice beamracing in the quickest and simplest way.

Getting the VSYNC timestamps

This technique is only needed for the register-less method, to listen for VSYNC timestamps while in VSYNC OFF mode, and to poll the raster line:

Get your primary display adaptor URL such as \.\\DISPLAY1 … For me in C#, I use Screen.PrimaryScreen.DeviceName to get this, but in C/C++ you can use EnumDisplayDevices() …
Next, call D3DKMTOpenAdapterFromHdc() with this info to open the hAdaptor handle
For listening to VSYNC timestamps, run a thread with D3DKMTWaitForVerticalBlankEvent() on this hAdaptor handle. Then immediately record the timestamp. This timestamp represents the end of a refresh cycle and beginning of VBI. That’s your VSYNC callback signal.

Other platforms have various methods of getting a VSYNC event hook (e.g. Mac CVDisplayLinkOutputCallback) which roughly corresponds to the Mac’s blanking interval. If you are using the registerless method and generic precision clocks (e.g. RTDSC wrappers) these can potentially be your only #ifdefs in your cross platform beam racing – just simply the various methods of getting VSYNC timestamps. The rest have no platform-specificness.

Getting the current raster scan line number

For raster calculation you can do one of the two:

(A) Raster-register-less-method: Use QueryPerformanceCounter to profile the times between refresh cycle. You can use known fractional refresh rate (from QueryDisplayConfig) to bootstrap this “best-estimate” refresh rate calculation, and refine this in realtime. Calculating raster position is simply a relative time between two VSYNC timestamps, allowing 5% for VBI (meaning 95% of 1/60sec for 60Hz would be a display scanning out). NOTE: Optionally, to improve accuracy, you can dejitter. Use a trailing 1-second interval average to dejitter any inaccuracies (they calm to 1-scanline-or-less raster jitter), ignore all outliers (e.g. missed VSYNC timestamps caused by computer freezes). Alternatively, just use jittermargin technique to hide VSYNC timestamp inaccuracies.

(B) Raster-register-method: Use D3DKMTGetScanLine to get your GPU’s current scanline on the graphics output. Wait at least 1 scanline between polls (e.g. sleep 10 microseconds between polls), since this is an expensive API call that can stress a GPU if busylooping on this register.

NOTE: If you need to retrieve the “hAdaptor” parameter for D3DKMTGetScanLine – then get your adaptor URL such as \.\\DISPLAY1 via EnumDisplayDevices() … Then call D3DKMTOpenAdapterFromHdc() with this adaptor URL in order to open the hAdaptor handle which you can then finally pass to D3DKMTGetScanLine that works with Vulkan/OpenGL/D3D/9/10/11/12+ … D3DKMT is simply a hook into the hAdaptor that is being used for your Windows desktop, which exists as a D3D surface regardless of what API your game is using, and all you need is to know the scanline number. So who gives a hoot about the “D3DKMT” prefix, it works fine with beamracing with OpenGL or Vulkan API calls. (KMT stands for Kernel Mode Thunk, but you don’t need Admin priveleges to do this specific API call from userspace.)

Improved VBI size monitoring

You don’t need raster-exact precision for basic frameslice beamracing, but knowing VBI size makes it more accurate to do frameslice beamracing since VBI size varies so much from platform to platform, resolution to resolution. Often it just varies a few percent, and most sub-millisecond inaccuracies is easily hidden within jittermargin technique.

But, if you’ve programmed with retro platforms, you are probably familiar with the VBI (blanking interval) – essentially the overscan space between refresh cycles. This can vary from 1% to 5% of a refresh cycle, though extreme timings tweaking can make VBI more than 300% the size of the active image (e.g. Quick Frame Transport tricks – fast scan refresh cycles with long VBIs in between). For cross platform frameslice beamracing it’s OK to assume ~5% being the VBI, but there are many tricks to know the VBI size.

QueryDisplayConfig() on Windows will tell you the Vertical Total. (easiest)
Or monitoring the ratio of .INVBlank = true versus .INVBlank = false … (via D3DKMTGetScanLine) by monitoring the flag changes (wait a few microseconds between polls, or 1 scanline delay – D3DKMTGetScanLine is an ‘expensive’ API call)

Turning The Above Data into Real Frameslice Beamracing

For simplicity, begin with emu Hz = real Hz (e.g. 60Hz)

Have a configuration parameter of number of frameslices (e.g. 10 frameslices per refresh cycle)
Let’s assume 10 frameslices for this exercise.
Actual screen 1080p means 108 real pixel rows per frameslice.
Emulator screen 240p means 24 emulator pixel rows per frameslice.
Your emulator module calls the centralized raster poll (retro_set_raster_poll) right after every emulator scan line. The centrallized code (retro_set_raster_poll) counts the number of emulator pixel rows completed to fill a frameslice. The central code will do either (5a) or (5b): (5a) Returns immediately to emulator module if not yet a full new framesliceful have been appended to the existing offscreen emulator framebuffer (don’t do anything to the partially completed framebuffer). Update a counter, do nothing else, return immediately. (5b) However once you’ve got a full frameslice worth built up since the last frameslice presented, it’s now time to frameslice the next frameslice. Don’t return right away. Instead, immediately do an intentional CPU busyloop until the realraster reaches roughly 2 frameslice-heights above your emulator raster (relative screen-height wise). So if your emulator framebuffer is filled up to bottom edge of where frameslice #4 is, then do a busyloop until realraster hits the top edge* of frameslice #3. Then immediately Present() or glutSwapBuffers() upon completing busyloop. Then Flush() right away. NOTE: The tearline (invisible if unchanged graphics at raster are) will sometimes be a few pixels below the scan line number (the amount of time for a memory blit - memory bandwidth dependant - you can compensate for it, or you can just hide any inaccuracy in jittermargin) NOTE2: This is simply the recommended beamrace margin to begin experimenting with: A 2 frameslice beamracing margin is very jitter-margin friendly.

Note: 120Hz scanout diagram from a different post of mine. Replace with emu refresh rate.matching real refresh rate, i.e. monitor set to 60 Hz instead. This diagram is to help raster veterans conceptualize how modern-day tearlines relates to raster position as a time-based offset from VBI

Bottom line: As long as you keep repeatedly Present()-ing your incompletely-rasterplotted (but progressively more complete) emulator framebuffer ahead of the realraster, the incompleteness of the emulator framebuffer never shows glitches or tearlines. The display never has a chance to display the incompleteness of your emulator framebuffer, because the display’s realraster is showing only the latest completed portions of your emulator’s framebuffer. You’re simply appending new emulator scanlines to the existing emulator framebuffer, and presenting that incomplete emulator framebuffer always ahead of real raster. No tearlines show up because the already-refreshed-part is duplicate (unchanged) where the realraster is. It thusly looks identical to VSYNC ON.

Precision Assumptions:

Scaling doesn’t have to be exact.
The two frameslice offset gives you a one-frameslice-ahead jitter margin
You can vary the height of consecutive frameslices if you want, slightly, or lots, or for rounding errors.
No artifacts show because the frameslice seams are well into the jitter margin.

Special Note On HLSL-Style Filters: You can use HLSL/fuzzyline style shaders with frameslices. WinUAE just does a full-screen redo on the incomplete emu framebuffer, but one could do it selectively (from just above the realraster all the way to just below the emuraster) as a GPU performance-efficiency optimization.

Adverse Conditions To Detect To Automatically disable beamracing

Optional, but for user-friendly ease of use, you can automatically enter/exit beamracing on the fly if desired. You can verify common conditions such as making sure all is me:

Rotation matches (scan direction same) = true
Supported refresh rate = true
Module has a supported raster hook = true
Emulator performance is sufficient = true

Exiting beamracing can be simply switching to “racing the VBI” (doing a Present() between refresh cycles), so you’re just simulating traditional VSYNC ON via VSYNC OFF via that manual VSYNC’ing. This is like 1-frameslice beamracing (next frame response). This provides a quick way to enter/exit beamracing on the fly when conditions change dynamically. A Surface Tablet gets rotated, a module gets switched, refresh rate gets changed mid-game, etc…

General Best Practices

Debugging raster problems can be frustrating, so here’s knowledge by myself/Calamity/Toni Wilen/Unwinder/etc. These are big timesaver tips:

Raster error manifests itself as tearline jitter.
If jitter is within raster jittermargin technique, no tearing or artifacts shows up.
It’s an amazing performance profiling tool; tearline jitter makes your performance fluctuations very visible. In debug mode, use color-coded tints for your frameslices, to help make normally-hidden raster jitter more visible (WinUAE uses this technique).
Raster error is more severe at top edge than bottom edge. This is because GPU is more busy during this region (e.g. scheduled Windows compositing thread, stuff that runs every VSYNC event in the Windows Kernel, etc). It’s minor, but it means you need to make sure your beam racing margin accomodate sthis.
GPU power management. If your emulator is very light on a powerful GPU, your GPU fluctuating power management will amplify raster error. Which may mean having too frameslices will have amplified tearline jitter. Fixes include (A) configure more frameslices (B) simply detect when GPU is too lightly loaded and make it busy one way or another (e.g. automatically use more frameslices). The rule of thumb is don’t let GPU idle for more than a millisecond if you want scanline-exact rasters. Or you can just merely simply use a bigger jittermargin to hide raster jitter.
If you’re using D3DKMTGetScanLine… do not busyloop on it because it stresses the GPU. Do a CPU busyloop of a few microseconds before polling the raster register again.
Do a Flush() before your busyloop before your precision-timed Present(). This massively increases accuracy of frameslice beamracing. But it can decrease performance.
Thread-switching on some older CPUs can cause RTDSC or QueryPerformanceCounter backwards clock ticking unexpectedly. So keep QueryPerformanceCounter polls to the same CPU thread with a BeginThreadAffinity. You probably already know this from elsewhere in the emulator, but this is mentioned here as being relevant to beamracing.
Instead of rasterplotting emulator scanlines into a blank framebuffer, rasterplot on top of a copy of the the emulator previous refresh cycle’s framebuffer. That way, there’s no blank/black area underneath the emulator raster. This will greatly reduce visibility of glitches during beamrace fails (falling outside of jitter margin – too far behind / too far ahead) – no tearing will appear unless within 1 frameslice of realraster, or 1 refresh cycle behind. A humongous jitter margin of almost one full refresh cycle. And this plot-on-old-refresh technique makes coarser frameslices practical – e.g. 2-frameslice beamracing practical (e.g. bottom-half screen Present() while still scanning out top half, and top-half screen Present() while scanning out bottom half). When out-of-bounds happens, the artifact is simply brief instantaneous tearing only for that specific refresh cycle. Typically, on most systems, the emulator can run artifactless identical looking to VSYNC ON for many minutes before you might see brief instantaneous tearline from a momentary computer freeze, and instantly disappear when beamrace gets back in sync.
Some platforms supports microsecond-accurate sleeping, which you can use instead of busylooping. Some platforms can also set the granularity of the sleep (there’s an undocumented Windows API call for this). As a compromise, some of us just do a normal thread sleep until a millisecond prior, then doing a busyloop to align to the raster.
Don’t worry about mid-scanline splits (e.g. HSYNC timings). We don’t have to worry about such sheer accuracy. The GPU transceiver reads full pixel rows at a time. Being late for a HSYNC simply means the tearline moves down by 1 pixel. Still within your raster jitter margin. We can jitter quite badly when using a forgiving jitter margin – (e.g. 100 pixels amplitude raster jitter will never look different from VSYNC ON). Precision requirement is horizontal scanrate (e.g. 67KHz means 1/67000sec precision needed for scanline-exact tearlines – which is way overkill for 10-frameslice beamracing which only needs 1/600sec precision at 60Hz).
Use multimonitor. Debugging is way easier with 2 monitors. Use your primary is exclusive full screen mode, with the IDE on a 2nd monitor. (Not all 3D frameworks behave well with that, but if you’re already debugging emulators, you’ve probably made this debugging workflow compatible already anyway). You can do things like write debug data to a console window (e.g. raster scanline numbers) when debugging pesky raster issues.
Some digital display outputs exhibit micropacketization behavior (DisplayPort at lower resolutions especially, where multiple rows of pixels seem to squeeze into the same packet – my suspicion). So your raster jitter might vibrate in 2 or 4 scan line multiples rather than single-scanline multiples. This may or may not happen more often with interleaved data (DisplayPort cable handling 2 displays or other PCI-X data) but they are still pretty raster-accurate otherwise, the raster inaccuracies are sub-millisecond, and fall far within jitter margin. Advanced algorithms such as DSC (Display Stream Compression of new DisplayPort implementations) can amplify raster jitter a bit. But don’t worry; all known micro-packetization inaccuracies, fall far well within jittermargin technique, so no problem. I only mention this is you find raster-jitter differences between different video outputs.
Become more familiar with how the jitter-margin technique saves your ass. If you do Best-Practice #9, you gain a full wraparound jittermargin (you see, step #9 allows you to Present() the previous refresh cycle on bottom half of screen, while still rendering the top half…). If you use 10 frameslices at 1080p, your jitter safety margin becomes (1080 - 108) = 972 scanlines before any tearing artifacts show up! No matter where the real raster is, you’re jitter margin is full wraparound to previous refresh cycle. The earliest bound is pageflip too late (more than 1 refresh cycle ago) or pageflip too soon (into the same frameslice still not completed scanning-out onto display). Between these two bounds is one full refresh cycle minus one frameslice! So don’t worry about even a 25 or 50 scanline jitter inaccuracy (erratic beamracing where margin between realraster and emuraster can randomly vary) in this case… It still looks like VSYNC ON perfectly until it goes out of that 972-scanline full-wraparound jitter margin. For minimum lag, you do want to keep beam racing margin tight (you could make beamrace margin adjustable as a config value, if desired – though I just recommend “aim the Present() at 2 frameslice margin” for simplicity), but you can fortunately surge ahead slightly or fall behind lots, and still recover with zero artifacts. The clever jittermargin technique that permanently hides tearlines into jittermargin makes frameslice beam-racing very forgiving of transitient background activity._
Get familiar with how it scales up/down well to powerful and underpowered platforms. Yes, it works on Raspberry PI. Yes, it works on Android. While high-frameslice-rate beamracing requires a powerful GPU, especially with HLSL filters, low-frameslice beamracing makes it easier to run cycle-exact emulation at a very low latency on less powerful hardware - the emulator can merrily emulate at 1:1 speed (no surge execution needed) spending more time on power-consuming cycle-exactness or ability to run on slower mobile GPUs. You’re simply early-presenting your existing incomplete offscreen emulator framebuffer (as it gets progressively-more-complete). Just adjust your frameslice count to an equilibrium for your specific platform. 4 is super easy on the latest Androids and Raspberry PI (Basically 4 frameslice beam racing for 1/4th frame subrefresh input lag – still damn impressive for a PI or Android) while only adding about 10% overhead to the emulator.
If you are on a platform with front buffer rendering (single buffer rendering), count yourself lucky. You can simply rasterplot new pixel rows directly into the front buffer instead of keeping the buffer offscreen (As you already are)! And plot on top of existing graphics (overwrite previous refresh cycle) for a jitter margin of a full refresh cycle minus 1-2 pixel rows! Just provide config parameter of of beamrace margin (vertical screen height percentage difference between emuraster + realraster), to adjust tightness of beamracing. You can support frameslicing VSYNC OFF technique & frontbuffer technique with the same suggested API, retro_set_raster_poll suggestion – it makes it futureproof to future beamracing workflows.
Yes, it works with curved scanlines in HLSL/filter type algorithms. Simply adjust your beamracing margin to prevent the horizontally straight realraster from touching the top parts of curved emurasters. Usually a few pixel rows will do the job. You can add a scanlines-offset-adjustment parameter or a frameslice-count-offset adjustment parameter.
Temporarily turn off debug output when programming/debugging real world beam racing. When running in debug mode, create your own built-in graphics console overlay, not a separate console window – don’t use debug console-writing to IDE or separate shell window during beam racing. It can glitch massively if you generate lots of debug output to a console window. Instead, display debug text directly in the 3D framebuffer instead and try to buffer your debug-text-writing till your blanking interval, and then display it as a block of text at top of screen (like a graphics console overlay). Even doing the 3D API calls to draw a thousand letters of text on screen, will cause far less glitches than trying to run a 2nd separate window of text (IDE debug overheads & shell window overheads) can cause massive beam-racing glitches if you try to output debug text – some Debug output commands can cause >16ms stall – I suspect that some IDE’s are programmed in garbage-collected language and sometimes the act of writing console output causes a garbage-collect event to occur. Or some other really nasty operating-system / IDE environment overheads. So if you’re running in debug mode while debugging raster glitches, then temporarily turn off the usual debug output mechanism, and output instead as a graphics-text overlay on your existing 3D framebuffer. Even if it means redundant re-drawing of a line of debugging text at the top edge of the screen every frame.

Hopefully these best practices reduce the amount of hairpulling during frameslice beamracing.

Special Notes

Special Note about Rotation Emulator devices already should report their screen orientation (portrait, landscape) which generally also defines scan direction. QueryDisplayConfig() will tell you real screen orientation. Default orientation is always top-to-bottom scan on all PC/Mac GPUs. 90 degree counterclockwise display rotation changes scan direction into left-to-right. If emulating Galaxian, this is quite fine if you’re rotating your monitor (left-right scan) and emulating Galaxian (left-right scan) – then beamracing works._
Special Note about Unsupported Refresh Rates Begin KISS and worry about 50Hz/60Hz only first. Start easy. Then iterate in adding support to other refresh rates like multiples. 120Hz is simply cherrypicking every other refresh cycle to beam race. For the in-between refresh cycles, just leave up the existing frame up (the already completed frame) until the refresh cycle that you want to beamrace is about to begin. In reality, there’s very few unbeamraceable refresh rates – even beamracing 60fps onto 75Hz is simply beamracing cherrypicked refresh cycles (it’ll still stutter like 60fps@75Hz VSYNC ON though)._
Advanced Note about VRR Beam Racing Before beam racing variable refresh rate modes (e.g. enabling GSYNC or FreeSync and then beamracing that) – wait until you’ve mastered all the above before you begin to add VRR compatibility to your beamracing. So for now, disable VRR when implementing frameslice beamracing for the first time. Add this as a last step once you’ve gotten everything else working reasonably well. It’s easy to do once you understand it, but the conceptual thought of VRR beamracing is a bit tricky to grasp at first. VRR+VSYNC OFF supports beamracing on VRR refresh cycles. The main considerations are, the first Present() begins the manually-triggered refresh cycle (.INVBlank becomes false and ScanLine starts incrementing), and you can then frameslice beamrace that normally like an individual fixed-Hz refresh cycle. Now, one additional very special, unusual consideration is the uncontrolled VRR repeat-refresh. One will need to do emergency catchup beamraces on VRR displays if a display decides to do an uncommanded refresh cycle (e.g. when a display+GPU decides to do a repeat-refresh cycle – this often happens when a display’s framerates go below VRR range). These uncommanded refresh cycles also automatically occur below VRR range (e.g. under 30fps on a 30Hz-144Hz VRR display). Most VRR displays will repeat-refresh automatically until it’s fully displayed an untorn refresh cycle. If this happens and you’ve already begun emulating a new emulator refresh cycle, you have to immediately start your beamrace early (rather than at the wanted precise time). So if you do a frameslice beamrace of a VRR refresh cycle, the GPU will send a repeat-refresh to the display automatically immediately. There might be an API call to suppress this behavior, but we haven’t found one, so this behavior is unwanted so this kind of makes beamraced 60fps onto a 75Hz FreeSync display difficult to do stutter-free. But it works fine for 144Hz VRR displays - we find it’s easy to be stutterfree when the VRR max is at least twice the emulator Hz, since we don’t care about those automatic-repeat-refresh cycles that aren’t colliding with timing of the next beamrace._

Is this sufficient QuickStart on quickly rapidly getting started with RetroArch frameslice beamracing?

At the very least the 2 hours I spent writing this post, for you – hopefully can help you possibly achieve experimental test 60Hz beamracing within 2 or 3 day’s of programming?

(Details may take longer, e.g. debugging VRR beamrace support – but 60Hz frame slice beamracing is typically “easier-than-expected” to add according to two other emulator authors – Tony Wilen of WinUAE told me that)

I’ll be able to provide more snippets, examples, suggestions, and snippets of source code (without violating demo rules – and besides, this way is probably faster and more C/C++ useful anyway) – here – or if prefer email, contact me [email protected] … I got some C/C++ test code from Jerry of Duckware, the inventor of vsynctester.com that has a working example of D3DKMTGetScanLine() in .cpp modules, if you’re still having difficulty with the instructions above.

Moving Forward

Before utilizing any existing code (e.g. WinUAE or Tearline Jedi or anything else) – I think the first priority is to blueprint it out, decide how to extend RetroArch API. I propose add – retro_set_raster_poll as described… what do you think? Something with the least pain to add. The raster poll technique is probably a move we have to do regardless.

The hooking technique will have a huge impact on how we decide to frameslice-beamrace, and how flexible it can be made.

Did my post help? Need some code examples by email?

battaglia01 · 27 June 2018 01:47

Hello - first time here, quick read through this and I see you’ve done some really good work.

My main thought from reading just a little here is that a lot of the latency seems to be tied up, in some sense, in things dependent on the frame rate.

That is, suppose you slow the emulator to half speed: some of the latency sources will remain unaffected (USB polling, kernel overhead, etc) whereas other sources will, surprisingly, double in latency (libretro slowdown).

A pretty easy way to see this is to use the aforementioned “frame advance” technique, where it can be seen that some of the latency sources will take 2-4 frames – no matter how slow the frames are.

People seem to be taking this for granted, but if you think about it for a second, this behavior is really strange and bizarre:

Running the emulator at a lower framerate consumes less computational resources, yet leads to more latency.
Running the emulator at a higher framerate consumes more computational resources, yet leads to less latency – until you reach the max speed your hardware can handle.

This is a great way to “blow up” some of the possible explanations proposed for input latency in the past. If your system is able to tolerate running the emulator at a frame rate that is even a little faster than usual – yet this net increase in computational resources somehow brings total latency down – then that says something profound about what is and isn’t really causing the latency.

I have some more thoughts in this regard, but I’m curious what thoughts other people have about this, as this is really surprising to me.

battaglia01 · 27 June 2018 05:52

Also, I hadn’t seen the beamracing post above before making mine - very cool.

@mdrejhon: so if I understand correctly, rather than presenting the framebuffer all at once, you separate it into chunks that you present in a series of smaller flushes, right?

I’m trying to understand how this doesn’t lead to tearing. If we have 10 frame slices, is the thinking that the real raster is then updated 600 times per second total? It is clear that there will be intermediate states where the raster will be half-displaying the past frame and half-displaying the current one, but is the thinking that since we’re changing things so fast, that assuming no jitter, this will not perceptually manifest as tearing?

It’s pretty simple for me to see how this would reduce total latency by something like ~0.5 frame: the top scanline no longer has to wait until the entire buffer is finished to present (~1 frame difference), whereas the bottom scanline is the same (~0 frame difference), so on average you get a 0.5 frame difference.

My second thought is that you would get a further reduction if some main part of the framebuffer updating routine is currently done synchronously (is it?). This would mean that all sorts of stuff is blocked until a complete framebuffer is filled (like polling the input). Having 10x more partial framebuffer calls would let you do things like poll the input 10x per frame rather than just 1x, which would give you, more or less, another half a frame of latency reduction. Coupled with the above, you get a total latency reduction of roughly 1 frame less than doing the entire framebuffer synchronously. Is that the thinking?