Lagless VSYNC Support (beam raced synchronization) - [$1082 !]

mdrejhon · 19 July 2018 18:52

UPDATE: BountySource over $1000

After I recently helped WinUAE add a new lagless VSYNC ON mode via tearingless VSYNC OFF. And now released in this WinUAE beta.

There is a long discussion thread about beam racing synchronization which is now feasible/possible (sync between emulator raster and real-world raster (of actual display output).

Now there is a GitHub issue:

And several major donations (including bparker06) built up to over $1082:

Just informing y’all. Thanks!

Thatman84 · 19 July 2018 20:32

Geez thats a big bounty!

Got to be worth a read at least. Does anyone have a quick brief on its benefits in laymems terms?

hunterk · 19 July 2018 20:42

It lets you get latency somewhere between the original console and vsync OFF. It’s not much/any more demanding on the CPU but it does place some additional demands on the GPU.

We’re not sure how it’s going to fit together with RetroArch/libretro, which is part of why the bounty has grown so large without any takers, though I think Dwedit’s been sniffing around it lately

Thatman84 · 19 July 2018 21:01

Well lets hope he/she likes what they smell and the day job isn’t to demanding or highly paid

mdrejhon · 20 July 2018 03:35

The bounty was only posted less than a week ago, it takes time to get takers! (Sometimes a few month)

It’s better to think of it in terms of horizontal scanrate terms:

There’s no difference for the first scanline right after Present(). VSYNC OFF is essentially lagless for the first scanline output right underneath tearline in the graphics output.

Tests confirm that that Present()-to-Photons on a CRT is almost non-measurable for the first scanline during VSYNC OFF. API to lights hitting eyeballs. Just like the original console’s original beam racing! After all, I am the founder of Blur Busters and inventor of TestUFO, so I understand input lag!

Scanlines are transmitted out of a graphics output (including DisplayPort and DVI or HDMI) at time intervals matching the current horizontal scan rate (e.g. 67.5 KHz for 1080p 60Hz HDMI signal).
So, the higher the Present() rate (frameslicing), the closer the lag identically matches the original console or machine. With a tightly optimized jitter margin, the maximum lag is an average of one frameslice worth (the time interval between two Present()s or glutSwapBuffers() during VSYNC OFF).

Yes, that said, it scan scale up/down simply via a configurable frameslice count (centrally).

Depending on how the frameslice coarseness is configured –

I’ve seen it add only 1% extra GPU load (e.g. no shaders/filters/HLSL, low frameslice count, power efficient on powerful GPU) all the way to maxing out GPU 100% (power hungry 8000 frameslices/second). It’s surprisingly flexible how much GPU power you want to use up.

Less powerful GPUs like PI/Android might go for 4 frameslices, while GTX Titans can approach 10,000 frameslices per second (the Kefrens Bars rasterdemo uses 8000 frameslices/second). Even 4 frameslices is still sub-refresh-cycle latency.

NTSC scanrate is 15625 scanlines per second, so if we’re presenting at about 1500 frameslices per second, we’ll have a max input lag of approximately 10 NTSC scanlines worth (10/15625th of a second).

When enabling frameslice beam racing simultaneous with Retroarch Native CRT Support, we can replicate original arcade machines’ input lag pretty closely (no lag advantage or disadvantage) for proper original lagfeel with no surge-execution distortions for mid-refresh-cycle input reads. A game can just simply essentially stream scanlines (frameslice’s worth) iout of the graphics output while emulating at 1:1 speed. Just like original machine. This is essentially what beam raced frameslicing does. Frameslies can be 1 pixel row or one full screen height’s worth, or any height in between (e.g. 1/4 screen height).

Now, if we have faster GPUs that can output single-scanline frameslices (15625 frameslices per second matching NTSC scanrate) we can pretty much hit the console’s original latency (within one scanline lag anyway). Excluding any signal-tech differences (e.g. codec latency for converting digital to analog, but that can be sub-millisecond).

The beauty is we don’t have to have one-scanline-tall frameslices; we can go coarse multi-scanline frameslices instead. Even frameslices 1/4th screen height, which work fine on slower 8-to-10-year-old GPUs and is very doable on mobile GPUs. And the timing of the frameslices can vary safely as long as the frameslices fit between the realraster (above) and emuraster (below) to produce a tearingless VSYNC OFF mode (lagless VSYNC ON look).

It scales down (slower GPUs with coarse frameslices) and scales up (faster GPUs with fine frameslices, potentially as small as 1-scanline frameslices), and input lag can actually approach exactly original machine for all possible inputread (midscreens, midrasters too) to the sub-millisecond identicalness for analog outputs. Whenever, whatever timing any input reads were relative to VBI, it’s preserved.

(Note: A “frameslice” is multiple scanlines (rows of pixels) between two metaphorical tearlines, but they are invisible in the jittermargin technique – for more info, see GitHub entry)

Alphanu · 20 July 2018 15:53

@mdrejhon currently with the new implementation with CRTswitch VSYNC can be turned off will no tearing and no input lag. This is currently only possible with Linux due to windows having limitation.

I’ve managed to recreate exact horizantal 15.xxhkz and vetical hz using a porch algorithm. So the game looks and plays just like the original.

The implementation of beam racing would be great for LCD and windows CRTSwitch.

Is there a repository for it?

mdrejhon · 20 July 2018 16:25

Tell me more information – Is that an equivalent of single-scanline frameslicing? Just using different terminology? I’d like to learn more about commonalities.

Basically front buffer rendering? Or 15,625 tearlines per second (Each pixel row a separate framebuffer streamed out of the video output in raster real-time), including mid-screen input reads at their original, undistorted latencies and no surge-execution needed?

If so, maybe the work could be merged together and converted into the raster API technique, so that crossplatformness can be gradually added?

Basically convert the CRT code to use the proposed retro_set_raster_poll method using the existing Linux technique. While Windows would fallback to coarser frameslicing methods.

And it’d be forward future-compatible with front-buffer rendering too (no Present() needed, just write emulator rasters directly to screen somewhere below realraster at a configurable beam-race/beam-chase margin) for platforms that are able to support it.

Basically there is a compatible method under Windows via the jittermargin technique and coarser VSYNC OFF frameslices (e.g. 10 pixel rows tall) – still almost the same latency (to within a millisecond).

Either way, any previous work (e.g. beam racing on a specific platform) does qualify towards the >$1000 bounty if this is also additionally blackboxed behind a proposed cross platform API such as retro_set_raster_poll

Basically, the proposed API retro_set_raster_poll leaves the implementation of actual beam racing synchronization to central modules; which thusly catches all possible beam racing possibilities:

VSYNC OFF frameslicing (add frameslices on the fly in the margin between emuraster and realraster)
Front buffer rendering (write emurasters directly to onscreen buffer, ahead of realraster)
Single-scanline streaming (advanced real time OS techniques)
Half-screen rendering (e.g. display bottom while rendering top, display top while rendering bottom)
Any other beam racing / beam chasing workflows – including those not dreamed yet.

Some are very forgiving (e.g. 15ms of jittermargin, or 1ms jittermargin) while others may require an RTOS. But the great thing is that by moving raster-related synchronization to an API call such as retro_set_raster_poll it adds huge mondoo flexibility to allow developers to do multiple beam racing techniques other than the ones suggested.

Basically retro_set_raster_poll (if added) simply allows the central RetroArch screen rendering code to have an “early peek” at the incompletely-rendered offscreen emulator buffer, every time the emulator modules plots a new scanline. That allows the central code to beam-race scanlines (whether tightly or loosely, coarsely or ultra-zero-latency realtimeness, etc) onto the screen. It is not limited to frameslice beamracing.

The emulator module is simply modified (hopefully as little as 10 line modification for the easiest emulator modules, such as NES) to call retro_set_raster_poll on all platforms. The beam racing complexity is all hidden centrally.

The central would decide how to beam race obviously (but frameslice beam racing would be the most crossplatform method, but it doesn’t have to be the only method). Platform doesn’t support it yet? Automatically disable beamracing (return immediately from retro_set_raster_poll). Screen rotation doesn’t match emulator scan direction? Ditto, return immediately too. Whatever code a platform has implemented for beam racing synchronization (emuraster to realraster), like whatever the CRT module is doing (if it’s beamracing rows of pixels directly to front buffer), it can be easily wrapped into the retro_set_raster_poll method.

That’s what the bounty also pays for: Add the generic API call so the rest of us can have fun adding various kinds of beam-racing possibilities that are appropriate for specific platforms.

Any platform with implemented beam racing technique – is allowed to be rolled into the bounty as long as it helps meets the bounty conditions
Bounty may be split between multiple programmers (if all stakeholders mutually agree)

e.g. one for each platform, or one programmer handling retro_set_raster_poll while a different programmer adds Windows, and yet other programmer adds Linux/Mac, etc. I understand not everyone speicalizes in everything simultaneously.

Comments?

Alphanu · 20 July 2018 16:36

The implementation I have used is a lot simpler. I mimicking the exact resolution and refresh of the original hardware.

As we all know no console or arcade game is exactly 50hz or 60hz. And in some arcade games like Mortal Kombat it’s 54.xx.

What I have done is to take the original resolution from the core say 320x224 and the hz. Using this information I build up a max height and max width by adding black pixels. So to build these bank spaces you need to take into account the length of the sync pulse and the speed of each scanline. Which is 63us for 15khz using this information I switch Linux to that exact resolution and refresh.

Once Linux has mimicked the original resolution and refresh rate I tell the retro API to run the core at the cores origial refresh rate.

mdrejhon · 20 July 2018 16:46

To confirm – during 60Hz – you only call the Retro API about 60 times per second instead of 15,625 times per second (as could happen with the proposed retro_set_raster_poll beam racing synchronization API?)

(Semantics: more like 224 x 60 = 14400 times per second, since it’s not necessary to call during VBI scanlines)

Note: Even with such a huge number of calls – in many cases it uses practically zero CPU because of all those immediate do-nothing returns – coarse frameslice beam racing) the centralized beamrace module will usually just return immediately, until a frameslice height’s worth of scanlines have been plotted. So a 10-frameslice-per-refresh would cause only 600 of those calls to execute large-overhead events per second – with the remainder being immediate returns from retro_set_raster_poll … The core beam race sync will decide how sparing/gentle vs how aggressive/tight to beam race – tests show it scales between 1% of emulator time to 99% of emulator time, depending on frameslice count and beam race granularity. RTOS can then thusly do single-scanline front-buffer beam racing with a 1-scanline beam-chase margin, while old Android/PI can do simply VSYNC OFF 4-frameslice-per-refresh-cycle beam racing (1/4 refresh cycle lag). It scales very well in both directions (low power systems to RTOS to slow-vs-fast systems) depending on how coarse/fine you want and what beamracing technique the centrallized code utilizes (now or later) as described here in GitHub. Tests were successful in some other emulators. The goal is a generic cross-platform beamracing API hook that is simply called everytime an emulator module plots a scanline’s worth of pixels (many 8-bit and 16-bit modules do), and centralizing beam racing logic.

Alphanu · 20 July 2018 16:46

No its all done once per resolution chance. All API calls are done at the start of each resolution change, it’s a case off setting up the desired resolution to match the original khz. Linux is then switch to that exact 15khz resolution as the core starts. I essense mimicing the exact vertiacl and horizantol sync pulse and resolutuon. This allows the core to be run at the consoles original cycle as the video be displayed by the pc matches exactly to the resolution setup at the start.

Remember this is designed for true 15khz signals to be displayed on a tv not a LCD. It not on the some lines of beam racing.

But I think adding beam racing would be a great idea. Especially when you limited to an LCD, Nvidia GPU or windows. It is something I am quite interested in.

mdrejhon · 20 July 2018 16:52

Now I understand!

So the CRT module isn’t doing beamrace sync (emuraster + realraster synchronization).

But achieving 15.625 KHz exact-signal-frequency part.

The good news is that this work (while totally separate) can probably be easily be combined with the CRT support too. Perfect original signal with original-machine-matching input latencies! They’re not mutually exclusive.

So even though no beamracing code to recycle, these two can still benefit each other too.

Lagless VSYNC Support (beam raced synchronization) - [$1082 !]

UPDATE: BountySource over $1000

Any platform with implemented beam racing technique – is allowed to be rolled into the bounty as long as it helps meets the bounty conditions

Bounty may be split between multiple programmers (if all stakeholders mutually agree)