An input lag investigation

mdrejhon · 28 June 2018 14:32

The input lag chain is a very complex topic.

To simplify, lets focus only on emulator thru graphics output.

Traditionally, an unoptimized emulator on unoptimized drivers can:

Consider whether emulator does preemptive real input reads before renderin emulator frame, or does inout reads in realtime while doing the render (simulated raster scanout)

Render emulator frame (varies, up to 1/60sec lag, depending on how intensive it is, and if this jiffy is executed or not to speed up individual emulator rendered frame for delivery sooner. Input read can be early in emulator frame, versus late in emulator frame)
Deliver to graphics card (varies, up to 1 frame lag) - Present() blocks until room in frame queue. Thats buffer backpressure lag!
Any frame queues used in the graphics card (varies, 0, or 1 or 2 frame lag). Graphics drivers delivers through any prerendered frames in sequence as consecutive individual refresh cycles.

You can make it efficient and tight (1 frame lag) but in reality, can be awful. Some very old Blur Busters input lag tests of Battlefield 4 had over 60ms input lag even on a display that had less than 10ms input lag, and even when a frame rendered in only 15ms. Conversely, CS:GO reliably achieved approximately 20ms or so. That was tests way back almost five years ago. The huge variances between software even for VSYNC ON, OFF, GSYNC… Software plays a role on how much lag they add and how they treat the sync workflows (that’s why you hear various tricks such as “input delaying” to reduce lag closer to output.)

Usually maximum prerendered frames is 1, and that is necessary for compatibility with lots of things such as SLI which must multiplex frames from multiple cards into the same frame queue. It also massively improves frame pacing. A few years ago, there was a controversy with the disappearance of the “0” setting in NVInspector for Max Prerendered Frames.

You can use tricks to reduce a lot of lag in this lag chain, but VSYNC ON in games can vary humongously between different apps, it’s simply time interval of input read (which necessarily occurs before rendering in most 3D games) versus the pixel hitting the output jack (the point A to B we are limiting scope to for simplicity).

Beam raced frameslices does input read, render AND output in essentially realtime. Just like the original machine did. Faithfully. With it, there can be just be a mere 1 millisecond between an input read and the actual reacted pixels hitting the graphics output. For any game that does continual input reads mid-scanout, the photons of that can actually hit your eyes in subframe time. Like a mid-screen input read for bottom-of-screen pinball flippers.

Twinaphex · 20 February 2020 18:03

Beam raced frameslices does input read, render AND output in essentially realtime. Just like the original machine did. Faithfully. With it, there can be just be a mere 1 millisecond between an input read and the actual reacted pixels hitting the graphics output. For any game that does continual input reads mid-scanout, the photons of that can actually hit your eyes in subframe time. Like a mid-screen input read for bottom-of-screen pinball flippers.

You still need to change a lot in every single emulator so it render the slices and even poll more often than 1/refresh rate, and is that even accurate? I’m pretty sure it’s not for every single case.

battaglia01 · 28 June 2018 18:47

I’ll respond more to @mdrejhon in a sec, but as a quick response to the above, how much of this would be with the individual cores and how much would be within the libretro API?

battaglia01 · 28 June 2018 23:24

@mdrejhon: thanks for that. As a rough pass I think I get the idea of why VSYNC can sometimes lead to multi-frame delay. I do think I’m continuing to get bogged down a little bit w/ the terminology here though.

Right now, the way I think of input timing latency is as follows:

The input is a delta function, and we are trying to figure out the total latency (or group delay) in the “impulse response.”
The signal path consists of a string of delay lines, one after the other, each of which adds some time delay to the signal.
The total time delay is the sum of the time delays of each component.
Rather than all of the delays being set in stone, the delay of each component is a random variable according to some probability distribution. We know the range of values each can take, the probability of each, and the mean and variance. For instance, a 100 Hz USB poll interval is a uniform distribution on 0-10ms, which has a mean of 5ms and a stdev of ~3ms.
The components are “approximately independent” of one another, at least given reasonable running conditions. Meaning the USB poll interval position does not correlate with, for instance, the refresh interval position, or whatever. Both are equally random, or if there is any correlation, it’s negligible.
Because of #5, the expected value of the total latency is the sum of the expected value of the latency of each component.

#5 is probably where the gray area is. Some components might correlate with one another… sometimes… only under certain fundamental conditions… and it’s hard to tell where. That seems to be the basic problem.

If we have two components that tend to correlate significantly, so that a better latency on one suggests a higher or lower probability of a better latency on the other, then we can chunk them into a single component. Ultimately we can always arrive at some chunks of components that do not correlate with one another in any signficant way, given at least reasonably normal working conditions.

Let’s just start with that, if that makes sense.

mdrejhon · 29 June 2018 04:11

Actually, for at least one module, it’s easier than you think.

WinUAE told me it was a quick modification to add basic 60 Hz support.

This is because for “raster accurate” emulation modules (e.g. Nintendo and Super Nintendo emulation):

The emulator module is already plotting one scan line at a time into an offscreen buffer. It’s already happening with the NES module.
The emulator module is already (usually) doing real time input reads while plotting scan lines. It’s already happening with the NES module.
The new raster poll API simply lets the centralized beamracer to “peek” at the ALREADY EXISTING module’s ALREADY EXISTING OFFSCREEN FRAMEBUFFER, and grab a frameslice from it.

The centralized code will do the peeking, and the centralized code will do the grabbing of the frameslice itself. The raster poll is simply giving the central code opportunities to do early-peeks of the emulator’s existing offscreen framebuffer, every time a new emulator scan line is written to it.

For at least one of the easiest Retroarch modules, it looks like only a 10 line modification.

All the complexity is centralized (probably ~1000 lines of code, 3-4 days of programming work). Please re-read my proposal. That’s where the RetroArch work ahead is cut.

More difficult cores will take a lot more time, but once the core libretro is made beamrace compatible, then the beamrace support can be added to only one module at a time. And from what I looked, the easiest module will only need a simple hook (10 lines) to turn it into a successful beamracer.

Emulator authors – over the last two decades – have done an amazing job refining realtime beamracing on the emulator offscreen buffer already. So it’s not much work left to glue the remaining step. So to the authors of the “easy” modules, thank you so much for making it so easy to beamrace for real!

It’s the last piece of puzzle that most emulators programmers do not understand; the “black box” between Present() and photons – but people like me do. That 1% is complex to understand and this is why I am writing big posts to explain that 1% needed to finish the “full beamrace chain”.

And, even if it’s easier than expected with some modules (NES)…

…It will also be more difficult than expected with other modules (who knows which ones). It depends on how much of the beamracing chain they’ve already completed.

The fact is that both extremes exist.

The beauty is that once it’s implemented in libretro, it can be implemented one module at a time, one by one, beginning with the easiest modules – taking our merry time.

Once the easy module is done, it gives everyone the “aha” moment, and makes some people understand frameslice beam racing much better. (The remaining 1% step needed to finally pull an emulator’s existing internal beamracing out to the real world display).

For some modules, >99%+ of the beamracing work is already done. 20 years of beamracing development has done that already, but never beyond the Present() API.]

The major complexity will be making libretro compatible. If there’s a lot of layering (e.g. lack of a VSYNC OFF mode, and a lot of black box layers, it has to be refactored somewhat). Basically, VSYNC OFF support needs to be added to LibRetro, in order for frameslice beamracing to work. It might or might not be royal headache.

But on the NES module side (at least), it’s quite minor changes there for that particular module since it already does internal beamraced input reads and internal beamraced line-plots into its internal framebuffer.

For time split between “The core, and the easiest module” – I guesstimate over 95% of programming time will be focussed on the centralized code, and 5% of the time spent on the easiest module. Once done, the bridges can be crossed for remainder of modules.

The hardest module might need lots of code – and/or rewriting – to be compatible, but the easiest modules will essentially only need 10 lines of modifications.

The already cycle-exact and raster-exact modules will obviously be the easiest modules, especially if they’re simply (as the NES module is) already rasterplotting one line at a time internally already to an internal frame buffer. Those types will be easy to do frameslice beamracing.

The emulator modules don’t even need to know what the heck a frameslice is, if one re-reads my proposal.

In my proposal, all the emulator module is doing is letting the core code (centrallized raster poll code) to do early-peeks to the existing offscreen beamraced buffer that most 8-bit and 16-bit emulators already do, in order to be compatible with retro-era raster interrupts.

mdrejhon · 29 June 2018 04:33

I’m going to cover this from the opposite side first.

This is an easier argument for me to make, because there is fewer variables. Makes latency math simpler.

I’ve got a 1000fps high speed camera. With a test program, I’ve successfully got API to photons in just 3 milliseconds on my fastest LCD display. That is Present() to photons hitting the camera sensor. That includes LCD GtG. That includes DVI/DisplayPort latency. That includes monitor processing latency. I’m able to get this for top edge, center, and bottom edge of screen.

I can likely probably get <1ms with a CRT and an older graphics card with a direct adaptorless VGA output.

But let’s simplify. So, now we already know the baseline absolute-best proof from my high speed camera, and it is proven “realtime” API to photons, by all practical raster extent.

I’ve also done brief tests that showed 4ms-5ms from mouse click to photons, for some extreme blank-screen VSYNC OFF tests. Now, we know that DisplayLag.com can have a display latency difference from top/center/bottom (e.g. 2ms, 9ms, 17ms). Obviously for simplicity, most sites only report average latency (VBI to screen middle). Which is often half a refresh cycle. Which is why you never see numbers less than 8.3ms on sites such as DisplayLag.com for a 60Hz display (1/2 of 1/60sec = 8.33333). That’s simply a stopwatch from VBI-to-raster. With beamracing, the lag is vertically uniform (e.g. 2ms, 2ms, 2ms TOP/CENTER/BOTTOM) for Present-to-Photons during VSYNC OFF frameslice beamracing. (There’s micro lag gradients within frameslices – caused by the granularity of frameslice versus the one pixel row at a time scanout of graphics output – but that can be filtered by the tape loop metaphor for consistent unvarying subframe emulator pixel to photons time, as a fixed screen height difference between emu raster and real raster – all easily centralizable inside the central code of the raster poll API, it’s just simply busysleep on RTDSC or QueryPerformanceCounter)

Briefly going offtopic, but currently the least-laggy LCDs via DisplayPort/DVI tends to be ~3ms for API-to-photons if we're focussing on minimum lag (top lag, or beamraced pixel-for-pixel lag) -- subtract about 8.3ms from the number you see on DisplayLag.com and that's your beamraced input lag attainable. Certainly there are less laggy displays and more laggy displays, but some LCDs are almost as fast as CRTs in response (e.g. just digital transceiver lag & GtG lag, with a few scanline buffered micropacket lag). Although we're not worried about panel scanout, the fact is some of them have realtime synchronouz cable-to-panel scanout abilities (also see www.blurbusters.com/lightboost/video for an older example high speed video of how an LCD scans out -- and how some blur-reduction strobe backlights work (LightBoost, ULMB). In non-strobed operation, it's essentially a fast-moving a GtG fade zone chasing behind the currently-being-refreshed pixel rows, being refreshed practically on the fly directly from the cable (with only line-buffer processing for overdrive -- unlike old LCDs that often full framebuffered the refresh cycle first).

So all we’re worried about is increases to latency to this absolute-best baseline.

My graphics card can do up to 8000 frameslices per second (Kefrens Bars demo).

Mouse poll 1000Hz adds an average of 0.5ms latency (the midpiont average of 0ms…1ms latency). There are some 2000Hz mice and overclocked 8000Hz mice experimentation being done, so it’s possible to theoretically get lower – USB of 0.125ms latency has been successfully achieved with mouse overclocking.

Emulator frameslice granularity at 2400 frameslices per second (40 frameslices per refresh cycle, HLSL filters disabled, GTX 1080 Ti extreme case) with a 1.5 frameslice average beamrace margin = 1.5/2400sec latency = 0.6ms latency.

So mouse poll latency 0.5ms and beamrace margin latency 0.6ms = 1.1ms lag for input-read-to-pixel-transmitting-on-wire. It could be even less obviously given my computer’s performance. But this is already incredibly small.

While doable on i7’s with powerful GPUs, it is overkill for many. A lot are happy with 10-frameslice beamracing (600 frameslices per second). 2 frameslice lag equals 2/600sec or 1/300sec or 3.333ms for the common 10 frameslice WinUAE setting (I’m excluding the 0.5ms from 1000Hz USB input poll, obvoiusly).

It continues to scale down to more leninet margins, like 4-frameslice or 6-frameslice beamracing on slower platforms (e.g. PI, Android, etc). There’s more lag for that, but still subframe lag compared to any other possible non-RunAhead lag reduction approaches. 4 frameslices with 1.5 slice beamrace margin is still only ~6ms lag – incrediblly low for a Raspberry PI, and I suspect 10 frameslices are doable on the newer mobile GPUs. At 600 frameslices per second, that gets real close to more exactly reproducing faithful-original latencies.

battaglia01 · 30 June 2018 00:56

@mdrejhon: Very cool. I can kind of see how the beam racing thing covers a lot of the different sources of latency inherent in the signal.

I’m still confused on the basic phenomenon of how this doesn’t lead to tearing. Suppose, for instance, you’re playing some game, and you have half of the frame rendered, and it just so happens that half of the character sprite is rendered. Then, midway through the frame, the user pushes some input button, causing the character to move. So input is processed mid-frame.

Do we process the input here, so that the next half of the frame is rendered in accordance with the character moving? And so on, even if they push another button? Basically, do we process multiple sequential inputs within the same frame, even if it’s already been rendered?

If you do, then the thing is, the sprite and position have now changed. If you totally change gears and start rendering the next frame, you would get, I think, a tearline, or at least a visual mess.

On the other hand, if you do just continue to process the current farme the same way as before, then at least under the hood, the input poll routine can get a bit of a head start on things for the next frame. So then your multiple combinations of buttons are partly processed during the current frame, but all get lumped together in the next frame.

It seems something like that, as I think this through. Partly the issue might be that there’s some subtlety in the way Present() works that I don’t quite get.

mdrejhon · 1 July 2018 20:23

That’s the thing…

There’s no such thing as “half a frame rendered”.

Many emulators – for NES, SNES, Commodore 64, Apple, several 80s/90s-era MAME modules – renders only one pixel row at a time, or less (line-exact emulators) or a single pixel at a time (cycle-exact emulators).

Sure, the original 8-bit software has done half a frame buffer, but the emulator is simulating a virtual equivalent of a CRT electron gun already! (Some of them single-pixel cycle-exact granularity; others of them scanline-at-time granularity).

So the emulator is already serializing one pixel row at a time. Plotting them to the existing offscreen buffer.

For a NTSC CRT, that is equivalent to 15,625 scanlines per second (Horizontal Scan Rate = 15.6 KHz) so that’s one new scanline plotted every ~1/15625th of a second. So the offscreen framebuffer is getting a new pixel row every average 1/15625th of a second. Some emulators execute synchronously, by putting a busyloop where needed to scanline-pace it – other emulators will surge-execute 1/60th sec worth of emulation (faster than original machine) to deliver a full framebufferful in traditional PC-based “full frame buffer at a time” workflows. Regardless, the inputreads in the original emulator code varies from game to game, and sometimes some input reads are always at beginning of a refresh cycle, or end of refresh cycle, or some games do input reads mid-screens, it really varies from game to game. But regardless, whatever original game did, it gets preserved when synchronizing emuraster with realraster.

That’s because they have to preserve original beam-race behaviours like raster interrupts. That’s why those particular modules tend to be very beamrace-friendly to the real-world.

Because they’re already doing that, 99% of the work is already essentially done.

The large amount of writing I do in this thread is doing the remaining 1% of the work synchronizing the emulator raster (line-at-a-time) to the real-world raster – which is something that most people don’t realize is now already possible. But, it is indeed a somewhat complex-to-grasp 1% that requires good understanding of the way things used to be done originally.

Does the above help explain why there is no tearing?

battaglia01 · 3 July 2018 17:15

@mdrejhon: ok, yes, it does. Sorry for the delay in responding, got busy for a bit here…

I now understand why there is no tearing. So the original game was in charge of protecting against that while doing the beamracing. So, we assume there’s something like that going on.

Doing beamracing means we get as close to the original game as possible. Since the original game is doing some kind of beamrace mid-frame tearing protection, the game itself will add a tiny bit of lag between when you press a button and when it appears on the TV, because it needs to not to mess up the current frame.

If we then run the above game in an emulator, then not only do we have the original game’s input latency, but we now also have the additional VSync latency, where things are delayed by yet another frame (or two). So it all adds up to increased cumulative latency.

It seems like there are a few case splits here in the way that original games deal with input latency - Yoshi’s Island seems to be particularly slow, for instance, whereas other games might be a lot faster. But, there are lots of ways to split the latency chain up into modules, and going with beamracing affects all equally, so that’s why this is good to implement. Right?

This is a weird way to put it, but I’m trying to make sure I’m balancing the latency “checkbook” correctly. Does the above make sense?

mdrejhon · 4 July 2018 18:05

“beamrace mid-frame tearing protection” should be phrased as “beamrace by design”.

Remember… Atari 2600 had no framebuffer memory at all. Zero, none, nada, zilch! They had to buffer a single scanline at a time – generate new graphics in realtime, every scanline.

You see, the only way to do graphics on an Atari 2600 in the 1970s-1980s was beamracing out of necessity.

So the Atari 2600 essentially beamraced several thousand simple “linebuffers” A second. Doing that on a puny 1 MHz CPU is no less than a miraculous programming feat. Raster feats continued for a couple decades afterwards, at least for other graphics special-effects like scroll zones or sprite multiplication.

Later on, even when games gained framebuffers, some special effects (e.g. 16 sprites instead of 8, or a stationary scorebar below a scrolling zone) also required beamracing out of necessity. Basically they intentionally injected a dividing line between two different framebuffers (if you must use the word “tearline” terminology, yes, that’s essentially an ancestor to a modern tearline).

Emulators had to preserve whatever beamracing antics that the original machines did.

Adminst all of this, while not all used lagless input mechanisms, some were essentially sub-scanline lagless – input reads of a joystick controller port had virtually no latency – it was typically just a mechanical joystick, where mechanical switches completed circuits directly on separate pins of a 9-pin joystick port. Up/Down/Left/Right/Button only required 5 wires plus shared ground. Extra wires can be used for things like extra buttons, etc. Anyway, the moment a joystick button is fully pressed, the circuit is completed right there and then. Which directly changes the bits of a byte of a single in-memory address. The joystick port is often read by a register-read instruction or a PEEK command (in 6502, the machine language programming instruction could be “LDA $DC00” or “LDA $DC011” (Commodore 64 version of joystick register) which is essentially the assembly language / machine language equivalent of the BASIC command “LET J = PEEK(56320)” – hex DC00 equals decimal 56320). The fast joystick-peeking instruction, which executes in microseconds, may actually be an instruction embedded within raster-realtime generated graphics – or might be a few scanlines before – or might be at beginning of refresh cycle (blanking interval) or a few refresh cycles ago (e.g. framebuffering workflows – but rememeber: a lot of this was the era before framebuffers). So latencies sometimes are microseconds between joystick to photons (CRTs can illuminate a ‘pixel’ in essentially microseconds). Or sometimes be several milliseconds (one or two refresh cycles) if he original code reads during a different part of refresh cycle or in blanking interval, or even many refresh cycles (e.g. the early simulators, early crude 3D flight simulators, like 1982 Microsoft Flight Simulator running at only 2-3 frames per second with blocky line-drawing graphics). Regardless, framebufferless workflows were still sometimes continued to be used on character-buffered platforms (grids of pre-defined graphics used as building blocks) to do things like add extra colors per row or other effects. Regardless, it varies hugely how an original platform did input reads, but nothing stopped them from doing raster-realtime “input-to-photons behavior” if input reads were done mid-raster. But this is getting offtopic, emulators already (to best effort) preserve this originalness at least to the offscreen framebuffer. So we usually don’t have to do anymore work on the input-reading side (Even though input read granularity rounds off at polls, e.g. 1ms for 1000Hz polls). The beam raced frameslicing is only a modification only to the missing 1% of the emulator “rendering” workflow (subframe raster sync between emu-raster and real-raster) necessary to replicate the original latencies to a much higher accuracy than has ever been achieved before. Regardless, the word “tearing” is non-sequitur to some 8-bit programming technique that did not use frame buffers.

For a good newbie’s guide to this, see Wired Magazine’s https://www.wired.com/2009/03/racing-the-beam/ … I highly recommend that “Racing The Beam” book from Amazon. This will help prepare you for a better understanding of rasterwork.

mdrejhon · 4 July 2018 18:07

The two parts of your sentence are (mostly) unrelated to each other. It turns out non-sequitur. We’re just simply preserving original input-read phasing (whether it was only 1 microsecond before pixel output – or 1 second before the frame). There can technically essentially be zero lag, if the input read is made raster-realtime mid-scanline (e.g. Atari 2600) reading the joystick controller register while generating pixels.

It is simply a function of the original game programming, nothing more, nothing less. The writings I do is simply bridging the beamracing from the virtual world (offscreen emulator frame buffer) to the real screen. This allows the emulator to replicate original latencies as faithfully as possible, including subtle within-frame and within-scanline time-offsets of input read relative to generating the pixels on the original machine’s original video output.

(Plus a slight amount of additional latency to create the ‘jitter margin’ to soak up computer performance imperfections… but that doesn’t interfere, as the offscreen emulator buffer is still merrily at its original latency – the jittermargin is simply slight extra latency between emuraster and realraster to soak up computer performance issues, allowing contineud VSYNC ON looking perfection in less-than-perfect performance conditions)

If you are a programmer, I suggest you purchase the book to gain a better understanding of the concept of raster programming.

Applying modern terms such as “tearing” (But, really, tearing is a modern term more applicable to 3D framebuffers) and “we assume” (incorrect: we actually know, so we don’t have to assume. Remember, I have programmed some of these old machines directly. Telling such people who have done real rasterwork whether be Atari TIA, Amiga Copper, C64 raster interrupts, etc that their actual work are “assumptions” can be actually slightly offensive to them when in actuality, they actually understand the old platforms), and framebuffers (because some old machines had no framebuffers, or character buffers only).

The reason that this final 1% of realtimeness has not been easily bridged is that it required three things. An expert that is simultaneously versed on

Understanding how the original software and original machines worked; and
Understanding the latency black box between “Present()-to-photons” (software to photons) in a full & proper temporal manner for all Hz & all VRR tech, including differences between pixels on the screen and
The technology to catch up (it did about 8-10 years ago, but see pre-requisite (1) and (2) above).

People who simultaneously understand both (1) and (2) and (3) are few and far in between. Such an individual would need to and how to apply this to inventing various techniques to sync between old & new.

Likewise, a rocket scientist, might not know archaeology, and vice-versa. Or in a more related field, a mathematician may not be able to create a new molecule, and vice-versa, though applying knowledge may end up applied across boundaries to create a new breakthrough. And sometimes come up with “E=mc^2” simplicity that others can understand.

Several of us (who are familiar with how framebufferless programming worked) are finding it’s a lot simpler than expected once we follow the “best practices” list several posts ago.

mdrejhon · 4 July 2018 18:24

Added a new Best-Practice number 18 to the LibRetro Proposal

(18) Temporarily turn off debug output when programming/debugging real world beam racing. When running in debug mode, create your own built-in graphics console overlay, not a separate console window – don’t use debug console-writing to IDE or separate shell window during beam racing. It can glitch massively if you generate lots of debug output to a console window. Instead, display debug text directly in the 3D framebuffer instead and try to buffer your debug-text-writing till your blanking interval, and then display it as a block of text at top of screen (like a graphics console overlay). Even doing the 3D API calls to draw a thousand letters of text on screen, will cause far less glitches than trying to run a 2nd separate window of text (IDE debug overheads & shell window overheads) can cause massive beam-racing glitches if you try to output debug text – some Debug output commands can cause >16ms stall – I suspect that some IDE’s are programmed in garbage-collected language and sometimes the act of writing console output causes a garbage-collect event to occur. Or some other really nasty operating-system / IDE environment overheads. So if you’re running in debug mode while debugging raster glitches, then temporarily turn off the usual debug output mechanism, and output instead as a graphics-text overlay on your existing 3D framebuffer. Even if it means redundant re-drawing of a line of debugging text at the top edge of the screen every frame.

BTW, I propose to open and contribute to a BountySource on executing this proposal and successfully enabling frameslice beamracing to at least one module.

Would anyone be interested in me opening a BountySource on this?

e-tank · 6 July 2018 11:56

as i understand it beam racing/chasing is basically a very clever and more “the right way” solution / replacement for what frame delay currently provides in retroarch, its main purpose would be to cut down on that last frames worth of latency that’s inherent when pushing full frames with vsync on the host via standard api’s (opengl, vulkan, drm, etc. and when i say last frame i mean compared to methods like using hard sync/fences in opengl, which can achieve approximately just 1 frame additional latency over original hardware)

however, wouldn’t it be a lot of work for relatively little pay off? the libretro api, retroarch and its shader pipeline, and all the libretro modules of emulators are all currently based on full frames. not only that but the core logic in many of these emulators are too. (well, to varying degrees. the point being most emulators would need their core logic tweaked as well as the libretro implementation)

it’s also not compatible with run ahead, which while not a silver bullet, provides at least as good if not better results where supported. (run ahead requires that the emulator has reasonably efficient, fully complete, and side affect free serialization support, which not all emulators do or even can provide. and it’s also game dependent) I get that there are pros and cons to each approach, but run ahead at least made a lot of sense for retroarch since it was mostly built upon features that already existed.

personally i think medanfen would be a much better testing ground for this than retroarch. for one thing it would most likely be easier and less disruptive to implement there, and for another doing so would lay the ground work for getting all the libretro mednafen cores ready to support this if and when beam chasing ever gets added to retroarch + libretro api. (note: i would advise against requesting/pestering mednafen author to implement this)

edit: fixed some terminology

edit 2: nevermind, don’t bother responding to this post, mdrejhon has already addressed pretty much everything i’ve brought up at some point or another in one of his many posts here, there were just sooo many about this i didn’t catch all of it at first. anyway, i still stand by my statement that mednafen is more suited for this

markwkidd · 7 July 2018 13:37

I’m sure there are a lot of people like me lurking in this thread who are very interested in seeing beamracing more widely available (for example by adding these features to the libretro API ). Now that things are getting closer to working code it does seem like a good time to start up a bounty.

mdrejhon · 7 July 2018 18:20

I’ll create a github issue on this & then create a BountySource

How I start with a $100 bounty if someone(s) else can pledge to throw in a total of matching $100?

= $200 starter bounty for libretro raster poll API addition + 1 emulator module made compatible (e.g. NES or other)

I’ll match dollar for dollar for anything less.

Negotiable: Can be willing to match more funds beyond $100, this is just a starting point.

mdrejhon · 7 July 2018 18:49

Thanks for the tip about mednafen. However, RetroArch as far as I know, is much more well known (including variants like RetroPie, etc).

Yes, addressed all concerns:

The practical payoff is actually at least slightly bigger than you think. …Beamracing make it workable on Android/PI devices too underpowered for RunAhead. It also saves more lag than a VRR display does (yet beamracing is also compatible with VRR mode too). Slow emulators on slow platforms will often take 1/60 sec to render, then 1/60sec of VSYNC ON buffer delay – two frames of lag on slow platforms (VSYNC ON backpressure latency, unless you do tricks to achieve next-refresh-cycle latency reliably and consistently). Or allows performance-intensive cycle-exact emulators to reduce latency . Streaming pixels realtime to the display like the original machine did, without all the attendant pre-delays. …It’s also more preservationist friendly latency-wise, and much more faithful to original machine’s latency behavior.
Frameslice beamracing is still compatible with RunAhead, I posted some diagrams. Although it is not as lag-saving as I thought, it can still reduce the CPU requirements of RunAhead by roughly 1 frame’s worth per 1/60sec, since the last frame no longer needs to be surge-executed, and the visible frame (last frame) can simply be realtime streamed to display at original emulator speed (aka beamracing)
Despite the “concept” complexity, it takes surprisingly little code, the difficulty is understanding the black box between Present()-to-photons on a per-pixel screen-scanout basis – nearly undocumented everywhere else except at Blur Busters. But the 18-point Best Practices list (that all of us have built up, myself, Calamity, Toni, etc) will save a lot of grief.

mdrejhon · 17 July 2018 05:47

Added GitHub Issue:

EDIT: Added $120 Cash to BountySource:

I’ll dollar-match your donations

I will dollar-match all your future donations (between now and end of September 2018). I donate another $1 everytime anyone donates $1 until the BountySource hits the $360 level.

EDIT: That was FAST! BountySource now $1050 and my share maxed at $360

Twinaphex · 13 July 2018 20:10

Cool that you made a bounty for this. Hope it goes well.

GemaH · 13 July 2018 20:36

Awesome. What’s the difference with this and Runahead? As far as i know Runahead cuts down “internal game” lag frames that exist in the original hardware. Can this do the same?

hunterk · 13 July 2018 20:47

it’s completely different. It reduces display lag to very low levels, potentially similar to original hardware and/or vsync OFF without tearing.