AI Service implementation

I was unable to find documentation on that. I am looking into implementing my own AI service, where can I find help on the API?

1 Like

I don’t think there’s any specific API documentation but I believe all of the AI service action happens in retroarch.c

Hm, I had hoped it wouldn’t come to that… It doesn’t look like it’s implemented in retroarch.c but that’s one massive file I might have overlooked something. All the AI Service impl is conditioned by the HAVE_TRANSLATE define iirc?

Yes, I think that’s correct.

I found what I need in accessibility.h

To make your own server, it must listen for a POST request, which
will consist of a JSON body, with the "image" field as a base64
encoded string of a 24bit-BMP/PNG that the will be translated.
The server must output the translated image in the form of a
JSON body, with the "image" field also as a base64 encoded
24bit-BMP, or as an alpha channel png.

But the code in task_translation.c seems to indicate that the server can respond with sound and raw text, but I’ve yet to confirm that this is actually support. In fact, RA doesn’t display anything when I tried responding with an image and a string, and nothing shows up in the log…

Back to work I guess.

1 Like

I’ve made some progress, but I’ve stumbled upon a few issues. I’m posting here in the hope that those who implementend the AI service module can comment.

By the way, I know it’s not important but I have to say it, “AI Service” is quite a misnomer. First because there’s no such thing as AI as of today; and second because the whole thing is surprisingly not related to AI at all, right? I believe “Screen Reader” would be a better description. For a long time when the feature was introduced I had no idea it could do something like that… Anyway, back on topic.

So right now my server returns an image. It works, but I have a few things to say about it.

  1. Like I said before, task_translation.c seems to handle different formats beside image. However when I tried using “text” mode, nothing happened. It doesn’t seem to work at all, and no error is logged. I’m still not sure how it’s supposed to work, a notification pop-up I’d assume.
  2. task_translation.c is supposed to try displaying the image using the graphical widgets, and as a fallback, write directly to the frame buffer. Having disabled graphical widgets I’ve noticed that this alternative does not work on my end, and worse it fails without any error being logged. When I started working on this I had widgets disabled and it hasn’t been easy finding out why nothing was working… Also, I’ve tried both vulkan and d3d11 and only the widget approach gives any result.
  3. Using widgets to display the result is also a problem in itself for me, as the widgets are displayed on top of the video and so completely bypass any shader. You can see for yourself in the following pictures why it’s an issue.

It can actually be far worse than that, since the widget will not be scaled by the core-provided aspect ratio.

For this to work correctly, the module must in fact write to frame buffer, and try to do that first, relying on widgets as an eventual fallback and not the other way around.

  1. As far as I can tell, this service can only work with manual input. I’d like RA to automatically send a request every X seconds and automatically pause the core when a valid response is returned by the server. Probably not the hardest thing to implement, but something that isn’t currently supported.
1 Like

I believe the naming is just because it could be analyzed for any other purpose, not just translation.

I pinged the guy who implemented it over in discord to see if they would come over here and answer your questions better than I could, but for 4, I believe the answer is just to avoid using up people’s allotment of API calls because they got up to grab a bit to eat and forgot to pause it or whatever.

1 Like

Naturally, but for a local server there is no such issue. Perhaps an automatic option could be available only if the URL is detected as “localhost” (still, the server could be running on the LAN but… that’s a start).

Yeah, I think you could have an option for polling every X seconds, default to 0 / never and put a warning in the sublabel.

Hi Xunkar. I’m usually available on discord and I don’t view the forums normally. In terms of the api, there’s a compatible implementation available at:

I’ve highlighted the the entry point of the requests, though it may be rough to go through. The idea is that the service used to return just translated images, but now can return speech for text-to-speech functionality, or text, which is then used by retroarch’s accessibility functionality to read the text out (if you don’t have that enabled, you won’t see or hear anything with this mode).

The AI Service is named such because it is intended to be compatible with different endpoints. This is normally translating text, but can be used for text-to-speech as well. Additionally it can read and push button states to retroarch, which can be used in custom scripts like this one for accessibility:

Writing to widgets is prefered over writing to the frame buffer since it’s a more general approach and allows for more options in the display. When writing to the frame buffer, 1.) you’re limited to what cores you can write to since you’re usually limited to software cores when writing to frame buffers instead of the whole screen, 2.) the resolution is tied to the emulation resolution instead of the display resolution, so two lines of 8 pixel high kanji text might have to be displayed at 3 lines of 5 pixel high english text and be unreadible, and 3.) you’re limited to having the game paused for the translation unless you write to the frame buffer every frame, which gets more complicated. With widgets you avoid these problems. The older frame buffer code was kept only as a fail-over method for the widget approach.

Auto translation is possible as is (it’s used in the custom scripts demo above) if the AI Service endpoint returns an auto field of true, but given that these implementations are using APIs, as mentioned, doing auto translation would rack up huge API usage fees for little benefit. If you’re connecting to a local model for OCR and translation, then that’s not an issue. There is support for tesseract in vgtranslate, but I would recommend using the easyocr library instead since it’s more accurate, though unlike tesseract, it requires the model to run a gpu and downloading gigabytes of deep learning packages to run.

If you have more questions, please let me know, and if you’re available on discord, I’m more available to reach there.



Thank you for the explanations.

If I understand you correctly the “text” mode is not supposed to display text but to provide a string to be read out by TTS if the server cannot return an actual sound file. Which leaves only image mode for what I’m trying to achieve.

The thing is, I’m not doing any form of OCR. My project aims to provide translation but through different means, with a service that runs locally and is reactive enough that it could work in real time. In all likehood it would better implemented directly within libretro, but leveraging the AI service is far easier for me.

The biggest thing I’d need is this auto-polling feature, so no input would be required from the user. The ability to actually write to the frame buffer would be great (imagine being able to play a live translation exactly like the original game, perfectly integrated and compatible with shaders) but I guess the AI service is not designed to do that.

I’ve put up a working prototype for my service, however it only highlights issues I have with the current implementation of the AI service. Not to throw any shade on anyone, but it strikes me as being intrinsically flawed in its design, at least when it comes to supporting anything other that TTS. For something that is supposed to be an accessibility-oriented feature, it seems to be quite the opposite, and I don’t see how it can be used decently even for services already supported like vgtranslate.

When it works, it works, here’s a video demonstrating the prototype for those interested. But let me explain why this is not going anywhere. To put it simply, for any of this to work, I would need to ask users to do the following, beside setting up the service itself:

  • Enable graphical widgets (minor requirement)
  • Disable any video shader (unfortunate compromise)
  • Play in windowed mode, with a window respecting the core aspect ratio, otherwise the translated text will not be placed correctly (can’t play in fullscreen… now it’s getting very specific)
  • Press a key everytime a text needs to be translated and press it again to make the translation disappear (minor tweak that could be supported in the future).

That’s a lot of requirements. The whole service becomes much too dependent on the end user setup and fairly cumbersome. Perhaps it’s all as intended and I’m overlooking something, but those limitations are very discouraging and confusing. :thinking: