Specification for arcade ROM scanning

Premise

Within any given MAME or FB Alpha ROM set, there could be one of four distinct, totally valid zipped romsets or one of four different, equally valid 7z archives for the same title. CRC scanning romset zip and 7z files doesn’t make sense in that context – it’s too different from the ‘native’ validation approach used by MAME. MAME’s own validation is characterized below.


This thread is my attempt to start a specification for arcade ROM scanning based on the ‘native’ validation method employed by MAME and FB Alpha. I am making this effort in the interest of science.

Overview and terminology

Arcade games are packaged as zip files, most of which are composed of more than one individual ‘ROM’ files. In MAME and FB Alpha parlance, a ZIP file containing each of the ROM files needed to emulate one game is called a “ROM set”. Some resources refer to an individual arcade game as a ROM (like people use to describe a zipped game cartridge ROM, which is actually one ROM file inside the zip) while other resources refer to an individual arcade game as a ROM set or romset.

I will follow mamedev convention and use the term romset to refer to a zip or 7zwith the ROM files for one game.


ROM set version and formats

Each version of an arcade emulator must be used with ROM sets that have the same exact version number. For example, MAME 0.37b5 sets are required by the mame2000 core, but will not work correctly with the mame2010 core, which requires MAME 0.139 ROM sets. MAME validates ROM sets by checking the CRCs of individual ROM files within a ROM set against its internal database. This database changes with each MAME release and can be generated by running the MAME executable with the flag -listxml.

Four Arcade Romset File Formats

Full Non-merged: All ROMs can be used standalone because each zip contains all the files needed to run that game, including any ROMs from ‘parent’ ROM sets and BIOS sets. (ClrMamePro users: access through the “Advanced” button in the Rebuild and Scanner menus, then deselect “Separate BIOS sets”.)

Non-merged ROMs: Except for romsets which require a BIOS archive, all romsets can be used standalone because each zip contains all the files needed to run that game, including any files from ‘parent ROMs’. BIOS romsets are ‘split’ from the game romsets and must be placed in the same folder as the game romset.

Split: Some ROMS that are considered clones, translations, or bootlegs also require a “parent ROM” to run. The parent ROM is often the first or most common variant of a game. In some cases the parent is not the most popular or best working version of the game, however. For example, in a Split set pacman.zip (a clone), will not work without puckman.zip (its parent). BIOS romsets are also ‘split’ from the game romsets and must be placed in the same folder as the game romset.

Merged: Clones are merged into the parent romset zip, meaning that more than one game is stored per file. Merged romsets are not well supported in the libretro arcade emulator cores as of this time.

Finally, necessary game content is sometimes distributed in the form of an an additional Sample ZIP file composed of individual audio samples or a CHD file with game data that was originally stored on an internal hard drive, CD-ROM, DVD, laserdisc, or other media.

If RetroArch were to add native arcade scanning support to the playlist generator, the most straightforward way would be to support “Full Non-Merged” sets only. I advocate for Full Non-Merged as a standard but I’m ready to help work through the requirements for any and all of the above.


An Example

To demonstrate how this works, the output of unzip -v for a Full Non-Merged 1941j.zip (1941 - Counter Attack (Japan). I believe the -v (verbose) command is also commonly available in standard zip libraries although I’m not sure what RetroArch uses for zip functionality.

Being able to examine the CRC values of the individual files within the ZIP without decompressing them is the cornerstone of this approach.

Archive:  1941j.zip
TORRENTZIPPED-4E6AC678
 Length   Method    Size  Ratio   Date   Time   CRC-32    Name
--------  ------  ------- -----   ----   ----   ------    ----
  131072  Defl:X    37719  71%  12/24/96 23:32  7fbd42ab  4136.bin
  131072  Defl:X    30694  77%  12/24/96 23:32  c6464b0b  4137.bin
  131072  Defl:X    65674  50%  12/24/96 23:32  c7781f89  4142.bin
  131072  Defl:X    64180  51%  12/24/96 23:32  440fc0b5  4143.bin
   65536  Defl:X    18355  72%  12/24/96 23:32  0f9d8527  41_09.rom
  131072  Defl:X   117804  10%  12/24/96 23:32  d1f15aeb  41_18.rom
  131072  Defl:X    82385  37%  12/24/96 23:32  15aec3a6  41_19.rom
  524288  Defl:X    87410  83%  12/24/96 23:32  4e9648ca  41_32.rom
  524288  Defl:X   269308  49%  12/24/96 23:32  ff77985a  41_gfx1.rom
  524288  Defl:X   186206  65%  12/24/96 23:32  983be58f  41_gfx3.rom
  524288  Defl:X   270331  48%  12/24/96 23:32  01d1cb11  41_gfx5.rom
  524288  Defl:X   187229  64%  12/24/96 23:32  aeaa3509  41_gfx7.rom
--------          -------  ---                            -------
 3473408          1417295  59%                            12 files

1941j in the MAME 0.78 DAT file:

<game name="1941j" cloneof="1941" romof="1941">
	<description>1941 - Counter Attack (Japan)</description>
	<year>1990</year>
	<manufacturer>Capcom</manufacturer>
	<rom name="4136.bin" size="131072" crc="7fbd42ab" sha1="4e52a599e3099bf3cccabb89152c69f216fde79e"/>
	<rom name="4137.bin" size="131072" crc="c6464b0b" sha1="abef422d891d32334a858d49599f1ef7cf0db45d"/>
	<rom name="4142.bin" size="131072" crc="c7781f89" sha1="7e99c433de0c903791ae153a3cc8632042b0a90d"/>
	<rom name="4143.bin" size="131072" crc="440fc0b5" sha1="e725535533c25a2c80a45a2200bbfd0dcda5ed97"/>
	<rom name="41_09.rom" merge="41_09.rom" size="65536" crc="0f9d8527" sha1="3a00dd5772f38081fde11d8d61ba467379e2a636"/>
	<rom name="41_18.rom" merge="41_18.rom" size="131072" crc="d1f15aeb" sha1="88089383f2d54fc97026a67f067d448eee5bd0c2"/>
	<rom name="41_19.rom" merge="41_19.rom" size="131072" crc="15aec3a6" sha1="8153c03aba005bab62bf0e8b3d15ec1c346326fd"/>
	<rom name="41_32.rom" merge="41_32.rom" size="524288" crc="4e9648ca" sha1="d8e67e6e3a6dc79053e4f56cfd83431385ea7611"/>
	<rom name="41_gfx1.rom" merge="41_gfx1.rom" size="524288" crc="ff77985a" sha1="7e08df3a829bf9617470a46c79b713d4d9ebacae"/>
	<rom name="41_gfx3.rom" merge="41_gfx3.rom" size="524288" crc="983be58f" sha1="83a4decdd775f859240771269b8af3a5981b244c"/>
	<rom name="41_gfx5.rom" merge="41_gfx5.rom" size="524288" crc="01d1cb11" sha1="621e5377d1aaa9f7270d85bea1bdeef6721cdd05"/>
	<rom name="41_gfx7.rom" merge="41_gfx7.rom" size="524288" crc="aeaa3509" sha1="6124ef06d9dfdd879181856bd49853f1800c3b87"/>
</game>

In order to implement Full Non-Merged arcade ROM scanning that works across MAME versions, some pseudo-code:

PlaylistScanner() {
	...
	if(ROM_file.extension == ".zip") { // or whatever other factor triggers scanning the arcade DATs

		DAT_entry                       = SearchArcadeDATs(ROMfile.name_no_extension)  
		if(!DAT_entry) {
			return false // the file being scanned can't be found in the DAT
		}
		DAT_entry_canonical_contents[]  = ParseArcadeROMContents(DAT_entry)
		ZIP_contents[]                  = ParseZIPManifest(ROMfile)
		
		index = 0
		for(index < DAT_entry_canonical_contents.length) {
			if(!CompareArcadeCRCs(ZIP_contents, DAT_entry_canonical_contents[index])) {
				return false // an expected file is missing from the ZIP
			}
			index++
		}
		index = 0
		for(index < ZIP_contents.length) {
			if(!CompareArcadeCRCs(DAT_entry_canonical_contents, ZIP_contents[index]) {
					if!(isBIOS(DAT_entry_canonical_contents, ZIP_contents[index]))
						return false // the ZIP file has extra files that are not expected by the DAT
			}
		index++
		}	
	}
	...
}

References

Note that the Logiqx site refers to ‘classic’ DAT format but the tags are the same as the newer XML format.

1 Like

To reiterate, I think that Non-Merged ROMs should be the focus for any new scanning engine. Just like No-Intro and Redump, Non-Merged formats capture the most complete/most accurate dumps of a game possible. Non-Merged sets are also relatively common in circulation.

That said, it would not be too much more complex to provide Non-Merged and Split support. The key is parsing the romof and cloneof attributes for each title.

Notice that the DAT entry for 1941j uses the romof tag to refer to the 1941 ROM set. Also note how many of the individual ROMs within the ROM set are tagged with “merge”, meaning that these same ROM files can be found in the “parent” 1941 ROM set.

<game name="1941j" cloneof="1941" romof="1941">
	<description>1941 - Counter Attack (Japan)</description>
	<year>1990</year>
	<manufacturer>Capcom</manufacturer>
	<rom name="4136.bin" size="131072" crc="7fbd42ab" sha1="4e52a599e3099bf3cccabb89152c69f216fde79e"/>
	<rom name="4137.bin" size="131072" crc="c6464b0b" sha1="abef422d891d32334a858d49599f1ef7cf0db45d"/>
	<rom name="4142.bin" size="131072" crc="c7781f89" sha1="7e99c433de0c903791ae153a3cc8632042b0a90d"/>
	<rom name="4143.bin" size="131072" crc="440fc0b5" sha1="e725535533c25a2c80a45a2200bbfd0dcda5ed97"/>
	<rom name="41_09.rom" merge="41_09.rom" size="65536" crc="0f9d8527" sha1="3a00dd5772f38081fde11d8d61ba467379e2a636"/>
	<rom name="41_18.rom" merge="41_18.rom" size="131072" crc="d1f15aeb" sha1="88089383f2d54fc97026a67f067d448eee5bd0c2"/>
	<rom name="41_19.rom" merge="41_19.rom" size="131072" crc="15aec3a6" sha1="8153c03aba005bab62bf0e8b3d15ec1c346326fd"/>
	<rom name="41_32.rom" merge="41_32.rom" size="524288" crc="4e9648ca" sha1="d8e67e6e3a6dc79053e4f56cfd83431385ea7611"/>
	<rom name="41_gfx1.rom" merge="41_gfx1.rom" size="524288" crc="ff77985a" sha1="7e08df3a829bf9617470a46c79b713d4d9ebacae"/>
	<rom name="41_gfx3.rom" merge="41_gfx3.rom" size="524288" crc="983be58f" sha1="83a4decdd775f859240771269b8af3a5981b244c"/>
	<rom name="41_gfx5.rom" merge="41_gfx5.rom" size="524288" crc="01d1cb11" sha1="621e5377d1aaa9f7270d85bea1bdeef6721cdd05"/>
	<rom name="41_gfx7.rom" merge="41_gfx7.rom" size="524288" crc="aeaa3509" sha1="6124ef06d9dfdd879181856bd49853f1800c3b87"/>
</game>

Taking look at the 1941 DAT entry shows many of the same files with the same CRC value as the “child” 1941j ROM set. In a Split set, those overlapping files would be omitted from 1941j and MAME would pull them out of 1941 at runtime.

<game name="1941">
	<description>1941 - Counter Attack (World)</description>
	<year>1990</year>
	<manufacturer>Capcom</manufacturer>
	<rom name="41_09.rom" size="65536" crc="0f9d8527" sha1="3a00dd5772f38081fde11d8d61ba467379e2a636"/>
	<rom name="41_18.rom" size="131072" crc="d1f15aeb" sha1="88089383f2d54fc97026a67f067d448eee5bd0c2"/>
	<rom name="41_19.rom" size="131072" crc="15aec3a6" sha1="8153c03aba005bab62bf0e8b3d15ec1c346326fd"/>
	<rom name="41_32.rom" size="524288" crc="4e9648ca" sha1="d8e67e6e3a6dc79053e4f56cfd83431385ea7611"/>
	<rom name="41_gfx1.rom" size="524288" crc="ff77985a" sha1="7e08df3a829bf9617470a46c79b713d4d9ebacae"/>
	<rom name="41_gfx3.rom" size="524288" crc="983be58f" sha1="83a4decdd775f859240771269b8af3a5981b244c"/>
	<rom name="41_gfx5.rom" size="524288" crc="01d1cb11" sha1="621e5377d1aaa9f7270d85bea1bdeef6721cdd05"/>
	<rom name="41_gfx7.rom" size="524288" crc="aeaa3509" sha1="6124ef06d9dfdd879181856bd49853f1800c3b87"/>
	<rom name="41e_30.rom" size="131072" crc="9deb1e75" sha1="68d9f91bef6a5c9e1bcbf286629aed6b37b4acb9"/>
	<rom name="41e_31.rom" size="131072" crc="df201112" sha1="d84f63bffeb9255cbabc02f23d7511f9b3c6a96c"/>
	<rom name="41e_35.rom" size="131072" crc="d63942b3" sha1="b4bc7d06dcefbc075d316f2d31abbd4c7a99dbae"/>
	<rom name="41e_36.rom" size="131072" crc="816a818f" sha1="3e491a30352b71ddd775142f3a80cdde480b669f"/>
</game>

The pseudocode doesn’t have to change much in order to search the parent ROM as well.

PlaylistScanner() {
	...
	if(ROM_file.extension == ".zip") { // or whatever other factor triggers scanning the arcade DATs

		DAT_entry                       = SearchArcadeDATs(ROMfile.name_no_extension)  
		if(!DAT_entry) {
			return false // the file being scanned can't be found in the DAT
		}
		parent_ROM                      = DAT_entry.parent_name
		DAT_entry_canonical_contents[]  = ParseArcadeROMContents(DAT_entry)
		ZIP_contents[]                  = ParseZIPManifest(ROMfile)
		parent_ZIP_contents[]           = ParseZIPManifest(parent_ROM)
		
		index = 0
		for(index < DAT_entry_canonical_contents.length) {
			if(!CompareArcadeCRCs(ZIP_contents, DAT_entry_canonical_contents[index]) && !CompareArcadeCRCs(parent_ZIP_contents)) {
				return false // an expected file is missing from the ZIP
			}
			index++
		}
		index = 0
		for(index < ZIP_contents.length) {
			if(!CompareArcadeCRCs(DAT_entry_canonical_contents, ZIP_contents[index]) {
				return false // the ZIP file has extra files that are not expected by the DAT
			}
		index++
		}
	}
	...	
}

@BarbuDreadMon and @Kivutar I’m tagging you because I know you have been discussing this in the past.

Would be great to see if we could launch these games through the # hashtag mechanism. I don’t have them handy, so I can’t test it, but it would allow using the same loading mechanisms we have in place currently.

  1. Add all the MAME XML DATs to https://github.com/robloach/libretro-dats
  2. Have it select a unique ROM from each of the game entries
  3. Create a clrmamepro DAT from it
  4. Use that DAT to launch games through the menu.

Again, I’m unsure of how MAME launches games, but would love to do some testing in the near future.

@RobLoach I’m not sure I understand your post so let me try to clarify what I wrote in case it helps us get on the same page.

If you hang out on mamedev.org or other dedicated MAME sites, you’ll find that they refer to a “ROM set” the way cartridge emulator jargon refers to a “ROM.” One single game, such as 1941j in MAME 0.78, requires 12 distinct ROMs (each a dump of a chip on the circuit board for that game) in order to play. So there is really no such thing as a “1941j ROM,” but rather there is a “1941j ROM set” with 12 required files inside it, and a different 1941 ROM set which itself requires 12 internal ROM files in order to play.

I’m stuck on #2 in your flowchart. The “game” entry in an arcade DAT lists all of the required ROMs for one single game – none of the unique ROMs listed inside that game tag is a playable game on it own. They all have to be there inside a ZIP together, with those CRCs, in order for the individual ROM set zip to be valid and complete.

The fact that 8 out of the 12 ROMs inside the 1941 ROM set are the same as 8 of the files inside the 1941j ROM set is what allows for the Split and Merged set formats to reduce the amount of hard drive space required for 1941j.

TorrentZip: The only hope of a single-file hash matching approach for playlist generation

One of the dilemmas facing arcade emulation in the bittorrent age is that you can zip the same twelve 1941j ROMs into a 1941j.zip ROM set on three different computers and you will get three ZIP files with three different file sizes and three different CRCs because ZIP implementations vary widely.

This made it difficult to use bittorrent to distribute MAME ROM collections, because even if two people had a valid Non-Merged MAME 0.78 set, they couldn’t necessarily co-seed a torrent unless they used the exact same ZIP archiver in the same environment to produce their ROM sets.

This is where TorrentZip (originally called MAMEZip) comes in. TorrentZip is a cross-platform ZIP implementation that is deterministic.

TorrentZipping the same 12 1941j ROMs into a 1941j.zip ROM set on three different computers will always produce three identical ZIP files with the same CRC.

If the decision was reached to not add a third scanning mode to the RetroArch playlist generator, the next best thing in my opinion would be to build DATs based on Non-Merged ROM sets that have been TorrentZipped. (Not coincidentally this is the most common format for distributing older MAME ROM sets right now).

Oh boy, the Arcade set discussion again :smiley:

I won’t spend too much arguing for which format is better (i myself prefer Split Sets), but will instead try to see how we can implement any form of scanning of those romsets in Retroarch.

First bad news is that in no way shape or form can the current scanner do any type of check on multiple files insize a zip, since rdb databases don’t support any type of entity-relation. One crc/serial with metadata, that’s all you get.

From that starting point, the only solution is to get the CRC32 of the torrentzip (as markwkidd correctly points out) holding the rom files, whether in Split or Non-Merged, Merge being a no-go since one zip would then contain several games (1941, 1941j, …)

I don’t see a technical reason why we couldn’t have both actually, a dat file with the crc for split romsets and another for Non-Merged sets, that would go in the same final rdb database. That way, it doesn’t matter which one you use, it would obviously have a different crc and it’s transparent for the user, provided that you remove the potential duplicates (in case a romset has no clones, both split and non-merged would be the exact same size), although it the rdb generation just takes the latest version of any metadata found in the source dats for a given crc/serial, so it doesn’t actually matter.

The split/non-merged complexity is actually handed by the core itself, so i don’t think we have to worry too much about it in the dat. Me and TaoPlyPly wrote a python script to take an XML dat, scan a folder and output a “libretro dat”, we used it for the FBA dat. If it’s any help, i’m glad to clean it up a bit and try to make it work for MAME.

1 Like

@markwkidd, first of all, I would like to thank you for the hard work with the playlist generator and with this overview of the arcade roms situation. I want to share my experience with FBA playlist since I never tried MAME at all.

Maybe we should make the actual SCAN process stop trying search for MAME and FBA roms, and we could ADD another option, one for FBA SCAN and other for MAME scan, and this method looks only for the FILENAME (SFIII.zip / SFIIIn.zip), since is the key to FBA and MAME to chose what game it will try to emulate…

I did this using a MERGED set, and it returned a PLAYLIST only with parent games, since in the merged set the file names are from parents and inside each file have the files for all the clones. If I change the filename of SFIII.ZIP to SFIIIn.zip FBA will run the clone. The filename is the easy aproach to MAME and FBA, but it need to be separeted from the existent method, who works well with the other systems.

If I said something wrong, I aprecciate feedback.

And sorry for the bad english, it’s not my language… Seeya!

@larrykoubiak1 Thank you for helping give me a sense of the background of arcade scanning in RetroArch.

My overall impression of the RetroArch playlist database is that it has very high standards. Only the versions of cartridge, CD, and DVD ROMs that are most identical to the original media are admitted to the database. I was thinking that a new scanning mode would be necessary in order to eventually reach the same standard of completeness (for example by even making sure that CHDs or Samples are in place)

But perhaps it is not practical to add a new arcade scanning method that addresses/extends RDB’s 1-game 1-matching data field structure. (I’m not familiar with this database structure, although I would read up on it if you think that’s useful).

In that case it does seem worth experimenting with a combined Split/Non-Merged DAT – both TorrentZipped I assume :slight_smile: – that provides the most comprehensive set of CRC matches possible for the existing algorithm.

I also have a script that can parse a ROM set collection and an XML DAT in order to produce a libretro-style ClrMamePro DAT. My script is in AutoHotKey.

Here’s the output it produces for a Non-Merged MAME 0.78 set that has been processed by TorrentZip: https://www.dropbox.com/s/ulv8iafwfnckxhz/MAME%200.78%20Non-Merged%20and%20TorrentZipped%20(No%20BIOS).dat?dl=0

If I understand your concept, getting from what I linked above to the goal would involve adding hash entries for the clones as they are in a Split-TorrentZip set. Is that right?

I am no expert and I realize that Retroarch wants to be keeping the checksums and what not but with arcade MAME and FBA it seems completely unreasonable. Wouldn’t it make a lot more sense to just keep the file names that MAME uses as check option?

I know from experience with MAME and FBA that their are clear cut file names that are very specific for Merged and Split sets. I am not saying eliminate the checksums from the database but how about just adding the file name identifier as an option for checking rather than the checksums. Just have it be a toggle in the settings.

Just an idea, and I am sure the devs have their reasons but as far as usability it seems silly to expect the users to track down the specific rom files when most of the “none matching” files play at least with Arcade Roms.

Well, to adress some points, it’s not so much that Retroarch has high standards, it’s just that it works with a calculated hash (crc32 in this case, to my chagrin as it can easily collide) as all DAT files do in fact to make sure that you have the right rom. If you’re interested in how a RDB works, the code used in RA is here : https://github.com/libretro/RetroArch/tree/master/libretro-db I’m in the process of trying to extend it a bit so i can hopefully introduce some kind of entity-relation support in it, allowing e.g. to have several values for a tag. Might take a while so don’t hold your breath lol.

I would also strongly disagree that it’s unreasonable to ask for specific rom files to be provided, certainly for Arcade. The reason for that is as MAME or FBA evolve, they require very different rom files for the same game, as dumps are sometimes found to be not ok, or just plain missing, and over time they evolve. That’s also the reason why you have 6 different MAME cores for libretro, they don’t support the same romset, don’t have the same requirements, etc… What you can’t do (and the emulator won’t actually let you) is use a “MAME rom” from say, version 0.136 and expect it to always run in the latest MAME, it usually won’t be the case, even though the filename is the same. Same with CHD files, they changed radically how CHD files were made somewhere around 0.150 iirc.

I will grant though that seen that we have several MAME cores, there is quite possibly some overlap between the romsets, in which case i have no idea how the scanner will determine in which playlist the game should go to. That is potentially an issue.

So long story short, if you want MAME sets to be scannable in the current RA, the only way is transform the MAME XML dats from rom-centric to zip-centric, using the right version of roms for the set in question, and yes that means TorrentZip sets, basically what RomVault would consider ok when scanning the set.

Finally I had a look at your script @markwkidd, i’m no AutoHotKey expert but seems to be doing the job, so you can definitly use that afaik, have at it :wink:

or you can just build your own rdb file using dir2dat in clrmame (and some cleanup so the libretrodb can work on it)-i did this before until @markwkidds tool existed. of course you still have to verify your romsets that it works for the specific version of the core you want to use.

Except a dir2dat will just give you the filename, not the complete game name, which i think kinda defeats the purpose of having a database in the first place. But i agree that it is the most straightforward thing to do.

dir2dat supports passing the DAT files you need to load the required DAT before running dir2dat

It seemed that there was strong interest in seeing exactly how far we can take the current single-hash scanner in terms of arcade ROM support, so some of the specific discussion about what that entails moved over to a Pull Request in the libretro-database repository.

Where things stand now in the PR is that we have Non-Merged and Split DATs for each of the MAME cores up through MAME 2014. Some additional scripting is probably needed before these DATs can be converted to the RDB format though.

I’m hoping this thread can be a useful place for tracking the big picture of arcade scanning support while various investigations and ideas may play out in github.

Edit: made some corrections to the OP, although the original proposal is no longer active.