|
Post by filler on Aug 27, 2020 19:33:08 GMT
Howdy. In dumping scripts for Game Gear and PC Engine games recently I've bumped up against a few titles whose text encodings I can't crack via my usual methods. They all have one thing in common... They are published by Hudson. Specifically Sadakichi Nanaban Series Hideyoshi no Ougon (PCE), Bikkuriman World (FC), and Super Momotarou Dentetsu III (GG).
I've heard that Hudson had used Huffman coding to compress scripts so I've been looking into it. My initial thought was to use OCR to grab text from a handful of screenshots from a game, run a Huffman algorithm on it and hope I got lucky that some of it matched up enough that I could find some text and confirm the actual bit values for the full character set(s). However, a recent video I watched made the point that this compression is specific to the input text, and is useless without the original tree, which got me thinking how the Huffman tree must be included in each of these games for them to decode their own text.
Any ideas how you might find the Huffman tree in these games if one exists? Have any of you cracked Huffman coding in a game before?
|
|
|
Post by dshadoff on Aug 27, 2020 20:14:28 GMT
On the PC Engine, it's a mixed bag - especially when you're talking about HuCard games. They don't even need to follow any particular sequence for the alphabet, which makes it more difficult unless you follow the path through the code. ...And I wouldn't make assumptions that any publisher used only one encoding (even on the same game !).
Having said all that, I wouldn't want to dissuade you from digging in, but I don't have any ready-access information to provide. Maybe turboxray or elmer ?
|
|
|
Post by DarkKobold on Aug 27, 2020 22:39:02 GMT
I'm interested to see their huffman coding - I'm assuming my next game will be more dialog heavy, and thus need it. I've found a few ASM samples of Huffman on the 6502, but I still imagine it's going to be quite the task to adapt it to my own needs.
|
|
|
Post by dshadoff on Aug 27, 2020 23:19:10 GMT
Coming back to this, why do you believe that it is Huffman encoding in particular ? The several games I have seen used an encoding method - I don't know if there is a specific name for it - where the encode mechanism looks back into recent history to see if there is repetition to reference; if, say, the word "the" is used somewhere in the past 256 bytes, it can save some space by making a 1-byte buffer offset plus some indication of length (in this case, 4 or 5, because of the spaces).
In this case, you wouldn't look for a code table as such, but for word fragments at the beginning of a block of text. But on a HuCard, you don't know whether they are using sequential JIS-style hiragana or some other method of coding the letters (this can basically be assumed to be SJIS on a CDROM though).
|
|
|
Post by filler on Aug 27, 2020 23:57:41 GMT
It's just a guess. I remember reading/hearing before that Hudson used Huffman compression but I don't know where, or if any of these games use it for sure. It would make plenty of sense though since it's a lossless compression format that would have been used for text and was around at the time. This book that Cabbage helped me find makes specific mention of Sadakichi Seven and compression, but they don't specify the exact compression method. It seems like you can't browse the whole book, so I haven't found exactly what the "compression tricks exposed earlier in the chapter" are, though I'm tempted to pick up this book since it seems like it would be an interesting read. www.google.com/books/edition/The_Media_Snatcher/o-mvDwAAQBAJ?hl=en&gbpv=1&bsq=hudson%20compressionEDIT: Oh wow. I just checked and that book is available on Kindle for less than $20. I'll probably grab it now.
|
|
|
Post by dshadoff on Aug 28, 2020 0:18:37 GMT
Let us know whether there are actual details in the book.
|
|
|
Post by turboxray on Aug 28, 2020 0:44:56 GMT
|
|
|
Post by elmer on Aug 28, 2020 1:01:29 GMT
Any ideas how you might find the Huffman tree in these games if one exists? Have any of you cracked Huffman coding in a game before? One (or maybe two) of the old games that I wrote in the late 1980s used the standard Huffman algorithm from Mark Nelson's "Data Compression" book ... and I found that pure Huffman isn't really all that good (in my experience). Hudson used a *variety* of compression schemes at different times, and Turboxray has probably done a lot more research than I have on Hudson's different games. There are *lots* of different text compression schemes, and *lots* of variants of each scheme. See here for a starting point. From my POV, if you can't find the text by using the standard translator's scheme of hunt-and-peck through the ROM, then you're going to need to get a programmer involved in your task. I expect that the only *successful* way to identify the compression that has been used is to actually have a programmer trace through the code in Mednafen and find the text-printing code and the decompression code. I'm sorry that that doesn't really help, but that's how I would approach the problem.
|
|
|
Post by filler on Aug 29, 2020 18:31:31 GMT
From my POV, if you can't find the text by using the standard translator's scheme of hunt-and-peck through the ROM, then you're going to need to get a programmer involved in your task. I expect that the only *successful* way to identify the compression that has been used is to actually have a programmer trace through the code in Mednafen and find the text-printing code and the decompression code. I'm sorry that that doesn't really help, but that's how I would approach the problem. Nothing to be sorry about. You're probably right. I read more of that chapter and I don't think the book includes technical details on the compression for Sadakichi Seven. It references another book called "Retrogame Archeology" but that's an $80 book, and I don't feel like spending close to $100 total to not find a concrete solution, though I'm sure it's also an interesting read. I'll probably mess around with it some more myself for now. I did check out Esperknight's tutorial on hacking Team Innocent just to get a little insight into messing with the Mednafen debugger, but that happens to use the BIOS font I guess, and the text is uncompressed S-JIS I believe so that makes it a different and easier problem to solve.
|
|
|
Post by turboxray on Aug 30, 2020 21:39:15 GMT
I knew of only one PCE hucard that used Huffman encoding for text. Everything else I've seen was either uncompressed or tokenized compression (a pointer to another part of existing block of text). Some CD games will compress the kana as single byte with a small LUT and leave rest as uncompressed two byte SJIS kanji, and some later ones use LZ compression for blocks of text (in combination too with the LUT).
As for compressing English text, that video pretty much hits the nail on the head (though there's a better video with a law of combination occurrences). From what I've seen, 'dictionary' compression works really well and it's fast to decode in realtime unlike huffman. If the ascii range is $0 to $7f, you have the whole upper 128 entries to use as a lookup value to a set of dictionary entry. Two character pairs ended up being the most common, so I would usually devote more than half of the 128 entries for it, a much smaller amount for 3 characters, a handful for 4 and 5 character entries. It's pretty straight forward to write a compressor that creates the dictionary/table for a script.
|
|
|
Post by DarkKobold on Sept 2, 2020 3:48:16 GMT
This makes me wonder if a more basic version of the huffman could be used. 4 bits to encode the 15 most common characters, and 5 bits to encode everything else. Use a shift character for caps.
|
|