Huffman Coding in Hudson Games

filler
Deep Blooper

Posts: 48

Huffman Coding in Hudson Games Aug 27, 2020 19:33:08 GMT

Quote

Post by filler on Aug 27, 2020 19:33:08 GMT

Howdy. In dumping scripts for Game Gear and PC Engine games recently I've bumped up against a few titles whose text encodings I can't crack via my usual methods. They all have one thing in common... They are published by Hudson. Specifically Sadakichi Nanaban Series Hideyoshi no Ougon (PCE), Bikkuriman World (FC), and Super Momotarou Dentetsu III (GG).

I've heard that Hudson had used Huffman coding to compress scripts so I've been looking into it. My initial thought was to use OCR to grab text from a handful of screenshots from a game, run a Huffman algorithm on it and hope I got lucky that some of it matched up enough that I could find some text and confirm the actual bit values for the full character set(s). However, a recent video I watched made the point that this compression is specific to the input text, and is useless without the original tree, which got me thinking how the Huffman tree must be included in each of these games for them to decode their own text.

Any ideas how you might find the Huffman tree in these games if one exists? Have any of you cracked Huffman coding in a game before?

dshadoff
ANIKIIII!

Implementing the systems of the '90s... since 1978

Posts: 1,240

Huffman Coding in Hudson Games Aug 27, 2020 20:14:28 GMT

Quote

Post by dshadoff on Aug 27, 2020 20:14:28 GMT

On the PC Engine, it's a mixed bag - especially when you're talking about HuCard games. They don't even need to follow any particular sequence for the alphabet, which makes it more difficult unless you follow the path through the code. ...And I wouldn't make assumptions that any publisher used only one encoding (even on the same game !).

Having said all that, I wouldn't want to dissuade you from digging in, but I don't have any ready-access information to provide. Maybe turboxray or elmer ?

DarkKobold
Lord of Thunder

Posts: 297
Homebrew skills: Coding

Huffman Coding in Hudson Games Aug 27, 2020 22:39:02 GMT

Quote

Post by DarkKobold on Aug 27, 2020 22:39:02 GMT

I'm interested to see their huffman coding - I'm assuming my next game will be more dialog heavy, and thus need it. I've found a few ASM samples of Huffman on the 6502, but I still imagine it's going to be quite the task to adapt it to my own needs.

dshadoff
ANIKIIII!

Implementing the systems of the '90s... since 1978

Posts: 1,240

Huffman Coding in Hudson Games Aug 27, 2020 23:19:10 GMT

Quote

Post by dshadoff on Aug 27, 2020 23:19:10 GMT

Coming back to this, why do you believe that it is Huffman encoding in particular ?
The several games I have seen used an encoding method - I don't know if there is a specific name for it - where the encode mechanism looks back into recent history to see if there is repetition to reference; if, say, the word "the" is used somewhere in the past 256 bytes, it can save some space by making a 1-byte buffer offset plus some indication of length (in this case, 4 or 5, because of the spaces).

In this case, you wouldn't look for a code table as such, but for word fragments at the beginning of a block of text.
But on a HuCard, you don't know whether they are using sequential JIS-style hiragana or some other method of coding the letters (this can basically be assumed to be SJIS on a CDROM though).

filler
Deep Blooper

Posts: 48

Huffman Coding in Hudson Games Aug 27, 2020 23:57:41 GMT

Quote

Post by filler on Aug 27, 2020 23:57:41 GMT

It's just a guess. I remember reading/hearing before that Hudson used Huffman compression but I don't know where, or if any of these games use it for sure. It would make plenty of sense though since it's a lossless compression format that would have been used for text and was around at the time. This book that Cabbage helped me find makes specific mention of Sadakichi Seven and compression, but they don't specify the exact compression method. It seems like you can't browse the whole book, so I haven't found exactly what the "compression tricks exposed earlier in the chapter" are, though I'm tempted to pick up this book since it seems like it would be an interesting read. www.google.com/books/edition/The_Media_Snatcher/o-mvDwAAQBAJ?hl=en&gbpv=1&bsq=hudson%20compression

EDIT: Oh wow. I just checked and that book is available on Kindle for less than $20. I'll probably grab it now.

Last Edit: Aug 28, 2020 0:05:09 GMT by filler

dshadoff ANIKIIII! Implementing the systems of the '90s... since 1978 Posts: 1,240	Huffman Coding in Hudson Games Aug 28, 2020 0:18:37 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by dshadoff on Aug 28, 2020 0:18:37 GMT Let us know whether there are actual details in the book.

turboxray Lord of Thunder Posts: 487	Huffman Coding in Hudson Games Aug 28, 2020 0:44:56 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by turboxray on Aug 28, 2020 0:44:56 GMT

elmer
ANIKIIII!

Posts: 1,040

Huffman Coding in Hudson Games Aug 28, 2020 1:01:29 GMT

Quote

Post by elmer on Aug 28, 2020 1:01:29 GMT

Aug 27, 2020 19:33:08 GMT filler said:

Any ideas how you might find the Huffman tree in these games if one exists? Have any of you cracked Huffman coding in a game before?

One (or maybe two) of the old games that I wrote in the late 1980s used the standard Huffman algorithm from Mark Nelson's "Data Compression" book ... and I found that pure Huffman isn't really all that good (in my experience).

Hudson used a *variety* of compression schemes at different times, and Turboxray has probably done a lot more research than I have on Hudson's different games.

There are *lots* of different text compression schemes, and *lots* of variants of each scheme.

See here for a starting point.

From my POV, if you can't find the text by using the standard translator's scheme of hunt-and-peck through the ROM, then you're going to need to get a programmer involved in your task.

I expect that the only *successful* way to identify the compression that has been used is to actually have a programmer trace through the code in Mednafen and find the text-printing code and the decompression code.

I'm sorry that that doesn't really help, but that's how I would approach the problem.

filler
Deep Blooper

Posts: 48

Huffman Coding in Hudson Games Aug 29, 2020 18:31:31 GMT

Quote

Post by filler on Aug 29, 2020 18:31:31 GMT

Aug 28, 2020 1:01:29 GMT elmer said:

From my POV, if you can't find the text by using the standard translator's scheme of hunt-and-peck through the ROM, then you're going to need to get a programmer involved in your task.

I expect that the only *successful* way to identify the compression that has been used is to actually have a programmer trace through the code in Mednafen and find the text-printing code and the decompression code.

I'm sorry that that doesn't really help, but that's how I would approach the problem.

Nothing to be sorry about. You're probably right.

I read more of that chapter and I don't think the book includes technical details on the compression for Sadakichi Seven. It references another book called "Retrogame Archeology" but that's an $80 book, and I don't feel like spending close to $100 total to not find a concrete solution, though I'm sure it's also an interesting read.

I'll probably mess around with it some more myself for now. I did check out Esperknight's tutorial on hacking Team Innocent just to get a little insight into messing with the Mednafen debugger, but that happens to use the BIOS font I guess, and the text is uncompressed S-JIS I believe so that makes it a different and easier problem to solve.

turboxray
Lord of Thunder

Posts: 487

Huffman Coding in Hudson Games Aug 30, 2020 21:39:15 GMT

Quote

Post by turboxray on Aug 30, 2020 21:39:15 GMT

I knew of only one PCE hucard that used Huffman encoding for text. Everything else I've seen was either uncompressed or tokenized compression (a pointer to another part of existing block of text). Some CD games will compress the kana as single byte with a small LUT and leave rest as uncompressed two byte SJIS kanji, and some later ones use LZ compression for blocks of text (in combination too with the LUT).

As for compressing English text, that video pretty much hits the nail on the head (though there's a better video with a law of combination occurrences). From what I've seen, 'dictionary' compression works really well and it's fast to decode in realtime unlike huffman. If the ascii range is $0 to $7f, you have the whole upper 128 entries to use as a lookup value to a set of dictionary entry. Two character pairs ended up being the most common, so I would usually devote more than half of the 128 entries for it, a much smaller amount for 3 characters, a handful for 4 and 5 character entries. It's pretty straight forward to write a compressor that creates the dictionary/table for a script.

DarkKobold
Lord of Thunder

Posts: 297
Homebrew skills: Coding

Huffman Coding in Hudson Games Sept 2, 2020 3:48:16 GMT

Quote

Post by DarkKobold on Sept 2, 2020 3:48:16 GMT

Aug 28, 2020 0:44:56 GMT turboxray said:

This makes me wonder if a more basic version of the huffman could be used. 4 bits to encode the 15 most common characters, and 5 bits to encode everything else. Use a shift character for caps.

_jash: Yo, Tron did you grab PSO2? It's pretty fun. I've enjoyed playing around with it. I need to find some more time to play it. Oct 25, 2022 19:01:48 GMT

tron: what the wonderboy collection should have been in the first place and not a slg exclusive www.youtube.com/watch?v=XZ9I6Tsbrs4 Jan 31, 2023 10:32:03 GMT

spenoza: It occurs to me that this same limited physical release approach has hit other hobbies as well, like tabletop role playing games. Companies in that space that aren't Hasbro are all small and so they tend to do limited prints of books and then lean on pdfs. Mar 8, 2023 20:23:43 GMT