|
Post by turboxray on Feb 20, 2020 23:15:20 GMT
I'm working on a PCE environment that needs to divide a 16bit dividend by an 8bit constant value (24). I didn't think too much of that until I tried to make a reasonably fast routine. I mean, how difficult could it be? Apparently.. a lot more than I had realized haha.
I looked around and found some solutions, but they weren't exactly fast on the PCE (lost of 16/32bit shifts and adds). And I found a nice set of 8bit divide by range constant functions on atariage (from 1 to 32) for 8bit input, but extending to 16bit.. it wasn't setup to handle the rounding errors of a wider than 8bit range.
So I figured I'll see how bad the rounding errors were. I took the approach of x/24 = (x & 0xff00 )/24 + (x & 0x00ff)/24. But for integers, it has a +1 under rounding error every so many entries in the 16bit range. I wrote an output of the whole range to see if the rounding errors had a pattern. Sure enough, they did. (a couple of patterns but one appeared to be based on largest factor of 24, and other things that seemed to be related to it).
So I made two tables for (x & 0xff00)/24 (lsb and msb for a WORD fetch), and three tables for (x &0x00ff)/24 (two for rounding errors and one for no errors). And finally, I made a branching table (the input MSB dictates which error correction table to use for the input LSB). Each table is 256 byte entries for a total of 1.5k.
The function is pretty fast enough; 35 cycles for no error correction, and 39 cycles for max error correction. That includes the RTS. The 16bit quotient is returned in A:X. Not bad since I was looking for something in the 30-50 cycle range.
Anyone interested in this? I could make quite a few tables for non powers of two divisions. Could potentially make pairing MOD functions for them too.
|
|
|
Post by ccovell on Feb 21, 2020 0:55:10 GMT
Sounds interesting. I'd like to see it, and I'm sure you'd get comments over at 6502.org.
|
|
|
Post by turboxray on Mar 14, 2020 18:04:03 GMT
I finally added MOD (or remainder) support. It was too slow to do it manually (mul by 24 and subtract), so I wrote an extra three 256byte tables. So divby24 with remainder is 10 cycles slower. I also added just a MOD call, which is fast enough (like 17 cycles), and a divby24 with no remainder at the original speed.
I'll post something soon; the lib and tables, as well as a validate rom that compares values against the slower method.. for piece of mind.
I'm pretty happy with the performance though. I plan to make tables for constant values.
|
|
|
Post by elmer on Mar 14, 2020 19:57:48 GMT
I look forward to seeing the code, but I'm actually more curious about what-on-earth needs so many divide-by-24 calls in a single game loop that the cycle timing is that critical?
Once you get to having 1.5KB or more of tables for a specific divisor, I'd personally consider spending 2KB for the more generic fast-multiply lookup tables, and just do a fixed-point 8bit x 16bit multiply of 1/24 to get the result (or close enough).
I assume that you've already rejected that technique as a solution for your needs, so I'm looking forward to hearing what needs that divide to be so fast.
|
|
|
Post by turboxray on Mar 14, 2020 21:26:56 GMT
Here's the source and test: gitlab.com/RickSugar/divby24Caught a carry logic I thought I could eliminate, but ended up not working. So cycles went up by 2 on total counts for the div function.
|
|
|
Post by turboxray on Mar 14, 2020 21:40:56 GMT
I look forward to seeing the code, but I'm actually more curious about what-on-earth needs so many divide-by-24 calls in a single game loop that the cycle timing is that critical? Once you get to having 1.5KB or more of tables for a specific divisor, I'd personally consider spending 2KB for the more generic fast-multiply lookup tables, and just do a fixed-point 8bit x 16bit multiply of 1/24 to get the result (or close enough). I assume that you've already rejected that technique as a solution for your needs, so I'm looking forward to hearing what needs that divide to be so fast. Two reasons haha. The principle of the matter; what's the fastest approach to do a divide by a constant against a 16bit range with the smallest table assist. And two, yeah I have some bits here and there that needs it. I made a metatile system for a map that's 24x24 pixels. Collision info (against map) for player, enemies, etc. I need the quotient and remainder to be exact. I plan to build a few other tables for common constants (3, 5, 12, etc). So I mean yeah it might be more of an edge case where people need this, but I've seen requests pop up before looking for this sort of thing.
|
|
|
Post by elmer on Mar 15, 2020 2:47:08 GMT
I made a metatile system for a map that's 24x24 pixels. Collision info (against map) for player, enemies, etc. I need the quotient and remainder to be exact. Ah, so that's why you're doing it multiple times in a game loop ... that makes sense. Honestly, for map use like that, I'd personally try seeing if the multiply-by-reciprocal would work, since the fast-multiply tables are useful to have around for many circumstances, or else I'd choose to store both pixel and metatile coordinates for each entity, and so avoid the division altogether. OTOH ... I totally understand that figuring out the fastest method to do the division was the most fun soulution to implement as a programmer!
|
|
|
Post by turboxray on Mar 15, 2020 4:33:13 GMT
I made a metatile system for a map that's 24x24 pixels. Collision info (against map) for player, enemies, etc. I need the quotient and remainder to be exact. Ah, so that's why you're doing it multiple times in a game loop ... that makes sense. Honestly, for map use like that, I'd personally try seeing if the multiply-by-reciprocal would work, since the fast-multiply tables are useful to have around for many circumstances, or else I'd choose to store both pixel and metatile coordinates for each entity, and so avoid the division altogether. OTOH ... I totally understand that figuring out the fastest method to do the division was the most fun soulution to implement as a programmer! Yeah, I looked at that. 1/24 is 1/3 * 1/8. The 1/8 is nice because it's a three right shifts.. but that's on a 16bit number so not that nice. 1/3 part is going to be * 0x56 for 8bit with >> 8.. and * 0x5556 for 16bit with >> 16, with both of those shifts being very convenient haha (no shifting at all). I was trying to see if I could get away with an 8*16->24 setup for the 1/3 part but it didn't seem to work out. I didn't save anything going that route though (just the opposite), so I went back to x/24 as x * 0x1999 >> 16. No shifts and it's not bad. Using the fast-multiply method with f(x) = (x^2)/4 iterative to do 16x16->32 I think is something like 200+ cycles. I wanted to see if I could so something faster. There were also some funky solutions with combinations of << and >> and adds with magic constants, but they weren't friendly for shifting/adding on the 65x. I'm not doing a whole lot.. something like 30-50 division per frame maybe. But I spent some extra cycles making the design of the game layout dependent on a more flexible external tool process, so I wanted to recapture some of that back. Plus... I have plans for that resource hahah.
|
|
|
Post by elmer on Mar 15, 2020 4:51:17 GMT
I didn't save anything going that route though (just the opposite), so I went back to x/24 as x * 0x1999 >> 16. No shifts and it's not bad. Using the fast-multiply method with f(x) = (x^2)/4 iterative to do 16x16->32 I think is something like 200+ cycles. I wanted to see if I could so something faster. Iterative multiply? Have you not seen the 2KB table-of-squares method? It reduces to a couple of lookup+subtracts for an 8x8bit -> 16bit result. Chaining those together gives you whatever bit-precision you require, and it's perfect for multiplication-by-reciprocal math ... codebase64.org/doku.php?id=base:seriously_fast_multiplicationI'm using that in Huzak to calculate scaled sine-waves for things like vibrato. That's not because there aren't other ways of coding that feature, it's just that Deflemask seems to be using real math and multiplies rather than the old-skool 8-bit/16-bit method of pre-calcualted lookup-tables. <edit> OK, so we probably are talking about the same multiplication method, but then I don't understand why you're talking about a 16x16->32 multiply when you wouldn't need that precision (or cycles) for your coordinate/24 within the range of values that you're going to meet.
|
|
|
Post by turboxray on Mar 15, 2020 7:47:42 GMT
Yeah that's the method I'm talking about a*b = f(a+b) - f(a-b), where f(x) = (x^2)/4. Also, I don't know why I have 0x1999 there. It should be 0xAAB. Must have been something intermediate I was working on (from my notes).
Anyway, x/24 => x * 2731/65536. And the range of x is 0 to 65535. So I don't see how you're going to get 2731 * 65535 into smaller than ~28bit result, or rather 4 multiplies. It's X:Y * W:Z, where X:Y = $A:$AB. Unless I missing something hahah.
|
|