Post by turboxray on Nov 24, 2019 21:21:02 GMT
This kinda came up in another thread, and I've always found this stuff really interesting.
Across the internet you'll surely find retro groups talking up their favorite cpu, and how impressive it was blah blah blah. What's missing from system/computer processor comparison threads, is context.
Years ago I had some debates/discussions/whatever with Steve Snake and a few others of 68k vs 65x vs '816 (snes cpu). The topics mostly focused on optimized code, and length of code.
Since I was looking through some folders, I found some of the comparison examples. One that Steve brought up was velocity. He proposed that velocity on the 68k was faster because it could handle 32bit operations. Now, for the sake of argument I'm going to declare the original 68k a 32bit processor, because in reality (to the developer) is clear is. Its ALU might be 16bit, but all operations (except MUL/DIV) have their 32bit equivalent. I make this statement because the original z80 only had a 4bit ALU, and like the 68k hardware macro instructions iterate over the ALU (thanks to microcoding the ISA). And we surely don't refer to the z80 as 4bit because that's absurd. So 68k is 32bit cpu with 16bit data bus. It just makes things easier.
Okay, that said here's Steve's velocity code:
The cycle times are shown to the right. 68k word address of an absolute address is a signed number, and because of how Sega layout out its 64k of ram the 32k of that ram can be accessed using a word address (wraps the PC address back to the top). I.e. it shaves 4 cycles because it doesn't have to load a full work into a register. So the initial load is 32 cycles, but the next instruction is 30 cycles. It's a read-modify-write instruction as well as auto-decrementing. This is a very nice instruction! There is a catch though.. ADDX adds with an extended bit, so as long as we make sure our addition doesn't overflow then we can use this as is. Once you load the regs, you can simply do ADDX a bunch of times.
I'll be honest here. The PCE isn't going to beat this. So how close can we get with 8bit operations?
Let's start with a semi optimized version (split tables).
Comparing Apples to Apples, honestly, is not a fair comparison. The 68k version is using 32bit math, but the values can't be larger than 31bit range. It's also using 32bit wide integers because that's what's fast on the 68k. Doing something like 16.8 fixed point would be a LOT slower. So it's wasting memory for the sake of speed. We can take that into account. which I did above. Two things: 1) we don't need more than 16bit whole coords for an object AND 8bit fixed part is more than enough precision (that's 1/256 sub pixel resolution!) for a regular game. 2) I would also argue than I don't need more than 8.8 fixed point accumulation for the delta. And yes, I purposely left off the CLC because I don't think in an average game that 1/256 sub pixel 'noise' is going to amount to anything. But feel free to add it back in if that makes you feel uncomfortable haha.
So while the above two aren't apples to apples, I've added context in which makes them relatively comparable (IMO). Recap: 68k is doing 16.16 + 16.16 fixed point math and PCE is doing 16.8 + 8.8 fixed point math, but the assumption is that both are acting on the assumption of 16.8 + 8.8.
But the 68k still pulls ahead because you can do ADDX and then another ADDX, etc. 30 68k cycles vs 32 6280 cycles. That's damn decent, given the context of 32bit cpu vs an 8bit cpu. But I'm not done yet...
For you see, for the advantage ADDX -(Ax), -(Ay) has, it's also at the same time a disadvantage; flexibility. For it to be advantageous you need to unroll ADDX instruction. Fair enough, right? Because if you didn't, that's no longer 30 cycles but 62 cycles! Let's say you have 256 total entries. You could unroll it to all 256 entries and that would be amazingly fast. But wait.. what if you didn't have 256 objects in those 256 entries? What if you only had say.. 38 entries? But you're stuck processing all 256 entries regardless. Okay, lets say that we only need up to 40 objects on screen (that live in the window), and other off screen objects are frozen (this is very common in platform games). So you unroll that loop to 40? Well, now you have unaccounted overhead that's now showing here: mapping in and out of active and inactive objects. That's 4 words of data that needs to be swapped in or out per object. And now you have to have mapping system (array) to keep track of open slots. And on top of that, you're still force to calculate 40 active objects regardless of how many you have.
See, this is were the 6280 picks up the pace. The above operations don't require larger integers and such. The PCE could simply have a 256 byte array that says yes this slot is active or no it's empty. Indexing, and random indexing, on the 6280 is much faster if the offset range is small. This is the magic elixir of the 65x series (beating the 68k and z80 in this department).
I can add a simple optimization:
The point is, is that the 68k can do a lot with the same instructions. And when you ACTUALLY need that extra range, then it's powerful. But it gives no advantage when the range is small. And on these systems, the range and load of what you need to do game logic wise is pretty small. And because the 6280 is really fast at doing small stuff, it can actually be faster than the 68k, especially when you add all this up per 'frame'.
Here's another one..
68k version:
And with BSR overhead that's 146 cycles.
With JSR, that's 108 total cycles.
What does the function do? It takes an X/Y coord and converts it to a MAP segment position, then grabs 4 tiles that are overlapping for a collision detection. This routine is optimized in that the object is more then 1px in size and no more than the map segment size (Xpx and Ypx). I.e. if the map collision segments are 8x8, then the object must between between 2x2 to 8x8, which means it could potentially touch 4 tiles/segments at a given position.
Across the internet you'll surely find retro groups talking up their favorite cpu, and how impressive it was blah blah blah. What's missing from system/computer processor comparison threads, is context.
Years ago I had some debates/discussions/whatever with Steve Snake and a few others of 68k vs 65x vs '816 (snes cpu). The topics mostly focused on optimized code, and length of code.
Since I was looking through some folders, I found some of the comparison examples. One that Steve brought up was velocity. He proposed that velocity on the 68k was faster because it could handle 32bit operations. Now, for the sake of argument I'm going to declare the original 68k a 32bit processor, because in reality (to the developer) is clear is. Its ALU might be 16bit, but all operations (except MUL/DIV) have their 32bit equivalent. I make this statement because the original z80 only had a 4bit ALU, and like the 68k hardware macro instructions iterate over the ALU (thanks to microcoding the ISA). And we surely don't refer to the z80 as 4bit because that's absurd. So 68k is 32bit cpu with 16bit data bus. It just makes things easier.
Okay, that said here's Steve's velocity code:
; X axis
move.l a0,abs.w ;16
move.l a1,abs.w ;16
addx.l -(a0),-(a1) ;30
; Y axis
move.l a0,abs.w ;16
move.l a1,abs.w ;16
addx.l -(a0),-(a1) ;30
The cycle times are shown to the right. 68k word address of an absolute address is a signed number, and because of how Sega layout out its 64k of ram the 32k of that ram can be accessed using a word address (wraps the PC address back to the top). I.e. it shaves 4 cycles because it doesn't have to load a full work into a register. So the initial load is 32 cycles, but the next instruction is 30 cycles. It's a read-modify-write instruction as well as auto-decrementing. This is a very nice instruction! There is a catch though.. ADDX adds with an extended bit, so as long as we make sure our addition doesn't overflow then we can use this as is. Once you load the regs, you can simply do ADDX a bunch of times.
I'll be honest here. The PCE isn't going to beat this. So how close can we get with 8bit operations?
Let's start with a semi optimized version (split tables).
AddVelocity:
lda x_float,x ;5
adc x_float_inc,x ;5
sta x_float,x ;5
lda x_whole.l,x ;5
adc x_whole_inc,x ;5
sta x_whole.l,x ;5
bvs .do_whole_x_hi ;2 = 32
lda y_float,x ;5
adc y_float_inc,x ;5
sta y_float,x ;5
lda y_whole.l,x ;5
adc y_whole_inc,x ;5
sta y_whole.l,x ;5
bvs .do_whole_y_hi ;2 = 32
Comparing Apples to Apples, honestly, is not a fair comparison. The 68k version is using 32bit math, but the values can't be larger than 31bit range. It's also using 32bit wide integers because that's what's fast on the 68k. Doing something like 16.8 fixed point would be a LOT slower. So it's wasting memory for the sake of speed. We can take that into account. which I did above. Two things: 1) we don't need more than 16bit whole coords for an object AND 8bit fixed part is more than enough precision (that's 1/256 sub pixel resolution!) for a regular game. 2) I would also argue than I don't need more than 8.8 fixed point accumulation for the delta. And yes, I purposely left off the CLC because I don't think in an average game that 1/256 sub pixel 'noise' is going to amount to anything. But feel free to add it back in if that makes you feel uncomfortable haha.
So while the above two aren't apples to apples, I've added context in which makes them relatively comparable (IMO). Recap: 68k is doing 16.16 + 16.16 fixed point math and PCE is doing 16.8 + 8.8 fixed point math, but the assumption is that both are acting on the assumption of 16.8 + 8.8.
But the 68k still pulls ahead because you can do ADDX and then another ADDX, etc. 30 68k cycles vs 32 6280 cycles. That's damn decent, given the context of 32bit cpu vs an 8bit cpu. But I'm not done yet...
For you see, for the advantage ADDX -(Ax), -(Ay) has, it's also at the same time a disadvantage; flexibility. For it to be advantageous you need to unroll ADDX instruction. Fair enough, right? Because if you didn't, that's no longer 30 cycles but 62 cycles! Let's say you have 256 total entries. You could unroll it to all 256 entries and that would be amazingly fast. But wait.. what if you didn't have 256 objects in those 256 entries? What if you only had say.. 38 entries? But you're stuck processing all 256 entries regardless. Okay, lets say that we only need up to 40 objects on screen (that live in the window), and other off screen objects are frozen (this is very common in platform games). So you unroll that loop to 40? Well, now you have unaccounted overhead that's now showing here: mapping in and out of active and inactive objects. That's 4 words of data that needs to be swapped in or out per object. And now you have to have mapping system (array) to keep track of open slots. And on top of that, you're still force to calculate 40 active objects regardless of how many you have.
See, this is were the 6280 picks up the pace. The above operations don't require larger integers and such. The PCE could simply have a 256 byte array that says yes this slot is active or no it's empty. Indexing, and random indexing, on the 6280 is much faster if the offset range is small. This is the magic elixir of the 65x series (beating the 68k and z80 in this department).
I can add a simple optimization:
.next
dex ;2
bne .out ;2
ldx active_table,y ;5
beq .next ;2
For a 11 cycles more per object velocity calculation (64+11), I now have complete flexibility and a the full range of all 256 entries. I mean there's even more ways I want optimize this and save cycles. But I guarantee you (!).. the overhead of doing the same to make the ADDX method more flexible would be a higher cycle rate. If I did something like a BSR/RTS on the 68k, that's 34 cycles vs 6280 14 cycles. Doing the above equivalent would be about 30~40+ cycles more per object velocity calculation (60+30~). It wouldn't be 60 cycles per object on the 68k; it'd be like 90+ cycles per object VS 6280 75 cycles.
For a frame of reference; the Genesis 68k has 128,000 cycles per frame and the PCE has 119,000 cycles per frame. That's your budget. You blow that budget and you get slowdown or whatever. A sprite heavy game with individual sprites, not meta-sprites or follow history sprites( i.e. history buffer sprites), is about 40 on screen (probably a shmup)? That's quite a lot. so 40 * 75 = 3000 cycles for PCE. For the sake of example, lets give back the Genesis 60 cycles per object optimization; 60*40 = 2400. That's 2.5% cpu resource per frame to do those objects on the PCE, and 1.9% cpu resource per frame on the Genesis. That's a difference of 0.6%!! That is my point. What looks great in isolation, tends to have trivial differences in context or even the opposite effect.
The point is, is that the 68k can do a lot with the same instructions. And when you ACTUALLY need that extra range, then it's powerful. But it gives no advantage when the range is small. And on these systems, the range and load of what you need to do game logic wise is pretty small. And because the 6280 is really fast at doing small stuff, it can actually be faster than the 68k, especially when you add all this up per 'frame'.
Here's another one..
68k version:
; ((x/8) * map_height)+(y/8)
; d0= object #
; a1= X/Y coords array (fixed reg for multiple calls)
; a3= shift+mul LUT
; a4= relative starting point of map in circular buffer
; d2= temp
; a0= temp
moveq #0,a0 ;4 clear upper range for unsigned indexing
move.l 0(a1,d0.l),d1 ;18 long word: low word is X, high word is Y
movea.w d1,a0 ;4 a0=x_array[d0]<<1;
swap d1 ;4
lsr.w #$03,d1 ;12 d1=y_array[d0]/8;
add.w 0(a3,a0.l),d1 ;14 d1+=shift_mul_array[a0];
add.l a4,d1 ;8 d1+=relative_start_point
andi.w #$f7ff,d1 ;8 clip to wrap with buffer
movea.l d1,a3 ;4 a3 is now x/y position of map
addi.w #col_len,d1 ;8 add a columns worth of bytes to get into the next column
andi.w #$f7ff,d1 ;8 clip to wrap with buffer
movea.l d2,a4 ;4 a3 is now x/y+1 position of map
move.w (a4),d2 ;8
or.w (a3),d2 ;8 set Z flag according all four tiles
rts ;16
;128 cycles 16 instructions
And with BSR overhead that's 146 cycles.
; _X= x pixel offset array of objects
; _Y= y pixel offset array of objects
; r2= map column offset
; tbl= shift+multiply LUT for column offset
; Reg X is the pointer/index to object's positions*
ldy <_Y,x ;4
lda shift_lut,y ;5 y>>3
ldy <_X,x ;4
clc ;2 = 15
adc tbl.l,y ;5
adc <r2.l ;4
sta <r0.l ;4
tax ;2 = 15
lda tbl.h,y ;5
adc <r2.h ;4
and #bffr_width ;2
sta <r0.h ;4 = 15
sax ;3
adc #col_width ;2
sta <r1.l ;4
txa ;2 = 11
adc #$00 ;2
and #bffr_width ;2
sta <r1.h ;4
lda [r0] ;7 = 15
ora [r1] ;7
ldy #$01 ;2
ora [r0],y ;7
ora [r1],y ;7
rts ;7 = 30
;101 cycles 25 instructions
With JSR, that's 108 total cycles.
What does the function do? It takes an X/Y coord and converts it to a MAP segment position, then grabs 4 tiles that are overlapping for a collision detection. This routine is optimized in that the object is more then 1px in size and no more than the map segment size (Xpx and Ypx). I.e. if the map collision segments are 8x8, then the object must between between 2x2 to 8x8, which means it could potentially touch 4 tiles/segments at a given position.