|
Post by Arkhan on Nov 13, 2019 4:24:59 GMT
I have a question I've never looked much into, and Tom, Dave or Covell can probably answer.
Lets say you want to read a BAT entry (an entire word) out of VRAM.
How fast is that vs. storing the BAT in RAM somewhere as well, and reading from there instead of VRAM?
I am not 100% certain if there are some timing / speed loss considerations to be considering when reading from VRAM with the normal code (ST0 / set $0002/$0003 / read the shit out)
Puncho found this when I asked if he happened to know given his recent bike game stuff:
And that's about the best info I've seen, but doesn't mention timings of just reading data out. I'm not writing back. Just reading.
I'm hoping the answer is "it's fine just do it", but I somehow doubt it.
|
|
|
Post by turboxray on Nov 13, 2019 5:37:48 GMT
You want to read just a single random word from VRAM? Or like a bunch in a row or column? Setting up the vram read pointer won't have a cycle alignment penalty, so just the overhead of st0 and the two stores. I haven't looked at the chart in some years, but I believe you have 4 cpu access slots to read/write vram; per 8 pixel DOT clock.
If you're miss-aligned, you'll get up to a 1.33 cycle penalty to the LDA of $0003 (since it's the latch, $0002 access won't have this penalty), and the normal +1 for accessing that 0x3ff block of hardware registers (i.e. 6 cycles for an LDA instead of 5). You shouldn't have any wonky timing or noticeable delays. So worse case: 6 cycles for LDx $0002, and 6 to 7.33 cycles for LDx $0003.
You're not doing a read-modify-write? I don't know if it's in that post, but I did tests with TSB and TRB against $0002/3 (it got the +1 penalty for the read and again for the write, plus any alignment delay).
|
|
|
Post by Arkhan on Nov 13, 2019 8:39:49 GMT
Where is that chart from Charles at anyways? I'm literally just reading one BAT cell out to get the tile number, that's all. Not modifying or rewriting anything.
just setting up a read, masking off the palette and subtracting the address offset to get a tile number.
reading the same BAT value out of CDRAM doesn't seem like it would be any faster really when you get right down to it. You'd gain a couple cycles for not accessing hardware registers... but you'd have to keep a list of pointers to BATs and know which one you need to read based off the screen you'd be on, so you lose cycles doing that lookup so it's basically a wash.
That's not worth it. I think reading tiles out of the BAT in VRAM will be OK.
Though, I am starting to wonder if doing a binary collision table would be better. That would definitely be faster, but again, not by much really.
I'm not an expert on CPU access slots/misalignment though. Is that kind of crap in one of the docs somewhere?
|
|
|
Post by turboxray on Nov 13, 2019 10:31:10 GMT
Here's the snippet from the patent, but Charley Mac's doc also explains it:
So basically, just think of the VDC in terms of pixels per scanline. Low res is 5.37mhz, which is 341px for the whole thing (displayable and blanking parts too). The number of cpu cycles against a scanline is ~455 cycles. That makes each 'dot' of the VDC scanline in low res mode equivalent to ~1.33 cpu cycles. It's just showing them in a group of 8 for frame of reference. And in that diagram above, every other 'dot' or pixel, is free to the cpu. Worse case is you just barely missed the pixel window and when you go to write, the VDC asserts RDY to the cpu causing it to stall until the slow is open. Honestly, doing a single read I wouldn't even worry about it. Slot #4 shows nothing, but from what I tested it looks like a second read from the BAT.
You mean like in general? I'm not sure. But I mean VDC makes you wait if you're not aligned, and that's only on register $0003 for reading and writing. And worst case, like +1.33 cpu cycle total for that instruction (being a read or write instruction). You could always figure in the worst case, or half that since you have like 50% change of hitting it in low res mode during active display. Are you using mkits or your hsync routines? Cause you'll also need to save the VDC reg you're going to use.
LDA #VDC_READ_ADDR ;2
sta <vdc_reg ;4
st0 #VDC_READ_ADDR ;5
(get A:X with word address) sta $0002 ;6
sta $0003 ;7 <- rounded it to 7
LDA #VDC_READWRITE_REG ;2
sta <vdc_reg ;4
st0 #VDC_READWRITE_REG ;5
lda $0002 ;6
lda $0003 ;7 <- rounded it to 7
Yeah, that's a good amount of overhead. Looks like it'd be faster if same data existed locally. But hey.. what if you need to read two consecutive BAT entries (row or column), then it wouldn't be so bad?
Not to complicate this further, but hblank when it fetches sprite pixel data - uses all the slots in the 8 cycle diagram. It's a very short period of time (for 1 DOT cycle mode). That'll delay you something like 16-28 cycles? Around that range. Depends on the number of sprites fetched for that scanline. I think you have like a ~12% chance of hitting it during active display haha. The reason I was interested in that part (hblank sprite fetch part), is that I was planning on lock-sync'ing the cpu to the VDC with a dummy read from VRAM (for special jitter-free VCE abusing). I got the idea from C64 demos where they lock-sync the cpu with the display to race the beam and make changes mid line.
Does using the BAT/VDC method save you some translation costs too? I don't know what kind of collision map you're doing, but I usually kept a ring buffer that basically mirrors the BAT. It was annoying for hucard projects because it ate up some of that 8k ram.
|
|
|
Post by Arkhan on Nov 13, 2019 19:35:51 GMT
Ah, thats a useful chart. Unless I was just out of it reading the VDC doc at 3-4AM, it doesn't have as much of that info. It has the chart but not the explanation.
The slotting part I might be wrong about how it works but just assumed thats how many "available actions" can be done, and if there's too many actions it has to wait, and that's part of what misalignment means.
This will be used to check tiles, and fortunately, for collisions I really only need binary collision.
But, every entity will be doing this so that can become a problem really fast if reading from the VDC is contributing that much useless overhead.
Storing binary tilemaps would be even faster than using the BAT as it lives in CDRAM at the expense of having to store the tables.
the BATs already live in CD-RAM and load from there to VRAM as needed depending what screen you are in.
So, keeping a pointer table and accessing that since it's already in RAM is probably a safer bet. The setup of getting a pointer out of an array and accessing the data there will avoid misalignment penalties, and some of the setup. I can probably use the indirect indexed directly misindexed whatever the fuck mode to get at the RAM pretty easily once I bash my face into a wall enough to understand that mode again.
This isn't an interrupt routine though so I wasn't storing the VDC register. I assumed interrupts would need to do that.
Does accessing CD-RAM incur extra penalties?
|
|
|
Post by elmer on Nov 13, 2019 20:53:06 GMT
Not to complicate this further, but hblank when it fetches sprite pixel data - uses all the slots in the 8 cycle diagram. It's a very short period of time (for 1 DOT cycle mode). That'll delay you something like 16-28 cycles? Around that range. Depends on the number of sprites fetched for that scanline. Worst case ... 16 sprites on a line * 4 bitplane reads per sprite = 64 VDC read cycles * 1.33 @ lo-res = 85 CPU cycles delay maximum, which seems like it would occur if you are part way through doing a TIA to VRAM when the hblank hits. Tom, you're saying 16-28 cycles ... am I missing something?
|
|
|
Post by turboxray on Nov 13, 2019 21:44:43 GMT
Not to complicate this further, but hblank when it fetches sprite pixel data - uses all the slots in the 8 cycle diagram. It's a very short period of time (for 1 DOT cycle mode). That'll delay you something like 16-28 cycles? Around that range. Depends on the number of sprites fetched for that scanline. Worst case ... 16 sprites on a line * 4 bitplane reads per sprite = 64 VDC read cycles * 1.33 @ lo-res = 85 CPU cycles delay maximum, which seems like it would occur if you are part way through doing a TIA to VRAM when the hblank hits. Tom, you're saying 16-28 cycles ... am I missing something? That was off the top of my head haha. But yeah 85 cycles max for low res mode (I realized that this morning when I was thinking about it). About the lock-sync thing, the PCE fetches all sprites on that Y line regardless if it's on screen or not, so off screen sprites still get pixel data fetched. That was going to be my trick.. have off screen sprites make the delay 85 cycles so the cpu would be in sync with the VDC.
|
|
|
Post by Arkhan on Nov 13, 2019 21:48:36 GMT
what I am gathering out of this is reading from CDRAM is the better bet since all of that bullshit seems likely in a game where you are running in circles shooting at stuff.
|
|
|
Post by Arkhan on Nov 18, 2019 6:13:24 GMT
OK new curve ball I wasn't thinking of.
You have to map in the CDRAM bank you want to read from, and map whatever back when you are done. That overhead is going to eat up more cycles than the VRAM stuff previously discussed, unless I am missing something.
I am not sure if I can reliably map the CD-RAM bank that is being displayed, and have it stay without needing to be constantly remapped. I need to look into that now
Pages ... 2-6 are the "everything seems to use these" pages, IIRC. Do we know off hand if HuC fucks around with any of those in a way where it is going to hose my stuff
|
|
|
Post by turboxray on Nov 19, 2019 2:53:15 GMT
It's been awhile so I don't remember how HuC handles banks (where you code is at any given time, and what else is required). This was one of the issues that kind kept me from enhancements for HuC (like small routine to make things faster), because eventually you run in to who's in charge of the mapping layout. Local ASM helps, but like high level optimization still is still reliant on low level details. /rant
Anyway. Maybe you can move stuff into a 'library' bank? IIRC, the fixed bank at MPR7 has code for calling far library functions. I mean yeah, you got bank saving/switching overhead, but don't you need to read more than one cell (8x8?) section of the collision map? Like for 0 to 8 pixels wide bounding box, you need to read up to two cells, 8 to 16px wide -> read three cells, etc. If that's the case, then saving/mapping overhead isn't that much. I think in the last map collision routine I did, I just made it simple and Xn and Yn cells from the map all depending on the X and Y offset of the colliding box.
Tangent; I always thought having hardware bank mapped to MPR0 was a waste. Like how often do you need to access hardware regs compared to the rest of your code? At minimum, make it a dedicated fixed const back (rom or ram), and just have ISRs map in #FF.
|
|
|
Post by Arkhan on Nov 19, 2019 3:08:23 GMT
The MagicKit routines generally use pages 3/4 and automagically do the work for you.
IIRC, the IRQ ones use 5 and 6.
but again, they handle the map in/out for you if they trounce your stuff, so its probably OK to assume i can leave it loaded where it is in page 6
and nah, I don't think I will need to read more than one 8x8 section for now at least. If I do, having it mapped in always will be useful.
I noticed something though. I see that TAM/TMA in the magic kit routines use #3, #4, etc.
Does it do some voodoo?
I thought the value you used for that instruction went in power of twos where page 3 is "4" (3rd bit), etc.
Some of the magic kit internal shenanigans, I never committed to memory or played with too hardcore and now I am looking at some of it going "hmmmmm".
I've done this before but then I forget because I don't look at it again for a few years.
|
|
|
Post by turboxray on Nov 19, 2019 4:49:43 GMT
Yeah mkit/pceas for some reason decided TAM/TMA would take a bank value instead of a bitmask. You're not crazy haha the real instruction is a bit mask (each bit corresponding to an MPR reg). They really should have just made it a macro instead of hard making it a fake operand. I guess some other assemblers do this too (a couple instructions for 68k are like that with assemblers). But yeah, #0 for MPR0, #1 for MPR1, etc.
So does main take up MPRs 5-6, and C 'user' functions take up 3-4? 7 is fixed for near lib and ISR handlers. 0-1 for Hardware and ram. So what's in 2-3? Is that where you collision map data is?
|
|
|
Post by dshadoff on Nov 19, 2019 5:04:49 GMT
The Develo kit used the format "TAM3", so PCEAS followed suit.
|
|
|
Post by Arkhan on Nov 19, 2019 9:38:04 GMT
Yeah mkit/pceas for some reason decided TAM/TMA would take a bank value instead of a bitmask. You're not crazy haha the real instruction is a bit mask (each bit corresponding to an MPR reg). They really should have just made it a macro instead of hard making it a fake operand. I guess some other assemblers do this too (a couple instructions for 68k are like that with assemblers). But yeah, #0 for MPR0, #1 for MPR1, etc. So does main take up MPRs 5-6, and C 'user' functions take up 3-4? 7 is fixed for near lib and ISR handlers. 0-1 for Hardware and ram. So what's in 2-3? Is that where you collision map data is? My collision data is nowhere right now. It's just in CDRAM banks. I forget what is where. Theres that chart that says what they suggest you do. Basically pages 2-5 are "touchable" from what I understand, but it is with shades of gray, and changes with the CD stuff. I think using 5 is safe. Squirrel also uses 5 IIRC, and it handles restoring the page, so it is probably safe to just use that.
|
|
|
Post by turboxray on Nov 20, 2019 17:31:33 GMT
So it like ASM inside of HuC? The only reason I ask, is that I remember it add a little bit more complexity to the layouts (i.e. where you are in the bank currently depending on if you calling from 'main' or from a user function). Cause like mkit yeah 2-5 are safely accessible, but that depends on where you're calling from. If it's from the main lib bank, then all of it is fine, but it's different for when calling from a user function defined in HuC. It's been a while so I don't remember all the details, I just remember it being a little bit annoying haha.
|
|