artial
Deep Blooper
Posts: 25
Homebrew skills: Make it work
Fave PCE Shooter: Superstar soldier
Currently Playing: Micro Mages
|
Post by artial on Jan 1, 2023 18:55:54 GMT
Nearly 10 years ago I ported Another World to the consoles (ps3/ps4/vita/Wii U/3ds/xb1). This summer I thought it would be a nice challenge to port it to the PCE. I was lucky to get the sources of the 6502 c64+40mhz cpu port by @majikeyric and started on PCE in November. You can read about the game architecture here fabiensanglard.net/another_world_polygons/index.htmlThe game uses 4 frame buffers to render polygons (quadstrips), so on the base PCE I chose 224*136 to fit in 62k of VRAM. Here's the current state (including graphic bugs) www.youtube.com/watch?v=PE61gkh0Ng83 areas need attention for performance - frame buffers copy - frame buffer filling - software rending of quads / line drawing. The latter being the trickiest to optimize. For now I want to address frame buffer copying. Currently I read VRAM and write it back to another location, it takes around 28ms to copy a framebuffer to another. Note that the target framerate is not 60hz but rather 10-20hz. VRAM DMA is much faster but cannot complete a full copy during VBlank. I'll try to use a mixed approch (CPU copy during display, DMA or DMA + CPU copy in parallell during vbl). Is there a way to increase VBL time with the VDC vertical registers?
|
|
|
Post by turboxray on Jan 1, 2023 22:19:55 GMT
For now I want to address frame buffer copying. Currently I read VRAM and write it back to another location, it takes around 28ms to copy a framebuffer to another. Note that the target framerate is not 60hz but rather 10-20hz. VRAM DMA is much faster but cannot complete a full copy during VBlank. I'll try to use a mixed approch (CPU copy during display, DMA or DMA + CPU copy in parallell during vbl). Is there a way to increase VBL time with the VDC vertical registers? Yeah, you can set how many lines are visible, and thus that affects the length of vblank time. Just curious, are you don't doing the cpu-manual copy during active display? Are you only doing it during vblank time? Also, if you're running low res for the main screen, you can boost the VDC clock during vblank to get a faster VRAM-DMA. And then set it back.
|
|
punch
Deep Blooper
Posts: 22
|
Post by punch on Jan 2, 2023 0:17:14 GMT
That's a fantastic demo artial... turboxray's tip will make things significantly faster, and I'm sure you already figured that out but you can access VRAM at any time even outside the blanking interval.
Keep up the good work!
|
|
artial
Deep Blooper
Posts: 25
Homebrew skills: Make it work
Fave PCE Shooter: Superstar soldier
Currently Playing: Micro Mages
|
Post by artial on Jan 2, 2023 6:48:58 GMT
Yeah, you can set how many lines are visible, and thus that affects the length of vblank time. Just curious, are you don't doing the cpu-manual copy during active display? Are you only doing it during vblank time? Also, if you're running low res for the main screen, you can boost the VDC clock during vblank to get a faster VRAM-DMA. And then set it back. I do cpu manual copy during display yes, since there are 4 framebuffers there is no tearing. It takes 28ms in pure cpu manual copy, now I check if VBL started and use DMA to copy 3k words out of the 7.5k words. That mixed copying cuts 8ms, down to 20ms. That can likely drop to between 5 and 16ms with more DMA time. Here are the current VDC registers, what can be done to increase top blanking and start vbl earlier? Only 136 lines are needed. I would experiment but I don't know what to set pulse to when messing with other settings. .byte $0A,$02,$02 ; HSR + .byte $0B,$1F,$04 ; HDR | .byte $0C,$07,$0D ; VPR | .byte $0D,$DF,$00 ; VDW | .byte $0E,$03,$00 ; VCR + I didn't think one could increase VBL time after reading CMD notes: "14 lines for the top blanking area (shown as light black). 242 lines for the active display area (graphics and/or overscan color). 4 lines for the bottom blanking area (shown as light black). 3 lines for the sync area (shown as pure black). This layout is fixed, and cannot be changed by the vertical control registers." Also, to boost VDC during vbl, you just set the VCE clock to 10 mhz? And / or set VDC registers A and B?
|
|
|
Post by turboxray on Jan 2, 2023 19:09:44 GMT
Yeah, you can set how many lines are visible, and thus that affects the length of vblank time. Just curious, are you don't doing the cpu-manual copy during active display? Are you only doing it during vblank time? Also, if you're running low res for the main screen, you can boost the VDC clock during vblank to get a faster VRAM-DMA. And then set it back. I do cpu manual copy during display yes, since there are 4 framebuffers there is no tearing. It takes 28ms in pure cpu manual copy, now I check if VBL started and use DMA to copy 3k words out of the 7.5k words. That mixed copying cuts 8ms, down to 20ms. That can likely drop to between 5 and 16ms with more DMA time. For this manual cpu copy, this is reading from vram and then writing it back? As in a make shift VRAM DMA? The reason I ask is that it doesn't sound like this part of the process is a local-ram to VRAM copy. If that's the case, you're just moving stuff around in vram in this final copy phase - then what data are you moving around? Is this tile data or tilemap data? So first off: "14 lines for the top blanking area (shown as light black). 242 lines for the active display area (graphics and/or overscan color). 4 lines for the bottom blanking area (shown as light black). 3 lines for the sync area (shown as pure black).
This layout is fixed, and cannot be changed by the vertical control registers." While that is true, that is an NTSC frame defined by the VCE. And it is fixed, but from the perspective of the VDC - not so much. There's the VDC frame and then there's the VCE's NTSC frame. You can have multiple VDC frames inside a single VCE frame. That's a more advance topic, but I wanted to make that distinction. So from the settings you posted, it looks like you have 224 active lines and the rest is vblank. Set that to whatever you want for displayable lines, but go ahead and set VCR to $ff (it's currently set to $03). VCR is the "wait and do nothing until VCE asserts vsync pulse". It doesn't matter if the value is really large, because once the VCE asserts vsync pulse, the VDC jumps out of this phase and into the next. So setting it too big, which quite a few PCE games do, is harmless but gives you piece of mind that you don't need to make sure you have the all the values to add up correctly. Just an FYI: If the VCR value is too short and this "line wait" expires too soon relative to the VCE asserting vsync, then the VDC will continue onto the next frame. The VDC is setup to sync with external vsync and external hsync. This is done with a hsync pulse wait period and a vsync line wait period. IF the VDC does not get an external sync signal (from the VCE) during that wait period, then it will simply move onto the next phase of the frame drawing. Likewise, IF the VDC is in the middle of drawing a line to the screen, and VCE asserts hsync.. then the VDC is immediately just to the end of that line and start the next. The exact same thing happens with vsync - wherever the VDC is at, it will end the current frame and start drawing the next. None of this affects the VCE output frame/timings. Those remain fixed. That's why I described it as a VCE frame and a VDC frame. So given those settings, it looks like you running in 5.37mhz mode (256px). I was implying that you could switch into 7.16mhz mode for a 33% increase in VRAM DMA speed. Yeah, you could set it to 10.74mhz - but you would technically need to set the vram wait states otherwise you'd be overclocking vram. I've ran the VDC ram overclock for years, on multiple systems, and never seen it glitch - but it's not guaranteed. My point being, is that if you set the dot clock to 10.74mhz and set the vram wait states for slower access, it would be the same bandwidth as if you were running 256px for the VRAM DMA. So, since you're using low res mode - I'd recommend just bumping the VDC clock to 7.16mhz.
|
|
artial
Deep Blooper
Posts: 25
Homebrew skills: Make it work
Fave PCE Shooter: Superstar soldier
Currently Playing: Micro Mages
|
Post by artial on Jan 2, 2023 21:43:59 GMT
For this manual cpu copy, this is reading from vram and then writing it back? As in a make shift VRAM DMA? The reason I ask is that it doesn't sound like this part of the process is a local-ram to VRAM copy. If that's the case, you're just moving stuff around in vram in this final copy phase - then what data are you moving around? Is this tile data or tilemap data? Yes that's moving around VRAM data. Typically when the player enters a new screen, background vector graphics are drawn to framebuffer #1 which takes 500 to 1500ms so it must be cached. Then FB1 is copied every frame alternativaly to framebuffer 2 or 3, dynamic stuff is drawn, and displayed (along a full nametable update),and it flip flops copying BG from FB1, drawing dynamic stuff and disaplying FB 2 and 3. I set VCR to FF and set 7mhz directly at start up with .byte $0B,$2B,$06 That didn't seem to work. Display was not changed (not sure how a 32*32 nametable is supposed to be displayed in 7mhz mode), and the same data length was copied during VBL DMA as with 5mhz. Tried 10mhz overclocked (.byte $0B,$3F,$08), same. Maybe nametable width must be adjusted to 64 tiles.
|
|
artial
Deep Blooper
Posts: 25
Homebrew skills: Make it work
Fave PCE Shooter: Superstar soldier
Currently Playing: Micro Mages
|
Post by artial on Jan 2, 2023 22:09:37 GMT
Update, the VDR / last line displayed did the trick, since I can set it to a low value (144), VBL is much longer and the 7.5k words VRAM dma is executed in 6ms + the wait for VBL which is a random 0 to ~9ms value. Maybe if I can get the VDC to operate faster that can go down from 6 to 4-5, but 6ms+ [0-9]ms is a huge saving over 28ms, thanks a lot!
|
|
|
Post by turboxray on Jan 2, 2023 23:54:11 GMT
For this manual cpu copy, this is reading from vram and then writing it back? As in a make shift VRAM DMA? The reason I ask is that it doesn't sound like this part of the process is a local-ram to VRAM copy. If that's the case, you're just moving stuff around in vram in this final copy phase - then what data are you moving around? Is this tile data or tilemap data? Yes that's moving around VRAM data. Typically when the player enters a new screen, background vector graphics are drawn to framebuffer #1 which takes 500 to 1500ms so it must be cached. Then FB1 is copied every frame alternativaly to framebuffer 2 or 3, dynamic stuff is drawn, and displayed (along a full nametable update),and it flip flops copying BG from FB1, drawing dynamic stuff and disaplying FB 2 and 3. If all the tile data is in different buffers in vram, then why even move it around? Just update the tilemap to point to different tiles. That aught to be much faster.
|
|
artial
Deep Blooper
Posts: 25
Homebrew skills: Make it work
Fave PCE Shooter: Superstar soldier
Currently Playing: Micro Mages
|
Post by artial on Jan 4, 2023 10:48:14 GMT
|
|
|
Post by elmer on Jan 4, 2023 14:37:10 GMT
So, the 2nd background is basically unused most of the time, and just exists to speed up transitions back to the previous scene background. Anyway, doing VDC DMA in the background to copy the 1st background to the next-frame buffer is certainly one way to go, but have you also considered whether there are benefits to just using a sprite plane for the "dynamic" screen objects, and letting the VDC composite the 2 layers in realtime?
|
|
|
Post by SignOfZeta on Jan 4, 2023 15:40:03 GMT
This looks great, keep it up!
|
|
|
Post by _jash on Jan 5, 2023 15:36:13 GMT
This is one of the coolest projects I have heard about. And it's one of my favorite games too! Can't wait to try this one out!
|
|
artial
Deep Blooper
Posts: 25
Homebrew skills: Make it work
Fave PCE Shooter: Superstar soldier
Currently Playing: Micro Mages
|
Post by artial on Jan 8, 2023 14:00:52 GMT
So, the 2nd background is basically unused most of the time, and just exists to speed up transitions back to the previous scene background. Anyway, doing VDC DMA in the background to copy the 1st background to the next-frame buffer is certainly one way to go, but have you also considered whether there are benefits to just using a sprite plane for the "dynamic" screen objects, and letting the VDC composite the 2 layers in realtime? Yes of course that would be ideal to mix BG and sprites for dynamic pixels but I don't thin I can because 1. I'd need a 5th full screen VRAM area so resolution would be even lower like 192*120 instead of 224*136 2. The game use 16 colors, no transparency, so there is no way to fit all colors + trans for a tile. 3. Some of the dynamic objects are clipped using BG pixels - in other words for some pixels the BG is on top of the sprites. That priority can only be set at the tile level.
|
|
|
Post by turboxray on Jan 10, 2023 2:05:43 GMT
So, the 2nd background is basically unused most of the time, and just exists to speed up transitions back to the previous scene background. Anyway, doing VDC DMA in the background to copy the 1st background to the next-frame buffer is certainly one way to go, but have you also considered whether there are benefits to just using a sprite plane for the "dynamic" screen objects, and letting the VDC composite the 2 layers in realtime? Yes of course that would be ideal to mix BG and sprites for dynamic pixels but I don't thin I can because 1. I'd need a 5th full screen VRAM area so resolution would be even lower like 192*120 instead of 224*136 2. The game use 16 colors, no transparency, so there is no way to fit all colors + trans for a tile. 3. Some of the dynamic objects are clipped using BG pixels - in other words for some pixels the BG is on top of the sprites. That priority can only be set at the tile level. The original engine might have used four framebuffers total, but you don't need four frame buffers. You need the source BG buffer (which I get, it's dynamic - which still doesn't change what I mentioned previously about rending to sectional segments and using the tilemap to point to untouched areas instead of wasting time copying tile/pixel data. It's not unlike the sprite layer approach mentioned). And you need the two render buffers. A three framebuffer allocation system would give you the 224*160 display window of the SNES/Genesis versions. I know the site says the 2nd BG is for caching, but it's used to speed up switching back and forth between scenes and going back to a previous frame (in game, fast switching). But honestly, that's really a waste given that the display window is now clipped to 224*136. Basically you're optimizing for a more minimal edge case at the expense of resolution. I'd ditch the BG 'cache' buffer. I'd also make all "base" BG frames stored in st1/st2 opcodes format. It's already 40% faster than doing TIA block transfer, and you could encode it further to take care of redundant rows in the pixel (i.e. st2 #nn, st2 #nn, st2 #nn... not needing st1 #nn because the LSB in the tile data didn't change) - the BG graphics lend itself to this really well. That'd bring you closer to 60-70% boost(and probably quite a bit more) in speed over using TIA. That would mitigate original issue they were trying to solve with having a BG cache buffer to reduce latency between fast switching between two BG source layers. As far as the sprite layer only being 15 colors.. it's sitting on top of the BG layer which already has 16 colors - surely not all "objects" rendering on top of the source BG framebuffer is going to be rasterizing/blitting using all 16 colors. And for the minor edge-cases where that might be true - just reduce by a single color. Totally worth the huge speed increase you'd get from it. On top of that, and getting ride of the BG cache buffer and st1/st2 opcode trick, you could even do a 3DO style port (more detailed backgrounds).
|
|
Shirei89
Deep Blooper
Posts: 21
Homebrew skills: In training...
Fave PCE Shooter: Soldier Blade
Fave PCE Platformer: Saigo no Nindou
Fave PCE Game Overall: Gradius II
Fave PCE RPG: Emerald Dragon, Private Eyedol
|
Post by Shirei89 on Jan 16, 2023 9:56:26 GMT
Looking great! Sounds like this is coming along nicely. Can't wait to see how this turns out.
|
|