|
Post by elmer on Jan 26, 2019 3:16:14 GMT
The 10 mhz dotclock is thus reserved only to bg layer only games. RIP. Sorry, but just to annoy you ... the 10MHz resolution has no problem at all in displaying all 16 sprites on a line, even in 2-clocks-per-access mode.
|
|
|
Post by elmer on Jan 30, 2019 5:15:55 GMT
I'll quote Bonknuts from the other PCEFx forums: I was looking at the offline archive of the old PCEFX Development threads, and saw that Bonknuts had already talked about how the sprite system loads pixel-data during the hblank, and that the CPU will be delayed if it tries to read/write VRAM during this time. I do miss his particular passion and technical curiosity in our little PCE community. Anyway, on top of that, the whole hblank and vblank signal timing was already documented on the lovely Japanese "Ki's Research Room" website that ccovell mentioned back in 2016.
|
|
|
Post by dshadoff on Jan 30, 2019 5:55:31 GMT
|
|
gilbot
Punkic Cyborg
Posts: 137
|
Post by gilbot on Jan 30, 2019 6:26:36 GMT
At least (most of?) the site is archived in web.archive.org ATM. Let's hope Geoshcities won't use robot.txt and whatever measures to block these contents being accessible by the public.
|
|
|
Post by elmer on Jan 31, 2019 21:50:02 GMT
This is what I've concluded from my sprites-per-line tests using Chris's "Screen Dimension Test" program.
Note that this is basically what the official VDC documentation predicts, but with the VCE's line timing overriding the VDC's settings for HDE/HSW/HDS.
VDC @ 5.36MHz -> width = total # chr on line = 42 chr VDC @ 7.16MHz -> width = total # chr on line = 56 chr VDC @ 10.74MHz -> width = total # chr on line = 85 chr
Sprites-per-line displayed (@ 1-clk-per-access) -> (width - 2 - (hdw + 1)) * 2 Sprites-per-line displayed (@ 2-clk-per-access) -> (width - 2 - (hdw + 1))
Here is the raw data ...
VDC @ 5.36MHz, MWR=$x0
hds $02 hdw $1F -> 32 chr = 256 pxl -> 16 sprites hds $02 hdw $20 -> 33 chr = 264 pxl -> 14 sprites hds $02 hdw $21 -> 34 chr = 272 pxl -> 12 sprites
VDC @ 5.36MHz, MWR=$xA
hds $02 hdw $1E -> 31 chr = 240 pxl -> 9 sprites hds $02 hdw $1F -> 32 chr = 256 pxl -> 8 sprites hds $02 hdw $20 -> 33 chr = 264 pxl -> 7 sprites hds $02 hdw $21 -> 34 chr = 272 pxl -> 6 sprites hds $02 hdw $22 -> 35 chr = 280 pxl -> 5 sprites hds $02 hdw $23 -> 35 chr = 288 pxl -> 4 sprites ... hds $04 hdw $23 -> 35 chr = 288 pxl -> 4 sprites hds $05 hdw $23 -> 35 chr = 288 pxl -> 4 sprites hds $06 hdw $23 -> 35 chr = 288 pxl -> 5 sprites hds $07 hdw $23 -> 35 chr = 288 pxl -> 6 sprites
VDC @ 7.16MHz, MWR=$x0
hds $03 hdw $2B -> 44 chr = 352 pxl -> 16 sprites hds $03 hdw $2C -> 45 chr = 360 pxl -> 16 sprites hds $03 hdw $2D -> 46 chr = 368 pxl -> 16 sprites hds $03 hdw $2E -> 47 chr = 376 pxl -> 14 sprites hds $03 hdw $2F -> 48 chr = 384 pxl -> 12 sprites ... hds $07 hdw $2F -> 48 chr = 384 pxl -> 12 sprites hds $08 hdw $2F -> 48 chr = 384 pxl -> 14 sprites hds $09 hdw $2F -> 48 chr = 384 pxl -> 16 sprites
VDC @ 7.16MHz, MWR=$xA
hds $06 hdw $25 -> 38 chr = 304 pxl -> 16 sprites hds $05 hdw $26 -> 39 chr = 312 pxl -> 15 sprites hds $05 hdw $27 -> 40 chr = 320 pxl -> 14 sprites hds $04 hdw $28 -> 41 chr = 328 pxl -> 13 sprites hds $04 hdw $29 -> 42 chr = 336 pxl -> 12 sprites hds $03 hdw $2A -> 43 chr = 344 pxl -> 11 sprites hds $03 hdw $2B -> 44 chr = 352 pxl -> 10 sprites hds $03 hdw $2C -> 45 chr = 360 pxl -> 9 sprites hds $03 hdw $2D -> 46 chr = 368 pxl -> 8 sprites ... hds $0A hdw $2B -> 44 chr = 352 pxl -> 10 sprites hds $0B hdw $2B -> 44 chr = 352 pxl -> 10 sprites hds $0C hdw $2B -> 44 chr = 352 pxl -> 11 sprites hds $0D hdw $2B -> 44 chr = 352 pxl -> 12 sprites
VDC @ 10.74MHz, MWR=$xA
hds $0B hdw $3B -> 60 chr = 480 pxl -> 16 sprites ... hds $0B hdw $3F -> 64 chr = 512 pxl -> 16 sprites hds $0B hdw $40 -> 65 chr = 520 pxl -> 16 sprites hds $0B hdw $41 -> 66 chr = 528 pxl -> 16 sprites hds $0B hdw $42 -> 67 chr = 536 pxl -> 16 sprites hds $0B hdw $43 -> 68 chr = 544 pxl -> 15 sprites hds $0B hdw $44 -> 69 chr = 552 pxl -> 14 sprites hds $0B hdw $45 -> 70 chr = 560 pxl -> 13 sprites hds $0B hdw $46 -> 71 chr = 568 pxl -> 12 sprites hds $0B hdw $47 -> 72 chr = 576 pxl -> 11 sprites hds $0B hdw $48 -> 73 chr = 584 pxl -> 10 sprites hds $0B hdw $49 -> 74 chr = 592 pxl -> 10 sprites ... hds $0B hdw $48 -> 73 chr = 584 pxl -> 10 sprites hds $0C hdw $48 -> 73 chr = 584 pxl -> 11 sprites hds $0D hdw $48 -> 73 chr = 584 pxl -> 12 sprites
|
|
|
Post by elmer on Feb 1, 2019 6:09:23 GMT
At least (most of?) the site is archived in web.archive.org ATM. Let's hope Geo shcities won't use robot.txt and whatever measures to block these contents being accessible by the public. Here's another easy way to make a backup ... www.petekeen.net/archiving-websites-with-wget
|
|
|
Post by dshadoff on Feb 1, 2019 18:04:42 GMT
At least (most of?) the site is archived in web.archive.org ATM. Let's hope Geo shcities won't use robot.txt and whatever measures to block these contents being accessible by the public. Here's another easy way to make a backup ... www.petekeen.net/archiving-websites-with-wget Good point... I kept thinking that wget was UNIX-specific, but it stands to reasons that somebody would have ported it to Windows sometime int he past 20 years. There are lots of good sites on Geocities Japan that should be archived before they disappear...but in order not to derail this thread, I will start a new one in the "General Discussion" forum. Dave
|
|
|
Post by elmer on Feb 1, 2019 19:47:56 GMT
As we discussed in the PC Engine CDROM 'BIOS' Interrupts thread, when it comes to split-screen displays and parallax scrolling, there is a delay between the VDC signalling an RCR interrupt, and when the VDC locks the scroll registers for the next line. All indications are that the VDC actually fires the RCR interrupt right at the end of the HDW display period. That means that you only have a very short window in which to write new scroll values into the VDC in order to set them before the next line starts. BUT, if you do write an IRQ1 interrupt-handler to do things this way, then you are going to have to make sure that nothing can possibly delay your response to the interrupt, which means that you have to be VERY careful about handling simultaneous timer interrupts, or ADPCM-streaming interrupts, and you can totally forget about using the TIA instruction to quickly upload new graphics to VRAM! The way that the hardware is supposed to be used is that you generate an RCR interrupt two screen-lines before the line that you want to change, then wait until the next line's scroll registers are locked for that line, and then you set the new scroll values for your target line. The advantage of doing things this way is that your safe-window to write the new scroll values goes from being a couple-of-dozen cycles, into being a couple-of-hundred cycles, and so you can afford some variation in your interrupt-response timing, and thus you can safely use timer interrupts, ADPCM-streaming interrupts, and the TIA instruction to upload new graphics to VRAM ... as long as you are careful with your code-design. That is the way that the System Card's RCR interrupt handling works, and if you look at it, it even has a couple of "bsr-to-rts" 15-cycle delays in there to make sure that it gets the timing right. Unfortunately, the System Card's code wasn't tested very well, and while it does work correctly for 256 & 336 pixel width displays, it doesn't work properly for 240 & 320 pixel width displays (the X-scroll value is set too early!). The critical thing to know when designing your interrupt code, is exactly how many CPU cycles occur between when the RCR interrupt triggers, and when it is safe to write the scroll registers, knowing that they have already been locked for the next line's display. I've done some tests, and here are the results... 5.36MHz (with MWR = $x0)
Safe to write BYR @ 100 cpu cycles if width=240 hdw=$1D Safe to write BYR @ 90 cpu cycles if width=248 hdw=$1E Safe to write BYR @ 79 cpu cycles if width=256 hdw=$1F Safe to write BYR @ 67 cpu cycles if width=264 hdw=$20
7.16MHz (with MWR = $x0)
Safe to write BYR @ 106 cpu cycles if width=320 hdw=$27 Safe to write BYR @ 98 cpu cycles if width=328 hdw=$28 Safe to write BYR @ 90 cpu cycles if width=336 hdw=$29 Safe to write BYR @ 82 cpu cycles if width=344 hdw=$2A Safe to write BYR @ 74 cpu cycles if width=352 hdw=$2B
10.74MHz (with MWR = $xA)
Safe to write BYR @ 112 cpu cycles if width=480 hdw=$3B Safe to write BYR @ 107 cpu cycles if width=488 hdw=$3C Safe to write BYR @ 101 cpu cycles if width=496 hdw=$3D Safe to write BYR @ 96 cpu cycles if width=504 hdw=$3E Safe to write BYR @ 91 cpu cycles if width=512 hdw=$3F Safe to write BYR @ 85 cpu cycles if width=520 hdw=$40 Safe to write BYR @ 79 cpu cycles if width=528 hdw=$41 Safe to write BYR @ 75 cpu cycles if width=536 hdw=$42 Safe to write BYR @ 69 cpu cycles if width=544 hdw=$43
Note: The VDC's hde,hsw, & hds settings have *NO* effect!!!
Note: These cycle timings are to the write-cycle within the instruction, and not to the start of the instruction.
Note: The VDC shadows/locks the BYR register a cycle-or-two before the BXR register, so write BYR first.
|
|
|
Post by elmer on Feb 3, 2019 20:39:38 GMT
So, where does this all RCR timing information leave us?
Let's imagine trying to create a section of the screen with a parallax scroll, where you want to change the BGX scroll value on every raster line.
That means that you need to change the BGX value approximately every 455 CPU cycles.
Let's also imagine that you are uploading a bunch of new graphics to VRAM at the same time, using a TIA instruction with a PCEFX-recommended 32-byte transfer to VRAM.
And finally, let's also imagine that you have 16-sprites-on-a-line in that area of the screen as some enemies go passed, and so the VDC needs to read 64 words of pixel data for the next line's display.
Having the CPU write to VRAM while the VDC is busy reading the next line's sprite data will cause the CPU to block until the VDC is finished reading the data that it needs.
So ...
32 byte TIA to VRAM -> 241 CPU cycles SPR DMA delays CPU -> 86 CPU cycles (64 VDC cycles @ 5MHz) == 327 CPU cycles
Safe delay to BYR in IRQ1 -> 100 cycles (with a 240-wide screen) == 427 cycles
write BYR & BXR registers -> 30 cycles == 457 cycles
But there are only 455 CPU cycles on a line!!!
This isn't a (major) problem (yet), because we actually have 455 + 100 - 2 to write the BYR before it is locked for the line that we want to change.
BUT ... we also have to possibly set a new CR value to enable/disable sprites, maybe change a color palette register, and finally, set up a new RCR line value for the next line's interrupt. And our IRQ1 handler should really handle all (reasonable) screen resolutions, so we might choose to make the "safe delay" be 112 cycles so that we can use the 480-wide screen resolution.
The timing is pretty critical in these circumstances, and the RCR code that we write is going to have to be carefully designed to do things in the right order, at the right time, to be predictable, and to not waste time ... or there will be unstable results and occasional screen glitches.
So what does the HuC RCR handler do?
32 byte TIA to VRAM -> 241 CPU cycles SPR DMA delays CPU -> 86 CPU cycles (64 VDC cycles @ 5MHz) == 327 CPU cycles
IRQ1 begins to RCR written -> 136 cycles == 463 cycles
IRQ1 begins to BGX written -> 208 cycles == 535 cycles (1 complete line + 80 cycles)
IRQ1 begins to BGY written -> 243 cycles == 570 cycles (1 complete line + 115 cycles)
Whoops!!
Well, in this case, the RCR value is written too late to catch the interrupt on the next line, the BYR value is written too late to catch the next line in all of the resolutions, but the BXR value should make it in time except for the 256-wide mode, where it may or may not make it in time.
So there is the possibility of some nasty screen glitches and things just not working the way that the programmer wants, but this circumstance of having 16 sprites on a line and starting a TIA instruction at the same time is going to be rare, and make it look like a random and unexplained problem.
Looking at HuC's code, I suspect that the author just didn't realize that the VDC's sprite-pixel-data delay could potentially throw a nasty wrench into the works.
|
|
|
Post by ccovell on Feb 4, 2019 1:48:45 GMT
MagicKit also checks for IRQ1, then VSync, and finally HSync in its interrupt handling code, which I would consider ass-backwards for dealing with timing-critical code. I usually put HSync interrupt checking first in my own ASM code.
|
|
|
Post by soop on Feb 4, 2019 16:36:39 GMT
I love reading these threads. I understand a little bit, but it's fascinating reading such knowledge. Also, what did happen to bonknuts?
|
|
|
Post by spenoza on Feb 4, 2019 17:29:03 GMT
I know he was taking classes, and that might have just taken up all his time, but I would hate to lose his knowledge and participation in the scene. Anyone have contact information for him and can reach out to him to let him know we're here?
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Feb 5, 2019 0:01:34 GMT
Touching on the missing members subject (thread deraling WOOOOOOO!), does anyone know if MooZ (the HuSDK guy) ever had an account @ PCEFX? I'm interacting with him via Github and I was wondering if he was aware/cared about the forums. At first I thought "what kind of madman would do a TIA during interrupt triggering scanlines" but then again, I realized that any user that's trying to use the "unused cycles" of the scanline in the main thread to keep updating logic for the next frame, HuC or not, could run into that issue. In a non-HuC project you could simply implement a NES-style VRAM transfer buffer to be run during VBlank (or VBlank + safe scanline range) only, and that's also true for HuC, BUT, you could argue that HuC gives the false impression that writing to VRAM anywhere is OK (and it kinda is most of the time...). It would be nice to have a buffer implementation for the HuC VRAM write routines. I could say the same thing about VCE writes which are kinda worse because any write outside v/hblank will show an ugly artifact on screen. HuC could also use a VBlank VCE writer. I'm not going to mention any game names here so people don't think I'm being a dick or trashing those games, but there are many homebrews where this glitch is very visible. Anyway, just a wishlist that will probably never be implemented, haha.
|
|
|
Post by elmer on Feb 5, 2019 2:31:54 GMT
MagicKit also checks for IRQ1, then VSync, and finally HSync in its interrupt handling code, which I would consider ass-backwards for dealing with timing-critical code. Yep, putting your time-critical stuff first just makes sense. It doesn't seem to be too horrible to write something that meets the timing requirements on a PCE, but I'm having some trouble coming up with a general-purpose (i.e. for HuC) routine that I feel comfortable with on the SGX when using a 256-wide screen. 240-wide is OK, and 320-wide and 336-wide modes are fine too (if running in single-cycle mode and slightly overclocking the VRAM). All of the timing difficulties go away if you split TIA instructions into 16-byte chunks instead of 32-byte chunks ... perhaps that would be the sane thing to do? I love reading these threads. Hey ... it gets even MORE complex and fun! On top of all the other concerns, then you also have to take timer-interrupts into account if you are trying to play back a sample while all of the other stuff is going on. A timer interrupt has higher priority than the VDC's IRQ1 interrupt, and will be serviced first if it is waiting. That adds yet more random delays to the whole process. Using interrupts to change things is *hard* to program ... that's why Nintendo built a hardware hsync-dma capability into the SNES.
|
|
|
Post by elmer on Feb 9, 2019 20:37:46 GMT
On top of all the other concerns, then you also have to take timer-interrupts into account if you are trying to play back a sample while all of the other stuff is going on. A timer interrupt has higher priority than the VDC's IRQ1 interrupt, and will be serviced first if it is waiting. That adds yet more random delays to the whole process. The excellent information on Ki's Research Room shows us that the minimum delay that a timer interrupt will add to our RCR processing is 17 cycles. The research there shows that a pending interrupt is not processed immediately after a CLI instruction, but after the next instruction. It also shows that a cli/sei pair of instructions is all that is needed. So my recommendation would be to start all timer interrupt handlers with something like ... timer_irq: ;;; ; 8 (cycles for the INT) stz $1403 ; 5 Acknowledge TIMER IRQ. cli ; 2 Allow HSYNC to interrupt. sei ; 2 Disable interrupts, IRQ1 begins.
; Adds 17 cycle delay to HSYNC.
|
|