|
Post by bugothecat on Feb 18, 2022 18:01:18 GMT
I am new to the PC-FX coding and I already did some tests, but I want to count big time differences and the provided timer functions give me troubles. First I have set the timer to it's highest value eris_timer_init(); eris_timer_set_period(65535); eris_timer_start(0); I was calling eris_timer_read_counter() and printing the numbers and it seems that 1) It loops from 0 to 65535 and back again 2) It's too fast to capture differences between a frame that I draw and might be slow (like 15-20 fps) The documentation says "Period is in ticks of CPUclk/15" This is 21.500.000 / 15 = 1.433.333 That would be in one second, but in that period the counter would revolve many many times around 65535, so I will have lost the exact diff of time. But see, I was coding some demo effect that was slow and wanted to get the timer before and after the effect, then divide it by whatever needed to find the exact time elapsed. But if something updates say in 15fps, there will around 95000 of these ticks, so it will have already gone over once. I know how to program fps counters and all that, but this timer doesn't help. Calling eris_timer_read_counter() between very small pieces of code are good, I can get some tic count for a small algorithm that finishes early, but not for a whole frame finished to display an FPS counter. Also not so easy to convert to seconds elapsed. So, my second plan was to use an interrupt and this totally failed (just crash). The idea was to let an interrupt increase a value every time the eris timer reached the end of 65536 and restarted. I think I used (can't find the code now) void irq_set_handler (int level, void(*fn)(void)) to point to my function (but I am not sure about the level, what should I put) and finally called eris_timer_start(1) where 1 is to fire an IRQ interrupt when finished. I might have tried few other things in the v810.h functions like calling irq_enable() to enable interrupts, but I got a freeze. Is this the only way to have a count with this timer? If I could just get an interrupt to automatically call a funciton every time the eris timer ends, and if the eris timer loops again and again, I could just read that variable and decide the total time elapsed for big time diffs. Unless I can do it without interrupts. I also didn't found an example in the liberis using those functions. p.s. I recently ported this one, so it would be nice to have a nice fps counter working to performance test things like that. There is more to come! www.youtube.com/watch?v=xamC5_avzew
|
|
|
Post by elmer on Feb 18, 2022 19:36:42 GMT
It's only a 16-bit timer, so WYSIWYG. Having the timer create an interrupt seems like a reasonable solution, and the timer is the level 1 interrupt (it's listing in FXGABOAD.DOC in the official documentation). Sorry, but you're kinda on your own here to figure things out, because liberis is NOT a finished development environment, it's just a work-in-progess. If I were you, and trying to get the interrupts to work, I'd put breakpoints on *all* of the IRQ vectors, and see if something else is triggering an interrupt after you've enabled them. AFAIK, GCC *should* be building a workable interrupt-handler function, as long as you declare it properly ... I spent a *lot* of time working on that for the GCC4 patches. P.S. If you do get this to work, then I'd love to see a pull-request for an improved "hello_interrupt" example project that actually triggers an interrupt, rather than just generating untested handler functions!
|
|
|
Post by dshadoff on Feb 18, 2022 19:40:46 GMT
I haven't done any PC-FX programming myself yet, but there are two approaches I would try to use for such profiling:
1) Use Mednafen and count the master clock (should be at the lower left in the block of numbers on the main debugger screen... in hexadecimal)
2) A common trick for getting a sense of performance for large processes, is to change the color of the overscan at the start/end of such processes, and observe roughly what portion of the period of a field is consumed by those processes. This is particularly useful for functions which have a large variability (collision detection on variable number of projectiles, for example).
|
|
|
Post by elmer on Feb 19, 2022 21:20:53 GMT
AFAIK, GCC *should* be building a workable interrupt-handler function, as long as you declare it properly ... I spent a *lot* of time working on that for the GCC4 patches. P.S. If you do get this to work, then I'd love to see a pull-request for an improved "hello_interrupt" example project that actually triggers an interrupt, rather than just generating untested handler functions! Ah, once again this silly programmer gets let down by his own over-confidence! I'm taking a brain-break and looking at the PC-FX compiler again, so bugothecat you've caught me at a good time to investigate this! There's definitely a huge bug in the interrupt-preamble that I'm generating in GCC that *must* be fixed before any of this will work. I'm kinda surprised that the VirtualBoy developers never reported this since they've been using the compiler for the last 5 years, but perhaps the VirtualBoy firmware deals with the low-level IRQ preamble itself.
|
|
|
Post by elmer on Feb 20, 2022 5:37:00 GMT
There's definitely a huge bug in the interrupt-preamble that I'm generating in GCC that *must* be fixed before any of this will work. Hmmm, the interrupt-preamble isn't really what's wrong, although it does limit nested-interrupts, and so it should probably be changed. But there's a whole lot of other initialization stuff that you need to do, or the console will hang. Another problem is that liberis's irq_set_handler() is written to use normal C functions, and not GCC's __attribute__ "interrupt" functions. There is a different irq_set_raw_handler() for those kind of functions. Anyway, I've got the "hello_interrupt" example working with the timer and the vblank IRQs now, and I'll upload it to github tomorrow after some more tests and a bit of cleanup .
|
|
|
Post by bugothecat on Feb 20, 2022 7:44:20 GMT
Anyway, I've got the "hello_interrupt" example working with the timer and the vblank IRQs now, and I'll upload it to github tomorrow after some more tests and a bit of cleanup . Oh great! Thanks very much, I'll check when it's committed
|
|
|
Post by elmer on Feb 20, 2022 17:59:29 GMT
Oh great! Thanks very much, I'll check when it's committed OK, I've checked that in now! Note that it has shown up a couple of things ... 1) Do not use liberis's irq_set_handler() function, because it doesn't handle the frame-pointer. Use irq_set_raw_handler() and GCC's "__attribute__ ((interrupt))" instead. 2) Do not enable V810 interrupts within an interrupt handler, because the GCC4 IRQ-preamble can't deal with nested interrupts yet. That's a problem created by the changes made to allow for super-fast interrupt handlers that are leaf-functions (which can help rsync and timer interrupts). 3) There's a bug somewhere in binutils, and it's not correctly relocating symbols in the ZDA segment. Please avoid declaring "__attribute__ ((zda))" variables for the moment.
|
|
|
Post by bugothecat on Feb 21, 2022 11:33:41 GMT
Oh great! Thanks very much, I'll check when it's committed OK, I've checked that in now! Great, I just checked the changes. Makes sense to me. One thing (among others) I wasn't doing was masking the interrupts, now I see how it works. I'll test it this evening and incorporate it for my FPS counter. Maybe one more short question, although I might know the answer. Is it true that there isn't a faster way to have a framebuffer in RAM send to KRAM for software rendering (unless it's somewhere on the hardware and not implemented yet)? I do use the king functions that OUT one 16bit per time to the KRAM, to fill all the individual pixels in my test. The seemed very unoptimal to me at first, but it wasn't as slow as I'd expect doing tens of thousands of OUTs per frame, but I wonder if the hardware supports better ways to have a big buffer send once? Or maybe that's the only way? I'd like to be sure I am not doing something stupid when there is a better way. Assuming we want to manipulate it and send it all the time every frame? My other thought is to try the sprites, but I guess one might also need to upload their sprite data in a similar manner with multiple OUTs. Or maybe a DMA function, but anyway, the machine is still open to exploration.
|
|
|
Post by elmer on Feb 21, 2022 15:52:21 GMT
Maybe one more short question, although I might know the answer. Is it true that there isn't a faster way to have a framebuffer in RAM send to KRAM for software rendering (unless it's somewhere on the hardware and not implemented yet)? I do use the king functions that OUT one 16bit per time to the KRAM, to fill all the individual pixels in my test. The seemed very unoptimal to me at first, but it wasn't as slow as I'd expect doing tens of thousands of OUTs per frame, but I wonder if the hardware supports better ways to have a big buffer send once? Doing a transfer as lots of individual C calls to a liberis function is definitely not going to be the fastest method, but there are distinct limitations in the CPU-to-KRAM bandwidth during a display frame that'll hit you whatever you do. A quick scan of the docs certainly doesn't mention any DMA function to help, but the system has obviously been set up for you to use the V810 bitstring instructions to read/write large chunks of KRAM as fast as any DMA would (6-cycles per 16-bit half-word, with the CPU stalling if the KRAM is busy). Like so much of current PC-FX investigation/exploration, if you want to try doing this to see how much faster it is (if any faster at all), then it will mean writing some assembly-language code. Good luck!
|
|
|
Post by bugothecat on Feb 21, 2022 21:09:07 GMT
Yes, I should write assembly function that at least OUTs a length of data to King at once, instead of calling the eris function for each 16bit block. I just wondered if OUT itself is too much, but maybe 6 cycles is not that bad. Thanks! With your help I could finally have an FPS counter and clock. Also previously I was sending double the amount of data to King for the 4bpp mode, my dumpest mistake. It proved to be faster than before just for this alone. There is still plenty of things to optimize of course, I will next check how to write specific assembly functions. I might release this demo and source soon when I also add the music for it. www.youtube.com/watch?v=38vC-lJhoH0
|
|
|
Post by elmer on Feb 23, 2022 18:44:10 GMT
Yes, I should write assembly function that at least OUTs a length of data to King at once, instead of calling the eris function for each 16bit block. I just wondered if OUT itself is too much, but maybe 6 cycles is not that bad. Well, if you're transferring a block of data to VRAM, there's absolutely no way that using an OUT instruction in a loop is going to come close to the speed of using a bitstring instruction, as long as you're transferring enough bytes to make it worth saving/reloading r26-r29! Even if you unwrap your LD/OUT/SHIFT/INC/CMP/OUT/CMP/BNE loop a few times, you're going to have serious difficulty matching the bitstring's 6-cycles-per-loop, which seems almost-suspiciously matched to the hardware's 6-cycle time between successive OUT instuctions. Remember, that's not a 6-cycle penalty for an OUT, the write completes in 2-cycles if the write-buffer is empty. And this is all assuming (I *think*) that the transfers are only possible if the KING microprogram gives some cycles to the CPU! There seem to be lots of variables in the equation, so the only way to tell is going to be experimentation on real hardware, because I'm pretty sure that mednafen isn't going to be 100% accurate in its emulation of all of these intertwined dependencies. 3) There's a bug somewhere in binutils, and it's not correctly relocating symbols in the ZDA segment. Please avoid declaring "__attribute__ ((zda))" variables for the moment. OK, I found and fixed the underlying problem, but it's not ready to check-in yet. In the meantime, it's OK to use uninitialized ZDA variables with the current binutils, but you shouldn't use initialized ZDA variables until the fixes are checked-in.
|
|
|
Post by bugothecat on Feb 24, 2022 18:27:06 GMT
Yes, I should write assembly function that at least OUTs a length of data to King at once, instead of calling the eris function for each 16bit block. I just wondered if OUT itself is too much, but maybe 6 cycles is not that bad. Well, if you're transferring a block of data to VRAM, there's absolutely no way that using an OUT instruction in a loop is going to come close to the speed of using a bitstring instruction, as long as you're transferring enough bytes to make it worth saving/reloading r26-r29! Even if you unwrap your LD/OUT/SHIFT/INC/CMP/OUT/CMP/BNE loop a few times, you're going to have serious difficulty matching the bitstring's 6-cycles-per-loop, which seems almost-suspiciously matched to the hardware's 6-cycle time between successive OUT instuctions. I was looking at the docs for this bitstring instructions. But are they not writing from memory to memory? I don't know much about V810 yet, but is OUT writing to hardware ports I think? Or is there a way to redirect the bitstring to OUT series of data to the same hardware port instead? Maybe I missed something and thought they are no suitable for my task of writing to King. Or is the King ram also accessible through regular memory writes instead of OUT?
|
|
|
Post by elmer on Feb 24, 2022 19:11:24 GMT
Maybe I missed something and thought they are no suitable for my task of writing to King. Or is the King ram also accessible through regular memory writes instead of OUT? Take a look at the FXGA_GA and FXGABOAD documents ... some of the I/O data read and write registers are mirrored into memory regions specifically so that you can use the V810's bitstring instructions to blast data as-fast-as-possible to/from VRAM, KRAM, and the color palettes. The VRAM/KRAM/palettes are not randomly accessible, it's just a sequential mirroring of the data read/write ports so that the bitstring instructions can be tricked into being useful. ... but made slightly-less-useful because some idiot at NEC decided that the C & Asm calling conventions would make registers r26-r29 all callee-saved so that any library routine *must* save them and reload them whenever it uses a bitstring opcode.
|
|