|
Post by dshadoff on Apr 28, 2019 18:04:23 GMT
Updates:
- The thread on the Homebrew board looks like it has closed the gap in information, and we seem to know the meaning of all the bits in the command stream, and there doesn't seem to be any room for additional commands as a result. This is a good thing.
- I have updated my Cortex M4 microcontroller state machine, and have now got it working with the protocol based on the new information.
- I have done additional measurements with reduced logging, and compared against PC Engine for complete 128KB reads: PC-Engine: 10.5 seconds Microcontroller: 5.35 seconds -> so, it should be able to keep up
- power consumption when connected to USB = 26mA, which is about 3 times more than the maximum I'd like it to be for "in operation", and at least 10 times more than it should be for "sleep"... but there are several LED's on (including a DotStar), and there's lots of time to work on improving this.
- The state machine is very simple, and should easily translate to Verilog for FPGA implementation... once somebody is inclined to do so (I probably will at some point, just for my own interest).
I still need to do the following: - level-shift and connect to PC Engine - verify operation in conjunction with PC Engine software - write some additional save/load code for making the data less volatile. - reduce power consumption
I will post code when it's been tested in-circuit with PC Engine, or if somebody asks me to do so earlier.
|
|
|
Post by dshadoff on Apr 28, 2019 22:14:16 GMT
Well, I think I might be getting to the point where I could use some help from somebody who knows more than I do about electronics and how "high-speed" could affects things. I connected my board it up to some Adafruit 3.3V-to-5V level-shifters (because that was the easiest thing to do), and I was able to test successfully with a 5V Arduino running at its top speed (which is perhaps 30% of the PC Engine's full speed). So, partial success ! But when I connected to the PC Engine, I got intermittent results. I was able to get a successful read once or twice... when it was detected, it was able read data (apparently OK), but most of the time it failed to be detected. I added a couple of 47K ohm pull-up resistors to the CLK and RESET lines (coming in from the PC Engine), and a 10uF tantalum capacitor across the 5V rails supplied by the PC Engine. That seemed to help a bit, but it didn't solve the problems. A few ideas for reasons it failed are: 1) possibly the response from the MCU is too slow within the program. I'm not convinced this is the case, but I guess I can slow down the clocking on the PC Engine program side by adding a microsecond or so. This probably isn't the case, as I was able to go faster when it was was driven directly by a 3.3V MCU right next to it. 2) Maybe the level-shifters are introducing latency or some kind of slew rate issue; I am not familiar with their characteristics. They are these: www.adafruit.com/product/757 . The transfer is happening at only about 100KHz, so according to the description, this may not be the problem. 3) Perhaps the 5V supplied by the PC Engine, after travelling down the joypad cable (which I got from Console5 - it's about 51" or 1.3m long), is not solid enough to drive the needed 5V signal back... so, the 10uF capacitor helped, but a second one didn't make an additional difference. The original MB128's cable is much shorter, and the batteries were always there to add stability. I must confess, I am a bit short of knowledge on specifics in many areas in electronics - though many of the concepts are there in "broad strokes" form. - need to learn about the dark art of passives (when are they needed ? how to select values ?) - need to learn how to use my new oscilloscope (they were way too expensive when I was a kid and could have used one last) I am expecting to break out the oscilloscope at some point, but I have a few things to learn there too. Any hints/suggestions ?
|
|
|
Post by dshadoff on Apr 28, 2019 22:45:33 GMT
I still need some advice on passives and so on, but the situation has improved at least 80%... I realized that hadn't grounded two of the 74HC157's inputs (unused on the MB128 side), so the values read back during the detection routine were effectively random. They aren't random anymore, though there is still a small bit of odd behaviour.
It is getting recognized by PC Engine software now ! Getting closer !
|
|
|
Post by elmer on Apr 29, 2019 0:08:56 GMT
- I have done additional measurements with reduced logging, and compared against PC Engine for complete 128KB reads: PC-Engine: 10.5 seconds Microcontroller: 5.35 seconds -> so, it should be able to keep up Just letting you know ... after a bit of optimization, my PC-Engine code is reading 128KB from the MB128 in 6.1s.
|
|
|
Post by dshadoff on Apr 29, 2019 2:01:22 GMT
- I have done additional measurements with reduced logging, and compared against PC Engine for complete 128KB reads: PC-Engine: 10.5 seconds Microcontroller: 5.35 seconds -> so, it should be able to keep up Just letting you know ... after a bit of optimization, my PC-Engine code is reading 128KB from the MB128 in 6.1s. Interesting... what did you cut out ? And is the representative of the code within games (i.e. from the Hudson library routines) ? I was checking the data sheet on the pulse-width timing, and the data-hold timing, and those were either in-spec or barely at-spec, depending on whether the memory refresh is being performed internally or externally (which I haven't looked into yet). But I realize that there's possible other waste going on in-between the clock signals, so that's why I'm asking... It is possible that my microcontroller code is on the borderline. But I'm not convinced - I added a 10uF capacitor across the 3.3V rails as well, and now it's always recongized, with an occasional dropout. So, getting better... I could always move up to an ESP32, but they run a little hot...
|
|
|
Post by elmer on Apr 29, 2019 3:27:38 GMT
Interesting... what did you cut out ? And is the representative of the code within games (i.e. from the Hudson library routines) ? I cut out most of the (seemingly) pointless delay *between* the bits. The actual data setup and hold times for each bit should be at-least at long as in the original code (I believe). This works in-practice, and AFAIK, I'm not violating any timings, but please tell me if you know differently. Anyway, here you go (all of the savings are in this inner-loop) ... ; ; __ax = ptr to destination (page-aligned in MPR3) ; X = length / 256 ; Y = length / 65536 ;
mb1_recv_pages: stz IO_PORT ; CLR lo, SEL lo (buttons). phy ; Wait 9. .page_loop: phx cly .byte_loop: lda #$80 .bit_loop: tax lda #2 ; CLR hi, SEL lo (reset). sta IO_PORT ; pha ; Wait 9. pla ; nop ; lda IO_PORT ; Read while in reset state. lsr a ; stz IO_PORT ; CLR lo, SEL lo (buttons). txa ; Wait 9. ror a bcc .bit_loop sta [__ax],y iny bne .byte_loop inc __ah bpl .next_page tma3 inc a tam3 lda #$60 sta <__ah .next_page: plx dex bne .page_loop ply dey bpl mb1_recv_pages rts
|
|
|
Post by dshadoff on Apr 29, 2019 3:33:56 GMT
OK, one more update and I'm done for today...
I put a little more error detection in my PCE test program, to indicate where it is failing, and it turns out that was failing to respond adequately to the mb_detect; a '0xff' was being read on the PCE side rather than 0x04. This was suspicious, because I tied two pins of the 74HC157 low on the mb128 side, so the only conclusion is that the 74HC157 was not being switched properly (or fast enough ?) upon 0xA8 recognition. Still suspicious, because this read would have been several microseconds after that.
I updated the detect() output with the correct bit values as per the other conversation about protocol (that it's related to the value in the DATA line, rather than the sequence). I also tried to move as many of the port output updates a few nanoseconds earlier, just in case. In one case, I was able to move an update one state earlier.
These things didn't have an effect - the device was being recognized 98% of the time, and appeared to be read OK in those instances, but once in a while, it was failing - especially when I tried to read many sectors in a row; it would fail about 10 sectors in.
I tried pull-down resistors, but that didn't help.
Finally, I think I've got it working - and here's how: - I had some improvement initially when I added a 10uF capactior across the 5V rail from the PCE - Later (and I didn't mention this), I had more improvement when I put a 10uF capacitor across the 3.3V rail - despite the fact that it has a good regulated supply from the microcontroller, and it's only a few centimeters away. - This time, I moved the 10uF capacitor from the 5V rail to join the other one on the 3.3V rail, and it is always being recognized now. I'm not sure why this helped, but it must have something to do with current draw by the level-shifters.
I would have just used larger capacitors, but those are the only ones I had handy. I'll go looking for 47uF tantalums for this project, across both rails.
...I still need to test data accuracy across many read cycles, but I will write a PC Engine program to do so. And, once happy, I will need to test with other software. And probably I should try other level-shifters since these are bulky and not so cheap.
I still feel like I'm missing something on the side of the passives...
|
|
|
Post by dshadoff on Apr 29, 2019 12:08:42 GMT
I cut out most of the (seemingly) pointless delay *between* the bits. The actual data setup and hold times for each bit should be at-least at long as in the original code (I believe). This works in-practice, and AFAIK, I'm not violating any timings, but please tell me if you know differently. Hmmm... looks good. I only have two thoughts... 1) The memory device timings should be here, look at pages 5 through 8: datasheet.datasheetarchive.com/originals/distributors/Datasheets-23/DSA-447524.pdfI believe the device is in “self refresh” mode (but need to poke at the internals to be sure), and I doubt that we can assume “fast” mode is available. 2) If this works on real hardware, I would still need to test on mine. While my hardware works (and is pretty fast) in the general case, it won’t succeed if my “worst case” path fails to detect both high and low signals on the (abbreviated) clock cycle. I still need to take another look at that, but I’ll post the code here soon (maybe tonight if I get time). Of course, FPGA-like hardware would not have this concern at all.
|
|
|
Post by elmer on Apr 29, 2019 20:47:24 GMT
I was checking the data sheet on the pulse-width timing, and the data-hold timing, and those were either in-spec or barely at-spec, depending on whether the memory refresh is being performed internally or externally (which I haven't looked into yet). Yep, I think that we can assume that older model MSM6389 that is used in the MB128 didn't have the "fast" mode that's in the MSM6389C. I agree that it's also likely that the device is set to "self refresh" mode, and that we should look at those timings. The critical ones (IMHO) are the 4000ns minimum cycle time, and the 3000ns minimum RWCK pulse width. I presume that we agree that the joypad's CLR line is being used to drive the RWCK pin? That seems to match how the code is written. This is the code from Private EyeDol, which I believe is considered to be a reliable game in its use of the MB128 ... $70:B09E mb1_recv_bit: stz IO_PORT ; $70:B0A1 pha ; $70:B0A2 pla ; $70:B0A3 nop ; $70:B0A4 lda #$02 ; CLR hi (RWCK lo, start of access cycle) $70:B0A6 sta IO_PORT ; 5 $70:B0A9 pha ; 3 $70:B0AA pla ; 4 $70:B0AB nop ; 2 = 14 cycles = 1955ns $70:B0AC lda IO_PORT ; 5 $70:B0AF and #$01 ; 2 = 21 cycles = 2933ns $70:B0B1 stz IO_PORT ; CLR lo (RWCK hi, end of access cycle) $70:B0B4 pha ; $70:B0B5 pla ; $70:B0B6 nop ; $70:B0B7 rts ;
The important thing here is that RWCK cycle time is actually 2933ns, and so slightly overclocked from spec, and that the access time is 1955ns, and so definitely assumes that the MSMS6389 is doing better than its "worst-case" timing. Anything outside those core 7 instructions should be pretty-much irrelevent to the MSM6389, as long as the rest of the code takes at-least 1000ns to meet the 4000ns minimum for the total cycle time.
|
|
|
Post by dshadoff on Apr 29, 2019 22:11:45 GMT
I agree that it's also likely that the device is set to "self refresh" mode, and that we should look at those timings. The critical ones (IMHO) are the 4000ns minimum cycle time, and the 3000ns minimum RWCK pulse width. I presume that we agree that the joypad's CLR line is being used to drive the RWCK pin? That seems to match how the code is written. Yes, I agree, but the clock seems inverted. I’m not sure if there is propagation delay introduced by that; could be ~70-100ns if so. This is the code from Private EyeDol, which I believe is considered to be a reliable game in its use of the MB128 ... $70:B09E mb1_recv_bit: stz IO_PORT ; $70:B0A1 pha ; $70:B0A2 pla ; $70:B0A3 nop ; $70:B0A4 lda #$02 ; CLR hi (RWCK lo, start of access cycle) $70:B0A6 sta IO_PORT ; 5 $70:B0A9 pha ; 3 $70:B0AA pla ; 4 $70:B0AB nop ; 2 = 14 cycles = 1955ns $70:B0AC lda IO_PORT ; 5 $70:B0AF and #$01 ; 2 = 21 cycles = 2933ns $70:B0B1 stz IO_PORT ; CLR lo (RWCK hi, end of access cycle) $70:B0B4 pha ; $70:B0B5 pla ; $70:B0B6 nop ; $70:B0B7 rts ;
The important thing here is that RWCK cycle time is actually 2933ns, and so slightly overclocked from spec, and that the access time is 1955ns, and so definitely assumes that the MSMS6389 is doing better than its "worst-case" timing. Anything outside those core 7 instructions should be pretty-much irrelevent to the MSM6389, as long as the rest of the code takes at-least 1000ns to meet the 4000ns minimum for the total cycle time. Oh, if it uses the older MSM6389, then we should be using the correct datasheet, which is: datasheet.datasheetarchive.com/originals/distributors/Datasheets-112/DSAP0054041.pdfDon’t worry, most of the (critical) timings seem the same. The access time seems to be 3000ns from clock-edge though (re: read port), and this doesn’t take into account any propagation delays introduced by the custom NEC chip. I don’t think you’re pushing it really any further than the Hudson code did, but it has the possibility of bad reads due to being slightly beyond spec. It would be helpful if the device itself did checksumming, to compare against your own values.
|
|
|
Post by elmer on Apr 30, 2019 0:49:27 GMT
Don’t worry, most of the (critical) timings seem the same. The access time seems to be 3000ns from clock-edge though (re: read port), and this doesn’t take into account any propagation delays introduced by the custom NEC chip. I'm so glad that you found the original part's data sheet, that's really helpful! The whole propagation delay issue is definitely a bit of a concern. Here's an idea for you, and it might even go some way towards explaining the mysterious 2 bits that are read at the end of a write command. I'm wondering if the ASIC chip actually contains its own one bit (or possibly two) shift registers connected to the MSM6389's DIN and DOUT pins? Then it would just start the read operation while receiving the last bit of the length ... and it would explain why there is supposed to be an extra read after the end of the write command (to flush the last bit out to the MSM6389). Putting in a shift register like that would isolate most of the critical timing issues away from the PC Engine, and make it the ASIC's responsibility. I don’t think you’re pushing it really any further than the Hudson code did, but it has the possibility of bad reads due to being slightly beyond spec. Yeah, I've counted the cycles and made sure that I'm not going pushing the RWCLK pulse any faster than Private EyeDol's code does. It's easy to add a couple of extra cycles of delay, if I feel paranoid later on.
|
|
|
Post by dshadoff on Apr 30, 2019 1:36:07 GMT
The whole propagation delay issue is definitely a bit of a concern. Here's an idea for you, and it might even go some way towards explaining the mysterious 2 bits that are read at the end of a write command. I'm wondering if the ASIC chip actually contains its own one bit (or possibly two) shift registers connected to the MSM6389's DIN and DOUT pins? Then it would just start the read operation while receiving the last bit of the length ... and it would explain why there is supposed to be an extra read after the end of the write command (to flush the last bit out to the MSM6389). Putting in a shift register like that would isolate most of the critical timing issues away from the PC Engine, and make it the ASIC's responsibility. I would go so far as to say that it's almost certain that there's at least one shift register inserted - this ASIC has to separate the clock into multiple electrical signals going to the chip, as the diagram in the datasheet illustrates. Those signals are often slightly offset (or a half-cycle apart) from each-other. What I'm less certain about is whether the ASIC fixes the timing/duration of the clock pulses; that's a little more difficult to do. It would need a time base generator which would be off-chip (capacitors were infamously difficult to put on-chip back then). I'm not saying this isn't the case; just that I'd need more detail before being persuaded.
|
|
|
Post by dshadoff on Apr 30, 2019 4:06:27 GMT
...And here's my code. It works, and it's relatively streamlined, though I'm sure I could shave a few cycles off here and there. MB128_emu_20190429.zip (3.67 KB)
|
|
|
Post by dshadoff on May 5, 2019 21:25:20 GMT
I tested many things today, and learned a lot (not all of it was good news... but that's what learning is all about).
I decided that it might be easier to take the 5V inputs from the PC Engine, make a regulated 3.3V supply, and level-shift everything from that demarcation point. Here's what I found:
1) I tested a standard PC Engine controller on a 3.3V based circuit, and it worked fine. The logic should be good down to 2V, but it becomes slower at those levels. 3.3V is fine though - it should go from ~10ns to ~50ns, which is still easily within expectations.
2) When I ran this using a LD1117-3.3 for voltage regulation and 2x74LVC4245's for level-shift (1 shifting down, 1 shifting up), I found that the power consumption was a lot more than I expected - about 5.3mA ... but after checking various spots in the circuit, I realized that 5mA of that consumption was the voltage regulator itself (confirmed by the datasheet). I can easily find an alternative regulator with less than 100uA drain, so that's an easy problem to fix.
3) Using the same 74HC157 to toggle inputs between joypad and M4 board, I was able to drive the M4 board on the regulated power coming down from the controller port. Total consumption was around ~29.5mA (which is higher than I would like), but it could perform simple operations properly.
4) I realized that the 74HC157 has a little more propagation delay than I would like at 3.3V - somewhere close to 50ns at 3.3V rather than the 10ns at 5V. This doesn't sound like a big deal, but it's several percent of the total timing budget for response. (The 74LVC4245's only take about 6ns to shift down, and the same to shift back up.) I will replace the 74HC157 with a 74LVC157, which will bring down that delay from ~50ns to less than 5ns.
5) Using the circuit described above, I was able to load individual sectors, but when loading streams of sectors, I would inevitably get errors at some point - errors in getting the attention of the mock-MB128 with a '0xA8'. Under the assumption that this problem was related to power management or slow signal transition, I started playing with pull-up (and later pull-down) resistors on the inbound signals, and capacitors at the power input to the individual chips... but none of this did much of anything to the overall reliability.
...I eventually had to conclude that perhaps my mock-MB128 code just wasn't able to keep up. (I probably should have considered this option earlier, but it seemed to read single sectors without issue.)
So, to test this theory, I went back to Arduino and optimized the code with '-O2'. This did help a bit with the reliability, but only slightly. Instead of failing in less than 8 sectors, it was failing at about 20 sectors in. On the bright side, this surprisingly managed to bring power consumption down by about 3 mA ! (Down to just over 26mA)
For this particular board, there are some 'experimental' features on the Arduino IDE - you can also overclock the CPU. So, I went up from 120MHz to 150MHz... and power consumption went up into the 32-33mA region, but the thing was now reliably being read. I took out all of the pull-up/pull-downs and extra capacitors one by one, and it remained reliable (at least against my test code). So all along, it seems that the code on the microcontroller was right on the borderline between keeping up and falling behind.
So, I may be able to optimize the code a bit in order to make it work, or I can run it overclocked, or I can try to supplement with some additional electronics, or replace it with an FPGA... but whatever the case, it's just not ready yet.
Since I got it working pretty well without any need for the extra passive components, I feel pretty good about the 5V-3.3V level-shift part of the circuit, and I think I'll try to put together a rev.2 of the breakout board I had started... it should make it easier to do more varied and complicated experiments on the controller port.
Dave
|
|
|
Post by dshadoff on May 11, 2019 12:58:09 GMT
A few more tidbits...
1) While I previously said that my code is too slow, that may or may not be true on a widespread basis. There are many paths through that code, and only one of them needs to be slow in order for the bitstream to fall behind. I may put some energy into identifying how fast/slow each code path is.
2) There's still some sort of a logic problem (I think) in my program. When I use the Emerald Dragon MB128 utility to format, then save the memory multiple times on a real MB128, I can replicate the contents into "Bank 00" then "Bank 01", etc. On my board, once written to "Bank 00", it seems like the Emerald Dragon utility somehow isn't getting a necessary confirmation of some sort (or I'm not writing correctly to a bank) - because each time I try to write a new copy to my MB, it keeps thinking that there is no memory in use, so let's write it to "Bank 00". This still happened even when the M4 was clocked at 180MHz...
I also looked at the PCE Mouse code, and concluded that it can't be simulated in this way by a simple microcontroller running at these speeds. Because the read strobe pulses are so short, it would need some sort of external logic noticing state transition, like a flip-flop or something. But I really don't want to add discrete parts, so I would probably make the jump to full-blown FPGA - possibly transitioning completely from MCU to FPGA (and implementing a CPU on the FPGA if need be).
There is one possible exception... The Cortex M4 contains a provision for "Configurable Custom Logic" (CCL) units, which appear to have flip-flops in them which could help. They are obscure, difficult to configure, and limited in number, but I might be able to do something with them (at the cost of making the project even more processor-specific).
On the other hand, the Lattice ICE40UP FPGAs seem to be capable of just about everything I need; I just need to start playing with Verilog, and eventually may need to bite the bullet on dealing with super-small TQFP packages.
|
|