Close

Conquering Clocks and Gremlins

A project log for Dreamdrive64

Open source N64 rom cart based on PicoCart64 and using 2 rp2040 mcus

kaili-hillKaili Hill 03/29/2023 at 21:050 Comments

PSRAM Challenges

The PSRAM presented numerous challenges throughout development. One issue arose from the SSI clock divider, which could only be set to even integers. When the RP2040s were originally clocked at 266MHz to meet N64 SRAM timings and the divider was set to 2 (resulting in a 133MHz QSPI), games experienced erratic errors and often failed to boot.

A 266/4 configuration (66MHz QSPI) was insufficient for achieving "stock" N64 bus speeds. On the PicoCart-Lite ("v1"), this wasn't a problem, as the 266/2 setting (133MHz QSPI) with short data lines to flash proved reliable and allowed for stock speeds.

This project marked a turning point in my learning journey, as my background in software engineering had previously led me to treat hardware as a weekend project. In the past, I had the luxury of "ignoring air resistance," but this endeavor demanded a thorough consideration of such variables.

Reflections, line impedance, terminating resistors, and topography all became familiar terms as a fellow Discord member and I grappled with hardware gremlins. This helpful individual even went so far as to reroute the PSRAM data lines, add terminating resistors, and assist me in resolving hardware issues over several months.

After attempting to overclock the RP2040s to a 360/4 setting (90MHz QSPI), I achieved stock speeds with mostly stable data reliability. Another hardware revision incorporating termination resistors and the 360/4 configuration appeared to be the solution. However, when testing additional boards from my batch, I discovered that at least one of them failed to operate at these frequencies.

The Quest For Data

So some notes on how the n64 bus works. The n64 sends a 32 bit address: upper 16 bits sampled on the falling edge of ALEH (Address line high), then lower 16 bits sampled on the falling edge of the ALEL (address line low). There is then a delay before the read line goes low which is the cart's cue to fetch the data and have 16 bits of data ready on the data lines when read line is asserted.

The time after ALEL -> low and read low depends on the n64 bus speed but is as quick as 1us. This gives us some time to "prefetch" a half-word of data in anticipation of the read line going low.

Read line low and the pulse that follows to latch the read data are also affected by the n64 bus speed.

Once we have the address we have 1us to prefetch the first half-word. In that 1us time:

We then wait for the read line to go low:

Here is what that code looks like 

if (last_addr >= 0x10000000 && last_addr <= 0x1FBFFFFF) {
    // Domain 1, Address 2 Cartridge ROM

    // Change the banked memory chip if needed
    tempChip = ((last_addr >> 23) & 0x7) + 1;// psram_addr_to_chip(last_addr);
    if (tempChip != g_currentMemoryArrayChip) {
        g_currentMemoryArrayChip = tempChip;
        // Set the new chip
        psram_set_cs(g_currentMemoryArrayChip);
    }

    // Set the correct read address
    (&dma_hw->ch[dma_chan])->al3_read_addr_trig = (uintptr_t)(ptr16 + (((last_addr - g_addressModifierTable[g_currentMemoryArrayChip]) & 0xFFFFFF) >> 1));

    do {    
        // Wait for value from psram
        while(!!(dma_hw->ch[dma_chan].al1_ctrl & DMA_CH0_CTRL_TRIG_BUSY_BITS)) { tight_loop_contents(); } 

        // Move the value out of the buffer so we can kick off the next fetch
        next_word = dmaValue;

        // Kick off next value fetch in the background
        dma_hw->multi_channel_trigger = 1u << dma_chan;

        // Wait for pio to see read line go low or ALEH happened
        while((pio->fstat & 0x100) != 0) tight_loop_contents();
        addr = pio->rxf[0];

        if (addr == 0) { // if read line was low
            // READ
            pio->txf[0] = next_word;
            last_addr += 2;

        } else if (addr & 0x00000001) {
            // WRITE
            // Ignore data since we're asked to write to the ROM.
            last_addr += 2;
        } else {
            // New address, ALEH is asserted
            break;
        }
    } while (1);
}

While this process seems simple enough, it was difficult to pin down when to make the next dma fetch to maximize the n64's bus speed (e.g. as close to 0x12 as possible). 

For slow rp2040 clock speeds, and thusly a faster qspi bus as we can use a smaller divider (e.g. 200/2) I found that fetching the next word gave better timings if done AFTER we set `pio->txf[0] = next_word`. The code posted is for 266/4 and comfortably hits 0x20 timings.

Here are my notes while I was testing clock/divider settings and finding the tightest timings that allowed games to be played.

300/4 -> boots 0x1540(336ns) (Moved DMA fetch)-> (0x1C40=448ns) (112ns diff)
22 * qclk = 293.333
13 * pclk = 42ns

210/2 -> boots 0x2040(512ns) (Moved DMA fetch)-> (0x1C40=448ns) (64ns diff)
22 * qclk = 209.524ns
64 * pclk = 302ns

180/2 -> boots 0x2E40(736ns) (Moved DMA fetch)-> (0x2240=544ns) (192ns diff)
22 * qclk = 245ns
89 * pclk = 491ns 

160/2 -> boots 0x3D40(976ns) (Moved DMA fetch)-> (0x2740=624ns) (352ns diff)
22  * qclk = 275ns
113 * pclk = 701ns 

140/2 -> boots 0x4D40(1232ns) (Moved DMA fetch)-> (0x3340=816ns) (416ns diff)
22  * qclk = 315ns
129 * pclk = 917ns  

I still haven't figured out exactly where all my clock cycles are being spent when cases like 336/4 and even 330/4 should theoretically have enough time to make the latches. The pclk calculations are guesses based on the known time to fetch data from the psram chips and the tightest n64 bus timings.

I tried DMA'ing into an array using a full word instead of using 16bit DMA reads and consuming the array as the dma wrote to it. That resulted in even slower bus patch speeds likely due to memory contention.

When attempting to allow for increased sram read/write timings, I discovered it takes the rp2040 37ns at 360MHz to read from a statically allocated array. I wrote a small test function that read from the array 1 million times in a loop `word = array[0];`  Timed using `time_us_32()` at start of loop and diff taken once finished. This seems like a very long time to read data from an array.

Discussions