Software-Managed Caching for Future Kestrel-3??

The current vision for the Kestrel-3 targets a 6 MIPS processor, primarily because it'll be seeking instructions over a 12.5MHz 16-bit bus and will not have any cache hardware on-board. This bottleneck exists because it (for now at least) makes for a very simple implementation. But, eventually, I'd like to drive the CPU to faster speeds. Ideally, to 50MHz.

Presently, the CPU addresses all of external memory physically. That is, if I tell the CPU to read a byte from $0E00000000000001, it will read from $0E00000000000001 on the external bus. This will get chopped down to $0E000001 on the Backbone bus (since I only expose 32 bits of address space there), and from there, external circuitry will respond to this transaction in the usual way you're familiar with from Z-80 or 6502 circuits. No surprises there.

However, this also means things are pretty slow. 32-bit accesses need two bus transactions at a minimum, so software will run the fastest if everything can be kept to 8- or 16-bits. Even then, you're incurring wait states like crazy, since all RISC-V instructions are 32-bits wide. I will need to fetch instructions and load and store data to an internal, scratchpad RAM if I want to run at faster than 12.5MHz speeds with no wait-states. The problem is, the FPGA I am planning on purchasing has only 10KiB of internal block memory. Thus, some mechanism is required to map an address like $0E00000000000001 to something narrower, like $0C81.

Traditionally, this is done with a memory management unit (MMU), and most frequently, using a technique called paging. General purpose computing architectures today all seem to agree on 4KiB sized pages. Indeed, the MMU specified in v1.7 of the RISC-V supervisor specifications currently sets the page size to 4KiB. Obviously, with such a small amount of physically addressable memory visible to the CPU, we'd want something smaller; hence why caches use 16, 32, or 64 byte "cache lines." Besides, burst transactions to external RAM is much, much faster than waiting for blocks of data from a hard drive, so these smaller transfer sizes do not cost that much. However, these block sizes all assume a hardware cache controller; the only reason I'd ever consider 128 or 256 byte transfer sizes is to help amortize the additional overhead from emulating a cache controller in software.

The Lattice iCE40HX4K FPGA comes with 10KiB of RAM onboard, which isn't much; however, it seems pretty useful as local cache memory (the 68020 only had 256 bytes back in the day, and it made a measurable difference!). However, hardware cache controllers are insane to get right, and off-the-shelf cores I've seen seems to use up a ton of logic. Maybe I'm looking in the wrong places; but, with LUTs are at a premium with the iCE40 FPGAs, anything I can do to avoid a massive hardware investment is of interest to me.

So, I'm thinking of implementing caching in software by using a fine-grained memory protection unit, something that protects memory down to, I dunno, say 64 to 256 bytes.

Here's how I see things working.

First, I'd expose the 10KiB block of memory as another peripheral in the Kestrel's I/O allotment, along with a set of control registers. Further, it would only be accessible when the CPU is running in machine-mode. When the computer first boots, or whenever the CPU is running in machine-mode, the MMU is turned off (as you'd expect), meaning the CPU will encounter a ton of wait-states as it attempts to fetch instructions from external memory. I expect the CPU to maintain close to 6 MIPS performance during this cold-boot phase.

One of the steps taken by the system firmware would be to initialize the "line" fault handler. Once this is done, the MMU is enabled by de-escalating to supervisor mode.

At this point, the next instruction fetch will cause a cache line miss (since only the cache line handler has been initialized, not the actual cache state). The handler temporarily takes over (running in machine-mode), looks at the fault address, and uses this information to load a base address and length into a DMA engine. The DMA engine, then, reads the required cache line into local memory. Meanwhile, the CPU can update its metadata: access and dirty bits, cache line mapping registers, and so on. When the metadata and DMA have completed (which ideally would conclude concurrently), we return from the exception handler. Back in supervisor- or user-mode, the processor state is set up now so that the instruction will restart, and should proceed without issue and at maximum bandwidth.

With 10KiB at my disposal, I can actually go one step further and pre-initialize 2KiB of the system firmware into the cache, and "lock" it. This would make exception handling much faster. This would leave 8KiB left over for "normal" cache purposes. Remembering the 68040 had a total of 8KiB of cache (albeit split into separate 4K chunks for data and code), it seems reasonable that this arrangement would bring the CPU to levels of performance on par with the MC68040. According to Wikipedia, I can reasonably expect the CPU to execute about 100 instructions on average before incurring a cache miss, so if the handler overhead plus the time taken to execute 100 instructions from a hot cache takes less than 16 microseconds (100 instructions at 6 MIPS rate), it's a net win. I would be happy with this outcome if it meant I could get by with a reduction in hardware investment.

I figure this would be a worthy, and low-cost, experiment to play with. It's premature to try now, of course, but I figure this would be a project for much later down the road.

Discussions

Ed S wrote 06/27/2016 at 05:43

Interesting, and worth a try. IIRC a rule of thumb is that the cache miss rate halves each time you double your cache size. But probably you care most about speeding up small kernels of code.

Are you sure? yes | no

Samuel A. Falvo II wrote 06/27/2016 at 11:46

I'm interested in speeding up all code, as much code as I can. However, I'm limited by the block RAM resources in the FPGA.

Are you sure? yes | no

Blog Article on CGIA Now Live

Possible CPU Development Plan

Discussions

Become a Hackaday.io Member