CPU Work Continues...

Over my vacation, I've managed to get a large portion of the CGIA components written, and as well, started to implement the CPU pipeline stages. Per my recent log, I've decided on the classic Hennessey & Patterson "5 stage pipeline" design for my RISC-V implementation.

Work on that pipeline design was coming along quite nicely until I tried to implement JAL/JALR instructions. The problem is I didn't want a bunch of special-purpose buses running everywhere, since they use up valuable resources inside the FPGA. If these buses were only going to be used for one or two instructions out of the complete set of about 80, I didn't feel they were justified. However, this was the POV of working with the pipeline from entry-to-exit (that is, designing the pipeline from instruction fetch to writeback).

Working in the opposite direction, though, is enlightening. Many of those special purpose buses turns out to be required if you want proper pipeline behavior, and want to retain the desirable characteristic of 1 clock per instruction throughput. But, it does something else which turns out more important, I think: it obviously tells you what signals you need (versus what you think you need) from the previous pipeline stage(s). This actually simplifies the design of the pipeline as a whole. Whether or not it offsets the resource consumption of these weird, one-off buses, I couldn't tell you. My hunch is that it doesn't. But, at least the operation should be correct, which at this stage of development, is more important.

I also discovered why Bcc and JAL/JALR instructions are so damn annoying to work on. Bcc instructions operates with four inputs, not your typical two. They are:

Left-hand comparison register value
Right-hand comparison register value
The branch instruction's address
The branch displacement (used only if the branch is taken)

Thankfully, nothing is stored in any destination register. But, the JAL/JALR instructions require six inputs, assuming we want to just reuse the conditional branch hardware for unconditional purposes (I treat JAL(R) instructions as special cases for BEQ):

Left-hand comparison value (can be any value; doesn't matter)

Right-hand comparison value (must be the same as left-hand)

The jump instruction's address

The jump displacement

The destination register.

The constant "4", used to add to the instruction's address to calculate the return address to store in the destination.

Yikes! Fixed adders, magnitude comparators, and the ALU all operating concurrently and asynchronously from each other, with the ALU and comparators requiring independent inputs from the decode stage, will conspire to make the execute stage the limiting factor in determining how fast the CPU will run, and as well, the 2nd largest consumer of LUTs in the FPGA (the register file will be the largest by far).

So far, I have instruction fetch, a prototypical decode, and writeback stages written. I'm currently working on the memory stage now, and will hopefully get these designs to meet in the middle once I start working on the execute stage. However, requirements will back-propegate: the needs of subsequent stages will dictate the needs of previous stages, and that implies I could end up rewriting some of the stages I've already written.

CPU Update: Sticking with 5-stage Pipeline

Polaris CPU Needs Serious Redesign

Discussions

Become a Hackaday.io Member