Close

Nearly doubling the performance - 23x original TI-99/4A

A project log for TMS9900 compatible CPU core in VHDL

Retro challenge 2017/04 project to create a TMS9900 compatible CPU core. Again in a month... Failure could be an option...

erik-piehlErik Piehl 09/17/2017 at 20:441 Comment

I started to see how I could optimize the CPU.

I looked at my memory interface code in the TMS9900 core, and realized I have been using very conservative timings - just to make sure that when debugging the CPU the memory interface does not cause problems. But now it is time to optimize!

My TI Basic test program:

10 for i=0 to 1000
20 print i;" ";
30 next


Takes 160 seconds on a standard TI, and 11.6 seconds on the previous version of the CPU.
I tweaked CPU memory interface first on the read side, reducing the number of wait states. 
That took me from 11.6s to 8.9s, and then after further tweaking the execution time dropped to 8.2s. This just by reducing the wait states on the read side.
Next I reduced the number of wait states on write side. This brought down the execution time to 7.7s. The impact of reducing write states on the write side is much smaller than on the read side, since the CPU mostly reads data and seldom writes it. 
After these changes I removed one extra "safety" state after each read (it was just there to make sure the bus interface has some time to settle after reads, but that is not really necessary as the main state machine anyway adds a delay cycle). That brought the time down to 7s. With these changes the execution time is only 60% of what it used to be! And the speed is now 22.9 times of the original TI.
As a final tweak I removed one extra "safety" state that was there after each write - for the same reason as the read cycles. That reduced run time to about 6.8s, so now the CPU runs my benchmark 23.5 times faster than the original TI.

Here is Parsec running at this new revised CPU:

When doing these tests, I really appreciated the quick re-synthesis time, it only takes my PC a couple of minutes to do the synthesis, so test iterations are fast.
I also took a look at how much FPGA capacity the current design takes - it takes 51% of the LUTs (look up tables), so there is plenty of space left. Also there is some debug features included in here, removing those would make the design smaller.

Discussions

Ed S wrote 09/20/2017 at 18:50 point

Great speedup!

  Are you sure? yes | no