I added some statistics calculations into RV32I[MA] emulator ( originally created by Fabrice Bellard and modified and shared on Hackaday by @Frank Buss ) and collected stats from some RISC-V benchmark tests (see https://github.com/riscv/riscv-tests/tree/master/benchmarks). With DEBUG_EXTRA option it collects this info from Dhrystone benchmark for example:
Instructions Stat: LUI = 892 AUIPC = 7716 JAL = 11212 JALR = 12850 BEQ = 33399 BNE = 11298 BLT = 1721 BGE = 3480 BLTU = 7017 BGEU = 2248 LW = 31050 LBU = 27712 LHU = 502 SB = 4968 SH = 502 SW = 33037 ADDI = 87830 SLTIU = 1500 XORI = 1 ORI = 1 ANDI = 6151 SLLI = 10647 SRLI = 9534 SRAI = 95 ADD = 11486 SUB = 2813 SLL = 402 SLTU = 1844 SRL = 353 OR = 2459 CSRRW = 1 CSRRS = 8 LI* = 20602 Five Most Frequent: 1) ADDI = 87830 (27.05%) 2) BEQ = 33399 (10.29%) 3) SW = 33037 (10.17%) 4) LW = 31050 (9.56%) 5) LBU = 27712 (8.53%) Memory Reading Area 80000000...80007ae2 Memory Writing Area 80001000...80007b3f >>> Execution time: 1425296449 ns >>> Instruction count: 324730 (IPS=227833) >>> Jumps: 50209 (15.46%) - 18074 forwards, 32135 backwards >>> Branching T=26147 (44.19%) F=33016 (55.81%)
Without DEBUG_EXTRA option (no instructions stat and no memory usage stats) and with -O3 option (fastest optimization) emulator is capable of doing almost 13 millions instructions per second on my relatively modern AMD64 computer with Debain Linux onboard:
>>> Execution time: 25084843 ns >>> Instruction count: 324730 (IPS=12945267) >>> Jumps: 50209 (15.46%) - 18074 forwards, 32135 backwards >>> Branching T=26147 (44.19%) F=33016 (55.81%)
Here you can see that 15% of executed instructions are jumps (when PC is changed to something different from usual PC+4) and most jumps were backwards. Also branches were 44% true (with jump) and 56% false (no jump). Below you can see similar stats for some other benchmarks:
median:
>>> Execution time: 1391119 ns
>>> Instruction count: 16244 (IPS=11676930)
>>> Jumps: 3552 (21.87%) - 1254 forwards, 2298 backwards
>>> Branching T=2613 (53.36%) F=2284 (46.64%)
multiply:
>>> Execution time: 4743276 ns
>>> Instruction count: 49670 (IPS=10471665)
>>> Jumps: 13808 (27.80%) - 6310 forwards, 7498 backwards
>>> Branching T=12915 (86.46%) F=2022 (13.54%)
qsort:
>>> Execution time: 19821720 ns
>>> Instruction count: 236219 (IPS=11917179)
>>> Jumps: 45487 (19.26%) - 8141 forwards, 37346 backwards
>>> Branching T=37792 (59.71%) F=25503 (40.29%)
rsort:
>>> Execution time: 31545464 ns
>>> Instruction count: 374291 (IPS=11865129)
>>> Jumps: 15239 (4.07%) - 797 forwards, 14442 backwards
>>> Branching T=14653 (73.66%) F=5239 (26.34%)
towers:
>>> Execution time: 1474786 ns
>>> Instruction count: 18656 (IPS=12649970)
>>> Jumps: 2027 (10.87%) - 762 forwards, 1265 backwards
>>> Branching T=1037 (57.20%) F=776 (42.80%)
vvadd:
>>> Execution time: 1004666 ns
>>> Instruction count: 11974 (IPS=11918388)
>>> Jumps: 1830 (15.28%) - 492 forwards, 1338 backwards
>>> Branching T=1417 (62.18%) F=862 (37.82%)
As you can see it is very important to pipeline jumps properly - not just wasting cycles by wrong branching as it's usually done in simple RISC hardware designs (branch penalty) - it has to be branch prediction or even speculative execution of both branches with ignoring wrong path after condition becomes known.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.
Branches are an old issue that will not disappear soon...
Are you sure? yes | no
Yes, and it's significant (4-28% of the executed code as showed above). So it has to be properly handled...
Are you sure? yes | no
Conditional moves help in some places, precomputed loop targets help too (very widely used in DSP) but the most common cases (think "spaghetti") is a different story... I'm reconsidering many approaches for FC1...
Are you sure? yes | no