Close

The code densitiy TRAP

A project log for eForth for cheap STM8S gadgets

Turn cheap modules from AliExpress into interactive development kits!

thomasThomas 01/21/2017 at 10:320 Comments

Fusing code to improve code density has a high potential of making refactoring very difficult. In this project I took the freedom to fuse core code when I saw a size (sometimes also a speed) advantage, and where I didn't see possible future variation points. In some cases I un-fused code from the original STM8EF source to get desired variation points (e.g. vectored I/O). This was easy since much of the eForth source code is written in Forth, and fusing code is no problem as there are no fundamental changes to the design of eForth.

So, where is the point? It's a pun. I used the TRAP instruction to improve the code density of Forth user code :-)

Please let me explain by diving a bit into how the Forth VM, the virtual CPU that drive a Forth systems works. When Forth compiles a "word", it creates a thread of tokens for consecutive execution by the Inner Interpreter, the execution unit of the Forth VM.

The Inner Interpreter can be implemented in hardware (a Forth CPU), or in software. There are established coding techniques for general purpose CPUs, like the often used Direct Threaded Code (DTC) where tokens are pointers to the so called "Code Field Address" of words.

Tokens can represent any of the following:

Words like DOLIT or ?BRANCHdiffer from ordinary Forth words in that they manipulate the Inner Interpreter, and don't just work with the stacks of the Forth VM, and therefor the implementation of this class of words dependents on the coding technique used for implementing the Forth VM. Only a small number of such words is necessary, and most of them are core words.

STM8EF uses Subroutine Threaded Code (STC). Here, the Inner Interpreter is implemented by the CALL instruction of the CPU, the instruction pointer is the program counter, and the CPU's return stack is used as the Return Stack of the Forth VM. Words that need to manipulate the instruction pointer, like DOLIT, do this by changing the CPU's PC value on the return stack (the one that the CALL pushes for use by RET).

The advantage of STC is speed and simplicity of implementation. Context switching for multi-tasking is also easy to achieve (I took advantage of this). The disadvantage is code size: compared to DTC a simple STC implementation needs at least one more byte for representing a token, the CALL instruction. And this is before we even use words like DOLIT and EXIT to make the Inner Interpreter change the sequence of execution.

Since we're using the CPU as the Inner Interpreter there are other some more analogies between the CPU and the Forth VM that we can exploit for optimization:

Token Optimization measure for increasing code density Trade-Off Done
any Many CPUs have CALL instruction that uses some form of relative or segmented addressing for improving code density, and an optimizing compiler can take advantage. The STM8S CALLR instruction is used if the call distance is less than 128 bytes (2 instead of 3 bytes). Some code generator complexity.
EXIT EXIT can be coded as native RET (1 instead of 3 bytes) None
BRANCH Branch can be coded as native JP (3 instead of 5 bytes) None
?BRANCH That's very hard since manipulation of the Data Stack is required (DONEXT and DOLOOP even manipulate data on the Forth VM Return Stack ). In Machine Forth, Chuck Moore changed the rules of the game to make this easier. Be like Chuck Moore.

DOLIT
Same as ?BRANCH, except what DOLIT does is a simple operation of the Forth VM.
hmmm

Edit: I worked a bit on replacing CALL by CALLR to make one more check mark in the "Done" column in the first table. I'll be playing with the code before releasing it.

In most cases, manipulating the Forth VM stacks with native code requires many CPU instructions. This also sets limits for hand-optimizing core assembly code.

If we only had more of such 1:1 matches of Forth VM instructions to CPU opcodes (like EXIT-RET and BRANCH-JP). I'm not like Chuck Moore (and I don't change easily) but ... the STM8S has one malleable instruction: TRAP.

TRAP is a "non-maskable software interrupt with highest priority". The purpose of such an instruction is to consistently manipulate the state of the CPU, e.g. in an RTOS: it interrupts the CPU's flow of execution, saves the context to the stack, runs the TRAP handler code, restores the context, and returns to where it left off. It can only be interrupted by RESET or by TRAP (which rarely ever makes sense).

I used TRAP to implement the DOLIT instruction: "CALL DOLIT MSB LSB" can now be coded as "TRAP MSB LSB", that's 3 bytes instead of 5 for representing literals!

Here is the code (you may need the page on the TRAP instruction in the STM8S programming manual to make sense of it):

;       TRAP handler implementing DOLIT
;       Push the inline literal following the TRAP instruction
_TRAP_Handler:
        DECW    X
        DECW    X
        LDW     (3,SP),X               ; XH,XL
        EXGW    X,Y
        LDW     X,(8,SP)               ; PC MSB/LSB
        LDW     X,(X)
        LDW     (Y),X
        LDW     (5,SP),X               ; YH,YL
        LDW     X,(8,SP)               
        INCW    X
        INCW    X
        LDW     (8,SP),X 
        IRET

In order to use it for compiled user code I only had to change the implementation of LITERAL (3 lines), and the implementation of DODOES (3 lines), and DOES> (1 line). After applying the new "DOLIT instruction" in the core code a couple of times it had paid for itself!

In applications where a lot of literals have to be used in the code (not variables or arrays but literal numbers) the new approach has a clear code size advantage.

Of course, where there is light, there also is shadow. Here is a comparison of the old and the new implementation:

CALL DOLIT DOLIT with TRAP
Execution time 22 CPU cycles 50 CPU cycles
Interrupt latency 4 CPU cycles (CALL or RET instruction) 50 CPU cycles w/ TRAP context switch
Bytes on return stack 22 (multi tasking + interrupt) 9 (non interruptible)

At 16MHz clock frequency, 50 CPU cycles translate to just about 3µs execution time. In most applications, the additional 2µs won't be noticeable. Since time critical code (e.g. interrupt driven video signal generation) might see an impact (even if it's coded in assembly, or C). I'll try to make this new feature optional. The result of the stack analysis came as a surprise since I already had planned for increasing the return stack headroom.

Discussions