To give you an impression of how fast this thing is:

A = a0x01     # load content from address 1 into register A
B = l1        # load register B with 1 (literal)
C = A-B       # simple math
D = A>l2      # is A greater than 2? 
E = D?C:B     # yes: E=C, else: E=B
F = A*E 
G = C-B 
H = C>l2 
I = H?G:B 
a0x80=F*I     # simple math and copy result to address 0x80
All this gets executed in one single cpu cycle! (which consists of 6 periods of square wave, just as the single core equivalent)

Some more detail (the code above is an example for one 160 bit package):