To demonstrate how the compilation of C to Fast compares with that of KiR to Fast in terms of code performance, we have tested a number of small benchmark programs. They include:
In figure 2 the number of instructions executed, the number of NOOPs executed, and the cycles per instruction (CPI) are given for both languages and all benchmark programs. CPI is defined for this purpose as:

Figure 2: Instructions and CPIs for Benchmark Programs
The KiR compiler produces better Fast code for Takeuchi and Isort. For all other programs the code generated by the C compiler is superior since it does more aggressive optimizations as function inlining, loop unrolling, etc. Since KiR is an untyped language, the compiler must generate runtime checks for the applicability of the higher-order functions in the sorting programs. These checks consist in matching the arities of the functions against the number of available arguments. All other runtime type checks are performed in hardware, hence they do not degrade performance.
The KiR compiler generates better code for Takeuchi, since the implementations of C and KiR return the result values of function calls in different locations. In C, the result is returned in a general purpose register. In KiR, it is returned on stack A. Therefore, the C compiler places the results of the intermediate function calls in registers which must be saved on each function call. This is not necessary in KiR, which reduces the number of instructions per function call by three. Since Takeuchi has three recursive calls in one part of the IF-THEN-ELSE clause, this small difference sums up to significantly more executed instructions.
KiR should perform better than C for programs which operate on the heap. The heap management routines are translated from C code into the intermediate language of the KiR compiler, and take advantage of the large general purpose register set. The KiR compiler has no register allocation stage, hence registers allocation is fixed for the entire program code. All necessary heap management variables are kept in registers. gcc is not able to perform this mapping, since these variables are declared globally. This also explains the greater number of memory accesses for these programs as given in figure 3. This advantage for KiR is outweighted for Treesort by the fact that KiR is untyped. For the chosen benchmark programs, gcc generates code which uses at most 13 general purpose registers. The number of general purpose registers of Fast could be reduced to 32 without loss of performance, which has also been confirmed with other benchmark programs [AH][Hut94]. For the KiR compiler, a register allocation phase would be necessary to allow for this reduction. This should result in only a small performance loss.
: Number of Stores/Loads
In order to determine the pros and cons of the multiple stack architecture of Fast for C implementations, we have compared it with another processor architecture named One. It has only one stack with 128 entries instead of four stacks with 32 entries each. All other features that are unrelated to the multiple stack system are identical to Fast. The compilation of the C programs to this architecture is denoted as C' in figure 2. C' performs better than C only for Ackermann. The reason is simple: Ackermann requires stacks for parameter passing and return addresses only. Hence it is very atypical for real programs. The execution of Ackermann generates about 50% less stack overflow/underflow traps on One than on Fast, which explains the superior performance. The other benchmark programs are slightly more realistic. They show the same behavior as other (bigger) programs, which have been tested during the adaptation of gcc to Fast [AH][Hut94]. The single stack of One is a bottleneck which prevents the re-organizer from optimizing instruction schedules. Even though the C compiler does not take advantage of all possibilities that the multiple stacks offer, Fast performs better than One for nearly all benchmark programs. The main advantage gcc does not utilize is the opportunity to release stack elements as early as possible. This important optimization is included in the KiR compiler. gcc does not have this optimization for the following reasons:
One would think that a single stack with 128 elements yields a better exploitation of the available space than four stacks with 32 elements each, since there is less fragmentation. However, multiple stacks allow for optimizations which are not possible with a single stack. For example, Takeuchi requires two local registers to store the results of intermediate function calls. These are saved in a function prologue on a stack, and restored in a function epilogue. Optimally placing their allocation and de-allocation increases performance by about 10%. Allocated stack space for parameters and local variables can be released as early as the respective values are no longer required. This optimization is at least difficult with one stack, since re-using re-claimed locations is complicated, and closing the gap within the stack by sliding down all entries above it is an expensive operation. There might be cases in which this optimization is possible due to an appropriate placement of actual parameters, local variables, return addresses and saved registers. However, in general no placement strategy will fit all possible situations, and therefore a single stack is in most cases worse than multiple stacks.