TORCH Architectural Specification

torch@chroma.Stanford.EDU
The TORCHers

Table of Contents

TORCH Architectural Specification
TORCH Architectural SpecificationTORCH Architectural Specification

TORCH Architectural Specification

1 Introduction

TORCH is an experimental superscalar processor architecture which includes the following features:

This document describes the TORCH architecture in detail. The rest of the introduction briefly describes the goals and features of TORCH. The next section is a more detailed overview which describes the features visible to the compiler or the assembly language programmer.

Superscalar processors are uniprocessor organizations capable of increasing machine performance by executing multiple scalar instructions in parallel. Since the amount of instruction-level parallelism within a basic block is small, superscalar processors must look across basic block boundaries to increase performance. In numerical code, the do-loop branches, which are a large percentage of the total branches executed, can be resolved early to expose the parallelism between many iterations. Unfortunately, many of the branches in non-numerical code are data dependent and cannot be resolved early. Thus, speculative execution, the execution of operations before previous branches, is an important source of parallelism in this type of code.

Instruction-level parallelism can be extracted statically (at compile-time) or dynamically (at run-time). Statically-scheduled superscalar processors and Very Long Instruction Word (VLIW) machines exploit instruction-level parallelism with a modest amount of hardware by exposing the machine's parallel architecture in the instruction set. For numerical applications, where branches can be determined early, compilers harness the parallelism across basic blocks by techniques such as software pipelining or trace scheduling. However, the overhead and complexity of speculative computation in compilers has prevented efficient parallelization of non-numerical code.

Dynamically-scheduled superscalar processors, on the other hand, effectively support speculative execution in hardware. By using simple buffers, these processors can efficiently commit or discard the side effects of speculative computations. Unfortunately, the additional hardware necessary to look far ahead in the dynamic instruction stream, find independent operations, and schedule these independent operations out of order is costly and complex.

We are interested in using superscalar techniques to increase the performance of non-numerical code at a reasonable cost. To accomplish this goal, we propose a superscalar architecture, which we call TORCH, that combines the strengths of static and dynamic instruction scheduling. The strength of static scheduling is the compiler's ability to efficiently schedule operations across many basic blocks; consequently, TORCH performs all instruction scheduling in the compiler. The strength of dynamic scheduling is in its ability to efficiently support speculative execution; consequently, TORCH provides hardware that allows the compiler to schedule any instruction before preceding branches, an operation we term boosting. Boosted instructions are conditionally committed upon the result of later branch instructions. Boosting, therefore, removes the scheduling constraints that result from the dependences caused by conditional branches and makes aggressive instruction scheduling in the compiler simple.

To make the conditional evaluation of boosted instructions efficient at run-time, the TORCH hardware includes two shadow structures: the shadow register file and shadow store buffer. These structures buffer the side effects of a boosted instruction until its dependent branch conditions are determined. On a correctly predicted branch, the hardware commits the appropriate values in the shadow structures. On a mispredicted branch, the hardware guarantees correct program operation by squashing all shadow values.

2 Overview of the TORCH Architecture

2.1 Hardware Overview

The initial implementation of TORCH consists of a single chip containing two integer execution units connected to a six-port register file, a floating point unit with an associated register file, and separate instruction and data caches (see Figure 1). The two integer execution units have different resources, so they are capable of executing different classes of instructions as shown in Table 3 . See section 3 for a list of instructions in each class. Note that the only duplicated functional unit is the ALU since ALU operations are the most frequently executed instructions. Load and store instructions are the next most frequent instruction class, but to dual-issue these instructions would dramatically increase the hardware cost. Instead, our goal is to design a machine that keeps a single memory port nearly 100% busy. The instruction set is nearly identical to the MIPS R2000 RISC instruction set.

Figure 1 Simplified Block Diagram of the TORCH Machine

The machine supports the standard MIPS 32-register register file. To support boosting the hardware includes a second 32-word register file called the shadow register file. Memory is byte-addressable and TORCH is a big-endian machine: byte 0 is the most-significant byte of word 0.

The integer execution units have a five-cycle pipeline similar to the MIPS R2000 pipeline:

IF = instruction fetch
RD = register fetch and instruction decode
ALU = operate on the operands
MEM = memory access
WB = register write back
The two execution units operate in lock-step; if one side stalls, so does the other.

Branch conditions are evaluated by the end of the first half of the ALU cycle so branches have a single-cycle delay slot, a topic which is discussed in detail in subsection "Branches" on page 6 . The machine has some hardware interlocks, including a load interlock (there is no load delay slot, unlike the MIPS R2000 architecture). Exceptions occur in the second half of the MEM cycle.

2.2 Packets and Instructions

The fundamental unit of a TORCH machine language program is the packet which consists of a group of instructions. Since the current implementation of the machine has two integer execution units a packet contains a pair of instructions. The instructions execute in parallel and operate on the same register file. TORCH instructions are identical to MIPS R2000 instructions except for an extra byte attached to each instruction. The extension byte contains information about dynamic NOPs and boosting as will be described later in this section. In addition, branch instructions include prediction information (which is encoded in unused portions of the MIPS opcode space). A MIPS instruction is 32 bits long, so a TORCH instruction with the extra byte is 40 bits long. For detailed information on the MIPS opcodes and their semantics see [Kane89]. Section 3 describes the differences between the MIPS and TORCH instruction encodings.

As an example, here are two legal TORCH packets written in assembly language:

addi $2, 10
subi $4, 6
sll $5, $6, 2
sw $2, 10($sp)
The first instruction in each pair is the A-side instruction. These same two packets can be abbreviated by the following schematic notation:

In this example the addition and subtraction operations execute in parallel, followed by the shift (sll) and store (sw).

The architecture does not define the results if an instruction is on the wrong side of a packet. For instance, the following packet is illegal since a store must be on the B side:

However, there is an exception: subsection "Dynamic NOPs" on page 5 describes a special case in which an A-side instruction with a dynamic NOP can appear on the B side of a packet, and similarly for a B-side instruction on the A side.

Instructions which execute in parallel must not depend on each other in any way. For example, if the two instructions in a packet write to the same register in the same cycle (a zero-cycle write-write dependency) then the result is undefined. However, the two instructions in a packet might be issued in different cycles if dynamic NOPs are present; again, see the Dynamic NOP subsection.

2.3 Instruction Numbers and Addresses

In order to improve the efficiency of the memory and fetch systems, the 40-bit instruction words are not packed into main memory or the secondary cache in the straight-forward way. Instead, instructions are collected into groups of eight and the extension bytes for the eight instructions are stored together. The 32-bit MIPS instructions follow the group of extension bytes. Here is an example for a group of eight instructions beginning at address 0:

The numbers at the left are byte addresses, each of the small boxes represents one byte, and a wide box represents a four-byte word. E0 through E7 are the extension bytes for instructions 0 through 7, respectively, and I0 through I7 are the MIPS-compatible parts of the instructions. The first packet consists of E0, I0, E1 and I1. A group of eight instructions packed in this way is called an instruction block.

The instruction blocks are repacked as individual 40-bit instructions as they are loaded into the primary cache, which is organized to store 40-bit words. The bus between the primary and secondary caches is 64 bits wide, so an instruction block is loaded into the primary cache as follows: First, the primary cache controller reads the eight extension bytes and stores them in a shift register. Next, two instructions are read and the first two extension bytes are shifted out of the shift register and stored with the corresponding instructions. Then the next two instructions are loaded, and so on. This scheme requires a minimal amount of hardware (a simple, fast unidirectional shift register for the extension bytes), in contrast with the complicated shift register/aligner which would be required if the extension bytes were stored adjacent to the corresponding instructions in memory (since an instruction could begin at any byte address).

Now since the extension byte and the MIPS portion of each instruction are not stored adjacent to each other, what is the address of an instruction? The convention used in TORCH is to address an instruction by its instruction number rather than by a memory address. In the example above the first instruction (E0-I0) has instruction number 0, the second instruction (E1-I1) has instruction number 1, and so on. The first (A-side) instruction of a packet always has an even instruction number, and the second (B-side) instruction has an odd instruction number. Branch and jump effective addresses and the address in the program counter are all instruction numbers.

This method of addressing instructions maintains compatibility with the word addresses used in a standard MIPS binary. If all of the extension bytes were removed and the remaining MIPS instructions were packed to fill in the empty spaces, then the instruction number would be the same as the word address of the instruction (assuming a 32-bit word). In the above example I0 would be at address 0, I1 would be at byte address 4 or word address 1, and so on.

From now on the phrases "instruction number" and "instruction address" will be used interchangeably, but remember that the addresses in branch instructions are not the virtual addresses of the targets.

2.4 Dynamic NOPs

The architecture described so far has a potential problem: if a program does not have enough parallelism to allow two independent instructions to be scheduled in each packet then the compiler would have to insert NOPs and the program text would get longer. As a result, the performance of the instruction cache could degrade.

TORCH solves this problem by allowing a NOP to be encoded by a single bit in the extension byte of some other (hopefully useful) instruction. If the dynamic NOP bit of an instruction is set then the instruction is delayed by one machine cycle before it is issued. Both instructions in a packet have independent dynamic NOP bits. In assembly language a dynamic NOP is indicated by preceding an instruction with "d.":

d.add $2, $1, $4
sub $1, $2, $3
The schematic notation is:

The extra vertical bar indicates an "empty" instruction preceding the add. In this example the subtract instruction is issued immediately since its dynamic NOP bit is not set. In the following machine cycle the add instruction is issued, and in the third cycle the next packet can be issued. Here is a functionally equivalent code sequence with no dynamic NOPs which takes the same amount of time to run (assuming no additional I-cache misses), but which has one extra packet:

A packet with dynamic NOPs takes two cycles to issue. The instruction (or instructions) issued in the first cycle is called the first issue of the packet, and the instruction issued in the second cycle is called the second issue. In the example above the subtract instruction is the first issue of the packet and the add is the second issue. A packet with no dynamic NOPs has only one issue.

Table 1 is a list of all legal combinations of dynamic NOPs in a packet and the equivalent packets with explicit NOPs.

Table 1 Legal Combinations of Dynamic NOPs and A-Side and B-Side Instructions

---------------------------------------------------
Source Code  Schematic Notation  Execution Sequence  
---------------------------------------------------
                                                     
                                                     
                                                     
                                                     
                                                     
                                                     
---------------------------------------------------
Note: "A" signifies any A-side instruction and "B" signifies any B-side instruction.
The equivalent packets are identical to the actual execution sequences of the packets with dynamic NOPs. Dynamic NOPs allow any straight-line sequence of arbitrary A- and B-side instructions to be encoded in packets without any explicit NOPs. Note that an A-side instruction is allowed to be the second instruction of a packet if it is dynamically NOPed, and similarly a NOPed B-side instruction can be the first instruction of a packet. These cases are allowed because the delay cycle provides enough time to "move" the delayed instruction to the correct execution unit. All cases not shown in Table [Ref: tbl:dnop-cases] are illegal and have undefined results.

2.5 Branches

TORCH supports the same branch instructions as the MIPS architecture, but they are augmented with prediction information. When a branch is predicted to be taken (that is, to branch to the target) its opcode is extended with ".t":

beq.t $1, $2, target
A branch which is predicted not taken is extended with ".n":

bne.n $1, $2, target
Branch prediction is done by the compiler.

In the MIPS architecture all branches have a single-instruction delay slot, which means that the instruction following the branch is always executed regardless of whether the branch is taken or not. TORCH branch instructions also have a delay slot, but only the issue following the branch issue is considered the delay slot. In other words, a branch takes effect after the issue following the branch and not necessarily after the first packet following the packet containing the branch. If the packet containing the branch and the following packet have no dynamic NOPs then the entire second packet is in the delay slot and will be executed regardless of the branch outcome. If, however, an instruction in the second packet has a dynamic NOP then that instruction is not in the delay slot. For example, the sequence

is equivalent to

and in both cases only the store instruction (sw) is in the delay slot of the branch (beq). As a second example, in the following sequence the delay slot has two NOPs and the subtract instructions are not in the delay slot:

Finally, in this last example only the add instruction is in the delay slot:

One of the instructions in the delay slot can be a branch or a jump:

Assuming that there are no dynamic NOPs the execution sequence of this program is:

The second jump is in the delay slot of the first jump. The first jump takes effect in the third cycle, but then the second jump takes effect in the fourth cycle and "cancels" the first jump. As a result, A5 and B5 are not executed. The exception handlers use this feature to return to the user's program (see subsection 2.8, Returning from an Exception).

Recall that the first instruction of a packet always has an even address. A branch to an odd address is allowed, but it has a special meaning. If there are no dynamic NOPs in the target packet then the second instruction of the packet (which must be a B-side instruction) is executed by itself. If, however, either instruction in the target packet has a dynamic NOP then the machine executes the second issue of the packet, which is not necessarily the second instruction. These semantics provide a way for exception handlers to return to the second issue of a packet without reexecuting the first issue (see subsection 2.8, Returning from an Exception for more details).

Here are some examples of odd-target branches:

Example 1: The target has no dynamic NOPs, so the second instruction is executed.

Example 2: The first instruction of the target has a dynamic NOP.

Example 3: Both instructions of the target have dynamic NOPs.

If the target packet of a branch or jump to an odd address has a single dynamic NOP (e.g. Example 2 above) then the machine stalls for one cycle before executing the target instruction. The stall cycle is necessary because the instruction might have to be moved to the opposite side of the machine. For instance instruction A3 of Example 2 could be a B-side instruction, but then the machine would need an extra cycle after fetching the target packet to move the instruction from the A side to the B side. There would be no stall cycle if the jump target address were t1, because in that case instruction B3 could execute immediately and instruction A3 could be moved to the correct side in time for the following cycle.

2.6 Boosted Instructions

To take advantage of the parallelism in the hardware exposed by the TORCH architecture, the compiler must reorder the instructions in the original serial program. For example, a load instruction can be moved to an earlier point in the program to fill an empty B-side slot and to hide the one-cycle latency of the load. However, one of the limits on instruction reordering is the control dependency. For example, an instruction in the taken path of a branch cannot normally be moved to a packet before the branch because the instruction should not change the state of the machine if the branch is not taken. Moving the instruction across the branch might change the logic of the program.

One of the central features of TORCH is the ability to execute an instruction before a branch it depends on, but to delay updating the sequential machine state with the results of the instruction until the branch result is known. Instructions which have been moved across a branch are called "boosted" instructions, and non-boosted instructions are called "sequential" instructions.

Boosting is implemented by keeping the results of boosted instructions in a shadow register file. When a branch occurs the contents of the shadow registers are either discarded, to squash the boosted instructions, or moved to the "sequential" (main) registers, to commit the boosted instructions. In the case of a store instruction the address and data are held in a shadow store buffer until the branch path is determined. The store buffer holds values which are waiting to be written to the data cache (or to main memory in the case of uncacheable references). Currently, up to four outstanding boosted stores can be held in the store buffer. The compiler enforces this limit, but if it is exceeded then an exception occurs(1).

Now recall that the TORCH compiler annotates each branch with the path which is most likely to be taken. Instructions are only allowed to be boosted from the branch path which is predicted to occur. Furthermore, in the current implementation an instruction can be boosted across only one branch. Boosted instructions can therefore be executed as follows. The machine executes a boosted instruction as if it were a normal instruction, but all results are saved in the shadow registers and shadow store buffer. Subsequent boosted instructions can read the results of previous boosted instructions from the shadow registers. When the next branch occurs, if it does not follow the predicted path then the machine squashes all of the boosted instructions by simply ignoring all of the shadow state. If the branch does follow the predicted path then all of the shadow registers that were written to by boosted instructions are exchanged with the corresponding sequential registers, so that the registers which formerly contained the shadow state now become sequential registers with the correct copy of the machine state (this operation is called "ponging"). Boosted stores which have been waiting in the store buffer are also allowed to begin writing to the data cache or main memory after the branch is taken.

Boosted ALU instructions always write to shadow registers, but they may read operands from either shadow registers or sequential registers depending on whether the operands are produced by a boosted instruction or a sequential instruction. In assembly language a shadow register is specified by appending ".b" to the register name:

In this case the add instruction reads its operands from the sequential register file and writes its result to the shadow register file. If the following branch is not taken, which in this case is the predicted path, then registers $2 and $2.b are ponged so that A4 and B4 (the instructions following the delay slot of the branch) see the result of the add in register $2. The example above is faster than but functionally equivalent to the following code:

The architecture does not prevent sequential instructions from reading shadow registers, but the result is unpredictable (problems can occur if any instruction causes an exception; see subsection 2.7 on Exception Handling).

Note that boosted instructions do not commit at jumps (which have no conditional). An instruction can be moved across a jump just by duplicating it (one copy of the instruction must appear in each path which leads to the target code).

The current implementation of TORCH only allows instructions to be boosted across one branch, but that is not a limitation of the architecture. Boosting across multiple branches can be implemented by associating a count with each boosted instruction. A boosted instruction with a count of two can commit only if the next two branches in the dynamic instruction stream both follow their predicted paths. However, because there is only one shadow register file only one level of boosting can be handled at a time per register. If, for example, an instruction boosted across one branch and an instruction boosted across two branches both write to the same shadow register before the next branch then the result is undefined. This is a minor limitation since the two instructions can write to different shadow registers.

Jump and branch instructions may not be boosted, except into the delay slot of another branch; in that case the boosted branch only takes effect if the first branch takes the predicted path. System calls and break instructions should not be boosted because they always cause exceptions, and as will become clear from the subsection on exception handling there is no performance gain from boosting an instruction which excepts. System coprocessor instructions which modify registers in the system coprocessor cannot be boosted since there are no shadow structures in the system coprocessor. The multiply and divide instructions use two special registers, "hi" and "lo", which do not have corresponding shadow registers. As a result, multiply and divide instructions can be boosted only if there is no sequential instruction using the hi and lo registers at the same time. Since the hi and lo registers are generally used for temporary results this constraint is not severe.

2.7 Exception Handling

TORCH supports precise interrupts---whenever an interrupt or exception occurs, there is a single instruction such that all instructions which logically precede it have completed, and no instructions which logically follow it have completed. Furthermore, a boosted instruction never causes an exception if the branch it depends on does not follow the predicted path (other than a boosted load which can cause a spurious cache miss).

Suppose a boosted instruction cannot complete due to an exceptional condition, for instance a load instruction attempts to dereference a NULL pointer. If the next branch does not follow the predicted path then the boosted instruction should never have been executed and the "boosted exception" must be ignored (in fact, this case occurs frequently during the last iteration of a tightly-coded loop which traverses a linked list). At the time the exception is detected the machine sets a bit in the status register of the system coprocessor (the "boosted exception pending" or "BEP" bit), and then continues fetching and executing instructions normally. The instruction which excepted stores garbage as its result. Of course, any boosted instructions which depend on the instruction which excepted may not compute correct results.

When the machine executes the next branch instruction it must decide how to process the boosted exception(2). If the branch does not follow the predicted path then the results of the pending boosted instructions can be discarded, and the machine ignores the exception. But if the branch does follow the predicted path then the exception must be processed, and assuming that the exception handler does not terminate the program, all of the boosted instructions executed since the last branch must be reexecuted (since some of those instructions may depend on the excepting instruction). TORCH causes these actions to be performed by clearing the boosted registers and the store buffer and then calling the boosted exception handler. The exception handler is called during the MEM cycle of the delay slot instruction, before the delay slot instruction completes.

The boosted exception handler is actually quite simple. It records the PC at the time of the exception (the delay slot PC), and then returns---not to the branch which caused the exception, but to a special boosted exception routine at a fixed location in the user's program. The delay slot PC is passed to this routine.

The boosted exception routine is generated by the TORCH compiler and executes in user mode, unlike the boosted exception handler which executes in system mode and is part of the operating system. Based on the delay slot PC, the routine jumps to a piece of code which contains a copy of the delay slot instruction and all of the boosted instructions since the last branch, but none of the other sequential instructions. However, when the boosted instructions are copied (at compile time) they are converted to sequential instructions---since the true sequential instructions have already completed, the boosted instructions can operate directly on the sequential register set. The block of copied code ends with a jump to the predicted path of the branch which caused the boosted exception handler to be called.

Now whatever happened to the exception the machine was supposed to handle? When the machine executes the sequential copy of the boosted instruction which excepted, if the exception occurs again then the machine can handle it as a normal sequential exception. The appropriate exception handler can be called immediately since the branch condition on which the boosted instruction depended has already been resolved. It is also possible that the exception may not reoccur. In either case, once the instructions in the copied code have all successfully completed, the jump at the end of the block causes a return to the correct instruction in the main program. This mechanism allows the boosted exceptions to be handled after the completion of the sequential instructions which logically precede them---and that is one of the requirements of a precise interrupt mechanism. Furthermore, the exception handlers only have to work on sequential instructions; there are no special cases for boosted instructions.

Here is an example to illustrate how the boosted exception handler works. Consider the following code:

Remember that the ".b" notation means an instruction is boosted, and the ".n" notation means the compiler predicted a branch will not be taken. The numbers at the left are instruction numbers (PCs). Now suppose the add instruction causes an overflow and the branch is not taken. Here is the sequence of events:

1. A1.b and B1 execute.
2. A2 executes and the add overflows; the BEP bit gets set.
3. A3 and the store execute (the incorrect value of $2.b is stored, but it is held in the store buffer for now).
4. The branch and B4 execute. The branch condition evaluates to false.
5. Since the BEP bit is set and the branch followed the predicted path, all boosted state is cleared (including the BEP bit and the values in the store buffer) and the boosted exception handler is called. The delay slot instructions, A5 and B5, do not complete.
6. The boosted exception handler calls the user's boosted exception routine and passes the PC of the delay slot instruction (8).
7. The user's boosted exception routine uses the delay slot PC to index a dispatch table which contains the address of the copied code. For this case, assume the compiler generated the following copied code:

Notice that the delay slot instructions from the branch come first, then the boosted instructions (with the ".b" extensions removed), and finally a jump back to the predicted path of the branch (instruction 10).
8. The user's boosted exception routine uses the address in the dispatch table (6002) to jump to the copied code.
9. A5 and B5 execute.
10. The add instruction executes and overflows again. The overflow exception handler is called, and assume that it stores some new values in $3 and $4. The exception handler then returns to instruction 6002 (via the RFE instruction).
11. A1 and the add execute successfully.
12. The jump and store execute.
13. The delay slot instructions (both NOPs) execute.
14. The machine continues execution from PC=10 (instructions A6 and B6).
So far the discussion has focused on boosted exceptions, but there are some complications for sequential exceptions too. During the MEM pipeline stage of a sequential instruction which excepts, the appropriate exception handler is called. However, before the handler gets control the machine discards all of the values in the boosted registers and the store buffer (by clearing bits which indicate which boosted registers have valid data). The boosted state must be cleared to prevent that state from being committed if the exception handler executes a branch (and most exception handlers do!) Alternatively the boosted state could be saved and restored, but significant extra hardware would be required. The bottom line is that in our implementation all boosted state is lost when a sequential instruction excepts.

When the exception handler returns by executing an RFE instruction, the machine checks state bits in the status register to see if any boosted instructions were pending before the exception. If so, the machine sets the BEP bit and continues execution. At the next branch the same sequence of events as for a boosted exception occurs---if the branch follows the predicted path then the boosted exception handler is called and the copied code is executed. In this case the copied code must be executed only to recompute the results of the boosted instructions, not to handle a true boosted exception.

Figure 2 summarizes the exception handling mechanism, including a special case if the excepting instruction is in the delay slot. The problem with a sequential exception in the delay slot is that any boosted instructions which were committed by the branch instruction must be squashed before the exception handler is called (since the delay slot instruction logically precedes the boosted instructions), and then when the exception handler returns the results of the boosted instructions must be recomputed. But this case is different then other sequential exceptions because the branch has already completed, so it would be incorrect to simply set the BEP bit. It is also necessary to keep a reecord of whether the branch is predicted incorrectly. This is because there could be a boosted instruction in the delay slot that needs to be squash when it is re-executed.

While the copied code technique does allow TORCH to support precise interrupts, it also increases the size of binaries substantially. The dispatch table used to find the address of the copied code must contain one entry for every even PC in the program code, and each entry is a 32-bit address, so the table is 2/5 the size of the program(3). A hash table could also be used, but most programs have a relatively high proportion of branches so a hash table would likely have nearly the same size as a dispatch table. On top of the space requirements of the dispatch table, all boosted instructions and some delay slot instructions must be duplicated. Although the increase in program size does increase the amount of disk space required to store binaries, we expect it to have little impact on cache and machine performance since the copied code and dispatch table are only accessed when exceptions occur.

2.8 Returning From an Exception

When an exception occurs the PC of the excepting instruction and the PC of the next instruction in the pipeline are saved in two read-only registers, EPC and EPCN respectively. After the exception has been processed the exception handler executes the following code sequence:

The RFE instruction restores the previous status register context bits (including the kernel/user mode bit), and the two jumps cause a return to the user's program. Note that the second jump is in the delay slot of the first; this sequence is necessary since the instruction at EPC might be in a delay slot, in which case the instruction at EPCN might not be the following instruction in the program text.

EPC and EPCN may contain odd addresses if the excepting instruction is in a packet which contains a dynamic NOP. The semantics of odd addresses allow the second issue of a packet to be executed without reexecuting the first issue, which may have successfully completed before the exception occurred. For example, suppose instruction A1 excepted in the following sequence:

Then the machine would set EPC to t1+1 and EPCN to t1+2. When the exception handler returned to t1+1 the machine would only execute instruction A1, since t1+1 is an odd address and the second issue of the packet is A1; B1 would not be executed again. It would be incorrect to set EPC to t1 because instruction B1 would execute and complete both before and after the exception. As a second example, if instruction B1 excepted instead of A1 then EPC would be set to t1 and EPCN to t1+1. Since A1 follows B1 in the pipeline, EPCN should point to A1 and not to A2 (which is two cycles behind B1 in the pipeline). Finally, if A2 or B2 excepted then EPC would be set to t1+2 and EPCN would be set to t1+4. No odd addresses would be necessary since A2 and B2 execute simultaneously; if one of the two excepted, then the other would not complete either. In summary, the general rule is that EPC points to the issue containing the excepting instruction and EPCN points to the next issue in the pipeline.

The break and syscall instructions are special because they always cause an exception, but they must not be reexecuted when the exception handler returns. So, the exception handler returns to EPCN instead of EPC by using the following code sequence:

However, this return sequence has the side-effect that the A-side instruction in the packet with the syscall or break will never complete unless one of the two sides has a dynamic NOP. For example, in the following code sequence instruction A1 never completes since the syscall handler returns to instruction A2:

This code sequence, however, is correct:

Since A1 is one pipeline cycle behind the syscall, EPCN points to A1 and it gets executed after the syscall returns.

Note that instructions should not be boosted into a basic block containing a system call. There is no harm in doing so, but there is no performance gain either. Since the system call causes an exception, any boosted instructions in the same basic block must be reexecuted after the next branch (if the branch follows the predicted path).

Figure 2 Flow Chart of the TORCH Exception Handling Mechanism

2.9 Floating Point Unit

In the current implementation we are focusing on the integer portion of the machine. We expect to be able to integrate an on-chip floating point unit with the design, although no special features have been incorporated to improve floating point performance.

2.10 MIPS Compatibility Mode

The TORCH machine has a compatibility mode which allows unmodified MIPS R2000/R3000 binaries to be executed. When the machine is in MIPS mode the instruction stream is interpreted as 32-bit instructions instead of blocks of 40-bit instructions, and the instructions are internally padded with an empty byte specifying that none of the TORCH features (boosting, dynamic NOPs, etc.) are in use. The instruction is then sent to both execution units, and the result is produced by whichever unit has the appropriate resources to execute the instruction (in the case of an ALU instruction the result is taken from either side). The shadow state is not used.

The operating system runs in TORCH mode. A user program can contain instructions of only one type, MIPS or TORCH. To execute a MIPS binary the operating system sets the top bit of a three-bit "mode" stack in the status register and branches to the user program. The machine returns to TORCH mode when the program causes an exception (e.g. a system call), in which case a context switch occurs and the previous bit on the mode stack determines the new execution mode.

2.11 Memory Hierarchy Organization

The data cache is a 8 kByte direct-mapped, write-allocate, write-back cache. Both virtual and physical addresses are 32 bits wide, and pages are 4 kBytes long. Cache lines are 256 bits (32 bytes) long. The tags are physical addresses and the cache is indexed with physical addresses. However, the cache access occurs in parallel with the virtual-to-physical address translation. Concurrent operation is possible because the low-order two bits of the virtual page number are required to equal the corresponding bits of the physical page frame number (since the page size is 4 kBytes the page offset is 12 bits wide, but 14 bits are required to select a byte from the cache). This requirement (called page coloring) constrains the placement of pages in physical memory. The TLB translation includes the process id (PID) of the current process, so the cache does not have to be flushed during a context switch. Figure 3 summarizes how the data cache works.

The instruction cache is an 10 kByte cache with virtual tags and virtual indexes. The cache accesses the TLB only in the case of a miss. Since the tags are virtual page coloring is not necessary.

The TORCH page size is 4 kBytes. However, main memory for program text (instructions) is allocated in units of five pages, or 20 kBytes. This allocation scheme ensures that the instructions in a 40-byte instruction block are always contiguous in main memory. Since 40 is an even divisor of 20k (but not of 4k), an integral number of instruction blocks fit into a 20 kByte chunk of memory. As a result, the hardware which fills the instruction cache does not need to handle a special case if the instruction block crosses a page boundary. There are no restrictions on the placement of program data in memory, other than the page coloring requirement.

Figure 3 Operation of the Data Cache

3 TORCH Instruction Set

Table 2 lists the instructions in each instruction class and the side of the machine on which each class can be executed.

Table 2 Instruction Classes

-----------------------------------------------------------------------------------
Class and Side     Opcode                              Description                   
-----------------------------------------------------------------------------------
ALU (both sides)  ADDI, ADDIU, ADD, ADDU, SUB, SUBU   add and subtract               
                  SLTI, SLTIU, SLT, SLTU              set on less than               
                  ANDI, ORI, XORI, AND, OR, XOR, NOR  bitwise boolean operations     
                  LUI                                 load upper immediate           
Shift (A side)     SLL, SRL, SLLV, SRLV               logical shifts                 
                   SRA, SRAV                          arithmetic shifts              
Multiply/Divide   MULT, MULTU                         integer multiply               
(A side)          DIV, DIVU                           integer divide                 
                  MFHI, MFLO, MTHI, MTLO              move from/to HI and LO regis   
                                                      ters                           
Jump/Branch       J, JR                               jump                           
 (A side)         JAL, JALR                           jump and link                  
                  BEQ, BNE, BLEZ, BGTZ, BLTZ, BGEZ    branch                         
                  BLTZAL, BGEZAL                      branch and link                
Load/Store        LB, LBU, LH, LHU, LW, LWL, LWR      load                           
(B side)          SB, SH, SW, SWL, SWR                store                          
Special           SYSCALL                             system call                    
(B side)          BREAK                               breakpoint                     
Coprocessor       RFE                                 return from exception          
(B side)          TLBR, TLBWI, TLBWR, TLBP            TLB operations                 
                  LWCz, SWCz                          load/store word to/from copro  
                  MTCz, MFCz, CTCz, CFCz              cessor                         
                  COPz                                move to/from coprocessor       
                  BCzT, BCzF                          coprocessor operation          
                                                      branch on coprocessor          
-----------------------------------------------------------------------------------
Table 3 Legal A- and B-side instruction classes
----------------------------------------------
A side                B side                    
----------------------------------------------
ALU operations        ALU operations            
multiply and divide   load and store            
shift operations      system call and break     
jump and branch       system coprocessor calls  
----------------------------------------------

4 Instruction Encodings

Figure 4 illustrates the encoding of the dynamic NOP and boosting information in the extension byte. The dynamic NOP bit is set if the instruction is dynamically NOPed. The boosting level of a register may be zero (for a sequential register access) or one (for a shadow register access). Boosting levels two and three are reserved for use in future multi-level boosting implementations. A boosted instruction which does not write to any registers (such as a jump) is encoded by setting the rd boosting level to one.

Figure 4 TORCH Instruction Encoding

Table 4 shows how the branch prediction information is encoded. Note that TORCH branch instructions which are predicted taken have encodings identical to those of the correpsonding MIPS branch instructions, and TORCH branches which are predicted not taken are encoded with unused MIPS opcodes.

Table 4 Branch Encodings

----------------------------------------
Branch Opcode   Bits 31-26    Bits 20-16  
----------------------------------------
BEQ.T           000100        rt          
BEQ.N           010100        rt          
BNE.T           000101        rt          
BNE.N           010101        rt          
BLEZ.T          000110        00000       
BLEZ.N          010110        00000       
BGTZ.T          000111        00000       
BGTZ.N          010111        00000       
BLTZ.T          000001        00000       
BLTZ.N          000001        01000       
BGEZ.T          000001        00001       
BGEZ.N          000001        01001       
BLTZAL.T        000001        10000       
BLTZAL.N        000001        11000       
BGEZAL.T        000001        10001       
BGEZAL.N        000001        11001       
BCz.T           bit 22 is 0               
BCz.N           bit 22 is 1               
----------------------------------------

5 Program-Visible Registers

The TORCH architecture defines the following registers:

--------------------------------------------------------------------------
Register Name   Description                                                 
--------------------------------------------------------------------------
$0 and $0.b    Read-only registers containing zero                          
$1-$31         General-purpose sequential registers                         
$1.b-$31.b     General-purpose shadow registers                             
HI and LO      Result registers for integer multiply and divide             
Status         Status and mode bits (see Figure 5)                          
Cause          Description of last exception (identical to the MIPS R2000   
               Cause Register)                                              
EPC            PC of instruction which excepted                             
EPCN           PC of instruction following the instruction which excepted   
BadVAddr       Virtual address which caused last addressing exception       
Context         For use by TLB miss handler                                 
EntryHi         High word of TLB entry                                      
EntryLo         Low word of TLB entry                                       
Index           TLB entry index                                             
Random          random TLB entry index                                      
--------------------------------------------------------------------------
The registers associated with the TLB and address translation are identical to the corresponding registers in the R2000 architecture.

Table 5 Instruction Encodings

-------------------------------------------------------
Bit Number   Description                                 
-------------------------------------------------------
0-31         MIPS instruction encoding                   
32-33        rd (destination register) boosting level    
34-35        rt (second source register) boosting level  
36-37        rs (first source register) boosting level   
38           dynamic NOP bit                             
39           reserved for future use                     
-------------------------------------------------------
Table 6 TORCH Status Register
-------------------------------------------------------------------------------
Bit Number   Name    Description                                                 
-------------------------------------------------------------------------------
0            IEc    Interrupt enable for current context (set to 1 to enable)    
1            KUc    Kernel/user mode for current context (set to 1 for user      
                    mode)                                                        
2            IEp    Interrupt enable for previous context                        
3            KUp    Kernel/user mode for previous context                        
4            IEo    Interrupt enable for old context (prior to previous)         
5            KUo    Kernel/user mode for old context                             
6-7                 not used                                                     
8-9          Sw0-1  Software interrupt mask bits                                 
10-15        Hw0-5  Hardware interrupt mask bits                                 
16           BSc    Boosted code executed in current context (the hardware       
                    sets this bit when a boosted instruction commits and         
                    clears it when a branch commits)                             
17           BSp     Boosted code executed in previous context                   
18           BSo     Boosted code executed in old context                        
19          SBp     Squash bit in old context                                    
20          SBo      Squash bit in old context                                   
21           TS      TLB shutdown (set to 1 by processor if TLB failure is       
                    detected)                                                    
22           BEV     Bootstrap exception vector (if set to 1, the machine uses   
                    an alternate set of exception vectors)                       
23           MPc     MIPS mode for current context (set to 1 for MIPS mode)      
24           MPp     MIPS mode for previous context                              
25           MPo     MIPS mode for old context                                   
26           BEP     Boosted exception pending (the hardware sets this bit       
                    when a boosted instruction excepts and clears it when a      
                    branch commits)                                              
                                                                                 
27                   not used                                                    
28-31        Cu0-3   Cux is set to 1 by the hardware if coprocessor x is usable  
-------------------------------------------------------------------------------
Figure 5 TORCH Status Register

6 Bibliography

[Kan88] G. Kane. MIPS RISC Architecture. Prentice Hall, Engelwood Cliffs, NJ 07632, 1988.

Footnotes

(1)
In fact a boosted exception occurs and the program continues to execute. See subsection 2.7 on exceptions.
(2)
This discussion assumes that the excepting instruction is boosted across a single branch. The same mechanism works for multiple levels of boosting except that the status register contains one BEP bit for each level, so that it is possible to determine which branch should trigger processing of the exception.
(3)
Suppose the program has 100 packets (200 instructions), so it occupies 200(5 bytes) = 1kByte. The dispatch table must have one entry for each even PC (all possible packet addresses), or 100(4 bytes) = 400 bytes. The ratio is 400/1000 = 2/5.
 

Back to Top