Debugging Tools for Windows |
The Intel Itanium processor has several features that do not appear on the x86 processor:
The Itanium has a large number of registers.
Many of the registers are further subdivided into categories such as static, stacked, and rotating.
When discussing register preservation conventions, the term preserved refers to a register whose value must be preserved by a function, and scratch refers to a register whose value can be modified by a function.
When using the ? (Evaluate Expression) command, registers should be prefixed with an "at" sign ( @ ). For example, you should use ? @f0 rather than ? f0. This ensures that the debugger recognizes f0 as a register, rather than as a hexadecimal number or a symbol.
However, the "at" sign is not required in the r (Registers) command. For instance, r r32 = 5 will always be interpreted correctly.
The 128 integer registers are named r0 through r127 and break down as follows:
r0 − r31 | Static general registers |
r32 − r127 | Stacked general registers |
Some of the registers have special meaning, most of which are assigned by convention rather than hardware requirement. (The only one that is a hardware requirement is r0.)
r0 | Reads as zero (writing will AV) |
gp | Global pointer (r1) |
ret0 − ret3 | Function return values go here (r8 through r11). |
sp | Stack pointer (r12) |
r13 | TEB |
By convention, registers r4 through r7 are preserved. You also should not change the TEB pointer.
The gp register points to your global variables. To access a global variable, you have to indirect OFF the gp register. The gp register is kept up-to-date when you jump from one DLL to another (by means described later).
The other static registers are scratch.
All integer registers are 64 bits, with a magic NaT bit attached to each one. NaT stands for "not a thing" and is used by speculative execution to indicate that the register values are undefined.
The static integer registers do not participate in register stacking. The behavior of the stacked registers (including their preservation rules) will be described in Procedure Calls and the Register Stack subsection of this section.
The return value registers hold function return values and therefore can be destroyed across a function call. Notice that there are four integer return value registers at 64 bits each, for a total of 256 bits or 32 bytes of data that can be returned from a function (twice the size of a GUID).
f0 | Reads as 0.0 (writing to it will cause an access violation). |
f1 | Reads as 1.0 (writing to it will cause an access violation). |
f2 − f31 | Static floating-point registers. |
f32 − f127 | Rotating floating-point registers |
This document does not discuss floating-point code.
By convention, floating-point registers f2 through f5 and f16 through f31 are preserved across calls; the rest are scratch.
pr0 | Reads as TRUE (writes are ignored). |
pr1 − pr15 | Static predicate registers. |
pr16 − pr63 | Rotating predicate registers. |
Predicate registers act as flags. They record the result of comparison instructions, and you can test them later to perform some sort of conditional action (called "predication"). You can predicate almost any instruction, not just jump instructions. For example,
instruction means "set register ret0 equal to register r32 plus register r33 if register p1 is TRUE; otherwise, do nothing." Allowing arbitrary predication helps performance significantly: You have fewer jump instructions and therefore less misprediction You can also pack your instructions more compactly, because jump targets must begin on a bundle boundary, but predicated instructions can go anywhere.
The parenthesized predicate register is called the qualifying predicate (abbreviated as "qp"). Because predicate register zero is always TRUE, unconditional instructions are internally encoded as conditional on p0.
By convention, predicate registers pr6 through pr15 are scratch; the rest are preserved.
There is a special pseudo-register called pr (called preds by the debugger) that consists of the 64 predicate registers combined to form a single 64-bit value. This lets you read and write all the predicate registers so you can preserve them across a call.
The branch registers b0 through b7 are used for computed jump instructions. These are dedicated to computed jumps so the processor can optimize more efficiently for them.
By convention, the following meanings are assigned to the branch registers.
br | Return address (b0) |
b1 − b5 | Preserved across procedure calls |
b6 − b7 | Scratch |
Of the application registers, ar.unat and ar.lc must be preserved across calls.
Each procedure declares how many parameters it has (input registers), how many private registers it needs (local registers), and the maximum number of parameters of any function it calls (output registers). The input registers plus the local registers together are called the "local region." The input registers plus the local registers plus the output registers together are called the "register stack frame".
For example, this function
{
int c;
c = someFunction(0) + otherFunction(a, b);
return c;
}
would specify two input registers, some number of local registers, and two output registers.
When control enters a procedure, the alloc instruction shuffles the registers around.
Suppose the calling function uses five local registers (r32 through r36) and it wants to execute a sampleFunction(3,4).
Before calling the function, the registers would be arranged like this:
r0 ... r31 r32 r36 r37 r38 r39 ...
+--------------+-----------+-------+-------------
| 0 ... aaa|bbb ... ccc| 3 4|??? ...
+--------------+-----------+-------+-------------
Registers r37 and r38 are the output registers and contain the parameters to the function.
When sampleFunction gets control, it executes an alloc instruction.
Essentially, this means: Set up the register stack frame as follows: 2 input registers, 2 local register, 2 output registers, and zero rotating registers. Save the previous register frame state (pfs) in register r34 so it can be restored later.
This instruction shuffles the registers like this:
r0 ... r31 r32 r33 r34 r35 r36 r37 ...
+--------------+---------------+-------+----------
| 0 ... aaa| 3 4 pfs ???|??? ???|...
+--------------+---------------+-------+----------
The static registers did not change. What used to be the output registers are now the input registers; the local registers are uninitialized (except for register r34, which was explicitly set to ar.pfs by the first parameter of the alloc instruction). What used to be in the local variables got pushed onto the register stack; they are not accessible to the called function.
Note The distinction between input and local registers is purely semantic. It makes no difference for the processor. The only important thing to the encoding of the instruction is the size of the local region, so when you read the instruction in disassembly, the first number is the size of the local region and the second number is always zero. Thus, the preceding alloc instruction will be disassembled as
alloc r34 = ar.pfs, 4, 0, 2, 0
Note In reality, this register shuffling process happens in two stages. The br.call instruction is the one that renumbers the registers. The alloc instruction is the one that describes the layout of the registers on the receiving side. However,in simple cases, you can act as if the alloc instruction does it all. In complicated situations, you may want to execute multiple alloc instructions to reinterpret all your registers.
The called function now performs its operation, maybe calls some other functions, and then when it is ready to return, the registers look like this:
r0...ret0..r31 r32 r33 r34 r35 r36 r37 ...
+--------------+---------------+-------+----------
| 0...rrr...aaa|xxx yyy zzz www|??? ???|...
+--------------+---------------+-------+----------
It has placed the function return value (rrr) into the ret0 register. All the registers in the local region contain whatever values are left over from the computation.
When control returns to the calling function, the register stack is popped and the registers look like this:
r0...ret0..r31 r32 r36 r37 r38 r39 ...
+--------------+-----------+-------+-------------
| 0...rrr...aaa|bbb ... ccc|??? ???|??? ...
+--------------+-----------+-------+-------------
All the old registers in the local region are restored from the register stack, and the return value is now available in the ret0 register. Note that the values in the output registers have returned to garbage. You cannot rely on the values of output registers being preserved across a call.
The hardware supports any number of output registers, but by convention, only the first eight are used to contain function parameters. Parameters beyond eight are passed on the stack.
The register stack is itself limited in size. If you push too many registers onto the register stack, it spills into regular memory.
The gp register is used to access global variables in your module. The rules for its management are complicated:
These rules allow for the following procedure call paradigms:
So how do you set up gp for the target? A function pointer on the Itanium is really a pointer to a block of data that describes the target function. One of the items in that block of data is the value of gp that the target function expects.
Return to the sampleFunction example. Assume this is an exported function. In this case, the code to call it would be:
mov r38 = 4
addl r31 = 0x108, gp;; // r31 -> import table entry
ld8 r30 = [r31];; // r30 -> sampleFunction descriptor
ld8 r41 = [r30], 8;; // r41 = actual address of function
ld8 gp = [r30] // set up gp for sampleFunction
mov b6 = r41 // set up the branch register
br.call rp = b6;; // call the function
mov gp = ... // restore gp to original value
This means that if you try to disassemble at sampleFunction, you will end up just looking at the function descriptor rather than the function itself. The function itself begins with a dot, so if you want to see the code for sampleFunction you have to type u .sampleFunction.
dbi2 = 0 dbi3 = 0
dbi4 = 0 dbi5 = 0
dbi6 = 0 dbi7 = 0
dbd0 = 0 dbd1 = 0
dbd2 = 0 dbd3 = 0
dbd4 = 0 dbd5 = 0
dbd6 = 0 dbd7 = 0
gp = 77560000 0 r2 = 6fbffc90bf0 0
r3 = c000000000000309 0 r4 = 78190000 0
r5 = 1010aa0 0 r6 = 6fbfffde000 0
r7 = 0 0 r8 = 6fbffe8d948 0
r9 = ffffffffffffffff 0 r10 = 0 0
r11 = ffffffff 0 sp = 6fbffe8d520 0
r13 = 6fbfffdc000 0 r14 = 6fbffc90be8 0
r15 = 0 0 r16 = e0000165e11ae7f0 0
r17 = e0000165e11ae910 0 r18 = 6fbffc90bec 0
r19 = 0 0 r20 = 9804c8a70033f 0
r21 = e000000086b8ee20 0 r22 = e000000086b96040 0
r23 = 1 0 r24 = 7f05 0
r25 = 2f 0 r26 = 14e 0
r27 = 6 0 r28 = 6fb000006fb 0
r29 = 6fbff 0 r30 = 103 0
r31 = 0 0 intnats = 0
preds = 8941
b0 = 772b43b0 b1 = e0000000ffa005c0
b2 = 766e9e88 b3 = 0
b4 = 0 b5 = 0
b6 = 772ba2e0 b7 = 0
unat = 0 lc = 0
ec = 0 ccv = e000000086c577b0
dcr = 7f05 pfs = c000000000000309
bsp = 6fbffe90cc8 bspstore = 6fbffe90cc8
rsc = f rnat = 0
ipsr = 1013082a6018 iip = 772b4310
ifs = 8000000000000204 fcr = 40
eflag = 202 csd = cfbfffff00000000
ssd = cf3fffff00000000 cflag = 111
fsr = 0 fir = 0
fdr = 0
r32 = 6fbffe8d948 0 r33 = 6fb000006fb 0
r34 = 104 0 r35 = 8941 0
At the top of this display are eight instruction debug registers (dbi) and eight data debug registers (dbd).
Next come the static integer registers, followed by intnats, which consists of all the NaT bits combined to form a 32-bit integer.
Then preds, which is all the predicate registers combined into a 64-bit integer.
Next are all the branch registers.
Then there are several special-purpose registers. Of these, the following two are probably the only ones you will need to deal with:
Finally shown are the registers for the current register stack frame. You happen to be in a function that has four registers in its frame, so the debugger showed the first four stacked registers, r32 through r35. If the function used more registers, this part would have been larger.
x | 1 (byte), 2 (word), 4 (long) or 8 (quad) |
Ra, Rb, Rc... | Registers |
immn | Signed n-bit constant |
Rb/immn | Register or a constant |
Ba/addr | Branch register or an address |
cc | Condition (such as eq or ne) |
Rb<n1, n2> | n2 bits from Rb starting at position n1. |
f | Floating-point type (s, d, e) |
For example, Rb<5, 4> means "extract 4 bits from Rb starting at position 5," which is the value (Rb >> 5) & 0x0F.
Many instructions can be modified by suffixes called completers.
Unlike the x86 instruction set, which has a significant number of addressing modes, the Itanium instruction set has only one addressing mode: Register indirect. The notation for register indirect is "[r]" which means "the value stored at memory location r." The thing inside the brackets is always a register.
Instructions are packaged into groups of three called bundles. If you execute an .asm verbose command, the instructions that belong to a bundle will be surrounded by curly brackets. Bundles always start on 16-byte boundaries (in other words, the last digit of the hexadecimal address is zero).
You cannot jump into the middle of a bundle.
There are restrictions on what type of instruction can be put into a particular slot in a bundle. For example, one valid bundle type is MII, which means that slot 0 (zero) is a memory access, slot 1 is an integer instruction, and slot 2 is another integer instruction.
Valid instruction types are:
Many instructions can be used in multiple slot types, in which case a completer is specified to disambiguate them. For example, there are five different nop instructions (nop.m, nop.i, nop.f, nop.b, nop.x) depending on which type of slot it was placed into.
Each valid combination of instruction categories is called a template. There are 32 different templates. Some of the templates differ only in the placement of stops.
A stop is used to indicate that instructions after the stop depend on instructions before the stop. For example, if you have the following series of instructions
add r1 = r2,r4 ;;
add r2 = r1,r3
there is no dependency between the first two instructions, but the third instruction cannot execute until after the first two have completed. Therefore, the compiler inserts a stop, represented by a double semicolon, after the second instruction.
Note that a stop does not have to go at the end of a bundle. There are some bundle templates that have stops in the middle and some that have more than one stop. For example, template 11 is an M|MI| instruction. Slot 0 (zero) is a memory access, then there is a stop after slot 0 (zero), then slot 1 is another memory access, then slot 2 is an integer instruction, then there is another stop after slot 2.
An instruction group is a sequence of instructions up to the next stop, taken branch, interrupt, or exception. The instructions within an instruction group cannot have dependencies among them. This allows the processor to execute them in parallel.
There are some exceptions to the no dependencies rule.
This description omits a number of details that are important only to compiler-writers or people hand-writing Itanium assembly. When reading disassembly, you can assume that the compiler or author generated correct code, unless you are tracking a compiler bug.
The ld instruction (load from memory) supports an .s completer, which means that execution of the instruction is speculated. (You cannot speculatively write to memory.)
An instruction that can be speculated is called a speculative instruction, rather than a speculatable instruction. Consequently, this documentation will use the word "speculated" to refer to the .s variant of the ld instruction.
A speculated load is just like a regular load, except that if an exception occurs, the processor sets the NaT bit in the destination register instead of raising the exception.
If any input to an integer computation instruction has the NaT bit set, then the result of the computation will also have the NaT bit set. If any input to a comparison instruction has the NaT bit set, then the result of the comparison is always FALSE.
You can also speculate floating-point instructions, but instead of setting the NaT bit, the entire floating-point register is set to a special NaT-like value called NaTVal. As with the integer case, NaTVal infects all subsequent computations.
Perhaps you realize that the speculated execution was not necessary. (Maybe you started going down the TRUE branch of an IF statement, only to discover that the value is FALSE.) In which case, you just ignore the registers that you changed with the speculated execution and continue on your way. (Don't look at them, of course, because they might be NaT.)
If you decide that the speculated execution was worthwhile, execute a chk.s instruction, which means "verify that this register contains an actual value. If it is a NaT, then jump to the recovery code." The recovery code typically just consists of all the speculated instructions re-executed normally, so the exception can be raised.
Aside from the instructions previously mentioned, which can handle NaTs and a few other special instructions, attempting to use the value of a NaT register will cause an exception. This is not useful in general because you cannot tell which speculated instruction caused the exception.
The Itanium contains special techniques to optimize in the face of aliased pointers. For example, consider this code snippet:
{
*p1 = 1;
*p2 = 2;
return *p1;
}
This function usually returns 1, but if p1 and p2 both point to the same address, then it will return 2.
The advanced load instruction ld.a means "load this value from memory and remember the access as successful." If there is a subsequent write to that address, it is removed from the list, rendering the advance load unsucessful. (Other events can remove an entry from the list; for example, if you ask it to remember too many advanced loads, it starts forgetting the older ones.)
Later on, you can check whether the advance load is still valid. If it is not, it means that the value was modified and you have to reload it.
There are two types of checks. The simplest check is ld.c.nc or ld.c.clr. This says, "Check if that advanced load is still valid. If not, then reload the value." The .clr completer means that this advanced load is not important anymore, so the processor can free up the entry for recording new advanced loads; the .nc completer means that this advanced load is still valuable, so do not clear it from the table.
Here is an example that uses ld.c.clr.
... // perform operation that might modify memory...
ld8.c.clr r6 = [r8] // reload r6 from [r8] if necessary
... // perform operations with the value in r6
In order to avoid a stall, the compiler requested that register r6 be loaded from the memory address specified by register r8 in advance of when it actually needed the result. Then, the compiler wanted to use the value in r6, but had not determined if, in the meantime, some pointer dereference had modified the value, thus rendering the prefetched value in r6 useless.
The ld8.c.clr instruction checks if anything has written to that address (even on another processor). If not, then the instruction does nothing. However, if something has indeed written to the address, then the instruction refetches the value (taking the memory stall).
The second type of check is if you need more complicated recovery than just reloading the value.
Here is an example that uses chk.a.clr.
add r5 = r6, r7 ;; // precompute based on advanced read
... // perform operations that might modify memory...
chk.a.clr r6, failed // memory was modified
continue:
... // perform operations with the value in r5
failed:
ld8 r6 = [r8] ;; // load the correct value
add r5 = r6, r7 // redo the precomputation
br continue // and then continue as if nothing was wrong
After doing some precomputation with the value you read from r8, you execute a chk.a.clr instruction, which checks if your advanced load is still valid. If not, you jump to failed, where you reload the value and redo the precomputation, then jump back to continue normal execution.
You can combine the preceding two techniques. The ld.sa instruction performs a speculated advance load.
Register rotation is an advanced technique where the registers renumber themselves each time they go through a loop. It is not covered in this documentation.
The types of control flow you will see most of the time are jump (br.cond), call (br.call), and return (br.ret). (Conditional jumps are just regular jump instructions with a qualifying predicate attached in front.)
The jump and call instructions also have long versions (brl) if the target of the jump is really far away.
The brl instruction is actually emulated on Itanium, so do not expect it to be fast.
Each of the standard jump instructions also includes a group of completers. The first completer determines whether the jump should be predicted taken or not taken.
Prediction hardware might not be able to tell, because this instruction was never encountered before, or it was last encountered so long ago that it fell out of the cache.
The second completer specifies how aggressively you should prefetch after the cache. In other words, how sure are you that the prediction is correct.
Most jumps within a procedure will be marked .few, whereas unconditional subroutine calls and unconditional return instructions are usually marked .many.
Finally, there is an optional completer.
If you clear the entry, the processor will wipe out any knowledge of this jump instruction. Do this if you know the instruction will not be encountered again for a long time.
There is also a bonus instruction brp whose sole purpose is to indicate to the processor: "That computed jump instruction ahead is going to jump to here."
In its simplest form, the comparison instruction compares two values and stores the result into two predicate registers; one gets the the result of the comparison, and the other gets the opposite of the result. For example,
compares the two registers for equality and stores the result into p1. Meanwhile, the p2 register gets the opposite value. For example, if they were equal, then p1 would be TRUE and p2 would be FALSE.
The next most complicated comparison instruction is called the unconditional comparison and it is always used with a qualifying predicate. Here is a sample unconditional comparison:
If the qualifying predicate p3 is TRUE, then this acts just like a regular comparison instruction. However, if the qualifying predicate is FALSE, then both the p1 and p2 registers are set to FALSE. This is a rare case where a qualifying predicate has an effect even though it is FALSE. (Normally, if a qualifying predicate is FALSE, the entire instruction is ignored.)
The next most complicated comparisons are the parallel comparisons. These are used when you have a chain of "a && b && c" or "a || b || c" results. Here is a sample AND parallel comparison:
This is expressible in C as
p2 = p2 && (r32 == r33)
In other words, if the comparison is false, then both predicate registers are set to FALSE; otherwise, they are left alone.
The other variations of parallel comparisons are:
p1 = p1 && !(r32 == r33)
p2 = p2 && !(r32 == r33)
cmp.eq.or p1, p2 = r32, r33
p1 = p1 || (r32 == r33)
p2 = p2 || (r32 == r33)
cmp.eq.orcm p1, p2 = r32, r33
p1 = p1 || !(r32 == r33)
p2 = p2 || !(r32 == r33)
and the DeMorgan operators...
p1 = p1 || (r32 == r33)
p2 = p2 && !(r32 == r33)
cmp.eq.and.orcm p1, p2 = r32, r33
p1 = p1 && (r32 == r33)
p2 = p2 || !(r32 == r33)
For example, the expression p5 = (r4 == 0) || (r5 == r6) can be computed as follows (assuming that p5 is preinitialized to FALSE):
cmp.eq.or p5, p0 = r5, r6
Notice that because these are both OR type comparisons, they can be combined into a single instruction group and, therefore, executed in parallel.