Debugging Tools for Windows

Itanium Architecture

The Intel Itanium processor has several features that do not appear on the x86 processor:

Explicit Parallel Instruction Computation (EPIC).
A large number of registers. (Doing as much as possible in registers gets you better performance.)
Dedicated register usage (integer registers, branch registers, floating-point registers).
Conditional execution (predication) on almost all instructions.
Modulo scheduling for software pipelining.

Registers

The Itanium has a large number of registers.

128 integer registers, each with a NaT bit (r0 - r127)
128 floating-point registers (f0 - f127)
64 predicate registers (p0 - p63)
8 branch registers (b0 - b7)
An instruction pointer (the debugger calls this iip)
128 other special-purpose registers (not all of which have yet been given meanings). These are called the "application register set," or "ar" registers. (They are not covered in this documentation.)
A number of miscellaneous registers, not covered in this documentation.

Many of the registers are further subdivided into categories such as static, stacked, and rotating.

When discussing register preservation conventions, the term preserved refers to a register whose value must be preserved by a function, and scratch refers to a register whose value can be modified by a function.

When using the ? (Evaluate Expression) command, registers should be prefixed with an "at" sign ( @ ). For example, you should use ? @f0 rather than ? f0. This ensures that the debugger recognizes f0 as a register, rather than as a hexadecimal number or a symbol.

However, the "at" sign is not required in the r (Registers) command. For instance, r r32 = 5 will always be interpreted correctly.

Integer Registers

The 128 integer registers are named r0 through r127 and break down as follows:

r0 − r31	Static general registers
r32 − r127	Stacked general registers

Some of the registers have special meaning, most of which are assigned by convention rather than hardware requirement. (The only one that is a hardware requirement is r0.)

r0	Reads as zero (writing will AV)
gp	Global pointer (r1)
ret0 − ret3	Function return values go here (r8 through r11).
sp	Stack pointer (r12)
r13	TEB

By convention, registers r4 through r7 are preserved. You also should not change the TEB pointer.

The gp register points to your global variables. To access a global variable, you have to indirect OFF the gp register. The gp register is kept up-to-date when you jump from one DLL to another (by means described later).

The other static registers are scratch.

All integer registers are 64 bits, with a magic NaT bit attached to each one. NaT stands for "not a thing" and is used by speculative execution to indicate that the register values are undefined.

The static integer registers do not participate in register stacking. The behavior of the stacked registers (including their preservation rules) will be described in Procedure Calls and the Register Stack subsection of this section.

The return value registers hold function return values and therefore can be destroyed across a function call. Notice that there are four integer return value registers at 64 bits each, for a total of 256 bits or 32 bytes of data that can be returned from a function (twice the size of a GUID).

Floating-Point Registers

f0	Reads as 0.0 (writing to it will cause an access violation).
f1	Reads as 1.0 (writing to it will cause an access violation).
f2 − f31	Static floating-point registers.
f32 − f127	Rotating floating-point registers

This document does not discuss floating-point code.

By convention, floating-point registers f2 through f5 and f16 through f31 are preserved across calls; the rest are scratch.

Predicate Registers

pr0	Reads as TRUE (writes are ignored).
pr1 − pr15	Static predicate registers.
pr16 − pr63	Rotating predicate registers.

Predicate registers act as flags. They record the result of comparison instructions, and you can test them later to perform some sort of conditional action (called "predication"). You can predicate almost any instruction, not just jump instructions. For example,

(p1) add ret0 = r32, r33

instruction means "set register ret0 equal to register r32 plus register r33 if register p1 is TRUE; otherwise, do nothing." Allowing arbitrary predication helps performance significantly: You have fewer jump instructions and therefore less misprediction You can also pack your instructions more compactly, because jump targets must begin on a bundle boundary, but predicated instructions can go anywhere.

The parenthesized predicate register is called the qualifying predicate (abbreviated as "qp"). Because predicate register zero is always TRUE, unconditional instructions are internally encoded as conditional on p0.

By convention, predicate registers pr6 through pr15 are scratch; the rest are preserved.

There is a special pseudo-register called pr (called preds by the debugger) that consists of the 64 predicate registers combined to form a single 64-bit value. This lets you read and write all the predicate registers so you can preserve them across a call.

Branch Registers

The branch registers b0 through b7 are used for computed jump instructions. These are dedicated to computed jumps so the processor can optimize more efficiently for them.

By convention, the following meanings are assigned to the branch registers.

br	Return address (b0)
b1 − b5	Preserved across procedure calls
b6 − b7	Scratch

Application Registers

Of the application registers, ar.unat and ar.lc must be preserved across calls.

Procedure Calls and the Register Stack

Each procedure declares how many parameters it has (input registers), how many private registers it needs (local registers), and the maximum number of parameters of any function it calls (output registers). The input registers plus the local registers together are called the "local region." The input registers plus the local registers plus the output registers together are called the "register stack frame".

For example, this function

int sampleFunction(int a, int b)
{
    int c;
    c = someFunction(0) + otherFunction(a, b);
    return c;
}

would specify two input registers, some number of local registers, and two output registers.

When control enters a procedure, the alloc instruction shuffles the registers around.

Suppose the calling function uses five local registers (r32 through r36) and it wants to execute a sampleFunction(3,4).

Before calling the function, the registers would be arranged like this:

static         local rgn   output
r0   ...   r31 r32     r36 r37 r38 r39 ...
+--------------+-----------+-------+-------------
| 0   ...   aaa|bbb ... ccc|  3   4|??? ...
+--------------+-----------+-------+-------------

Registers r37 and r38 are the output registers and contain the parameters to the function.

When sampleFunction gets control, it executes an alloc instruction.

alloc r34 = ar.pfs, 2, 2, 2, 0

Essentially, this means: Set up the register stack frame as follows: 2 input registers, 2 local register, 2 output registers, and zero rotating registers. Save the previous register frame state (pfs) in register r34 so it can be restored later.

This instruction shuffles the registers like this:

static         local rgn       output
r0   ...   r31 r32 r33 r34 r35 r36 r37 ...
+--------------+---------------+-------+----------
| 0   ...   aaa|  3   4 pfs ???|??? ???|...
+--------------+---------------+-------+----------

The static registers did not change. What used to be the output registers are now the input registers; the local registers are uninitialized (except for register r34, which was explicitly set to ar.pfs by the first parameter of the alloc instruction). What used to be in the local variables got pushed onto the register stack; they are not accessible to the called function.

Note The distinction between input and local registers is purely semantic. It makes no difference for the processor. The only important thing to the encoding of the instruction is the size of the local region, so when you read the instruction in disassembly, the first number is the size of the local region and the second number is always zero. Thus, the preceding alloc instruction will be disassembled as

alloc r34 = ar.pfs, 4, 0, 2, 0

Note In reality, this register shuffling process happens in two stages. The br.call instruction is the one that renumbers the registers. The alloc instruction is the one that describes the layout of the registers on the receiving side. However,in simple cases, you can act as if the alloc instruction does it all. In complicated situations, you may want to execute multiple alloc instructions to reinterpret all your registers.

The called function now performs its operation, maybe calls some other functions, and then when it is ready to return, the registers look like this:

static local rgn output
r0...ret0..r31 r32 r33 r34 r35 r36 r37 ...
+--------------+---------------+-------+----------
| 0...rrr...aaa|xxx yyy zzz www|??? ???|...
+--------------+---------------+-------+----------

It has placed the function return value (rrr) into the ret0 register. All the registers in the local region contain whatever values are left over from the computation.

When control returns to the calling function, the register stack is popped and the registers look like this:

static local rgn output
r0...ret0..r31 r32 r36 r37 r38 r39 ...
+--------------+-----------+-------+-------------
| 0...rrr...aaa|bbb ... ccc|??? ???|??? ...
+--------------+-----------+-------+-------------

All the old registers in the local region are restored from the register stack, and the return value is now available in the ret0 register. Note that the values in the output registers have returned to garbage. You cannot rely on the values of output registers being preserved across a call.

The hardware supports any number of output registers, but by convention, only the first eight are used to contain function parameters. Parameters beyond eight are passed on the stack.

The register stack is itself limited in size. If you push too many registers onto the register stack, it spills into regular memory.

The gp Register

The gp register is used to access global variables in your module. The rules for its management are complicated:

On entry to a function, gp is assumed to be properly initialized.
When you call a function within your module, gp must be properly initialized.
When you call a function outside your module, the gp register can be destroyed on return.
When you return from a function, gp must contain the value it did when your function started.

These rules allow for the following procedure call paradigms:

Call to function in same module: Simply call it, there is no need to use gp. The gp register will still be valid on return.
Call to function in other module: Set up gp appropriate for the module you are calling, then restore it after the function returns.

So how do you set up gp for the target? A function pointer on the Itanium is really a pointer to a block of data that describes the target function. One of the items in that block of data is the value of gp that the target function expects.

Return to the sampleFunction example. Assume this is an exported function. In this case, the code to call it would be:

        mov     r37 = 3             // set up the function parameters
        mov     r38 = 4
        addl    r31 = 0x108, gp;;   // r31 -> import table entry
        ld8     r30 = [r31];;       // r30 -> sampleFunction descriptor
        ld8     r41 = [r30], 8;;    // r41 = actual address of function
        ld8     gp = [r30]          // set up gp for sampleFunction
        mov     b6 = r41            // set up the branch register
        br.call rp = b6;;           // call the function
        mov     gp = ...            // restore gp to original value

This means that if you try to disassemble at sampleFunction, you will end up just looking at the function descriptor rather than the function itself. The function itself begins with a dot, so if you want to see the code for sampleFunction you have to type u .sampleFunction.

Debugger Register Dump

     dbi0 =                0         dbi1 =                0
     dbi2 =                0         dbi3 =                0
     dbi4 =                0         dbi5 =                0
     dbi6 =                0         dbi7 =                0
     dbd0 =                0         dbd1 =                0
     dbd2 =                0         dbd3 =                0
     dbd4 =                0         dbd5 =                0
     dbd6 =                0         dbd7 =                0

       gp =         77560000 0         r2 =      6fbffc90bf0 0
       r3 = c000000000000309 0         r4 =         78190000 0
       r5 =          1010aa0 0         r6 =      6fbfffde000 0
       r7 =                0 0         r8 =      6fbffe8d948 0
       r9 = ffffffffffffffff 0        r10 =                0 0
      r11 =         ffffffff 0         sp =      6fbffe8d520 0
      r13 =      6fbfffdc000 0        r14 =      6fbffc90be8 0
      r15 =                0 0        r16 = e0000165e11ae7f0 0
      r17 = e0000165e11ae910 0        r18 =      6fbffc90bec 0
      r19 =                0 0        r20 =    9804c8a70033f 0
      r21 = e000000086b8ee20 0        r22 = e000000086b96040 0
      r23 =                1 0        r24 =             7f05 0
      r25 =               2f 0        r26 =              14e 0
      r27 =                6 0        r28 =      6fb000006fb 0
      r29 =            6fbff 0        r30 =              103 0
      r31 =                0 0    intnats =                0

    preds =             8941

       b0 =         772b43b0           b1 = e0000000ffa005c0
       b2 =         766e9e88           b3 =                0
       b4 =                0           b5 =                0
       b6 =         772ba2e0           b7 =                0

     unat =                0           lc =                0
       ec =                0          ccv = e000000086c577b0
      dcr =             7f05          pfs = c000000000000309
      bsp =      6fbffe90cc8     bspstore =      6fbffe90cc8
      rsc =                f         rnat =                0
     ipsr =     1013082a6018          iip =         772b4310
      ifs = 8000000000000204          fcr =               40
    eflag =              202          csd = cfbfffff00000000
      ssd = cf3fffff00000000        cflag =              111
      fsr =                0          fir =                0
      fdr =                0

      r32 =      6fbffe8d948 0        r33 =      6fb000006fb 0
      r34 =              104 0        r35 =             8941 0

At the top of this display are eight instruction debug registers (dbi) and eight data debug registers (dbd).

Next come the static integer registers, followed by intnats, which consists of all the NaT bits combined to form a 32-bit integer.

Then preds, which is all the predicate registers combined into a 64-bit integer.

Next are all the branch registers.

Then there are several special-purpose registers. Of these, the following two are probably the only ones you will need to deal with:

ar.pfs describes the stack frame of the previous function.
ar.ccv is an implicit parameter for the cmpxchg instruction.

Finally shown are the registers for the current register stack frame. You happen to be in a function that has four registers in its frame, so the debugger showed the first four stacked registers, r32 through r35. If the function used more registers, this part would have been larger.

Notation

x	1 (byte), 2 (word), 4 (long) or 8 (quad)
Ra, Rb, Rc...	Registers
imm_n	Signed n-bit constant
Rb/imm_n	Register or a constant
Ba/addr	Branch register or an address
cc	Condition (such as eq or ne)
Rb<n1, n2>	n2 bits from Rb starting at position n1.
f	Floating-point type (s, d, e)

For example, Rb<5, 4> means "extract 4 bits from Rb starting at position 5," which is the value (Rb >> 5) & 0x0F.

Many instructions can be modified by suffixes called completers.

Addressing Modes

Unlike the x86 instruction set, which has a significant number of addressing modes, the Itanium instruction set has only one addressing mode: Register indirect. The notation for register indirect is "[r]" which means "the value stored at memory location r." The thing inside the brackets is always a register.

Instruction Format and Pipelining

Instructions are packaged into groups of three called bundles. If you execute an .asm verbose command, the instructions that belong to a bundle will be surrounded by curly brackets. Bundles always start on 16-byte boundaries (in other words, the last digit of the hexadecimal address is zero).

You cannot jump into the middle of a bundle.

There are restrictions on what type of instruction can be put into a particular slot in a bundle. For example, one valid bundle type is MII, which means that slot 0 (zero) is a memory access, slot 1 is an integer instruction, and slot 2 is another integer instruction.

Valid instruction types are:

M - Memory/Move instruction
I - Complex Integer/Multimedia instruction
A - Simple Integer/Logic/Multimedia instruction
F - Floating-Point instruction (Normal/SIMD)
B - Branch instruction

Many instructions can be used in multiple slot types, in which case a completer is specified to disambiguate them. For example, there are five different nop instructions (nop.m, nop.i, nop.f, nop.b, nop.x) depending on which type of slot it was placed into.

Each valid combination of instruction categories is called a template. There are 32 different templates. Some of the templates differ only in the placement of stops.

A stop is used to indicate that instructions after the stop depend on instructions before the stop. For example, if you have the following series of instructions

    mov r3 = r2
    add r1 = r2,r4 ;;
    add r2 = r1,r3

there is no dependency between the first two instructions, but the third instruction cannot execute until after the first two have completed. Therefore, the compiler inserts a stop, represented by a double semicolon, after the second instruction.

Note that a stop does not have to go at the end of a bundle. There are some bundle templates that have stops in the middle and some that have more than one stop. For example, template 11 is an M|MI| instruction. Slot 0 (zero) is a memory access, then there is a stop after slot 0 (zero), then slot 1 is another memory access, then slot 2 is an integer instruction, then there is another stop after slot 2.

An instruction group is a sequence of instructions up to the next stop, taken branch, interrupt, or exception. The instructions within an instruction group cannot have dependencies among them. This allows the processor to execute them in parallel.

There are some exceptions to the no dependencies rule.

A branch instruction is allowed to depend on a predicate register and branch register set elsewhere in the group.
You can use the result of a successful ld.c without an intervening stop.
Instructions after a branch will implicitly depend on whether the branch was taken; this is acceptable. However, instructions after a branch cannot interfere with instructions before the branch.
Comparison instructions .and, .andcm, .or and .orcm are allowed to combine with others of the same type into the same target registers. (This means you cannot combine an .and with an .or.)
You are allowed to write to a register after a previous instruction reads it, with rare exceptions.
Two instructions in the group cannot both write to the same register, with the exception of combined comparison instructions as already noted.

This description omits a number of details that are important only to compiler-writers or people hand-writing Itanium assembly. When reading disassembly, you can assume that the compiler or author generated correct code, unless you are tracking a compiler bug.

Speculative Execution

The ld instruction (load from memory) supports an .s completer, which means that execution of the instruction is speculated. (You cannot speculatively write to memory.)

An instruction that can be speculated is called a speculative instruction, rather than a speculatable instruction. Consequently, this documentation will use the word "speculated" to refer to the .s variant of the ld instruction.

A speculated load is just like a regular load, except that if an exception occurs, the processor sets the NaT bit in the destination register instead of raising the exception.

If any input to an integer computation instruction has the NaT bit set, then the result of the computation will also have the NaT bit set. If any input to a comparison instruction has the NaT bit set, then the result of the comparison is always FALSE.

You can also speculate floating-point instructions, but instead of setting the NaT bit, the entire floating-point register is set to a special NaT-like value called NaTVal. As with the integer case, NaTVal infects all subsequent computations.

Perhaps you realize that the speculated execution was not necessary. (Maybe you started going down the TRUE branch of an IF statement, only to discover that the value is FALSE.) In which case, you just ignore the registers that you changed with the speculated execution and continue on your way. (Don't look at them, of course, because they might be NaT.)

If you decide that the speculated execution was worthwhile, execute a chk.s instruction, which means "verify that this register contains an actual value. If it is a NaT, then jump to the recovery code." The recovery code typically just consists of all the speculated instructions re-executed normally, so the exception can be raised.

Aside from the instructions previously mentioned, which can handle NaTs and a few other special instructions, attempting to use the value of a NaT register will cause an exception. This is not useful in general because you cannot tell which speculated instruction caused the exception.

Advanced Loads

The Itanium contains special techniques to optimize in the face of aliased pointers. For example, consider this code snippet:

int MyNewFcn(int *p1, int *p2)
{
    *p1 = 1;
    *p2 = 2;
    return *p1;
}

This function usually returns 1, but if p1 and p2 both point to the same address, then it will return 2.

The advanced load instruction ld.a means "load this value from memory and remember the access as successful." If there is a subsequent write to that address, it is removed from the list, rendering the advance load unsucessful. (Other events can remove an entry from the list; for example, if you ask it to remember too many advanced loads, it starts forgetting the older ones.)

Later on, you can check whether the advance load is still valid. If it is not, it means that the value was modified and you have to reload it.

There are two types of checks. The simplest check is ld.c.nc or ld.c.clr. This says, "Check if that advanced load is still valid. If not, then reload the value." The .clr completer means that this advanced load is not important anymore, so the processor can free up the entry for recording new advanced loads; the .nc completer means that this advanced load is still valuable, so do not clear it from the table.

Here is an example that uses ld.c.clr.

    ld8.a   r6 = [r8]       ;;      // read memory at r8, remember the address
    ...                             // perform operation that might modify memory...
    ld8.c.clr r6 = [r8]             // reload r6 from [r8] if necessary
    ...                             // perform operations with the value in r6

In order to avoid a stall, the compiler requested that register r6 be loaded from the memory address specified by register r8 in advance of when it actually needed the result. Then, the compiler wanted to use the value in r6, but had not determined if, in the meantime, some pointer dereference had modified the value, thus rendering the prefetched value in r6 useless.

The ld8.c.clr instruction checks if anything has written to that address (even on another processor). If not, then the instruction does nothing. However, if something has indeed written to the address, then the instruction refetches the value (taking the memory stall).

The second type of check is if you need more complicated recovery than just reloading the value.

Here is an example that uses chk.a.clr.

    ld8.a   r6 = [r8]       ;;      // read memory at r8, remember the address
    add     r5 = r6, r7     ;;      // precompute based on advanced read
    ...                             // perform operations that might modify memory...
    chk.a.clr r6, failed            // memory was modified
continue:
    ...                             // perform operations with the value in r5

failed:
    ld8     r6 = [r8]       ;;      // load the correct value
    add     r5 = r6, r7             // redo the precomputation
    br      continue                // and then continue as if nothing was wrong

After doing some precomputation with the value you read from r8, you execute a chk.a.clr instruction, which checks if your advanced load is still valid. If not, you jump to failed, where you reload the value and redo the precomputation, then jump back to continue normal execution.

Speculated Advanced Loads

You can combine the preceding two techniques. The ld.sa instruction performs a speculated advance load.

Register Rotation

Register rotation is an advanced technique where the registers renumber themselves each time they go through a loop. It is not covered in this documentation.

Control Flow

The types of control flow you will see most of the time are jump (br.cond), call (br.call), and return (br.ret). (Conditional jumps are just regular jump instructions with a qualifying predicate attached in front.)

The jump and call instructions also have long versions (brl) if the target of the jump is really far away.

The brl instruction is actually emulated on Itanium, so do not expect it to be fast.

Branch Prediction

Each of the standard jump instructions also includes a group of completers. The first completer determines whether the jump should be predicted taken or not taken.

.spnt: Static predict not taken. Always predict not taken.
.sptk: Static predict taken. Always predict taken.
.dpnt: Dynamic predict not taken. Use the prediction hardware. If prediction hardware cannot tell, then predict not taken.
.dptk: Dynamic predict taken. Use the prediction hardware. If prediction hardware cannot tell, then predict taken.

Prediction hardware might not be able to tell, because this instruction was never encountered before, or it was last encountered so long ago that it fell out of the cache.

The second completer specifies how aggressively you should prefetch after the cache. In other words, how sure are you that the prediction is correct.

.few: Prefetch a few instructions. Your prediction could be wrong, or it is not worth prefetching.
.many: Prefetch several instructions. You are confident that your prediction is correct.

Most jumps within a procedure will be marked .few, whereas unconditional subroutine calls and unconditional return instructions are usually marked .many.

Finally, there is an optional completer.

.clr: Clear this entry.

If you clear the entry, the processor will wipe out any knowledge of this jump instruction. Do this if you know the instruction will not be encountered again for a long time.

There is also a bonus instruction brp whose sole purpose is to indicate to the processor: "That computed jump instruction ahead is going to jump to here."

Comparisons

In its simplest form, the comparison instruction compares two values and stores the result into two predicate registers; one gets the the result of the comparison, and the other gets the opposite of the result. For example,

cmp.eq p1, p2 = r32, r33

compares the two registers for equality and stores the result into p1. Meanwhile, the p2 register gets the opposite value. For example, if they were equal, then p1 would be TRUE and p2 would be FALSE.

The next most complicated comparison instruction is called the unconditional comparison and it is always used with a qualifying predicate. Here is a sample unconditional comparison:

(p3) cmp.eq.unc p1, p2 = r32, r33

If the qualifying predicate p3 is TRUE, then this acts just like a regular comparison instruction. However, if the qualifying predicate is FALSE, then both the p1 and p2 registers are set to FALSE. This is a rare case where a qualifying predicate has an effect even though it is FALSE. (Normally, if a qualifying predicate is FALSE, the entire instruction is ignored.)

The next most complicated comparisons are the parallel comparisons. These are used when you have a chain of "a && b && c" or "a || b || c" results. Here is a sample AND parallel comparison:

cmp.eq.and p1, p2 = r32, r33

This is expressible in C as

p1 = p1 && (r32 == r33)
p2 = p2 && (r32 == r33)

In other words, if the comparison is false, then both predicate registers are set to FALSE; otherwise, they are left alone.

The other variations of parallel comparisons are:

    cmp.eq.andcm p1, p2 = r32, r33

        p1 = p1 && !(r32 == r33)
        p2 = p2 && !(r32 == r33)
    cmp.eq.or p1, p2 = r32, r33

        p1 = p1 || (r32 == r33)
        p2 = p2 || (r32 == r33)
    cmp.eq.orcm p1, p2 = r32, r33

        p1 = p1 || !(r32 == r33)
        p2 = p2 || !(r32 == r33)

and the DeMorgan operators...

    cmp.eq.or.andcm p1, p2 = r32, r33

        p1 = p1 || (r32 == r33)
        p2 = p2 && !(r32 == r33)
    cmp.eq.and.orcm p1, p2 = r32, r33

        p1 = p1 && (r32 == r33)
        p2 = p2 || !(r32 == r33)

For example, the expression p5 = (r4 == 0) || (r5 == r6) can be computed as follows (assuming that p5 is preinitialized to FALSE):

cmp.eq.or p5, p0 = r0, r4
cmp.eq.or p5, p0 = r5, r6

Notice that because these are both OR type comparisons, they can be combined into a single instruction group and, therefore, executed in parallel.

Build machine: CAPEBUILD