The New Core Improvements
Intel's "brand" new Core 2 Duo has many improvements over Pentium M. With respect to the x86 decode stage, they include -
- Improved micro-fusion
- 4-wide decode
The improved micro-fusion is the least complicated, so we will just briefly describe it here. It is composed of using a bigger XLAT PLA (see the partial decoder diagram in Part 2) that can handle more load-modify or addressed store instructions, including many SSE2/SSE3 ones. This improves Core 2's SSE performance over its predecessors, which must re-steer many SSE instructions to the first (full) decoder to be processed. In fact, Core Solo/Core Duo (Yonah) already has improved micro-fusion over Pentium M, but on a smaller degree of instructions than Core 2 Duo.
On non-SSE codes, however, the performance boost is limited.
A 4-wide decode & issue width
The biggest marketing hype of Core 2 is certainly its ability to decode and issue 4 x86 instructions per cycle, thus achieving an IPC of 4 Instructions Per Cycle (or 5 with macro-fusion)! It turns out this is the biggest misconception around Core 2. As discussion in Myth #3 of Part 1 article, a (sustained) rate of three x86 decodes per cycle is not the performance bottleneck yet. In fact, Intel's Optimization Reference Manual says in itself that
[Decoding 3.5 instructions per cycle] is higher than the performance seen in most applications.Note that this is stated under the conditions where branches, assumed once every 7 instructions, are predicted 100% correct, which is almost never the case and the sustained IPC is usually further reduced.
- 18.104.22.168 Instruction Fetch Unit (Instruction PreDecode)
Contrary to marketing slogan and common (mis-)belief, the main purpose of a 4-wide decode & issue (also macro-fusion discussed below) is really to combat the many undesirable design artifacts of P6's x86 decode engine. As seen in the end of Part 1 article, these design artifacts reduce efficiency of the 4-1-1 decoders, which under real circumstances can hardly sustain three x86 decodes per cycle. Specifically -
- Flushing decoding pipeline every 16 bytes, or about 4 to 5 x86 instructions in average.
- Flushing decoding pipeline at each complex (> 2 fused micro-op) instruction.
- Reducing instruction fetch for taken branches, especially to unaligned target address.
An additional partial decoder
For 1. and 2. in the above list, an additional partial decoder can help simply by raising the upper bound of the averaging range. For the purpose of discussion, suppose a 16-byte window contains four x86 instructions, and there is only one complex instruction among two such windows:
- A set of 4-1-1 decoders will spend 4 to 5 cycles to decode the two 16-byte instruction windows, where two cycles are spent on the window with only simple instructions, and another two or three are spent on the one with a complex instruction (depending on where the complex instruction occurs).
- A set of 4-1-1-1 decoders will spend only 3 to 4 cycles to decode the same two windows.
The Instruction Queue
The extra partial decoder, however, does not help the 3rd point in the previous list when a branch is taken, especially to an unaligned target address. Note that branch frequency is about 10-15% in normal programs (see also macro-fusion below). While many branch targets can be forced to be 16-byte aligned, it is usually not possible for small in-line loops to do so. If the entry point of the loop has address x MOD 16, then during the first cycle executing the loop, only 16 minus x fetched bytes contain effective instructions. This number does not increase no matter how many additional decoders you add to the decoding engine.
The real "weapon" the Core 2 Duo has against this branch-related inefficiency is not the 4-wide decoder, but a pre-decoded instruction queue of up to 18-deep x86 instructions. Refer to Part 1 article's first diagram on P6's Instruction Fetch Unit. There is a 16-byte wide, instruction boundary aligned Instruction Buffer sitting in-between the IFU and the decoders. Replacing this buffer with an 18 instruction-deep queue (probably 24 to 36 bytes in size) that can detect loops among the containing instructions, we get Core 2 Duo's biggest advantage with respect to x86 decode: ability to sustain continuous decode stream on short loops.
This continuous stream of x86 instructions allows Core 2 Duo's four decoders to be better utilized. The 18-instruction queue are aligned at instruction boundaries, and thus are immune to branch target (16-byte) misalignment problem. Although the 18-deep queue length easily becomes insufficient if loop unrolling, a compile-time optimization technique, is used, it is okay because unrolling a loop has the exact same effect as supplying a continuous instruction stream. More-over, the instruction queue also serves as a place where macro-fusion opportunities can be identified, as will be discussed next.
Without extensive simulation or real traces, we really can't be sure how much boost is received by Core 2 Duo from the 4-wide decode and the instruction queue. We have to make a guess; by using one extra partial decoder, the average sustained x86 decode throughput is probably increased from around 2.1 to about 2.5 macroinstructions (x86) per cycle. With the help of the instruction queue to supply uninterrupted macroinstructions in small loops, the sustained decode throughput is probably increased further to 2.7 or even close to 3.
Macro-fusion, the Myth and Truth
Debunking the Myth
Intel markets macro-fusion as the ability to increase x86 decode throughput from 4 to 5. As we have seen in the section above, the decode throughput without macro-fusion is much less than 4 and only close to 3. It turns out that macro-fusion has even less impact on improving the throughput, as is discussed here.
So what really is macro-fusion? In Intel's P6 terminology, "macro" or "macroinstruction" is used to describe an instruction in the original ISA (Instruction Set Architecture, here the x86). Thus macro-fusion is actually the exact same idea as micro-fusion, where two (or more) depending instructions with a single fan-out are collapsed into one instruction format (see the Part 2 article). The difference is on their application domain; where micro-fusion works on internal micro-ops, macro-fusion works on (x86) macrointructions. In fact, Intel's macro-fusion patent, System and Method for Fusing Instructions, filed in Dec.2000, predates its micro-fusion patent, Fusion of Processor Micro-Operations, filed in Aug.2002. It is probably due to two following reasons that the former is implemented later:
- Complexity (or difficulty)
- Limited usefulness
Why is it difficult, and what does it do?
First, we know that x86 instructions are complex and variable-length. Some x86 instructions take 6 clock cycles to only determine its length (page 2-7, Instruction PreDecode, of Intel's Optimization Reference Manual). The complexity of collapsing variable-length macroinstructions in when most cycle time is spent on decoding lengths (among other things) is undoubtedly much higher than that of fusing fixed-width micro-ops. Second, it will be even more difficult, if not impossible, to determine dependencies in real time, and fuse the depending macroinstructions together.
So instead of trying to fused all possible macroinstruction pairs, Core 2 Duo fuses only the selected macroinstructions -
- The first macroinstruction must be a TEST X, Y or a CMP X, Y where only one operand of X and Y is an immediate or a memory word.
- The second macroinstruction must be a conditional jump that checks the carry flag (CF) or zero flag (ZF).
- The macroinstructions are not working in 64-bit mode.
Bringing the Truth
Look at it closer, we realize that the purpose of macro-fusion is really not much to reduce the number of x86 instructions to be decoded, but again to reduce decode interruptions/stalls due to predicted-taken branches. Again for the purpose of discussion lets number the four x86 decoders as 0, 1, 2, and 3. A two-macroinstruction sequence can be steered to either of the following four positions: [0,1], [1,2], [2,3], [3,0]. If the conditional jump is predicted taken, then no instruction after it will be steered for decoding, and in two of the four cases (i.e., [0,1] and [3,0]) the four decoders will decode no other maroinstruction at all in the cycle. More specifically,
- Decoder slot [0,1], no other instruction decode, 0.25 probability
- Decoder slot [1,2], 1 other instruction decode, 0.25 probability
- Decoder slot [2,3], 2 other instruction decode, 0.25 probability
- Decoder slot [3,1], no other instruction decode, 0.25 probability
Thankfully, when looking at a bigger picture, the situation becomes much better. As previously stated, the frequency of conditional branch itself tops at 8-16% in the first place; in other words, in average one taken branch occurs in every 8 to 16 other instructions, or every 3 to 4 instruction fetch cycles (see the bottom of page 2-6 in Intel's Optimization Reference Manual). Suppose a taken branch occurs after 3 blocks of non-branching decodes, the 80% decoding efficiency loss at the branching block would result in less than 20% loss overall. This is why even without macro-fusion, Core 2's predecessor (Yonah) can already achieve IPC higher than 2 for some programs with only three x86 decoders.
Now lets look at what happens to the conditional branch decode when macro-fusion is added. Again, the first column is the decoder number occupied by the now fused branch macroinstruction; the second column is number of other instruction decodes; the last column is occurrence probability of the row:
- Decoder slot 0, no other instruction decode, 0.25 probability
- Decoder slot 1, 1 other instruction decode, 0.25 probability
- Decoder slot 2, 2 other instruction decode, 0.25 probability
- Decoder slot 3, 3 other instruction decode, 0.25 probability
This concludes the 3-part Decoding x86: From P6 to Core 2 series. I hope what's written here satisfy your curiosity with regard to the inner workings of modern microarchitectures, as they certainly do me over the course of my research/study on them. Please let me know if you have comments, suggestions, or even better, corrections, to the contents.