A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Friday, June 01, 2007

Decoding x86: From P6 to Core 2 - Part 3

This is the Part 3 of a 3 part series. To fully appreciate what's written here, the Part 1 and Part 2 articles (or comparable understandings) are prerequisites.

The New Core Improvements

Intel's "brand" new Core 2 Duo has many improvements over Pentium M. With respect to the x86 decode stage, they include -
  1. Improved micro-fusion
  2. 4-wide decode
  3. Macro-fusion
All of these have been numerously described and repeated by many on-line review sites. Here again we will look at them in more technical and analytical detail.

The improved micro-fusion is the least complicated, so we will just briefly describe it here. It is composed of using a bigger XLAT PLA (see the partial decoder diagram in Part 2) that can handle more load-modify or addressed store instructions, including many SSE2/SSE3 ones. This improves Core 2's SSE performance over its predecessors, which must re-steer many SSE instructions to the first (full) decoder to be processed. In fact, Core Solo/Core Duo (Yonah) already has improved micro-fusion over Pentium M, but on a smaller degree of instructions than Core 2 Duo.

On non-SSE codes, however, the performance boost is limited.

A 4-wide decode & issue width

The biggest marketing hype of Core 2 is certainly its ability to decode and issue 4 x86 instructions per cycle, thus achieving an IPC of 4 Instructions Per Cycle (or 5 with macro-fusion)! It turns out this is the biggest misconception around Core 2. As discussion in Myth #3 of Part 1 article, a (sustained) rate of three x86 decodes per cycle is not the performance bottleneck yet. In fact, Intel's Optimization Reference Manual says in itself that
[Decoding 3.5 instructions per cycle] is higher than the performance seen in most applications.
- 2.1.2.2 Instruction Fetch Unit (Instruction PreDecode)
Note that this is stated under the conditions where branches, assumed once every 7 instructions, are predicted 100% correct, which is almost never the case and the sustained IPC is usually further reduced.

Contrary to marketing slogan and common (mis-)belief, the main purpose of a 4-wide decode & issue (also macro-fusion discussed below) is really to combat the many undesirable design artifacts of P6's x86 decode engine. As seen in the end of Part 1 article, these design artifacts reduce efficiency of the 4-1-1 decoders, which under real circumstances can hardly sustain three x86 decodes per cycle. Specifically -
  1. Flushing decoding pipeline every 16 bytes, or about 4 to 5 x86 instructions in average.
  2. Flushing decoding pipeline at each complex (> 2 fused micro-op) instruction.
  3. Reducing instruction fetch for taken branches, especially to unaligned target address.

An additional partial decoder

For 1. and 2. in the above list, an additional partial decoder can help simply by raising the upper bound of the averaging range. For the purpose of discussion, suppose a 16-byte window contains four x86 instructions, and there is only one complex instruction among two such windows:
  • A set of 4-1-1 decoders will spend 4 to 5 cycles to decode the two 16-byte instruction windows, where two cycles are spent on the window with only simple instructions, and another two or three are spent on the one with a complex instruction (depending on where the complex instruction occurs).
  • A set of 4-1-1-1 decoders will spend only 3 to 4 cycles to decode the same two windows.
By lifting the roof of the best-case capability, a wider x86 decode engine can increase the average decode throughput. Note that even under the ideal condition where branch-related stalls do not occur, the sustained decodes per cycle is still less than 2.7 (8 instructions in 3+ cycles), far from the value 4 or 5 as advertised by Intel.

The Instruction Queue

The extra partial decoder, however, does not help the 3rd point in the previous list when a branch is taken, especially to an unaligned target address. Note that branch frequency is about 10-15% in normal programs (see also macro-fusion below). While many branch targets can be forced to be 16-byte aligned, it is usually not possible for small in-line loops to do so. If the entry point of the loop has address x MOD 16, then during the first cycle executing the loop, only 16 minus x fetched bytes contain effective instructions. This number does not increase no matter how many additional decoders you add to the decoding engine.

The real "weapon" the Core 2 Duo has against this branch-related inefficiency is not the 4-wide decoder, but a pre-decoded instruction queue of up to 18-deep x86 instructions. Refer to Part 1 article's first diagram on P6's Instruction Fetch Unit. There is a 16-byte wide, instruction boundary aligned Instruction Buffer sitting in-between the IFU and the decoders. Replacing this buffer with an 18 instruction-deep queue (probably 24 to 36 bytes in size) that can detect loops among the containing instructions, we get Core 2 Duo's biggest advantage with respect to x86 decode: ability to sustain continuous decode stream on short loops.

This continuous stream of x86 instructions allows Core 2 Duo's four decoders to be better utilized. The 18-instruction queue are aligned at instruction boundaries, and thus are immune to branch target (16-byte) misalignment problem. Although the 18-deep queue length easily becomes insufficient if loop unrolling, a compile-time optimization technique, is used, it is okay because unrolling a loop has the exact same effect as supplying a continuous instruction stream. More-over, the instruction queue also serves as a place where macro-fusion opportunities can be identified, as will be discussed next.

Without extensive simulation or real traces, we really can't be sure how much boost is received by Core 2 Duo from the 4-wide decode and the instruction queue. We have to make a guess; by using one extra partial decoder, the average sustained x86 decode throughput is probably increased from around 2.1 to about 2.5 macroinstructions (x86) per cycle. With the help of the instruction queue to supply uninterrupted macroinstructions in small loops, the sustained decode throughput is probably increased further to 2.7 or even close to 3.

Macro-fusion, the Myth and Truth

Debunking the Myth

Intel markets macro-fusion as the ability to increase x86 decode throughput from 4 to 5. As we have seen in the section above, the decode throughput without macro-fusion is much less than 4 and only close to 3. It turns out that macro-fusion has even less impact on improving the throughput, as is discussed here.

So what really is macro-fusion? In Intel's P6 terminology, "macro" or "macroinstruction" is used to describe an instruction in the original ISA (Instruction Set Architecture, here the x86). Thus macro-fusion is actually the exact same idea as micro-fusion, where two (or more) depending instructions with a single fan-out are collapsed into one instruction format (see the Part 2 article). The difference is on their application domain; where micro-fusion works on internal micro-ops, macro-fusion works on (x86) macrointructions. In fact, Intel's macro-fusion patent, System and Method for Fusing Instructions, filed in Dec.2000, predates its micro-fusion patent, Fusion of Processor Micro-Operations, filed in Aug.2002. It is probably due to two following reasons that the former is implemented later:
  1. Complexity (or difficulty)
  2. Limited usefulness

Why is it difficult, and what does it do?

First, we know that x86 instructions are complex and variable-length. Some x86 instructions take 6 clock cycles to only determine its length (page 2-7, Instruction PreDecode, of Intel's Optimization Reference Manual). The complexity of collapsing variable-length macroinstructions in when most cycle time is spent on decoding lengths (among other things) is undoubtedly much higher than that of fusing fixed-width micro-ops. Second, it will be even more difficult, if not impossible, to determine dependencies in real time, and fuse the depending macroinstructions together.

So instead of trying to fused all possible macroinstruction pairs, Core 2 Duo fuses only the selected macroinstructions -
  • The first macroinstruction must be a TEST X, Y or a CMP X, Y where only one operand of X and Y is an immediate or a memory word.
  • The second macroinstruction must be a conditional jump that checks the carry flag (CF) or zero flag (ZF).
  • The macroinstructions are not working in 64-bit mode.
These test/compare and jump are often used in integer programs composed of iterative algorithms. According to a 2007 SPEC Benchmark Workshop paper, "Characterization of Performance of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor," the frequency of macro-fused operations in SPEC2006 CPU ranges from 0-16% in integer codes and just 0-8% in floating-point codes. In other words, in the best case, macro-fusion would reduce the number of macroinstructions from 100% to 92% for integer and just 96% for floating-point execution, hardly the whopping 20-25% reduction as described by Intel's marketing department (and the numerous on-line repeaters).

Bringing the Truth

Look at it closer, we realize that the purpose of macro-fusion is really not much to reduce the number of x86 instructions to be decoded, but again to reduce decode interruptions/stalls due to predicted-taken branches. Again for the purpose of discussion lets number the four x86 decoders as 0, 1, 2, and 3. A two-macroinstruction sequence can be steered to either of the following four positions: [0,1], [1,2], [2,3], [3,0]. If the conditional jump is predicted taken, then no instruction after it will be steered for decoding, and in two of the four cases (i.e., [0,1] and [3,0]) the four decoders will decode no other maroinstruction at all in the cycle. More specifically,
  • Decoder slot [0,1], no other instruction decode, 0.25 probability
  • Decoder slot [1,2], 1 other instruction decode, 0.25 probability
  • Decoder slot [2,3], 2 other instruction decode, 0.25 probability
  • Decoder slot [3,1], no other instruction decode, 0.25 probability
The average number of other decodes is thus (1+2)*.25 = 0.75, or about 19% efficiency when the 4 decoders work on a block of macroinstructions containing conditional branches. Note that this is assuming all ideal cases otherwise, including perfect branch prediction, all simple instructions, and no 16-byte instruction misalignment. In reality, the separate test-and-jump macroinstructions under realistic environment will probably reduce decode efficiency even more.

Thankfully, when looking at a bigger picture, the situation becomes much better. As previously stated, the frequency of conditional branch itself tops at 8-16% in the first place; in other words, in average one taken branch occurs in every 8 to 16 other instructions, or every 3 to 4 instruction fetch cycles (see the bottom of page 2-6 in Intel's Optimization Reference Manual). Suppose a taken branch occurs after 3 blocks of non-branching decodes, the 80% decoding efficiency loss at the branching block would result in less than 20% loss overall. This is why even without macro-fusion, Core 2's predecessor (Yonah) can already achieve IPC higher than 2 for some programs with only three x86 decoders.

Now lets look at what happens to the conditional branch decode when macro-fusion is added. Again, the first column is the decoder number occupied by the now fused branch macroinstruction; the second column is number of other instruction decodes; the last column is occurrence probability of the row:
  • Decoder slot 0, no other instruction decode, 0.25 probability
  • Decoder slot 1, 1 other instruction decode, 0.25 probability
  • Decoder slot 2, 2 other instruction decode, 0.25 probability
  • Decoder slot 3, 3 other instruction decode, 0.25 probability
The average number of other decodes becomes (1+2+3)*.25 = 1.5, or about 38% efficiency of the 4 decoders, doubling that of the case without macro-fusion. The overall decoding efficiency loss reduces from less than 20% to less than 10%. A 10% increase in decoding efficiency will certainly be appreciated by the rest of the core, lifting the roof of sustained IPC to 3 or maybe even higher for SPEC95 like programs (note that according to Intel's manual, Core 2's macroinstruction length pre-decoder is designed to sustain a 3.5 decode throughput in the worst case).

This concludes the 3-part Decoding x86: From P6 to Core 2 series. I hope what's written here satisfy your curiosity with regard to the inner workings of modern microarchitectures, as they certainly do me over the course of my research/study on them. Please let me know if you have comments, suggestions, or even better, corrections, to the contents.

7 comments:

backup said...

Why are 8 core Opterons faster than 8 Barcelona?

abinstein said...

I don't think you made this comment at the right place, but suppose this is a question from a reader I'll answer it right here anyway.

If you look at 8-core 2.66GHz Clovertown you'll find that -
Clovertown 8-core @2.66GHz
* SPECint_rate tops around 80
* SPECfp_rate tops below 60

On the contrary, from the pictures -
Barcelona 8-core @2.6GHz
* SPECint_rate higher than 100
* SPECfp_rate at 90

In other words, the overclocker.com has no clue at all with regard to what they have seen. They are bad technologists, amateur at best, and are not even good journalists, making FUDs without the slightest understanding.

Anonymous said...

After all of the hooplah AMD has made, shouldn't you expect 8 Barcelona cores to be at least amazing? Even if it is just 5-15% better, even a quad K8 should be trouble for Clovertown, but that isn't going to save them in the desktop arena.

Why is it ok for you to say "below" 60 and be credible yet when I say "above" I am a fanboy? I am just informing you of the latest numbers.

Why do you believe AMD's number not submitted for submission but I can't believe Intel's? I don't think Intel made the numbers out of air and of course you can't have a score posted as soon as you obtain it.

According to this Barcelona doesn't even reach 104. Before you say anything, the top horizontal bar represents 104, not the imaginary line extending from the 104 itself (3D graph trickery). Your 2.7% is incorrect, and it is 2.308%, an irrelevant extrapolation. Would you prefer a 0.5GHz "efficient" CPU or one twice as inefficient yet 3x the clock (same thermal)?

Anonymous said...

It's fine for you to say "below" whatever than I say "above" something and I'm desperate...

Take a look here.
The test date is "March", but using whatever parameters to find it in a search result, the publish date is April. Give it some time. Your latest response is much more sensible, to suggest Intel using hyper memory rather than poofing it up.

Barcelona 108, what? That is K8. AMD's own score's shows it not reaching (assuming you can read a 3D graph) 104/92 (keeping their lead in FP and giving it up before Harpertown with the X5365 that I already mentioned). Clock for clock extrapolations, I thought that this inane measurement has faded in favor of real measures such as raw performance and performance per watt.

I'm still waiting for you to comment on 8 Barcelona cores being slower than 8 K8 on a raw basis and what this would mean for the meat of AMD cpu sales. Maybe they are sandbagging or maybe Barcelona is just a tweaked quad K8. Things look even more dire with news that even Cray can't get enough to make any meaningful revenue this year.

abinstein said...

"Your latest response is much more sensible, to suggest Intel using hyper memory rather than poofing it up."

The fact remains that Intel have responded with an apple-to-orange comparison; whereas Barcelona are designed to be drop-in replacement to Opteron, the hyped-up Clovertown will need memory or even chipset/motherboard upgrade to perform better.


"Barcelona 108, what? That is K8. AMD's own score's shows it not reaching (assuming you can read a 3D graph) 104/92"

The fact remains that even a SPECint_rate of 104 from Barcelona, with ~3% slower clock rate, is better than the unverified 99.9 from Clovertown.


"I'm still waiting for you to comment on 8 Barcelona cores being slower than 8 K8 on a raw basis and what this would mean for the meat of AMD cpu sales."

The only reason that you are "still waiting" is because you can't read. As I have already explained, the comparison that overclocker.com made and you happily believed is invalid, because it is trying to imply per-core performance by comparing a 4-socket platform to a 2-socket one, with unknown comparability of their prices, power consumptions, memory speeds, operating systems, chipset and platform complexity.

And we all know single-socket Clovertown performs terribly poor compared to dual-socket Woodcrest. Dual-socket Clovertown (SPECint_rate,SPECfp_rate) runs even slower, 10% in integer and 30+% in floating-point, than quad-socket Opterons (SPECint_rate,SPECfp_rate) today.

To prepare yourself better knowledge in terms of # of core scalability, please read a few of my previous blog articles before commenting further. My tolerance toward pestering & clueless Intel fanboys is getting thiner.


"Things look even more dire with news that even Cray can't get enough to make any meaningful revenue this year."

You first try to imply Barcelona won't sell well due to its 1.9x scalability (compared to Clovertown's sub-1.7x), then you say AMD can't sell enough of them to satisfy one customer.

Such choice of arguments, with sole purpose of spreading FUDs that even conflict among each other, is indeed very Intel-ish...

Anonymous said...

So you are saying that nothing can be extrapolated from 8 K8 vs 8 K10. That is all I wonder. More performance hints being thrown around that you'll probably deny.

How did I contradict? Where do you see evidence of Barcelona scaling? The factor of scaling does not matter as much as end performance. So while i'm sure that Barcelona will prove a worthy upgrade for 2 and 4 socket (which AMD seems it can't even supply), much performance seems to be a result of more cores than anything else, surely an ugly situation when the brunt of volume is with 2 cores and mobile and desktop.

abinstein said...

"So you are saying that nothing can be extrapolated from 8 K8 vs 8 K10."

Wrong, I am saying that you can't extrapolate per-core performance from a quad-socket system to a dual-socket one. You can't extrapolate per-core performance from Woodcrest to Clovertown, except downward.


"Where do you see evidence of Barcelona scaling?"

Where do you not see? You must be blind. Dual-socket K10 performs almost as well as quad-socket K8. Can you say the same on Clovertown?


"The factor of scaling does not matter as much as end performance."

The scaling is on performance. What you intend to say is probably that throughput does not matter as much as delay. This is true only for enthusiast desktop, but totally wrong for servers.


"So while i'm sure that Barcelona will prove a worthy upgrade for 2 and 4 socket (which AMD seems it can't even supply)", much performance seems to be a result of more cores than anything else

What makes Barcelona not great for single-socket servers to reach almost 2x throughput? I mean Clovertown can't do that, and all those fools who bought into Intel's MCM quad-core a year earlier are stuck with a system of only 80% or less performance.

And again, you are trying to imply per-core performance by comparing 2-socket and 4-socket systems, which is completely wrong. Phenom X2 will have better performance to Athlon X2 just by the additional of 2MB L3. There are other improvements.


"surely an ugly situation when the brunt of volume is with 2 cores and mobile and desktop."

My recommendation to all desktop buyers is to buy a low/mid-level AM2 system (e.g., Athlon64 X2 4200+/4800+) today, and be ready to upgrade to quad-core a year or two later. It's great value that none Intel system can match. Note that A64 X2 4800+ beats Core 2 Duo E6400 in most tests, but has only two-thirds the price of the latter. You can even go as high as Athlon64 X2 5200+ for the same price of E6400.

In other words, the only Intel systems worth buying at this moment are Core 2 Duo E6600 or above, which cost more than $220 just for the processor and most users don't need anyway. When you factor in quad-core upgradability, however, not a single Core 2 Duo worths consideration.

Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.