A Journey in Modern Computer Architectures: 06/2007

Semiconductor chip yield is one of the most guarded secret in the industry. There is no way an outsider could have "guess-timated" an accurate value, except by chance (which is also extremely low). Yet sometimes observations can be made from some big, obvious facts. In this article we will make some (strictly) back-of-envelope calculations of AMD and Intel's yields on dual-core processors, and make implications on native quad-core "manufacturability" from both companies.

The observation and assumptions

For the purpose of discussion, we will make the following assumptions -

In mid-late 2006, Intel occupies ~80% market share with three 300mm 65nm fabs, while AMD occupies ~20% with only 90nm FAB30 (200mm) and FAB36 (300mm).
AMD's FAB30 has the same wafer throughput as Intel's 65nm fabs. FAB36, while under 90-to-65nm transition and having low utilization, further increases production volume by 50%.
Intel's main production in 3Q 2006 is Core 2 Duo and dual-core Netburst, with Core 2 Quad volume small enough to be negligible to our discussion (e.g., 10% or less).
The most significant factor except those described above are the yield of the fabs, and AMD's FSB36 has about the same dual-core K8 yield as its 90nm counterparts.

The assumptions above may be utterly wrong, or they may be good enough for the "back-of-envelope" purpose. My point here is not to commend their validity; rather it is to make clear that the arguments below will hold true only if these assumptions do. Note that Intel's last 65nm fab (Ireland) of the three started production output by July 2006, while AMD's FAB36 started 65nm output in the later part of 4Q that year. Thus at least for 3Q 2006 the above market-share and relative technology differences are known to be true.

The calculations

Potentially, three 300mm 65nm fabs would have 3*2*2 = 12x capacity of one 200mm 90nm fab, if the yields of all fabs are the same. Thus, counting into AMD's FAB36, Intel would've had 12x/1.5 = 8x capacity of AMD with the same (dual-core processor) yield. However, Intel's market share during the period is only 4x that of AMD's. There is thus a 2x discrepancy between Intel's potential capacity (8x of AMD's) and its true capacity (4x of AMD's), which is presumably affected by a lower yield of its fabs. In other words, to reach the expected market share, AMD's FAB30 and FAB36 would have yields twice as good as Intel's 65nm 300mm fabs.

Apparently, this conclusion is not possible. A factor of two in terms of yield is too large, and Intel simply can't be that bad in manufacturing. A few factors may have affected the estimation accuracy here:

Intel's 65nm fabs may have lower wafer throughput or utilization than AMD's FAB30 and FAB36 combined, particularly the Ireland fab which was ramping for just 4 months, and D1D which is also used for 45nm research & development.
Intel may be making much more Core 2 Quad, which effectively cuts production volume in half (two Core 2 Duo dies make one Core 2 Quad).

Taking into consideration of the two factors above, we'll adjust the estimated yield difference from 2x down (quite arbitrarily) to 1.5x. Note that a high percentage of Netburst-related products from Intel actually makes the discrepancy larger, since Netburst chips are smaller in size per die, much matured, and should have better yield than the cutting-edge Core 2 Duo.

The Implication

So how does this 1.5x yield difference affect "native" quad-core manufacturing? Suppose AMD's dual-core K8 yield is 81%; Intel's Core 2 Duo yield would be just 54% (1/1.5x). By 1st-order estimate, AMD's native quad-core would have a yield of roughly 65% (0.81*0.81), whereas Intel's would have 29% (0.54*0.54). In other words, out of 100 quad-core dies, AMD is able to make 65 functional quad-core processors, while Intel only 29, less than 50% of its smaller competitor. It is not difficult to see why AMD is going native but Intel won't until late 2008.

Lets for the purpose of discussion turn the parameters further in Intel's favor, and assume it has just 1.25x lower yield (instead of 1.5x) from AMD's. If we again suppose dual-core K8 has yield 81%, then Core 2 Duo would have almost 65%, making Intel's MCM quad-core approach as productive as AMD's native quad-core approach. What we see here is that a yield just a quarter better than the competitor could've made a huge difference in terms of native quad-core manufacturability. In fact, Not only is Intel late to native quad-cores, it was also late to native dual-cores for about 6 months even with a better technology (65nm vs. 90nm).

The conclusion is clear, that Intel is telling the truth that it can't make native quad-core cost-effectively. For AMD, it might be very hard, but probably still doable, based on a simple capacity observation and this back-of-envelope calculation.

The arguments

Some people have argued the precision of the above estimates. Their arguments can basically be divided into the following points:

Intel's D1D is also making 45nm transition in late 2006, thus should have less than maximum output.
Intel's Ireland fab, ramping only 4 months from Jun'06, won't achieve max capacity in Oct'06.
Intel's shipping more dual-core processors in 4Q06 than AMD. Specifically, just over 50% of Intel processors are dual-cores, while only 30% of AMD's are.
AMD's FAB36, making 300mm wafers and started revenue shipping in Apr'06, should've been making as much silicon as FAB30.
By late 3Q06, AMD would also have Chartered's output at hand.
Intel's 65nm doesn't actually result in 2x capacity of 90nm, more like 1.7x (1/0.6). As well, Intel's 300mm wafer would result in ~2.25x usable silicon area of 200mm ones.

It's important to note that all these are considered higher-order factors. A slight difference in terms of max wafer throughput per fab (ranging anywhere from 20k to 60k) could've dwarfed any of above. Still, for the sake of discussion, lets still try some more precise estimates from these points.

The first point, it turns out, is wrong. As D1D's making 45nm outputs, its 65nm capacity is moved to the neighboring D1C, which is outputting 65nm chips right after Ireland and purposely/completely ignored by me above. The second point would be valid and reduce 65nm Ireland fab's capacity to some 30% of its max.

The third point above is also true; however it fails to recognize that most of Intel's single-core processors (Celerons and Pentium M's) are made at its 90nm fabs, whereas all of AMD's single-core and dual-core processors are made out of FAB30 & 36 (excluding Chartered), a factor the 17% difference in dual-core ratio isn't even able to compensate.

The fifth and sixth points are minor. Chartered's flex capacity would account for up to 20% of AMD's silicon output, and even less in Oct'06, not 3 months after its first revenue shipping for AMD. Assume Chartered is supplying 15% of AMD's silicon output, it'll effectively make output from AMD's own fabs 85% of the total, or changing the actual Intel-to-AMD output ratio from 4x to 4.7x.

The forth point, which we'll discuss last here, seems quite valid from page 5 of this AMD Jun'06 analyst day presentation (see picture above). At late 3Q06, the 300mm "wafer outs" from FAB36 seems to be 0.4x of the max 200mm from FAB30, equivalent to 0.4*2.25 = 0.9x FAB30's silicon area. Surely this is a great increase of AMD's potential capacity. Unfortunately, it turns out such argument is unfounded and mislead by a graph without y-axis unit and meant to be illustrative only.

If we read the text on page 4 of that presentation (again see picture above), FAB36 is expected to output 25k wafers per month (wpm) by Q4 2007, which will be the total 300mm wafer output at that point (FAB38 won't have wafer outs until Q1 2008). We also know FAB30 is outputting about 30k wpm in Q3 2006. Now go to page 5 again and look! How can green line's 25k wpm (end of 4Q07) be some 60% higher than red line's 30k wpm in 2006? It is absolutely not possible unless the "wafer outs" y-axis actually means wafer area outs, and the 25k 300mm wpm from FAB36 is effectively doubled to 50k, some 66% higher than the 30k 200mm wpm from FAB30.

It turns out my original estimate of FAB36 reaching 50% capacity of FAB30 is actually a bit optimistic. The true number should be calculated as such: (0.8/3.4 * 25000)*2.25/30000 = 0.44, where 0.8 comes from green line at end of 3Q06, 3.4 from end of 4Q07 (both FAB36 only), 25000 is expected green line wpm at end of 4Q07, 2.25 translates 300mm wpm to effective 200mm wpm, and the final 30000 is FAB30 wpm (red line at end of 3Q06).

Overall, a definitely more precise/probably more accurate estimate is the following:

Intel's potential capacity: (2+0.3)*(1/0.6)*2.25 = 8.6
AMD's potential capacity: 1+0.44 = 1.44
Potential capacity ratio: 8.6/1.44 = 6.0x

Intel's actual output: less than 80% market share (excluding 90nm production)
AMD's actual output: 20%*0.85 = 17% (Chartered effects)
Actual output ratio: 80%/17% = 4.7x

Discrepancy between potential and actual output: 6.1/4.7 = 1.28, or almost 28% difference in microprocessor yield, well between the 50% and 25% estimates I made above.

This is the Part 3 of a 3 part series. To fully appreciate what's written here, the Part 1 and Part 2 articles (or comparable understandings) are prerequisites.

The New Core Improvements

Intel's "brand" new Core 2 Duo has many improvements over Pentium M. With respect to the x86 decode stage, they include -

Improved micro-fusion
4-wide decode
Macro-fusion

All of these have been numerously described and repeated by many on-line review sites. Here again we will look at them in more technical and analytical detail.

The improved micro-fusion is the least complicated, so we will just briefly describe it here. It is composed of using a bigger XLAT PLA (see the partial decoder diagram in Part 2) that can handle more load-modify or addressed store instructions, including many SSE2/SSE3 ones. This improves Core 2's SSE performance over its predecessors, which must re-steer many SSE instructions to the first (full) decoder to be processed. In fact, Core Solo/Core Duo (Yonah) already has improved micro-fusion over Pentium M, but on a smaller degree of instructions than Core 2 Duo.

On non-SSE codes, however, the performance boost is limited.

A 4-wide decode & issue width

The biggest marketing hype of Core 2 is certainly its ability to decode and issue 4 x86 instructions per cycle, thus achieving an IPC of 4 Instructions Per Cycle (or 5 with macro-fusion)! It turns out this is the biggest misconception around Core 2. As discussion in Myth #3 of Part 1 article, a (sustained) rate of three x86 decodes per cycle is not the performance bottleneck yet. In fact, Intel's Optimization Reference Manual says in itself that

[Decoding 3.5 instructions per cycle] is higher than the performance seen in most applications.
- 2.1.2.2 Instruction Fetch Unit (Instruction PreDecode)

Note that this is stated under the conditions where branches, assumed once every 7 instructions, are predicted 100% correct, which is almost never the case and the sustained IPC is usually further reduced.

Contrary to marketing slogan and common (mis-)belief, the main purpose of a 4-wide decode & issue (also macro-fusion discussed below) is really to combat the many undesirable design artifacts of P6's x86 decode engine. As seen in the end of Part 1 article, these design artifacts reduce efficiency of the 4-1-1 decoders, which under real circumstances can hardly sustain three x86 decodes per cycle. Specifically -

Flushing decoding pipeline every 16 bytes, or about 4 to 5 x86 instructions in average.
Flushing decoding pipeline at each complex (> 2 fused micro-op) instruction.
Reducing instruction fetch for taken branches, especially to unaligned target address.

An additional partial decoder

For 1. and 2. in the above list, an additional partial decoder can help simply by raising the upper bound of the averaging range. For the purpose of discussion, suppose a 16-byte window contains four x86 instructions, and there is only one complex instruction among two such windows:

A set of 4-1-1 decoders will spend 4 to 5 cycles to decode the two 16-byte instruction windows, where two cycles are spent on the window with only simple instructions, and another two or three are spent on the one with a complex instruction (depending on where the complex instruction occurs).
A set of 4-1-1-1 decoders will spend only 3 to 4 cycles to decode the same two windows.

By lifting the roof of the best-case capability, a wider x86 decode engine can increase the average decode throughput. Note that even under the ideal condition where branch-related stalls do not occur, the sustained decodes per cycle is still less than 2.7 (8 instructions in 3+ cycles), far from the value 4 or 5 as advertised by Intel.

The Instruction Queue

The extra partial decoder, however, does not help the 3rd point in the previous list when a branch is taken, especially to an unaligned target address. Note that branch frequency is about 10-15% in normal programs (see also macro-fusion below). While many branch targets can be forced to be 16-byte aligned, it is usually not possible for small in-line loops to do so. If the entry point of the loop has address x MOD 16, then during the first cycle executing the loop, only 16 minus x fetched bytes contain effective instructions. This number does not increase no matter how many additional decoders you add to the decoding engine.

The real "weapon" the Core 2 Duo has against this branch-related inefficiency is not the 4-wide decoder, but a pre-decoded instruction queue of up to 18-deep x86 instructions. Refer to Part 1 article's first diagram on P6's Instruction Fetch Unit. There is a 16-byte wide, instruction boundary aligned Instruction Buffer sitting in-between the IFU and the decoders. Replacing this buffer with an 18 instruction-deep queue (probably 24 to 36 bytes in size) that can detect loops among the containing instructions, we get Core 2 Duo's biggest advantage with respect to x86 decode: ability to sustain continuous decode stream on short loops.

This continuous stream of x86 instructions allows Core 2 Duo's four decoders to be better utilized. The 18-instruction queue are aligned at instruction boundaries, and thus are immune to branch target (16-byte) misalignment problem. Although the 18-deep queue length easily becomes insufficient if loop unrolling, a compile-time optimization technique, is used, it is okay because unrolling a loop has the exact same effect as supplying a continuous instruction stream. More-over, the instruction queue also serves as a place where macro-fusion opportunities can be identified, as will be discussed next.

Without extensive simulation or real traces, we really can't be sure how much boost is received by Core 2 Duo from the 4-wide decode and the instruction queue. We have to make a guess; by using one extra partial decoder, the average sustained x86 decode throughput is probably increased from around 2.1 to about 2.5 macroinstructions (x86) per cycle. With the help of the instruction queue to supply uninterrupted macroinstructions in small loops, the sustained decode throughput is probably increased further to 2.7 or even close to 3.

Macro-fusion, the Myth and Truth

Debunking the Myth

Intel markets macro-fusion as the ability to increase x86 decode throughput from 4 to 5. As we have seen in the section above, the decode throughput without macro-fusion is much less than 4 and only close to 3. It turns out that macro-fusion has even less impact on improving the throughput, as is discussed here.

So what really is macro-fusion? In Intel's P6 terminology, "macro" or "macroinstruction" is used to describe an instruction in the original ISA (Instruction Set Architecture, here the x86). Thus macro-fusion is actually the exact same idea as micro-fusion, where two (or more) depending instructions with a single fan-out are collapsed into one instruction format (see the Part 2 article). The difference is on their application domain; where micro-fusion works on internal micro-ops, macro-fusion works on (x86) macrointructions. In fact, Intel's macro-fusion patent, System and Method for Fusing Instructions, filed in Dec.2000, predates its micro-fusion patent, Fusion of Processor Micro-Operations, filed in Aug.2002. It is probably due to two following reasons that the former is implemented later:

Complexity (or difficulty)
Limited usefulness

Why is it difficult, and what does it do?

First, we know that x86 instructions are complex and variable-length. Some x86 instructions take 6 clock cycles to only determine its length (page 2-7, Instruction PreDecode, of Intel's Optimization Reference Manual). The complexity of collapsing variable-length macroinstructions in when most cycle time is spent on decoding lengths (among other things) is undoubtedly much higher than that of fusing fixed-width micro-ops. Second, it will be even more difficult, if not impossible, to determine dependencies in real time, and fuse the depending macroinstructions together.

So instead of trying to fused all possible macroinstruction pairs, Core 2 Duo fuses only the selected macroinstructions -

The first macroinstruction must be a TEST X, Y or a CMP X, Y where only one operand of X and Y is an immediate or a memory word.
The second macroinstruction must be a conditional jump that checks the carry flag (CF) or zero flag (ZF).
The macroinstructions are not working in 64-bit mode.

These test/compare and jump are often used in integer programs composed of iterative algorithms. According to a 2007 SPEC Benchmark Workshop paper, "Characterization of Performance of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor," the frequency of macro-fused operations in SPEC2006 CPU ranges from 0-16% in integer codes and just 0-8% in floating-point codes. In other words, in the best case, macro-fusion would reduce the number of macroinstructions from 100% to 92% for integer and just 96% for floating-point execution, hardly the whopping 20-25% reduction as described by Intel's marketing department (and the numerous on-line repeaters).

Bringing the Truth

Look at it closer, we realize that the purpose of macro-fusion is really not much to reduce the number of x86 instructions to be decoded, but again to reduce decode interruptions/stalls due to predicted-taken branches. Again for the purpose of discussion lets number the four x86 decoders as 0, 1, 2, and 3. A two-macroinstruction sequence can be steered to either of the following four positions: [0,1], [1,2], [2,3], [3,0]. If the conditional jump is predicted taken, then no instruction after it will be steered for decoding, and in two of the four cases (i.e., [0,1] and [3,0]) the four decoders will decode no other maroinstruction at all in the cycle. More specifically,

Decoder slot [0,1], no other instruction decode, 0.25 probability
Decoder slot [1,2], 1 other instruction decode, 0.25 probability
Decoder slot [2,3], 2 other instruction decode, 0.25 probability
Decoder slot [3,1], no other instruction decode, 0.25 probability

The average number of other decodes is thus (1+2)*.25 = 0.75, or about 19% efficiency when the 4 decoders work on a block of macroinstructions containing conditional branches. Note that this is assuming all ideal cases otherwise, including perfect branch prediction, all simple instructions, and no 16-byte instruction misalignment. In reality, the separate test-and-jump macroinstructions under realistic environment will probably reduce decode efficiency even more.

Thankfully, when looking at a bigger picture, the situation becomes much better. As previously stated, the frequency of conditional branch itself tops at 8-16% in the first place; in other words, in average one taken branch occurs in every 8 to 16 other instructions, or every 3 to 4 instruction fetch cycles (see the bottom of page 2-6 in Intel's Optimization Reference Manual). Suppose a taken branch occurs after 3 blocks of non-branching decodes, the 80% decoding efficiency loss at the branching block would result in less than 20% loss overall. This is why even without macro-fusion, Core 2's predecessor (Yonah) can already achieve IPC higher than 2 for some programs with only three x86 decoders.

Now lets look at what happens to the conditional branch decode when macro-fusion is added. Again, the first column is the decoder number occupied by the now fused branch macroinstruction; the second column is number of other instruction decodes; the last column is occurrence probability of the row:

Decoder slot 0, no other instruction decode, 0.25 probability
Decoder slot 1, 1 other instruction decode, 0.25 probability
Decoder slot 2, 2 other instruction decode, 0.25 probability
Decoder slot 3, 3 other instruction decode, 0.25 probability

The average number of other decodes becomes (1+2+3)*.25 = 1.5, or about 38% efficiency of the 4 decoders, doubling that of the case without macro-fusion. The overall decoding efficiency loss reduces from less than 20% to less than 10%. A 10% increase in decoding efficiency will certainly be appreciated by the rest of the core, lifting the roof of sustained IPC to 3 or maybe even higher for SPEC95 like programs (note that according to Intel's manual, Core 2's macroinstruction length pre-decoder is designed to sustain a 3.5 decode throughput in the worst case).

This concludes the 3-part Decoding x86: From P6 to Core 2 series. I hope what's written here satisfy your curiosity with regard to the inner workings of modern microarchitectures, as they certainly do me over the course of my research/study on them. Please let me know if you have comments, suggestions, or even better, corrections, to the contents.

A Journey in Modern Computer Architectures

Sunday, June 10, 2007

Back-of-envelope calculation of native quad-core production

Friday, June 01, 2007

Decoding x86: From P6 to Core 2 - Part 3

About Me

Blog Archive

Labels