A Journey in Modern Computer Architectures: 2007

Saturday, September 22, 2007

AMD's latest x86 extension: SSE5 - Part 2

Series Index -

Part 1. An Overview
Part 2. Face-Off with SSSE3 and SSE4.x

In this part we will compare Intel's SSSE3 and SSE4.x with AMD's SSE5. More specifically we will look at how one can (or cannot) use SSE5 to accomplish the same tasks performed by SSSE3 and SSE4.x. The pinnacle question we're trying to answer here is whether the SSE5 from AMD is strictly an extension to Intel's SSE4, or in some sense a replacement for SSSE3 and SSE4.x (which none of AMD's current processors - including Barcelona and Phenom - supports)?

Syntactical Similarity

The original 8086/8087 have one-byte opcode instructions (if we ignore the ModRM bits used for 8087 and a handful others such as bit rotations). One remaining opcode byte that was usefully unused turned out to be 0Fh; had it been used, it would've had the meaning of POP CS , which was not there because it would create some interesting program flow control problems. Using 0Fh as an escape byte followed by a second byte, a number of two-byte opcode instructions were added by 80{2|3|4}86, Pentium, MMX, 3DNow!, and SSE/2/3/4a.

After the addition of SSE4a from AMD, the free two-byte opcodes left are only the followings: 0F0{4,A,C}h, 0F2{4-7}h, 0F3{6-F}h, 0F7{A,B}h, and 0FA{6,7}h. Why is this important? Because these points in the two-byte opcode space are the only entries where the x86 ISA can be further extended. Obviously, the two-dozen or so entries are not enough for any large-scale extension.

In order to further extend the instruction set in a significant way, the opcode itself must be extended from two-byte to three-byte. This is where SSSE3/SSE4.x and SSE5 bear the most similarity: they all consist (mainly) of instructions with three opcode bytes. Intel carved out 0F38xxh and 0F3Axxh for SSSE3 and SSE4.x, whereas AMD took 0F24xxh, 0F25xxh, 0F7Axxh and 0F7Bxxh for SSE5.

Syntactical Differences

However, the syntactical similarity between Intel's and AMD's extensions pretty much ends right here. As we've seen in Part 1. of this series, SSE5 instruction encoding is regular and orthogonal: the 3rd opcode byte (Opcode3) always has 5 bits for opcode extension, 1 bit for operand ordering, and 2 bits for operand size.

On the other hand, the encoding of SSSE3 and SSE4.x instructions may well have been arbitrary for anyone outside Intel. For example, look at the following SSSE3 instructions:

PSIGNB - 0F380h 1000b ... PABSB - 0F381h 1100b
PSIGNW - 0F380h 1001b ... PABSW - 0F381h 1101b
PSIGND - 0F380h 1010b ... PABSD - 0F381h 1110b

It may seem from above that the right-most bits encode the operand size - 00b for byte, 01b for word, and 10b for dword. However, take anther look at the following SSSE3 instructions:

PSHUFB - 0F380h 0000b ... PMADDUBSW - 0F380h 0100b
PHADDW - 0F380h 0001b ...... PHSUBW - 0F380h 0101b
PHADDD - 0F380h 0010b ...... PHSUBD - 0F380h 0110b

For some (probably legitimate) reason, Intel designers decided not to include horizontal byte additions and subtractions; instead they (most "exceptionally") squeezed in a byte-shuffle instruction and a specialized multiply-add instructions. We see that 30-years later, people at Intel still design instructions exactly the same way like 30-years ago: doesn't make sense.

Even worse cases are seen in SSE4.x. The following example shows the encodings used for packed MAX and packed MIN instructions:

PMAXSB - 0F383h 1100b ... PMINSB - 0F383h 1000b
PMAXSD - 0F383h 1101b ... PMINSD - 0F383h 1001b
PMAXUW - 0F383h 1110b ... PMINUW - 0F383h 1010b
PMAXUD - 0F383h 1111b ... PMINUD - 0F383h 1011b

Note how the different operand types and operand sizes are squeezed cozily into consecutive opcode byte values without much sense. For some mystical reason, the unsigned word operations are put quite arbitrarily right next to the signed dword operations . But wait... what happens to P{MAX|MIN}SW and P{MAX|MIN}UB? Well, they already are SSE2 instructions with opcode 0FE{E|A}h and 0FD{E|A}h, respectively. As can be seen in this example, the irregularity of SSE4.x also inherits from the poor design of SSE2.

From software programmer's point of view, the irregularity really doesn't matter as long as the compiler can generate these opcodes automatically. But such extension irregularity is no circuit designer's love to implement. This is probably why Intel, assumed not incompetent, chose in such poor styles to design SSEx - to make it as difficult as possible for anyone else (most prominently AMD) to offer compatible decoding. In the end, not only Intel's competitors but also its customers suffer from the bad choices: had Intel designed the original SSE/SSE2 the same way as AMD does SSE5, we would've had a much more complete & efficient set of x86 SIMD instructions that makes sense! (Now, does Intel promote open & fair competition that benefits the consumers? Or does it aims nothing but to screw up its competitors, sometimes together with its customers?)

In any rate, as we've been above the encoding of SSE5 is different from SSSE3/SSE4.x and thus the former does not exclude the latter. In other words, it is possible for a processor to offer both SSE5 and SSSE3/SSE4.x (much like 3DNow! and MMX). What about their functionalities, then? Below we'll look at each SSSE3 and SSE4.x instruction and see how its functionalities can or cannot be accomplished by SSE5.

Functional Comparison to SSSE3

For SSSE3 instructions:

PHADDx/PHSUBx

Horizontally add/subtract word & dword in both source and destination sub-operands and pack them into destination.

Each PHADDx/PHSUBx in SSE5 operates on only one 128-bit packed source.

PMADDx

Multiply destination and source sub-operands, horizontally add the results, and store them back to destination.

PMADx in SSE5 offers more powerful multiply-add intrinsics
No byte-to-word multiply-add in SSE5, though.

PSHUFB

Shuffle bytes in destination according to source.

Special & weaker cases of the first-half of PPERM in SSE5.

PALIGNR

Shift concatenated destination & source bytes back into destination.

Special & weaker cases of the first-half of PPERM in SSE5.

PSIGNx

Retain, negate, or set zero sub-operands in destination if corresponding sub-operands in source is positive, negative, or zero, respectively.

No direct implementation in SSE5.

PABSx

Store the unsigned absolute values of source sub-operands into destination sub-operands.

No direct implementation in SSE5.

PMULHRSW

Multiply 16-bit sub-operands of destination and source and store the rounded high-order 16-bit results back to destination.

No direct implementation in SSE5.

It can be seen that most SSSE3 instructions are not directly implemented in SSE5, with possibly the exceptions of PSHUFB, PALIGNR, and PADDx/PSUBx. However, these latter SSSE3 instructions can still be useful as lower-latency, lower-instruction count shortcuts to the more generic & powerful SSE5 counterparts. Thus from this point of view, future AMD processors will probably still benefit from implementing SSSE3 together with SSE5.

Functional Comparison to SSE4.x

For SSE4.1 instructions:

PMULLD

Multiply 32-bit sub-operands of destination and source and store the low-order 32-bit results back to destination.

Can be done by two PMULDQ (SSE2) followed by a PPERM.

DPPS/DPPD

Horizontally dot-product single/double precision floating-point sub-operands in destination and source and selectively store results to destination sub-operand fields.

FMADx in SSE5 offer more powerful & flexible floating-point dot product intrinsics.

MOVNTDQA

Non-temporal dword load from WC memory type into an internal buffer of processor, without storing to the cache hierarchy.

Specific to Intel processor implementation.
PREFETCHNTA in Opteron & later works for the same purpose.

BLENDx and PBLENDx

Conditionally copy sub-operands from source into destination.

Special and weaker cases of PERMPx and PPERM in SSE5.

PMAXx and PMINx

Packed max and min operations of destination and source

Can be accomplished by a PCOMx followed by a PPERM in SSE5.

EXTRACTPS/PEXTRx

Extract sub-operands from an XMM register (source) to memory or a general-purpose register (destination).

Special and weaker case of PERMPx for memory destination.
No direct implementation for GPR destination in SSE5.

INSERTPS/PINSRx

Optionally copy sub-operands from source to destination.

Special and weaker case of PERMPx in SSE5.

PMOVx

Sign- or zero-extend source sub-operands to destination.

Special and weaker case of PPERM with a proper mux/logical argument.

PCMPEQQ

Packed compare-equal between destination and source and store results back to destination.

Special and weaker case of PCOMQ in SSE5.

MPSADBW

Compute "sum of absolute byte-difference" between one 4-byte group in source and eight 4-byte groups in destination and store the eight results back to destination

No direct implementation in SSE5.

PHMINPOSUW

Find the minimum word horizontally in source and put its value in DEST[15:0] and its index in DEST[18:16]

No direct implementation in SSE5.

PACKUSDW

Convert signed dword to unsigned word with saturation.

Complements PACKSSDW/PACKUSWB/PACKSSWB in SSE2

No direct implementation in SSE5.

PTEST, ROUNDx

Llogical zero test, packed precision rounding.

Copied directly to SSE5.

For SSE4.2 instructions:

PCMPGTQ

Packed compare for greater than

Special & weaker case of PCOMQ in SSE5.

String match, CRC32

No direct implementation in SSE5.

POPCNT

Copied directly from AMD's POPCNT.

A few evidences from above show that it's probably not very likely for a future AMD processor to implement SSE4.1 & SSE4.2 in addition to SSE5. First, some of the instructions are copied directly from SSE4.1 to SSE5 (TEST and ROUNDx); had AMD wanted to implement SSE4.1 before SSE5, it would've been unnecessary to copy these instructions. Second, those instructions in SSE4.x that do not have superior SSE5 counterparts are either extremely specialized (MPSADBW, PHMINPOSUW, string match & CRC32), or able to be accomplished more flexibly by two or less SSE5 instructions.

We can also see how Intel designers work very hard to squeeze functionalities into the poor syntax of SSE4.x, resulting in a poor extension design. One example is the BLENDx/PBLENDx instructions. Instead of using the proper SSE5-like 3-way syntax, the variable selector in SSE4.1 is set implicitly to XMM0, not only requiring additional register shuffling but also limiting the number of permutation types to only 1 at any moment.

Another example is the DPPS/DPPD instructions, where the dot-product is performed partially vertical and partially horizontal. To make these instructions useful the two source vectors must be arranged to alternate positions: (A0, B0), (A1, B1), (A2, B2), ... Not only such arrangement can be costly by itself, but also after the operation one of the arranged source vectors is destroyed (replaced by the dot-product result).

Concluding Part 2.

Comparing SSE5 with SSSE3/SSE4, it seems that after years of being dragged along by Intel's poor extension designs, AMD finally decides to make its own next step in a better way. As I've discussed above, it's probably more advantageous for AMD to implement SSSE3 together with SSE5, and less so to implement SSE4.1 & SSE4.2.

However, as we know the commercial software in general and benchmarks in particular, especially on the desktop enthusiast market, are heavily influenced by the bigger company, thus if it turns out SSE4.x are excessively used to benchmark processor performance then it is still possible for AMD to implement them in its future processors. But lets hope for all customers' sake this is not going to happen, and future x86 extension will follow more of AMD's SSE5 than Intel's SSE4.x.

Friday, September 21, 2007

AMD's latest x86 extension: SSE5 - Part 1

Series Index -

Part 1. An Overview
Part 2. Face-Off with SSSE3 and SSE4.x

The SSE5 announcement made by AMD earlier this month is something big. In fact, in terms of instruction scope and architectural design, it is bigger SSE3, SSSE3, and SSE4 combined. If we think of AMD64 as completely revamping x86-based general-purpose computing (as generally conceived by the industry), then we can also think of SSE5 as completely revamping x86-based SIMD acceleration. In my opinion, the leaps made by AMD in both AMD64 and SSE5 firmly assert the company as the leader in x86 computing architectures, leaving Intel gasping far behind.

The SSE5 Superiority

There are a few things that make SSE5 a "superior" kind of SIMD (Single-Instruction Multiple-Data) instructions different from all the previous SSE{1-4}:

SSE5 is a generic SIMD extension that aims to accelerate not just multimedia but also HPC and security applications.

In contrast, previous SSEx, especially SSE3 and later, were designed specifically with media processing in mind.
The CRC and string match instructions of SSE4.2 are too specialized to be generally useful.

SSE5 instructions can operate on up to three distinct memory/register operands.

It allows true 3-operand operations, where the destination operand is different from any of the two source operands.

It allows 3-way 4-operand operations, where the destination operand is the same as one of the three source operands.

SSE5 includes powerful and generic Vector Conditional Moves (both integer and floating-point).

Only four instructions (mnemonics) are added: PCMOV for generic bits, PPERM for integer bytes/(d,q)words, PERMPD/PERMPS for single/double-precision floating points.
Powerful enough to move data from any part of the 128-bit source memory/register to any part of the 128-bit destination register, plus optional logical post-operations.

SSE5 includes both integer arithmetic & logic, and floating-point arithmetic & compare instructions.

For integer arithmetics, it includes both true vertical Multiply-Accumulate and flexible horizontal Adds/Subs.

An Analytical View of SSE5 Instruction Format

All above show one thing: SSE5 is a well-planned, thoroughly articulated, and carefully designed ISA extension. The amazing thing is that the designers at AMD accomplish all these by simply adding a single DREX byte in-between the SIB and Displacement bytes, as shown in the figure below (taken from page 2 of AMD's SSE5 documentation):

A question naturally arises: will the additional DREX byte further increase instruction lengths? Fortunately, not a single bit. According to the official document linked above, those SSE5 instructions that use the DREX byte can not only take 3 distinctive operands but also access all 16 XMM registers without the AMD64 REX prefix; in fact, the use of the DREX byte in an SSE5 instruction excludes the use of the REX prefix. SSE5 instruction lengths are just as long as needed and as short as it can be. (We will talk more about possible further extensions to AMD64 REX and SSE5 DREX in a later part.)

Another great merit of SSE5 instruction encoding is that it is simple and regular. Note the "Opcode3" byte in the above picture, the main byte that distinguishes among different SSE5 instructions: its encoding is astonishingly simple: 5 bits for opcode, 1 bit for operand ordering, and 2 bits for operand size. The result is an orthogonal instruction encoding - you only need to look at an opcode field by itself to know what it means. In contrast, the 3rd opcodes of Intel's SSSE3 and SSE4 instructions seem like picked by spoiled child to purposely screw up any implementation. (We will talk more about comparison between AMD's SSE5 and Intel's SSSE3/SSE4 in a later part.)

Types of SSE5 Instructions

There are several major types of instructions in SSE5:

Various integer and floating-point multiply-accumulate (MAC) instructions.
Vector conditional move (CMOV) and permutation (PERM) instructions.
Vector compare and predicate generation instructions.
Packed integer horizontal add and subtract.
Vectorized rounding, precision control, and 16-bit FP conversion.

A single PTEST instructions in Type 3 and four ROUNDx instructions in Type 5 above are copied directly from Intel's SSE4.1; together with other Type 4 and Type 5 instructions these are the SSE5 instructions that do not contain the DREX byte. All the other Type 1-3 SSE5 instructions utilize the DREX byte to specify a 3rd distinctive (destination) operand and to offer access to XMM8-XMM16 registers (without & excluding the REX prefix).

In particular, the Type 1 (MAC) and Type 2 (CMOV/PERM) instructions are 3-way 4-operand operations, with destination is set to either source 1 or source 3. The fact that 3-way operation is allowed - even with destination equal to one of the sources - is instrumental in enabling flexible MAC and CMOV/PERM instructions. In the case of MAC, two multipliers and an accumulator must be specified; in the case of CMOV/PERM, two sources and a conditional predicate must be given. Without the ability to address 3 distinctive operands, these two types of accelerations are either impossible or done awkwardly (more on Intel's SSE4.1-way of doing it in a later part of this series).

What makes these two types of instructions, MAC and CMOV/PERM, which happily require 3 distinctive operands, so special? As previously said, the four conditional move & permutation instructions allow predicated transfer of data from any part of the source registers/memory to any part of the destination register, followed by one of seven optional operations. Just how many instructions are there in SSE/SSE2/SSE3 to perform similar and simpler tasks partially? Here is a quick list:

MOVAPD
MOVAPS
MOVDDUP
MOVSHDUP
MOVSLDUP
MOVDQA
MOVDQU
MOVHLPS
MOVLHPS
MOVQ
MOVSD

Of course this does not mean the four instructions in SSE5 will replace all the MOVs in SSE/SSE2 above, which are still useful for their simplicity (only 2 operands required) and possibly lower latency (no post-operation needed). However, it does illustrate how powerful and useful the PERM instructions in SSE5 can be - just imagine how hard it is to implement these operations in an SSE2-like style.

The MAC instructions turns out to be one of the "most-wanted" instruction accelerations. As shown in "Design issue in division and other floating point operations" by Oberman et al. in IEEE ToC, 1997, nearly 50% of floating-point multiplication results are consumed by a depending addition or subtraction. See the picture below, directly grabbed from the paper:

In other words, by combining multiplication with a depending addition/subtraction, we can eliminate 50% instructions following all multiplications. Until SSE5, it was impossible to truly fuse a multiplication with a depending add or subtract and take advantage of such acceleration.

Concluding Part 1.

As shown above, the SSE5 from AMD is indeed something very different from the previous x86 SIMD extensions from Intel. Some people even went so far to call it "AMD64-2", and the "top development" of the year; such enthusiasm, of course, is unduly.

Until now, AMD is still gathering community feedback and asking for community support on the SSE5 initiative. Apparently, SSE5 is still in development; it's a great proposal, but clearly not developed (yet). Also, the SSE5 instructions by themselves do not match the breadth and depth of AMD64, which not only expands x86 addressing space but also semantically changes the working of the ISA. SSE5, on the other hand, doesn't touch nor alter any bit of the x86-64 outside of its extending scope. However, as we will discuss in a later part, the direction pointed to by SSE5 can be used to further extend x86-64 in a more general and generic way rivaling the original AMD64.

Monday, September 10, 2007

Scalability counts!

As I have said in this article, Intel's new Core 2 line of processors have good cores but poor system architecture. The poor scalability of FSB means that Core 2, without extensive, expensive, and power-hungry chipset support, is only suitable for low-end personal enjoyment.

Take a look at this AnandTech benchmark. I'd note foremost that AnandTech is hardly an AMD-favoring on-line "journal"; thus we can expect its report to be at worst Intel-biased and at best neutral (which I'm hoping for here). In any rate, the benchmark picture is reproduced below:

The comparison between Barcelona (Opteron 2350 2.0GHz) and Clovertown (Xeon E5345 2.33GHz) couldn't be clearer: FSB is an outdated system architecture for today's high-end computing, and scalability does matter for server & workstation grade performance. While AMD's quad-core Opteron at 2.0GHz is slower than Intel's quad-core Xeon at 2.3GHz on single-socket test, the situation is reversed when going to a dual-socket setup, one that used by most workstations and entry-level servers.

The same phenomenon is also observed in this page where AMD's quad-core Opteron, at 17% slower clock rate, performs increasingly better than Intel's quad-core Xeon with more number of cores (picture reproduced below). Again, when it comes to server & workstation performance, scalability counts.

Friday, August 03, 2007

Not Everything about Memory is Bandwidth

The False Common Belief

There is this common belief among PC enthusiasts that bandwidth, or million transfers per second or megabytes per second, is the most important thing that a good memory system should aim for. Such a belief is so deep-rooted that even the professionals (i.e., AMD & Intel) began to calibrate & market their products based on the memory bandwidth values.

For example, take a look at this Barcelona architecture July update article. The first graph in that page, which seems to be an AMD presentation and is conveniently duplicated below, seems to suggest that all the memory enhancements in AMD's Barcelona (K10) over its predecessor (K8) are about "Increasing Memory Bandwidth".

The question is, do they really increase memory bandwidth? Lets take a look at the bullet points in the graph, from bottom to top.

The prefetchers. Prefetching does not increase memory bandwidth. On the contrary, it reduces available memory bandwidth by increasing memory bus utilization (search "increase in bus utilization" on the page).
Optimized Paging and Write Bursting. They both increase memory bus efficiency, which does not increase the bandwidth per se, although it helps improving the bandwidth effectiveness.
Larger Memory Buffer. A larger buffer can improve store-to-load forwarding and increase the size of write bursting. The buffer itself, however, does not increase memory bandwidth at all.
Independent Memory Channels. This certainly has no effect on memory bandwidth. Each of the two independent channels is half the width, resulting in the same overall bandwidth.

Thus, out of six bullet points, only two are marginally related to memory bandwidth. The bottom line: Barcelona still uses the same memory technology (DDR2) and the same memory bus width (128-bit), beyond which there is no more bandwidth to increase to!

However, one would be more wrong to think Barcelona's memory subsystem is not improved over its predecessor, because all the points above are nevertheless improvements, though not on increasing memory bandwidth, but on reducing memory latency. Intelligent memory prefetching can hide memory latency, as shown in the Intel article page linked above. Reduced read/write transitions due to write bursting and the larger memory buffer both can reduce memory latency considerably. The independent memory channels also reduces latency when multiple memory transactions are on-flight simultaneously - especially important for multi-core processing. In short, the memory subsystem of Barcelona is improved for lower latency, not higher bandwidth.

Why Does Barcelona Improve More Latency Than Bandwidth?

There are a few reasons that a general computing platform based on multiple levels of cache benefits more from lower memory latency. This is contrary to specialized signal processing or graphics processors where instruction branches (changes in instruction flow) and data dependencies (store-to-load forwarding) are few and rare. This fact is aptly described in the following "Pitfall" on page 501 of Computer Architecture A Quantitative Approach 3rd ed., Section 5.16, by Hennessy and Patterson:

Pitfall Emphasizing memory bandwidth in DRAMs versus memory latency. PCs do most memory access through a two-level cache hierarchy, so it is unclear how much benefit is gained from high bandwidth without also improving memory latency.

In other words, for general-purpose processors such as Athlon, Core 2 Duo, Opteron, and Xeon, what helps performance is not just the bandwidth, but more importantly the effective latency of their memory subsystem. This pitfall is promptly followed by its dual on the next page of the book, which on the other hand explains why most signal and graphics processors which require high memory bandwidth do not need multiple levels of cache like the general-purpose CPUs:

Pitfall Delivering high memory bandwidth in a cache-based system. Caches help with average cache memory latency but may not deliver high memory bandwidth to an application that needs it.

Memory Bandwidth Estimate for High-End Quad-Core CPUs

Still one may ask, is the memory bandwidth offered by say a DDR2-800 channel really enough for modern processors? It turns out that, at least for Intel's Penryn and AMD's Barcelona to come, it should be. To estimate the maximally required memory bandwidth, we assume a 3.33GHz quad-core processor with 4MB cache sustaining 3 IPC (instruction per cycle). Such a processor should be close to the top performing models from both AMD and Intel by the middle of next year. (See also the micro/macro-fusion article for Core 2's actual/sustainable IPC.)

First lets look at the data bandwidth. A 3.33GHz, 3 IPC processor would execute up to 10G I/s (giga-instructions per second). Suppose 1 out of 3 instructions has a load or store, which is supported by the fact both Core 2 and Barcelona have 6-issue (micro-op) engines and perform up to 2 loads or stores per cycle. Thus,

10G I/s * 0.333 LS/I = 3.33G LS/s (giga-load/store per second, per core)

Multiply this number by 4 cores, the total is 13.33G LS/s. According to Figure 5.10 of Computer Architecture AQA on page 416, a 4MB cache has miss rate about 1%. Lets make it 2% to be conservative. Thus the number of memory accesses going to the memory bus is

13.33G LS/s * 2% MA/LS = 0.267G MA/s (giga-memory accesses per second)

Each memory access is at most 16-byte, but mostly likely 8-byte or less in average. This makes the worst-case memory bandwidth requirement 0.267G*16 = 4.27GB/s, and the average-case 2.14GB/s. Note that a single channel of DDR2 memory can support up to 6.4GB/s, much more than the numbers above.

Now lets calculate the instruction bandwidth. Again, for 4 cores at 3.33GHz 3 IPC, there are 40G I/s. However, instructions usually have exceptional cache advantage. According to Figure 5.8 on page 406 of the same above textbook, a 64KB instruction cache has less than 1 miss per 1000 instructions. Assume (again quite conservatively) each instruction takes 5 bytes, this means the total memory bandwidth for fetching instructions is

40G I/s * 0.1% * 5 B/I = 0.2GB/s

Thus even under conservative estimation, the instruction-fetch bandwidth is negligible compared to the data load/store bandwidth. The conclusion is clear: the memory bandwidth of just one single DDR2-800 channel (6.4GB/s) is more than enough for the highest-end quad-core processor during the next 10 months to come. The problem, however, is not bandwidth, but latency.

Update 8/7/07 - Please take a look at the AMD presentation on Barcelona, page 8, where quad-core Barcelona is shown to utilize just 25% the total bandwidth of 10.6GB/s. Suppose this is obtained from a 2.0GHz K10 (one that was demo'd and apparently benchmarked by AMD), then, scaling up linearly, a fictional 3.3GHz K10 would reach about 41% utilization, or about 4.3GB/s. Notice how close this number is to my estimate above.

What About Core 2's Insatiable Appetite for FSB Speed?

A naturally raised question is that, if 6.4GB/s is more than enough for the highest-performing quad-core x86-64 processors in the next year or so, why is Intel raising the FSB (front-side-bus) speed above 1066MT/s (million-transfers per second) so aggressively to 1333MT/s and even 1600MT/s? Isn't 1066MT/s already offering more than 6.4GB/s bandwidth?

The reasons are two-fold:

For Core 2 Quad, the FSB is not just used for memory accesses, but also I/O and inter-core communications. Since data transfer on FSB is most efficient with long trains of back-to-back bytes, such transfer-type transitions can greatly reduce effective bandwidth.
Raising the FSB speed not only increases peak bandwidth, but also (more importantly) reduces transfer delay. A 400MHz (1600MT/s) bus will cut 1/3rd the data transfer time of a 266MHz (1066MT/s) bus.

In other words, due to the obsolete design of Intel's front-side-bus, the sheer value of peak memory bandwidth becomes insufficient to predict the memory subsystem's performance, where a potentially 10.6GB/s bus (1333MT/s * 8B/T) isn't even able to satisfy the need of a quad-core processor (3.33GHz, 3 IPC) requiring no more than 5 GB/s of continuous data access to/from the memory.

The Importance of Latency Reduction

To show how latency reduction is the more important reason to raise FSB speed, we will compare a dual-core system with two quad-core systems, one with a 2x wider memory bus, the other with a 1.5x faster memory bus. We will show that the faster FSB is more effective in bringing down the average memory access time, which is the major factor affecting a computer's IPC. More specifically, using the dual-core system as reference, suppose the following:

The quad-core #1 system has the same bus speed but 2x the bus width (e.g., 128-bit vs. 64-bit). In other words, it has the same data transfer delay and 100% more peak memory bandwidth than the dual-core system.
The quad-core #2 system has the same bus width but 1.5x the bus speed (e.g., 400MHz 1600MT/s vs. 266MHz 1066MT/s). In other words, it has 33% less data transfer delay and, consequently, 50% more peak memory bandwidth than the dual-core.
The memory bus is time-slotted and serves the cores in round-robin. For the dual-core and quad-core #1 systems, each memory access slot is 60ns. For the quad-core #2 system, each slot is 40ns.
Memory bandwidth utilization is 50% on the dual-core (2 out of 4 slots are occupied) and the quad-core #1 (4 out of 8 slots are occupied). It is 66.7% on quad-core #2 (4 out of 6 slots are occupied), calculated by 50% * (2x cores) / (1.5x bandwidth).

Note that the assumptions above are simplistic and optimistic. It does not take into account the reduced effective bandwidth & efficiency due to I/O and inter-core communications. When taken these two effects into account, the quad-core systems will perform much worse than they are estimated below.

Lets first calculate the average memory access latency of the dual-core system. When either core makes a memory request, it finds 3/4=75% of chance the memory bus is free, and 25% of chance it has to wait an additional 60ns for access. The effective latency is

60ns * 75% + (60ns+60ns) * 25% = 75ns

Thus in average, each memory access takes just 75ns to complete.

Now lets calculate the effective latency for the quad-core #1 system. When an arbitrary core makes a memory request, it finds only 5/8=62.5% of chance the memory bus is free, and 37.5% of chance it has to wait. The waiting time, however, is more complicated in this case, because there are C(8|3) = 56 cases how the slots are occupied. Skipping some mathematical derivations, the result is

60ns * (4+3+2+1*5)/8 = 105ns, 6 out of 56 cases
60ns * (3+2+2+1*5)/8 = 90ns, 30 out of 56 cases
60ns * (2+2+2+1*5)/8 = 82.5ns, 20 out of 56 cases

=> (105ns * 6/56) + (90ns * 30/56) + (82.5ns * 20/56) = 88.9ns

Thus even when we double the memory bandwidth, keep the same bus utilization, a quad-core system still induces 18.5% higher access latency than a dual-core system. Note that this is even in the case where memory utilization is as low as 50%. For higher utilization, the latency increase will only be worse. The conclusion is clear: increasing memory bandwidth is not enough to scale up memory performance for multi-core general-purpose processing.

Now lets look at the quad-core #2 system, where data transfer delay is reduced 33%, but memory width is the same and bus utilization is increased to 66.7%. When an arbitrary core makes a memory request, it finds just 3/6=50% of chance the memory bus is free, and 50% of chance it has to wait. The waiting time again is complicated as there are C(6|3) = 20 cases how the slots are occupied. Skipping again some mathematical derivations, we get

40ns * (4+3+2+1*3)/6 = 80ns, 4 out of 20 cases
40ns * (3+2+2+1*3)/6 = 66.7ns, 12 out of 20 cases
40ns * (2+2+2+1*3)/6 = 60ns, 4 out of 20 cases

=> (80ns * 4/20) + (66.7ns * 12/20) + (60ns * 4/20) = 68ns

The average memory access latency here is almost 10% lower than the dual-core case and 24% lower than the quad-core #1. The effect of higher memory bus utilization is completely offset by a lower data transfer delay. Again, for general-purpose multi-core processing, reducing memory access delay is much more important than increasing memory peak bandwidth.

Conclusion and Remark

Lets go back to the original (supposedly) AMD's presentation. Why does it say "increase memory bandwidth" all over the page? Probably because most people simply don't understand better, and to make them so an article like this one is probably necessary and not even sufficient. We seem to see AMD engineers trying so hard to twist the delicate bandwidth-latency relationship, push it and force it down to a form easily understood (yet probably not believed) by ordinary minds.

However, bandwidth is definitely not useless. It really depends on the workload. For streaming processing such as graphics and signal processing, bandwidth and throughput are everything, and latency becomes mostly irrelevant. You won't care whether a DVD frame is played to you 100 milliseconds after it was read out of a blu-ray disc, as long as the next frame comes within 15 milliseconds (70fps) or so. Yet 100 milliseconds is 300 million cycles of a 3GHz processor! For streaming applications, we certainly want continuous flow of high-bandwidth data, yet have millions of cycles of latency to spare.

Sunday, June 10, 2007

Back-of-envelope calculation of native quad-core production

Semiconductor chip yield is one of the most guarded secret in the industry. There is no way an outsider could have "guess-timated" an accurate value, except by chance (which is also extremely low). Yet sometimes observations can be made from some big, obvious facts. In this article we will make some (strictly) back-of-envelope calculations of AMD and Intel's yields on dual-core processors, and make implications on native quad-core "manufacturability" from both companies.

The observation and assumptions

For the purpose of discussion, we will make the following assumptions -

In mid-late 2006, Intel occupies ~80% market share with three 300mm 65nm fabs, while AMD occupies ~20% with only 90nm FAB30 (200mm) and FAB36 (300mm).
AMD's FAB30 has the same wafer throughput as Intel's 65nm fabs. FAB36, while under 90-to-65nm transition and having low utilization, further increases production volume by 50%.
Intel's main production in 3Q 2006 is Core 2 Duo and dual-core Netburst, with Core 2 Quad volume small enough to be negligible to our discussion (e.g., 10% or less).
The most significant factor except those described above are the yield of the fabs, and AMD's FSB36 has about the same dual-core K8 yield as its 90nm counterparts.

The assumptions above may be utterly wrong, or they may be good enough for the "back-of-envelope" purpose. My point here is not to commend their validity; rather it is to make clear that the arguments below will hold true only if these assumptions do. Note that Intel's last 65nm fab (Ireland) of the three started production output by July 2006, while AMD's FAB36 started 65nm output in the later part of 4Q that year. Thus at least for 3Q 2006 the above market-share and relative technology differences are known to be true.

The calculations

Potentially, three 300mm 65nm fabs would have 3*2*2 = 12x capacity of one 200mm 90nm fab, if the yields of all fabs are the same. Thus, counting into AMD's FAB36, Intel would've had 12x/1.5 = 8x capacity of AMD with the same (dual-core processor) yield. However, Intel's market share during the period is only 4x that of AMD's. There is thus a 2x discrepancy between Intel's potential capacity (8x of AMD's) and its true capacity (4x of AMD's), which is presumably affected by a lower yield of its fabs. In other words, to reach the expected market share, AMD's FAB30 and FAB36 would have yields twice as good as Intel's 65nm 300mm fabs.

Apparently, this conclusion is not possible. A factor of two in terms of yield is too large, and Intel simply can't be that bad in manufacturing. A few factors may have affected the estimation accuracy here:

Intel's 65nm fabs may have lower wafer throughput or utilization than AMD's FAB30 and FAB36 combined, particularly the Ireland fab which was ramping for just 4 months, and D1D which is also used for 45nm research & development.
Intel may be making much more Core 2 Quad, which effectively cuts production volume in half (two Core 2 Duo dies make one Core 2 Quad).

Taking into consideration of the two factors above, we'll adjust the estimated yield difference from 2x down (quite arbitrarily) to 1.5x. Note that a high percentage of Netburst-related products from Intel actually makes the discrepancy larger, since Netburst chips are smaller in size per die, much matured, and should have better yield than the cutting-edge Core 2 Duo.

The Implication

So how does this 1.5x yield difference affect "native" quad-core manufacturing? Suppose AMD's dual-core K8 yield is 81%; Intel's Core 2 Duo yield would be just 54% (1/1.5x). By 1st-order estimate, AMD's native quad-core would have a yield of roughly 65% (0.81*0.81), whereas Intel's would have 29% (0.54*0.54). In other words, out of 100 quad-core dies, AMD is able to make 65 functional quad-core processors, while Intel only 29, less than 50% of its smaller competitor. It is not difficult to see why AMD is going native but Intel won't until late 2008.

Lets for the purpose of discussion turn the parameters further in Intel's favor, and assume it has just 1.25x lower yield (instead of 1.5x) from AMD's. If we again suppose dual-core K8 has yield 81%, then Core 2 Duo would have almost 65%, making Intel's MCM quad-core approach as productive as AMD's native quad-core approach. What we see here is that a yield just a quarter better than the competitor could've made a huge difference in terms of native quad-core manufacturability. In fact, Not only is Intel late to native quad-cores, it was also late to native dual-cores for about 6 months even with a better technology (65nm vs. 90nm).

The conclusion is clear, that Intel is telling the truth that it can't make native quad-core cost-effectively. For AMD, it might be very hard, but probably still doable, based on a simple capacity observation and this back-of-envelope calculation.

The arguments

Some people have argued the precision of the above estimates. Their arguments can basically be divided into the following points:

Intel's D1D is also making 45nm transition in late 2006, thus should have less than maximum output.
Intel's Ireland fab, ramping only 4 months from Jun'06, won't achieve max capacity in Oct'06.
Intel's shipping more dual-core processors in 4Q06 than AMD. Specifically, just over 50% of Intel processors are dual-cores, while only 30% of AMD's are.
AMD's FAB36, making 300mm wafers and started revenue shipping in Apr'06, should've been making as much silicon as FAB30.
By late 3Q06, AMD would also have Chartered's output at hand.
Intel's 65nm doesn't actually result in 2x capacity of 90nm, more like 1.7x (1/0.6). As well, Intel's 300mm wafer would result in ~2.25x usable silicon area of 200mm ones.

It's important to note that all these are considered higher-order factors. A slight difference in terms of max wafer throughput per fab (ranging anywhere from 20k to 60k) could've dwarfed any of above. Still, for the sake of discussion, lets still try some more precise estimates from these points.

The first point, it turns out, is wrong. As D1D's making 45nm outputs, its 65nm capacity is moved to the neighboring D1C, which is outputting 65nm chips right after Ireland and purposely/completely ignored by me above. The second point would be valid and reduce 65nm Ireland fab's capacity to some 30% of its max.

The third point above is also true; however it fails to recognize that most of Intel's single-core processors (Celerons and Pentium M's) are made at its 90nm fabs, whereas all of AMD's single-core and dual-core processors are made out of FAB30 & 36 (excluding Chartered), a factor the 17% difference in dual-core ratio isn't even able to compensate.

The fifth and sixth points are minor. Chartered's flex capacity would account for up to 20% of AMD's silicon output, and even less in Oct'06, not 3 months after its first revenue shipping for AMD. Assume Chartered is supplying 15% of AMD's silicon output, it'll effectively make output from AMD's own fabs 85% of the total, or changing the actual Intel-to-AMD output ratio from 4x to 4.7x.

The forth point, which we'll discuss last here, seems quite valid from page 5 of this AMD Jun'06 analyst day presentation (see picture above). At late 3Q06, the 300mm "wafer outs" from FAB36 seems to be 0.4x of the max 200mm from FAB30, equivalent to 0.4*2.25 = 0.9x FAB30's silicon area. Surely this is a great increase of AMD's potential capacity. Unfortunately, it turns out such argument is unfounded and mislead by a graph without y-axis unit and meant to be illustrative only.

If we read the text on page 4 of that presentation (again see picture above), FAB36 is expected to output 25k wafers per month (wpm) by Q4 2007, which will be the total 300mm wafer output at that point (FAB38 won't have wafer outs until Q1 2008). We also know FAB30 is outputting about 30k wpm in Q3 2006. Now go to page 5 again and look! How can green line's 25k wpm (end of 4Q07) be some 60% higher than red line's 30k wpm in 2006? It is absolutely not possible unless the "wafer outs" y-axis actually means wafer area outs, and the 25k 300mm wpm from FAB36 is effectively doubled to 50k, some 66% higher than the 30k 200mm wpm from FAB30.

It turns out my original estimate of FAB36 reaching 50% capacity of FAB30 is actually a bit optimistic. The true number should be calculated as such: (0.8/3.4 * 25000)*2.25/30000 = 0.44, where 0.8 comes from green line at end of 3Q06, 3.4 from end of 4Q07 (both FAB36 only), 25000 is expected green line wpm at end of 4Q07, 2.25 translates 300mm wpm to effective 200mm wpm, and the final 30000 is FAB30 wpm (red line at end of 3Q06).

Overall, a definitely more precise/probably more accurate estimate is the following:

Intel's potential capacity: (2+0.3)*(1/0.6)*2.25 = 8.6
AMD's potential capacity: 1+0.44 = 1.44
Potential capacity ratio: 8.6/1.44 = 6.0x

Intel's actual output: less than 80% market share (excluding 90nm production)
AMD's actual output: 20%*0.85 = 17% (Chartered effects)
Actual output ratio: 80%/17% = 4.7x

Discrepancy between potential and actual output: 6.1/4.7 = 1.28, or almost 28% difference in microprocessor yield, well between the 50% and 25% estimates I made above.

Friday, June 01, 2007

Decoding x86: From P6 to Core 2 - Part 3

This is the Part 3 of a 3 part series. To fully appreciate what's written here, the Part 1 and Part 2 articles (or comparable understandings) are prerequisites.

The New Core Improvements

Intel's "brand" new Core 2 Duo has many improvements over Pentium M. With respect to the x86 decode stage, they include -

Improved micro-fusion
4-wide decode
Macro-fusion

All of these have been numerously described and repeated by many on-line review sites. Here again we will look at them in more technical and analytical detail.

The improved micro-fusion is the least complicated, so we will just briefly describe it here. It is composed of using a bigger XLAT PLA (see the partial decoder diagram in Part 2) that can handle more load-modify or addressed store instructions, including many SSE2/SSE3 ones. This improves Core 2's SSE performance over its predecessors, which must re-steer many SSE instructions to the first (full) decoder to be processed. In fact, Core Solo/Core Duo (Yonah) already has improved micro-fusion over Pentium M, but on a smaller degree of instructions than Core 2 Duo.

On non-SSE codes, however, the performance boost is limited.

A 4-wide decode & issue width

The biggest marketing hype of Core 2 is certainly its ability to decode and issue 4 x86 instructions per cycle, thus achieving an IPC of 4 Instructions Per Cycle (or 5 with macro-fusion)! It turns out this is the biggest misconception around Core 2. As discussion in Myth #3 of Part 1 article, a (sustained) rate of three x86 decodes per cycle is not the performance bottleneck yet. In fact, Intel's Optimization Reference Manual says in itself that

[Decoding 3.5 instructions per cycle] is higher than the performance seen in most applications.
- 2.1.2.2 Instruction Fetch Unit (Instruction PreDecode)

Note that this is stated under the conditions where branches, assumed once every 7 instructions, are predicted 100% correct, which is almost never the case and the sustained IPC is usually further reduced.

Contrary to marketing slogan and common (mis-)belief, the main purpose of a 4-wide decode & issue (also macro-fusion discussed below) is really to combat the many undesirable design artifacts of P6's x86 decode engine. As seen in the end of Part 1 article, these design artifacts reduce efficiency of the 4-1-1 decoders, which under real circumstances can hardly sustain three x86 decodes per cycle. Specifically -

Flushing decoding pipeline every 16 bytes, or about 4 to 5 x86 instructions in average.
Flushing decoding pipeline at each complex (> 2 fused micro-op) instruction.
Reducing instruction fetch for taken branches, especially to unaligned target address.

An additional partial decoder

For 1. and 2. in the above list, an additional partial decoder can help simply by raising the upper bound of the averaging range. For the purpose of discussion, suppose a 16-byte window contains four x86 instructions, and there is only one complex instruction among two such windows:

A set of 4-1-1 decoders will spend 4 to 5 cycles to decode the two 16-byte instruction windows, where two cycles are spent on the window with only simple instructions, and another two or three are spent on the one with a complex instruction (depending on where the complex instruction occurs).
A set of 4-1-1-1 decoders will spend only 3 to 4 cycles to decode the same two windows.

By lifting the roof of the best-case capability, a wider x86 decode engine can increase the average decode throughput. Note that even under the ideal condition where branch-related stalls do not occur, the sustained decodes per cycle is still less than 2.7 (8 instructions in 3+ cycles), far from the value 4 or 5 as advertised by Intel.

The Instruction Queue

The extra partial decoder, however, does not help the 3rd point in the previous list when a branch is taken, especially to an unaligned target address. Note that branch frequency is about 10-15% in normal programs (see also macro-fusion below). While many branch targets can be forced to be 16-byte aligned, it is usually not possible for small in-line loops to do so. If the entry point of the loop has address x MOD 16, then during the first cycle executing the loop, only 16 minus x fetched bytes contain effective instructions. This number does not increase no matter how many additional decoders you add to the decoding engine.

The real "weapon" the Core 2 Duo has against this branch-related inefficiency is not the 4-wide decoder, but a pre-decoded instruction queue of up to 18-deep x86 instructions. Refer to Part 1 article's first diagram on P6's Instruction Fetch Unit. There is a 16-byte wide, instruction boundary aligned Instruction Buffer sitting in-between the IFU and the decoders. Replacing this buffer with an 18 instruction-deep queue (probably 24 to 36 bytes in size) that can detect loops among the containing instructions, we get Core 2 Duo's biggest advantage with respect to x86 decode: ability to sustain continuous decode stream on short loops.

This continuous stream of x86 instructions allows Core 2 Duo's four decoders to be better utilized. The 18-instruction queue are aligned at instruction boundaries, and thus are immune to branch target (16-byte) misalignment problem. Although the 18-deep queue length easily becomes insufficient if loop unrolling, a compile-time optimization technique, is used, it is okay because unrolling a loop has the exact same effect as supplying a continuous instruction stream. More-over, the instruction queue also serves as a place where macro-fusion opportunities can be identified, as will be discussed next.

Without extensive simulation or real traces, we really can't be sure how much boost is received by Core 2 Duo from the 4-wide decode and the instruction queue. We have to make a guess; by using one extra partial decoder, the average sustained x86 decode throughput is probably increased from around 2.1 to about 2.5 macroinstructions (x86) per cycle. With the help of the instruction queue to supply uninterrupted macroinstructions in small loops, the sustained decode throughput is probably increased further to 2.7 or even close to 3.

Macro-fusion, the Myth and Truth

Debunking the Myth

Intel markets macro-fusion as the ability to increase x86 decode throughput from 4 to 5. As we have seen in the section above, the decode throughput without macro-fusion is much less than 4 and only close to 3. It turns out that macro-fusion has even less impact on improving the throughput, as is discussed here.

So what really is macro-fusion? In Intel's P6 terminology, "macro" or "macroinstruction" is used to describe an instruction in the original ISA (Instruction Set Architecture, here the x86). Thus macro-fusion is actually the exact same idea as micro-fusion, where two (or more) depending instructions with a single fan-out are collapsed into one instruction format (see the Part 2 article). The difference is on their application domain; where micro-fusion works on internal micro-ops, macro-fusion works on (x86) macrointructions. In fact, Intel's macro-fusion patent, System and Method for Fusing Instructions, filed in Dec.2000, predates its micro-fusion patent, Fusion of Processor Micro-Operations, filed in Aug.2002. It is probably due to two following reasons that the former is implemented later:

Complexity (or difficulty)
Limited usefulness

Why is it difficult, and what does it do?

First, we know that x86 instructions are complex and variable-length. Some x86 instructions take 6 clock cycles to only determine its length (page 2-7, Instruction PreDecode, of Intel's Optimization Reference Manual). The complexity of collapsing variable-length macroinstructions in when most cycle time is spent on decoding lengths (among other things) is undoubtedly much higher than that of fusing fixed-width micro-ops. Second, it will be even more difficult, if not impossible, to determine dependencies in real time, and fuse the depending macroinstructions together.

So instead of trying to fused all possible macroinstruction pairs, Core 2 Duo fuses only the selected macroinstructions -

The first macroinstruction must be a TEST X, Y or a CMP X, Y where only one operand of X and Y is an immediate or a memory word.
The second macroinstruction must be a conditional jump that checks the carry flag (CF) or zero flag (ZF).
The macroinstructions are not working in 64-bit mode.

These test/compare and jump are often used in integer programs composed of iterative algorithms. According to a 2007 SPEC Benchmark Workshop paper, "Characterization of Performance of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor," the frequency of macro-fused operations in SPEC2006 CPU ranges from 0-16% in integer codes and just 0-8% in floating-point codes. In other words, in the best case, macro-fusion would reduce the number of macroinstructions from 100% to 92% for integer and just 96% for floating-point execution, hardly the whopping 20-25% reduction as described by Intel's marketing department (and the numerous on-line repeaters).

Bringing the Truth

Look at it closer, we realize that the purpose of macro-fusion is really not much to reduce the number of x86 instructions to be decoded, but again to reduce decode interruptions/stalls due to predicted-taken branches. Again for the purpose of discussion lets number the four x86 decoders as 0, 1, 2, and 3. A two-macroinstruction sequence can be steered to either of the following four positions: [0,1], [1,2], [2,3], [3,0]. If the conditional jump is predicted taken, then no instruction after it will be steered for decoding, and in two of the four cases (i.e., [0,1] and [3,0]) the four decoders will decode no other maroinstruction at all in the cycle. More specifically,

Decoder slot [0,1], no other instruction decode, 0.25 probability
Decoder slot [1,2], 1 other instruction decode, 0.25 probability
Decoder slot [2,3], 2 other instruction decode, 0.25 probability
Decoder slot [3,1], no other instruction decode, 0.25 probability

The average number of other decodes is thus (1+2)*.25 = 0.75, or about 19% efficiency when the 4 decoders work on a block of macroinstructions containing conditional branches. Note that this is assuming all ideal cases otherwise, including perfect branch prediction, all simple instructions, and no 16-byte instruction misalignment. In reality, the separate test-and-jump macroinstructions under realistic environment will probably reduce decode efficiency even more.

Thankfully, when looking at a bigger picture, the situation becomes much better. As previously stated, the frequency of conditional branch itself tops at 8-16% in the first place; in other words, in average one taken branch occurs in every 8 to 16 other instructions, or every 3 to 4 instruction fetch cycles (see the bottom of page 2-6 in Intel's Optimization Reference Manual). Suppose a taken branch occurs after 3 blocks of non-branching decodes, the 80% decoding efficiency loss at the branching block would result in less than 20% loss overall. This is why even without macro-fusion, Core 2's predecessor (Yonah) can already achieve IPC higher than 2 for some programs with only three x86 decoders.

Now lets look at what happens to the conditional branch decode when macro-fusion is added. Again, the first column is the decoder number occupied by the now fused branch macroinstruction; the second column is number of other instruction decodes; the last column is occurrence probability of the row:

Decoder slot 0, no other instruction decode, 0.25 probability
Decoder slot 1, 1 other instruction decode, 0.25 probability
Decoder slot 2, 2 other instruction decode, 0.25 probability
Decoder slot 3, 3 other instruction decode, 0.25 probability

The average number of other decodes becomes (1+2+3)*.25 = 1.5, or about 38% efficiency of the 4 decoders, doubling that of the case without macro-fusion. The overall decoding efficiency loss reduces from less than 20% to less than 10%. A 10% increase in decoding efficiency will certainly be appreciated by the rest of the core, lifting the roof of sustained IPC to 3 or maybe even higher for SPEC95 like programs (note that according to Intel's manual, Core 2's macroinstruction length pre-decoder is designed to sustain a 3.5 decode throughput in the worst case).

This concludes the 3-part Decoding x86: From P6 to Core 2 series. I hope what's written here satisfy your curiosity with regard to the inner workings of modern microarchitectures, as they certainly do me over the course of my research/study on them. Please let me know if you have comments, suggestions, or even better, corrections, to the contents.

Tuesday, May 29, 2007

Decoding x86: From P6 to Core 2 - Part 2

This is the Part 2 of a 3 article series. To fully appreciate what's written here, the Part 1 article (or comparable understanding) is a prerequisite.

The New Advancements

Three major advancements have been made from the original P6 x86 decode over the years: micro-op fusion (Pentium M), macro-fusion (Core 2), and an increased 4-wide decode (also Core 2). In this Part 2 article, I will go over the micro-op fusion in more detail, and in the next Part 3, I will go further into Core 2's additions.

While these advancements have all been "explained" numerous times on the Internet, as well as marketed massively by Intel, I must say that many of those explanations and claims are either wrong or misleading. People got second-hand info from Intel's marketing guys and possibly even some designers, and they tend to spice those up with extra sauces, partly from imaginations and partly from "educated" [sic] guesses.

One big problem that I saw in many of those on-line "analyses" is that they never get to the bottom of the techniques such as why they were implemented and what makes them compelling as they are . Instead, most of those analyses just repeat whatever glossy terms they got from Intel and gloss over the technical reasonings. Not that these technical reasonings are any more important to end users, but without proper reference to them, the "analyses" will most surely degrade to mere marketing repeaters of the Intel Co. These wrong ideas also tend to have bad consequences to the industry - think of Pentium 4 and the megahertz hypes that come with it.

In the following, I will try to look at the true motives and benefits of these techniques from a technical point of view. I will try to answer the 3W1H questions for each: Where does it come from, What does it do, How does it work, and Why is it designed so. As stated in the previous Part 1 article, all analyses here are based on publicly available information. Without inside knowledge from Intel, however, I cannot be certain of being 100% error-free. But the good thing of technical reasoning is that, with enough evidence, you can also reason for or against it, instead of choose whatever marketing craps that come across your way to believe.

* Micro-op fusion - its RISC roots

The idea behind micro-op fusion, or micro-fusion, came in early '90s to improve RISC processor performance where true data dependency exists. Unsurprisingly, it did not come from Intel. In a 1992 paper, "Architectural Effects on Dual Instruction Issue With Interlock Collapsing ALUs," Malik et al. from IBM devised a scheme to issue two dependent instructions at once to a 3-to-1 ALU. The technique, called instruction collapsing, are then extended and improved by numerous researchers and designers.

Intel came to the game quite late until 2000/2001 (Pentium M was released in 2003), and apparently just grabbed the existing idea and filed a patent on it. The company did bring some new thing to the table: a cool name, fusion. It really sounds better to make work fusion than to collapse instructions, doesn't it? In fact, the micro-fusion of Intel's design is very rudimentary compared to what's been proposed 6-8 years ago in the RISC community; we will talk about this later shortly.

Let's first look at the original "instruction collapse" techniques. Because a RISC ISA generally consists of simple instructions, true dependency detection among these instructions becomes a big issue when collapsing them together. However, if one can dynamically find out the dependencies -as all modern out-of-order dispatch can- he can then not only "collapse" two but also more instructions together. The performance improvement was reported from 7% to 20% on 2 to 4-issue processors.

* A cheaper and simplified approach

Now turn to Intel's micro-op fusion. What does it do? Magic like most wagging websites have cheered? Surely not -

It only works on x86 read-then-modify and operate-then-store instructions, where no dependency check is needed between the two micro-ops to be fused.
It works only on x86 decode and issue stages, so no speculative execution is performed.
It doesn't change or affect the ALUs, so the same number of execution units is still needed for one fused micro-op as two non-fused micro-ops.

What is actually expanded is an additional XLAT PLA for each partial x86 decoder (see the diagram above, and also Part 1 article of this series), so that partial x86 decode can handle those load/store instructions that generate two micro-ops. Naturally, the performance increase won't be spectacular, and the early report from Intel is just between 2% to 5%. This is actually not that bad a result, given the technique itself is pretty localized (to the x86 decode and micro-op format), and the main point of micro-fusion is not to remove dependency or to increase execution width anyway, as will be discussed later.

* An additional PLA plus a condensed format

So how does micro-fusion work? An x86 read-then-modify instruction, for example, consists of two depending micro-ops in one "strand" (i.e., single fan-out): 1) calculate load address, 2) modify loaded result. The micro-fusion will bind together these two operations into one format -

Putting the two micro-ops into one fused format, which now has two opcode fields and three operand fields. (Yup, that's it, or what else have you expected?)
Putting the operand fields of the first opcode into the fused micro-op. Putting only the non-depending operand field of the second opcode into the fused micro-op.
Linking the depending operand of the second opcode to the output of the first opcode.

The fused micro-op is really two separate micro-ops combined in a condensed form. When the fused micro-op is issued, it occupies only one (wider) reservation station (RS) slot. Since it only has one fan-out (execution result), it occupies only one reorder buffer (ROB) slot, too. However, the two opcodes are still sent to separate execution units, so the execute bandwidth is not increased (nor reduced, by the way).

* It works just fine - not great, just fine

So why does it work? The micro-fusion works because it relieved, in some degree, the x86 decode of the 4-1-1 complexity constraint. On those x86 instructions that get one argument directly from memory locations, this technique will -

Increase x86 decode bandwidth from 1 to 3.
Reduce RS usage by 50%.
Reduce ROB usage by 50%

What it costs to implement micro-op fusion is just minor increase in micro-op format complexity and an additional XLAT PLA for each partial decoder. So after all, it's probably a good deal or smart way to increase the P6 performance. Just, according to the published literatures, it doesn't work miracles as many amateur sites have claimed, and there's not much of Intel's own intellectual credits in it.

Sunday, May 27, 2007

Decoding x86: From P6 to Core 2 - Part 1

In this series of articles I will take a close look at the x86 instruction decode of Intel's P6 processor family, which includes Pentium-Pro/II/III/M, Core, and Core 2. I will first explain the design in some detail, then relate the marketing terms such as micro-op fusion, macro fusion and 4-wide decoding with what is actually happening inside the processor, down to its microarchitectures and processing algorithms.

All analyses here are based on publicly available information, such as Intel's software optimization manuals, patents and papers. What is added is some knowledge and understanding in computer microarchitectures and circuit designs. With great probably the analyses here should clarify/correct much more myths out there than it introduce any error.

The x86-to-RISC Decoding Problem

Over the years, Intel has advocated the use of CISC over RISC instruction set. However, with great irony -if we actually believed Intel's apparent stance toward the RISC/CISC argument- its P6 microarchitecture is really designed to be more "RISC Inside" than "Intel Inside." In order to reach both higher clock rates and better IPC (instruction per clock), the complex x86 instructions had to be first decoded into simple, fixed-width RISC format (micro-ops) before sent for execution. By this way, the number of pipeline cycles an instruction must go through and the delay of the longest pipeline stage can be optimized for the common average-case rather than the rare worst-case instructions.

All sound good, right? Except there are three (rather big) problems:

The variable-length x86 instructions, which are almost always misaligned in the instruction cache, are hard to decode in parallel (i.e., multiple decodes per clock cycle).
The many addressing modes and operand sizes of even the simplest x86 instruction require complex and slow translation from x86 to internal RISC.
The high complexity of some x86 instructions make worst-case decoders highly complex and inefficient.

Only by recognizing the problems of x86 decode and the difficulty to solve them can we fully appreciate the design choices that Intel made into the P6 front-end, as described in the three techniques below.

Technique #1: Pipeline the instruction length, prefix and opcode decodes

An x86 instruction can have 1-3 opcode bytes, 0-10 operand bytes, plus up to 14 prefix bytes, all but not exceeding a 15-byte length limit. When stored in the instruction cache, it is almost never aligned to the cache line, which unfortunately is the unit that processor cores use to read from the cache. To solve the variable-length misalignment problem, P6's Instruction Fetch Unit (IFU) decodes the length, prefix, and the actual instruction opcodes in a pipelined fashion (see also the picture below):

Instruction Fetch Unit and steering mechanism

When IFU fetches a 32-byte cache line of instructions, it decodes the instruction lengths and marks the first opcode byte and the last instruction byte of every instruction in the window. The 32 bytes are put into a pre-decode buffer together with the markings.
The 32 bytes are scanned and 16 bytes starting from the first instruction are sent via a rotator to the instruction buffer (now aligned to the instruction boundary), from which they proceed on to two paths.
On one path, all 16 bytes are sent to the prefix decoders, where the first 3 prefix vectors are identified and sent to help instruction decode below.
On the other path and at the same time, 3 blocks of the same 16 bytes are steered to the 3 decoders in parallel, one block for each consecutive instruction.

Steering variable-length instructions is a complex task. The instruction bytes must be scanned sequentially to locate up to 3 opcodes and their operands, then packed and sent to the 3 decoders. Each decoder might accept up to 11 bytes (max x86 instruction length without the prefix), or it might receive just one.

By determining the instruction boundaries early and pipeline the prefix decode away from instruction decode, the steering task can be made simpler and faster. To further simplify the matter, only the first (full) decoder will accept 11 bytes; the other two (partial) decoders will accept only up to 8 bytes of opcodes and operands, as will be further discussed below.

Technique #2: Decode the opcodes and operands separately

After a decoder (full or partial) receives the opcode and operand bytes, it must try to decode them into a RISC format efficiently. This is accomplished by again decoding the opcodes and the operands in separate paths, as illustrated by the partial decoder diagram below:

Partial x86 decoder

From the steering circuit, 3 opcode bytes are picked up and sent to a translation programmable logic array (PLA) for control micro-op decode. The decoded control signals and micro-op template are put into a control uop register.
All the opcode and operands bytes, together with the prefix vector from the prefix decoders, are also sent to a field extractor in parallel. The field extractor extracts the alias information which further describes the control micro-ops into a macro-alias register.
The two registers, cuop and macro-alias, are then combined by an alias multiplexer to get the final alias-resolve micro-op (aoup) code.

By decoding opcodes into templates and extracting operands information separately, the opcode decoder's PLA can be minimized and made flexible. Flexibility is important, as we will see the full decoder (shown in Technique #3 below) is really the partial decoder plus 3 XLAT PLA pipelines and one microcode engine. The flexibility also made it possible to implement micro-op fusion by adding an extra XLAT PLA, as will be discussed in the Part 2 article later.

Technique #3: Differentiate decoders to Make the Common Case Fast

In a typical x86 program, more than 2/3 of the instructions are simple enough to be represented by a single (non-fused) micro-op. Most of the other 1/3 can be decoded into 4 micro-ops or less, with a (very) few taking more to execute. Recognizing these facts, especially the 2:1 simple-to-complex ratio, the P6 design divides its decoders into the well-known 4-1-1 structure, giving only one decoder full capability:

Full x86 decoder

The first decoder has four translate PLAs, decoding an instruction to up to 4 control uops in one clock cycle (see the full decoder diagram right above).
The first decoder also has a micro-code engine to decode the few really complex instructions multiple number of clock cycles, generating 3 control uops per cycle (notice the three 2:1 MUXes in the above diagram).
The second and third decoders, as explained in Technique #2, have only one PLA and can decode only one single-uop x86 instructions per clock cycle.
Each decoder is equipped with its own macro-alias field extractor, although the first decoder's can be bigger in size.

When the micro-code engine is used, the 2nd and 3rd decoders are stalled from progress to preserve in-order issue. By differentiating the decoders and put performance emphasis on the common simple-instruction cases, instruction decode and issue complexity can be minimized, and higher clock rate can be reached. RISC design rule #1: Make the common-case fast.

The Myths, Part 1

Internet being the greatest information exchange inevitably becomes also the largest rumor farm and myth factory of the world. There have been numerous very wrong ideas about the P6 microarchitecture as a whole and the decoding front-end in particular. In "The Myths" section I will try to correct some of these misconceptions.

Since this Part 1 article only talks about the basic x86 decoding mechanisms, the related myths are also more basic and less astonishing. The described decoding mechanisms are over 10 years old, after all. Nevertheless, it is still better to get things right than wrong.

Myth #1: It is better to have more full decoders

An attempt to make fully capable decoders work in parallel is likely to spend more and gain little, not only because it will be very inefficient (resulting in slower clock rate and higher power usage), but also because it will cause trouble to the micro-op issue logic, which then must dynamically find out how many micro-ops are generated from each decoder, and route them in an (M*D)-to-N fabric from D decoders of M micro-ops to a issue queue of length N.

With twice as many simple instructions than complex ones in a typical program, an additional full decoder will not be worth it unless two more partial decoders are added. This ratio is increased even more with the introduction of micro-op fusion and the use of powerful SIMD instructions, although these are the later things to come.

Myth #2: It is better to utilize the full decoder as much as possible

Even though the full decoder can generate up to 4 micro-ops per clock cycle in parallel with the partial decoders, the issue queue of the P6 microarchitecture can only issue 3 micro-ops (or 4 in the case of Core 2) during any cycle. What this says is that the micro-op issue (and execution) logic will not be able to "digest" a continuous flow of x86 instructions with 4-1-1 uop complexity (with micro-op fusion, the pattern becomes selectively 4-2-2 - see Part 2 for more detail).

In other words, the pipeline (more precisely, the issue queue) will stall even when you sparsely (e.g., less than 30%) use those moderately complex instructions that can be decoded in one clock cycle. A corollary of this is that, in general, it is beneficial to replace a complex instruction by 3 simple ones (or 4 in the case of Core 2). The lesson: CISC does not scale. Even though you are writing/compiling to a CISC x86 ISA, you still want to make your assembly codes as much RISC-like as possible to get higher performance.

Myth #3: The same decoding width implies the same level of performance

To be sure, the 4-1-1 decoding engine is not the performance bottleneck up until the days of Pentium M, when micro-op fusion was introduced. Even with micro-op fusion, which supposedly doubles capability of the partial decoders, Intel reported less than 5% performance increase over the none-fused x86 decoding. The fact is, the IPC (instruction per clock) of all x86 processor cores, including the ones that bear the "Core 2" mark, have never exceeded 3. Pentium III running SPEC95 has IPC roughly between 0.6 and 0.9. Assuming 30% increase with each newer generation (which is quite optimistic to say the least), Pentium M would have IPC roughly between 0.8 and 1.2, Core would have it between 1.0 and 1.5, and Core 2 between 1.3 and 2.0. In other words, theoretically the ability to decode 3 instructions per cycle is quite sufficient up till this moment.

Of course nothing in the real world runs in a theoretical way. Aside from the fact that there are many other things in a processor core to slow down execution, P6's 3-wide (or 4-wide in the case of Core 2) x86 decode rarely sustains 3 decodes per cycle, even with low complex-to-simple instruction ratio. The reasons -

First, the complex instructions must be well positioned to the first decoder. Since the 3 (or 4 in the case of Core 2) x86-to-RISC decoders work in program order, if unfortunately the first decoder is occupied by a simple instruction while a complex instruction comes to the 2nd place, then during that clock cycle only one simple instruction will be decoded. The steering circuit will "re-steer" the complex instruction from the 2nd place to the 1st on the next cycle.

Second, the decoders are flushed every 16 instruction bytes (or 24 in the case of Core 2). Look at the IFU diagram at the beginning of this article, in every clock cycle 3 instructions from a 16-byte window are steered to the decoders. In average an x86 instruction takes about 3.5 bytes (the variance is high, though), so it is likely that the 16-byte window is not consumed in one clock cycle. If this is the case, then during the next cycle, the steer circuit will try to steer the next 3 instructions from the same 16-byte window to their respective decoders. But wait, what happens if there are less than 3 instructions left? Well, then less than 3 decoders have work to do in the cycle!

Third, taken branches always interrupt and stop short the decoding. This is similar to the reason above, except that here the latter decoders are not working not because the end of the 16-byte window is reached, but because the rest of the instruction bytes in the window are not (predicted) to be executed. This happens even under 100% branch prediction accuracy. The problem here is even more serious when the target address is unaligned to a byte-address of MOD 16. For example, if the branch target instruction has byte address 14 MOD 16, then only one instruction is fetched (inside the first 16-byte window) after the branch is taken.

We will note that these are caused by P6's x86 decode design artifacts; they cannot be improved by any microarchitecture improvement elsewhere. It is because of these reasons that we need micro-op fusion, macro fusion, or an additional partial decoder in the later generations of the P6 processor family to even get close to the theoretical 3-issue limit. We will however wait until Part 2 (and possibly Part 3) to dwell deeper into those.

A Journey in Modern Computer Architectures