A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Saturday, September 22, 2007

AMD's latest x86 extension: SSE5 - Part 2

Series Index -

In this part we will compare Intel's SSSE3 and SSE4.x with AMD's SSE5. More specifically we will look at how one can (or cannot) use SSE5 to accomplish the same tasks performed by SSSE3 and SSE4.x. The pinnacle question we're trying to answer here is whether the SSE5 from AMD is strictly an extension to Intel's SSE4, or in some sense a replacement for SSSE3 and SSE4.x (which none of AMD's current processors - including Barcelona and Phenom - supports)?

Syntactical Similarity

The original 8086/8087 have one-byte opcode instructions (if we ignore the ModRM bits used for 8087 and a handful others such as bit rotations). One remaining opcode byte that was usefully unused turned out to be 0Fh; had it been used, it would've had the meaning of POP CS , which was not there because it would create some interesting program flow control problems. Using 0Fh as an escape byte followed by a second byte, a number of two-byte opcode instructions were added by 80{2|3|4}86, Pentium, MMX, 3DNow!, and SSE/2/3/4a.

After the addition of SSE4a from AMD, the free two-byte opcodes left are only the followings: 0F0{4,A,C}h, 0F2{4-7}h, 0F3{6-F}h, 0F7{A,B}h, and 0FA{6,7}h. Why is this important? Because these points in the two-byte opcode space are the only entries where the x86 ISA can be further extended. Obviously, the two-dozen or so entries are not enough for any large-scale extension.

In order to further extend the instruction set in a significant way, the opcode itself must be extended from two-byte to three-byte. This is where SSSE3/SSE4.x and SSE5 bear the most similarity: they all consist (mainly) of instructions with three opcode bytes. Intel carved out 0F38xxh and 0F3Axxh for SSSE3 and SSE4.x, whereas AMD took 0F24xxh, 0F25xxh, 0F7Axxh and 0F7Bxxh for SSE5.

Syntactical Differences

However, the syntactical similarity between Intel's and AMD's extensions pretty much ends right here. As we've seen in Part 1. of this series, SSE5 instruction encoding is regular and orthogonal: the 3rd opcode byte (Opcode3) always has 5 bits for opcode extension, 1 bit for operand ordering, and 2 bits for operand size.

On the other hand, the encoding of SSSE3 and SSE4.x instructions may well have been arbitrary for anyone outside Intel. For example, look at the following SSSE3 instructions:

PSIGNB - 0F380h 1000b ... PABSB - 0F381h 1100b
PSIGNW - 0F380h 1001b ... PABSW - 0F381h 1101b
PSIGND - 0F380h 1010b ... PABSD - 0F381h 1110b

It may seem from above that the right-most bits encode the operand size - 00b for byte, 01b for word, and 10b for dword. However, take anther look at the following SSSE3 instructions:

PSHUFB - 0F380h 0000b ... PMADDUBSW - 0F380h 0100b
PHADDW - 0F380h 0001b ...... PHSUBW - 0F380h 0101b
PHADDD - 0F380h 0010b ...... PHSUBD - 0F380h 0110b

For some (probably legitimate) reason, Intel designers decided not to include horizontal byte additions and subtractions; instead they (most "exceptionally") squeezed in a byte-shuffle instruction and a specialized multiply-add instructions. We see that 30-years later, people at Intel still design instructions exactly the same way like 30-years ago: doesn't make sense.

Even worse cases are seen in SSE4.x. The following example shows the encodings used for packed MAX and packed MIN instructions:

PMAXSB - 0F383h 1100b ... PMINSB - 0F383h 1000b
PMAXSD - 0F383h 1101b ... PMINSD - 0F383h 1001b
PMAXUW - 0F383h 1110b ... PMINUW - 0F383h 1010b
PMAXUD - 0F383h 1111b ... PMINUD - 0F383h 1011b

Note how the different operand types and operand sizes are squeezed cozily into consecutive opcode byte values without much sense. For some mystical reason, the unsigned word operations are put quite arbitrarily right next to the signed dword operations . But wait... what happens to P{MAX|MIN}SW and P{MAX|MIN}UB? Well, they already are SSE2 instructions with opcode 0FE{E|A}h and 0FD{E|A}h, respectively. As can be seen in this example, the irregularity of SSE4.x also inherits from the poor design of SSE2.

From software programmer's point of view, the irregularity really doesn't matter as long as the compiler can generate these opcodes automatically. But such extension irregularity is no circuit designer's love to implement. This is probably why Intel, assumed not incompetent, chose in such poor styles to design SSEx - to make it as difficult as possible for anyone else (most prominently AMD) to offer compatible decoding. In the end, not only Intel's competitors but also its customers suffer from the bad choices: had Intel designed the original SSE/SSE2 the same way as AMD does SSE5, we would've had a much more complete & efficient set of x86 SIMD instructions that makes sense! (Now, does Intel promote open & fair competition that benefits the consumers? Or does it aims nothing but to screw up its competitors, sometimes together with its customers?)

In any rate, as we've been above the encoding of SSE5 is different from SSSE3/SSE4.x and thus the former does not exclude the latter. In other words, it is possible for a processor to offer both SSE5 and SSSE3/SSE4.x (much like 3DNow! and MMX). What about their functionalities, then? Below we'll look at each SSSE3 and SSE4.x instruction and see how its functionalities can or cannot be accomplished by SSE5.

Functional Comparison to SSSE3

For SSSE3 instructions:
    • Horizontally add/subtract word & dword in both source and destination sub-operands and pack them into destination.
    • Each PHADDx/PHSUBx in SSE5 operates on only one 128-bit packed source.
  • PMADDx
    • Multiply destination and source sub-operands, horizontally add the results, and store them back to destination.
    • PMADx in SSE5 offers more powerful multiply-add intrinsics
    • No byte-to-word multiply-add in SSE5, though.
    • Shuffle bytes in destination according to source.
    • Special & weaker cases of the first-half of PPERM in SSE5.
    • Shift concatenated destination & source bytes back into destination.
    • Special & weaker cases of the first-half of PPERM in SSE5.
  • PSIGNx
    • Retain, negate, or set zero sub-operands in destination if corresponding sub-operands in source is positive, negative, or zero, respectively.
    • No direct implementation in SSE5.
  • PABSx
    • Store the unsigned absolute values of source sub-operands into destination sub-operands.
    • No direct implementation in SSE5.
    • Multiply 16-bit sub-operands of destination and source and store the rounded high-order 16-bit results back to destination.
    • No direct implementation in SSE5.

It can be seen that most SSSE3 instructions are not directly implemented in SSE5, with possibly the exceptions of PSHUFB, PALIGNR, and PADDx/PSUBx. However, these latter SSSE3 instructions can still be useful as lower-latency, lower-instruction count shortcuts to the more generic & powerful SSE5 counterparts. Thus from this point of view, future AMD processors will probably still benefit from implementing SSSE3 together with SSE5.

Functional Comparison to SSE4.x

For SSE4.1 instructions:
    • Multiply 32-bit sub-operands of destination and source and store the low-order 32-bit results back to destination.
    • Can be done by two PMULDQ (SSE2) followed by a PPERM.
    • Horizontally dot-product single/double precision floating-point sub-operands in destination and source and selectively store results to destination sub-operand fields.
    • FMADx in SSE5 offer more powerful & flexible floating-point dot product intrinsics.
    • Non-temporal dword load from WC memory type into an internal buffer of processor, without storing to the cache hierarchy.
    • Specific to Intel processor implementation.
    • PREFETCHNTA in Opteron & later works for the same purpose.
  • BLENDx and PBLENDx
    • Conditionally copy sub-operands from source into destination.
    • Special and weaker cases of PERMPx and PPERM in SSE5.
  • PMAXx and PMINx
    • Packed max and min operations of destination and source
    • Can be accomplished by a PCOMx followed by a PPERM in SSE5.
    • Extract sub-operands from an XMM register (source) to memory or a general-purpose register (destination).
    • Special and weaker case of PERMPx for memory destination.
    • No direct implementation for GPR destination in SSE5.
    • Optionally copy sub-operands from source to destination.
    • Special and weaker case of PERMPx in SSE5.
  • PMOVx
    • Sign- or zero-extend source sub-operands to destination.
    • Special and weaker case of PPERM with a proper mux/logical argument.
    • Packed compare-equal between destination and source and store results back to destination.
    • Special and weaker case of PCOMQ in SSE5.
    • Compute "sum of absolute byte-difference" between one 4-byte group in source and eight 4-byte groups in destination and store the eight results back to destination
    • No direct implementation in SSE5.
    • Find the minimum word horizontally in source and put its value in DEST[15:0] and its index in DEST[18:16]
    • No direct implementation in SSE5.
    • Convert signed dword to unsigned word with saturation.
    • No direct implementation in SSE5.
    • Llogical zero test, packed precision rounding.
    • Copied directly to SSE5.

For SSE4.2 instructions:
    • Packed compare for greater than
    • Special & weaker case of PCOMQ in SSE5.
  • String match, CRC32
    • No direct implementation in SSE5.
    • Copied directly from AMD's POPCNT.

A few evidences from above show that it's probably not very likely for a future AMD processor to implement SSE4.1 & SSE4.2 in addition to SSE5. First, some of the instructions are copied directly from SSE4.1 to SSE5 (TEST and ROUNDx); had AMD wanted to implement SSE4.1 before SSE5, it would've been unnecessary to copy these instructions. Second, those instructions in SSE4.x that do not have superior SSE5 counterparts are either extremely specialized (MPSADBW, PHMINPOSUW, string match & CRC32), or able to be accomplished more flexibly by two or less SSE5 instructions.

We can also see how Intel designers work very hard to squeeze functionalities into the poor syntax of SSE4.x, resulting in a poor extension design. One example is the BLENDx/PBLENDx instructions. Instead of using the proper SSE5-like 3-way syntax, the variable selector in SSE4.1 is set implicitly to XMM0, not only requiring additional register shuffling but also limiting the number of permutation types to only 1 at any moment.

Another example is the DPPS/DPPD instructions, where the dot-product is performed partially vertical and partially horizontal. To make these instructions useful the two source vectors must be arranged to alternate positions: (A0, B0), (A1, B1), (A2, B2), ... Not only such arrangement can be costly by itself, but also after the operation one of the arranged source vectors is destroyed (replaced by the dot-product result).

Concluding Part 2.

Comparing SSE5 with SSSE3/SSE4, it seems that after years of being dragged along by Intel's poor extension designs, AMD finally decides to make its own next step in a better way. As I've discussed above, it's probably more advantageous for AMD to implement SSSE3 together with SSE5, and less so to implement SSE4.1 & SSE4.2.

However, as we know the commercial software in general and benchmarks in particular, especially on the desktop enthusiast market, are heavily influenced by the bigger company, thus if it turns out SSE4.x are excessively used to benchmark processor performance then it is still possible for AMD to implement them in its future processors. But lets hope for all customers' sake this is not going to happen, and future x86 extension will follow more of AMD's SSE5 than Intel's SSE4.x.

Friday, September 21, 2007

AMD's latest x86 extension: SSE5 - Part 1

Series Index -

The SSE5 announcement made by AMD earlier this month is something big. In fact, in terms of instruction scope and architectural design, it is bigger SSE3, SSSE3, and SSE4 combined. If we think of AMD64 as completely revamping x86-based general-purpose computing (as generally conceived by the industry), then we can also think of SSE5 as completely revamping x86-based SIMD acceleration. In my opinion, the leaps made by AMD in both AMD64 and SSE5 firmly assert the company as the leader in x86 computing architectures, leaving Intel gasping far behind.

The SSE5 Superiority

There are a few things that make SSE5 a "superior" kind of SIMD (Single-Instruction Multiple-Data) instructions different from all the previous SSE{1-4}:
  • SSE5 is a generic SIMD extension that aims to accelerate not just multimedia but also HPC and security applications.
    • In contrast, previous SSEx, especially SSE3 and later, were designed specifically with media processing in mind.
    • The CRC and string match instructions of SSE4.2 are too specialized to be generally useful.
  • SSE5 instructions can operate on up to three distinct memory/register operands.
    • It allows true 3-operand operations, where the destination operand is different from any of the two source operands.
    • It allows 3-way 4-operand operations, where the destination operand is the same as one of the three source operands.
  • SSE5 includes powerful and generic Vector Conditional Moves (both integer and floating-point).
    • Only four instructions (mnemonics) are added: PCMOV for generic bits, PPERM for integer bytes/(d,q)words, PERMPD/PERMPS for single/double-precision floating points.
    • Powerful enough to move data from any part of the 128-bit source memory/register to any part of the 128-bit destination register, plus optional logical post-operations.
  • SSE5 includes both integer arithmetic & logic, and floating-point arithmetic & compare instructions.
    • For integer arithmetics, it includes both true vertical Multiply-Accumulate and flexible horizontal Adds/Subs.

An Analytical View of SSE5 Instruction Format

All above show one thing: SSE5 is a well-planned, thoroughly articulated, and carefully designed ISA extension. The amazing thing is that the designers at AMD accomplish all these by simply adding a single DREX byte in-between the SIB and Displacement bytes, as shown in the figure below (taken from page 2 of AMD's SSE5 documentation):
A question naturally arises: will the additional DREX byte further increase instruction lengths? Fortunately, not a single bit. According to the official document linked above, those SSE5 instructions that use the DREX byte can not only take 3 distinctive operands but also access all 16 XMM registers without the AMD64 REX prefix; in fact, the use of the DREX byte in an SSE5 instruction excludes the use of the REX prefix. SSE5 instruction lengths are just as long as needed and as short as it can be. (We will talk more about possible further extensions to AMD64 REX and SSE5 DREX in a later part.)

Another great merit of SSE5 instruction encoding is that it is simple and regular. Note the "Opcode3" byte in the above picture, the main byte that distinguishes among different SSE5 instructions: its encoding is astonishingly simple: 5 bits for opcode, 1 bit for operand ordering, and 2 bits for operand size. The result is an orthogonal instruction encoding - you only need to look at an opcode field by itself to know what it means. In contrast, the 3rd opcodes of Intel's SSSE3 and SSE4 instructions seem like picked by spoiled child to purposely screw up any implementation. (We will talk more about comparison between AMD's SSE5 and Intel's SSSE3/SSE4 in a later part.)

Types of SSE5 Instructions

There are several major types of instructions in SSE5:
  1. Various integer and floating-point multiply-accumulate (MAC) instructions.
  2. Vector conditional move (CMOV) and permutation (PERM) instructions.
  3. Vector compare and predicate generation instructions.
  4. Packed integer horizontal add and subtract.
  5. Vectorized rounding, precision control, and 16-bit FP conversion.
A single PTEST instructions in Type 3 and four ROUNDx instructions in Type 5 above are copied directly from Intel's SSE4.1; together with other Type 4 and Type 5 instructions these are the SSE5 instructions that do not contain the DREX byte. All the other Type 1-3 SSE5 instructions utilize the DREX byte to specify a 3rd distinctive (destination) operand and to offer access to XMM8-XMM16 registers (without & excluding the REX prefix).

In particular, the Type 1 (MAC) and Type 2 (CMOV/PERM) instructions are 3-way 4-operand operations, with destination is set to either source 1 or source 3. The fact that 3-way operation is allowed - even with destination equal to one of the sources - is instrumental in enabling flexible MAC and CMOV/PERM instructions. In the case of MAC, two multipliers and an accumulator must be specified; in the case of CMOV/PERM, two sources and a conditional predicate must be given. Without the ability to address 3 distinctive operands, these two types of accelerations are either impossible or done awkwardly (more on Intel's SSE4.1-way of doing it in a later part of this series).

What makes these two types of instructions, MAC and CMOV/PERM, which happily require 3 distinctive operands, so special? As previously said, the four conditional move & permutation instructions allow predicated transfer of data from any part of the source registers/memory to any part of the destination register, followed by one of seven optional operations. Just how many instructions are there in SSE/SSE2/SSE3 to perform similar and simpler tasks partially? Here is a quick list:
  • MOVQ
Of course this does not mean the four instructions in SSE5 will replace all the MOVs in SSE/SSE2 above, which are still useful for their simplicity (only 2 operands required) and possibly lower latency (no post-operation needed). However, it does illustrate how powerful and useful the PERM instructions in SSE5 can be - just imagine how hard it is to implement these operations in an SSE2-like style.

The MAC instructions turns out to be one of the "most-wanted" instruction accelerations. As shown in "Design issue in division and other floating point operations" by Oberman et al. in IEEE ToC, 1997, nearly 50% of floating-point multiplication results are consumed by a depending addition or subtraction. See the picture below, directly grabbed from the paper:
In other words, by combining multiplication with a depending addition/subtraction, we can eliminate 50% instructions following all multiplications. Until SSE5, it was impossible to truly fuse a multiplication with a depending add or subtract and take advantage of such acceleration.

Concluding Part 1.

As shown above, the SSE5 from AMD is indeed something very different from the previous x86 SIMD extensions from Intel. Some people even went so far to call it "AMD64-2", and the "top development" of the year; such enthusiasm, of course, is unduly.

Until now, AMD is still gathering community feedback and asking for community support on the SSE5 initiative. Apparently, SSE5 is still in development; it's a great proposal, but clearly not developed (yet). Also, the SSE5 instructions by themselves do not match the breadth and depth of AMD64, which not only expands x86 addressing space but also semantically changes the working of the ISA. SSE5, on the other hand, doesn't touch nor alter any bit of the x86-64 outside of its extending scope. However, as we will discuss in a later part, the direction pointed to by SSE5 can be used to further extend x86-64 in a more general and generic way rivaling the original AMD64.

Monday, September 10, 2007

Scalability counts!

As I have said in this article, Intel's new Core 2 line of processors have good cores but poor system architecture. The poor scalability of FSB means that Core 2, without extensive, expensive, and power-hungry chipset support, is only suitable for low-end personal enjoyment.

Take a look at this AnandTech benchmark. I'd note foremost that AnandTech is hardly an AMD-favoring on-line "journal"; thus we can expect its report to be at worst Intel-biased and at best neutral (which I'm hoping for here). In any rate, the benchmark picture is reproduced below:

The comparison between Barcelona (Opteron 2350 2.0GHz) and Clovertown (Xeon E5345 2.33GHz) couldn't be clearer: FSB is an outdated system architecture for today's high-end computing, and scalability does matter for server & workstation grade performance. While AMD's quad-core Opteron at 2.0GHz is slower than Intel's quad-core Xeon at 2.3GHz on single-socket test, the situation is reversed when going to a dual-socket setup, one that used by most workstations and entry-level servers.

The same phenomenon is also observed in this page where AMD's quad-core Opteron, at 17% slower clock rate, performs increasingly better than Intel's quad-core Xeon with more number of cores (picture reproduced below). Again, when it comes to server & workstation performance, scalability counts.
Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.