A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Saturday, September 22, 2007

AMD's latest x86 extension: SSE5 - Part 2

Series Index -

In this part we will compare Intel's SSSE3 and SSE4.x with AMD's SSE5. More specifically we will look at how one can (or cannot) use SSE5 to accomplish the same tasks performed by SSSE3 and SSE4.x. The pinnacle question we're trying to answer here is whether the SSE5 from AMD is strictly an extension to Intel's SSE4, or in some sense a replacement for SSSE3 and SSE4.x (which none of AMD's current processors - including Barcelona and Phenom - supports)?

Syntactical Similarity

The original 8086/8087 have one-byte opcode instructions (if we ignore the ModRM bits used for 8087 and a handful others such as bit rotations). One remaining opcode byte that was usefully unused turned out to be 0Fh; had it been used, it would've had the meaning of POP CS , which was not there because it would create some interesting program flow control problems. Using 0Fh as an escape byte followed by a second byte, a number of two-byte opcode instructions were added by 80{2|3|4}86, Pentium, MMX, 3DNow!, and SSE/2/3/4a.

After the addition of SSE4a from AMD, the free two-byte opcodes left are only the followings: 0F0{4,A,C}h, 0F2{4-7}h, 0F3{6-F}h, 0F7{A,B}h, and 0FA{6,7}h. Why is this important? Because these points in the two-byte opcode space are the only entries where the x86 ISA can be further extended. Obviously, the two-dozen or so entries are not enough for any large-scale extension.

In order to further extend the instruction set in a significant way, the opcode itself must be extended from two-byte to three-byte. This is where SSSE3/SSE4.x and SSE5 bear the most similarity: they all consist (mainly) of instructions with three opcode bytes. Intel carved out 0F38xxh and 0F3Axxh for SSSE3 and SSE4.x, whereas AMD took 0F24xxh, 0F25xxh, 0F7Axxh and 0F7Bxxh for SSE5.

Syntactical Differences

However, the syntactical similarity between Intel's and AMD's extensions pretty much ends right here. As we've seen in Part 1. of this series, SSE5 instruction encoding is regular and orthogonal: the 3rd opcode byte (Opcode3) always has 5 bits for opcode extension, 1 bit for operand ordering, and 2 bits for operand size.

On the other hand, the encoding of SSSE3 and SSE4.x instructions may well have been arbitrary for anyone outside Intel. For example, look at the following SSSE3 instructions:

PSIGNB - 0F380h 1000b ... PABSB - 0F381h 1100b
PSIGNW - 0F380h 1001b ... PABSW - 0F381h 1101b
PSIGND - 0F380h 1010b ... PABSD - 0F381h 1110b

It may seem from above that the right-most bits encode the operand size - 00b for byte, 01b for word, and 10b for dword. However, take anther look at the following SSSE3 instructions:

PSHUFB - 0F380h 0000b ... PMADDUBSW - 0F380h 0100b
PHADDW - 0F380h 0001b ...... PHSUBW - 0F380h 0101b
PHADDD - 0F380h 0010b ...... PHSUBD - 0F380h 0110b

For some (probably legitimate) reason, Intel designers decided not to include horizontal byte additions and subtractions; instead they (most "exceptionally") squeezed in a byte-shuffle instruction and a specialized multiply-add instructions. We see that 30-years later, people at Intel still design instructions exactly the same way like 30-years ago: doesn't make sense.

Even worse cases are seen in SSE4.x. The following example shows the encodings used for packed MAX and packed MIN instructions:

PMAXSB - 0F383h 1100b ... PMINSB - 0F383h 1000b
PMAXSD - 0F383h 1101b ... PMINSD - 0F383h 1001b
PMAXUW - 0F383h 1110b ... PMINUW - 0F383h 1010b
PMAXUD - 0F383h 1111b ... PMINUD - 0F383h 1011b

Note how the different operand types and operand sizes are squeezed cozily into consecutive opcode byte values without much sense. For some mystical reason, the unsigned word operations are put quite arbitrarily right next to the signed dword operations . But wait... what happens to P{MAX|MIN}SW and P{MAX|MIN}UB? Well, they already are SSE2 instructions with opcode 0FE{E|A}h and 0FD{E|A}h, respectively. As can be seen in this example, the irregularity of SSE4.x also inherits from the poor design of SSE2.

From software programmer's point of view, the irregularity really doesn't matter as long as the compiler can generate these opcodes automatically. But such extension irregularity is no circuit designer's love to implement. This is probably why Intel, assumed not incompetent, chose in such poor styles to design SSEx - to make it as difficult as possible for anyone else (most prominently AMD) to offer compatible decoding. In the end, not only Intel's competitors but also its customers suffer from the bad choices: had Intel designed the original SSE/SSE2 the same way as AMD does SSE5, we would've had a much more complete & efficient set of x86 SIMD instructions that makes sense! (Now, does Intel promote open & fair competition that benefits the consumers? Or does it aims nothing but to screw up its competitors, sometimes together with its customers?)

In any rate, as we've been above the encoding of SSE5 is different from SSSE3/SSE4.x and thus the former does not exclude the latter. In other words, it is possible for a processor to offer both SSE5 and SSSE3/SSE4.x (much like 3DNow! and MMX). What about their functionalities, then? Below we'll look at each SSSE3 and SSE4.x instruction and see how its functionalities can or cannot be accomplished by SSE5.

Functional Comparison to SSSE3

For SSSE3 instructions:
    • Horizontally add/subtract word & dword in both source and destination sub-operands and pack them into destination.
    • Each PHADDx/PHSUBx in SSE5 operates on only one 128-bit packed source.
  • PMADDx
    • Multiply destination and source sub-operands, horizontally add the results, and store them back to destination.
    • PMADx in SSE5 offers more powerful multiply-add intrinsics
    • No byte-to-word multiply-add in SSE5, though.
    • Shuffle bytes in destination according to source.
    • Special & weaker cases of the first-half of PPERM in SSE5.
    • Shift concatenated destination & source bytes back into destination.
    • Special & weaker cases of the first-half of PPERM in SSE5.
  • PSIGNx
    • Retain, negate, or set zero sub-operands in destination if corresponding sub-operands in source is positive, negative, or zero, respectively.
    • No direct implementation in SSE5.
  • PABSx
    • Store the unsigned absolute values of source sub-operands into destination sub-operands.
    • No direct implementation in SSE5.
    • Multiply 16-bit sub-operands of destination and source and store the rounded high-order 16-bit results back to destination.
    • No direct implementation in SSE5.

It can be seen that most SSSE3 instructions are not directly implemented in SSE5, with possibly the exceptions of PSHUFB, PALIGNR, and PADDx/PSUBx. However, these latter SSSE3 instructions can still be useful as lower-latency, lower-instruction count shortcuts to the more generic & powerful SSE5 counterparts. Thus from this point of view, future AMD processors will probably still benefit from implementing SSSE3 together with SSE5.

Functional Comparison to SSE4.x

For SSE4.1 instructions:
    • Multiply 32-bit sub-operands of destination and source and store the low-order 32-bit results back to destination.
    • Can be done by two PMULDQ (SSE2) followed by a PPERM.
    • Horizontally dot-product single/double precision floating-point sub-operands in destination and source and selectively store results to destination sub-operand fields.
    • FMADx in SSE5 offer more powerful & flexible floating-point dot product intrinsics.
    • Non-temporal dword load from WC memory type into an internal buffer of processor, without storing to the cache hierarchy.
    • Specific to Intel processor implementation.
    • PREFETCHNTA in Opteron & later works for the same purpose.
  • BLENDx and PBLENDx
    • Conditionally copy sub-operands from source into destination.
    • Special and weaker cases of PERMPx and PPERM in SSE5.
  • PMAXx and PMINx
    • Packed max and min operations of destination and source
    • Can be accomplished by a PCOMx followed by a PPERM in SSE5.
    • Extract sub-operands from an XMM register (source) to memory or a general-purpose register (destination).
    • Special and weaker case of PERMPx for memory destination.
    • No direct implementation for GPR destination in SSE5.
    • Optionally copy sub-operands from source to destination.
    • Special and weaker case of PERMPx in SSE5.
  • PMOVx
    • Sign- or zero-extend source sub-operands to destination.
    • Special and weaker case of PPERM with a proper mux/logical argument.
    • Packed compare-equal between destination and source and store results back to destination.
    • Special and weaker case of PCOMQ in SSE5.
    • Compute "sum of absolute byte-difference" between one 4-byte group in source and eight 4-byte groups in destination and store the eight results back to destination
    • No direct implementation in SSE5.
    • Find the minimum word horizontally in source and put its value in DEST[15:0] and its index in DEST[18:16]
    • No direct implementation in SSE5.
    • Convert signed dword to unsigned word with saturation.
    • No direct implementation in SSE5.
    • Llogical zero test, packed precision rounding.
    • Copied directly to SSE5.

For SSE4.2 instructions:
    • Packed compare for greater than
    • Special & weaker case of PCOMQ in SSE5.
  • String match, CRC32
    • No direct implementation in SSE5.
    • Copied directly from AMD's POPCNT.

A few evidences from above show that it's probably not very likely for a future AMD processor to implement SSE4.1 & SSE4.2 in addition to SSE5. First, some of the instructions are copied directly from SSE4.1 to SSE5 (TEST and ROUNDx); had AMD wanted to implement SSE4.1 before SSE5, it would've been unnecessary to copy these instructions. Second, those instructions in SSE4.x that do not have superior SSE5 counterparts are either extremely specialized (MPSADBW, PHMINPOSUW, string match & CRC32), or able to be accomplished more flexibly by two or less SSE5 instructions.

We can also see how Intel designers work very hard to squeeze functionalities into the poor syntax of SSE4.x, resulting in a poor extension design. One example is the BLENDx/PBLENDx instructions. Instead of using the proper SSE5-like 3-way syntax, the variable selector in SSE4.1 is set implicitly to XMM0, not only requiring additional register shuffling but also limiting the number of permutation types to only 1 at any moment.

Another example is the DPPS/DPPD instructions, where the dot-product is performed partially vertical and partially horizontal. To make these instructions useful the two source vectors must be arranged to alternate positions: (A0, B0), (A1, B1), (A2, B2), ... Not only such arrangement can be costly by itself, but also after the operation one of the arranged source vectors is destroyed (replaced by the dot-product result).

Concluding Part 2.

Comparing SSE5 with SSSE3/SSE4, it seems that after years of being dragged along by Intel's poor extension designs, AMD finally decides to make its own next step in a better way. As I've discussed above, it's probably more advantageous for AMD to implement SSSE3 together with SSE5, and less so to implement SSE4.1 & SSE4.2.

However, as we know the commercial software in general and benchmarks in particular, especially on the desktop enthusiast market, are heavily influenced by the bigger company, thus if it turns out SSE4.x are excessively used to benchmark processor performance then it is still possible for AMD to implement them in its future processors. But lets hope for all customers' sake this is not going to happen, and future x86 extension will follow more of AMD's SSE5 than Intel's SSE4.x.


Yuhong Bao said...

"Intel carved out 0F38xxh and 0F3Axxh for SSSE3 and SSE4.x, whereas AMD took 0F24xxh, 0F25xxh, 0F7Axxh and 0F7Bxxh for SSE5."
Unfortunately, these conflict with the Cyrix SMM instructions, which are still used today in the AMD Geode processor (Cyrix was sold to National in 1997, and later that particular division was sold to AMD in 2003). Looking at the Geode LX datasheet, 0F 3A is used for RDM, 0F 38 is used for SMINT, 0F 7A is used for SVLDT, 0F 7B is used for RSLDT. Look at the sandpile.org opcode map and you will see more opcode conflicts:

abinstein said...

Thanks for the info. It's nice to know.

AussieFX said...

Where are Ho-Ho, chuckula and the rest of Roborats fanclub? Why aren't they commenting on this post?

Oh that's right they don't understand it. :)

Nice summation abi.

abinstein said...

Thanks AussieFX. I'd take it as a compliment. :)

However, with the advent of AVX and AMD's decision to embrace it due to customer/marketing demand, the contents of this article, especially the x86 instruction encoding part, become somewhat irrelevant. (The semantics part might still bear some meaningful value, though.)

Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.