A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Friday, September 21, 2007

AMD's latest x86 extension: SSE5 - Part 1

Series Index -

The SSE5 announcement made by AMD earlier this month is something big. In fact, in terms of instruction scope and architectural design, it is bigger SSE3, SSSE3, and SSE4 combined. If we think of AMD64 as completely revamping x86-based general-purpose computing (as generally conceived by the industry), then we can also think of SSE5 as completely revamping x86-based SIMD acceleration. In my opinion, the leaps made by AMD in both AMD64 and SSE5 firmly assert the company as the leader in x86 computing architectures, leaving Intel gasping far behind.

The SSE5 Superiority

There are a few things that make SSE5 a "superior" kind of SIMD (Single-Instruction Multiple-Data) instructions different from all the previous SSE{1-4}:
  • SSE5 is a generic SIMD extension that aims to accelerate not just multimedia but also HPC and security applications.
    • In contrast, previous SSEx, especially SSE3 and later, were designed specifically with media processing in mind.
    • The CRC and string match instructions of SSE4.2 are too specialized to be generally useful.
  • SSE5 instructions can operate on up to three distinct memory/register operands.
    • It allows true 3-operand operations, where the destination operand is different from any of the two source operands.
    • It allows 3-way 4-operand operations, where the destination operand is the same as one of the three source operands.
  • SSE5 includes powerful and generic Vector Conditional Moves (both integer and floating-point).
    • Only four instructions (mnemonics) are added: PCMOV for generic bits, PPERM for integer bytes/(d,q)words, PERMPD/PERMPS for single/double-precision floating points.
    • Powerful enough to move data from any part of the 128-bit source memory/register to any part of the 128-bit destination register, plus optional logical post-operations.
  • SSE5 includes both integer arithmetic & logic, and floating-point arithmetic & compare instructions.
    • For integer arithmetics, it includes both true vertical Multiply-Accumulate and flexible horizontal Adds/Subs.

An Analytical View of SSE5 Instruction Format

All above show one thing: SSE5 is a well-planned, thoroughly articulated, and carefully designed ISA extension. The amazing thing is that the designers at AMD accomplish all these by simply adding a single DREX byte in-between the SIB and Displacement bytes, as shown in the figure below (taken from page 2 of AMD's SSE5 documentation):
A question naturally arises: will the additional DREX byte further increase instruction lengths? Fortunately, not a single bit. According to the official document linked above, those SSE5 instructions that use the DREX byte can not only take 3 distinctive operands but also access all 16 XMM registers without the AMD64 REX prefix; in fact, the use of the DREX byte in an SSE5 instruction excludes the use of the REX prefix. SSE5 instruction lengths are just as long as needed and as short as it can be. (We will talk more about possible further extensions to AMD64 REX and SSE5 DREX in a later part.)

Another great merit of SSE5 instruction encoding is that it is simple and regular. Note the "Opcode3" byte in the above picture, the main byte that distinguishes among different SSE5 instructions: its encoding is astonishingly simple: 5 bits for opcode, 1 bit for operand ordering, and 2 bits for operand size. The result is an orthogonal instruction encoding - you only need to look at an opcode field by itself to know what it means. In contrast, the 3rd opcodes of Intel's SSSE3 and SSE4 instructions seem like picked by spoiled child to purposely screw up any implementation. (We will talk more about comparison between AMD's SSE5 and Intel's SSSE3/SSE4 in a later part.)

Types of SSE5 Instructions

There are several major types of instructions in SSE5:
  1. Various integer and floating-point multiply-accumulate (MAC) instructions.
  2. Vector conditional move (CMOV) and permutation (PERM) instructions.
  3. Vector compare and predicate generation instructions.
  4. Packed integer horizontal add and subtract.
  5. Vectorized rounding, precision control, and 16-bit FP conversion.
A single PTEST instructions in Type 3 and four ROUNDx instructions in Type 5 above are copied directly from Intel's SSE4.1; together with other Type 4 and Type 5 instructions these are the SSE5 instructions that do not contain the DREX byte. All the other Type 1-3 SSE5 instructions utilize the DREX byte to specify a 3rd distinctive (destination) operand and to offer access to XMM8-XMM16 registers (without & excluding the REX prefix).

In particular, the Type 1 (MAC) and Type 2 (CMOV/PERM) instructions are 3-way 4-operand operations, with destination is set to either source 1 or source 3. The fact that 3-way operation is allowed - even with destination equal to one of the sources - is instrumental in enabling flexible MAC and CMOV/PERM instructions. In the case of MAC, two multipliers and an accumulator must be specified; in the case of CMOV/PERM, two sources and a conditional predicate must be given. Without the ability to address 3 distinctive operands, these two types of accelerations are either impossible or done awkwardly (more on Intel's SSE4.1-way of doing it in a later part of this series).

What makes these two types of instructions, MAC and CMOV/PERM, which happily require 3 distinctive operands, so special? As previously said, the four conditional move & permutation instructions allow predicated transfer of data from any part of the source registers/memory to any part of the destination register, followed by one of seven optional operations. Just how many instructions are there in SSE/SSE2/SSE3 to perform similar and simpler tasks partially? Here is a quick list:
  • MOVQ
Of course this does not mean the four instructions in SSE5 will replace all the MOVs in SSE/SSE2 above, which are still useful for their simplicity (only 2 operands required) and possibly lower latency (no post-operation needed). However, it does illustrate how powerful and useful the PERM instructions in SSE5 can be - just imagine how hard it is to implement these operations in an SSE2-like style.

The MAC instructions turns out to be one of the "most-wanted" instruction accelerations. As shown in "Design issue in division and other floating point operations" by Oberman et al. in IEEE ToC, 1997, nearly 50% of floating-point multiplication results are consumed by a depending addition or subtraction. See the picture below, directly grabbed from the paper:
In other words, by combining multiplication with a depending addition/subtraction, we can eliminate 50% instructions following all multiplications. Until SSE5, it was impossible to truly fuse a multiplication with a depending add or subtract and take advantage of such acceleration.

Concluding Part 1.

As shown above, the SSE5 from AMD is indeed something very different from the previous x86 SIMD extensions from Intel. Some people even went so far to call it "AMD64-2", and the "top development" of the year; such enthusiasm, of course, is unduly.

Until now, AMD is still gathering community feedback and asking for community support on the SSE5 initiative. Apparently, SSE5 is still in development; it's a great proposal, but clearly not developed (yet). Also, the SSE5 instructions by themselves do not match the breadth and depth of AMD64, which not only expands x86 addressing space but also semantically changes the working of the ISA. SSE5, on the other hand, doesn't touch nor alter any bit of the x86-64 outside of its extending scope. However, as we will discuss in a later part, the direction pointed to by SSE5 can be used to further extend x86-64 in a more general and generic way rivaling the original AMD64.

1 comment:

GutterRat said...
This comment has been removed by the author.
Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.