A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Sunday, May 27, 2007

Decoding x86: From P6 to Core 2 - Part 1

In this series of articles I will take a close look at the x86 instruction decode of Intel's P6 processor family, which includes Pentium-Pro/II/III/M, Core, and Core 2. I will first explain the design in some detail, then relate the marketing terms such as micro-op fusion, macro fusion and 4-wide decoding with what is actually happening inside the processor, down to its microarchitectures and processing algorithms.

All analyses here are based on publicly available information, such as Intel's software optimization manuals, patents and papers. What is added is some knowledge and understanding in computer microarchitectures and circuit designs. With great probably the analyses here should clarify/correct much more myths out there than it introduce any error.

The x86-to-RISC Decoding Problem

Over the years, Intel has advocated the use of CISC over RISC instruction set. However, with great irony -if we actually believed Intel's apparent stance toward the RISC/CISC argument- its P6 microarchitecture is really designed to be more "RISC Inside" than "Intel Inside." In order to reach both higher clock rates and better IPC (instruction per clock), the complex x86 instructions had to be first decoded into simple, fixed-width RISC format (micro-ops) before sent for execution. By this way, the number of pipeline cycles an instruction must go through and the delay of the longest pipeline stage can be optimized for the common average-case rather than the rare worst-case instructions.

All sound good, right? Except there are three (rather big) problems:
  1. The variable-length x86 instructions, which are almost always misaligned in the instruction cache, are hard to decode in parallel (i.e., multiple decodes per clock cycle).
  2. The many addressing modes and operand sizes of even the simplest x86 instruction require complex and slow translation from x86 to internal RISC.
  3. The high complexity of some x86 instructions make worst-case decoders highly complex and inefficient.
Only by recognizing the problems of x86 decode and the difficulty to solve them can we fully appreciate the design choices that Intel made into the P6 front-end, as described in the three techniques below.

Technique #1: Pipeline the instruction length, prefix and opcode decodes

An x86 instruction can have 1-3 opcode bytes, 0-10 operand bytes, plus up to 14 prefix bytes, all but not exceeding a 15-byte length limit. When stored in the instruction cache, it is almost never aligned to the cache line, which unfortunately is the unit that processor cores use to read from the cache. To solve the variable-length misalignment problem, P6's Instruction Fetch Unit (IFU) decodes the length, prefix, and the actual instruction opcodes in a pipelined fashion (see also the picture below):

Instruction Fetch Unit and steering mechanism

  • When IFU fetches a 32-byte cache line of instructions, it decodes the instruction lengths and marks the first opcode byte and the last instruction byte of every instruction in the window. The 32 bytes are put into a pre-decode buffer together with the markings.
  • The 32 bytes are scanned and 16 bytes starting from the first instruction are sent via a rotator to the instruction buffer (now aligned to the instruction boundary), from which they proceed on to two paths.
  • On one path, all 16 bytes are sent to the prefix decoders, where the first 3 prefix vectors are identified and sent to help instruction decode below.
  • On the other path and at the same time, 3 blocks of the same 16 bytes are steered to the 3 decoders in parallel, one block for each consecutive instruction.
Steering variable-length instructions is a complex task. The instruction bytes must be scanned sequentially to locate up to 3 opcodes and their operands, then packed and sent to the 3 decoders. Each decoder might accept up to 11 bytes (max x86 instruction length without the prefix), or it might receive just one.

By determining the instruction boundaries early and pipeline the prefix decode away from instruction decode, the steering task can be made simpler and faster. To further simplify the matter, only the first (full) decoder will accept 11 bytes; the other two (partial) decoders will accept only up to 8 bytes of opcodes and operands, as will be further discussed below.

Technique #2: Decode the opcodes and operands separately

After a decoder (full or partial) receives the opcode and operand bytes, it must try to decode them into a RISC format efficiently. This is accomplished by again decoding the opcodes and the operands in separate paths, as illustrated by the partial decoder diagram below:

Partial x86 decoder

  • From the steering circuit, 3 opcode bytes are picked up and sent to a translation programmable logic array (PLA) for control micro-op decode. The decoded control signals and micro-op template are put into a control uop register.
  • All the opcode and operands bytes, together with the prefix vector from the prefix decoders, are also sent to a field extractor in parallel. The field extractor extracts the alias information which further describes the control micro-ops into a macro-alias register.
  • The two registers, cuop and macro-alias, are then combined by an alias multiplexer to get the final alias-resolve micro-op (aoup) code.
By decoding opcodes into templates and extracting operands information separately, the opcode decoder's PLA can be minimized and made flexible. Flexibility is important, as we will see the full decoder (shown in Technique #3 below) is really the partial decoder plus 3 XLAT PLA pipelines and one microcode engine. The flexibility also made it possible to implement micro-op fusion by adding an extra XLAT PLA, as will be discussed in the Part 2 article later.

Technique #3: Differentiate decoders to Make the Common Case Fast

In a typical x86 program, more than 2/3 of the instructions are simple enough to be represented by a single (non-fused) micro-op. Most of the other 1/3 can be decoded into 4 micro-ops or less, with a (very) few taking more to execute. Recognizing these facts, especially the 2:1 simple-to-complex ratio, the P6 design divides its decoders into the well-known 4-1-1 structure, giving only one decoder full capability:

Full x86 decoder

  • The first decoder has four translate PLAs, decoding an instruction to up to 4 control uops in one clock cycle (see the full decoder diagram right above).
  • The first decoder also has a micro-code engine to decode the few really complex instructions multiple number of clock cycles, generating 3 control uops per cycle (notice the three 2:1 MUXes in the above diagram).
  • The second and third decoders, as explained in Technique #2, have only one PLA and can decode only one single-uop x86 instructions per clock cycle.
  • Each decoder is equipped with its own macro-alias field extractor, although the first decoder's can be bigger in size.
When the micro-code engine is used, the 2nd and 3rd decoders are stalled from progress to preserve in-order issue. By differentiating the decoders and put performance emphasis on the common simple-instruction cases, instruction decode and issue complexity can be minimized, and higher clock rate can be reached. RISC design rule #1: Make the common-case fast.

The Myths, Part 1

Internet being the greatest information exchange inevitably becomes also the largest rumor farm and myth factory of the world. There have been numerous very wrong ideas about the P6 microarchitecture as a whole and the decoding front-end in particular. In "The Myths" section I will try to correct some of these misconceptions.

Since this Part 1 article only talks about the basic x86 decoding mechanisms, the related myths are also more basic and less astonishing. The described decoding mechanisms are over 10 years old, after all. Nevertheless, it is still better to get things right than wrong.

Myth #1: It is better to have more full decoders

An attempt to make fully capable decoders work in parallel is likely to spend more and gain little, not only because it will be very inefficient (resulting in slower clock rate and higher power usage), but also because it will cause trouble to the micro-op issue logic, which then must dynamically find out how many micro-ops are generated from each decoder, and route them in an (M*D)-to-N fabric from D decoders of M micro-ops to a issue queue of length N.

With twice as many simple instructions than complex ones in a typical program, an additional full decoder will not be worth it unless two more partial decoders are added. This ratio is increased even more with the introduction of micro-op fusion and the use of powerful SIMD instructions, although these are the later things to come.

Myth #2: It is better to utilize the full decoder as much as possible

Even though the full decoder can generate up to 4 micro-ops per clock cycle in parallel with the partial decoders, the issue queue of the P6 microarchitecture can only issue 3 micro-ops (or 4 in the case of Core 2) during any cycle. What this says is that the micro-op issue (and execution) logic will not be able to "digest" a continuous flow of x86 instructions with 4-1-1 uop complexity (with micro-op fusion, the pattern becomes selectively 4-2-2 - see Part 2 for more detail).

In other words, the pipeline (more precisely, the issue queue) will stall even when you sparsely (e.g., less than 30%) use those moderately complex instructions that can be decoded in one clock cycle. A corollary of this is that, in general, it is beneficial to replace a complex instruction by 3 simple ones (or 4 in the case of Core 2). The lesson: CISC does not scale. Even though you are writing/compiling to a CISC x86 ISA, you still want to make your assembly codes as much RISC-like as possible to get higher performance.

Myth #3: The same decoding width implies the same level of performance

To be sure, the 4-1-1 decoding engine is not the performance bottleneck up until the days of Pentium M, when micro-op fusion was introduced. Even with micro-op fusion, which supposedly doubles capability of the partial decoders, Intel reported less than 5% performance increase over the none-fused x86 decoding. The fact is, the IPC (instruction per clock) of all x86 processor cores, including the ones that bear the "Core 2" mark, have never exceeded 3. Pentium III running SPEC95 has IPC roughly between 0.6 and 0.9. Assuming 30% increase with each newer generation (which is quite optimistic to say the least), Pentium M would have IPC roughly between 0.8 and 1.2, Core would have it between 1.0 and 1.5, and Core 2 between 1.3 and 2.0. In other words, theoretically the ability to decode 3 instructions per cycle is quite sufficient up till this moment.

Of course nothing in the real world runs in a theoretical way. Aside from the fact that there are many other things in a processor core to slow down execution, P6's 3-wide (or 4-wide in the case of Core 2) x86 decode rarely sustains 3 decodes per cycle, even with low complex-to-simple instruction ratio. The reasons -

First, the complex instructions must be well positioned to the first decoder. Since the 3 (or 4 in the case of Core 2) x86-to-RISC decoders work in program order, if unfortunately the first decoder is occupied by a simple instruction while a complex instruction comes to the 2nd place, then during that clock cycle only one simple instruction will be decoded. The steering circuit will "re-steer" the complex instruction from the 2nd place to the 1st on the next cycle.

Second, the decoders are flushed every 16 instruction bytes (or 24 in the case of Core 2). Look at the IFU diagram at the beginning of this article, in every clock cycle 3 instructions from a 16-byte window are steered to the decoders. In average an x86 instruction takes about 3.5 bytes (the variance is high, though), so it is likely that the 16-byte window is not consumed in one clock cycle. If this is the case, then during the next cycle, the steer circuit will try to steer the next 3 instructions from the same 16-byte window to their respective decoders. But wait, what happens if there are less than 3 instructions left? Well, then less than 3 decoders have work to do in the cycle!

Third, taken branches always interrupt and stop short the decoding. This is similar to the reason above, except that here the latter decoders are not working not because the end of the 16-byte window is reached, but because the rest of the instruction bytes in the window are not (predicted) to be executed. This happens even under 100% branch prediction accuracy. The problem here is even more serious when the target address is unaligned to a byte-address of MOD 16. For example, if the branch target instruction has byte address 14 MOD 16, then only one instruction is fetched (inside the first 16-byte window) after the branch is taken.

We will note that these are caused by P6's x86 decode design artifacts; they cannot be improved by any microarchitecture improvement elsewhere. It is because of these reasons that we need micro-op fusion, macro fusion, or an additional partial decoder in the later generations of the P6 processor family to even get close to the theoretical 3-issue limit. We will however wait until Part 2 (and possibly Part 3) to dwell deeper into those.


Celso said...

Outstanding article.
I just can't remember last time I read such a rich and complete article like this around the internet...
Pretty good, great job!
Please keep posting articles like this... there are so much crap out there...
Celso - Brazil.

cycheng said...

Great gob !!
By the way, it should be "up to 4 prefix bytes", not "14"
(According to Intel Manual - V2A, Ch2 Instruction Format, Figure 2-1)

Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.