A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Tuesday, May 29, 2007

Decoding x86: From P6 to Core 2 - Part 2

This is the Part 2 of a 3 article series. To fully appreciate what's written here, the Part 1 article (or comparable understanding) is a prerequisite.

The New Advancements

Three major advancements have been made from the original P6 x86 decode over the years: micro-op fusion (Pentium M), macro-fusion (Core 2), and an increased 4-wide decode (also Core 2). In this Part 2 article, I will go over the micro-op fusion in more detail, and in the next Part 3, I will go further into Core 2's additions.

While these advancements have all been "explained" numerous times on the Internet, as well as marketed massively by Intel, I must say that many of those explanations and claims are either wrong or misleading. People got second-hand info from Intel's marketing guys and possibly even some designers, and they tend to spice those up with extra sauces, partly from imaginations and partly from "educated" [sic] guesses.

One big problem that I saw in many of those on-line "analyses" is that they never get to the bottom of the techniques such as why they were implemented and what makes them compelling as they are . Instead, most of those analyses just repeat whatever glossy terms they got from Intel and gloss over the technical reasonings. Not that these technical reasonings are any more important to end users, but without proper reference to them, the "analyses" will most surely degrade to mere marketing repeaters of the Intel Co. These wrong ideas also tend to have bad consequences to the industry - think of Pentium 4 and the megahertz hypes that come with it.

In the following, I will try to look at the true motives and benefits of these techniques from a technical point of view. I will try to answer the 3W1H questions for each: Where does it come from, What does it do, How does it work, and Why is it designed so. As stated in the previous Part 1 article, all analyses here are based on publicly available information. Without inside knowledge from Intel, however, I cannot be certain of being 100% error-free. But the good thing of technical reasoning is that, with enough evidence, you can also reason for or against it, instead of choose whatever marketing craps that come across your way to believe.

* Micro-op fusion - its RISC roots

The idea behind micro-op fusion, or micro-fusion, came in early '90s to improve RISC processor performance where true data dependency exists. Unsurprisingly, it did not come from Intel. In a 1992 paper, "Architectural Effects on Dual Instruction Issue With Interlock Collapsing ALUs," Malik et al. from IBM devised a scheme to issue two dependent instructions at once to a 3-to-1 ALU. The technique, called instruction collapsing, are then extended and improved by numerous researchers and designers.

Intel came to the game quite late until 2000/2001 (Pentium M was released in 2003), and apparently just grabbed the existing idea and filed a patent on it. The company did bring some new thing to the table: a cool name, fusion. It really sounds better to make work fusion than to collapse instructions, doesn't it? In fact, the micro-fusion of Intel's design is very rudimentary compared to what's been proposed 6-8 years ago in the RISC community; we will talk about this later shortly.

Let's first look at the original "instruction collapse" techniques. Because a RISC ISA generally consists of simple instructions, true dependency detection among these instructions becomes a big issue when collapsing them together. However, if one can dynamically find out the dependencies -as all modern out-of-order dispatch can- he can then not only "collapse" two but also more instructions together. The performance improvement was reported from 7% to 20% on 2 to 4-issue processors.

* A cheaper and simplified approach


Now turn to Intel's micro-op fusion. What does it do? Magic like most wagging websites have cheered? Surely not -
  • It only works on x86 read-then-modify and operate-then-store instructions, where no dependency check is needed between the two micro-ops to be fused.
  • It works only on x86 decode and issue stages, so no speculative execution is performed.
  • It doesn't change or affect the ALUs, so the same number of execution units is still needed for one fused micro-op as two non-fused micro-ops.
What is actually expanded is an additional XLAT PLA for each partial x86 decoder (see the diagram above, and also Part 1 article of this series), so that partial x86 decode can handle those load/store instructions that generate two micro-ops. Naturally, the performance increase won't be spectacular, and the early report from Intel is just between 2% to 5%. This is actually not that bad a result, given the technique itself is pretty localized (to the x86 decode and micro-op format), and the main point of micro-fusion is not to remove dependency or to increase execution width anyway, as will be discussed later.

* An additional PLA plus a condensed format


So how does micro-fusion work? An x86 read-then-modify instruction, for example, consists of two depending micro-ops in one "strand" (i.e., single fan-out): 1) calculate load address, 2) modify loaded result. The micro-fusion will bind together these two operations into one format -
  1. Putting the two micro-ops into one fused format, which now has two opcode fields and three operand fields. (Yup, that's it, or what else have you expected?)
  2. Putting the operand fields of the first opcode into the fused micro-op. Putting only the non-depending operand field of the second opcode into the fused micro-op.
  3. Linking the depending operand of the second opcode to the output of the first opcode.
The fused micro-op is really two separate micro-ops combined in a condensed form. When the fused micro-op is issued, it occupies only one (wider) reservation station (RS) slot. Since it only has one fan-out (execution result), it occupies only one reorder buffer (ROB) slot, too. However, the two opcodes are still sent to separate execution units, so the execute bandwidth is not increased (nor reduced, by the way).

* It works just fine - not great, just fine

So why does it work? The micro-fusion works because it relieved, in some degree, the x86 decode of the 4-1-1 complexity constraint. On those x86 instructions that get one argument directly from memory locations, this technique will -
  1. Increase x86 decode bandwidth from 1 to 3.
  2. Reduce RS usage by 50%.
  3. Reduce ROB usage by 50%
What it costs to implement micro-op fusion is just minor increase in micro-op format complexity and an additional XLAT PLA for each partial decoder. So after all, it's probably a good deal or smart way to increase the P6 performance. Just, according to the published literatures, it doesn't work miracles as many amateur sites have claimed, and there's not much of Intel's own intellectual credits in it.

1 comment:

Bryan said...

This has been one of the more informative posts about macro-fusion and micro-fusion out there. Thank you for your wonderful insight.

Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.