A Journey in Modern Computer Architectures: 05/2007

Tuesday, May 29, 2007

Decoding x86: From P6 to Core 2 - Part 2

This is the Part 2 of a 3 article series. To fully appreciate what's written here, the Part 1 article (or comparable understanding) is a prerequisite.

The New Advancements

Three major advancements have been made from the original P6 x86 decode over the years: micro-op fusion (Pentium M), macro-fusion (Core 2), and an increased 4-wide decode (also Core 2). In this Part 2 article, I will go over the micro-op fusion in more detail, and in the next Part 3, I will go further into Core 2's additions.

While these advancements have all been "explained" numerous times on the Internet, as well as marketed massively by Intel, I must say that many of those explanations and claims are either wrong or misleading. People got second-hand info from Intel's marketing guys and possibly even some designers, and they tend to spice those up with extra sauces, partly from imaginations and partly from "educated" [sic] guesses.

One big problem that I saw in many of those on-line "analyses" is that they never get to the bottom of the techniques such as why they were implemented and what makes them compelling as they are . Instead, most of those analyses just repeat whatever glossy terms they got from Intel and gloss over the technical reasonings. Not that these technical reasonings are any more important to end users, but without proper reference to them, the "analyses" will most surely degrade to mere marketing repeaters of the Intel Co. These wrong ideas also tend to have bad consequences to the industry - think of Pentium 4 and the megahertz hypes that come with it.

In the following, I will try to look at the true motives and benefits of these techniques from a technical point of view. I will try to answer the 3W1H questions for each: Where does it come from, What does it do, How does it work, and Why is it designed so. As stated in the previous Part 1 article, all analyses here are based on publicly available information. Without inside knowledge from Intel, however, I cannot be certain of being 100% error-free. But the good thing of technical reasoning is that, with enough evidence, you can also reason for or against it, instead of choose whatever marketing craps that come across your way to believe.

* Micro-op fusion - its RISC roots

The idea behind micro-op fusion, or micro-fusion, came in early '90s to improve RISC processor performance where true data dependency exists. Unsurprisingly, it did not come from Intel. In a 1992 paper, "Architectural Effects on Dual Instruction Issue With Interlock Collapsing ALUs," Malik et al. from IBM devised a scheme to issue two dependent instructions at once to a 3-to-1 ALU. The technique, called instruction collapsing, are then extended and improved by numerous researchers and designers.

Intel came to the game quite late until 2000/2001 (Pentium M was released in 2003), and apparently just grabbed the existing idea and filed a patent on it. The company did bring some new thing to the table: a cool name, fusion. It really sounds better to make work fusion than to collapse instructions, doesn't it? In fact, the micro-fusion of Intel's design is very rudimentary compared to what's been proposed 6-8 years ago in the RISC community; we will talk about this later shortly.

Let's first look at the original "instruction collapse" techniques. Because a RISC ISA generally consists of simple instructions, true dependency detection among these instructions becomes a big issue when collapsing them together. However, if one can dynamically find out the dependencies -as all modern out-of-order dispatch can- he can then not only "collapse" two but also more instructions together. The performance improvement was reported from 7% to 20% on 2 to 4-issue processors.

* A cheaper and simplified approach

Now turn to Intel's micro-op fusion. What does it do? Magic like most wagging websites have cheered? Surely not -

It only works on x86 read-then-modify and operate-then-store instructions, where no dependency check is needed between the two micro-ops to be fused.
It works only on x86 decode and issue stages, so no speculative execution is performed.
It doesn't change or affect the ALUs, so the same number of execution units is still needed for one fused micro-op as two non-fused micro-ops.

What is actually expanded is an additional XLAT PLA for each partial x86 decoder (see the diagram above, and also Part 1 article of this series), so that partial x86 decode can handle those load/store instructions that generate two micro-ops. Naturally, the performance increase won't be spectacular, and the early report from Intel is just between 2% to 5%. This is actually not that bad a result, given the technique itself is pretty localized (to the x86 decode and micro-op format), and the main point of micro-fusion is not to remove dependency or to increase execution width anyway, as will be discussed later.

* An additional PLA plus a condensed format

So how does micro-fusion work? An x86 read-then-modify instruction, for example, consists of two depending micro-ops in one "strand" (i.e., single fan-out): 1) calculate load address, 2) modify loaded result. The micro-fusion will bind together these two operations into one format -

Putting the two micro-ops into one fused format, which now has two opcode fields and three operand fields. (Yup, that's it, or what else have you expected?)
Putting the operand fields of the first opcode into the fused micro-op. Putting only the non-depending operand field of the second opcode into the fused micro-op.
Linking the depending operand of the second opcode to the output of the first opcode.

The fused micro-op is really two separate micro-ops combined in a condensed form. When the fused micro-op is issued, it occupies only one (wider) reservation station (RS) slot. Since it only has one fan-out (execution result), it occupies only one reorder buffer (ROB) slot, too. However, the two opcodes are still sent to separate execution units, so the execute bandwidth is not increased (nor reduced, by the way).

* It works just fine - not great, just fine

So why does it work? The micro-fusion works because it relieved, in some degree, the x86 decode of the 4-1-1 complexity constraint. On those x86 instructions that get one argument directly from memory locations, this technique will -

Increase x86 decode bandwidth from 1 to 3.
Reduce RS usage by 50%.
Reduce ROB usage by 50%

What it costs to implement micro-op fusion is just minor increase in micro-op format complexity and an additional XLAT PLA for each partial decoder. So after all, it's probably a good deal or smart way to increase the P6 performance. Just, according to the published literatures, it doesn't work miracles as many amateur sites have claimed, and there's not much of Intel's own intellectual credits in it.

Sunday, May 27, 2007

Decoding x86: From P6 to Core 2 - Part 1

In this series of articles I will take a close look at the x86 instruction decode of Intel's P6 processor family, which includes Pentium-Pro/II/III/M, Core, and Core 2. I will first explain the design in some detail, then relate the marketing terms such as micro-op fusion, macro fusion and 4-wide decoding with what is actually happening inside the processor, down to its microarchitectures and processing algorithms.

All analyses here are based on publicly available information, such as Intel's software optimization manuals, patents and papers. What is added is some knowledge and understanding in computer microarchitectures and circuit designs. With great probably the analyses here should clarify/correct much more myths out there than it introduce any error.

The x86-to-RISC Decoding Problem

Over the years, Intel has advocated the use of CISC over RISC instruction set. However, with great irony -if we actually believed Intel's apparent stance toward the RISC/CISC argument- its P6 microarchitecture is really designed to be more "RISC Inside" than "Intel Inside." In order to reach both higher clock rates and better IPC (instruction per clock), the complex x86 instructions had to be first decoded into simple, fixed-width RISC format (micro-ops) before sent for execution. By this way, the number of pipeline cycles an instruction must go through and the delay of the longest pipeline stage can be optimized for the common average-case rather than the rare worst-case instructions.

All sound good, right? Except there are three (rather big) problems:

The variable-length x86 instructions, which are almost always misaligned in the instruction cache, are hard to decode in parallel (i.e., multiple decodes per clock cycle).
The many addressing modes and operand sizes of even the simplest x86 instruction require complex and slow translation from x86 to internal RISC.
The high complexity of some x86 instructions make worst-case decoders highly complex and inefficient.

Only by recognizing the problems of x86 decode and the difficulty to solve them can we fully appreciate the design choices that Intel made into the P6 front-end, as described in the three techniques below.

Technique #1: Pipeline the instruction length, prefix and opcode decodes

An x86 instruction can have 1-3 opcode bytes, 0-10 operand bytes, plus up to 14 prefix bytes, all but not exceeding a 15-byte length limit. When stored in the instruction cache, it is almost never aligned to the cache line, which unfortunately is the unit that processor cores use to read from the cache. To solve the variable-length misalignment problem, P6's Instruction Fetch Unit (IFU) decodes the length, prefix, and the actual instruction opcodes in a pipelined fashion (see also the picture below):

Instruction Fetch Unit and steering mechanism

When IFU fetches a 32-byte cache line of instructions, it decodes the instruction lengths and marks the first opcode byte and the last instruction byte of every instruction in the window. The 32 bytes are put into a pre-decode buffer together with the markings.
The 32 bytes are scanned and 16 bytes starting from the first instruction are sent via a rotator to the instruction buffer (now aligned to the instruction boundary), from which they proceed on to two paths.
On one path, all 16 bytes are sent to the prefix decoders, where the first 3 prefix vectors are identified and sent to help instruction decode below.
On the other path and at the same time, 3 blocks of the same 16 bytes are steered to the 3 decoders in parallel, one block for each consecutive instruction.

Steering variable-length instructions is a complex task. The instruction bytes must be scanned sequentially to locate up to 3 opcodes and their operands, then packed and sent to the 3 decoders. Each decoder might accept up to 11 bytes (max x86 instruction length without the prefix), or it might receive just one.

By determining the instruction boundaries early and pipeline the prefix decode away from instruction decode, the steering task can be made simpler and faster. To further simplify the matter, only the first (full) decoder will accept 11 bytes; the other two (partial) decoders will accept only up to 8 bytes of opcodes and operands, as will be further discussed below.

Technique #2: Decode the opcodes and operands separately

After a decoder (full or partial) receives the opcode and operand bytes, it must try to decode them into a RISC format efficiently. This is accomplished by again decoding the opcodes and the operands in separate paths, as illustrated by the partial decoder diagram below:

Partial x86 decoder

From the steering circuit, 3 opcode bytes are picked up and sent to a translation programmable logic array (PLA) for control micro-op decode. The decoded control signals and micro-op template are put into a control uop register.
All the opcode and operands bytes, together with the prefix vector from the prefix decoders, are also sent to a field extractor in parallel. The field extractor extracts the alias information which further describes the control micro-ops into a macro-alias register.
The two registers, cuop and macro-alias, are then combined by an alias multiplexer to get the final alias-resolve micro-op (aoup) code.

By decoding opcodes into templates and extracting operands information separately, the opcode decoder's PLA can be minimized and made flexible. Flexibility is important, as we will see the full decoder (shown in Technique #3 below) is really the partial decoder plus 3 XLAT PLA pipelines and one microcode engine. The flexibility also made it possible to implement micro-op fusion by adding an extra XLAT PLA, as will be discussed in the Part 2 article later.

Technique #3: Differentiate decoders to Make the Common Case Fast

In a typical x86 program, more than 2/3 of the instructions are simple enough to be represented by a single (non-fused) micro-op. Most of the other 1/3 can be decoded into 4 micro-ops or less, with a (very) few taking more to execute. Recognizing these facts, especially the 2:1 simple-to-complex ratio, the P6 design divides its decoders into the well-known 4-1-1 structure, giving only one decoder full capability:

Full x86 decoder

The first decoder has four translate PLAs, decoding an instruction to up to 4 control uops in one clock cycle (see the full decoder diagram right above).
The first decoder also has a micro-code engine to decode the few really complex instructions multiple number of clock cycles, generating 3 control uops per cycle (notice the three 2:1 MUXes in the above diagram).
The second and third decoders, as explained in Technique #2, have only one PLA and can decode only one single-uop x86 instructions per clock cycle.
Each decoder is equipped with its own macro-alias field extractor, although the first decoder's can be bigger in size.

When the micro-code engine is used, the 2nd and 3rd decoders are stalled from progress to preserve in-order issue. By differentiating the decoders and put performance emphasis on the common simple-instruction cases, instruction decode and issue complexity can be minimized, and higher clock rate can be reached. RISC design rule #1: Make the common-case fast.

The Myths, Part 1

Internet being the greatest information exchange inevitably becomes also the largest rumor farm and myth factory of the world. There have been numerous very wrong ideas about the P6 microarchitecture as a whole and the decoding front-end in particular. In "The Myths" section I will try to correct some of these misconceptions.

Since this Part 1 article only talks about the basic x86 decoding mechanisms, the related myths are also more basic and less astonishing. The described decoding mechanisms are over 10 years old, after all. Nevertheless, it is still better to get things right than wrong.

Myth #1: It is better to have more full decoders

An attempt to make fully capable decoders work in parallel is likely to spend more and gain little, not only because it will be very inefficient (resulting in slower clock rate and higher power usage), but also because it will cause trouble to the micro-op issue logic, which then must dynamically find out how many micro-ops are generated from each decoder, and route them in an (M*D)-to-N fabric from D decoders of M micro-ops to a issue queue of length N.

With twice as many simple instructions than complex ones in a typical program, an additional full decoder will not be worth it unless two more partial decoders are added. This ratio is increased even more with the introduction of micro-op fusion and the use of powerful SIMD instructions, although these are the later things to come.

Myth #2: It is better to utilize the full decoder as much as possible

Even though the full decoder can generate up to 4 micro-ops per clock cycle in parallel with the partial decoders, the issue queue of the P6 microarchitecture can only issue 3 micro-ops (or 4 in the case of Core 2) during any cycle. What this says is that the micro-op issue (and execution) logic will not be able to "digest" a continuous flow of x86 instructions with 4-1-1 uop complexity (with micro-op fusion, the pattern becomes selectively 4-2-2 - see Part 2 for more detail).

In other words, the pipeline (more precisely, the issue queue) will stall even when you sparsely (e.g., less than 30%) use those moderately complex instructions that can be decoded in one clock cycle. A corollary of this is that, in general, it is beneficial to replace a complex instruction by 3 simple ones (or 4 in the case of Core 2). The lesson: CISC does not scale. Even though you are writing/compiling to a CISC x86 ISA, you still want to make your assembly codes as much RISC-like as possible to get higher performance.

Myth #3: The same decoding width implies the same level of performance

To be sure, the 4-1-1 decoding engine is not the performance bottleneck up until the days of Pentium M, when micro-op fusion was introduced. Even with micro-op fusion, which supposedly doubles capability of the partial decoders, Intel reported less than 5% performance increase over the none-fused x86 decoding. The fact is, the IPC (instruction per clock) of all x86 processor cores, including the ones that bear the "Core 2" mark, have never exceeded 3. Pentium III running SPEC95 has IPC roughly between 0.6 and 0.9. Assuming 30% increase with each newer generation (which is quite optimistic to say the least), Pentium M would have IPC roughly between 0.8 and 1.2, Core would have it between 1.0 and 1.5, and Core 2 between 1.3 and 2.0. In other words, theoretically the ability to decode 3 instructions per cycle is quite sufficient up till this moment.

Of course nothing in the real world runs in a theoretical way. Aside from the fact that there are many other things in a processor core to slow down execution, P6's 3-wide (or 4-wide in the case of Core 2) x86 decode rarely sustains 3 decodes per cycle, even with low complex-to-simple instruction ratio. The reasons -

First, the complex instructions must be well positioned to the first decoder. Since the 3 (or 4 in the case of Core 2) x86-to-RISC decoders work in program order, if unfortunately the first decoder is occupied by a simple instruction while a complex instruction comes to the 2nd place, then during that clock cycle only one simple instruction will be decoded. The steering circuit will "re-steer" the complex instruction from the 2nd place to the 1st on the next cycle.

Second, the decoders are flushed every 16 instruction bytes (or 24 in the case of Core 2). Look at the IFU diagram at the beginning of this article, in every clock cycle 3 instructions from a 16-byte window are steered to the decoders. In average an x86 instruction takes about 3.5 bytes (the variance is high, though), so it is likely that the 16-byte window is not consumed in one clock cycle. If this is the case, then during the next cycle, the steer circuit will try to steer the next 3 instructions from the same 16-byte window to their respective decoders. But wait, what happens if there are less than 3 instructions left? Well, then less than 3 decoders have work to do in the cycle!

Third, taken branches always interrupt and stop short the decoding. This is similar to the reason above, except that here the latter decoders are not working not because the end of the 16-byte window is reached, but because the rest of the instruction bytes in the window are not (predicted) to be executed. This happens even under 100% branch prediction accuracy. The problem here is even more serious when the target address is unaligned to a byte-address of MOD 16. For example, if the branch target instruction has byte address 14 MOD 16, then only one instruction is fetched (inside the first 16-byte window) after the branch is taken.

We will note that these are caused by P6's x86 decode design artifacts; they cannot be improved by any microarchitecture improvement elsewhere. It is because of these reasons that we need micro-op fusion, macro fusion, or an additional partial decoder in the later generations of the P6 processor family to even get close to the theoretical 3-issue limit. We will however wait until Part 2 (and possibly Part 3) to dwell deeper into those.

Friday, May 25, 2007

The PoV-Ray benchmark and AMD's Barcelona demo

AMD recently showed off a 4-socket quad-core Barcelona (K10) which almost doubles the speed of a 4-socket dual-core Opteron (K8) on PoV-Ray. More precisely, the rendering speed of the 16-core K10 system is just 1.87 times the speed of the 8-core K8 system, both running at the same processor frequency.

To some degree, this is totally below people's expectation on Barcelona/K10, especially according to AMD's official claim Barcelona should "blow Clovertown away."

First, we know PoV-Ray is very scalable with respect to number of cores: a 4-socket 8-core Opteron system today already doubles PoV-Ray speed of a 2-socket 4-core Opteron system (see 453.povray - 130 vs. 66.3). So what's the big deal if K10 runs 1.87 times as fast with twice the number of cores?
Second, according to SPECfp PoV-Ray scores, a 2-socket Clovertown system at 2.66GHz is more than twice as fast as a 2-socket Opteron system at 3.0GHz (again, 453.povray - 145 vs. 69.4). How is Barcelona going to blow Clovertown away if it doesn't even double the speed of today's dual-core Opteron?

The first question turns out to be easy to answer: the point of the demo is not just (nearly) twice the performance, but also within the same power/thermal envelope. In other words, the quad-core K10 is going to be a perfect drop-in replacement for today's dual-core K8. The same does not hold with Intel's Clovertown/Xeon. According to this GamePC measurement, to upgrade a Xeon system from dual-core to quad-core under the same thermal/power envelope, one must lower the processor's clock rate by 30% (2.66GHz -> 1.86GHz, or 2.33GHz -> 1.6GHz), which generally implies a 15-20% loss of performance.

However, this still doesn't answer the second question. Shouldn't K10 with 2x the number of cores be more than 2x the speed of K8, due to the many per-core improvements we've heard of inside Barcelona/K10?

To answer this question, we have to look more closely at the benchmark: PoV-Ray.

We know AMD was using PoV-Ray 3.7 beta in the Barcelona demo, because previous versions do not support SMP. Now, there are two executables in the PoV-Ray 3.7 beta package: one compiled with x87 instructions, and one with SSE2. Which one did AMD use? If it was the SSE2, then why didn't it show any per-core improvement? If it was the x87, then why did AMD purposely choose a slower program to demo its next-generation processor?

It turns out that none of these questions is appropriate. Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access; (2) PoV-Ray SSE seems to be optimized more specifically for Core 2 than anything else, where on K8 it is only about 5% faster than PoV-Ray x87. This is also not going to change with K10.

First, there is no actual usage of vectorized (or packed) instructions in PoV-Ray SSE. The only packed instructions I see from the binary are register conversions between x87 and SSE2 formats. PoV-Ray SSE basically treat the SSE2 as a faster [sic] x87 engine which can access xmm registers randomly (rather than stack-based in x87). For example, a simple double-precision division in PoV-Ray SSE is performed by the following instruction sequence:

Convert the divisor from single to double (CVTSS2SD)
Perform double-precision scalar division using DIVSD
Convert the result from two double values to two single values (CTVPD2PS).

This offers considerable advantage for Intel's Core 2, because SSE2 DIVSD (18 cycles) in Core 2 is much faster than x87 FDIV (36 cycles), and the conversion instructions are also quite fast (4 cycles). Overall, for Core 2, the above sequence will save ~30% number of cycles (4+18+4=26 vs. 36) from an x87 division. On the other hand, this sequence is very inefficient for K8, where SSE2 DIVSD is as fast as x87 FDIV (~20 cycles), but conversions are much slower (8 cycles). Overall, for K8, the sequence runs ~80% slower (8+20+8=36 vs. 20 cycles) than an x87 division.

Roughly estimating, about 1/4 to 1/3 of the numerical instructions in the PoV-Ray SSE undergo such convert-calculate-convert process, where you see CVTxx2yy instructions all over the places in these parts of the code. Now I'm not sure whether this is compiled by an Intel compiler, or with an Intel library, or whatever else, but this is simply not the good/right way to do vectorized acceleration. It gives Core 2 a performance boost only due to Core 2's design artifact where such conversions are cheap/fast. Still, PoV-Ray SSE manages to run slightly faster than PoV-Ray x87 on K8 probably due to the ability to access register randomly, which results in better superscalar and out-of-order executions.

Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains unfriendly to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work elsewhere inside the K10 design.

So now it looks all reasonable that we see such "disappointing" results from the K10/Barcelona PoV-Ray demo. Except one question that naturally comes up: why did AMD choose PoV-Ray for the demonstration in the first place? Sure, PoV-Ray is very scalable to multiple cores, but there are many other applications that scale as well, aren't there? Maybe AMD wants to run a program that has something to display, such as a cool 3D image? Maybe AMD wants to show K10 can scale even on an unfriendly workload? Or maybe the guys responsible of the demonstration are just incapable of finding a good benchmark? Or maybe PoV-Ray is already the best case AMD can find, and Barcelona/K10 is going to disappoint? We simply won't know the real answer until the actual release of this greatly anticipated chip.

Tuesday, May 22, 2007

Core 2 Duo: That Can Hardly Be More Optimized

In this article I am to find out how well Core 2 Duo and K8 perform on the single-processed benchmarks, SPECint and SPECfp (i.e., no "rate" here), as some interesting observations come up again. A look at the facts is never short of revelation.

The systems are shown in the following table. K8int/K8fp denote K8 Opteron scores for integer and floating point benchmarks, respectively. Similarly, C2int/C2fp denote Core 2 Duo scores. Both SPEC CPU2000 and SPEC CPU2006 are compared. The main criteria in choosing these SPEC submissions are -

There are at least two or more processor speeds with all identical configurations otherwise.
They use 64-bit operating systems and compilers.
All have comparable memory across different architectures (DDR2-667 to be exact).

Unfortunately, I couldn't find a K8 and a C2D using the same operating system and compiler. Thus strictly speaking the absolute values in these tests are not comparable across system families. We will relax ourselves a bit here but keep this fact in mind.

Below are the SPEC CPU2000 results. All points are the "base" scores -

Below are the SPEC CPU2006 results. The points with a postfix 'B' character on their labels represent "base" scores; the other points represent the "peak" scores -

In terms of SPEC CPU2000, Core 2 Duo completely outclasses Opteron/K8 on both INT (50%) and FP (30%) scores. Interestingly, the vast advantage greatly reduces with respect to SPEC CPU2006, where the leads become less than 30% for INT and almost none for FP. One explanation is that Core 2 Duo, released 3 years after Opteron, was optimized by design for the benchmarks, at a time when only SPEC CPU2000 was available. Another explanation is the newer SPEC CPU2006 does not benefit from large L2 cache size (up to at least 4MB) as much and thus more favors K8's integrated memory controller. Yet another explanation is that the newer benchmark codes are more complex and thus less predictable by simple heuristics where Core 2 probably does better/more than K8.

No matter what are the reasons (probably a bit from all three and more), one message is clear: for single-processed integer codes, Core 2 Duo beats K8 Opteron hands-down. For floating point, it's a close match, and one should look at the type of program he runs to make a preference.

The really interesting observation lies on the "peak" versus "base" values of the benchmark results. For Core 2 Duo, peak offers just 4% boost on INT and 3% on FP. On the other hand, for K8 Opteron, peak offers 8% boost on INT and almost 30% on FP. It seems the microarchitecture of Core 2 Duo is so optimizing that there is little room for more software optimization, whereas K8 Opteron still can benefit from better compilation. This is certainly a plus for Core 2 Duo, because nobody likes to spend 2x time to compile an optimized executable.

Comparing the SPEC and SPEC_rate results, we clearly see that while Core 2 Duo has a much better core implementation, its memory architecture trails after K8 and drags down its throughput scalability. The FSB bottleneck can even be seen from the Core 2 Duo lines in the second graph above, where the two left-most point sets (with 1066MHz FSB) are much lower than the others (with 1333MHz FSB). Again, as I said, with Core 2 Duo, Intel goes back to its root to improve, market on, and profit from the personal/home (versus big server/high performance) computing.

Saturday, May 19, 2007

More scaling - where a picture speaks a thousand words

One reader to my previous article asked why didn't I use dual-socket Core 2 Duo for scaling comparison. The reason is simple: I couldn't find a single pair of SPEC 2006 results where a single-socket and a dual-socket Core 2 Duo machines use the same CPU clock rate, memory technology, compiler, and operating system, where scientifically valid comparison can be made.

In this article I will relax a bit and do not require exact matches among the candidate systems. I will use four x86_64 system models show the "number of cores" and "clock rate" scaling of both Intel Core 2 Duo and AMD Opteron (K8).

Below is the system settings and their SPEC2006_rate scores. I use "ds" for dual-socket dual-core (4 cores), "qc" for single-socket quad-core (4 cores), and "dc" for single-socket dual-core (2 cores) -

Nothing is better than a picture to illustrate complicated data. Below is the performance graph of these systems. Green lines are for Fujitsu/AMD; blue lines for Fujitsu/Intel; red lines for Acer/Intel -

Couldn't resist the temptation, below is a list of observations that I have to make:

First, with 2 cores, Core 2 Duo is undoubtedly the winner on both SPECint_rate and SPECfp_rate. With 4 cores, however, K8 becomes the better choice for SPECfp. The more powerful a system is, the more advantage K8 has, due to its better "number of core" scalability.

Second, Intel's FSB (front-side bus) is a bottleneck for 4 cores, even at 1066MHz. This is obvious from the left-most points of 4-core Core 2 systems (C2ds and C2qc), where the scores are lower than the rest of the clock scaling trend. Looking at the system settings, these lower-than-expected performances come precisely from the 1066MHz FSB (vs. 1333MHz).

Third, the MCM quad-core could be a good cost/power-saving for single-socket home users and low-end servers. It almost matches dual-socket Opteron on integer performance, although its floating-point performance is still somewhat desired.

Fourth, the MCM quad-core does not scale well at/beyond 2.67GHz. You may cry, look, the 2.67GHz C2Q even has lower SPECfp_rate than the 2.40GHz C2Q! There must be something wrong with the Fujitsu systems! Unfortunately, no. As of May 2007, all reported 2.67GHz C2Q SPECfp_rate I can find are "lower than expected." (The highest among them is 33.9 - less than 1% higher - but it uses FB-DIMM, different from the other systems presented here). This is probably why Intel is so late in introducing a higher-clocked Core 2 Quad - if they are not (much) better, why bother?

Fifth, the "clock rate" scaling of K8 performance is slowing down at 2.8GHz, especially for SPECfp_rate. Since all Fujitsu Primergy RX330 systems are identical except the CPU clock rate, the only explanation is that the larger processor-memory speed gap makes higher CPU frequency less effective. Core 2 does not experience the same slow down probably due to its larger cache and a better load/store circuits.

Sixth, doubling L2 cache size helps Core 2 Duo for about as much as a speed grade (0.16GHz). This is seen from the "jump" on the single-socket Core 2 Duo performances (C2dc), where the left two points with 2MB L2 are one step lower than the right three points with 4MB L2.

Tuesday, May 15, 2007

Multi-core scalability (lacking) of Intel Core 2 Duo

It's been over 9 months since Intel release the Core 2 Duo processors. Praise to this processor and its multi-chip module (MCM) quad-core brother, Core 2 Quad, float around the Internet. With this line of processors, Intel is going back to its root - market on and profit from the personal (vs. high-performance or big-server) computing. In other words, while Core 2 Duo/Quad works great for home projects (video encoding, playing games, etc.), it does not scale well to the larger, heavier-duty setups.

Enough of talking, and lets see some proofs with industry-standard SPECint_rate and SPECfp_rate benchmarks. We will only look at the base scores from the new SPEC 2006 benchmark suite.

First we look at how well Core 2 Quad scales from Core 2 Duo:

Intel Xeon 3060 2.4GHz, 2 cores/1 chip, 1066MHz FSB - 26.0
Intel Xeon X3220 2.4GHz, 4 cores/1 chip, 1066MHz FSB - 43.4 (1.67x)

Intel Xeon 3060 2.4GHz, 2 cores/1 chip, 1066MHz FSB - 22.4
Intel Xeon X3220 2.4GHz, 4 cores/1 chip, 1066MHz FSB - 33.5 (1.50x)

The above show that, if you buy a Core 2 Quad, you really get just 3.3 cores of performance for the average integer workloads, and only 3 cores for the floating-point. In other words, the architecture already lacks scalability to quad cores.

In contrast, lets look at how AMD's Opteron (K8) scales to multi-core:

AMD Opteron 854 2.8GHz, 2 cores/2 chips - 22.3
AMD Opteron 854 2.8GHz, 4 cores/4 chips - 41.4 (1.86x)

AMD Opteron 2210 1.8GHz, 2 cores/1 chip - 17.3
AMD Opteron 2210 1.8GHz, 4 cores/2 chips - 34.3 (1.98x)

AMD Opteron 854 2.8GHz, 2 cores/2 chips - 24.1
AMD Opteron 854 2.8GHz, 4 cores/4 chips - 45.6 (1.89x)

AMD Opteron 2210 1.8GHz, 2 cores/1 chip - 17.6
AMD Opteron 2210 1.8GHz, 4 cores/2 chips - 34.8 (1.98x)

What we see here is that, for a total of 4 cores per system, not only dual-core Opterons but even single-core Opterons connected by cHT links scale much better than two Core 2 Duos sitting on an MCM. Note that the absolute numbers in the different cases above are not directly comparable to each other, since they use different CPU clock rates, memory technologies, operating systems, and compilers.

Now lets look at how well does Core 2 Duo scale to multi-core, multi-processor setup:

Intel Xeon X5355 2.67GHz, 4 cores/1 chip, 1333MHz FSB - 45.9
Intel Xeon X5355 2.67GHz, 8 cores/2 chips, 1333MHz FSB - 78.0 (1.70x)

Intel Xeon X5355 2.67GHz, 4 cores/1 chip, 1333MHz FSB - 33.9
Intel Xeon X5355 2.67GHz, 8 cores/2 chips, 1333MHz FSB - 56.3 (1.66x)

Again, the scalability is very lacking; you get only 6.8 and 6.6 cores from an 8-core setup for integer and floating-point codes, respectively.

In contrast, lets look at how does Opteron scale from 4 cores to 8. This time we use only the dual-core Opteron processors for comparison:

AMD Opteron 2222SE 3.0GHz, 4 cores/2 chips - 44.6
AMD Opteron 2222SE 3.0GHz, 8 cores/4 chips - 84.4 (1.89x)

AMD Opteron 2222SE 3.0GHz, 4 cores/2 chips - 47.3
AMD Opteron 2222SE 3.0GHz, 8 cores/4 chips - 89.8 (1.90x)

Non-surprisingly, for a total of 8 cores per system, the dual-core Opterons also scale much better than the quad-core Xeons.

What is interesting above is that, for Core 2 Duo, the 4-to-8-cores scaling is actually better than the 2-to-4-cores one. This is probably due to the fact that the 8-core system has 33% faster FSB, plus a chipset intelligent enough to separate traffic to/from the two quad-core processors (rather than a dumb MCM connection as the Core 2 Quad has internally). This also shows that (1) Intel's FSB design is the bottleneck of multi-core scaling even at quad-core, and (2) The MCM quad-core is a even worse approach for scaling performance to multi-core.

In Conclusion - Intel's Core 2 Duo could well be the fastest processor for home computers (or dual-core, single-processor servers) which cost a bit more money for faster video encoding and AI-intensive gaming. On the other hand, with hard proofs we show that for servers that scale to 4 cores or higher, today's dual-core Opteron is a far better choice. This is probably due both to Opteron's Direct-Connect architecture and integrated memory controller, both of which were implemented by AMD in 2003, and will be followed suit by Intel in its next major processor release (Nehalem) in late 2008.

A Journey in Modern Computer Architectures