A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Saturday, May 19, 2007

More scaling - where a picture speaks a thousand words

One reader to my previous article asked why didn't I use dual-socket Core 2 Duo for scaling comparison. The reason is simple: I couldn't find a single pair of SPEC 2006 results where a single-socket and a dual-socket Core 2 Duo machines use the same CPU clock rate, memory technology, compiler, and operating system, where scientifically valid comparison can be made.

In this article I will relax a bit and do not require exact matches among the candidate systems. I will use four x86_64 system models show the "number of cores" and "clock rate" scaling of both Intel Core 2 Duo and AMD Opteron (K8).

Below is the system settings and their SPEC2006_rate scores. I use "ds" for dual-socket dual-core (4 cores), "qc" for single-socket quad-core (4 cores), and "dc" for single-socket dual-core (2 cores) -


Nothing is better than a picture to illustrate complicated data. Below is the performance graph of these systems. Green lines are for Fujitsu/AMD; blue lines for Fujitsu/Intel; red lines for Acer/Intel -


Couldn't resist the temptation, below is a list of observations that I have to make:

First, with 2 cores, Core 2 Duo is undoubtedly the winner on both SPECint_rate and SPECfp_rate. With 4 cores, however, K8 becomes the better choice for SPECfp. The more powerful a system is, the more advantage K8 has, due to its better "number of core" scalability.

Second, Intel's FSB (front-side bus) is a bottleneck for 4 cores, even at 1066MHz. This is obvious from the left-most points of 4-core Core 2 systems (C2ds and C2qc), where the scores are lower than the rest of the clock scaling trend. Looking at the system settings, these lower-than-expected performances come precisely from the 1066MHz FSB (vs. 1333MHz).

Third, the MCM quad-core could be a good cost/power-saving for single-socket home users and low-end servers. It almost matches dual-socket Opteron on integer performance, although its floating-point performance is still somewhat desired.

Fourth, the MCM quad-core does not scale well at/beyond 2.67GHz. You may cry, look, the 2.67GHz C2Q even has lower SPECfp_rate than the 2.40GHz C2Q! There must be something wrong with the Fujitsu systems! Unfortunately, no. As of May 2007, all reported 2.67GHz C2Q SPECfp_rate I can find are "lower than expected." (The highest among them is 33.9 - less than 1% higher - but it uses FB-DIMM, different from the other systems presented here). This is probably why Intel is so late in introducing a higher-clocked Core 2 Quad - if they are not (much) better, why bother?

Fifth, the "clock rate" scaling of K8 performance is slowing down at 2.8GHz, especially for SPECfp_rate. Since all Fujitsu Primergy RX330 systems are identical except the CPU clock rate, the only explanation is that the larger processor-memory speed gap makes higher CPU frequency less effective. Core 2 does not experience the same slow down probably due to its larger cache and a better load/store circuits.

Sixth, doubling L2 cache size helps Core 2 Duo for about as much as a speed grade (0.16GHz). This is seen from the "jump" on the single-socket Core 2 Duo performances (C2dc), where the left two points with 2MB L2 are one step lower than the right three points with 4MB L2.

10 comments:

Anonymous said...

You have shown that AMD does well in synthetic testing. Can you do the same for real programs?

abinstein said...

"You have shown that AMD does well in synthetic testing. Can you do the same for real programs?"

Actually, SPEC is real programs solving real problems. Also I don't think AMD K8 is doing well in SPEC, at least not in single-socket setup. Even on a dual-socket machine, K8 loses to Core 2 on SPECint programs.

Anonymous said...

Do servers run SPEC? I don't think so. Of course AMD always boast about FP, a bunch of meaningless simulations requiring tons of bandwidth. Even on INT, analyzing proteins, playing chess, simulating a quantum computer? I think SPEC gets too much credit. So AMD scales on SPEC, can you show it in any actual app?

abinstein said...

"Do servers run SPEC? I don't think so."

It is one of SPEC's goals to cover as widely the processing-intensive usage as possible. The focus is intentionally made on processor/memory architecture and compiler. Unless you can successfully argue that Intel compilers have poor "number of core" scalability, it is by definition that a processor/memory architecture that scales better on SPEC_rate scales well on the individual programs that SPEC covers.

If you have insights on some server application whose processor-intensive part is not covered by SPEC, please do contribute.

"Of course AMD always boast about FP, a bunch of meaningless simulations requiring tons of bandwidth."

FP capability interests the scientific, engineering, financial & manufacturing community. It could well be meaningless to you. :p

"Even on INT, analyzing proteins, playing chess, simulating a quantum computer?"

From my somewhat crude understanding, Protein analysis is essentially string matching; chess playing is searching & min-max optimization; quantum computer simulation is probability and complex state transitions. These are some of the most important computer science algorithms. These algorithms are what modern computers were designed to perform well.

Of course, they are not 3D gaming nor video transcoding nor word processing nor photo editing. These latter/personal/home applications are generally more easily improved by speeding up special (vectorized) instructions.

"I think SPEC gets too much credit. So AMD scales on SPEC, can you show it in any actual app?"

To be more accurate, AMD's K8 microarchitecture scales better than Intel's Core 2 Duo in terms of "number of cores" versus SPEC throughput (i.e., SPEC[int|fp]_rate).

Please note that the SPEC_rate is measured by running multiple independent incidences of the SPEC programs side-by-side. It is similar to measuring a server or workstation's ability to handle multiple concurrent processes. It should be representative of the memory architecture and context switch capability of the processor.

Ho Ho said...

What about analyzing the results of SPEC JBB2005? This is much closer to being a server application than specint/fp. Also as a server application this is a good way to measure multithreaded scaling. Much better than simply running several copies of the same program.

Anonymous said...

I wonder how K10 will scale - while HT definitely is better for scaling, perhaps it is easier to scale on lower performing processors? It may be easier to scale as the other components are potentially less likely to be bottlenecks when your CPU performance is low - as your CPU performance goes up and you continue to scale it would seem you are more likely to run into another bottleneck?

abinstein said...

"What about analyzing the results of SPEC JBB2005? This is much closer to being a server application than specint/fp."

My focus here is the scalability of CPU and memory architectures. SPEC JBB2005 is a Java/system level test which has far more implications than just on the processor and memory. It is certainly interesting to look at this level, but it's much more difficult to compare just the processors with these tests. Also, since the whole system is involved, the pricing and architecture of the whole system must be taken into account, which generates a far more complicated picture than just scalability of the processors.

Multithreading is yet another variable. Again, it would be interesting to see how multithreaded programs run on different processors, especially for workstations where program run time is more important.

abinstein said...

"It may be easier to scale as the other components are potentially less likely to be bottlenecks when your CPU performance is low."

The statement above is in general correct. However, the fact that a dual-socket, 8-core Core 2 Quad scales better than the single-socket MCM Core 2 Quad perfectly shows that absolute performance value is not yet limiting throughput scalability here.

Also, the throughput of an Opteron system with quad-socket 16-core scales almost as well as one with dual-socket 8-core. This is simply due to the fact that Opteron's Direct Connect Architecture allows memory usage to be localized with efficient accesses by the integrated memory controller. In terms of processor-intensive throughput, DCA and IMC are simply superior to FSB and MCM at 4 cores and up.

The reason that I stress on throughput is because today's multi-core processors are designed for higher throughput rather than lower (single-process) run time. Core 2 X6800 (released several months ago) would easily outperform the upcoming Q6700 on any single-process program, even many multithreaded ones.

Ho Ho said...

abinstein
"My focus here is the scalability of CPU and memory architectures. SPEC JBB2005 is a Java/system level test which has far more implications than just on the processor and memory."

Then compare the systems with comparable specs, something like those, perhaps:

4P netburst vs 4P 8220SE vs 2P quadcore Core2

You are seeing correctly, even old netbursts scale linearly in Java server business applications. Not everyone write as bad code as in Cinebench.

abinstein said...

Ho Ho -
"Then compare the systems with comparable specs, something like those, perhaps:

4P netburst vs 4P 8220SE vs 2P quadcore Core2"


They are using different memory, different disk drives, different JVM/commands, and you call they comparable?

You seem to lack the ability and sense to comprehend the simple fact that jbb does not reflect nor claim to reflect processor capability (nor any individual component).

The reason is simple: it does not have a "base" value. Everything is utterly optimized in a customized fashion. In general, this benchmark favors hardware multithreading (long I/O delay) and large cache (long-living thread contexts); in particular, it is also sensitive to JVM implementation and the command line options.

"You are seeing correctly, even old netbursts scale linearly in Java server business applications."

What I correctly see is that each Netburst chip has 4 hardware threads (2 cores/chip * 2 threads/core), thus the bops of each instance per chip scale up to 4 warehouses.

On the other hand, the 8-core Core 2 system running 4 instances of JVM has 2 cores per instance, and the bops scale up to 2 warehouses. Similar for the K8 system.

"Not everyone write as bad code as in Cinebench."

You totally got it wrong. The "scaling" of bops w.r.t. number of warehouses is very different from number of cores. It reflects nothing but the number of threads the Java server and underlying hardware can run in parallel. The fact that Netburst HyperThreading can scale 50% performance up proves the benchmark is not limited by processing power; HyperThreading only adds/retains thread contexts to tolerate IO delay, it does not add any processing power to the processor.

IMO, jbb is like a marketing tool for manufacturers to sell systems that have large caches but are otherwise inefficient or slow. You don't suppose 16 Netburst cores with 360k bops really have more processing power than 32 Itanium2 cores with 300k bops, do you?

Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.