A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Tuesday, May 22, 2007

Core 2 Duo: That Can Hardly Be More Optimized

In this article I am to find out how well Core 2 Duo and K8 perform on the single-processed benchmarks, SPECint and SPECfp (i.e., no "rate" here), as some interesting observations come up again. A look at the facts is never short of revelation.

The systems are shown in the following table. K8int/K8fp denote K8 Opteron scores for integer and floating point benchmarks, respectively. Similarly, C2int/C2fp denote Core 2 Duo scores. Both SPEC CPU2000 and SPEC CPU2006 are compared. The main criteria in choosing these SPEC submissions are -
  1. There are at least two or more processor speeds with all identical configurations otherwise.
  2. They use 64-bit operating systems and compilers.
  3. All have comparable memory across different architectures (DDR2-667 to be exact).

Unfortunately, I couldn't find a K8 and a C2D using the same operating system and compiler. Thus strictly speaking the absolute values in these tests are not comparable across system families. We will relax ourselves a bit here but keep this fact in mind.

Below are the SPEC CPU2000 results. All points are the "base" scores -

Below are the SPEC CPU2006 results. The points with a postfix 'B' character on their labels represent "base" scores; the other points represent the "peak" scores -

In terms of SPEC CPU2000, Core 2 Duo completely outclasses Opteron/K8 on both INT (50%) and FP (30%) scores. Interestingly, the vast advantage greatly reduces with respect to SPEC CPU2006, where the leads become less than 30% for INT and almost none for FP. One explanation is that Core 2 Duo, released 3 years after Opteron, was optimized by design for the benchmarks, at a time when only SPEC CPU2000 was available. Another explanation is the newer SPEC CPU2006 does not benefit from large L2 cache size (up to at least 4MB) as much and thus more favors K8's integrated memory controller. Yet another explanation is that the newer benchmark codes are more complex and thus less predictable by simple heuristics where Core 2 probably does better/more than K8.

No matter what are the reasons (probably a bit from all three and more), one message is clear: for single-processed integer codes, Core 2 Duo beats K8 Opteron hands-down. For floating point, it's a close match, and one should look at the type of program he runs to make a preference.

The really interesting observation lies on the "peak" versus "base" values of the benchmark results. For Core 2 Duo, peak offers just 4% boost on INT and 3% on FP. On the other hand, for K8 Opteron, peak offers 8% boost on INT and almost 30% on FP. It seems the microarchitecture of Core 2 Duo is so optimizing that there is little room for more software optimization, whereas K8 Opteron still can benefit from better compilation. This is certainly a plus for Core 2 Duo, because nobody likes to spend 2x time to compile an optimized executable.

Comparing the SPEC and SPEC_rate results, we clearly see that while Core 2 Duo has a much better core implementation, its memory architecture trails after K8 and drags down its throughput scalability. The FSB bottleneck can even be seen from the Core 2 Duo lines in the second graph above, where the two left-most point sets (with 1066MHz FSB) are much lower than the others (with 1333MHz FSB). Again, as I said, with Core 2 Duo, Intel goes back to its root to improve, market on, and profit from the personal/home (versus big server/high performance) computing.

2 comments:

Ho Ho said...

It seems the microarchitecture of Core 2 Duo is so optimizing that there is little room for more software optimization, whereas K8 Opteron still can benefit from better compilation."

Are you trying to show how little you know about ICC and compiling in general? If yes then you succeeded.

For your information ICC defaults to generating CPU specific code. That means there is very little you can do to get any extra speedboost from playing with compiler settings. 9.0 targeted Pentium 4 by default, I'm not sure about newer compiler versions. That means when you didn't specify it via compiler parameters you can't even run the programs on P3 or lower, assuming that compiler did manage to use some CPU specific instructions.

With GCC I've seen up to 30% speed increase in some FP heavy code just by tuning the compiler parameters from -O0 to -Os (-O3 was slower). Of cource this is very rare and in most cases there is relatively small difference in speed.

"This is certainly a plus for Core 2 Duo, because nobody likes to spend 2x time to compile an optimized executable."

Even though ICC compiles stuff at several times slower pace than GCC compiling speed doesn't mean anything at all. I can compile every single application installed on my Gentoo box in less than 24h with my e4300@2.9GHz. glibc takes the longest with around 30 minute compile time.


Also, you made some huge mistakes in the reply you made to the post in the other thread, I'll reply to it later tonight if I get time.

abinstein said...

Ho Ho -

Are you trying to show how little you know about ICC and compiling in general? If yes then you succeeded.

I'll ask you watch your language - I will forgo your silly rudeness this time, but not next. This is a professional site and I post comments very selectively. That is no flame, no bias, and no further arguing on this policy.


"For your information ICC defaults to generating CPU specific code. That means there is very little you can do to get any extra speedboost from playing with compiler settings."

Apparently you don't know what I was speaking of in the article. Maybe I assumed too much from the readers and didn't explain the context clear enough.

All SPEC base scores are compiled with the (best) Optimized flag. The difference of a "peak" score versus a "base" score is that the peak is compiled in two passes, where run-time information (profile) from the first pass is used in compilation of the second.

The fact that icc does not improve C2D performance from "base" to "peak" shows C2D's default optimization (hardware speculation, prediction, etc.) is already very good for SPEC workloads, that run-time profiling information do not offer further help.


"Even though ICC compiles stuff at several times slower pace than GCC compiling speed doesn't mean anything at all."

Again you are getting completely wrong ideas. I said 2x the compile time because it takes two-pass profiled compilation for a dual-socket K8 to reach the same performance level of a C2D.

No matter which compiler you are using, compiling twice (once for profile generation, once for actual measurement) takes about 2x the amount of time. You actually also pay an extra run time in order to generate the profile.

Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.