To some degree, this is totally below people's expectation on Barcelona/K10, especially according to AMD's official claim Barcelona should "blow Clovertown away."
- First, we know PoV-Ray is very scalable with respect to number of cores: a 4-socket 8-core Opteron system today already doubles PoV-Ray speed of a 2-socket 4-core Opteron system (see 453.povray - 130 vs. 66.3). So what's the big deal if K10 runs 1.87 times as fast with twice the number of cores?
- Second, according to SPECfp PoV-Ray scores, a 2-socket Clovertown system at 2.66GHz is more than twice as fast as a 2-socket Opteron system at 3.0GHz (again, 453.povray - 145 vs. 69.4). How is Barcelona going to blow Clovertown away if it doesn't even double the speed of today's dual-core Opteron?
The first question turns out to be easy to answer: the point of the demo is not just (nearly) twice the performance, but also within the same power/thermal envelope. In other words, the quad-core K10 is going to be a perfect drop-in replacement for today's dual-core K8. The same does not hold with Intel's Clovertown/Xeon. According to this GamePC measurement, to upgrade a Xeon system from dual-core to quad-core under the same thermal/power envelope, one must lower the processor's clock rate by 30% (2.66GHz -> 1.86GHz, or 2.33GHz -> 1.6GHz), which generally implies a 15-20% loss of performance.
However, this still doesn't answer the second question. Shouldn't K10 with 2x the number of cores be more than 2x the speed of K8, due to the many per-core improvements we've heard of inside Barcelona/K10?
To answer this question, we have to look more closely at the benchmark: PoV-Ray.
We know AMD was using PoV-Ray 3.7 beta in the Barcelona demo, because previous versions do not support SMP. Now, there are two executables in the PoV-Ray 3.7 beta package: one compiled with x87 instructions, and one with SSE2. Which one did AMD use? If it was the SSE2, then why didn't it show any per-core improvement? If it was the x87, then why did AMD purposely choose a slower program to demo its next-generation processor?
It turns out that none of these questions is appropriate. Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access; (2) PoV-Ray SSE seems to be optimized more specifically for Core 2 than anything else, where on K8 it is only about 5% faster than PoV-Ray x87. This is also not going to change with K10.
First, there is no actual usage of vectorized (or packed) instructions in PoV-Ray SSE. The only packed instructions I see from the binary are register conversions between x87 and SSE2 formats. PoV-Ray SSE basically treat the SSE2 as a faster [sic] x87 engine which can access xmm registers randomly (rather than stack-based in x87). For example, a simple double-precision division in PoV-Ray SSE is performed by the following instruction sequence:
- Convert the divisor from single to double (CVTSS2SD)
- Perform double-precision scalar division using DIVSD
- Convert the result from two double values to two single values (CTVPD2PS).
Roughly estimating, about 1/4 to 1/3 of the numerical instructions in the PoV-Ray SSE undergo such convert-calculate-convert process, where you see CVTxx2yy instructions all over the places in these parts of the code. Now I'm not sure whether this is compiled by an Intel compiler, or with an Intel library, or whatever else, but this is simply not the good/right way to do vectorized acceleration. It gives Core 2 a performance boost only due to Core 2's design artifact where such conversions are cheap/fast. Still, PoV-Ray SSE manages to run slightly faster than PoV-Ray x87 on K8 probably due to the ability to access register randomly, which results in better superscalar and out-of-order executions.
Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains unfriendly to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work elsewhere inside the K10 design.
So now it looks all reasonable that we see such "disappointing" results from the K10/Barcelona PoV-Ray demo. Except one question that naturally comes up: why did AMD choose PoV-Ray for the demonstration in the first place? Sure, PoV-Ray is very scalable to multiple cores, but there are many other applications that scale as well, aren't there? Maybe AMD wants to run a program that has something to display, such as a cool 3D image? Maybe AMD wants to show K10 can scale even on an unfriendly workload? Or maybe the guys responsible of the demonstration are just incapable of finding a good benchmark? Or maybe PoV-Ray is already the best case AMD can find, and Barcelona/K10 is going to disappoint? We simply won't know the real answer until the actual release of this greatly anticipated chip.