A Journey in Modern Computer Architectures: The PoV-Ray benchmark and AMD's Barcelona demo

Friday, May 25, 2007

The PoV-Ray benchmark and AMD's Barcelona demo

AMD recently showed off a 4-socket quad-core Barcelona (K10) which almost doubles the speed of a 4-socket dual-core Opteron (K8) on PoV-Ray. More precisely, the rendering speed of the 16-core K10 system is just 1.87 times the speed of the 8-core K8 system, both running at the same processor frequency.

To some degree, this is totally below people's expectation on Barcelona/K10, especially according to AMD's official claim Barcelona should "blow Clovertown away."

First, we know PoV-Ray is very scalable with respect to number of cores: a 4-socket 8-core Opteron system today already doubles PoV-Ray speed of a 2-socket 4-core Opteron system (see 453.povray - 130 vs. 66.3). So what's the big deal if K10 runs 1.87 times as fast with twice the number of cores?
Second, according to SPECfp PoV-Ray scores, a 2-socket Clovertown system at 2.66GHz is more than twice as fast as a 2-socket Opteron system at 3.0GHz (again, 453.povray - 145 vs. 69.4). How is Barcelona going to blow Clovertown away if it doesn't even double the speed of today's dual-core Opteron?

The first question turns out to be easy to answer: the point of the demo is not just (nearly) twice the performance, but also within the same power/thermal envelope. In other words, the quad-core K10 is going to be a perfect drop-in replacement for today's dual-core K8. The same does not hold with Intel's Clovertown/Xeon. According to this GamePC measurement, to upgrade a Xeon system from dual-core to quad-core under the same thermal/power envelope, one must lower the processor's clock rate by 30% (2.66GHz -> 1.86GHz, or 2.33GHz -> 1.6GHz), which generally implies a 15-20% loss of performance.

However, this still doesn't answer the second question. Shouldn't K10 with 2x the number of cores be more than 2x the speed of K8, due to the many per-core improvements we've heard of inside Barcelona/K10?

To answer this question, we have to look more closely at the benchmark: PoV-Ray.

We know AMD was using PoV-Ray 3.7 beta in the Barcelona demo, because previous versions do not support SMP. Now, there are two executables in the PoV-Ray 3.7 beta package: one compiled with x87 instructions, and one with SSE2. Which one did AMD use? If it was the SSE2, then why didn't it show any per-core improvement? If it was the x87, then why did AMD purposely choose a slower program to demo its next-generation processor?

It turns out that none of these questions is appropriate. Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access; (2) PoV-Ray SSE seems to be optimized more specifically for Core 2 than anything else, where on K8 it is only about 5% faster than PoV-Ray x87. This is also not going to change with K10.

First, there is no actual usage of vectorized (or packed) instructions in PoV-Ray SSE. The only packed instructions I see from the binary are register conversions between x87 and SSE2 formats. PoV-Ray SSE basically treat the SSE2 as a faster [sic] x87 engine which can access xmm registers randomly (rather than stack-based in x87). For example, a simple double-precision division in PoV-Ray SSE is performed by the following instruction sequence:

Convert the divisor from single to double (CVTSS2SD)
Perform double-precision scalar division using DIVSD
Convert the result from two double values to two single values (CTVPD2PS).

This offers considerable advantage for Intel's Core 2, because SSE2 DIVSD (18 cycles) in Core 2 is much faster than x87 FDIV (36 cycles), and the conversion instructions are also quite fast (4 cycles). Overall, for Core 2, the above sequence will save ~30% number of cycles (4+18+4=26 vs. 36) from an x87 division. On the other hand, this sequence is very inefficient for K8, where SSE2 DIVSD is as fast as x87 FDIV (~20 cycles), but conversions are much slower (8 cycles). Overall, for K8, the sequence runs ~80% slower (8+20+8=36 vs. 20 cycles) than an x87 division.

Roughly estimating, about 1/4 to 1/3 of the numerical instructions in the PoV-Ray SSE undergo such convert-calculate-convert process, where you see CVTxx2yy instructions all over the places in these parts of the code. Now I'm not sure whether this is compiled by an Intel compiler, or with an Intel library, or whatever else, but this is simply not the good/right way to do vectorized acceleration. It gives Core 2 a performance boost only due to Core 2's design artifact where such conversions are cheap/fast. Still, PoV-Ray SSE manages to run slightly faster than PoV-Ray x87 on K8 probably due to the ability to access register randomly, which results in better superscalar and out-of-order executions.

Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains unfriendly to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work elsewhere inside the K10 design.

So now it looks all reasonable that we see such "disappointing" results from the K10/Barcelona PoV-Ray demo. Except one question that naturally comes up: why did AMD choose PoV-Ray for the demonstration in the first place? Sure, PoV-Ray is very scalable to multiple cores, but there are many other applications that scale as well, aren't there? Maybe AMD wants to run a program that has something to display, such as a cool 3D image? Maybe AMD wants to show K10 can scale even on an unfriendly workload? Or maybe the guys responsible of the demonstration are just incapable of finding a good benchmark? Or maybe PoV-Ray is already the best case AMD can find, and Barcelona/K10 is going to disappoint? We simply won't know the real answer until the actual release of this greatly anticipated chip.

8 comments:

Anonymous said...: 4000/2200 is 82% not 87%.; 25 May, 2007 08:21
Anonymous said...: "Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work inside the K10 design."

If an 8 core K8 system is 96% faster than a 4 core K8 system, how is 16 K10 cores with only 82% improvement over 8 K8 showing improvement? Yes you mention that AMD downgraded some functions, but unless you have more data, where is the improvement?; 25 May, 2007 08:29
Scientia from AMDZone said...: Good analysis. However, with popup comments and anonymous posting enabled I probably won't be commenting here very often.; 25 May, 2007 09:21
Ho Ho said...: From the video I understood that Barcelona was HE product and the dualcores were not. That means a minimum of 95W TDP for it. As thermals were equal that makes Barcelonas 95W TDP also. Comparing against Intel 50W ones that isn't exactly low power usage in my oppinion.

"According to this GamePC measurement, to upgrade a Xeon system from dual-core to quad-core under the same thermal/power envelope, one must lower the processor's clock rate by 30% (2.66GHz -> 1.86GHz, or 2.33GHz -> 1.6GHz), which generally implies a 15-20% loss of performance"

First, they were not using LV quadcores in that comparison. AMD did use its high efficiency version of Barcelona in the povray benchmark. Intel LV quadcores are 50W TDP, non-LV versions are 80-150W with the 2.66GHz being 120W. Those compared dualcores mostly had 65W TDP, much lower than the quadcores they compared against.

Secondly, in vast majority of CPU limited applications replacing 2.66GHz dualcores with 1.86GHz quadcores is much closer to 40% performance increase than 30% decrease. If your application is bandwidth limited and not CPU limited there is no need to replace the CPUs, wouldn't you agree?

"(1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;"

Random elements in one register or simply random SIMD registers?

"(2) PoV-Ray SSE is optimized more specifically for Core 2 than anything else, where on K8 it is only about 5% faster than PoV-Ray x87"

Why do you think it is more optimized for the Core2? From what I know work on PovRay 3.7 started long before Conroe was released.
Where did you get those SSE vs x87 performance numbers?
Also as you yourself just said then even that bad SIMD instruction usage is still better than x87 on both architectures. It is just sad that AMD SIMD units are not that good, even though it suggests developers to prefer SIMD instrucitons to x87.

"Now I'm not sure whether this is compiled by an Intel compiler, or with an Intel library, or whatever else, but this is simply not the good/right way to do vectorized acceleration"

I agree and highly doubt this is so. Don't take this personally but I don't think you understood that disassembled code that well to see what is going on in the hotspots. I personally highly doubt the people behind PovRay would write so inefficient code. Packet tracing is trivial and there is no need to convert stuff like that constantly all over the place. I can somewhat understand the shader code but not the tracing and intersection parts. What parts did you analyze?

"We simply won't know the real answer until the actual release of this greatly anticipated chip."

I agree. Too bad AMD doesn't explain the performance themselves and doesn't give any other benchmark results for people to analyze, not to mention whole systems.; 25 May, 2007 10:57
abinstein said...: "4000/2200 is 82% not 87%."

I calculated 1.87x from the ratio of CPU time, 62 sec vs. 116 sec.

In the video the guy didn't speak of the pixels per second very accurately. He said "over 4000" and "a little lower 2200".; 25 May, 2007 12:24
abinstein said...: "From the video I understood that Barcelona was HE product and the dualcores were not. That means a minimum of 95W TDP for it."

No, you misunderstand. Both systems, dual-core Opteron and quad-core Barcelona, are HE systems, which has processor TDP 65W.

"It is just sad that AMD SIMD units are not that good, even though it suggests developers to prefer SIMD instrucitons to x87."

No, you are not reading the article. PoV-Ray SSE doesn't utilize Stream SIMD Execution, at least not in a way I can see. Its performance thus is irrelevant of how well an SIMD engine the running processor has.

"Why do you think it is more optimized for the Core2? From what I know work on PovRay 3.7 started long before Conroe was released."

I probably jumped my guns too soon. It might be PoV-Ray optimized for Core 2, but simply gcc's SSE generation resulted in such convert-calculate-convert pattern. Please see this mail archive.

If it's really the "fault" of gcc, then the problem here we see is even greater. Not PoV-Ray but potentially every other dumbed-recompiled "SSE" versions of programs is favoring Core 2's design artifact.; 25 May, 2007 12:47
Ho Ho said...: "No, you misunderstand. Both systems, dual-core Opteron and quad-core Barcelona, are HE systems, which has processor TDP 65W."

I watched the video twice. In no place I heard the man saying they were using HE dualcores, he was only talking about the quads. At what minute/second did he say it?

"PoV-Ray SSE doesn't utilize Stream SIMD Execution, at least not in a way I can see"

Just for fun, could you show me the same things you see? I mean the code dump you were analyzing. You can provide a link to it or send an email, I don't care.

"Please see this mail archive."

There were some interesting posts there indeed.

"Not PoV-Ray but potentially every other dumbed-recompiled "SSE" versions of programs is favoring Core 2's design artifact."

How can you call a better designed ALU a design artifact?; 25 May, 2007 14:52
abinstein said...: "In no place I heard the man saying they were using HE dualcores, he was only talking about the quads."

Quad-core HE Barcelona has 65W TDP.

"ould you show me the same things you see? I mean the code dump you were analyzing."

I'm not sure the licensing of the software I use allows me to do so. But in any rate you can get PoV-Ray SSE and search for CVTPD2PS yourself. It doesn't require arcane knowledge to count. :-)

"How can you call a better designed ALU a design artifact?"

Core 2's FP ALU isn't better than K8's. It converts between x87 and SSE formats faster, but that's hardly a definition for being better. In fact, dual-core K8 at 2.8GHz has about the same SEPCfp_rate as dual-core Core 2 at 3.0GHz. You still call the latter "better"?

The PoV-Ray SSE (or the compiler that generates the binary) actually exploited the single fact that CVTxx2yy instructions are slow/depreciated in K8/K10 to make Core 2 look better than it is.; 25 May, 2007 15:30

A Journey in Modern Computer Architectures

Friday, May 25, 2007

The PoV-Ray benchmark and AMD's Barcelona demo

8 comments:

About Me

Blog Archive

Labels