In recently years with the widespread of consumer-grade "benchmarks" we see many what I'd call "folklore comparison" of different PC platforms, most recently AMD's Barcelona (K10) and Intel's Xeon (Core2) microarchitectures. Specifically, many of these folklore comparisons made by on-line reviews give Intel-favoring misinformation to "justify" Core2's "theoretical ILP advantage" on paper. In this article I will look more closely at both sides of this argument: Is it or is it not justified to attribute Core2's better performance in some cases to the supposed "advantage" in its microarchitecture, or does such advantage actually exist?
Misinformation on L1 Data Cache Bandwidth
The first example starts with a "test" AnandTech performed regarding K10 vs Core2 L1 data cache bandwidth:
Lavalys Everest L1 Bandwidth | |||||
Read (MB/s) | Write (MB/s) | Copy (MB/s) | Bytes/cycle (Read) | Latency (ns) | |
Opteron 2350 2 GHz | 32117 | 16082 | 23935 | 16.06 | 1.5 |
Xeon 5160 3.0 | 47860 | 47746 | 95475 | 15.95 | 1 |
Xeon E5345 2.33 | 37226 | 37134 | 74268 | 15.96 | 1.3 |
Opteron 2224 SE | 51127 | 25601 | 44080 | 15.98 | 0.9 |
Opteron 8218HE 2.6 GHz | 41541 | 20801 | 35815 | 15.98 | 1.1 |
From the values above it would appear that, per clock cycle, K10 can load as much data (16 bytes) but store only half (8 bytes) as Core2 can. As pointed out by scientia at AMDZone, these numbers do not seem correct. In fact, according to this presentation slide, AMD's Barcelona (K10) processors could theoretically perform two 128-bit (16-byte) loads per clock cycle, or twice the Core2's L1 data cache bandwidth. Why the contradiction?
Rebuttal and Explanation
What is shown here is a perfect example of misinformation coming out of such "folklore comparison" performed by AnandTech, using synthetic benchmark tools without really knowing what it is doing. A synthetic benchmark would underestimate Opteron's L1 bandwidth because, with sequential accesses and small strides, it stresses only one of the two ports of K10's L1 cache.
Recall that K10's L1 data cache (L1D) is 2-way set associative with two 128-bit ports. Internally each port is connected to one bus going to the Load-Store Unit (LSU). This arrangement is described both verbally (2nd paragraph, page 223, A.5.2) and graphically (Figure 11, page 230, A.13) in the Software Optimization Guide for AMD Family 10 Processors:
Optimally in every clock cycle two 128-bit words, one from each port, can be read from L1D to LSU and forwarded to the execution units. What happens with synthetic benchmarks is probably that, due to their fine-grain assembly-level "optimization," they could generate unrealistic codes favoring one microarchitecture instead of another. On K10, it is in fact possible to force data accesses to the same cache way or the same cache bank. Such accesses will only utilize one of the two available ports and result in half the optimal bandwidth.
In practice, do we always find data accesses to the same port? Of course not. Clearly the type of tests that AnandTech did reflect little if any realistic processor performance. They are at best echoes to uneducated folklore opinions, or worse practices of darn FUDs. On the other hand, we also won't find all data accesses to different ports, thus a theoretical calculation of K10's maximum L1D bandwidth (that it is twice as high as Core2's) is also overly optimistic and unrealistic. In reality, if 50% of cache accesses are spread to two ports, then in average it would take 3 cycles to read 4 words (one cycle to read 2 words, two cycles to read 1 word each). The average bandwidth would be 4/3 = 1.33 reads or writes per clock cycle. In other words, K10 would achieve about 67% (1.33/2) of its theoretical max L1D bandwidth, which would be 33% higher than Core 2 for reads and 33% lower for writes.
Proof and Conclusion
To prove that this theory is true, I wrote a program in C with gcc's SSE intrinsics to test the bandwidths myself. Skipping other I/O & maintenance parts, the kernel of the code looks like this:
The above code is compiled by gcc 3.4.4 with -O2 and -march=k8. The generated assembly is also included below for assurance that the code really does what it's supposed to do. (Note: When trying it on gcc 4.2 the compiler is smart enough to know that the loop doesn't actually do any useful work, and would not generate any SSE load instruction. In this case the code above achieves 1 iteration (skipping 8 loads) per cycle, limited by the conditional branch bubble.)
I have the program with three tests: SSE store, SSE load, and SSE PAND; only SSE PAND is shown above but the other two are very similar. When running on a Phenom 9500 @2.2GHz, the achieved bandwidths for store, load, and PAND are 28GB/s, 46GB/s, 46GB/s, respectively. This translates to about 1.3x 16B reads/cycle and 0.8x 16B writes/cycle. So without any special treatment, on the C-source level, I could already get 30% better L1D read bandwidth and 60% better L1D write bandwidths than AnandTech's "test" results; furthermore, the theoretical estimates that I offered in the previous section were actually fairly close.
Out of curiosity I took the same program to run on a Core 2-based 2.0GHz Xeon machine. It turns out that it only achieves ~85% max bandwidth there, i.e., less than 14B reads and writes per clock cycle. Thus per clock cycle, the L1D read bandwidth on Core 2 is only 2/3rd of that on K10, whereas the write is just 8% higher. Frankly I have no idea how Lavalys Everest does its benchmarking codes to generate vastly different results, but really anyone with a clear mind shouldn't care about how some synthetic binary runs, but what he can achieve on the platform on the source level, without dirtying his hands by assembly optimizations (or de-optimization for the non-Intel cases).