A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Friday, August 03, 2007

Not Everything about Memory is Bandwidth

The False Common Belief

There is this common belief among PC enthusiasts that bandwidth, or million transfers per second or megabytes per second, is the most important thing that a good memory system should aim for. Such a belief is so deep-rooted that even the professionals (i.e., AMD & Intel) began to calibrate & market their products based on the memory bandwidth values.

For example, take a look at this Barcelona architecture July update article. The first graph in that page, which seems to be an AMD presentation and is conveniently duplicated below, seems to suggest that all the memory enhancements in AMD's Barcelona (K10) over its predecessor (K8) are about "Increasing Memory Bandwidth".


The question is, do they really increase memory bandwidth? Lets take a look at the bullet points in the graph, from bottom to top.
  • The prefetchers. Prefetching does not increase memory bandwidth. On the contrary, it reduces available memory bandwidth by increasing memory bus utilization (search "increase in bus utilization" on the page).
  • Optimized Paging and Write Bursting. They both increase memory bus efficiency, which does not increase the bandwidth per se, although it helps improving the bandwidth effectiveness.
  • Larger Memory Buffer. A larger buffer can improve store-to-load forwarding and increase the size of write bursting. The buffer itself, however, does not increase memory bandwidth at all.
  • Independent Memory Channels. This certainly has no effect on memory bandwidth. Each of the two independent channels is half the width, resulting in the same overall bandwidth.
Thus, out of six bullet points, only two are marginally related to memory bandwidth. The bottom line: Barcelona still uses the same memory technology (DDR2) and the same memory bus width (128-bit), beyond which there is no more bandwidth to increase to!

However, one would be more wrong to think Barcelona's memory subsystem is not improved over its predecessor, because all the points above are nevertheless improvements, though not on increasing memory bandwidth, but on reducing memory latency. Intelligent memory prefetching can hide memory latency, as shown in the Intel article page linked above. Reduced read/write transitions due to write bursting and the larger memory buffer both can reduce memory latency considerably. The independent memory channels also reduces latency when multiple memory transactions are on-flight simultaneously - especially important for multi-core processing. In short, the memory subsystem of Barcelona is improved for lower latency, not higher bandwidth.

Why Does Barcelona Improve More Latency Than Bandwidth?

There are a few reasons that a general computing platform based on multiple levels of cache benefits more from lower memory latency. This is contrary to specialized signal processing or graphics processors where instruction branches (changes in instruction flow) and data dependencies (store-to-load forwarding) are few and rare. This fact is aptly described in the following "Pitfall" on page 501 of Computer Architecture A Quantitative Approach 3rd ed., Section 5.16, by Hennessy and Patterson:
  • Pitfall Emphasizing memory bandwidth in DRAMs versus memory latency. PCs do most memory access through a two-level cache hierarchy, so it is unclear how much benefit is gained from high bandwidth without also improving memory latency.
In other words, for general-purpose processors such as Athlon, Core 2 Duo, Opteron, and Xeon, what helps performance is not just the bandwidth, but more importantly the effective latency of their memory subsystem. This pitfall is promptly followed by its dual on the next page of the book, which on the other hand explains why most signal and graphics processors which require high memory bandwidth do not need multiple levels of cache like the general-purpose CPUs:
  • Pitfall Delivering high memory bandwidth in a cache-based system. Caches help with average cache memory latency but may not deliver high memory bandwidth to an application that needs it.
Memory Bandwidth Estimate for High-End Quad-Core CPUs

Still one may ask, is the memory bandwidth offered by say a DDR2-800 channel really enough for modern processors? It turns out that, at least for Intel's Penryn and AMD's Barcelona to come, it should be. To estimate the maximally required memory bandwidth, we assume a 3.33GHz quad-core processor with 4MB cache sustaining 3 IPC (instruction per cycle). Such a processor should be close to the top performing models from both AMD and Intel by the middle of next year. (See also the micro/macro-fusion article for Core 2's actual/sustainable IPC.)

First lets look at the data bandwidth. A 3.33GHz, 3 IPC processor would execute up to 10G I/s (giga-instructions per second). Suppose 1 out of 3 instructions has a load or store, which is supported by the fact both Core 2 and Barcelona have 6-issue (micro-op) engines and perform up to 2 loads or stores per cycle. Thus,

10G I/s * 0.333 LS/I = 3.33G LS/s (giga-load/store per second, per core)

Multiply this number by 4 cores, the total is 13.33G LS/s. According to Figure 5.10 of Computer Architecture AQA on page 416, a 4MB cache has miss rate about 1%. Lets make it 2% to be conservative. Thus the number of memory accesses going to the memory bus is

13.33G LS/s * 2% MA/LS = 0.267G MA/s (giga-memory accesses per second)

Each memory access is at most 16-byte, but mostly likely 8-byte or less in average. This makes the worst-case memory bandwidth requirement 0.267G*16 = 4.27GB/s, and the average-case 2.14GB/s. Note that a single channel of DDR2 memory can support up to 6.4GB/s, much more than the numbers above.

Now lets calculate the instruction bandwidth. Again, for 4 cores at 3.33GHz 3 IPC, there are 40G I/s. However, instructions usually have exceptional cache advantage. According to Figure 5.8 on page 406 of the same above textbook, a 64KB instruction cache has less than 1 miss per 1000 instructions. Assume (again quite conservatively) each instruction takes 5 bytes, this means the total memory bandwidth for fetching instructions is

40G I/s * 0.1% * 5 B/I = 0.2GB/s

Thus even under conservative estimation, the instruction-fetch bandwidth is negligible compared to the data load/store bandwidth. The conclusion is clear: the memory bandwidth of just one single DDR2-800 channel (6.4GB/s) is more than enough for the highest-end quad-core processor during the next 10 months to come. The problem, however, is not bandwidth, but latency.

Update 8/7/07 - Please take a look at the AMD presentation on Barcelona, page 8, where quad-core Barcelona is shown to utilize just 25% the total bandwidth of 10.6GB/s. Suppose this is obtained from a 2.0GHz K10 (one that was demo'd and apparently benchmarked by AMD), then, scaling up linearly, a fictional 3.3GHz K10 would reach about 41% utilization, or about 4.3GB/s. Notice how close this number is to my estimate above.

What About Core 2's Insatiable Appetite for FSB Speed?

A naturally raised question is that, if 6.4GB/s is more than enough for the highest-performing quad-core x86-64 processors in the next year or so, why is Intel raising the FSB (front-side-bus) speed above 1066MT/s (million-transfers per second) so aggressively to 1333MT/s and even 1600MT/s? Isn't 1066MT/s already offering more than 6.4GB/s bandwidth?

The reasons are two-fold:
  1. For Core 2 Quad, the FSB is not just used for memory accesses, but also I/O and inter-core communications. Since data transfer on FSB is most efficient with long trains of back-to-back bytes, such transfer-type transitions can greatly reduce effective bandwidth.
  2. Raising the FSB speed not only increases peak bandwidth, but also (more importantly) reduces transfer delay. A 400MHz (1600MT/s) bus will cut 1/3rd the data transfer time of a 266MHz (1066MT/s) bus.
In other words, due to the obsolete design of Intel's front-side-bus, the sheer value of peak memory bandwidth becomes insufficient to predict the memory subsystem's performance, where a potentially 10.6GB/s bus (1333MT/s * 8B/T) isn't even able to satisfy the need of a quad-core processor (3.33GHz, 3 IPC) requiring no more than 5 GB/s of continuous data access to/from the memory.

The Importance of Latency Reduction

To show how latency reduction is the more important reason to raise FSB speed, we will compare a dual-core system with two quad-core systems, one with a 2x wider memory bus, the other with a 1.5x faster memory bus. We will show that the faster FSB is more effective in bringing down the average memory access time, which is the major factor affecting a computer's IPC. More specifically, using the dual-core system as reference, suppose the following:
  1. The quad-core #1 system has the same bus speed but 2x the bus width (e.g., 128-bit vs. 64-bit). In other words, it has the same data transfer delay and 100% more peak memory bandwidth than the dual-core system.
  2. The quad-core #2 system has the same bus width but 1.5x the bus speed (e.g., 400MHz 1600MT/s vs. 266MHz 1066MT/s). In other words, it has 33% less data transfer delay and, consequently, 50% more peak memory bandwidth than the dual-core.
  3. The memory bus is time-slotted and serves the cores in round-robin. For the dual-core and quad-core #1 systems, each memory access slot is 60ns. For the quad-core #2 system, each slot is 40ns.
  4. Memory bandwidth utilization is 50% on the dual-core (2 out of 4 slots are occupied) and the quad-core #1 (4 out of 8 slots are occupied). It is 66.7% on quad-core #2 (4 out of 6 slots are occupied), calculated by 50% * (2x cores) / (1.5x bandwidth).
Note that the assumptions above are simplistic and optimistic. It does not take into account the reduced effective bandwidth & efficiency due to I/O and inter-core communications. When taken these two effects into account, the quad-core systems will perform much worse than they are estimated below.

Lets first calculate the average memory access latency of the dual-core system. When either core makes a memory request, it finds 3/4=75% of chance the memory bus is free, and 25% of chance it has to wait an additional 60ns for access. The effective latency is

60ns * 75% + (60ns+60ns) * 25% = 75ns

Thus in average, each memory access takes just 75ns to complete.

Now lets calculate the effective latency for the quad-core #1 system. When an arbitrary core makes a memory request, it finds only 5/8=62.5% of chance the memory bus is free, and 37.5% of chance it has to wait. The waiting time, however, is more complicated in this case, because there are C(8|3) = 56 cases how the slots are occupied. Skipping some mathematical derivations, the result is

60ns * (4+3+2+1*5)/8 = 105ns, 6 out of 56 cases
60ns * (3+2+2+1*5)/8 = 90ns, 30 out of 56 cases
60ns * (2+2+2+1*5)/8 = 82.5ns, 20 out of 56 cases

=> (105ns * 6/56) + (90ns * 30/56) + (82.5ns * 20/56) = 88.9ns

Thus even when we double the memory bandwidth, keep the same bus utilization, a quad-core system still induces 18.5% higher access latency than a dual-core system. Note that this is even in the case where memory utilization is as low as 50%. For higher utilization, the latency increase will only be worse. The conclusion is clear: increasing memory bandwidth is not enough to scale up memory performance for multi-core general-purpose processing.

Now lets look at the quad-core #2 system, where data transfer delay is reduced 33%, but memory width is the same and bus utilization is increased to 66.7%. When an arbitrary core makes a memory request, it finds just 3/6=50% of chance the memory bus is free, and 50% of chance it has to wait. The waiting time again is complicated as there are C(6|3) = 20 cases how the slots are occupied. Skipping again some mathematical derivations, we get

40ns * (4+3+2+1*3)/6 = 80ns, 4 out of 20 cases
40ns * (3+2+2+1*3)/6 = 66.7ns, 12 out of 20 cases
40ns * (2+2+2+1*3)/6 = 60ns, 4 out of 20 cases

=> (80ns * 4/20) + (66.7ns * 12/20) + (60ns * 4/20) = 68ns

The average memory access latency here is almost 10% lower than the dual-core case and 24% lower than the quad-core #1. The effect of higher memory bus utilization is completely offset by a lower data transfer delay. Again, for general-purpose multi-core processing, reducing memory access delay is much more important than increasing memory peak bandwidth.

Conclusion and Remark

Lets go back to the original (supposedly) AMD's presentation. Why does it say "increase memory bandwidth" all over the page? Probably because most people simply don't understand better, and to make them so an article like this one is probably necessary and not even sufficient. We seem to see AMD engineers trying so hard to twist the delicate bandwidth-latency relationship, push it and force it down to a form easily understood (yet probably not believed) by ordinary minds.

However, bandwidth is definitely not useless. It really depends on the workload. For streaming processing such as graphics and signal processing, bandwidth and throughput are everything, and latency becomes mostly irrelevant. You won't care whether a DVD frame is played to you 100 milliseconds after it was read out of a blu-ray disc, as long as the next frame comes within 15 milliseconds (70fps) or so. Yet 100 milliseconds is 300 million cycles of a 3GHz processor! For streaming applications, we certainly want continuous flow of high-bandwidth data, yet have millions of cycles of latency to spare.

5 comments:

Unknown said...

Hi Abe,

Nice article! Just to let you know that the first link is broken:

Barcelona architecture July update:
"www.elitebastards.com/cms/index.php?option=com_content&task=view&id=437&Itemid=29&limit=1&limitstart=2"

should be:
"www.elitebastards.com/cms/index.php?option=com_content&task=view&id=437&Itemid=29&limit=1&limitstart=2"

Cheers

Lem (from AMDZone)

abinstein said...

Thanks, Lem, :). The link is fixed now.

Ho Ho said...

abinstein
"Each memory access is at most 16-byte, but mostly likely 8-byte or less in average"

Since when can CPUs fetch half or quarter of a cache line instead of full cachelines? Or are you counting on having data sequental in RAM? What about non-aligned data?

I might comment on other stuff later if I get time and I have nothing better to do.

Anonymous said...

Maybe he can read!

abinstein said...

"Maybe he can read!"

Right, maybe he can read, but can't think or learn better.

One of the properties of a good cache is to handle several cache misses at the same time. In other words, two misses on the same cache line can be served by only one line fetch.

So lets say your code operates on a 64-bit word at address offset 0x33a0 and another 64-bit word at 0x33a8. These two consecutive & independent loads generate two misses at the L2 cache, which has 16-byte/128-bit cache lines. How many bytes will be fetched from memory for the two misses? 16 or 32?

In any rate, the maximum memory bandwidth require per miss is bounded by 16 bytes. The average is likely 8 or less, given there are many 2-byte and even 1-byte accesses, often with good locality, in many programs.

Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.