A Journey in Modern Computer Architectures: 04/2011

-
In the 1st quarter of 2011, Microsoft's Windows revenue dropped 4% from last year, mainly due to 40% decline in netbook sales. CFO Peter Klein says that tablets "played a part" in this decline. It's no wonder that companies no longer make a big fuzz about netbooks like they did in 2008 when Intel Atom was first out. After two short years of coming to life, netbook is already conceived as slow, insufficient and uncool by the consumers, ready to be replaced by the next technology.

AMD Fusion is what I feel an exciting and revolutionary technology which combines a capable multi-core CPU and a general-purpose GPU into a cost and power efficient package. We have an HP dm1z based on the AMD E-350, and I have to say it is an excellent notebook. It's capable, thin-and-light, runs cool and has long battery life (7+ hours in actual usage).

I find it ridiculous, though, that HP dm1z is marketed as a netbook. It's not a netbook. Neither HP nor AMD should have defined it as a netbook. My wife plays many cool games on it, edits photos on it, views HD movies on it, and basically performs every task an advanced PC user would on it, many tasks cannot be satisfactorily performed on a netbook (she holds an M.S. degree in computer science). It's no surprise, since netbook, by definition, is only for the "net," not games nor computing in general. An Atom based 10" laptop with in-order cores may be a netbook; an ARM based 8" laptop with 2GB RAM may be a netbook. But these AMD Fusion laptops have full-grown notebook computing capabilities (good), with size and power consumption similar to a netbook (better!).

It is no wonder that AMD does not refer to their Fusion APUs as for netbook. But sadly it doesn't matter. When Intel released their CULV, people tried to define the slow single-core Celeron based laptop with 2GB RAM as "notebook;" but after AMD launched their Fusion APUs, with two 64-bit out-of-order cores at 1.6GHz accessing 4GB RAM, most people seem to become dense about the many distinctions from netbook. Seriously, if the AMD Fusion "netbook" plays cool games with DX11, runs Office suite and photo editors smoothly, plays HD movies and even runs virtual machines at close-to-native speed, then it is a full notebook. If it's thin and light, then it's an ultramobile notebook. The reasonable conclusion: Fusion APU is not another netbook chip, but a perfect replacement for those ultramobile processors that would otherwise cost you $1000 each.

I feel kind of sad for AMD, for they seem to live in a world which is mostly agnostic about how good their technologies are. But perhaps that is how people like me can by these powerful little notebooks at a great price?

NOTE: If you only want to know whether AMD would K.O. Intel or the other way around, or if you believed technical discussions are nonsense while Internet rumors are gold, then please stay away. OTOH, if you like computer architecture and feel excited about state-of-the-art designs, please enjoy and let me know what you think (thanks)!

Updates --
* 4/13/2011 Updated with discussion on load-store unit and memory disambiguation.
* 4/12/2011 Updated with highlights on shared frontend and changes to other memory resource.

Prelude

AMD recently released the software optimization guide for its upcoming & most anticipated family 15h (Bulldozer) processors. In this article we take a high-level comparative look at the newly released document.

The new processor family features a revolutionary "cluster multi-threading" (CMT) architecture, where a processor consists of multiple modules, each being a cluster of two cores sharing the same instruction frontend, floating-point unit and level-2 cache. Newly supported ISA extensions include the 128-bit SSE4 and 128 & 256-bit AVX, XOP and FMA4.

Despite these major differences, the Bulldozer is fundamentally a continuation of the previous processor design from AMD. It is perhaps more useful to first mention some similarities between Bulldozer and the previous family 10h (K10) processors before going into detail of the differences:

The same (or very similar) macro-op and micro-op based instruction decode is utilized.
Similar register file superforwarding.
Same L1I cache, very similar L3 cache and system interconnect architecture are used.
Similar pick-pack instruction decode in a 32-byte window.
Loads and stores seem still performed in the load-store unit working as a backend to the integer core and FPU, rather than being scheduled directly in reservation stations.
The shared FPU design in Bulldozer has its root deep in the separated integer and FPU schedulers in K10.
Same or very similar microarchitecture for indirect branch (512-entry target array) and return address (24-entry return address stack) prediction.

That said, below we discuss some (not all!) of the major microarchitecture differences introduced in Bulldozer: shared frontend, execution pipelines, L1D and L2/L3 caches, and memory access resources.

Highlights on the shared frontend:

Two 32-byte instruction fetch windows (one for each core? 1.6.4)
Fetch window tracking structure (to manage fetches for both cores? 2.6)
Hybrid (tournament) branch prediction with global and local branch predictors
2-level BTB with 512+5120 entries, upped from 1-level 2048 entries
Instructions decoded from a 32-byte window or two 16-byte windows (for both cores? 2.7)
Introduce branch fusion

Instruction fetch and branching are greatly improved in Bulldozer. A more sophisticated conditional branch prediction is employed, utilizing a local predictor, a global predictor and a tournament selector. The branch target buffer (BTB) is increased to 2.5+ times larger.

Note that although a single frontend serves two cores, the same branch prediction information can be shared by both cores if they execute the same program. Even if the two cores run different programs, sharing the same instruction fetch and branch prediction resources can have benefit in latency hiding, especially for non-optimized and densely branching codes.

When stars align (instruction allocation is optimized and the code has pre-decode information), the frontend can decode up to 4 macro-ops from a 32-byte window per cycle for one core. Otherwise, a 16-byte window is scanned to find the boundaries for supposedly < 4 decodes per cycle. It is unclear whether in such cases one 16-byte window can be scanned for each core, thus still maintaining 32-byte decode (for both cores) per cycle. Note that it takes at least 2x time to scan a instruction window twice as large, but two instruction windows of same size can always be scanned concurrently by parallel resources, if available.

The branch fusion seems similar to Intel's macro-op fusion. It has limited applicability but would make Bulldozer more competitive for running Intel-optimized codes.

Highlights on the execution pipelines:

4-way microarchitecture design
Integer core has two EX and two AGLU pipelines, plus an LSU (2.10.2)
Floating-point unit (FPU) has two FMAC and two IMMX pipelines (2.11)

Up to 4 macro-ops per clock cycle can be issued from the (shared) frontend to either of the two cores. Within each core, up to 4 macro-ops per clock cycle can be sent to an integer or the floating-point scheduler.

The integer scheduler can dispatch up to 4 micro-ops per cycle, one to each of the 4 pipelines. Almost all ALU operations are handled by the 2 EX pipelines, except some LEA instructions which also utilize AGU. Thus the integer core can execute only up to 2 x86 instructions per clock cycle, resulting in a maximum integer IPC of 2.0 (in units of x86 instructions). Note however this estimate does not include the computing throughput of the integer SIMD pipelines in the FPU.

The FPU scheduler can dispatch up to four 128-bit operations with the following combinations: (1) any of {FMUL, FADD, FMAC, FCVT, IMAC}; and (2) any of {FMUL, FADD, FMAC, Shuffle, Permute}; and (3) any of {AVX, MMX, ISSE}; and (4) any of {AVX, MMX, ISSE, FSTORE}.

From a layman's viewpoint, the shared FPU seems to offer only half the throughput of two K10 cores for independent FMUL and FADD operations. However, in previous Opteron, vectorized loads and stores also share the FMUL and FADD pipelines; in Bulldozer, vectorized loads are either "free" or handled by the IMMX pipelines. Note that when FPU is throughput bottleneck, each arithmetic operation should be paired with on average one load or store. A perhaps more significant overhead saving comes from various vectorized register moves which can now be dispatched concurrently to separate IMMX pipelines. Thus the shared FPU in Bulldozer is actually a very balanced design.

Changes to L1 data cache: (2.5.2)

Size reduced from 64kB to 16kB
Associativity increased from 2-way to 4-way
Number of banks increased from 8 to 16 banks
Load-to-use latency increased from 3 to 4 cycles
Access policy changed from write-back to write-through

The L1D cache seems to go through an almost complete overhaul in Bulldozer. In previous AMD Opteron the L1D cache is virtually indexed and physically tagged; this allows the cache size to be greater than (page_size)*(associativity) without the homonym and synonym problems. On the other hand, this also means every cache hit must be subject to TLB hit.

In Bulldozer, the L1D cache size is (page_size)*(associativity) = 4kB * 4 = 16kB. As such, it is possible that the L1D cache is now virtually tagged which would put the DTLB access out of the critical loop. While this limits the maximum cache size to 16kB, it can offer clock rate and power advantage.

Limiting the cache size, however, does not solve the synonym problem where two cores in a Bulldozer module map different virtual address to the same physical address. Inconsistency can occur when the two cores update contents in their (virtually tagged) data cache separately. This problem, however, can be solved by writing through to the physically tagged shared L2D cache.

Changes to L2 and L3 caches:

L2 cache is now a "mostly inclusive" cache (2.5.3)
L2 cache latency increases to 18 ~ 20 cycles from previous 12 (=9+3) cycles
L3 cache is logically partitioned into sub-caches each up to 2MB (2.5.4)

The "mostly inclusive" property of the L2 cache in Bulldozer is a direct consequence of the write-through policy of the L1D cache. Any cache line that has been modified in an L1D cache will also have a copy in the L2 cache. On the other hand, when there is L1D/L2 cache miss and L3 cache hit, a cache line is copied from L3 cache directly to L1D cache (same behavior as in K10), making the L2 cache not fully inclusive. Similar behavior applies to the memory prefetch instructions which copy cache lines directly to L1D. On the other hand, "cold" data are probably loaded to both L1D and L2 caches to take advantage of the sharing of L2 by both cores (different from K10), which could explain the "mostly" inclusive description to the L2 cache.

The L2 cache latency in K10 is 9 cycles beyond the (3-cycle) L1 cache access, or a total 12 cycles. In Bulldozer, the L2 cache latency is increased to 18 ~ 20 cycles; the greater value is probably for writes, or for L1D TLB miss. The increased latency shows Bulldozer core designed more as thinner and faster (higher clock rate) than wider and shorter (higher ILP).

On load-store unit and memory disambiguation:

40-entry load queue and 24-entry store queue in LSU

The load-store unit (LSU) seems to be very similar to the one in K10. Both utilizes two queues, one primarily for pending loads and one exclusively for pending stores. There have been claims that Bulldozer offers better out-of-order loads to stores than K10. From the high-level point of view of the LSU, the only "major" difference is perhaps the use of virtual address for tagging the L1D cache in Bulldozer(?), but physical address in K10. Tagging L1D with virtual addresses may allow pending stores to retire sooner to L1D without being subject to any TLB miss latency, thus resolving store-to-load dependency faster. Otherwise, according to Section 6.3 of the software optimization guides, the same restrictions on store-to-load forwarding apply to both Bulldozer and K10.

There has been many claims (mostly from people outside of AMD?) that Bulldozer must offer some "memory disambiguation" similar to Core 2 or Nehalem. From the organization of Bulldozer's integer and load-store pipelines, which resemble K10 more than Core 2, AMD would have to use very different memory disambiguation mechanisms than Intel. The concept of memory disambiguation is actually simple: a memory access can be ambiguous when its target address is unknown. Once the address is known, then disambiguation (within the same process) can be performed by simply comparing the addresses.

Suppose there is a store to an address A specified by a memory reference M. If M is not in cache, then the store can be pending for a long time waiting for A (at address M) to come. During that time, all later (independent) loads are ambiguous because any of their addresses could be the same as A (which is yet unknown). Similarly, there can be memory access ambiguity for stores following a load from A, or stores following a store to A.

One disambiguation that can be done is to predict which of the later memroy accesses are to addresses that overlap with A. All those that are predicted not overlapping proceed speculatively, and have their results (and all those they affected) squashed if later A is found to overlap with their access addresses. Note, however, that such disambiguation cannot be performed by the LSU if the LSU receives load-store requests with known addresses. It seems to be the case in both K10 and Bulldozer where the LSU works as a backend to the reservation stations.

Is it worth it to allow ambiguous memory access requests to be sent speculatively to Bulldozer's LSU? I think it requires detailed analysis and simulation to know for sure. The software optimization guide does not tell us whether such a design is used in Bulldozer. (Note that a more "severe" type of memory disambiguation may be needed for Intel Nehalem where two processes can share the same LSU, where different virtual memory mapping can create extra memory reference ambiguity.)

Changes to other memory resources (hardware prefetch and write combining):

Hardware pretech to both L1 and L2 (prefetch instruction still to L1 only, 6.5)
Stride L1 prefetcher with up to 12 pretech patterns
"Region" L2 prefetcher for up to 4096 streams or patterns
4KB 4-way WCC plus a (single?) 64-byte 4-entry WCB (?) WCB (A.5)

Due to the much smaller size of L1D in Bulldozer, it is reasonable to expect hardware prefetch to be less aggressive at L1D. Instead, part of the "aggressiveness" is transferred to the large and shared L2 cache. Although less aggressive, the prefetch mechanism is much more sophisticated, keeping multiple (12) prefetch patterns active at the same time.

A special design in Bulldozer is the addition of a 4KB 4-way associative write coalescing cache (WCC) for aggregating write-back (WB) memory writes (before committing them to L2?). This special "write cache" is inclusive with the L2 cache, and has its contents universally visible. It is unclear whether there is one WCC per core or one per module, although the former seems more plausible.

One of the design goals of WCC is probably to improve inter-core data transfer. Previously in K10, if core1 needs to send something to core2, the cache line containing the data must be (a) modified in core1's L1D, (b) evicted from core1's L1D to its L2, then (c) transferred from core1's L2 to core2's L1D. In Bulldozer, since every write to L1D also writes through to the WCC, steps (b) can be omitted and step (c) can be performed together with updating the L2 cache. Even less overhead is incurred if the data transfer occurs between two cores in the same module that share the L2 cache.

The WCC also acts as a write buffer for the write combining buffer (WCB) for streaming loads and write combine memory type. This can have other implications on the memory ordering requirement by the AMD64 execution model, which we will not touch upon here.

Bulldozer seems to have less write-combining resource per core for streaming stores and write combining memory type than K10. Performance "caveat" was mentioned for streaming store instructions in Section 6.5 of the software optimization guide, where writing >1 streams of data with streaming stores results in much less performance compared with K10. It appears, although unclear, that Bulldozer has a (single?) 64-byte 4-entry (sharing the 64 bytes? each having 64 bytes?) write combining buffer (per core?). K10 and even the later K8 revisions have 4 independent 64-byte WCBs per core. One explanation is that modern processors have more cores and thus fewer occasions to store multiple independent data streams per core. With only one stream of streaming stores, the performance in Bulldozer is still comparable to that in K10.

On the other hand, by beefing up the write-combining resource for write-back & temporal stores with the WCC, common memory writes are made much more efficient. Make the common case fast -- a rule of thumb in microarchitecture design!

~~

A Journey in Modern Computer Architectures

Thursday, April 28, 2011

The Battle between Netbook and AMD Fusion

Friday, April 15, 2011

AMD tapes out 28nm Wichita, Intel shows new Atom and peeks 32nm shrink

Wednesday, April 13, 2011

Supid is as stupid does

Sunday, April 10, 2011

First look at AMD Family 15h (Bulldozer) Software Optimization Guide

About Me

Blog Archive

Labels