A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Sunday, April 10, 2011

First look at AMD Family 15h (Bulldozer) Software Optimization Guide

NOTE: If you only want to know whether AMD would K.O. Intel or the other way around, or if you believed technical discussions are nonsense while Internet rumors are gold, then please stay away. OTOH, if you like computer architecture and feel excited about state-of-the-art designs, please enjoy and let me know what you think (thanks)!


Updates --
* 4/13/2011 Updated with discussion on load-store unit and memory disambiguation.
* 4/12/2011 Updated with highlights on shared frontend and changes to other memory resource.


Prelude

AMD recently released the software optimization guide for its upcoming & most anticipated family 15h (Bulldozer) processors. In this article we take a high-level comparative look at the newly released document.

The new processor family features a revolutionary "cluster multi-threading" (CMT) architecture, where a processor consists of multiple modules, each being a cluster of two cores sharing the same instruction frontend, floating-point unit and level-2 cache. Newly supported ISA extensions include the 128-bit SSE4 and 128 & 256-bit AVX, XOP and FMA4.

Despite these major differences, the Bulldozer is fundamentally a continuation of the previous processor design from AMD. It is perhaps more useful to first mention some similarities between Bulldozer and the previous family 10h (K10) processors before going into detail of the differences:
  • The same (or very similar) macro-op and micro-op based instruction decode is utilized.
  • Similar register file superforwarding.
  • Same L1I cache, very similar L3 cache and system interconnect architecture are used.
  • Similar pick-pack instruction decode in a 32-byte window.
  • Loads and stores seem still performed in the load-store unit working as a backend to the integer core and FPU, rather than being scheduled directly in reservation stations.
  • The shared FPU design in Bulldozer has its root deep in the separated integer and FPU schedulers in K10.
  • Same or very similar microarchitecture for indirect branch (512-entry target array) and return address (24-entry return address stack) prediction.
That said, below we discuss some (not all!) of the major microarchitecture differences introduced in Bulldozer: shared frontend, execution pipelines, L1D and L2/L3 caches, and memory access resources. 


Highlights on the shared frontend:
  • Two 32-byte instruction fetch windows (one for each core? 1.6.4)
  • Fetch window tracking structure (to manage fetches for both cores? 2.6)
  • Hybrid (tournament) branch prediction with global and local branch predictors
  • 2-level BTB with 512+5120 entries, upped from 1-level 2048 entries
  • Instructions decoded from a 32-byte window or two 16-byte windows (for both cores? 2.7)
  • Introduce branch fusion
Instruction fetch and branching are greatly improved in Bulldozer. A more sophisticated conditional branch prediction is employed, utilizing a local predictor, a global predictor and a tournament selector. The branch target buffer (BTB) is increased to 2.5+ times larger.

Note that although a single frontend serves two cores, the same branch prediction information can be shared by both cores if they execute the same program. Even if the two cores run different programs, sharing the same instruction fetch and branch prediction resources can have benefit in latency hiding, especially for non-optimized and densely branching codes.

When stars align (instruction allocation is optimized and the code has pre-decode information), the frontend can decode up to 4 macro-ops from a 32-byte window per cycle for one core. Otherwise, a 16-byte window is scanned to find the boundaries for supposedly < 4 decodes per cycle. It is unclear whether in such cases one 16-byte window can be scanned for each core, thus still maintaining 32-byte decode (for both cores) per cycle. Note that it takes at least 2x time to scan a instruction window twice as large, but two instruction windows of same size can always be scanned concurrently by parallel resources, if available.

The branch fusion seems similar to Intel's macro-op fusion. It has limited applicability but would make Bulldozer more competitive for running Intel-optimized codes.


Highlights on the execution pipelines:
  • 4-way microarchitecture design
  • Integer core has two EX and two AGLU pipelines, plus an LSU (2.10.2)
  • Floating-point unit (FPU) has two FMAC and two IMMX pipelines (2.11)
Up to 4 macro-ops per clock cycle can be issued from the (shared) frontend to either of the two cores. Within each core, up to 4 macro-ops per clock cycle can be sent to an integer or the floating-point scheduler.

The integer scheduler can dispatch up to 4 micro-ops per cycle, one to each of the 4 pipelines. Almost all ALU operations are handled by the 2 EX pipelines, except some LEA instructions which also utilize AGU. Thus the integer core can execute only up to 2 x86 instructions per clock cycle, resulting in a maximum integer IPC of 2.0 (in units of x86 instructions). Note however this estimate does not include the computing throughput of the integer SIMD pipelines in the FPU.

The FPU scheduler can dispatch up to four 128-bit operations with the following combinations: (1) any of {FMUL, FADD, FMAC, FCVT, IMAC}; and (2) any of {FMUL, FADD, FMAC, Shuffle, Permute}; and (3) any of {AVX, MMX, ISSE}; and (4) any of {AVX, MMX, ISSE, FSTORE}.

From a layman's viewpoint, the shared FPU seems to offer only half the throughput of two K10 cores for independent FMUL and FADD operations. However, in previous Opteron, vectorized loads and stores also share the FMUL and FADD pipelines; in Bulldozer, vectorized loads are either "free" or handled by the IMMX pipelines. Note that when FPU is throughput bottleneck, each arithmetic operation should be paired with on average one load or store. A perhaps more significant overhead saving comes from various vectorized register moves which can now be dispatched concurrently to separate IMMX pipelines. Thus the shared FPU in Bulldozer is actually a very balanced design.


Changes to L1 data cache: (2.5.2)
  • Size reduced from 64kB to 16kB
  • Associativity increased from 2-way to 4-way
  • Number of banks increased from 8 to 16 banks
  • Load-to-use latency increased from 3 to 4 cycles
  • Access policy changed from write-back to write-through
The L1D cache seems to go through an almost complete overhaul in Bulldozer. In previous AMD Opteron the L1D cache is virtually indexed and physically tagged; this allows the cache size to be greater than (page_size)*(associativity) without the homonym and synonym problems. On the other hand, this also means every cache hit must be subject to TLB hit.

In Bulldozer, the L1D cache size is (page_size)*(associativity) = 4kB * 4 = 16kB. As such, it is possible that the L1D cache is now virtually tagged which would put the DTLB access out of the critical loop. While this limits the maximum cache size to 16kB, it can offer clock rate and power advantage.

Limiting the cache size, however, does not solve the synonym problem where two cores in a Bulldozer module map different virtual address to the same physical address. Inconsistency can occur when the two cores update contents in their (virtually tagged) data cache separately. This problem, however, can be solved by writing through to the physically tagged shared L2D cache.


Changes to L2 and L3 caches:
  • L2 cache is now a "mostly inclusive" cache (2.5.3)
  • L2 cache latency increases to 18 ~ 20 cycles from previous 12 (=9+3) cycles
  • L3 cache is logically partitioned into sub-caches each up to 2MB (2.5.4)
The "mostly inclusive" property of the L2 cache in Bulldozer is a direct consequence of the write-through policy of the L1D cache. Any cache line that has been modified in an L1D cache will also have a copy in the L2 cache. On the other hand, when there is L1D/L2 cache miss and L3 cache hit, a cache line is copied from L3 cache directly to L1D cache (same behavior as in K10), making the L2 cache not fully inclusive. Similar behavior applies to the memory prefetch instructions which copy cache lines directly to L1D. On the other hand, "cold" data are probably loaded to both L1D and L2 caches to take advantage of the sharing of L2 by both cores (different from K10), which could explain the "mostly" inclusive description to the L2 cache.

The L2 cache latency in K10 is 9 cycles beyond the (3-cycle) L1 cache access, or a total 12 cycles. In Bulldozer, the L2 cache latency is increased to 18 ~ 20 cycles; the greater value is probably for writes, or for L1D TLB miss. The increased latency shows Bulldozer core designed more as thinner and faster (higher clock rate) than wider and shorter (higher ILP).


On load-store unit and memory disambiguation:
  • 40-entry load queue and 24-entry store queue in LSU
The load-store unit (LSU) seems to be very similar to the one in K10. Both utilizes two queues, one primarily for pending loads and one exclusively for pending stores. There have been claims that Bulldozer offers better out-of-order loads to stores than K10. From the high-level point of view of the LSU, the only "major" difference is perhaps the use of virtual address for tagging the L1D cache in Bulldozer(?), but physical address in K10. Tagging L1D with virtual addresses may allow pending stores to retire sooner to L1D without being subject to any TLB miss latency, thus resolving store-to-load dependency faster. Otherwise, according to Section 6.3 of the software optimization guides, the same restrictions on store-to-load forwarding apply to both Bulldozer and K10.

There has been many claims (mostly from people outside of AMD?) that Bulldozer must offer some "memory disambiguation" similar to Core 2 or Nehalem. From the organization of Bulldozer's integer and load-store pipelines, which resemble K10 more than Core 2, AMD would have to use very different memory disambiguation mechanisms than Intel. The concept of memory disambiguation is actually simple: a memory access can be ambiguous when its target address is unknown. Once the address is known, then disambiguation (within the same process) can be performed by simply comparing the addresses.

Suppose there is a store to an address A specified by a memory reference M. If M is not in cache, then the store can be pending for a long time waiting for A (at address M) to come. During that time, all later (independent) loads are ambiguous because any of their addresses could be the same as A (which is yet unknown). Similarly, there can be memory access ambiguity for stores following a load from A, or stores following a store to A.

One disambiguation that can be done is to predict which of the later memroy accesses are to addresses that overlap with A. All those that are predicted not overlapping proceed speculatively, and have their results (and all those they affected) squashed if later A is found to overlap with their access addresses. Note, however, that such disambiguation cannot be performed by the LSU if the LSU receives load-store requests with known addresses. It seems to be the case in both K10 and Bulldozer where the LSU works as a backend to the reservation stations.

Is it worth it to allow ambiguous memory access requests to be sent speculatively to Bulldozer's LSU? I think it requires detailed analysis and simulation to know for sure. The software optimization guide does not tell us whether such a design is used in Bulldozer. (Note that a more "severe" type of memory disambiguation may be needed for Intel Nehalem where two processes can share the same LSU, where different virtual memory mapping can create extra memory reference ambiguity.)


Changes to other memory resources (hardware prefetch and write combining):
  • Hardware pretech to both L1 and L2 (prefetch instruction still to L1 only, 6.5)
  • Stride L1 prefetcher with up to 12 pretech patterns
  • "Region" L2 prefetcher for up to 4096 streams or patterns
  • 4KB 4-way WCC plus a (single?) 64-byte 4-entry WCB (?) WCB (A.5)
Due to the much smaller size of L1D in Bulldozer, it is reasonable to expect hardware prefetch to be less aggressive at L1D. Instead, part of the "aggressiveness" is transferred to the large and shared L2 cache. Although less aggressive, the prefetch mechanism is much more sophisticated, keeping multiple (12) prefetch patterns active at the same time.

A special design in Bulldozer is the addition of a 4KB 4-way associative write coalescing cache (WCC) for aggregating write-back (WB) memory writes (before committing them to L2?). This special "write cache" is inclusive with the L2 cache, and has its contents universally visible. It is unclear whether there is one WCC per core or one per module, although the former seems more plausible.

One of the design goals of WCC is probably to improve inter-core data transfer. Previously in K10, if core1 needs to send something to core2, the cache line containing the data must be (a) modified in core1's L1D, (b) evicted from core1's L1D to its L2, then (c) transferred from core1's L2 to core2's L1D. In Bulldozer, since every write to L1D also writes through to the WCC, steps (b) can be omitted and step (c) can be performed together with updating the L2 cache. Even less overhead is incurred if the data transfer occurs between two cores in the same module that share the L2 cache.

The WCC also acts as a write buffer for the write combining buffer (WCB) for streaming loads and write combine memory type. This can have other implications on the memory ordering requirement by the AMD64 execution model, which we will not touch upon here.

Bulldozer seems to have less write-combining resource per core for streaming stores and write combining memory type than K10. Performance "caveat" was mentioned for streaming store instructions in Section 6.5 of the software optimization guide, where writing >1 streams of data with streaming stores results in much less performance compared with K10. It appears, although unclear, that Bulldozer has a (single?) 64-byte 4-entry (sharing the 64 bytes? each having 64 bytes?) write combining buffer (per core?). K10 and even the later K8 revisions have 4 independent 64-byte WCBs per core. One explanation is that modern processors have more cores and thus fewer occasions to store multiple independent data streams per core. With only one stream of streaming stores, the performance in Bulldozer is still comparable to that in K10.

On the other hand, by beefing up the write-combining resource for write-back & temporal stores with the WCC, common memory writes are made much more efficient. Make the common case fast -- a rule of thumb in microarchitecture design!

~~

10 comments:

Anonymous said...

In other words and performance-wise, seems like Bulldozer sucks if it's comparable to the old K10 core which is no match to Intel's old nehalem core.

What a disappointment.

abinstein said...

I don't think you can derive your conclusion from my article. Can you tell us where did you get that idea?

Personally I think Bulldozer is going to be a strong, balanced and power efficient general purpose processor. It is optimized for high clock rate, adequate ILP, balanced integer/floating-point/vector throughput, and multi-processing.

I think it's more than competitive to Intel's Sandy Bridge.

nick black said...

superb extraction work -- thanks tremendously for such a detailed article!

nick black said...

what a great year this has been already for ╬╝architecture. between sandy bridge, bulldozer, tegra, and the upcoming ivy bridge and kepler (all of these more than incremental updates or scalings of their predecessors!), it's the most exciting time i can remember. thread-level parallelism's sudden ubiquity, combined with the power wall's grim visage, topped off with NUMA considerations yield a thrilling ride for architects and programmers alike. delightful!

abinstein said...

@nick: Thanks for your comment. Indeed, these are exciting years. The microprocessor industry is maturing (in a good way) where many of the research ideas proposed in the past 2 decades are now put into practice.

Higher clock rate, integration and ILP have been the driving force of microarchitecture for the past decade. Both Bulldozer and Sandy Bridge, however, seem at the "inflection point" of this trend already (if not have passed it). Going forward, I think more specialization with hybrid or heterogeneous architectures is going to offer performance where performance is needed and reduce power where not.

Anonymous said...

[mmarq]

actually i believe that Intel scheme.. he!(well not exactly sucks .. but).. its not the best that could be implemented.

Intel method is a "memory renaming" scheme and has separated load and store queues unified in a single buffer. The separation (guess) its logic is not exactly "physical".

And its so because the method has a relative good performance with "smaller windows" than other more complex ones.. and kind of its dependable, but just scales "horribly". Much simpler Address prediction, stride based, could be about the same with less hardware cost.

Join in store to load in the same structure, like Intel, and its like a register mov.. but snooping a larger complex buffer for memory operands can only incur on heavier latencies(clock cycle suffers).

Intel is good, but this definitely is not their best performance.. only AMD had none at the time..

A good test
A Comparative Survey of Load Speculation Architectures
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.8092&rep=rep1&type=pdf

shows [page 20] what i've been saying.. a stride based mechanism (not only for prefetch) for "address" and "value" could beat Intel hands down specially if you join in dependence prediction... and at the same smaller windows

Its ALL this in BD I.. i don't think so, but since stride prefetch and stride "Address prediction" are inexorably inter-related i'm convinced that it has this late one.. along with a better dependency mechanisms(even a crude dependency relief) and it could beat Intel, at the same, with less hardware cost.

abinstein said...

@mmarq: Thanks for the information. I think the type of "memory disambiguation" that my article was talking about is more similar to the dependence prediction in the paper that you referenced.

Anonymous said...

[mmarq]
Yes, having already a complex "buffer" for memory operations its not difficult to imagine that Intel might employ some for of "memory dependency predictor" to.

But the point is that "stride address prediction" with a "memory dependency prediction" will provide the same benefits, if not better, than "memory renaming" with "memory dependency prediction"... AND certainly with less hardware cost and more clock cycle friendliness.

Is this on BD I... we don't know, neither we know in the case of Intel.. i think!..

But AMD has OLD patents about "memory dependency prediction"

Store load forward predictor untraining - Patent 6651161
http://www.freepatentsonline.com/6651161.html

Store load forward predictor training - Patent 6694424
http://www.freepatentsonline.com/6694424.html

Processor with dependence mechanism to predict whether a load is dependent on older store - Patent 7415597
http://www.freepatentsonline.com/7415597.html

Anonymous said...

[mmarq]

BTW, Intel uses "primarily" since the first "core" design and specially from "core2" a method that it labels "dynamic memory disambiguation" or "dynamic memory renaming"... so its a "memory renaming scheme".. primarily.. of course it doesn't invalidate adding more mechanism on top of that

This paper gives a glimpse
Dynamic Memory Disambiguation Using the Memory Conflict Buff er
http://impact.crhc.illinois.edu/ftp/conference/asplos-94-buffer.pdf

this paper from scientists that that belong to the "haifa" team that design "core2" discusses all those methods
Dynamic Techniques for Load and Load-Use Scheduling
http://www.weblearn.hs-bremen.de/risse/RST/docs/Intel/roth-loadsched.pdf

pwrntspd said...

Honestly right now i just want to see how it clocks. It should perform at least as well as K10, but what sorts of speeds are we looking at?

Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.