A Journey in Modern Computer Architectures: 2010

Friday, December 31, 2010

AMD Bobcat Fusion APU -- A Big Deal?

AMD has been enthusiastic and optimistic about its upcoming Fusion Accelerated Processing Unit (APU) based on the Bobcat cores set for launch at next year's (really less than one week from now) International Consumer Electronics Show (CES) in Las Vegas. It even makes a supposedly humorous video on YouTube, showing its main competitor spying on and astonished by AMD's "Fusion technology".

Is the Fusion APU really a big deal and, if so, in what sense? Will it really revolutionize personal computing as claimed by AMD?

The Facts

We already know the performance bound of these Fusion APUs, straight from AMD: compared to current CPU designs, the Bobcat core will achieve 90% performance with 50% die area. So a 1.6GHz Bobcat core will have performance comparable to a 1.4GHz Turion, definitely not a stellar specification. In fact, the APU's performance has been previewed and shown to be comparable to Intel's CULV CPU + nVidia's ION GPU.

The more impressive part is perhaps that the APU has both the CPU and GPU sitting on the same die, sharing the same system interface and 18W power envolope. Thus from the performance perspective, APU is much better than Intel's Atom processor (which powers most of the current low-cost netbooks), while from the power and cost perspective, APU is much better than Intel CULV + nVidia ION. So the whole point of these Fusion APU is really not about better performance (in both processing speed and power), but to reach a "better" power-performance tradeoff, i.e., performance-per-watt.

The Advantage

But is this power-performance tradeoff the real "advantage" of the Fusion APU, that it is unreachable by other players? I highly doubt it. For example, if one combines Intel's Yonah and nvidia's ION2 and manufactures them on Intel or TSMC 32nm, the same level of performance-per-watt could very well be reached.

However, even if Intel and nVidia work together, such a product probably won't make money for Intel due to all those redesign efforts required and the erosion to Intel's existing products. So IMHO one critical "advantage" that AMD has with APUs is that AMD's current market share in low-power laptops is so small that it doesn't worry about cannibalization by releasing cheaper products. Intel OTOH doesn't want to replace their existing laptops with lower performance cheaper ones. Instead they designed Atom to target on the smartphone and tablet markets. They make sure there's significant performance gap between Atom and Core i3 so the two markets are well separated.

The Extra

Hardware is only part of the story. By combining CPU and GPU closely together, every laptop based on AMD's Fusion APU becomes DirectCompute and OpenCL capable. Such "universal" GPGPU availability makes GPGPU acceleration a viable choice for software developers, which in turn makes these Fusion APUs better products (since more programs will be optimized for the CPU+GPU package). OpenCL came along somewhere in 2008 is an industry standard that replaced the original ATI Stream. Kernel programming in OpenCL is also very similar to that in nVidia CUDA, making OpenCL a fine choice for developers who are looking for or already taking advantage of GPGPU.

The "Better" Product?

However, even with GPGPU acceleration, a 18W APU still won't achieve stellar performance. Do you really believe the 18W TDP can translate to personal supercomputer, artificial intelligence and immersive 3D interface? What would be more interesting instead is the Fusion APU with the Bulldozer CPU core and the "Southern Island" GPU. That plus OpenCL could really be revolutionary in terms of software acceleration. But that plan, first disclosed by AMD in 2007, had been delayed until at least 2012/2013. Instead, Bobcat-based low-end Fusion APUs came to fill the void for the next 1 to 2 years.

While the current Fusion APU is not in AMD's original plan, with some irony it is probably a "better" product than originally planned. Why? Because believe it or not, most laptop users really don't need higher CPU performance! Most people will be quite happy with a dual-core 1.6GHz computer which they use mostly for e-mail and web surfing. The good graphics offered by these APUs is just a sweetening plus.

So what AMD does with the Bobcat-based APU is to depress CPU+GPU prices and power budgets so laptop makers can give us better other stuff, such as longer battery life, better webcam, and faster WiFi/3G/4G. And although this Fusion APU will reduce CPU+GPU ASPs and will hurt high-end laptop sales, AMD has little to lose in those areas anyway. :-)

Thursday, September 02, 2010

The IPC Myths

While Instruction Per Cycle (IPC) is an important metric for program optimization, it has been misused in many contexts. Below are a few common examples:

IPC can be used to describes how good a CPU is.
IPC is roughly proportional to pipeline width of the CPU.
IPC of modern CPUs are high (>>1).
Amdahl's law says CPU with higher IPC will have higher single-threaded performance.
...

Myth #1: IPC described as a single value

A common problem of all the "statements" above is that they all refer to IPC as if it is some intrinsic property determined by the CPU microarchitecture. In fact, IPC is a property determined not just by the CPU, but more by the program from algorithm down to instruction scheduling. For example, it is very possible for a CPU1 to have higher IPC than CPU2 running program A, but lower IPC running program B.

Thus, saying "CPU1 has higher (or lower) IPC than CPU2" has to be inaccurate, especially when the two processors have different microarchitectures.

Myth #2: Higher IPC means better

Many people believe higher IPC means higher (single-thread) performance. This is as wrong as when people thought higher clock rate means higher performance. Still, many believe higher IPC is better because the CPU can run as fast with slower clock rate. This seems an over-reaction to the Pentium 4, which had very high clock rate but moderate performance compared to Athlon64/Opteron.

The problem with this type of thinking is that the relation between IPC and clock rate is really a tradeoff. Like any tradeoff relation, you don't get optimal results by sliding towards either edge. With microarchitecture and circuit-level advancements, both clock rate and/or IPC can be increased. Which one to improve should depend on the design and application of the processor, and it's definitely not always (not even usually) IPC.

Myth #3: IPC is proportional to CPU pipeline width

We see many arguments like below on the Internet--

Core 2 can issue up to 4 x86 instructions per cycle, so it should have an IPC close to 4.
Nehalem brings [this or that features] to circumvent the decode limit, so it's IPC is 25% or 33% higher than Core 2.
K10 (AMD Family 10h) can only decode 3 x86 instructions per cycle, so its IPC has "bottleneck" at the instruction decode.

None of these statements is correct. It's not that the conclusion of these statements are absolutely false, but that their reasoning does not hold water. The best we can say about them is that without profiling or cycle-accurate simulation, we simply don't know.

In the case of Core 2 and Nehalem, we actually know for sure that the statements above are false. IPC of Core 2 Duo running SPEC CPU2006 was measured in this paper. The values were between 0.4 to 1.8 among all sub-benchmarks, with average only around 1.0, no where near its 4-way decoder width.

If we compare actual SPECint measurements of Core 2 (22.6) with Nehalem (25.1 or 27.8), we see that Nehalem has 11% to 23% higher single-thread performance after taking into account potentially 20% turbo frequency. Thus Nehalem's IPC for SPECint is at most ~20% higher than Core 2, and most likely much less when exclude the turbo mode effect. In other words, if Core 2's IPC for SPECint sub-benchmarks were 0.4~1.8, then Nehalem's should be between 0.5~2.1. Both are far below what is implied by their 4-way pipelines or any sexy-sound marketing features.

Myth #4: Amdahl's law favors CPU designed for higher IPC

This is the strangest argument that I have seen on the Internet, because it is completely the opposite of truth. The main thing that Amdahl's law says is that performance improvement is intrinsically limited by the available parallelism in a program. In the context of single-threaded programs, this means that performance at the same clock rate is limited by the Instruction-Level Parallelism (ILP) available in the program.

Some people see that "limited by the ILP" part and immediately relate it to a CPU designed for higher IPC. The problem here is that, according to Amdahl's law, the ILP is limited by the program, not the CPU. In other words, if your program has low ILP, it will not run fast no matter how high an IPC the CPU was designed for. Thus in fact Amdahl's law favors a CPU designed for higher clock rate but lower IPC than the available ILP in the program.

Furthermore, the available ILP in a program is also a strong function of the window size and the branch prediction accuracy. Both are very difficult to increase in the uber-complex microarchitectures of modern CPUs. That is why features such as SIMD (SSE and AVX), SMT, and turbo frequency are used in Nehalem to improve ~~single-thread~~ processor performance. None of them increases IPC of the CPU.

Conclusion

IPC is very useful when one wants to optimize his program for a particular system. It is one of the most important metrics that profiling produces. But like any metric, generalizing its implication outside of its intended usage context is usually meaningless and even misleading.

Wednesday, May 19, 2010

GPGPU and its battle of nVidia vs ATI

GPGPU seems to be really taking off. I came across a new YouTube video showing IBM new mainstream server using nVidia Tesla graphics cards for compute intensive acceleration.

For the past few years, AMD/ATI enthusiasts have assumed that Radeon is AMD's crown jewels. The truth might be just the opposite.

At a workshop in a recent conference, an nvidia researcher compared CUDA and OpenCL. His argument (whether true or not) was simple: OpenCL is more a device driver level language. He "proved" it by showing the same program written in CUDA and in OpenCL side-by-side. The CUDA one took about 2 slides. The OpenCL about 10. If you are a researcher/programmer/engineer, which one will you use?

It may be true that ATI Evergreen gives higher performance per dollar for games, but nVidia Fermi seems to give better GPGPU performance on average. Evergreen has more parallelism and higher theoretical flops, but Fermi is easier to program and to get real speedup. Hundreds universities worldwide are teaching students how to optimize their programs for Fermi. This is a formidable rival and I don't share at all the optimism of many ATI enthusiasts.

I believe In a year or two we will see the market of GPGPU surpassing that of enthusiast graphics. Very few people in the world care about fps when playing games. On the other hand, everyone benefits from GPGPU. I feel that AMD/ATI is too conservative in pushing for GPGPU. Most of their laptop/desktop chipsets still use r700 or even r600 based IGP, which are very hard to get good GPGPU performance, if any at all. Every time I see a laptop with HD42xx IGP I feel disgusted. They are selling those 2-year-old stuff which doesn't let users take proper advantage of OpenCL. They sell them for cheap, but is it a good thing? Do they also want to sell r800 IGPs for cheap 2 years from now?

Then it's OpenCL which everyone's heard of but few is interested. In my humble opinion, AMD should hire a few programmers to fully integrate OpenCL into their Catalyst driver, so that every computer with an ATI GPU can use them after a simple driver install. In contrast, perhaps for fear of Microsoft or whatever reason, AMD/ATI want end users to manually install and upgrade every OpenCL release to match the installed Catalyst driver. If I'd go so much trouble then why don't I just install CUDA which more people are using anyway?

And if I'm going with CUDA, then I will not only buy Tesla for my workstation, but also GeForce for my desktop & laptop; there will be less incentive for me to buy an AMD CPU as well. That is really a good way to keep me away from being AMD's customer, isn't it?

Tuesday, April 27, 2010

Stating the facts or bad-mouthing his former employer?

Over at the MacRumors forum someone claimed to be a former AMD employee recently started to criticize AMD and people working in it. His posts can be seen in the following links: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15 (thanks to dm7000s at AMDZone for collecting these links).

In my opinion, it is both "interesting" and "fishy" to see someone do this to his former employer. On one hand, he may be revealing some real problems inside (parts of) the company which other people (either inside or outside AMD) wouldn't know or recognize. On the other hand, he may be right about things that he claims to know, but due to his bitterness wrong about the conclusions.

I believe in this case, it is the latter. In any rate, lets go through some of his points below and see, assuming these are all facts, how true or false they can be:

AMD has not been financially successful since "K8". This may not be due to any of AMD's problem, but Intel's monopoly tactics. One should ask why was AMD "financially successful" during the K8 days in the first place? Was it because only those who designed K8 knew that they were doing? Was it because they hand crafted every transistor? Or was it really because both Itanium and Netburst terribly sucked in real-world tests? I'd argue it's only the last.

AMD has been losing key employees. Losing employees is tough for any company. Yet, sometimes a company has to lose weight when it is evolving and before it can start growing again. The question is not whether someone did something grand. But whether he will do something grander. What would have been the grander next step after K8? Could AMD have beaten Intel by making an over-complicated "K9" with SMT and turbo mode and everything else? I'd argue with the required design and verification efforts, no "key employee" could have made this happen timely and cost effectively.

AMD is not hand-instantiating designs ~~transistors~~ anymore. Anyone (who is a electrical engineer) can hand craft transistors. It is at the end of the day primarily a labor intensive task. If you're a CTO and you expect your company to hand craft transistors better than your 10x oversized competitor, then you're not being realistic. You won't win. And with the "unfair" agreements between Intel and AMD prior to their 2009 settlement, AMD was simply forbidden to reap the same amount of profit by selling hand-crafted processors of higher performance. I was informed by a kind reader, who seem to know what was going on inside AMD, that hand-crafting transistors is emphatically what AMD did not do for K8. Unlike Intel which does a lot of custom designs, AMD used standard cells for the ALUs and most other components. However, AMD did a lot of circuit placement and routing by hand, with a superb physical design (implementation) team. Somewhat related to the previous comment, I was also told that many in that team have left AMD over the past few years.

AMD did not make any new architecture after K8. Architecture is a flimsy thing. At their hearts K8 is no more than K7 plus extra 64-bit registers and integrated NB. The way instructions are broken down to macro-ops and micro-ops, the basic organization of the ROB and the separate INT and FP schedulers are all the same between K7 and K8. HyperTransport based NB, the exclusive L1/L2 cache and improve TLB gave K8 solid performance. But so are the improvements made to K10 like the shared L3, unganged memory, probe filter and greater scalability. K8's NB was designed to have up to 8P in a single system; few wanted that over the years. Today K10-based Magny Cours processors allow 48 cores with perhaps tighter inter-core communication. I bet Intel very much want to do the same.

In conclusion.... is that guy simply stating the facts, or is he bad-mouthing his former employer? Personally, I think what he did was immature and immoral, even if what he said were facts. I was told that there's been some political struggles inside AMD during the post-K8 years, and I also believe that such politics must've brought with it some waste of time and money as well as loss of talents. But still.... in my humble opinion, that's no good excuse for picking on your former exployee and starting a public brawl fight.

:mrgreen:

A Journey in Modern Computer Architectures