A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.

Thursday, September 02, 2010

The IPC Myths

While Instruction Per Cycle (IPC) is an important metric for program optimization, it has been misused in many contexts. Below are a few common examples:
  • IPC can be used to describes how good a CPU is.
  • IPC is roughly proportional to pipeline width of the CPU.
  • IPC of modern CPUs are high (>>1).
  • Amdahl's law says CPU with higher IPC will have higher single-threaded performance.
  • ... 


Myth #1: IPC described as a single value

A common problem of all the "statements" above is that they all refer to IPC as if it is some intrinsic property determined by the CPU microarchitecture. In fact, IPC is a property determined not just by the CPU, but more by the program from algorithm down to instruction scheduling. For example, it is very possible for a CPU1 to have higher IPC than CPU2 running program A, but lower IPC running program B.

Thus, saying "CPU1 has higher (or lower) IPC than CPU2" has to be inaccurate, especially when the two processors have different microarchitectures.

Myth #2: Higher IPC means better

Many people believe higher IPC means higher (single-thread) performance. This is as wrong as when people thought higher clock rate means higher performance. Still, many believe higher IPC is better because the CPU can run as fast with slower clock rate. This seems an over-reaction to the Pentium 4, which had very high clock rate but moderate performance compared to Athlon64/Opteron.

The problem with this type of thinking is that the relation between IPC and clock rate is really a tradeoff. Like any tradeoff relation, you don't get optimal results by sliding towards either edge. With microarchitecture and circuit-level advancements, both clock rate and/or IPC can be increased. Which one to improve should depend on the design and application of the processor, and it's definitely not always (not even usually) IPC.

Myth #3: IPC is proportional to CPU pipeline width

We see many arguments like below on the Internet--
  • Core 2 can issue up to 4 x86 instructions per cycle, so it should have an IPC close to 4.
  • Nehalem brings [this or that features] to circumvent the decode limit, so it's IPC is 25% or 33% higher than Core 2.
  • K10 (AMD Family 10h) can only decode 3 x86 instructions per cycle, so its IPC has "bottleneck" at the instruction decode.
None of these statements is correct. It's not that the conclusion of these statements are absolutely false, but that their reasoning does not hold water. The best we can say about them is that without profiling or cycle-accurate simulation, we simply don't know.

In the case of Core 2 and Nehalem, we actually know for sure that the statements above are false. IPC of Core 2 Duo running SPEC CPU2006 was measured in this paper. The values were between 0.4 to 1.8 among all sub-benchmarks, with average only around 1.0, no where near its 4-way decoder width.

If we compare actual SPECint measurements of Core 2 (22.6) with Nehalem (25.1 or 27.8), we see that Nehalem has 11% to 23% higher single-thread performance after taking into account potentially 20% turbo frequency. Thus Nehalem's IPC for SPECint is at most ~20% higher than Core 2, and most likely much less when exclude the turbo mode effect. In other words, if Core 2's IPC for SPECint sub-benchmarks were 0.4~1.8, then Nehalem's should be between 0.5~2.1. Both are far below what is implied by their 4-way pipelines or any sexy-sound marketing features.

Myth #4: Amdahl's law favors CPU designed for higher IPC

This is the strangest argument that I have seen on the Internet, because it is completely the opposite of truth. The main thing that Amdahl's law says is that performance improvement is intrinsically limited by the available parallelism in a program. In the context of single-threaded programs, this means that performance at the same clock rate is limited by the Instruction-Level Parallelism (ILP) available in the program.

Some people see that "limited by the ILP" part and immediately relate it to a CPU designed for higher IPC. The problem here is that, according to Amdahl's law, the ILP is limited by the program, not the CPU. In other words, if your program has low ILP, it will not run fast no matter how high an IPC the CPU was designed for. Thus in fact Amdahl's law favors a CPU designed for higher clock rate but lower IPC than the available ILP in the program.

Furthermore, the available ILP in a program is also a strong function of the window size and the branch prediction accuracy. Both are very difficult to increase in the uber-complex microarchitectures of modern CPUs. That is why features such as SIMD (SSE and AVX), SMT, and turbo frequency are used in Nehalem to improve single-thread processor performance. None of them increases IPC of the CPU.

Conclusion

IPC is very useful when one wants to optimize his program for a particular system. It is one of the most important metrics that profiling produces. But like any metric, generalizing its implication outside of its intended usage context is usually meaningless and even misleading.
Please Note: Anonymous comments will be read and respected only when they are on-topic and polite. Thanks.