<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8459791</id><updated>2012-02-16T11:09:05.564-08:00</updated><category term='Core 2'/><category term='Bulldozer'/><category term='memory'/><category term='General'/><category term='SPEC CPU'/><category term='Quad Core'/><category term='K10'/><category term='ISA'/><category term='K8'/><title type='text'>A Journey in Modern Computer Architectures</title><subtitle type='html'>A personal record of understanding, deciphering, speculating and predicting the development of modern microarchitecture designs.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>25</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8459791.post-8586005534669449411</id><published>2011-08-13T08:13:00.000-07:00</published><updated>2011-08-13T08:16:54.337-07:00</updated><title type='text'>HP DV6z Quad Ed</title><content type='html'>Got an HP DV6z with AMD A8-3500M (Llano) recently.&amp;nbsp;I'm extremely happy with the 1920x1080 display ($150 upgrade). It is not glossy, but&amp;nbsp;very colorful and bright. If you get an HP DV6z with Llano, BE SURE to spend that $150 for a high resolution display!&lt;br /&gt;&lt;br /&gt;dv6z Quad Ed&lt;br /&gt;• steel gray&lt;br /&gt;• Genuine Windows 7 Home Premium 64-bit&lt;br /&gt;• AMD Quad-Core A8-3500M Accelerated Processor (2.4GHz/1.5GHz, 4MB L2 Cache)&lt;br /&gt;• AMD Radeon(TM) Discrete-Class Graphics [HDMI, VGA]&lt;br /&gt;• FREE Upgrade to 6GB DDR3 System Memory (2 Dimm)&lt;br /&gt;• 640GB 7200RPM Hard Drive with HP ProtectSmart Hard Drive Protection&lt;br /&gt;• 9 Cell Lithium Ion Battery&lt;br /&gt;• 15.6" diagonal Full HD HP Anti-glare LED Display (1920 x 1080)&lt;br /&gt;• FREE Upgrade to Blu-ray player &amp;amp; SuperMulti DVD burner&lt;br /&gt;• HP TrueVision HD Webcam with Integrated Digital Microphone and HP SimplePass Fingerprint Reader&lt;br /&gt;• 802.11b/g/n WLAN and Bluetooth(R)&lt;br /&gt;&lt;br /&gt;Battery life&amp;nbsp; (9-cell) is about 5hr heavy use with Linux VM compiling in background. With idle/light use battery life could be up to 7hr. Not terrific but quite good enough for the 15.6" +&amp;nbsp;1080p screen.&lt;br /&gt;&lt;br /&gt;It's relatively heavy, but on the other hand, it is fully loaded. I'd say it'll make a great desktop replacement, easily beating most laptop at comparable price ($800~$1000).&lt;br /&gt;&lt;br /&gt;I haven't really optimized the laptop yet but here are few photos:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-vTLcKqCmsMY/TkaUnRuK88I/AAAAAAAAAIs/5txXt1H98b8/s1600/DSCN1158.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-vTLcKqCmsMY/TkaUnRuK88I/AAAAAAAAAIs/5txXt1H98b8/s320/DSCN1158.JPG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-j4oBzlOHSVI/TkaUqu42-tI/AAAAAAAAAIw/6TjFMlK74Qo/s1600/DSCN1154.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-j4oBzlOHSVI/TkaUqu42-tI/AAAAAAAAAIw/6TjFMlK74Qo/s320/DSCN1154.JPG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-8586005534669449411?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/8586005534669449411/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=8586005534669449411' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/8586005534669449411'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/8586005534669449411'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2011/08/hp-dv6z-quad-ed.html' title='HP DV6z Quad Ed'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-vTLcKqCmsMY/TkaUnRuK88I/AAAAAAAAAIs/5txXt1H98b8/s72-c/DSCN1158.JPG' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-7348046313338639909</id><published>2011-06-09T11:02:00.000-07:00</published><updated>2011-09-16T13:23:15.964-07:00</updated><title type='text'>Experience in Acer Iconia W500 Windows 7 Tablet</title><content type='html'>- &lt;br /&gt;I recently got one of those Iconia W500 tablet from Acer, running the fully version Windows 7 Home Premium.&lt;br /&gt;&lt;br /&gt;It's really a nice build. I had some doubt before when I read some "bad" reviews about this tablet on the Internet. Then I went to a BestBuy to take a look myself at the Iconia &lt;b&gt;A&lt;/b&gt;500 tablet, which I believe uses the same build but Intel's Atom processors. Note that Iconia &lt;b&gt;W&lt;/b&gt;500, which I eventually bought, is based on AMD's Fusion C-50 processor, which is a lot faster and supports the new DX11 graphics and OpenCL. I realized that if A500 looks and feels nicely, then W500 can only be better, since they are using the same tablet frame &amp;amp; accessories. Apparently some reviewers have sensory problem.&lt;br /&gt;&lt;br /&gt;Anyway, it is a nice little tablet with detachable full-size keyboard.&lt;br /&gt;&lt;br /&gt;Some reviews made it sound like difficult or even impossible to "fold" the tablet and keyboard. In fact it cannot be more easy---just fold it. The tablet part will detach ("slide") itself out of the keyboard base. I could even do this when the tablet is on with no problem.&lt;br /&gt;&lt;br /&gt;The dual-core C-50 is noticeably faster than the single-core 1.6GHz Turion + X300 graphics (the laptop being replaced by this tablet). The C-50 also runs much cooler. &lt;b&gt;Real usage&lt;/b&gt; battery life is ~5.5 hours idle, ~4.5 hours with light use (+wifi), and just over 3 hours web-based gaming. Not as good as iPad, but comparable to other Zacate based laptops with larger battery.&lt;br /&gt;&lt;br /&gt;I don't find it very useful to take the keyboard on the road. But it's again very easy, and the keyboard is lighter than the slim Dell keyboard I use on my desktop.&lt;br /&gt;&lt;br /&gt;It has two disadvantages as a tablet:&lt;br /&gt;&amp;nbsp;- Windows 7 is NOT a good tablet OS.&lt;br /&gt;&amp;nbsp;- It can use more motion sensors and actuators that are found in iPad and Xoom.&lt;br /&gt;&lt;br /&gt;But perhaps it's not fair to compare W500&amp;nbsp; with iPad or Xoom. The Acer Iconia W500 is more of a professional tablet, for people to make presentations, take notes, communicate with colleagues and all that. The iPad and Xoom, on the other hand, are more of consumer toys. Not to say you can't play on the Iconia W500 (you can certainly play many great games better than any other tablet there), or you can't work on iPad or Xoom, but the general feeling of Windows 7 and IOS or Android are just different.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-7348046313338639909?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/7348046313338639909/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=7348046313338639909' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/7348046313338639909'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/7348046313338639909'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2011/06/experience-in-acer-iconia-w500-windows.html' title='Experience in Acer Iconia W500 Windows 7 Tablet'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-5097662958036269415</id><published>2011-04-28T17:53:00.000-07:00</published><updated>2011-04-28T18:30:51.379-07:00</updated><title type='text'>The Battle between Netbook and AMD Fusion</title><content type='html'>- &lt;br /&gt;In the 1st quarter of 2011, Microsoft's Windows revenue &lt;a href="http://online.wsj.com/article/SB10001424052748704330404576291412535674954.html"&gt;dropped 4% from last year&lt;/a&gt;, mainly due to 40% decline in netbook sales. CFO Peter Klein says that tablets "played a part" in this decline. It's no wonder that companies no longer make a big fuzz about netbooks like they did in 2008 when Intel Atom was first out. After two short years of coming to life, netbook is already conceived as slow, insufficient and uncool by the consumers, ready to be replaced by the next technology.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx"&gt;AMD Fusion&lt;/a&gt; is what I feel an exciting and revolutionary technology which combines a capable multi-core CPU and a general-purpose GPU into a cost and power efficient package. We have an HP dm1z based on the AMD E-350, and I have to say it is an excellent notebook. It's capable, thin-and-light, runs cool and has long battery life (7+ hours in actual usage).&lt;br /&gt;&lt;br /&gt;I find it ridiculous, though, that HP dm1z is &lt;a href="http://www.shopping.hp.com/webapp/series/category/notebooks/dm1z_series/3/computer_store"&gt;marketed as a netbook&lt;/a&gt;. &lt;b&gt;It's &lt;i&gt;not&lt;/i&gt; a netbook.&lt;/b&gt; Neither HP nor AMD should have defined it as a netbook. My wife plays many cool games on it, edits photos on it, views HD movies on it, and basically performs every task an &lt;i&gt;advanced&lt;/i&gt; PC user would on it, many tasks cannot be satisfactorily performed on a netbook (she holds an M.S. degree in computer science). It's no surprise, since &lt;i&gt;net&lt;/i&gt;book, by definition, is only for the "net," not games nor computing in general. An Atom based 10" laptop with in-order cores may be a netbook; an ARM based 8" laptop with 2GB RAM may be a netbook. But these AMD Fusion laptops have full-grown notebook computing capabilities (good), with size and power consumption similar to a netbook (better!).&lt;br /&gt;&lt;br /&gt;It is no wonder that AMD &lt;a href="http://blogs.amd.com/home/tag/netbook/"&gt;does &lt;i&gt;not&lt;/i&gt; refer to their Fusion APUs as for netbook&lt;/a&gt;. But sadly it doesn't matter. When Intel released their CULV, people &lt;a href="http://news.cnet.com/8301-17938_105-10312430-1.html"&gt;tried to define&lt;/a&gt; the slow single-core Celeron based laptop with 2GB RAM as "notebook;" but after AMD launched their Fusion APUs, with two 64-bit out-of-order cores at 1.6GHz accessing 4GB RAM, most people seem to &lt;a href="http://search.yahoo.com/search?p=netbook+amd"&gt;become dense about&lt;/a&gt; the many distinctions from netbook. Seriously, if the AMD Fusion "netbook" plays cool games with DX11, runs Office suite and photo editors smoothly, plays HD movies and even &lt;a href="http://www.windowsfordevices.com/c/a/News/HP-Pavilion-dm1z-review/"&gt;runs virtual machines&lt;/a&gt; at close-to-native speed, then &lt;b&gt;it is a full notebook&lt;/b&gt;. If it's thin and light, then it's an &lt;b&gt;ultramobile notebook&lt;/b&gt;. The reasonable conclusion: Fusion APU is &lt;i&gt;not&lt;/i&gt; another netbook chip, but a perfect replacement for those ultramobile processors that would otherwise cost you $1000 each.&lt;br /&gt;&lt;br /&gt;I feel kind of sad for AMD, for they seem to live in a world which is mostly agnostic about how good their technologies are. But perhaps that is how people like me can by these powerful little notebooks at a great price?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-5097662958036269415?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/5097662958036269415/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=5097662958036269415' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5097662958036269415'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5097662958036269415'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2011/04/battle-between-netbook-and-amd-fusion.html' title='The Battle between Netbook and AMD Fusion'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-5467606835227516032</id><published>2011-04-15T10:46:00.000-07:00</published><updated>2011-04-16T08:00:21.909-07:00</updated><title type='text'>AMD tapes out 28nm Wichita, Intel shows new Atom and peeks 32nm shrink</title><content type='html'>- &lt;br /&gt;It was said  a few days ago that &lt;a href="http://news.softpedia.com/news/AMD-Tapes-Out-the-28nm-Wichita-APUs-194452.shtml"&gt;AMD taped out Wichita&lt;/a&gt;, their 28nm shrink on the &lt;a href="http://www.xbitlabs.com/news/cpu/display/20101109132554_AMD_Unwraps_Desktop_Laptop_Plans_for_2012_Komodo_Trinity_Krishna_and_Wichita.html"&gt;APU roadmap&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Currently, there are two types of APUs on the market. Zacate @18W targets ultraportable notebook (such as &lt;a href="http://www.shopping.hp.com/webapp/series/category/notebooks/dm1z_series/3/computer_store"&gt;HP dm1z&lt;/a&gt; and &lt;a href="http://www.engadget.com/2011/04/14/msis-fusion-powered-x370-laptop-gets-579-price-tag-hits-amazo/"&gt;MSI X370&lt;/a&gt;) and small form factor desktop, while Ontario @9W targets tablet, (fanless) set-top and embedded boxes. According to AMD, these APUs are very small (&amp;lt; 0.8 cm^2) compared to ordinary processors. Shrinking them from 40nm to 28nm would bring the die size to below 0.5cm^2. It was also said that TSMC will apply HKMG on their 28nm process, which should offer some significant power reduction or performance boost.&lt;br /&gt;&lt;br /&gt;I think there are two ways that AMD can bring the APUs forward. Wichita (with 1--2 Bocat core) could have similar performance to Ontario/Zacate with lower power consumption for the tablet market. Krishna (2--4 Bobcat core) could have the same 9W/18W TDP with more cores and even higher performance than Zacate for future ultraportable notebook market. These &lt;b&gt;ultraportable notebooks&lt;/b&gt; should prove themselves &lt;b&gt;with flexibility and performance clearly above the tablets&lt;/b&gt;. &lt;b&gt;Netbook&lt;/b&gt;, on the other hand, is perhaps a market to be &lt;b&gt;gradually replaced by tablet&lt;/b&gt;. It would be a mistake, in my opinion, for AMD to invest in and build up products for the netbook market now.&lt;br /&gt;&lt;br /&gt;Currently, AMD's APUs offer &lt;a href="http://hothardware.com/Reviews/AMD-Zacate-E350-Processor-Performance-Preview/?page=6"&gt;much better performance&lt;/a&gt; than Intel's dual-core Atom. It's no surprise that with the tape out of AMD's next APU, Intel is eager to announce the &lt;a href="http://www.bgr.com/2011/04/11/intel-outs-new-atom-z760-processor-for-tablets/"&gt;new Atom&lt;/a&gt; and disclose its &lt;a href="http://news.softpedia.com/news/Intel-to-Detail-Its-Upcoming-Cedar-Trail-Atom-CPU-at-IDF-Beijing-2011-194648.shtml"&gt;32nm shrink&lt;/a&gt; at IDF in Beijing this year.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-5467606835227516032?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/5467606835227516032/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=5467606835227516032' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5467606835227516032'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5467606835227516032'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2011/04/amd-tapes-out-28nm-wichita-intel-shows.html' title='AMD tapes out 28nm Wichita, Intel shows new Atom and peeks 32nm shrink'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-6969618866741653858</id><published>2011-04-13T17:14:00.000-07:00</published><updated>2011-04-28T18:33:04.775-07:00</updated><title type='text'>Supid is as stupid does</title><content type='html'>Some people at the Internet forums are overly concerned with fanboism arguments such as asking for &lt;a href="http://www.semiaccurate.com/forums/showpost.php?p=107595&amp;amp;postcount=60"&gt;&lt;i&gt;non&lt;/i&gt;-evident expertise&lt;/a&gt; or &lt;a href="http://www.semiaccurate.com/forums/showpost.php?p=107725&amp;amp;postcount=63"&gt;superficial credential&lt;/a&gt; when presented with technical discussions. To me such behavior represents degeneration of human intelligence because they didn't even attempt to read, think, and debate with their brains before they make judgment. To them, anyone who spoke of things that are out of their comprehension must be insane, and they wouldn't even voluntarily improve themselves with information that are novel to them.&lt;br /&gt;&lt;br /&gt;Stupid is as stupid does. If something is wrong,&amp;nbsp; then no matter what credential or expertise its performer claims, it is still wrong. When person X corrects the spelling of person Y, what X needs is a dictionary, not an argument that "because you're in primary school but I am the President." Bottom line: to say something is wrong, one needs to explain why; and he &lt;a href="http://abinstein.blogspot.com/2011/04/first-look-at-amd-family-15h-bulldozer.html"&gt;needs to understand the thing&lt;/a&gt; himself and make sure his understanding is valid first.&lt;br /&gt;&lt;br /&gt;As I have discussed previously in my blog articles, it continues to amaze me how people can consciously believe marketing crap about the mythical &lt;a href="http://abinstein.blogspot.com/2010/09/ipc-myths.html"&gt;IPC&lt;/a&gt; or &lt;a href="http://abinstein.blogspot.com/2008/04/two-sides-of-mirror-on-k10-vs-core2.html"&gt;bandwidth&lt;/a&gt; "numbers" of specific processors or benchmarks. Really, stupid is as stupid does. It doesn't matter where those numbers come from, when we have hard evidence or sound analysis proving them being either wrong or irrelevant.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-6969618866741653858?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/6969618866741653858/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=6969618866741653858' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/6969618866741653858'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/6969618866741653858'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2011/04/supid-is-as-stupid-does.html' title='Supid is as stupid does'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-7878161005541344753</id><published>2011-04-10T17:40:00.000-07:00</published><updated>2011-08-30T22:25:00.407-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='K10'/><category scheme='http://www.blogger.com/atom/ns#' term='Bulldozer'/><title type='text'>First look at AMD Family 15h (Bulldozer) Software Optimization Guide</title><content type='html'>&lt;span style="font-size: x-small;"&gt;&lt;u&gt;NOTE: If you only want to know whether AMD would K.O. Intel or the other way around, or if you believed technical discussions are nonsense while Internet rumors are gold, then please stay away. OTOH, if you like computer architecture and feel excited about state-of-the-art designs, please enjoy and let me know what you think (thanks)!&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Updates -- &lt;br /&gt;* 4/13/2011 Updated with discussion &lt;i&gt;on load-store unit and memory disambiguation&lt;/i&gt;.&lt;br /&gt;* 4/12/2011 Updated with &lt;i&gt;highlights on shared frontend&lt;/i&gt; and &lt;i&gt;changes to other memory resource&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Prelude&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;AMD recently released &lt;a href="http://support.amd.com/us/Processor_TechDocs/47414.pdf"&gt;the software optimization guide&lt;/a&gt; for its upcoming &amp;amp; most anticipated &lt;b&gt;family 15h &lt;/b&gt;&lt;b&gt;(Bulldozer) &lt;/b&gt;&lt;b&gt;processors&lt;/b&gt;. In this article we take a high-level comparative look at the newly released document.&lt;br /&gt;&lt;br /&gt;The new processor family features a revolutionary "cluster multi-threading" (CMT) architecture, where a processor consists of multiple &lt;i&gt;module&lt;/i&gt;s, each being a cluster of two cores sharing the same instruction frontend, floating-point unit and level-2 cache. Newly supported ISA extensions include the 128-bit SSE4 and 128 &amp;amp; 256-bit AVX, XOP and FMA4.&lt;br /&gt;&lt;br /&gt;Despite these major differences, the Bulldozer is fundamentally a continuation of the previous processor design from AMD. It is perhaps more useful to first mention &lt;b&gt;some similarities between Bulldozer and the previous family 10h (K10) processors&lt;/b&gt; before going into detail of the differences:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The same (or very similar) &lt;i&gt;macro-op&lt;/i&gt; and &lt;i&gt;micro-op&lt;/i&gt; based instruction decode is utilized.&lt;/li&gt;&lt;li&gt;Similar register file &lt;i&gt;superforwarding&lt;/i&gt;.&lt;/li&gt;&lt;li&gt;Same L1I cache, very similar L3 cache and system interconnect architecture are used.&lt;/li&gt;&lt;li&gt;Similar pick-pack instruction decode in a 32-byte window.&lt;/li&gt;&lt;li&gt;Loads and stores seem still performed in the load-store unit working as a backend to the integer core and FPU, rather than being scheduled directly in reservation stations.&lt;/li&gt;&lt;li&gt; The shared FPU design in Bulldozer has its root deep in the separated integer and FPU schedulers in K10.&lt;/li&gt;&lt;li&gt; Same or very similar microarchitecture for indirect branch (512-entry target array) and return address (24-entry return address stack) prediction.&lt;/li&gt;&lt;/ul&gt;That said, below we discuss some (not all!) of the major microarchitecture differences introduced in Bulldozer: shared frontend, execution pipelines, L1D and L2/L3 caches, and memory access resources.&lt;b&gt;&amp;nbsp;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Highlights on the shared frontend:&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Two 32-byte instruction fetch windows (&lt;span style="font-size: x-small;"&gt;one for each core?&lt;/span&gt; &lt;u&gt;1.6.4&lt;/u&gt;)&lt;/li&gt;&lt;li&gt;Fetch window tracking structure (&lt;span style="font-size: x-small;"&gt;to manage fetches for both cores?&lt;/span&gt; &lt;u&gt;2.6&lt;/u&gt;)&lt;/li&gt;&lt;li&gt;Hybrid (tournament) branch prediction with global and local branch predictors&lt;/li&gt;&lt;li&gt;2-level BTB with 512+5120 entries, upped from 1-level 2048 entries&lt;/li&gt;&lt;li&gt;Instructions decoded from a 32-byte window or two 16-byte windows (&lt;span style="font-size: x-small;"&gt;for both cores?&lt;/span&gt; &lt;u&gt;2.7&lt;/u&gt;)&lt;/li&gt;&lt;li&gt;Introduce branch fusion&lt;/li&gt;&lt;/ul&gt;Instruction fetch and branching are greatly improved in Bulldozer. A more sophisticated conditional branch prediction is employed, utilizing a local predictor, a global predictor and a tournament selector. The branch target buffer (BTB) is increased to 2.5+ times larger.&lt;br /&gt;&lt;br /&gt;Note that although a single frontend serves two cores, the same branch prediction information can be shared by both cores if they execute the same program. Even if the two cores run different programs, sharing the same instruction fetch and branch prediction resources can have benefit in latency hiding, especially for non-optimized and densely branching codes.&lt;br /&gt;&lt;br /&gt;When stars align (instruction allocation is optimized and the code has pre-decode information), the frontend can decode up to 4 macro-ops from a 32-byte window per cycle for one core. Otherwise, a 16-byte window is scanned to find the boundaries for supposedly &amp;lt; 4 decodes per cycle. It is unclear whether in such cases one 16-byte window can be scanned for each core, thus still maintaining 32-byte decode (for both cores) per cycle. Note that it takes at least 2x time to scan a instruction window twice as large, but two instruction windows of same size can always be scanned concurrently by parallel resources, if available.&lt;br /&gt;&lt;br /&gt;The &lt;i&gt;branch fusion&lt;/i&gt; seems similar to Intel's macro-op fusion. It has limited applicability but would make Bulldozer more competitive for running Intel-optimized codes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Highlights on the execution pipelines:&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;4-way microarchitecture design&lt;/li&gt;&lt;li&gt;Integer core has two EX &lt;i&gt;and&lt;/i&gt; two AGLU pipelines, plus an LSU (&lt;u&gt;2.10.2&lt;/u&gt;)&lt;/li&gt;&lt;li&gt;Floating-point unit (FPU) has two FMAC &lt;i&gt;and&lt;/i&gt; two IMMX pipelines (&lt;u&gt;2.11&lt;/u&gt;)&lt;/li&gt;&lt;/ul&gt;Up to 4 &lt;i&gt;macro-op&lt;/i&gt;s per clock cycle can be issued from the (shared) frontend to either of the two cores. Within each core, up to 4 macro-ops per clock  cycle can be sent to an integer or the floating-point scheduler.&lt;br /&gt;&lt;br /&gt;The integer scheduler can dispatch up to 4 &lt;i&gt;micro-op&lt;/i&gt;s per cycle, one to each of the 4 pipelines. Almost all ALU operations are handled by the 2 EX pipelines, except some LEA instructions which also utilize AGU. Thus the integer core can execute only up to 2 x86 instructions per clock cycle, resulting in a &lt;i&gt;maximum&lt;/i&gt; integer IPC of 2.0 (in units of x86 instructions). Note however this estimate does not include the computing throughput of the integer SIMD pipelines in the FPU.&lt;br /&gt;&lt;br /&gt;The FPU scheduler can dispatch up to four 128-bit operations with the following combinations: (1) any of {FMUL, FADD, FMAC, FCVT, IMAC}; &lt;i&gt;and&lt;/i&gt; (2) any of {FMUL, FADD, FMAC, Shuffle, Permute}; &lt;i&gt;and&lt;/i&gt; (3) any of {AVX, MMX, ISSE}; &lt;i&gt;and&lt;/i&gt; (4) any of {AVX, MMX, ISSE, FSTORE}.&lt;br /&gt;&lt;br /&gt;From a layman's viewpoint, the shared FPU seems to offer only half the throughput of two K10 cores for independent FMUL and FADD operations. However, in previous Opteron, vectorized loads and stores also share the FMUL and FADD pipelines; in Bulldozer, vectorized loads are either "free" or handled by the IMMX pipelines. Note that &lt;b&gt;when FPU is &lt;i&gt;throughput bottleneck&lt;/i&gt;, each arithmetic operation should &lt;/b&gt;&lt;b&gt;be paired with &lt;/b&gt;&lt;b&gt;on average &lt;/b&gt;&lt;b&gt;one load or store&lt;/b&gt;. A perhaps more significant overhead saving comes from various vectorized register moves which can now be dispatched concurrently to separate IMMX pipelines. Thus the shared FPU in Bulldozer is actually a very balanced design.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Changes to L1 data cache: (&lt;u&gt;2.5.2&lt;/u&gt;)&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Size reduced from 64kB to 16kB&lt;/li&gt;&lt;li&gt;Associativity increased from 2-way to 4-way&lt;/li&gt;&lt;li&gt;Number of banks increased from 8 to 16 banks&lt;/li&gt;&lt;li&gt;Load-to-use latency increased from 3 to 4 cycles&lt;/li&gt;&lt;li&gt;Access policy changed from write-back to write-through&lt;/li&gt;&lt;/ul&gt;The L1D cache seems to go through an almost complete overhaul in Bulldozer. In previous AMD Opteron the L1D cache is virtually indexed and &lt;i&gt;physically&lt;/i&gt; tagged; this allows the cache size to be greater than (page_size)*(associativity) without the homonym and synonym problems. On the other hand, this also means every cache hit must be subject to TLB hit.&lt;br /&gt;&lt;br /&gt;In Bulldozer, the L1D cache size is (page_size)*(associativity) = 4kB * 4 = 16kB. As such, it is &lt;i&gt;possible&lt;/i&gt; that the L1D cache is now &lt;i&gt;virtually&lt;/i&gt; tagged which would put the DTLB access out of the critical loop. While this limits the maximum cache size to 16kB, it can &lt;b&gt;offer clock rate and power advantage&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;Limiting the cache size, however, does not solve the synonym problem where two cores in a Bulldozer module map different virtual address to the same physical address. Inconsistency can occur when the two cores update contents in their (virtually tagged) data cache separately. This problem, however, can be &lt;b&gt;solved by writing through to the physically tagged shared L2D cache&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Changes to L2 and L3 caches:&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;L2 cache is now a "mostly inclusive" cache (&lt;u&gt;2.5.3&lt;/u&gt;)&lt;/li&gt;&lt;li&gt;L2 cache latency increases to 18 ~ 20 cycles from previous 12 (=9+3) cycles&lt;/li&gt;&lt;li&gt;L3 cache is logically partitioned into sub-caches each up to 2MB (&lt;u&gt;2.5.4&lt;/u&gt;)&lt;/li&gt;&lt;/ul&gt;The "mostly inclusive" property of the L2 cache in Bulldozer is a direct consequence of the write-through policy of the L1D cache. Any cache line that has been modified in an L1D cache will also have a copy in the L2 cache. On the other hand, when there is L1D/L2 cache miss and L3 cache hit, a cache line is copied from L3 cache directly to L1D cache (same behavior as in K10), making the L2 cache not fully inclusive. Similar behavior applies to the memory prefetch instructions which copy cache lines directly to L1D. On the other hand, "cold" data are probably loaded to both L1D and L2 caches to take advantage of the sharing of L2 by both cores (different from K10), which could explain the "mostly" inclusive description to the L2 cache.&lt;br /&gt;&lt;br /&gt;The L2 cache latency in K10 is 9 cycles beyond the (3-cycle) L1 cache access, or a total 12 cycles. In Bulldozer, the L2 cache latency is increased to 18 ~ 20 cycles; the greater value is probably for writes, or for L1D TLB miss. The increased latency shows Bulldozer core designed more as thinner and faster (higher clock rate) than wider and shorter (higher ILP).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;On load-store unit and memory disambiguation:&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;40-entry load queue and 24-entry store queue in LSU&lt;/li&gt;&lt;/ul&gt;The load-store unit (LSU) seems to be very similar to the one in K10.  Both utilizes two queues, one primarily for pending loads and one  exclusively for pending stores. There have been claims that Bulldozer  offers better out-of-order loads to stores than K10. From the high-level point of view of the LSU, the only "major" difference is perhaps the use  of virtual address for tagging the L1D cache in Bulldozer&lt;span style="font-size: x-small;"&gt;(?)&lt;/span&gt;,  but physical address in K10. Tagging L1D with virtual addresses may  allow pending stores to retire sooner to L1D without being subject to  any TLB miss latency, thus resolving store-to-load dependency faster.  Otherwise, according to Section &lt;u&gt;6.3&lt;/u&gt; of the software optimization guides, &lt;b&gt;the same restrictions on store-to-load forwarding apply to both Bulldozer and K10&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;There has been many claims &lt;span style="font-size: x-small;"&gt;(mostly from people outside of AMD?)&lt;/span&gt; that Bulldozer must offer some "memory disambiguation" similar to Core 2 or Nehalem. From the organization of Bulldozer's integer and load-store pipelines, which resemble K10 more than Core 2, AMD would have to use very different memory disambiguation mechanisms than Intel. &lt;b&gt;The concept of memory disambiguation is actually simple: a memory access can be ambiguous when its target address is unknown&lt;/b&gt;. Once the address is  known, then disambiguation (within the same process) can be performed by simply comparing the addresses.&lt;br /&gt;&lt;br /&gt;Suppose there is a store to an address &lt;i&gt;A&lt;/i&gt; specified by a memory  reference &lt;i&gt;M&lt;/i&gt;. If &lt;i&gt;M&lt;/i&gt; is not in cache, then the store can  be pending for a long time waiting for &lt;i&gt;A&lt;/i&gt; (at address &lt;i&gt;M&lt;/i&gt;) to come. During  that time, all later (independent) loads are ambiguous because any of their addresses could be the same as &lt;i&gt;A&lt;/i&gt; (which is yet unknown). Similarly, there can  be memory access ambiguity for stores following a load from &lt;i&gt;A&lt;/i&gt;, or  stores following a store to &lt;i&gt;A&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;One disambiguation that can be done is to predict which of the later memroy accesses are to addresses that overlap with &lt;i&gt;A&lt;/i&gt;. All those that are predicted not overlapping proceed speculatively, and have their results (and all those they affected) squashed if later &lt;i&gt;A&lt;/i&gt; is found to overlap with their access  addresses. Note, however, that &lt;b&gt;such disambiguation cannot be performed by the LSU  if the LSU receives load-store requests with known addresses&lt;/b&gt;. It seems  to be the case in both K10 and Bulldozer where the LSU works as a backend to the  reservation stations.&lt;br /&gt;&lt;br /&gt;Is it worth it to allow ambiguous memory access requests to be sent speculatively to Bulldozer's LSU? I think it requires detailed analysis and simulation to know for sure. The software optimization guide does not tell us whether such a design is used in Bulldozer. (Note that a more "severe" type of memory disambiguation may be needed for Intel Nehalem where two processes can share the same LSU, where different virtual memory mapping can create extra memory reference ambiguity.)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Changes to other memory resources (hardware prefetch and write combining):&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Hardware pretech to both L1 and L2 (prefetch instruction still to L1 only, &lt;u&gt;6.5&lt;/u&gt;)&lt;/li&gt;&lt;li&gt;Stride L1 prefetcher with up to 12 pretech patterns&lt;/li&gt;&lt;li&gt;"Region" L2 prefetcher for up to 4096 streams or patterns&lt;/li&gt;&lt;li&gt;4KB 4-way WCC plus a (single?) 64-byte 4-entry WCB (?) WCB (&lt;u&gt;A.5&lt;/u&gt;)&lt;/li&gt;&lt;/ul&gt;Due to the much smaller size of L1D in Bulldozer, it is reasonable to expect hardware prefetch to be less aggressive at L1D. Instead, part of the "aggressiveness" is transferred to the large and shared L2 cache. Although less aggressive, the prefetch mechanism is much more sophisticated, keeping multiple (12) prefetch patterns active at the same time.&lt;br /&gt;&lt;br /&gt;A special design in Bulldozer is the addition of a 4KB 4-way associative write coalescing cache (WCC) for aggregating write-back (WB) memory writes &lt;span style="font-size: x-small;"&gt;(before committing them to L2?)&lt;/span&gt;. This special "write cache" is inclusive with the L2 cache, and has its contents universally visible. It is unclear whether there is one WCC per core or one per module, although the former seems more plausible.&lt;br /&gt;&lt;br /&gt;One of the design goals of WCC is probably to &lt;b&gt;improve inter-core data transfer&lt;/b&gt;. Previously in K10, if core1 needs to send something to core2, the cache line containing the data must be (a) modified in core1's L1D, (b) evicted from core1's L1D to its L2, then (c) transferred from core1's L2 to core2's L1D. In Bulldozer, since every write to L1D also writes through to the WCC, steps (b) can be omitted and step (c) can be performed together with updating the L2 cache. Even less overhead is incurred if the data transfer occurs between two cores in the same module that share the L2 cache.&lt;br /&gt;&lt;br /&gt;The WCC also acts as a write buffer for the write combining buffer (WCB) for streaming loads and write combine memory type. This can have other implications on the memory ordering requirement by the AMD64 execution model, which we will not touch upon here.&lt;br /&gt;&lt;br /&gt;Bulldozer seems to have less write-combining resource per core for streaming stores and write combining memory type than K10. Performance "caveat" was mentioned for streaming store instructions in Section &lt;u&gt;6.5&lt;/u&gt; of the software optimization guide, where writing &amp;gt;1 streams of data with streaming stores results in much less performance compared with K10. It appears, although unclear, that Bulldozer has &lt;b&gt;a &lt;span style="font-size: x-small;"&gt;(single?)&lt;/span&gt; 64-byte 4-entry &lt;span style="font-size: x-small;"&gt;(sharing the 64 bytes? each having 64 bytes?)&lt;/span&gt; write combining buffer &lt;span style="font-size: x-small;"&gt;(per core?)&lt;/span&gt;&lt;/b&gt;. K10 and even the later K8 revisions have 4 independent 64-byte WCBs per core. One explanation is that modern processors have more cores and thus fewer occasions to store multiple independent data streams per core. With only one stream of streaming stores, the performance in Bulldozer is still comparable to that in K10.&lt;br /&gt;&lt;br /&gt;On the other hand, by beefing up the write-combining resource for write-back &amp;amp; temporal stores with the WCC, common memory writes are made much more efficient. Make the common case fast -- a rule of thumb in microarchitecture design!&lt;br /&gt;&lt;br /&gt;~~&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-7878161005541344753?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/7878161005541344753/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=7878161005541344753' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/7878161005541344753'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/7878161005541344753'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2011/04/first-look-at-amd-family-15h-bulldozer.html' title='First look at AMD Family 15h (Bulldozer) Software Optimization Guide'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-652409107483333974</id><published>2011-03-23T02:44:00.000-07:00</published><updated>2011-03-23T02:44:01.688-07:00</updated><title type='text'>IE9 running fish tank with 250 fish on HP6455b</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh3.googleusercontent.com/-ezYY5beLgGQ/TYnA2N31SPI/AAAAAAAAAIc/ruOh4BYmQ0A/s1600/hp6455b_250fish.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="290" src="https://lh3.googleusercontent.com/-ezYY5beLgGQ/TYnA2N31SPI/AAAAAAAAAIc/ruOh4BYmQ0A/s320/hp6455b_250fish.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-652409107483333974?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/652409107483333974/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=652409107483333974' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/652409107483333974'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/652409107483333974'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2011/03/ie9-running-fish-tank-with-250-fish-on.html' title='IE9 running fish tank with 250 fish on HP6455b'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh3.googleusercontent.com/-ezYY5beLgGQ/TYnA2N31SPI/AAAAAAAAAIc/ruOh4BYmQ0A/s72-c/hp6455b_250fish.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-5116129080712179413</id><published>2011-02-08T17:42:00.000-08:00</published><updated>2011-04-28T18:39:01.625-07:00</updated><title type='text'>Battery life of HP ProBook 6455b w/ AMD Phenom II N620</title><content type='html'>Common perception has been that AMD based notebooks are power hungry and rarely get over 3hr battery life. That perception is wrong.&lt;br /&gt;&lt;br /&gt;Evidence? The HP ProBook 6455b with 2.8 GHz dual-core Phenom II N620 and 6-cell battery has up to 3hr 50min battery life under normal usage (Internet surfing with wifi and 70% screen brightness), as shown in the following screenshots taken roughly 50 minutes apart:&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: left;"&gt;&lt;a href="http://1.bp.blogspot.com/_oGCeAi-2i3Q/TVHraDSr66I/AAAAAAAAAIM/dnENCXPJM6E/s1600/battery1.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_oGCeAi-2i3Q/TVHraDSr66I/AAAAAAAAAIM/dnENCXPJM6E/s1600/battery1.jpg" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&amp;nbsp;===&amp;gt;&amp;nbsp; &lt;/td&gt; &lt;td style="text-align: right;"&gt;&lt;a href="http://2.bp.blogspot.com/_oGCeAi-2i3Q/TVHz6jPXkKI/AAAAAAAAAIY/7KNQNLsxbF8/s1600/battery2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/_oGCeAi-2i3Q/TVHz6jPXkKI/AAAAAAAAAIY/7KNQNLsxbF8/s1600/battery2.jpg" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Before start writing this blog article&lt;/td&gt; &lt;td&gt;&lt;/td&gt; &lt;td class="tr-caption" style="text-align: center;"&gt;After writing this blog article&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;br /&gt;While 3.8hr battery life isn't stellar, keep in  mind that this is a 14.1 notebook with 35W TDP 2.8 GHz dual-core CPU and DirectX 11 capable GPU. I believe not many sub-$1000 Core 2 Duo or Core i3 laptops reach that battery life, and they usually have much lower-end GPUs.&lt;br /&gt;&lt;br /&gt;(The notebook &lt;i&gt;claims&lt;/i&gt; to have over 5hr battery life, which is the case when the CPU is idle at 1.0 GHz, the screen dimmed to the lowest and wifi turned off. I find such battery life number pointless, though.)&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-5116129080712179413?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/5116129080712179413/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=5116129080712179413' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5116129080712179413'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5116129080712179413'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2011/02/battery-life-of-hp-probook-6455b-w-amd.html' title='Battery life of HP ProBook 6455b w/ AMD Phenom II N620'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_oGCeAi-2i3Q/TVHraDSr66I/AAAAAAAAAIM/dnENCXPJM6E/s72-c/battery1.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-5363612983202777193</id><published>2010-12-31T16:45:00.000-08:00</published><updated>2011-05-10T15:05:08.859-07:00</updated><title type='text'>AMD Bobcat Fusion APU -- A Big Deal?</title><content type='html'>AMD has been enthusiastic and optimistic about its upcoming &lt;a href="http://blogs.amd.com/fusion/2010/09/06/direct-from-berlin-and-ifa-2010-guten-tag-kleine-fusion/"&gt;Fusion Accelerated Processing Unit (APU)&lt;/a&gt; based on the Bobcat cores set for launch at next year's (really less than one week from now) International Consumer Electronics Show (CES) in Las Vegas. It even makes a &lt;a href="http://www.youtube.com/watch?v=qUrXyDlfdXQ"&gt;supposedly humorous video&lt;/a&gt; on YouTube, showing its main competitor spying on and astonished by AMD's "Fusion technology".&lt;br /&gt;&lt;br /&gt;Is the Fusion APU really a big deal and, if so, in what sense? Will it really revolutionize personal computing as &lt;a href="http://blogs.amd.com/home/2010/12/30/joel-mchale-sexy-pcs-ces-vegas-be-there/"&gt;claimed by AMD&lt;/a&gt;?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Facts&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;We already know the performance bound of these Fusion APUs, straight from AMD: compared to current CPU designs, the Bobcat core will achieve &lt;a href="http://techreport.com/articles.x/19531"&gt;90% performance with 50% die area&lt;/a&gt;. So a 1.6GHz Bobcat core will have performance comparable to a 1.4GHz Turion, definitely not a stellar specification. In fact, the APU's performance has been &lt;a href="http://www.pcper.com/article.php?aid=1039&amp;amp;type=expert&amp;amp;pid=6"&gt;previewed and shown to be comparable&lt;/a&gt; to Intel's CULV CPU + nVidia's ION GPU.&lt;br /&gt;&lt;br /&gt;The more impressive part is perhaps that the APU has both the CPU and GPU sitting on the same die, sharing the same system interface and 18W power envolope. Thus from the performance perspective, APU is much better than Intel's Atom processor (which powers most of the current low-cost netbooks), while from the power and cost perspective, &lt;a href="http://www.pcper.com/article.php?aid=1039&amp;amp;type=expert&amp;amp;pid=8"&gt;APU is much better&lt;/a&gt; than Intel CULV + nVidia ION. So the whole point of these Fusion APU is  really not about better performance (in  both processing speed and  power), but to reach a "better" power-performance tradeoff, &lt;i&gt;i.e.&lt;/i&gt;, &lt;i&gt;performance-per-watt&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Advantage &lt;/b&gt;&lt;br /&gt;&lt;br /&gt;But is this power-performance tradeoff the real "advantage" of the Fusion APU, that it is unreachable by other players? I highly doubt it. For example, if one combines Intel's Yonah and nvidia's ION2 and manufactures them on Intel or TSMC 32nm, the same level of performance-per-watt could very well be reached.&lt;br /&gt;&lt;br /&gt;However, even if Intel and nVidia work together, such a product probably won't make money for Intel due to all those redesign efforts required and the erosion to Intel's existing products. So IMHO one critical "advantage" that AMD has with APUs is that AMD's current market share in low-power laptops is so small that it doesn't worry about cannibalization by releasing cheaper products. Intel OTOH doesn't want to replace their existing laptops with lower performance cheaper ones. Instead they designed Atom to target on the smartphone and tablet markets. They make sure there's significant performance gap between Atom and Core i3 so the two markets are well separated.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Extra &lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Hardware is only part of the story. By combining CPU and GPU closely together, every laptop based on AMD's Fusion APU becomes DirectCompute and OpenCL capable. Such "universal" GPGPU availability makes GPGPU acceleration a viable choice for software developers, which in turn makes these Fusion APUs better products (since more programs will be optimized for the CPU+GPU package). OpenCL came along somewhere in 2008 is an industry standard that replaced the original ATI Stream. Kernel programming in OpenCL is also very similar to that in nVidia CUDA, making OpenCL a fine choice for developers who are looking for or already taking advantage of GPGPU.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The "Better" Product?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;However, even with GPGPU acceleration, a 18W APU still won't achieve stellar performance. Do you really believe the 18W TDP can translate to personal supercomputer, artificial intelligence and immersive 3D interface? What would be more interesting instead is the Fusion APU with the Bulldozer CPU core and the "Southern Island" GPU. That plus OpenCL could really be revolutionary in terms of software acceleration. But that plan, first disclosed by AMD in 2007, had been delayed until at least 2012/2013. Instead, Bobcat-based low-end Fusion APUs came to fill the void for the next 1 to 2 years.&lt;br /&gt;&lt;br /&gt;While the current Fusion APU is not in AMD's original plan, with some irony it is probably a "better" product than originally planned. Why? Because believe it or not, most laptop users really don't need higher CPU performance! Most people will be quite happy with a dual-core 1.6GHz computer which they use mostly for e-mail and web surfing. The good graphics offered by these APUs is just a sweetening plus.&lt;br /&gt;&lt;br /&gt;So what AMD does with the Bobcat-based APU is to depress CPU+GPU prices and power budgets so laptop makers can give us better &lt;i&gt;other stuff&lt;/i&gt;, such as longer battery life, better webcam, and faster WiFi/3G/4G. And although this Fusion APU will reduce CPU+GPU ASPs and will hurt high-end laptop sales, AMD has little to lose in those areas anyway. :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-5363612983202777193?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/5363612983202777193/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=5363612983202777193' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5363612983202777193'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5363612983202777193'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2010/12/amd-bobcat-apu-big-deal.html' title='AMD Bobcat Fusion APU -- A Big Deal?'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-4174239704851111363</id><published>2010-09-02T17:02:00.000-07:00</published><updated>2011-05-05T10:08:00.122-07:00</updated><title type='text'>The IPC Myths</title><content type='html'>While Instruction Per Cycle (IPC) is an important metric for program optimization, it has been misused in many contexts. Below are a few common examples:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;IPC can be used to describes how good a CPU is.&lt;/li&gt;&lt;li&gt;IPC is roughly proportional to pipeline width of the CPU.&lt;/li&gt;&lt;li&gt;IPC of modern CPUs are high (&amp;gt;&amp;gt;1).&lt;/li&gt;&lt;li&gt;Amdahl's law says CPU with higher IPC will have higher single-threaded performance.&lt;/li&gt;&lt;li&gt;...&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Myth #1: IPC described as a single value&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;A common problem of all the "statements" above is that they all refer to IPC as if it is some intrinsic property determined by the CPU microarchitecture. In fact, IPC is a property determined not just by the CPU, but more by the program from algorithm down to instruction scheduling. For example, it is very possible for a CPU1 to have higher IPC than CPU2 running program A, but lower IPC running program B.&lt;br /&gt;&lt;br /&gt;Thus, saying "CPU1 has higher (or lower) IPC than CPU2" &lt;i&gt;has to&lt;/i&gt; be inaccurate, especially when the two processors have different microarchitectures.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Myth #2: Higher IPC means better&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Many people believe higher IPC means higher (single-thread) performance. This is as wrong as when people thought higher clock rate means higher performance. Still, many believe higher IPC is better because the CPU can run as fast with slower clock rate. This seems an over-reaction to the Pentium 4, which had very high clock rate but moderate performance compared to Athlon64/Opteron.&lt;br /&gt;&lt;br /&gt;The problem with this type of thinking is that the relation between IPC and clock rate is really a &lt;i&gt;tradeoff&lt;/i&gt;. Like any tradeoff relation, you don't get optimal results by sliding towards either edge. With microarchitecture and circuit-level advancements, both clock rate and/or IPC can be increased. Which one to improve should depend on the design and application of the processor, and it's definitely not always (not even usually) IPC.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Myth #3: IPC is proportional to CPU pipeline width&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;We see many arguments like below on the Internet--&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Core 2 can issue up to 4 x86 instructions per cycle, so it should have an IPC close to 4.&lt;/li&gt;&lt;li&gt;Nehalem brings [this or that features] to circumvent the decode limit, so it's IPC is 25% or 33% higher than Core 2. &lt;/li&gt;&lt;li&gt;K10 (AMD Family 10h) can only decode 3 x86 instructions per cycle, so its IPC has "bottleneck" at the instruction decode.&lt;/li&gt;&lt;/ul&gt;None of these statements is correct. It's not that the conclusion of these statements are absolutely false, but that their reasoning does not hold water. The best we can say about them is that without profiling or cycle-accurate simulation, we simply don't know.&lt;br /&gt;&lt;br /&gt;In the case of Core 2 and Nehalem, we actually know for sure that the statements above are false. IPC of Core 2 Duo running SPEC CPU2006 was &lt;a href="http://www.ece.lsu.edu/lpeng/papers/isast08.pdf"&gt;measured in this paper&lt;/a&gt;. The values were between 0.4 to 1.8 among all sub-benchmarks, with average only around 1.0, no where near its 4-way decoder width.&lt;br /&gt;&lt;br /&gt;If we compare actual SPECint measurements of Core 2 (&lt;a href="http://www.spec.org/cpu2006/results/res2008q4/cpu2006-20081109-05885.html"&gt;22.6&lt;/a&gt;) with Nehalem (&lt;a href="http://www.spec.org/cpu2006/results/res2009q4/cpu2006-20091201-09117.html"&gt;25.1&lt;/a&gt; or &lt;a href="http://www.spec.org/cpu2006/results/res2009q4/cpu2006-20091026-08991.html"&gt;27.8&lt;/a&gt;), we see that Nehalem has 11% to 23% higher single-thread performance &lt;i&gt;after&lt;/i&gt; taking into account potentially 20% turbo frequency. Thus Nehalem's IPC for SPECint is &lt;i&gt;at most&lt;/i&gt; ~20% higher than Core 2, and most likely much less when exclude the turbo mode effect. In other words, if Core 2's IPC for SPECint sub-benchmarks were 0.4~1.8, then Nehalem's should be between 0.5~2.1. Both are far below what is implied by their 4-way pipelines or any sexy-sound marketing features.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Myth #4: Amdahl's law favors CPU designed for higher IPC&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This is the strangest argument that I have seen on the Internet, because it is completely the opposite of truth. The main thing that Amdahl's law says is that &lt;i&gt;performance improvement is intrinsically limited by the &lt;b&gt;available parallelism in a program&lt;/b&gt;&lt;/i&gt;. In the context of single-threaded programs, this means that performance at the same clock rate is limited by the Instruction-Level Parallelism (ILP) &lt;b&gt;available in the program&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;Some people see that "limited by the ILP" part and immediately relate it to a CPU designed for higher IPC. The problem here is that, according to Amdahl's law, the ILP is limited by the &lt;i&gt;program&lt;/i&gt;, not the CPU. In other words, if your &lt;i&gt;program&lt;/i&gt; has low ILP, it will not run fast no matter how high an IPC the CPU was designed for. Thus in fact Amdahl's law favors a CPU designed for &lt;i&gt;higher clock rate&lt;/i&gt; but &lt;i&gt;lower IPC&lt;/i&gt; than the available ILP in the program.&lt;br /&gt;&lt;br /&gt;Furthermore, the available ILP in a program is also a strong function of the window size and the branch prediction accuracy. Both are very difficult to increase in the uber-complex microarchitectures of modern CPUs. That is why features such as SIMD (SSE and AVX), SMT, and turbo frequency are used in Nehalem to improve &lt;strike&gt;single-thread&lt;/strike&gt; processor performance. None of them increases IPC of the CPU.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;IPC is very useful when one wants to optimize his program for a particular system. It is one of the most important metrics that profiling produces. But like any metric, generalizing its implication outside of its intended usage context is usually meaningless and even misleading.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-4174239704851111363?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/4174239704851111363/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=4174239704851111363' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/4174239704851111363'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/4174239704851111363'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2010/09/ipc-myths.html' title='The IPC Myths'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-8959840806195118188</id><published>2010-05-19T10:01:00.000-07:00</published><updated>2011-04-28T18:52:05.933-07:00</updated><title type='text'>GPGPU and its battle of nVidia vs ATI</title><content type='html'>GPGPU seems to be really taking off. I came across a new YouTube video showing &lt;a href="http://www.youtube.com/user/nvidiatesla#p/a/u/0/T18j1dg9Bno"&gt;IBM new mainstream server using nVidia Tesla graphics cards&lt;/a&gt; for compute intensive acceleration.&lt;br /&gt;&lt;br /&gt;For the past few years, AMD/ATI enthusiasts have assumed that Radeon is AMD's crown jewels. The truth might be just the opposite.&lt;br /&gt;&lt;br /&gt;At a workshop in a recent conference, an nvidia researcher compared CUDA and OpenCL. His argument (whether true or not) was simple: OpenCL is more a device driver level language. He "proved" it by showing the same program written in CUDA and in OpenCL side-by-side. The CUDA one took about 2 slides. The OpenCL about 10. If you are a researcher/programmer/engineer, which one will you use?&lt;br /&gt;&lt;br /&gt;It may be true that ATI Evergreen gives higher performance per dollar for games, but nVidia Fermi seems to give better GPGPU performance on average. Evergreen has more parallelism and higher theoretical flops, but Fermi is easier to program and to get real speedup. Hundreds universities worldwide are teaching students how to optimize their programs for Fermi. This is a formidable rival and I don't share at all the optimism of many ATI enthusiasts.&lt;br /&gt;&lt;br /&gt;I believe In a year or two we will see the market of GPGPU surpassing that of enthusiast graphics. Very few people in the world care about fps when playing games. On the other hand, everyone benefits from GPGPU. I feel that AMD/ATI is too conservative in pushing for GPGPU. Most of their laptop/desktop chipsets still use r700 or even r600 based IGP, which are very hard to get good GPGPU performance, if any at all. Every time I see a laptop with HD42xx IGP I feel disgusted. They are selling those 2-year-old stuff which doesn't let users take proper advantage of OpenCL. They sell them for cheap, but is it a good thing? Do they also want to sell r800 IGPs for cheap 2 years from now?&lt;br /&gt;&lt;br /&gt;Then it's OpenCL which everyone's heard of but few is interested. In my humble opinion, AMD should hire a few programmers to fully integrate OpenCL into their Catalyst driver, so that every computer with an ATI GPU can use them after a simple driver install. In contrast, perhaps for fear of Microsoft or whatever reason, AMD/ATI want end users to manually install &lt;i&gt;and&lt;/i&gt; upgrade every OpenCL release to match the installed Catalyst driver. If I'd go so much trouble then why don't I just install CUDA which more people are using anyway?&lt;br /&gt;&lt;br /&gt;And if I'm going with CUDA, then I will not only buy Tesla for my workstation, but also GeForce for my desktop &amp;amp; laptop; there will be less incentive for me to buy an AMD CPU as well. That is really a good way to keep me away from being AMD's customer, isn't it?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-8959840806195118188?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/8959840806195118188/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=8959840806195118188' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/8959840806195118188'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/8959840806195118188'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2010/05/gpgpu-war-between-atievergreen-and.html' title='GPGPU and its battle of nVidia vs ATI'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-5792450318096706215</id><published>2010-04-27T08:16:00.000-07:00</published><updated>2010-05-19T15:43:42.592-07:00</updated><title type='text'>Stating the facts or bad-mouthing his former employer?</title><content type='html'>Over at the MacRumors forum someone claimed to be a former AMD employee recently started to criticize AMD and people working in it. His posts can be seen in the following links: &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9683616&amp;amp;postcount=25"&gt;1&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9683667&amp;amp;postcount=39"&gt;2&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9683707&amp;amp;postcount=48"&gt;3&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9683789&amp;amp;postcount=68"&gt;4&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9684327&amp;amp;postcount=147"&gt;5&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9685563&amp;amp;postcount=242"&gt;6&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9698440&amp;amp;postcount=535"&gt;7&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9706695&amp;amp;postcount=559"&gt;8&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9708648&amp;amp;postcount=562"&gt;9&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9710599&amp;amp;postcount=565"&gt;10&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9734533&amp;amp;postcount=590"&gt;11&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9736602&amp;amp;postcount=598"&gt;12&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9745462&amp;amp;postcount=612"&gt;14&lt;/a&gt;, &lt;a class="postlink" href="http://forums.macrumors.com/showpost.php?p=9746191&amp;amp;postcount=620"&gt;15&lt;/a&gt; (thanks to &lt;a href="http://www.amdzone.com/phpbb3/viewtopic.php?f=52&amp;amp;t=137432&amp;amp;start=325#p181204"&gt;dm7000s at AMDZone&lt;/a&gt; for collecting these links).&lt;br /&gt;&lt;br /&gt;In my opinion, it is both "interesting" and "fishy" to see someone do this to his former employer. On one hand, he may be revealing some real problems inside (parts of) the company which other people (either inside or outside AMD) wouldn't know or recognize. On the other hand, he may be right about things that he claims to know, but due to his bitterness wrong about the conclusions.&lt;br /&gt;&lt;br /&gt;I believe in this case, it is the latter. In any rate, lets go through some of his points below and see, assuming these are all facts, how true or false they can be:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;AMD has not been financially successful since "K8"&lt;/i&gt;. This may not be due to any of AMD's problem, but Intel's monopoly tactics. One should ask why was AMD "financially successful" during the K8 days in the first place? Was it because only those who designed K8 knew that they were doing? Was it because they hand crafted every transistor? Or was it really because both Itanium and Netburst terribly sucked in real-world tests? I'd argue it's only the last.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;AMD has been losing key employees&lt;/i&gt;. Losing employees is tough for any company. Yet, sometimes a company &lt;span style="font-style: italic;"&gt;has&lt;/span&gt; to lose weight when it is evolving and before it can start growing again. The question is not whether someone did something grand. But whether he will do something grander. What would have been the grander next step after K8? Could AMD have beaten Intel by making an over-complicated "K9" with SMT and turbo mode and everything else? I'd argue with the required design and verification efforts, no "key employee" could have made this happen timely and cost effectively.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;AMD is not hand-instantiating designs &lt;strike&gt;transistors&lt;/strike&gt; anymore&lt;/i&gt;. &lt;strike&gt;Anyone (who is a electrical engineer) can hand craft transistors. It is at the end of the day primarily a labor intensive task. If you're a CTO and you expect your company to hand craft transistors better than your 10x oversized competitor, then you're not being realistic. You won't win. And with the "unfair" agreements between Intel and AMD prior to their 2009 settlement, AMD was simply forbidden to reap the same amount of profit by selling hand-crafted processors of higher performance.&lt;/strike&gt;&amp;nbsp; I was informed by a kind reader, who seem to know what was going on inside AMD, that hand-crafting transistors is &lt;i&gt;emphatically&lt;/i&gt; what AMD &lt;i&gt;did not&lt;/i&gt; do for K8. Unlike Intel which does a lot of custom designs, AMD used standard cells for the ALUs and most other components. However, AMD did a lot of circuit placement and routing by hand, with a superb physical design (implementation) team. Somewhat related to the previous comment, I was also told that many in that team have left AMD over the past few years.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;AMD did not make any new architecture after K8&lt;/i&gt;. Architecture is a flimsy thing. At their hearts K8 is no more than K7 plus extra 64-bit registers and integrated NB. The way instructions are broken down to macro-ops and micro-ops, the basic organization of the ROB and the separate INT and FP schedulers are all the same between K7 and K8. HyperTransport based NB, the exclusive L1/L2 cache and improve TLB gave K8 solid performance. But so are the improvements made to K10 like the shared L3, unganged memory, probe filter and greater scalability. K8's NB was designed to have up to 8P in a single system; few wanted that over the years. Today K10-based Magny Cours processors allow 48 cores with perhaps tighter inter-core communication. I bet Intel very much want to do the same.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In conclusion.... is that guy simply stating the facts, or is he bad-mouthing his former employer? Personally, I think what he did was immature and immoral, even if what he said were facts. &lt;b&gt;I was told that there's been some political struggles inside AMD during the post-K8 years, and I also believe that such politics must've brought with it some&amp;nbsp;waste of time and money as well as loss of talents.&lt;/b&gt; But still.... in my humble opinion, that's no good excuse for picking on your former exployee and starting a public brawl fight.&lt;br /&gt;&lt;br /&gt;&lt;img alt=":mrgreen:" src="http://www.amdzone.com/phpbb3/images/smilies/icon_mrgreen.gif" title="Mr. Green" /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-5792450318096706215?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/5792450318096706215/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=5792450318096706215' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5792450318096706215'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5792450318096706215'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2010/04/stating-facts-of-bad-mouthing-his.html' title='Stating the facts or bad-mouthing his former employer?'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-788455632886528573</id><published>2008-04-24T20:11:00.000-07:00</published><updated>2008-09-17T23:31:26.264-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Core 2'/><category scheme='http://www.blogger.com/atom/ns#' term='K10'/><title type='text'>Internet Misinformation - On K10 vs Core2 Bandwidths</title><content type='html'>The Internet is filled with all kinds of information. For the most part this is a very good thing- who don't like new things and thoughts? However, if not being careful enough such free information could easily become &lt;span style="font-style: italic;"&gt;mis&lt;/span&gt;information, especially regarding technical things where either the author doesn't really know what he is writing, or he actually knows but deliberately misleads his readers for marketing or financial reasons.&lt;br /&gt;&lt;br /&gt;In recently years with the widespread of consumer-grade "benchmarks" we see many what I'd call "folklore comparison" of different PC platforms, most recently AMD's Barcelona (K10) and Intel's Xeon (Core2) microarchitectures. Specifically, many of these folklore comparisons made by on-line reviews give Intel-favoring misinformation to "justify" Core2's "theoretical ILP advantage" on paper. In this article I will look more closely at both sides of this argument: &lt;span style="font-style: italic;"&gt;Is it or is it not justified to attribute Core2's better performance in some cases&lt;/span&gt;&lt;span style="font-style: italic;"&gt; to the supposed "advantage"  in its microarchitecture, or does such advantage actually exist?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Misinformation on L1 Data Cache Bandwidth&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The first example starts with a "test" AnandTech performed regarding K10 vs Core2 L1 data cache bandwidth:&lt;br /&gt;&lt;br /&gt;&lt;table border="1" cellpadding="0" cellspacing="0" width="600"&gt;&lt;tbody&gt;&lt;tr bgcolor="#016a96"&gt;&lt;td colspan="6" style="text-align: center;"&gt;&lt;b&gt;Lavalys Everest  L1 Bandwidth&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="text-align: left;" bg=""&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Read (MB/s)&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Write (MB/s)&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Copy (MB/s)&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Bytes/cycle (Read)&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Latency (ns)&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="text-align: left;" bg=""&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Opteron 2350 2 GHz&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;32117&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;16082&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;23935&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;16.06&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;1.5&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="text-align: left;" bg=""&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Xeon 5160 3.0&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;47860&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;47746&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;95475&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;15.95&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;1&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="text-align: left;" bg=""&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Xeon E5345 2.33&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;37226&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;37134&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;74268&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;15.96&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;1.3&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="text-align: left;" bg=""&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Opteron 2224 SE&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;51127&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;25601&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;44080&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;15.98&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;0.9&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="text-align: left;" bg=""&gt;&lt;td&gt;&lt;span&gt;&lt;b&gt;Opteron 8218HE 2.6 GHz&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;41541&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;20801&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;35815&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;15.98&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span&gt;1.1&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;From the values above it would appear that, per clock cycle, K10 can load as much data (16 bytes) but store only half (8 bytes) as Core2 can. As pointed out by scientia at AMDZone, these numbers do not seem correct. In fact, according to &lt;a href="http://pc.watch.impress.co.jp/docs/2006/1013/kaigai05.pdf"&gt;this presentation slide&lt;/a&gt;, AMD's Barcelona (K10) processors could theoretically perform &lt;span style="font-style: italic;"&gt;two&lt;/span&gt; 128-bit (16-byte) loads per clock cycle, or twice the Core2's L1 data cache bandwidth. Why the contradiction?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Rebuttal and Explanation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;What is shown here is a perfect example of misinformation coming out of such "folklore comparison" performed by AnandTech, using synthetic benchmark tools without really knowing what it is doing. A synthetic benchmark would underestimate Opteron's L1 bandwidth because, with sequential accesses and small strides, it stresses only one of the two ports of K10's L1 cache.&lt;br /&gt;&lt;br /&gt;Recall that K10's L1 data cache (L1D) is 2-way set associative with two 128-bit ports. Internally each port is connected to one bus going to the Load-Store Unit (LSU). This arrangement is described both verbally (2nd paragraph, page 223, A.5.2) and graphically (Figure 11, page 230, A.13) in the &lt;a href="http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf"&gt;Software Optimization Guide for AMD Family 10 Processors&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_oGCeAi-2i3Q/SBV4wDTt1EI/AAAAAAAAAE8/QQX4L5ijTTQ/s1600-h/LSU.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_oGCeAi-2i3Q/SBV4wDTt1EI/AAAAAAAAAE8/QQX4L5ijTTQ/s400/LSU.jpg" alt="" id="BLOGGER_PHOTO_ID_5194190512158790722" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Optimally in every clock cycle two 128-bit words, &lt;span style="font-style: italic;"&gt;one from each port&lt;/span&gt;, can be read from L1D to LSU and forwarded to the execution units. What happens with synthetic benchmarks is probably that, due to their fine-grain assembly-level "optimization," they could generate unrealistic codes favoring one microarchitecture instead of another. On K10, it is in fact possible to force data accesses to&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;span style="font-style: italic;"&gt;the same cache way&lt;/span&gt; or &lt;span style="font-style: italic;"&gt;the same cache bank&lt;/span&gt;. Such accesses will only utilize one of the two available ports and result in half the optimal bandwidth.&lt;br /&gt;&lt;br /&gt;In practice, do we always find data accesses to the same port? Of course not. Clearly the type of tests that AnandTech did reflect little if any realistic processor performance. They are at best echoes to uneducated folklore opinions, or worse practices of darn FUDs. On the other hand, we also won't find all data accesses to different ports, thus a theoretical calculation of K10's maximum L1D bandwidth (that it is twice as high as Core2's) is also overly optimistic and unrealistic. In reality, if 50% of cache accesses are spread to two ports, then in average it would take 3 cycles to read 4 words (one cycle to read 2 words, two cycles to read 1 word each). The average bandwidth would be 4/3 = 1.33 reads or writes per clock cycle. In other words, K10 would achieve about 67% (1.33/2) of its theoretical max L1D bandwidth, which would be 33% higher than Core 2 for reads and 33% lower for writes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Proof and Conclusion&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To prove that this theory is true, I wrote a program in C with gcc's SSE intrinsics to test the bandwidths myself. Skipping other I/O &amp;amp; maintenance parts, the kernel of the code looks like this:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_oGCeAi-2i3Q/SBU7STTt1BI/AAAAAAAAAEk/qv75nxPcsNc/s1600-h/source.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_oGCeAi-2i3Q/SBU7STTt1BI/AAAAAAAAAEk/qv75nxPcsNc/s400/source.gif" alt="" id="BLOGGER_PHOTO_ID_5194122930848388114" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The above code is compiled by gcc 3.4.4 with -O2 and -march=k8. The generated assembly is also included below for assurance that the code really does what it's supposed to do. (Note: When trying it on gcc 4.2 the compiler is smart enough to know that the loop doesn't actually do any useful work, and would not generate any SSE load instruction. In this case the code above achieves 1 iteration (skipping 8 loads) per cycle, limited by the conditional branch bubble.)&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_oGCeAi-2i3Q/SBU7fTTt1DI/AAAAAAAAAE0/D3CLkmzMr-g/s1600-h/assembly.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_oGCeAi-2i3Q/SBU7fTTt1DI/AAAAAAAAAE0/D3CLkmzMr-g/s400/assembly.gif" alt="" id="BLOGGER_PHOTO_ID_5194123154186687538" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;I have the program with three tests: SSE store, SSE load, and SSE PAND; only SSE PAND is shown above but the other two are very similar. When &lt;span style="font-weight: bold;"&gt;running on a Phenom 9500 @2.2GHz&lt;/span&gt;, the achieved bandwidths for store, load, and PAND are 28GB/s, 46GB/s, 46GB/s, respectively. This &lt;span style="font-weight: bold;"&gt;translates to about 1.3x 16B reads/cycle and 0.8x 16B writes/cycle&lt;/span&gt;. So without any special treatment, on the C-source level, I could already get 30% better L1D read bandwidth and 60% better L1D write bandwidths than AnandTech's "test" results; furthermore, the theoretical estimates that I offered in the previous section were actually fairly close.&lt;br /&gt;&lt;br /&gt;Out of curiosity I took the same program to run on a Core 2-based 2.0GHz Xeon machine. It turns out that it only achieves ~85% max bandwidth there, &lt;span style="font-style: italic;"&gt;i.e.&lt;/span&gt;, less than 14B reads and writes per clock cycle. &lt;span style="font-weight: bold;"&gt;Thus per clock cycle, the L1D read bandwidth on Core 2 is only 2/3rd of that on K10, whereas the write is just 8% higher.&lt;/span&gt; Frankly I have no idea how Lavalys Everest does its benchmarking codes to generate vastly different results, but really anyone with a clear mind shouldn't care about how some synthetic binary runs, but what &lt;span style="font-style: italic;"&gt;he&lt;/span&gt; can achieve on the platform &lt;span style="font-style: italic;"&gt;on the source level&lt;/span&gt;, without dirtying his hands by assembly optimizations (or de-optimization for the non-Intel cases).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-788455632886528573?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/788455632886528573/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=788455632886528573' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/788455632886528573'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/788455632886528573'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2008/04/two-sides-of-mirror-on-k10-vs-core2.html' title='Internet Misinformation - On K10 vs Core2 Bandwidths'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_oGCeAi-2i3Q/SBV4wDTt1EI/AAAAAAAAAE8/QQX4L5ijTTQ/s72-c/LSU.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-5880600419536294921</id><published>2007-09-22T13:26:00.000-07:00</published><updated>2007-09-25T02:18:36.530-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ISA'/><title type='text'>AMD's latest x86 extension: SSE5 - Part 2</title><content type='html'>Series Index -&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://abinstein.blogspot.com/2007/09/amds-latest-greatest-x86-extension-sse5.html"&gt;Part 1. An Overview&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Part 2. Face-Off with SSSE3 and SSE4.x&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In this part we will compare Intel's SSSE3 and SSE4.x with AMD's SSE5. More specifically we will look at how one can (or cannot) use SSE5 to accomplish the same tasks performed by SSSE3 and SSE4.x. The pinnacle question we're trying to answer here is whether the SSE5 from AMD is strictly an &lt;span style="font-style: italic;"&gt;extension&lt;/span&gt; to Intel's SSE4, or in some sense a &lt;span style="font-style: italic;"&gt;replacement&lt;/span&gt; for SSSE3 and SSE4.x (which none of AMD's current processors - including Barcelona and Phenom - supports)?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Syntactical Similarity&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The original 8086/8087 have one-byte opcode instructions (if we ignore the ModRM bits used for 8087 and a handful others such as bit rotations). One remaining opcode byte that was usefully unused turned out to be &lt;code&gt;0Fh&lt;/code&gt;; had it been used, it would've had the meaning of &lt;code&gt;POP CS&lt;/code&gt; , which was not there because it &lt;a href="http://www.arl.wustl.edu/%7Elockwood/class/cs306/books/artofasm/Chapter_6/CH06-1.html#HEADING1-160"&gt;would create some interesting program flow control  problems&lt;/a&gt;. Using &lt;code&gt;0Fh&lt;/code&gt; as an escape byte followed by a second byte, a number of two-byte opcode instructions were added by 80{2|3|4}86, Pentium, MMX, 3DNow!, and SSE/2/3/4a.&lt;br /&gt;&lt;br /&gt;After the addition of SSE4a from AMD, the free two-byte opcodes left are only the followings: &lt;code&gt;&lt;span style="font-weight: bold;"&gt;0F0&lt;/span&gt;{&lt;span style="font-weight: bold;"&gt;4&lt;/span&gt;,&lt;span style="font-weight: bold;"&gt;A&lt;/span&gt;,&lt;span style="font-weight: bold;"&gt;C&lt;/span&gt;}h&lt;/code&gt;, &lt;code&gt;&lt;span style="font-weight: bold;"&gt;0F2&lt;/span&gt;{&lt;span style="font-weight: bold;"&gt;4-7&lt;/span&gt;}h&lt;/code&gt;, &lt;code&gt;&lt;span style="font-weight: bold;"&gt;0F3&lt;/span&gt;{&lt;span style="font-weight: bold;"&gt;6-F&lt;/span&gt;}h&lt;/code&gt;, &lt;code&gt;&lt;span style="font-weight: bold;"&gt;0F7&lt;/span&gt;{&lt;span style="font-weight: bold;"&gt;A&lt;/span&gt;,&lt;span style="font-weight: bold;"&gt;B&lt;/span&gt;}h&lt;/code&gt;, and &lt;code&gt;&lt;span style="font-weight: bold;"&gt;0FA&lt;/span&gt;{&lt;span style="font-weight: bold;"&gt;6&lt;/span&gt;,&lt;span style="font-weight: bold;"&gt;7&lt;/span&gt;}h&lt;/code&gt;. Why is this important? Because these points in the two-byte opcode space are the &lt;span style="font-style: italic;"&gt;only entries&lt;/span&gt; where the x86 ISA can be further extended. Obviously, the two-dozen or so entries are not enough for any large-scale extension.&lt;br /&gt;&lt;br /&gt;In order to further extend the instruction set in a significant way, the opcode itself must be extended from two-byte to three-byte. This is where SSSE3/SSE4.x and SSE5 bear the most similarity: &lt;span style="font-weight: bold;"&gt;they all consist (mainly) of instructions with &lt;span style="font-style: italic;"&gt;three&lt;/span&gt; opcode bytes&lt;/span&gt;. Intel carved out &lt;code&gt;0F&lt;span style="font-weight: bold;"&gt;38&lt;/span&gt;xxh&lt;/code&gt; and &lt;code&gt;0F&lt;span style="font-weight: bold;"&gt;3A&lt;/span&gt;xxh&lt;/code&gt; for SSSE3 and SSE4.x, whereas AMD took &lt;code&gt;0F&lt;span style="font-weight: bold;"&gt;24&lt;/span&gt;xxh&lt;/code&gt;, &lt;code&gt;0F&lt;span style="font-weight: bold;"&gt;25&lt;/span&gt;xxh&lt;/code&gt;, &lt;code&gt;0F&lt;span style="font-weight: bold;"&gt;7A&lt;/span&gt;xxh&lt;/code&gt; and &lt;code&gt;0F&lt;span style="font-weight: bold;"&gt;7B&lt;/span&gt;xxh&lt;/code&gt; for SSE5.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Syntactical Differences&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;However, the syntactical similarity between Intel's and AMD's extensions pretty much ends right here. As we've seen in &lt;a href="http://abinstein.blogspot.com/2007/09/amds-latest-greatest-x86-extension-sse5.html"&gt;Part 1.&lt;/a&gt; of this series, &lt;span style="font-weight: bold;"&gt;SSE5 instruction encoding is &lt;span style="font-style: italic;"&gt;regular&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;orthogonal&lt;/span&gt;&lt;/span&gt;: the 3rd opcode byte (Opcode3) always has &lt;span style="font-weight: bold;"&gt;5 bits&lt;/span&gt; for opcode extension, &lt;span style="font-weight: bold;"&gt;1 bit&lt;/span&gt; for operand ordering, and &lt;span style="font-weight: bold;"&gt;2 bits&lt;/span&gt; for operand size.&lt;br /&gt;&lt;br /&gt;On the other hand, the encoding of SSSE3 and SSE4.x instructions may well have been arbitrary for anyone outside Intel. For example, look at the following SSSE3 instructions:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;PSIGN&lt;span style="font-weight: bold;"&gt;B&lt;/span&gt; - 0F380h 10&lt;span style="font-weight: bold;"&gt;00&lt;/span&gt;b ... PABS&lt;span style="font-weight: bold;"&gt;B&lt;/span&gt; - 0F381h 11&lt;span style="font-weight: bold;"&gt;00&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;code&gt;PSIGN&lt;span style="font-weight: bold;"&gt;W&lt;/span&gt; - 0F380h 10&lt;span style="font-weight: bold;"&gt;01&lt;/span&gt;b &lt;/code&gt;&lt;code&gt;... &lt;/code&gt;&lt;code&gt;PABS&lt;span style="font-weight: bold;"&gt;W&lt;/span&gt; - 0F381h 11&lt;span style="font-weight: bold;"&gt;01&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;code&gt;PSIGN&lt;span style="font-weight: bold;"&gt;D&lt;/span&gt; - 0F380h 10&lt;span style="font-weight: bold;"&gt;10&lt;/span&gt;b &lt;/code&gt;&lt;code&gt;... &lt;/code&gt;&lt;code&gt;PABS&lt;span style="font-weight: bold;"&gt;D&lt;/span&gt; - 0F381h 11&lt;span style="font-weight: bold;"&gt;10&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;It may &lt;span style="font-style: italic;"&gt;seem&lt;/span&gt; from above that the right-most bits encode the operand size - &lt;code&gt;00b&lt;/code&gt; for &lt;span style="font-style: italic;"&gt;byte&lt;/span&gt;, &lt;code&gt;01b&lt;/code&gt; for &lt;span style="font-style: italic;"&gt;word&lt;/span&gt;, and &lt;code&gt;10b&lt;/code&gt; for &lt;span style="font-style: italic;"&gt;dword&lt;/span&gt;. However, take anther look at the following SSSE3 instructions:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-weight: bold;"&gt;PSHUFB&lt;/span&gt; - 0F380h 00&lt;span style="font-weight: bold;"&gt;00&lt;/span&gt;b &lt;/code&gt;&lt;code&gt;... &lt;/code&gt;&lt;code&gt;&lt;span style="font-weight: bold;"&gt;PMADDUBSW&lt;/span&gt; - 0F380h 01&lt;span style="font-weight: bold;"&gt;00&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;code&gt;PHADD&lt;span style="font-weight: bold;"&gt;W&lt;/span&gt; - 0F380h 00&lt;span style="font-weight: bold;"&gt;01&lt;/span&gt;b &lt;/code&gt;&lt;code&gt;...... &lt;/code&gt;&lt;code&gt;PHSUB&lt;span style="font-weight: bold;"&gt;W&lt;/span&gt; - 0F380h 01&lt;span style="font-weight: bold;"&gt;01&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;code&gt;PHADD&lt;span style="font-weight: bold;"&gt;D&lt;/span&gt; - 0F380h 00&lt;span style="font-weight: bold;"&gt;10&lt;/span&gt;b &lt;/code&gt;&lt;code&gt;...... &lt;/code&gt;&lt;code&gt;PHSUB&lt;span style="font-weight: bold;"&gt;D&lt;/span&gt; - 0F380h 01&lt;span style="font-weight: bold;"&gt;10&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;For some (probably legitimate) reason, Intel designers decided not to include horizontal &lt;span style="font-style: italic;"&gt;byte&lt;/span&gt; additions and subtractions; instead they (most "exceptionally") squeezed in a byte-shuffle instruction and a specialized multiply-add instructions. We see that 30-years later, people at Intel still design instructions exactly the same way like 30-years ago: &lt;span style="font-style: italic;"&gt;doesn't make sense&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Even worse cases are seen in SSE4.x. The following example shows the encodings used for packed MAX and packed MIN instructions:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;PMAX&lt;span style="font-weight: bold;"&gt;SB&lt;/span&gt; - 0F383h 11&lt;span style="font-weight: bold;"&gt;00&lt;/span&gt;b &lt;/code&gt;&lt;code&gt;... &lt;/code&gt;&lt;code&gt;PMIN&lt;span style="font-weight: bold;"&gt;SB&lt;/span&gt; - 0F383h 10&lt;span style="font-weight: bold;"&gt;00&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;code&gt;PMAX&lt;span style="font-weight: bold;"&gt;SD&lt;/span&gt; - 0F383h 11&lt;span style="font-weight: bold;"&gt;01&lt;/span&gt;b &lt;/code&gt;&lt;code&gt;... &lt;/code&gt;&lt;code&gt;PMIN&lt;span style="font-weight: bold;"&gt;SD&lt;/span&gt; - 0F383h 10&lt;span style="font-weight: bold;"&gt;01&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;code&gt;PMAX&lt;span style="font-weight: bold;"&gt;UW&lt;/span&gt; - 0F383h 11&lt;span style="font-weight: bold;"&gt;10&lt;/span&gt;b &lt;/code&gt;&lt;code&gt;... &lt;/code&gt;&lt;code&gt;PMIN&lt;span style="font-weight: bold;"&gt;UW&lt;/span&gt; - 0F383h 10&lt;span style="font-weight: bold;"&gt;10&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;code&gt;PMAX&lt;span style="font-weight: bold;"&gt;UD&lt;/span&gt; - 0F383h 11&lt;span style="font-weight: bold;"&gt;11&lt;/span&gt;b &lt;/code&gt;&lt;code&gt;... &lt;/code&gt;&lt;code&gt;PMIN&lt;span style="font-weight: bold;"&gt;UD&lt;/span&gt; - 0F383h 10&lt;span style="font-weight: bold;"&gt;11&lt;/span&gt;b&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Note how the different operand types and operand sizes are squeezed cozily into consecutive opcode byte values without much sense. For some mystical reason, the &lt;span style="font-style: italic;"&gt;unsigned word&lt;/span&gt; operations are put quite arbitrarily right next to the &lt;span style="font-style: italic;"&gt;signed dword&lt;/span&gt; operations . But wait... what happens to &lt;code&gt;P{MAX|MIN}SW&lt;/code&gt; and &lt;code&gt;P{MAX|MIN}UB&lt;/code&gt;? Well, they already are &lt;span style="font-style: italic;"&gt;SSE2&lt;/span&gt; instructions with opcode &lt;code&gt;0FE{E|A}h&lt;/code&gt; and &lt;code&gt;0FD{E|A}h&lt;/code&gt;, respectively. As can be seen in this example, the irregularity of SSE4.x also inherits from the poor design of SSE2.&lt;br /&gt;&lt;br /&gt;From software programmer's point of view, the irregularity really doesn't matter as long as the compiler can generate these opcodes automatically. But such extension irregularity is no circuit designer's love to implement. &lt;span style="font-style: italic;"&gt;This&lt;/span&gt; is probably why Intel, assumed &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; incompetent, chose in such poor styles to design SSEx - &lt;span style="font-weight: bold;"&gt;to make it as difficult as possible for anyone else (most prominently AMD) to offer compatible decoding&lt;/span&gt;. In the end, not only Intel's competitors but also its customers suffer from the bad choices: had Intel designed the original SSE/SSE2 the same way as AMD does SSE5, we would've had a much more complete &amp;amp; efficient set of x86 SIMD instructions &lt;span style="font-style: italic;"&gt;that makes sense! &lt;/span&gt;(Now, does Intel promote open &amp;amp; fair competition that benefits the consumers? Or does it aims nothing but to screw up its competitors, sometimes together with its customers?)&lt;br /&gt;&lt;br /&gt;In any rate, as we've been above the encoding of SSE5 is &lt;span style="font-style: italic;"&gt;different&lt;/span&gt; from SSSE3/SSE4.x and thus the former does &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; exclude the latter. In other words, it is possible for a processor to offer &lt;span style="font-style: italic;"&gt;both&lt;/span&gt; SSE5 &lt;span style="font-style: italic;"&gt;and&lt;/span&gt; SSSE3/SSE4.x (much like 3DNow! and MMX). What about their functionalities, then? Below we'll look at each SSSE3 and SSE4.x instruction and see how its functionalities can or cannot be accomplished by SSE5.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Functional Comparison to SSSE3&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For SSSE3 instructions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;PHADDx/PHSUBx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Horizontally add/subtract word &amp;amp; dword in &lt;span style="font-style: italic;"&gt;both&lt;/span&gt; source &lt;span style="font-style: italic;"&gt;and &lt;/span&gt;destination sub-operands and pack them into destination.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Each PHADDx/PHSUBx in SSE5 operates on only &lt;span style="font-style: italic;"&gt;one&lt;/span&gt; 128-bit packed source.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PMADDx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Multiply destination and source sub-operands, horizontally add the results, and store them back to destination.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;PMADx in SSE5 offers more powerful multiply-add intrinsics&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;No byte-to-word multiply-add in SSE5, though.&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PSHUFB&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Shuffle bytes in destination according to source.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Special &amp;amp; weaker cases of the first-half of PPERM in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PALIGNR&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Shift concatenated destination &amp;amp; source bytes back into destination.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Special &amp;amp; weaker cases of the first-half of PPERM in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PSIGNx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Retain, negate, or set zero sub-operands in destination if corresponding sub-operands in source is positive, negative, or zero, respectively.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul style="font-style: italic;"&gt;&lt;li&gt;No direct implementation in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PABSx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Store the unsigned absolute values of source sub-operands into destination sub-operands.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul style="font-style: italic;"&gt;&lt;li&gt;No direct implementation in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PMULHRSW&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Multiply 16-bit sub-operands of destination and source and store the &lt;span style="font-style: italic;"&gt;rounded&lt;/span&gt; high-order 16-bit results back to destination.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul style="font-style: italic;"&gt;&lt;li&gt;No direct implementation in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br /&gt;It can be seen that most SSSE3 instructions are &lt;span style="font-style: italic;"&gt;not directly implemented&lt;/span&gt; in SSE5, with possibly the exceptions of PSHUFB, PALIGNR, and PADDx/PSUBx. However, these latter SSSE3 instructions &lt;span style="font-style: italic;"&gt;can&lt;/span&gt; still be useful as lower-latency, lower-instruction count shortcuts to the more generic &amp;amp; powerful SSE5 counterparts. Thus from this point of view, &lt;span style="font-weight: bold;"&gt;future AMD processors will&lt;/span&gt;&lt;span style="font-style: italic; font-weight: bold;"&gt; probably&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; still benefit from implementing SSSE3 &lt;/span&gt;&lt;span style="font-style: italic; font-weight: bold;"&gt;together with&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; SSE5&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Functional Comparison to SSE4.x&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For SSE4.1 instructions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;PMULLD&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Multiply 32-bit sub-operands of destination and source and store the low-order 32-bit results back to destination.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Can be done by two PMULDQ (SSE2) followed by a PPERM.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;DPPS/DPPD&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Horizontally dot-product single/double precision floating-point sub-operands in destination and source and selectively store results to destination sub-operand fields.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;FMADx in SSE5 offer more powerful &amp;amp; flexible floating-point dot product intrinsics.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;MOVNTDQA&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Non-temporal dword load from WC memory type into an internal buffer of processor, &lt;span style="font-style: italic;"&gt;without&lt;/span&gt; storing to the cache hierarchy.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Specific to Intel processor implementation.&lt;/li&gt;&lt;li&gt;PREFETCHNTA in Opteron &amp;amp; later works for the same purpose.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;BLENDx and PBLENDx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Conditionally copy sub-operands from source into destination.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Special and weaker cases of PERMPx and PPERM in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PMAXx and PMINx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Packed max and min operations of destination and source&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Can be accomplished by a PCOMx followed by a PPERM in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;EXTRACTPS/PEXTRx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Extract sub-operands from an XMM register (source) to memory or a general-purpose register (destination).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Special and weaker case of PERMPx for memory destination.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;No direct implementation for GPR destination in SSE5.&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;INSERTPS/PINSRx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Optionally copy sub-operands from source to destination.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Special and weaker case of PERMPx in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PMOVx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Sign- or zero-extend source sub-operands to destination.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Special and weaker case of PPERM with a proper mux/logical argument. &lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PCMPEQQ&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Packed compare-equal between destination and source and store results back to destination.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Special and weaker case of PCOMQ in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;MPSADBW&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Compute "sum of absolute byte-difference" between one 4-byte group in source and eight 4-byte groups in destination and store the eight results back to destination&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;No direct implementation in SSE5.&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PHMINPOSUW&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Find the minimum word horizontally in source and put its value in DEST[15:0] and its index in DEST[18:16]&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul style="font-style: italic;"&gt;&lt;li&gt;No direct implementation in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PACKUSDW&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Convert signed dword to unsigned word with saturation.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Complements PACKSSDW/PACKUSWB/PACKSSWB in SSE2&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul style="font-style: italic;"&gt;&lt;li&gt;No direct implementation in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;PTEST, ROUNDx&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Llogical zero test, packed precision rounding.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Copied directly to SSE5.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br /&gt;For SSE4.2 instructions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;PCMPGTQ&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Packed compare for greater than&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Special &amp;amp; weaker case of PCOMQ in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;String match, CRC32&lt;/li&gt;&lt;ul style="font-style: italic;"&gt;&lt;li&gt;No direct implementation in SSE5.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;POPCNT&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Copied directly from AMD's POPCNT.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br /&gt;A few evidences from above show that &lt;span style="font-weight: bold;"&gt;it's probably &lt;/span&gt;&lt;span style="font-style: italic; font-weight: bold;"&gt;not&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; very likely for a future AMD processor to implement SSE4.1 &amp;amp; SSE4.2 &lt;/span&gt;&lt;span style="font-style: italic; font-weight: bold;"&gt;in addition to&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; SSE5&lt;/span&gt;. First, some of the instructions are copied directly from SSE4.1 to SSE5 (TEST and ROUNDx); had AMD wanted to implement SSE4.1  before SSE5, it would've been &lt;span style="font-style: italic;"&gt;un&lt;/span&gt;necessary to copy these instructions. Second, those instructions in SSE4.x that do not have superior SSE5 counterparts are either extremely specialized (MPSADBW, PHMINPOSUW, string match &amp;amp; CRC32), or able to be accomplished more flexibly by two or less SSE5 instructions.&lt;br /&gt;&lt;br /&gt;We can also see how Intel designers work very hard to squeeze functionalities into the poor syntax of SSE4.x, resulting in a poor extension design. One example is the BLENDx/PBLENDx instructions. Instead of using the proper SSE5-like 3-way syntax, the variable selector in SSE4.1 is set &lt;span style="font-style: italic;"&gt;implicitly&lt;/span&gt; to XMM0, not only requiring additional register shuffling but also limiting the number of permutation types to only 1 at any moment.&lt;br /&gt;&lt;br /&gt;Another example is the DPPS/DPPD instructions, where the dot-product is performed &lt;span style="font-style: italic;"&gt;partially &lt;/span&gt;vertical and &lt;span style="font-style: italic;"&gt;partially&lt;/span&gt; horizontal. To make these instructions useful the two source vectors must be arranged to alternate positions: (A0, B0), (A1, B1), (A2, B2), ... Not only such arrangement can be costly by itself, but also after the operation one of the arranged source vectors is &lt;span style="font-style: italic;"&gt;destroyed&lt;/span&gt; (replaced by the dot-product result).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Concluding Part 2.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Comparing SSE5 with SSSE3/SSE4, it seems that after years of being dragged along by Intel's poor extension designs, AMD finally decides to make its own next step in a better way. As I've discussed above, it's probably more advantageous for AMD to implement SSSE3 together with SSE5, and less so to implement SSE4.1 &amp;amp; SSE4.2.&lt;br /&gt;&lt;br /&gt;However, as we know the commercial software in general and benchmarks in particular, especially on the desktop enthusiast market,  are heavily influenced by the bigger company, thus if it turns out SSE4.x are excessively used to benchmark processor performance then it is still possible for AMD to implement them in its future processors. But lets hope for all customers' sake this is not going to happen, and future x86 extension will follow more of AMD's SSE5 than Intel's SSE4.x.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-5880600419536294921?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/5880600419536294921/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=5880600419536294921' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5880600419536294921'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5880600419536294921'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/09/amds-latest-x86-extension-sse5-part-2.html' title='AMD&apos;s latest x86 extension: SSE5 - Part 2'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-4969549532666424600</id><published>2007-09-21T11:11:00.000-07:00</published><updated>2007-09-25T02:18:13.699-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ISA'/><title type='text'>AMD's latest x86 extension: SSE5 - Part 1</title><content type='html'>Series Index -&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Part 1. An Overview&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://abinstein.blogspot.com/2007/09/amds-latest-x86-extension-sse5-part-2.html"&gt;Part 2. Face-Off with SSSE3 and SSE4.x&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The &lt;a href="http://developer.amd.com/sse5.jsp"&gt;SSE5 announcement&lt;/a&gt; made by AMD earlier this month is something big. In fact, in terms of instruction scope and architectural design, it is bigger SSE3, SSSE3, and SSE4 &lt;span style="font-style: italic;"&gt;combined&lt;/span&gt;. If we think of AMD64 as completely revamping x86-based general-purpose computing (as generally conceived by the industry), then we can also think of SSE5 as completely revamping x86-based SIMD acceleration. In my opinion, the leaps made by AMD in both AMD64 and SSE5 firmly assert the company as the leader in x86 computing architectures, leaving Intel gasping far behind.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The SSE5 Superiority&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are a few things that make SSE5 a "superior" kind of SIMD (Single-Instruction Multiple-Data) instructions different from all the previous SSE{1-4}:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;SSE5 is a &lt;span style="font-weight: bold;"&gt;generic SIMD extension that aims to &lt;/span&gt;&lt;a style="font-weight: bold;" href="http://www.ddj.com/hpc-high-performance-computing/201803067"&gt;accelerate not just multimedia but also HPC and security applications&lt;/a&gt;.&lt;/li&gt;&lt;ul&gt;&lt;li&gt;In contrast, previous SSEx, especially SSE3 and later, were designed specifically with media processing in mind.&lt;/li&gt;&lt;li&gt;The CRC and string match instructions of SSE4.2 are too specialized to be generally useful.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;SSE5 instructions can &lt;span style="font-weight: bold;"&gt;operate on up to three distinct memory/register operands&lt;/span&gt;.&lt;/li&gt;&lt;ul&gt;&lt;li&gt;It allows true 3-operand operations, where the destination operand is &lt;span style="font-style: italic;"&gt;different&lt;/span&gt; from any of the two source operands.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;It allows 3-way 4-operand operations, where the destination operand is the same as one of the &lt;span style="font-style: italic;"&gt;three&lt;/span&gt; source operands.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;SSE5 includes &lt;span style="font-weight: bold;"&gt;powerful and generic Vector Conditional Moves (both integer and floating-point)&lt;/span&gt;.&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Only four instructions (mnemonics) are added: PCMOV for generic bits, PPERM for integer bytes/(d,q)words, PERMPD/PERMPS for single/double-precision floating points.&lt;/li&gt;&lt;li&gt;Powerful enough to move data from any part of the 128-bit source memory/register to any part of the 128-bit destination register, &lt;span style="font-style: italic;"&gt;plus&lt;/span&gt; optional logical post-operations.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;SSE5 includes &lt;span style="font-weight: bold;"&gt;both integer arithmetic &amp;amp; logic, and floating-point arithmetic &amp;amp; compare instructions&lt;/span&gt;.&lt;/li&gt;&lt;ul&gt;&lt;li&gt;For integer arithmetics, it includes both true vertical Multiply-Accumulate and flexible horizontal Adds/Subs.&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;An Analytical View of SSE5 Instruction Format&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;All above show one thing: SSE5 is a well-planned, thoroughly articulated, and carefully designed ISA extension. The amazing thing is that the designers at AMD accomplish all these by simply adding &lt;span style="font-style: italic;"&gt;a single DREX byte&lt;/span&gt; in-between the SIB and Displacement bytes, as shown in the figure below (taken from page 2 of &lt;a href="http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/43479.pdf"&gt;AMD's SSE5 documentation&lt;/a&gt;):&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RvR-qT-_NxI/AAAAAAAAAD8/qzGliWdiOI0/s1600-h/sse5_format.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RvR-qT-_NxI/AAAAAAAAAD8/qzGliWdiOI0/s400/sse5_format.gif" alt="" id="BLOGGER_PHOTO_ID_5112850742356424466" border="0" /&gt;&lt;/a&gt;A question naturally arises: will the additional DREX byte further increase instruction lengths? Fortunately, &lt;span style="font-style: italic;"&gt;not a single bit&lt;/span&gt;. According to the official document linked above, &lt;span style="font-weight: bold;"&gt;those &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;SSE5 instructions that use the DREX byte can not only take 3 distinctive operands but also access all 16 XMM registers &lt;span style="font-style: italic;"&gt;without&lt;/span&gt; the AMD64 REX prefix&lt;/span&gt;; in fact, the use of the DREX byte in an SSE5 instruction &lt;span style="font-style: italic;"&gt;excludes&lt;/span&gt; the use of the REX prefix. SSE5 instruction lengths are just as long as needed and as short as it can be. (We will talk more about possible further extensions to AMD64 REX and SSE5 DREX in a later part.)&lt;br /&gt;&lt;br /&gt;Another great merit of SSE5 instruction encoding is that it is simple and regular. Note the "Opcode3" byte in the above picture, the main byte that distinguishes among different SSE5 instructions: &lt;span style="font-weight: bold;"&gt;its encoding is astonishingly simple: 5 bits for opcode, 1 bit for operand ordering, and 2 bits for operand size&lt;/span&gt;. The result is &lt;span style="font-weight: bold;"&gt;an &lt;/span&gt;&lt;span style="font-style: italic; font-weight: bold;"&gt;orthogonal&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; instruction encoding&lt;/span&gt; - you only need to look at an opcode field by itself to know what it means. &lt;span style="font-style: italic;"&gt;In contrast&lt;/span&gt;, the 3rd opcodes of Intel's SSSE3 and SSE4 instructions seem like picked by spoiled child to purposely screw up any implementation. (We will talk more about comparison between AMD's SSE5 and Intel's SSSE3/SSE4 &lt;a href="http://abinstein.blogspot.com/2007/09/amds-latest-x86-extension-sse5-part-2.html"&gt;in a later part&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Types of SSE5 Instructions&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are several major types of instructions in SSE5:&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;ol&gt;&lt;li&gt;Various integer and floating-point multiply-accumulate (MAC) instructions.&lt;/li&gt;&lt;li&gt;Vector conditional move (CMOV) and permutation (PERM) instructions.&lt;/li&gt;&lt;li&gt;Vector compare and predicate generation instructions.&lt;/li&gt;&lt;li&gt;Packed integer horizontal add and subtract.&lt;/li&gt;&lt;li&gt;Vectorized rounding, precision control, and 16-bit FP conversion.&lt;/li&gt;&lt;/ol&gt;&lt;/span&gt;A single PTEST instructions in Type 3 and four ROUNDx instructions in Type 5 above are copied directly from Intel's SSE4.1; together with other Type 4 and Type 5 instructions these are the SSE5 instructions that do &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; contain the DREX byte. All the other Type 1-3 SSE5 instructions utilize the DREX byte to specify a 3rd distinctive (destination) operand and to offer access to XMM8-XMM16 registers (without &amp;amp; excluding the REX prefix).&lt;br /&gt;&lt;br /&gt;In particular, the Type 1 (MAC) and Type 2 (CMOV/PERM) instructions are 3-way &lt;span style="font-style: italic;"&gt;4-operand &lt;/span&gt;operations, with destination is set to either source 1 or source 3. &lt;span style="font-weight: bold;"&gt;The fact that 3-way operation is allowed - even with destination equal to one of the sources - is instrumental in enabling flexible MAC and CMOV/PERM instructions.&lt;/span&gt; In the case of MAC, two multipliers and an accumulator must be specified; in the case of CMOV/PERM, two sources and a conditional predicate must be given. Without the ability to address 3 distinctive operands, these two types of accelerations are either impossible or done awkwardly (more on Intel's SSE4.1-way of doing it in a &lt;a href="http://abinstein.blogspot.com/2007/09/amds-latest-x86-extension-sse5-part-2.html"&gt;later part&lt;/a&gt; of this series).&lt;br /&gt;&lt;br /&gt;What makes these two types of instructions, MAC and CMOV/PERM, which happily require 3 distinctive operands, so special? As previously said, the &lt;span style="font-style: italic;"&gt;four&lt;/span&gt; conditional move &amp;amp; permutation instructions allow predicated transfer of data from &lt;span style="font-style: italic;"&gt;any part&lt;/span&gt; of the source registers/memory to &lt;span style="font-style: italic;"&gt;any part&lt;/span&gt; of the destination register, &lt;span style="font-style: italic;"&gt;followed by&lt;/span&gt; one of seven optional operations. Just how many instructions are there in SSE/SSE2/SSE3 to perform similar and simpler tasks &lt;span style="font-style: italic;"&gt;partially&lt;/span&gt;? Here is a quick list:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;MOVAPD&lt;/li&gt;&lt;li&gt;MOVAPS&lt;/li&gt;&lt;li&gt;MOVDDUP&lt;/li&gt;&lt;li&gt;MOVSHDUP&lt;/li&gt;&lt;li&gt;MOVSLDUP&lt;br /&gt;&lt;/li&gt;&lt;li&gt;MOVDQA&lt;/li&gt;&lt;li&gt;MOVDQU&lt;/li&gt;&lt;li&gt;MOVHLPS&lt;/li&gt;&lt;li&gt;MOVLHPS&lt;/li&gt;&lt;li&gt;MOVQ&lt;/li&gt;&lt;li&gt;MOVSD&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Of course this does &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; mean the four instructions in SSE5 will replace all the MOVs in SSE/SSE2 above, which are still useful for their simplicity (only 2 operands required) and possibly lower latency (no post-operation needed). However, it &lt;span style="font-style: italic;"&gt;does&lt;/span&gt; illustrate how powerful and useful the PERM instructions in SSE5 can be - just imagine how hard it is to implement these operations in an SSE2-like style.&lt;br /&gt;&lt;br /&gt;The MAC instructions turns out to be one of the "&lt;span style="font-style: italic;"&gt;most-wanted&lt;/span&gt;" instruction accelerations. As shown in "&lt;a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=565590"&gt;Design issue in division and other floating point operations&lt;/a&gt;" by Oberman et al. in IEEE ToC, 1997, nearly 50% of floating-point multiplication results are consumed by a depending addition or subtraction. See the picture below, directly grabbed from the paper:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RvTS4T-_NyI/AAAAAAAAAEE/vwLW2cYoAfw/s1600-h/fp_muladd.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RvTS4T-_NyI/AAAAAAAAAEE/vwLW2cYoAfw/s400/fp_muladd.jpg" alt="" id="BLOGGER_PHOTO_ID_5112943341851326242" border="0" /&gt;&lt;/a&gt;In other words, by combining multiplication with a depending addition/subtraction, we can eliminate 50% instructions following all multiplications. Until SSE5, it was impossible to truly fuse a multiplication with a depending add or subtract and take advantage of such acceleration.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Concluding Part 1.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As shown above, the SSE5 from AMD is indeed something very different from the previous x86 SIMD extensions from Intel. &lt;a href="http://scientiasblog.blogspot.com/2007/09/top-developments-of-2007.html"&gt;Some people&lt;/a&gt; even went so far to call it "AMD64-2", and the "top development" of the year; such enthusiasm, of course, is unduly.&lt;br /&gt;&lt;br /&gt;Until now, AMD is still gathering community feedback and asking for community support on the SSE5 initiative. Apparently, SSE5 is still in development; it's a great proposal, but clearly not developed (yet). Also, the SSE5 instructions by themselves do not match the breadth and depth of AMD64, which not only expands x86 addressing space but also semantically changes the working of the ISA. SSE5, on the other hand, doesn't touch nor alter any bit of the x86-64 outside of its extending scope. However, as we will discuss in a later part, the direction pointed to by SSE5 &lt;span style="font-style: italic;"&gt;can&lt;/span&gt; be used to further extend x86-64 in a more general and generic way rivaling the original AMD64.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-4969549532666424600?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/4969549532666424600/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=4969549532666424600' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/4969549532666424600'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/4969549532666424600'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/09/amds-latest-greatest-x86-extension-sse5.html' title='AMD&apos;s latest x86 extension: SSE5 - Part 1'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_oGCeAi-2i3Q/RvR-qT-_NxI/AAAAAAAAAD8/qzGliWdiOI0/s72-c/sse5_format.gif' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-45811453800814031</id><published>2007-09-10T10:27:00.000-07:00</published><updated>2007-09-25T02:19:07.290-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='General'/><category scheme='http://www.blogger.com/atom/ns#' term='K10'/><title type='text'>Scalability counts!</title><content type='html'>As I have said &lt;a href="http://abinstein.blogspot.com/2007/05/scalability-or-lack-of-it-of-intels.html"&gt;in this article&lt;/a&gt;, Intel's new Core 2 line of processors have good cores but poor system architecture. The poor scalability of FSB means that Core 2, without extensive, expensive, and power-hungry chipset support, is only suitable for low-end personal enjoyment.&lt;br /&gt;&lt;br /&gt;Take a look at this &lt;a href="http://anandtech.com/IT/showdoc.aspx?i=3091&amp;p=7"&gt;AnandTech benchmark&lt;/a&gt;. I'd note foremost that AnandTech is hardly an AMD-favoring on-line "journal"; thus we can expect its report to be at worst Intel-biased and at best neutral (which I'm hoping for here). In any rate, the benchmark picture is reproduced below:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_oGCeAi-2i3Q/RuV_TrFEePI/AAAAAAAAADs/vTKbVX_YZco/s1600-h/15537.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_oGCeAi-2i3Q/RuV_TrFEePI/AAAAAAAAADs/vTKbVX_YZco/s400/15537.png" alt="" id="BLOGGER_PHOTO_ID_5108629328279927026" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The comparison between Barcelona (Opteron 2350 2.0GHz) and Clovertown (Xeon E5345 2.33GHz) couldn't be clearer: FSB is an outdated system architecture for today's high-end computing, and &lt;span style="font-weight: bold;"&gt;scalability does matter&lt;/span&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;/span&gt;for server &amp; workstation grade performance. While AMD's quad-core Opteron at 2.0GHz is slower than Intel's quad-core Xeon at 2.3GHz on single-socket test, the situation is reversed when going to a dual-socket setup, one that used by most workstations and entry-level servers.&lt;br /&gt;&lt;br /&gt;The same phenomenon is also &lt;a href="http://anandtech.com/IT/showdoc.aspx?i=3091&amp;amp;p=10"&gt;observed in this page&lt;/a&gt; where AMD's quad-core Opteron, at 17% slower clock rate, performs increasingly better than Intel's quad-core Xeon with more number of cores (picture reproduced below). Again, when it comes to server &amp; workstation performance,&lt;span style="font-weight: bold;"&gt; scalability counts&lt;/span&gt;.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_oGCeAi-2i3Q/RuWC4LFEeQI/AAAAAAAAAD0/7B6g8tYUVac/s1600-h/BarcaWinrar.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_oGCeAi-2i3Q/RuWC4LFEeQI/AAAAAAAAAD0/7B6g8tYUVac/s400/BarcaWinrar.gif" alt="" id="BLOGGER_PHOTO_ID_5108633253880035586" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-45811453800814031?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/45811453800814031/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=45811453800814031' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/45811453800814031'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/45811453800814031'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/09/scalability-counts.html' title='Scalability counts!'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_oGCeAi-2i3Q/RuV_TrFEePI/AAAAAAAAADs/vTKbVX_YZco/s72-c/15537.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-2759752207109239346</id><published>2007-08-03T10:53:00.000-07:00</published><updated>2007-09-25T03:02:58.764-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Core 2'/><category scheme='http://www.blogger.com/atom/ns#' term='General'/><category scheme='http://www.blogger.com/atom/ns#' term='memory'/><category scheme='http://www.blogger.com/atom/ns#' term='K10'/><title type='text'>Not Everything about Memory is Bandwidth</title><content type='html'>&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;The False Common Belief&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There is this common belief among PC enthusiasts that bandwidth, or million transfers per second or megabytes per second, is the most important thing that a good memory system should aim for. Such a belief is so deep-rooted that even the professionals (&lt;span style="font-style: italic;"&gt;i.e.&lt;/span&gt;, AMD &amp; Intel) began to calibrate &amp;amp; market their products based on the memory bandwidth values.&lt;br /&gt;&lt;br /&gt;For example, take a look at this &lt;a href="http://www.elitebastards.com/cms/index.php?option=com_content&amp;task=view&amp;amp;amp;amp;id=437&amp;Itemid=29&amp;amp;limit=1&amp;limitstart=2"&gt;Barcelona architecture July update&lt;/a&gt; article. The first graph in that page, which seems to be an AMD presentation and is conveniently duplicated below, seems to suggest that all the memory enhancements in AMD's Barcelona (K10) over its predecessor (K8) are about "Increasing Memory Bandwidth".&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RrNuikx3CzI/AAAAAAAAACs/wD5IGi5KY3k/s1600-h/mem_controller.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RrNuikx3CzI/AAAAAAAAACs/wD5IGi5KY3k/s400/mem_controller.jpg" alt="" id="BLOGGER_PHOTO_ID_5094537143753575218" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The question is, &lt;span style="font-weight: bold;font-size:100%;" &gt;do they really increase memory bandwidth&lt;/span&gt;? Lets take a look at the bullet points in the graph, from bottom to top.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;The prefetchers&lt;/span&gt;. Prefetching does not increase memory bandwidth. On the contrary, it reduces available memory bandwidth &lt;a href="http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/xeon/optimization/298229.htm?page=4"&gt;by increasing memory bus utilization&lt;/a&gt; (search "increase in bus utilization" on the page).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Optimized Paging&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;Write Bursting&lt;/span&gt;. They both increase memory bus efficiency, which does not increase the bandwidth &lt;span style="font-style: italic;"&gt;per se&lt;/span&gt;, although it helps improving the bandwidth effectiveness.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Larger Memory Buffer&lt;/span&gt;. A larger buffer can improve store-to-load forwarding and increase the size of write bursting. The buffer itself, however, does not increase memory bandwidth at all.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Independent Memory Channels&lt;/span&gt;. This certainly has no effect on memory bandwidth. Each of the two independent channels is half the width, resulting in the same overall bandwidth.&lt;/li&gt;&lt;/ul&gt;Thus, &lt;span style="font-style: italic;"&gt;out of six bullet points, only two are marginally related to memory bandwidth&lt;/span&gt;. The bottom line: Barcelona still uses the same memory technology (DDR2) and the same memory bus width (128-bit), beyond which there is no more bandwidth to increase to!&lt;br /&gt;&lt;br /&gt;However, one would be more wrong to think Barcelona's memory subsystem is not improved over its predecessor, because &lt;span style="font-style: italic;"&gt;all the points above are nevertheless improvements, though not on increasing memory bandwidth, but on reducing memory latency&lt;/span&gt;. Intelligent memory prefetching can &lt;span style="font-style: italic;"&gt;hide memory latency&lt;/span&gt;, as shown in the &lt;a href="http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/xeon/optimization/298229.htm?page=4"&gt;Intel article page linked above&lt;/a&gt;. Reduced read/write transitions due to write bursting and the larger memory buffer both can &lt;span style="font-style: italic;"&gt;reduce memory latency&lt;/span&gt; considerably. The independent memory channels also &lt;span style="font-style: italic;"&gt;reduces latency&lt;/span&gt; when multiple memory transactions are on-flight simultaneously - especially important for multi-core processing. In short, the memory subsystem of Barcelona is improved for lower latency, not higher bandwidth.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Why Does Barcelona Improve More Latency Than Bandwidth?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are a few reasons that a general computing platform based on multiple levels of cache benefits more from lower memory latency. This is contrary to specialized signal processing or graphics processors where instruction branches (changes in instruction flow) and data dependencies (store-to-load forwarding) are few and rare. This fact is aptly described in the following "Pitfall" on page 501 of &lt;span style="font-style: italic;"&gt;Computer Architecture A Quantitative Approach 3rd ed.&lt;/span&gt;, Section 5.16, by Hennessy and Patterson:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Pitfall&lt;/span&gt; &lt;span style="font-style: italic;"&gt;Emphasizing memory bandwidth in DRAMs versus memory latency&lt;/span&gt;. PCs do most memory access through a two-level cache hierarchy, so it is unclear how much benefit is gained from high bandwidth without also improving memory latency.&lt;/li&gt;&lt;/ul&gt;In other words, for general-purpose processors such as Athlon, Core 2 Duo, Opteron, and Xeon, what helps performance is &lt;span style="font-style: italic;"&gt;not just&lt;/span&gt; the bandwidth, &lt;span style="font-style: italic;"&gt;but more importantly&lt;/span&gt; the effective latency of their memory subsystem. This pitfall is promptly followed by its dual on the next page of the book, which on the other hand explains why most signal and graphics processors which require high memory bandwidth do not need multiple levels of cache like the general-purpose CPUs:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Pitfall&lt;/span&gt; &lt;span style="font-style: italic;"&gt;Delivering high memory bandwidth in a cache-based system.&lt;/span&gt; Caches help with average cache memory latency but may not deliver high memory bandwidth to an application that needs it.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Memory Bandwidth Estimate for &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;High-End Quad-Core CPUs&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Still one may ask, &lt;span&gt;is the memory bandwidth offered by say a DDR2-800 channel really enough for modern processors&lt;/span&gt;? It turns out that, at least for Intel's Penryn and AMD's Barcelona to come, it should be. To estimate the maximally required memory bandwidth, we &lt;span style="font-style: italic;"&gt;assume &lt;/span&gt;&lt;span style="font-style: italic;"&gt;a 3.33GHz quad-core processor with 4MB cache sustaining 3 IPC (instruction per cycle)&lt;/span&gt;. Such a processor should be close to the top performing models from both AMD and Intel by the middle of next year. (See also the &lt;a href="http://abinstein.blogspot.com/2007/06/decoding-x86-from-p6-to-core-2-part-3.html"&gt;micro/macro-fusion article&lt;/a&gt; for Core 2's actual/sustainable IPC.)&lt;br /&gt;&lt;br /&gt;First lets look at the &lt;span style="font-style: italic;"&gt;data bandwidth&lt;/span&gt;. A 3.33GHz, 3 IPC processor would execute up to 10G I/s (giga-instructions per second). Suppose 1 out of 3 instructions has a load or store, which is supported by the fact both Core 2 and Barcelona have 6-issue (micro-op) engines and perform up to 2 loads or stores per cycle. Thus,&lt;br /&gt;&lt;br /&gt;10G I/s * 0.333 LS/I = 3.33G LS/s (giga-load/store per second, per core)&lt;br /&gt;&lt;br /&gt;Multiply this number by 4 cores, the total is 13.33G LS/s. According to Figure 5.10 of &lt;span style="font-style: italic;"&gt;Computer Architecture AQA&lt;/span&gt; on page 416, a 4MB cache has miss rate about 1%. Lets make it 2% to be conservative. Thus the number of memory accesses going to the memory bus is&lt;br /&gt;&lt;br /&gt;13.33G LS/s * 2% MA/LS = 0.267G MA/s (giga-memory accesses per second)&lt;br /&gt;&lt;br /&gt;Each memory access is at most 16-byte, but mostly likely 8-byte or less in average. This makes the &lt;span style="font-style: italic;"&gt;worst-case memory bandwidth requirement 0.267G*16 = &lt;span style="font-weight: bold;"&gt;4.27GB/s&lt;/span&gt;&lt;/span&gt;, and &lt;span style="font-style: italic;"&gt;the average-case &lt;span style="font-weight: bold;"&gt;2.14GB/s&lt;/span&gt;&lt;/span&gt;. Note that a single channel of DDR2 memory can support up to 6.4GB/s, much more than the numbers above.&lt;br /&gt;&lt;br /&gt;Now lets calculate the &lt;span style="font-style: italic;"&gt;instruction bandwidth&lt;/span&gt;. Again, for 4 cores at 3.33GHz 3 IPC, there are 40G I/s. However, instructions usually have exceptional cache advantage. According to Figure 5.8 on page 406 of the same above textbook, a 64KB instruction cache has less than 1 miss per 1000 instructions. Assume (again quite conservatively) each instruction takes 5 bytes, this means the total memory bandwidth for fetching instructions is&lt;br /&gt;&lt;br /&gt;40G I/s * 0.1% * 5 B/I = 0.2GB/s&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Thus even under conservative estimation, the instruction-fetch bandwidth is negligible compared to the data load/store bandwidth. The conclusion is clear: &lt;span style="font-style: italic;"&gt;the memory bandwidth of just one single DDR2-800 channel (6.4GB/s) is more than enough for the highest-end quad-core processor during the next 10 months to come.&lt;/span&gt; The problem, however, is not bandwidth, but latency.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Update 8/7/07&lt;/span&gt; - Please take a look at the &lt;a href="http://se.sun.com/virtualisering/pdf/AMD_Quad_Core-Leif_Nordlund.pdf"&gt;AMD presentation on Barcelona&lt;/a&gt;, page 8, where quad-core Barcelona is shown to utilize just 25% the total bandwidth of 10.6GB/s. Suppose this is obtained from a 2.0GHz K10 (one that was demo'd and apparently benchmarked by AMD), then, scaling up linearly, a fictional 3.3GHz K10 would reach about 41% utilization, or about 4.3GB/s. Notice how close this number is to my estimate above.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;What About Core 2's Insatiable Appetite for FSB Speed?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A naturally raised question is that, if 6.4GB/s is more than enough for the highest-performing quad-core x86-64 processors in the next year or so, why is Intel raising the FSB (front-side-bus) speed above 1066MT/s (million-transfers per second) so aggressively to 1333MT/s and even 1600MT/s? Isn't 1066MT/s already offering more than 6.4GB/s bandwidth?&lt;br /&gt;&lt;br /&gt;The reasons are two-fold:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;For Core 2 Quad, the FSB is not just used for memory accesses, but also I/O and inter-core communications. Since data transfer on FSB is most efficient with long trains of back-to-back bytes, such transfer-type transitions can greatly reduce effective bandwidth.&lt;/li&gt;&lt;li&gt;Raising the FSB speed not only increases peak bandwidth, but also (more importantly) reduces transfer delay. A 400MHz (1600MT/s) bus will cut 1/3rd the data transfer time of a 266MHz (1066MT/s) bus.&lt;/li&gt;&lt;/ol&gt;In other words, due to the obsolete design of Intel's front-side-bus, the sheer value of peak memory bandwidth becomes insufficient to predict the memory subsystem's performance, where a potentially 10.6GB/s bus (1333MT/s * 8B/T) isn't even able to satisfy the need of a quad-core processor (3.33GHz, 3 IPC) requiring no more than 5 GB/s of continuous data access to/from the memory.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Importance of Latency Reduction&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To show how latency reduction is the more important reason to raise FSB speed, we will compare a dual-core system with two quad-core systems, one with a &lt;span style="font-style: italic;"&gt;2x wider&lt;/span&gt; memory bus, the other with a &lt;span style="font-style: italic;"&gt;1.5x faster&lt;/span&gt; memory bus. We will show that the faster FSB is more effective in bringing down the average memory access time, which is the major factor affecting a computer's IPC. More specifically, using the dual-core system as reference, suppose the following:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The quad-core #1 system has the same bus speed but 2x the bus width (&lt;span style="font-style: italic;"&gt;e.g.&lt;/span&gt;, 128-bit vs. 64-bit). In other words, it has the same data transfer delay and 100% more peak memory bandwidth than the dual-core system.&lt;/li&gt;&lt;li&gt;The quad-core #2 system has the same bus width but 1.5x the bus speed (&lt;span style="font-style: italic;"&gt;e.g.&lt;/span&gt;, 400MHz  1600MT/s vs. 266MHz 1066MT/s). In other words, it has 33% less data transfer delay and, consequently, 50% more peak memory bandwidth than the dual-core.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The memory bus is time-slotted and serves the cores in round-robin. For the dual-core and quad-core #1 systems, each memory access slot is 60ns. For the quad-core #2 system, each slot is 40ns.&lt;/li&gt;&lt;li&gt;Memory bandwidth utilization is 50% on the dual-core (2 out of 4 slots are occupied) and the quad-core #1 (4 out of 8 slots are occupied). It is 66.7% on quad-core #2 (4 out of 6 slots are occupied), calculated by 50% * (2x cores) / (1.5x bandwidth).&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Note that the assumptions above are simplistic and optimistic. It does not take into account the reduced effective bandwidth &amp; efficiency due to  I/O and inter-core communications. When taken these two effects into account, the quad-core systems will perform much worse than they are estimated below.&lt;br /&gt;&lt;br /&gt;Lets first calculate the average memory access latency of the dual-core system. When either core makes a memory request, it finds 3/4=75% of chance the memory bus is free, and 25% of chance it has to wait an additional 60ns for access. The effective latency is&lt;br /&gt;&lt;br /&gt;60ns * 75% + (60ns+60ns) * 25% = 75ns&lt;br /&gt;&lt;br /&gt;Thus in average, each memory access takes just 75ns to complete.&lt;br /&gt;&lt;br /&gt;Now lets calculate the effective latency for the quad-core #1 system. When an arbitrary core makes a memory request, it finds only 5/8=62.5% of chance the memory bus is free, and 37.5% of chance it has to wait. The waiting time, however, is more complicated in this case, because there are C(8|3) = 56 cases how the slots are occupied. Skipping some mathematical derivations, the result is&lt;br /&gt;&lt;br /&gt;60ns * (4+3+2+1*5)/8 = 105ns, 6 out of 56 cases&lt;br /&gt;60ns * (3+2+2+1*5)/8 = 90ns, 30 out of 56 cases&lt;br /&gt;60ns * (2+2+2+1*5)/8 = 82.5ns, 20 out of 56 cases&lt;br /&gt;&lt;br /&gt;=&gt; (105ns * 6/56) + (90ns * 30/56)  + (82.5ns * 20/56) = 88.9ns&lt;br /&gt;&lt;br /&gt;Thus even when we &lt;span style="font-style: italic;"&gt;double the memory bandwidth&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;keep the same bus utilization&lt;/span&gt;, a quad-core system still induces 18.5% higher access latency than a dual-core system. Note that this is even in the case where memory utilization is as low as 50%. For higher utilization, the latency increase will only be worse. The conclusion is clear: increasing memory bandwidth is not enough to scale up memory performance for multi-core general-purpose processing.&lt;br /&gt;&lt;br /&gt;Now lets look at the quad-core #2 system, where data transfer delay is reduced 33%, but memory width is the same and bus utilization is increased to 66.7%. When an arbitrary core makes a memory request, it finds just 3/6=50% of chance the memory bus is free, and 50% of chance it has to wait. The waiting time again is complicated as there are C(6|3) = 20 cases how the slots are occupied. Skipping again some mathematical derivations, we get&lt;br /&gt;&lt;br /&gt;40ns * (4+3+2+1*3)/6 = 80ns, 4 out of 20 cases&lt;br /&gt;40ns * (3+2+2+1*3)/6 = 66.7ns, 12 out of 20 cases&lt;br /&gt;40ns * (2+2+2+1*3)/6 = 60ns, 4 out of 20 cases&lt;br /&gt;&lt;br /&gt;=&gt; (80ns * 4/20) + (66.7ns * 12/20) + (60ns * 4/20) = 68ns&lt;br /&gt;&lt;br /&gt;The average memory access latency here is &lt;span style="font-style: italic;"&gt;almost 10% lower than the dual-core case and 24% lower than the quad-core #1&lt;/span&gt;. The effect of higher memory bus utilization is completely offset by a lower data transfer delay. Again, for general-purpose multi-core processing, reducing memory access delay is much more important than increasing memory peak bandwidth.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Conclusion and Remark&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Lets go back to the original (supposedly) AMD's presentation. Why does it say "increase memory bandwidth" all over the page? Probably because most people simply don't understand better, and to make them so an article like this one is probably necessary and not even sufficient. We seem to see AMD engineers trying so hard to twist the delicate bandwidth-latency relationship, push it and force it down to a form easily understood (yet probably not believed) by ordinary minds.&lt;br /&gt;&lt;br /&gt;However, bandwidth is definitely not useless. It really depends on the workload. For streaming processing such as graphics and signal processing, bandwidth and throughput are everything, and latency becomes mostly irrelevant. You won't care whether a DVD frame is played to you 100 milliseconds after it was read out of a blu-ray disc, as long as the next frame comes within 15 milliseconds (70fps) or so. Yet 100 milliseconds is 300 million cycles of a 3GHz processor! For streaming applications, we certainly want continuous flow of high-bandwidth data, yet have millions of cycles of latency to spare.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-2759752207109239346?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/2759752207109239346/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=2759752207109239346' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/2759752207109239346'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/2759752207109239346'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/08/not-everything-about-memory-is.html' title='Not Everything about Memory is Bandwidth'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_oGCeAi-2i3Q/RrNuikx3CzI/AAAAAAAAACs/wD5IGi5KY3k/s72-c/mem_controller.jpg' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-7959628657552245616</id><published>2007-06-10T19:40:00.000-07:00</published><updated>2007-06-13T23:49:01.529-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Quad Core'/><title type='text'>Back-of-envelope calculation of native quad-core production</title><content type='html'>Semiconductor chip yield is one of the most guarded secret in the industry. There is no way an outsider could have "guess-timated" an accurate value, except by chance (which is also extremely low). Yet sometimes observations can be made from some big, obvious facts. In this article we will make some (strictly) back-of-envelope calculations of AMD and Intel's yields on dual-core processors, and make implications on &lt;span style="font-style: italic;"&gt;native quad-core "manufacturability"&lt;/span&gt; from both companies.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The observation and assumptions&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For the purpose of discussion, we will make the following assumptions -&lt;br /&gt;&lt;ol&gt;&lt;li&gt;In mid-late 2006, Intel occupies ~80% market share with &lt;a href="http://www.intel.com/technology/silicon/65nm-cross-over.htm"&gt;three 300mm 65nm fabs&lt;/a&gt;, while AMD occupies ~20% with only 90nm FAB30 (200mm) and FAB36 (300mm).&lt;/li&gt;&lt;li&gt;AMD's FAB30 has the same wafer throughput as Intel's 65nm fabs. FAB36, while under 90-to-65nm transition and having low utilization, further increases production volume by 50%.&lt;/li&gt;&lt;li&gt;Intel's main production in 3Q 2006 is Core 2 Duo and dual-core Netburst, with Core 2 Quad volume small enough to be negligible to our discussion (e.g., 10% or less).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The most significant factor except those described above are the yield of the fabs, and AMD's FSB36 has about the same dual-core K8 yield as its 90nm counterparts.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;The assumptions above may be utterly wrong, or they may be good enough for the "back-of-envelope" purpose. My point here is not to commend their validity; rather it is to make clear that the arguments below will hold true only if these assumptions do. Note that Intel's last 65nm fab (Ireland) of the three started production output by July 2006, while AMD's FAB36 started 65nm output in the later part of 4Q that year. Thus at least for 3Q 2006 the above market-share and relative technology differences are known to be true.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The calculations&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Potentially, three 300mm 65nm fabs would have 3*2*2 = 12x capacity of one 200mm 90nm fab, if the yields of all fabs are the same. Thus, counting into AMD's FAB36, Intel would've had 12x/1.5 = 8x capacity of AMD with the same (dual-core processor) yield. However, Intel's market share during the period is only 4x that of AMD's. There is thus a 2x discrepancy between Intel's potential capacity (8x of AMD's) and its true capacity (4x of AMD's), which is presumably affected by a lower yield of its fabs. In other words, to reach the expected market share, AMD's FAB30 and FAB36 would have yields twice as good as Intel's 65nm 300mm fabs.&lt;br /&gt;&lt;br /&gt;Apparently, this conclusion is not possible. A factor of two in terms of yield is too large, and Intel simply can't be that bad in manufacturing. A few factors may have affected the estimation accuracy here:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Intel's 65nm fabs may have lower wafer throughput or utilization than AMD's FAB30 and FAB36 combined, particularly the Ireland fab which was ramping for just 4 months, and D1D which is also used for 45nm research &amp; development.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Intel may be making much more Core 2 Quad, which effectively cuts production volume in half (two Core 2 Duo dies make one Core 2 Quad).&lt;/li&gt;&lt;/ul&gt;Taking into consideration of the two factors above, we'll adjust the estimated yield difference from 2x down (quite arbitrarily) to 1.5x. Note that a high percentage of Netburst-related products from Intel actually makes the discrepancy larger, since Netburst chips are smaller in size per die, much matured, and should have better yield than the cutting-edge Core 2 Duo.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Implication&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So how does this 1.5x yield difference affect "native" quad-core manufacturing? Suppose AMD's dual-core K8 yield is 81%; Intel's Core 2 Duo yield would be just 54% (1/1.5x). By 1st-order estimate, AMD's native quad-core would have a yield of roughly 65% (0.81*0.81), whereas Intel's would have 29% (0.54*0.54). In other words, out of 100 quad-core dies, AMD is able to make 65 functional quad-core processors, while Intel only 29, less than 50% of its smaller competitor. It is not difficult to see why AMD is going native but Intel won't until late 2008.&lt;br /&gt;&lt;br /&gt;Lets for the purpose of discussion turn the parameters further in Intel's favor, and assume it has just 1.25x lower yield (instead of 1.5x) from AMD's. If we again suppose dual-core K8 has yield 81%, then Core 2 Duo would have almost 65%, &lt;span style="font-style: italic;"&gt;making Intel's MCM quad-core approach as productive as AMD's native quad-core approach&lt;/span&gt;. What we see here is that &lt;span style="font-weight: bold;"&gt;a yield just a quarter better than the competitor could've made a huge difference in terms of native quad-core manufacturability&lt;/span&gt;. In fact, Not only is Intel late to native quad-cores, it was also late to native dual-cores for about 6 months even with a better technology (65nm vs. 90nm).&lt;br /&gt;&lt;br /&gt;The conclusion is clear, that Intel is telling the truth that it can't make native quad-core cost-effectively. For AMD, it might be very hard, but probably still doable, based on a simple capacity observation and this back-of-envelope calculation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The arguments&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Some people have argued the precision of the above estimates. Their arguments can basically be divided into the following points:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Intel's D1D is also making 45nm transition in late 2006, thus should have less than maximum output.&lt;/li&gt;&lt;li&gt;Intel's &lt;a href="http://www.intel.com/pressroom/archive/releases/20060622corp.htm"&gt;Ireland fab, ramping only 4 months&lt;/a&gt; from Jun'06, won't achieve max capacity in Oct'06.&lt;/li&gt;&lt;li&gt;Intel's shipping more dual-core processors in 4Q06 than AMD. Specifically, just &lt;a href="http://seekingalpha.com/article/24326"&gt;over 50% of Intel processors&lt;/a&gt; are dual-cores, while &lt;a href="http://seekingalpha.com/article/24929"&gt;only 30% of AMD's&lt;/a&gt; are.&lt;/li&gt;&lt;li&gt;AMD's FAB36, making 300mm wafers and &lt;a href="http://www.anandtech.com/printarticle.aspx?i=2734"&gt;started revenue shipping in Apr'06&lt;/a&gt;, should've been making as much silicon as FAB30.&lt;/li&gt;&lt;li&gt;By late 3Q06, AMD would also &lt;a href="http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543_7605%7E110532,00.html"&gt;have Chartered's output&lt;/a&gt; at hand.&lt;/li&gt;&lt;li&gt;Intel's 65nm doesn't actually result in 2x capacity of 90nm, more like 1.7x (1/0.6). As well, Intel's 300mm wafer would result in ~2.25x usable silicon area of 200mm ones.&lt;/li&gt;&lt;/ol&gt;It's important to note that all these are considered higher-order factors. A slight difference in terms of max wafer throughput per fab (ranging anywhere from 20k to 60k) could've dwarfed any of above. Still, for the sake of discussion, lets still try some more precise estimates from these points.&lt;br /&gt;&lt;br /&gt;The first point, it turns out, is wrong. As D1D's making 45nm outputs, its 65nm capacity is &lt;a href="http://www.tgdaily.com/content/view/31069/118/"&gt;moved to the neighboring D1C&lt;/a&gt;, which is &lt;a href="http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2592"&gt;outputting 65nm chips right after Ireland&lt;/a&gt; and purposely/completely ignored by me above. The second point would be valid and reduce 65nm Ireland fab's capacity to some 30% of its max.&lt;br /&gt;&lt;br /&gt;The third point above is also true; however it fails to recognize that most of Intel's single-core processors (Celerons and Pentium M's) are made at its 90nm fabs, whereas &lt;span style="font-style: italic;"&gt;all&lt;/span&gt; of AMD's single-core &lt;span style="font-style: italic;"&gt;and&lt;/span&gt; dual-core processors are made out of FAB30 &amp; 36 (excluding Chartered), a factor the 17% difference in dual-core ratio isn't even able to compensate.&lt;br /&gt;&lt;br /&gt;The fifth and sixth points are minor. Chartered's flex capacity would account for up to 20% of AMD's silicon output, and even less in Oct'06, not 3 months after its first revenue shipping for AMD. Assume Chartered is supplying 15% of AMD's silicon output, it'll effectively make output from AMD's own fabs 85% of the total, or changing the actual Intel-to-AMD output ratio from 4x to 4.7x.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RnCrjrViFnI/AAAAAAAAACk/mzk9bGD2b-E/s1600-h/AMDslides.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RnCrjrViFnI/AAAAAAAAACk/mzk9bGD2b-E/s400/AMDslides.gif" alt="" id="BLOGGER_PHOTO_ID_5075745409463359090" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The forth point, which we'll discuss last here, &lt;span style="font-style: italic;"&gt;seems&lt;/span&gt; quite valid from page 5 of this &lt;a href="http://www.amd.com/us-en/assets/content_type/DownloadableAssets/DarylOstranderAMDAnalystDay.pdf"&gt;AMD Jun'06 analyst day&lt;/a&gt; presentation (see picture above). At late 3Q06, the 300mm "wafer outs" from FAB36 seems to be 0.4x of the max 200mm from FAB30, equivalent to 0.4*2.25 = 0.9x FAB30's silicon area. Surely this is a great increase of AMD's potential capacity. Unfortunately, it turns out such argument is unfounded and mislead by a graph without y-axis unit and meant to be illustrative only.&lt;br /&gt;&lt;br /&gt;If we read the text on page 4 of that presentation (again see picture above), FAB36 is expected to output 25k wafers per month (wpm) by Q4 2007, which will be the total 300mm wafer output at that point (FAB38 won't have wafer outs until Q1 2008). We also know &lt;a href="http://www.anandtech.com/printarticle.aspx?i=2734"&gt;FAB30 is outputting about 30k wpm&lt;/a&gt; in Q3 2006. Now go to page 5 again and look! How can green line's &lt;span style="font-weight: bold;"&gt;25k wpm&lt;/span&gt; (end of 4Q07) be some 60% higher than red line's &lt;span style="font-weight: bold;"&gt;30k wpm&lt;/span&gt; in 2006? It is absolutely not possible unless the "wafer outs" y-axis actually means wafer &lt;span style="font-style: italic;"&gt;area&lt;/span&gt; outs, and the 25k 300mm wpm from FAB36 is effectively doubled to 50k, some 66% higher than the 30k 200mm wpm from FAB30.&lt;br /&gt;&lt;br /&gt;It turns out my original estimate of FAB36 reaching 50% capacity of FAB30 is actually a bit optimistic. The true number should be calculated as such: (0.8/3.4 * 25000)*2.25/30000 = 0.44, where 0.8 comes from green line at end of 3Q06, 3.4 from end of 4Q07 (both FAB36 only), 25000 is expected green line wpm at end of 4Q07, 2.25 translates 300mm wpm to effective 200mm wpm, and the final 30000 is FAB30 wpm (red line at end of 3Q06).&lt;br /&gt;&lt;br /&gt;Overall, a definitely more precise/probably more accurate estimate is the following:&lt;br /&gt;&lt;br /&gt;Intel's potential capacity: (2+0.3)*(1/0.6)*2.25 = 8.6&lt;br /&gt;AMD's potential capacity: 1+0.44 = 1.44&lt;br /&gt;Potential capacity ratio: 8.6/1.44 = 6.0x&lt;br /&gt;&lt;br /&gt;Intel's actual output: less than 80% market share (excluding 90nm production)&lt;br /&gt;AMD's actual output: 20%*0.85 = 17% (Chartered effects)&lt;br /&gt;Actual output ratio: 80%/17% = 4.7x&lt;br /&gt;&lt;br /&gt;Discrepancy between potential and actual output: 6.1/4.7 = 1.28, or almost 28% difference in microprocessor yield, well between the 50% and 25% estimates I made above.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-7959628657552245616?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/7959628657552245616/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=7959628657552245616' title='29 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/7959628657552245616'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/7959628657552245616'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/06/back-of-envelope-yield-calculation-of.html' title='Back-of-envelope calculation of native quad-core production'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_oGCeAi-2i3Q/RnCrjrViFnI/AAAAAAAAACk/mzk9bGD2b-E/s72-c/AMDslides.gif' height='72' width='72'/><thr:total>29</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-5633641575838094464</id><published>2007-06-01T22:50:00.000-07:00</published><updated>2011-04-28T19:43:10.872-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Core 2'/><title type='text'>Decoding x86: From P6 to Core 2 - Part 3</title><content type='html'>This is the Part 3 of a 3 part series. To fully appreciate what's written here, the &lt;a href="http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2-and.html"&gt;Part 1&lt;/a&gt; and &lt;a href="http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2.html"&gt;Part 2 &lt;/a&gt;articles (or comparable understandings) are prerequisites.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 130%;"&gt;&lt;span style="font-weight: bold;"&gt;The New Core Improvements&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Intel's "brand" new Core 2 Duo has many improvements over Pentium M. With respect to the x86 decode stage, they include -&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Improved micro-fusion&lt;br /&gt;&lt;/li&gt;&lt;li&gt;4-wide decode&lt;/li&gt;&lt;li&gt;Macro-fusion&lt;/li&gt;&lt;/ol&gt;All of these have been numerously described and repeated by many on-line review sites. Here again we will look at them in more technical and analytical detail.&lt;br /&gt;&lt;br /&gt;The improved micro-fusion is the least complicated, so we will just briefly describe it here. It is composed of using a bigger XLAT PLA (see the partial decoder diagram in &lt;a href="http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2.html"&gt;Part 2&lt;/a&gt;) that can handle more load-modify or addressed store instructions, including many SSE2/SSE3 ones. This improves Core 2's SSE performance over its predecessors, which must re-steer many SSE instructions to the first (full) decoder to be processed. In fact, Core Solo/Core Duo (Yonah) already has improved micro-fusion over Pentium M, but on a smaller degree of instructions than Core 2 Duo.&lt;br /&gt;&lt;br /&gt;On non-SSE codes, however, the performance boost is limited.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 130%;"&gt;&lt;span style="font-weight: bold;"&gt;A 4-wide decode &amp;amp; issue width&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The biggest marketing hype of Core 2 is certainly its &lt;span style="font-style: italic;"&gt;ability to decode and issue 4 x86 instructions per cycle&lt;/span&gt;, thus &lt;span style="font-style: italic;"&gt;achieving an IPC of 4 Instructions Per Cycle (or 5 with macro-fusion)!&lt;/span&gt; It turns out this is the biggest misconception around Core 2. As discussion in Myth #3 of &lt;a href="http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2-and.html"&gt;Part 1 article&lt;/a&gt;, a (sustained) rate of three x86 decodes per cycle is not the performance bottleneck yet. In fact, Intel's Optimization Reference Manual says in itself that&lt;br /&gt;&lt;blockquote&gt;[Decoding 3.5 instructions per cycle] is higher than the performance seen in most applications.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;- 2.1.2.2 Instruction Fetch Unit (Instruction PreDecode)&lt;/span&gt;&lt;/blockquote&gt;Note that this is stated under the conditions where branches, assumed once every 7 instructions, are predicted 100% correct, which is almost never the case and the sustained IPC is usually further reduced.&lt;br /&gt;&lt;br /&gt;Contrary to marketing slogan and common (mis-)belief, the main purpose of a 4-wide decode &amp;amp; issue (also macro-fusion discussed below) is really to combat the many undesirable design artifacts of P6's x86 decode engine. As seen in the end of &lt;a href="http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2-and.html"&gt;Part 1 article&lt;/a&gt;, these design artifacts reduce efficiency of the 4-1-1 decoders, which under real circumstances can hardly sustain three x86 decodes per cycle. Specifically -&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Flushing decoding pipeline every 16 bytes, or about 4 to 5 x86 instructions in average.&lt;/li&gt;&lt;li&gt;Flushing decoding pipeline at each complex (&amp;gt; 2 fused micro-op) instruction.&lt;/li&gt;&lt;li&gt;Reducing instruction fetch for taken branches, especially to unaligned target address.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;An additional partial decoder&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For 1. and 2. in the above list, an additional partial decoder can help simply by raising the upper bound of the averaging range. For the purpose of discussion, suppose a 16-byte window contains four x86 instructions, and there is only one complex instruction among two such windows:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A set of 4-1-1 decoders will spend 4 to 5 cycles to decode the two 16-byte instruction windows, where two cycles are spent on the window with only simple instructions, and another two or three are spent on the one with a complex instruction (depending on where the complex instruction occurs).&lt;/li&gt;&lt;li&gt;A set of 4-1-1-1 decoders will spend only 3 to 4 cycles to decode the same two windows.&lt;/li&gt;&lt;/ul&gt;By lifting the roof of the best-case capability, a wider x86 decode engine can increase the average decode throughput. Note that even under the ideal condition where branch-related stalls do not occur, the sustained decodes per cycle is still less than 2.7 (8 instructions in 3+ cycles), far from the value 4 or 5 as advertised by Intel.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 100%;"&gt;&lt;span style="font-weight: bold;"&gt;The Instruction Queue&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The extra partial decoder, however, does not help the 3rd point in the previous list when a branch is taken, especially to an unaligned target address. Note that branch  frequency is about 10-15% in normal programs (see also macro-fusion below). While many branch targets can be forced to be 16-byte aligned, it is usually not possible for small in-line loops to do so. If the entry point of the loop has address &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; MOD 16, then during the first cycle executing the loop, only 16 minus &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; fetched bytes contain effective instructions. This number does not increase no matter how many additional decoders you add to the decoding engine.&lt;br /&gt;&lt;br /&gt;The real "weapon" the Core 2 Duo has against this branch-related inefficiency is not the 4-wide decoder, but &lt;span style="font-style: italic;"&gt;a pre-decoded instruction queue of up to 18-deep x86 instructions&lt;/span&gt;. Refer to &lt;a href="http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2-and.html"&gt;Part 1 article&lt;/a&gt;'s first diagram on P6's Instruction Fetch Unit. There is a 16-byte wide, instruction boundary aligned Instruction Buffer sitting in-between the IFU and the decoders. Replacing this buffer with an 18 instruction-deep queue (probably 24 to 36 bytes in size) that can detect loops among the containing instructions, we get Core 2 Duo's biggest advantage with respect to x86 decode: &lt;span style="font-style: italic;"&gt;ability to sustain continuous decode stream on short loops&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;This continuous stream of x86 instructions allows Core 2 Duo's four decoders to be better utilized. The 18-instruction queue are aligned at instruction boundaries, and thus are immune to branch target (16-byte) misalignment problem. Although the 18-deep queue length easily becomes insufficient if loop unrolling, a compile-time optimization technique, is used, it is okay because unrolling a loop has the exact same effect as supplying a continuous instruction stream. More-over, the instruction queue also serves as a place where macro-fusion opportunities can be identified, as will be discussed next.&lt;br /&gt;&lt;br /&gt;Without extensive simulation or real traces, we really can't be sure how much boost is received by Core 2 Duo from the 4-wide decode and the instruction queue. We have to make a guess; by using one extra partial decoder, the &lt;span style="font-style: italic;"&gt;average&lt;/span&gt; &lt;span style="font-style: italic;"&gt;sustained&lt;/span&gt; x86 decode throughput is probably increased from around 2.1 to about 2.5 macroinstructions (x86) per cycle. With the help of the instruction queue to supply uninterrupted macroinstructions in small loops, the sustained decode throughput is probably increased further to 2.7 or even close to 3.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 130%;"&gt;&lt;span style="font-weight: bold;"&gt;Macro-fusion, the Myth and Truth&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Debunking the Myth&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Intel markets macro-fusion as the ability to increase x86 decode throughput from 4 to 5. As we have seen in the section above, the decode throughput &lt;span style="font-style: italic;"&gt;without&lt;/span&gt; macro-fusion is much less than 4 and only close to 3. It turns out that macro-fusion has even less impact on improving the throughput, as is discussed here.&lt;br /&gt;&lt;br /&gt;So what really is macro-fusion? In Intel's P6 terminology, "macro" or "macroinstruction" is used to describe an instruction in the original ISA (Instruction Set Architecture, here the x86). Thus &lt;span style="font-style: italic;"&gt;macro-fusion is actually the exact same idea as micro-fusion, where two (or more) depending instructions with a single fan-out are collapsed into one instruction format&lt;/span&gt; (see the &lt;a href="http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2.html"&gt;Part 2 article&lt;/a&gt;). The difference is on their application domain; where micro-fusion works on internal micro-ops, macro-fusion works on (x86) macrointructions. In fact, Intel's macro-fusion patent, &lt;span style="font-style: italic;"&gt;System and Method for Fusing Instructions&lt;/span&gt;, filed in Dec.2000, predates its micro-fusion patent, &lt;span style="font-style: italic;"&gt;Fusion of Processor Micro-Operations&lt;/span&gt;, filed in Aug.2002. It is probably due to two following reasons that the former is implemented later:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Complexity (or difficulty)&lt;/li&gt;&lt;li&gt;Limited usefulness&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Why is it difficult, and what does it do&lt;/span&gt;?&lt;br /&gt;&lt;br /&gt;First, we know that x86 instructions are complex and variable-length. Some x86 instructions take 6 clock cycles to only determine its length (page 2-7, Instruction PreDecode, of Intel's Optimization Reference Manual). The complexity of collapsing variable-length macroinstructions in when most cycle time is spent on decoding lengths (among other things) is undoubtedly much higher than that of fusing fixed-width micro-ops. Second, it will be even more difficult, if not impossible, to determine dependencies in real time, and fuse the depending macroinstructions together.&lt;br /&gt;&lt;br /&gt;So instead of trying to fused all possible macroinstruction pairs, Core 2 Duo fuses only the selected macroinstructions -&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The first macroinstruction must be a TEST X, Y or a CMP X, Y where only one operand of X and Y is an immediate or a memory word.&lt;/li&gt;&lt;li&gt;The second macroinstruction must be a conditional jump that checks the carry flag (CF) or zero flag (ZF).&lt;/li&gt;&lt;li&gt;The macroinstructions are not working in 64-bit mode.&lt;/li&gt;&lt;/ul&gt;These test/compare and jump are often used in integer programs composed of iterative algorithms. According to a &lt;a href="http://www.spec.org/workshops/2007/austin/"&gt;2007 SPEC Benchmark Workshop&lt;/a&gt; paper, "&lt;i&gt;&lt;span style="color: black;"&gt;&lt;a href="http://www.spec.org/workshops/2007/austin/presentations.html#Bird" linkindex="150" set="yes"&gt;&lt;i&gt;Characterization of Performance                     of SPEC CPU Benchmarks on Intel's Core Microarchitecture                     based processor&lt;/i&gt;&lt;/a&gt;&lt;/span&gt;&lt;/i&gt;," the frequency of macro-fused operations in SPEC2006 CPU ranges from 0-16% in integer codes and just 0-8% in floating-point codes. In other words, in the best case, macro-fusion would reduce the number of macroinstructions from 100% to 92% for integer and just 96% for floating-point execution, hardly the whopping 20-25% reduction as described by Intel's marketing department (and the numerous on-line repeaters).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Bringing the Truth&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Look at it closer, we realize that the purpose of macro-fusion is really not much to reduce the number of x86 instructions to be decoded, but &lt;span style="font-style: italic;"&gt;again to reduce decode interruptions/stalls due to predicted-taken branches&lt;/span&gt;. Again for the purpose of discussion lets number the four x86 decoders as 0, 1, 2, and 3.  A two-macroinstruction sequence can be steered to either of the following four positions: [0,1], [1,2], [2,3], [3,0]. If the conditional jump is predicted taken, then no instruction after it will be steered for decoding, and in two of the four cases (&lt;span style="font-style: italic;"&gt;i.e.&lt;/span&gt;, [0,1] and [3,0]) the four decoders will decode no other maroinstruction at all in the cycle. More specifically,&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Decoder slot [0,1], no other instruction decode, 0.25 probability&lt;/li&gt;&lt;li&gt;Decoder slot [1,2], 1 other instruction decode, 0.25 probability&lt;/li&gt;&lt;li&gt;Decoder slot [2,3], 2 other instruction decode, 0.25 probability&lt;/li&gt;&lt;li&gt;Decoder slot [3,1], no other instruction decode, 0.25 probability&lt;/li&gt;&lt;/ul&gt;The average number of &lt;span style="font-style: italic;"&gt;other&lt;/span&gt; decodes is thus (1+2)*.25 = 0.75, or about 19% efficiency when the 4 decoders work on a block of macroinstructions containing conditional branches. Note that this is assuming all ideal cases otherwise, including perfect branch prediction, all simple instructions, and no 16-byte instruction misalignment. In reality, the separate test-and-jump macroinstructions under realistic environment will probably reduce decode efficiency even more.&lt;br /&gt;&lt;br /&gt;Thankfully, when looking at a bigger picture, the situation becomes much better. As previously stated, the frequency of conditional branch itself tops at 8-16% in the first place; in other words, in average one taken branch occurs in every 8 to 16 other instructions, or every 3 to 4 instruction fetch cycles (see the bottom of page 2-6 in Intel's Optimization Reference Manual).  Suppose a taken branch occurs after 3 blocks of non-branching decodes, the 80% decoding efficiency loss at the branching block would result in less than 20% loss overall. This is why even without macro-fusion, Core 2's predecessor (Yonah) can already achieve IPC higher than 2 for some programs with only three x86 decoders.&lt;br /&gt;&lt;br /&gt;Now lets look at what happens to the conditional branch decode when macro-fusion is added. Again, the first column is the decoder number occupied by the now fused branch macroinstruction; the second column is number of other instruction decodes; the last column is occurrence probability of the row:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Decoder slot 0, no other instruction decode, 0.25 probability&lt;/li&gt;&lt;li&gt;Decoder slot 1, 1 other instruction decode, 0.25 probability&lt;/li&gt;&lt;li&gt;Decoder slot 2, 2 other instruction decode, 0.25 probability&lt;/li&gt;&lt;li&gt;Decoder slot 3, 3 other instruction decode, 0.25 probability&lt;/li&gt;&lt;/ul&gt;The average number of other decodes becomes (1+2+3)*.25 = 1.5, or about 38% efficiency of the 4 decoders, doubling that of the case without macro-fusion. The overall decoding efficiency loss reduces from less than 20% to less than 10%. A 10% increase in decoding efficiency will certainly be appreciated by the rest of the core, lifting the roof of &lt;span style="font-style: italic;"&gt;sustained&lt;/span&gt; IPC to 3 or maybe even higher for SPEC95 like programs (note that according to Intel's manual, Core 2's macroinstruction length pre-decoder is designed to sustain a 3.5 decode throughput in the worst case).&lt;br /&gt;&lt;br /&gt;This concludes the 3-part Decoding x86: From P6 to Core 2 series. I hope what's written here satisfy your curiosity with regard to the inner workings of modern microarchitectures, as they certainly do me over the course of my research/study on them. Please let me know if you have comments, suggestions, or even better, corrections, to the contents.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-5633641575838094464?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/5633641575838094464/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=5633641575838094464' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5633641575838094464'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5633641575838094464'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/06/decoding-x86-from-p6-to-core-2-part-3.html' title='Decoding x86: From P6 to Core 2 - Part 3'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-2510122580391086813</id><published>2007-05-29T14:29:00.000-07:00</published><updated>2007-05-31T22:21:56.750-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Core 2'/><title type='text'>Decoding x86: From P6 to Core 2 - Part 2</title><content type='html'>This is the Part 2 of a 3 article series. To fully appreciate what's written here, &lt;a href="http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2-and.html"&gt;the Part 1 article&lt;/a&gt; (or comparable understanding) is a prerequisite.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;The New Advancements&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Three major advancements have been made from the original P6 x86 decode over the years: &lt;span style="font-style: italic;"&gt;micro-op fusion&lt;/span&gt; (Pentium M), &lt;span style="font-style: italic;"&gt;macro-fusion&lt;/span&gt; (Core 2), and an increased &lt;span style="font-style: italic;"&gt;4-wide decode&lt;/span&gt; (also Core 2). In this Part 2 article, I will go over the micro-op fusion in more detail, and in the next Part 3, I will go further into Core 2's additions.&lt;br /&gt;&lt;br /&gt;While these advancements have all been "explained" numerous times on the Internet, as well as marketed massively by Intel, I must say that many of those explanations and claims are either wrong or misleading. People got second-hand info from Intel's marketing guys and possibly even some designers, and they tend to spice those up with extra sauces, partly from imaginations and partly from "educated" [sic] guesses.&lt;br /&gt;&lt;br /&gt;One big problem that I saw in many of those on-line "analyses" is that &lt;span style="font-style: italic;"&gt;they never get to the bottom of the techniques such as why they were implemented &lt;/span&gt;&lt;span style="font-style: italic;"&gt;and what makes them compelling &lt;/span&gt;&lt;span style="font-style: italic;"&gt;as they are &lt;/span&gt;. Instead, most of those analyses just repeat whatever glossy terms they got from Intel and gloss over the technical reasonings. Not that these technical reasonings are any more important to end users, but without proper reference to them, the "analyses" will most surely degrade to mere marketing repeaters of the Intel Co. These wrong ideas also tend to have bad consequences to the industry - think of Pentium 4 and the megahertz hypes that come with it.&lt;br /&gt;&lt;br /&gt;In the following, I will try to look at the true motives and benefits of these techniques from a technical point of view. I will try to answer the 3W1H questions for each: &lt;span style="font-weight: bold;"&gt;W&lt;/span&gt;here does it come from, &lt;span style="font-weight: bold;"&gt;W&lt;/span&gt;hat does it do, &lt;span style="font-weight: bold;"&gt;H&lt;/span&gt;ow does it work, and &lt;span style="font-weight: bold;"&gt;W&lt;/span&gt;hy is it designed so. &lt;span style="font-size:85%;"&gt;As stated in the previous Part 1 article, all analyses here are based on publicly available information. Without inside knowledge from Intel, however, I cannot be certain of being 100% error-free. But the good thing of technical reasoning is that, with enough evidence, you can also &lt;span style="font-style: italic;"&gt;reason&lt;/span&gt; for or against it, instead of choose whatever marketing craps that come across your way to believe.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;* Micro-op fusion - its RISC roots&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The idea behind micro-op fusion, or micro-fusion, came in early '90s to improve RISC processor performance where true data dependency exists. Unsurprisingly, it did not come from Intel. In a 1992 paper, "&lt;span style="font-style: italic;"&gt;Architectural Effects on Dual Instruction Issue With Interlock Collapsing ALUs&lt;/span&gt;," Malik &lt;span style="font-style: italic;"&gt;et al.&lt;/span&gt; from IBM devised a scheme to issue two dependent instructions at once to a 3-to-1 ALU. The technique, called &lt;span style="font-style: italic;"&gt;instruction collapsing&lt;/span&gt;, are then extended and improved by numerous researchers and designers.&lt;br /&gt;&lt;br /&gt;Intel came to the game quite late until 2000/2001 (Pentium M was released in 2003), and apparently just grabbed the existing idea and filed a patent on it. The company did bring some new thing to the table: a cool name, &lt;span style="font-style: italic;"&gt;fusion&lt;/span&gt;. It really sounds better to make work &lt;span style="font-style: italic;"&gt;fusion&lt;/span&gt; than to &lt;span style="font-style: italic;"&gt;collapse&lt;/span&gt; instructions, doesn't it? In fact, the micro-fusion of Intel's design is very rudimentary compared to what's been proposed 6-8 years ago in the RISC community; we will talk about this later shortly.&lt;br /&gt;&lt;br /&gt;Let's first look at the original "instruction collapse" techniques. Because a RISC ISA generally consists of simple instructions, true dependency detection among these instructions becomes a big issue when collapsing them together. However, if one can dynamically find out the dependencies -as all modern out-of-order dispatch can- he can then not only "collapse" two but also more instructions together. The performance improvement was reported from 7% to 20% on 2 to 4-issue processors.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;* A cheaper and simplified approach&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_oGCeAi-2i3Q/Rl-R5JE0duI/AAAAAAAAACM/HT1zm_NzmXc/s1600-h/x86_micro-fusion_p6.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_oGCeAi-2i3Q/Rl-R5JE0duI/AAAAAAAAACM/HT1zm_NzmXc/s400/x86_micro-fusion_p6.gif" alt="" id="BLOGGER_PHOTO_ID_5070932116317173474" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Now turn to Intel's micro-op fusion. What does it do? Magic like most wagging websites have cheered? Surely not -&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It only works on x86 read-then-modify and operate-then-store instructions, where no dependency check is needed between the two micro-ops to be fused.&lt;/li&gt;&lt;li&gt;It works only on x86 decode and issue stages, so no speculative execution is performed.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It doesn't change or affect the ALUs, so the same number of execution units is still needed for one fused micro-op as two non-fused micro-ops.&lt;/li&gt;&lt;/ul&gt;What is actually expanded is an additional XLAT PLA for each partial x86 decoder (see the diagram above, and also Part 1 article of this series), so that partial x86 decode can handle those load/store instructions that generate two micro-ops. Naturally, the performance increase won't be spectacular, and the early report from Intel is just between 2% to 5%. This is actually not that bad a result, given the technique itself is pretty localized (to the x86 decode and micro-op format), and the main point of micro-fusion is not to remove dependency or to increase execution width anyway, as will be discussed later.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;* An additional PLA plus a condensed format&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_oGCeAi-2i3Q/Rl-Z6pE0dvI/AAAAAAAAACU/Uwb0rZD64-E/s1600-h/x86_micro-fusion_format.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_oGCeAi-2i3Q/Rl-Z6pE0dvI/AAAAAAAAACU/Uwb0rZD64-E/s400/x86_micro-fusion_format.gif" alt="" id="BLOGGER_PHOTO_ID_5070940938179999474" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;So how does micro-fusion work? An x86 read-then-modify instruction, for example, consists of two depending micro-ops in one "strand" (&lt;span style="font-style: italic;"&gt;i.e.&lt;/span&gt;, single fan-out): 1) calculate load address, 2) modify loaded result. The micro-fusion will bind together these two operations into one format -&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Putting the two micro-ops into one fused format, which now has two opcode fields and three operand fields. (Yup, that's it, or what else have you expected?)&lt;/li&gt;&lt;li&gt;Putting the operand fields of the first opcode into the fused micro-op. Putting only the non-depending operand field of the second opcode into the fused micro-op.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Linking the depending operand of the second opcode to the output of the first opcode.&lt;/li&gt;&lt;/ol&gt;The fused micro-op is really two separate micro-ops combined in a condensed form. When the fused micro-op is issued, it occupies only one (wider) reservation station (RS) slot. Since it only has one fan-out (execution result), it occupies only one reorder buffer (ROB) slot, too. However, the two opcodes are still sent to separate execution units, so the execute bandwidth is not increased (nor reduced, by the way).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;* It works just fine - not great, just &lt;/span&gt;&lt;span style="font-style: italic; font-weight: bold;"&gt;fine&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So why does it work? The micro-fusion works because it relieved, in some degree, the x86 decode of the 4-1-1 complexity constraint. &lt;span style="font-weight: bold;"&gt;On those x86 instructions that get one argument directly from memory locations&lt;/span&gt;, this technique will -&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Increase x86 decode bandwidth from 1 to 3.&lt;/li&gt;&lt;li&gt;Reduce RS usage by 50%.&lt;/li&gt;&lt;li&gt;Reduce ROB usage by 50%&lt;/li&gt;&lt;/ol&gt;What it costs to implement micro-op fusion is just minor increase in micro-op format complexity and an additional XLAT PLA for each partial decoder. So after all, it's probably a good deal or smart way to increase the P6 performance. Just, according to the published literatures, it doesn't work miracles as many amateur sites have claimed, and there's not much of Intel's own intellectual credits in it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-2510122580391086813?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/2510122580391086813/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=2510122580391086813' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/2510122580391086813'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/2510122580391086813'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2.html' title='Decoding x86: From P6 to Core 2 - Part 2'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_oGCeAi-2i3Q/Rl-R5JE0duI/AAAAAAAAACM/HT1zm_NzmXc/s72-c/x86_micro-fusion_p6.gif' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-5088930965106384404</id><published>2007-05-27T19:57:00.000-07:00</published><updated>2011-04-28T19:21:26.339-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Core 2'/><title type='text'>Decoding x86: From P6 to Core 2 - Part 1</title><content type='html'>In this series of articles I will take a close look at the &lt;span style="font-style: italic;"&gt;x86 instruction decode&lt;/span&gt; of Intel's P6 processor family, which includes Pentium-Pro/II/III/M, Core, and Core 2. I will first explain the design in some detail, then relate the marketing terms such as &lt;span style="font-style: italic;"&gt;micro-op fusion&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;macro fusion&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;4-wide decoding&lt;/span&gt; with what is actually happening inside the processor, down to its microarchitectures and processing algorithms.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 85%;"&gt;All analyses here are based on publicly available information, such as Intel's software optimization manuals, patents and papers. What is added is some knowledge and understanding in computer microarchitectures and circuit designs. W&lt;/span&gt;&lt;span style="font-size: 100%;"&gt;ith great probably the analyses here should clarify/correct much more myths out there than it introduce any error.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 130%; font-weight: bold;"&gt;The x86-to-RISC Decoding Problem&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Over the years, Intel has advocated the use of CISC over RISC instruction set. However, with great irony -if we actually believed Intel's apparent stance toward the RISC/CISC argument- its P6 microarchitecture is really designed to be more "RISC Inside" than "Intel Inside." In order to reach both higher clock rates and better IPC (instruction per clock), the complex x86 instructions had to be first decoded into simple, fixed-width RISC format (micro-ops) before sent for execution. &lt;span style="font-style: italic;"&gt;By this way, the number of pipeline cycles an instruction must go through and the delay of the longest pipeline stage can be optimized for the common average-case rather than the rare worst-case instructions.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;All sound good, right? Except there are three (rather big) problems:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The variable-length x86 instructions, which are almost always misaligned in the instruction cache, are hard to decode in parallel (&lt;span style="font-style: italic;"&gt;i.e.&lt;/span&gt;, multiple decodes per clock cycle).&lt;/li&gt;&lt;li&gt;The many addressing modes and operand sizes of even the simplest x86 instruction require complex and slow translation from x86 to internal RISC.&lt;/li&gt;&lt;li&gt;The high complexity of some x86 instructions make worst-case decoders highly complex and inefficient.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Only by recognizing the problems of x86 decode and the difficulty to solve them can we fully appreciate the design choices that Intel made into the P6 front-end, as described in the three techniques below.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 100%;"&gt;&lt;span style="font-weight: bold;"&gt;Technique #1: Pipeline the instruction length, prefix and opcode decodes&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;An x86 instruction can have 1-3 opcode bytes, 0-10 operand bytes, plus up to 14 prefix bytes, all but not exceeding a 15-byte length limit. When stored in the instruction cache, it is almost never aligned to the cache line, which unfortunately is the unit that processor cores use to read from the cache. To solve the variable-length misalignment problem, P6's Instruction Fetch Unit (IFU) decodes the length, prefix, and the actual instruction opcodes in a pipelined fashion (see also the picture below):&lt;br /&gt;&lt;br /&gt;&lt;div align="center"&gt;&lt;span style="font-size: 130%;"&gt;&lt;u&gt;Instruction Fetch Unit and steering mechanism&lt;/u&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_oGCeAi-2i3Q/Rl2iAM4lV7I/AAAAAAAAAB8/NjzdQ4Ld5Ms/s1600-h/x86_predecode_p6.gif" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5070386879831300018" src="http://1.bp.blogspot.com/_oGCeAi-2i3Q/Rl2iAM4lV7I/AAAAAAAAAB8/NjzdQ4Ld5Ms/s400/x86_predecode_p6.gif" style="cursor: pointer; display: block; margin: 0px auto 10px; text-align: center;" /&gt;&lt;/a&gt;&lt;a href="http://1.bp.blogspot.com/_oGCeAi-2i3Q/Rl2hHM4lV6I/AAAAAAAAAB0/yPY6k_R9kIo/s1600-h/x86_predecode_p6.gif" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;/a&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;When IFU fetches a 32-byte cache line of instructions, it decodes the instruction lengths and marks the first opcode byte and the last instruction byte of every instruction in the window. The 32 bytes are put into a pre-decode buffer together with the markings.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The 32 bytes are scanned and 16 bytes starting from the first instruction are sent via a rotator to the instruction buffer (now aligned to the instruction boundary), from which they proceed on to two paths.&lt;/li&gt;&lt;li&gt;On one path, all 16 bytes are sent to the prefix decoders, where the first 3 prefix vectors are identified and sent to help instruction decode below.&lt;/li&gt;&lt;li&gt;On the other path and at the same time, 3 blocks of the same 16 bytes are steered to the 3 decoders in parallel, one block for each consecutive instruction.&lt;/li&gt;&lt;/ul&gt;Steering variable-length instructions is a complex task. The instruction bytes must be scanned sequentially to locate up to 3 opcodes and their operands, then packed and sent to the 3 decoders. Each decoder &lt;span style="font-style: italic;"&gt;might&lt;/span&gt; accept up to 11 bytes (max x86 instruction length &lt;span style="font-style: italic;"&gt;without&lt;/span&gt; the prefix), or it might receive just one.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;By determining the instruction boundaries early and pipeline the prefix decode away from instruction decode, the steering task can be made simpler and faster.&lt;/span&gt; To further simplify the matter, only the first (full) decoder will accept 11 bytes; the other two (partial) decoders will accept only up to 8 bytes of opcodes and operands, as will be further discussed below.&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size: 100%;"&gt;&lt;span style="font-weight: bold;"&gt;Technique #2: Decode the opcodes and operands separately&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;After a decoder (full or partial) receives the opcode and operand bytes, it must try to decode them into a RISC format efficiently. This is accomplished by again decoding the opcodes and the operands in separate paths, as illustrated by the partial decoder diagram below:&lt;br /&gt;&lt;br /&gt;&lt;div align="center"&gt;&lt;span style="font-size: 130%;"&gt;&lt;u&gt;Partial x86 decoder&lt;/u&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_oGCeAi-2i3Q/Rl2kw84lV8I/AAAAAAAAACE/glxq05WXJGc/s1600-h/x86_partial_decoder_p6.gif" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5070389916373178306" src="http://4.bp.blogspot.com/_oGCeAi-2i3Q/Rl2kw84lV8I/AAAAAAAAACE/glxq05WXJGc/s400/x86_partial_decoder_p6.gif" style="cursor: pointer; display: block; margin: 0px auto 10px; text-align: center;" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;From the steering circuit, 3 opcode bytes are picked up and sent to a translation programmable logic array (PLA) for control micro-op decode. The decoded control signals and micro-op template are put into a control uop register.&lt;/li&gt;&lt;li&gt;All the opcode and operands bytes, together with the prefix vector from the prefix decoders,  are also sent to a field extractor in parallel. The field extractor extracts the alias information which further describes the control micro-ops into a macro-alias register.&lt;/li&gt;&lt;li&gt;The two registers, cuop and macro-alias, are then combined by an alias multiplexer to get the final alias-resolve micro-op (aoup) code.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-style: italic;"&gt;By decoding opcodes into templates and extracting operands information separately, the opcode decoder's PLA can be minimized and made flexible.&lt;/span&gt; Flexibility is important, as we will see the full decoder (shown in Technique #3 below) is really the partial decoder plus 3 XLAT PLA pipelines and one microcode engine. The flexibility also made it possible to implement micro-op fusion by adding an extra XLAT PLA, as will be discussed in the Part 2 article later.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 100%; font-weight: bold;"&gt;Technique #3: Differentiate decoders to Make the Common Case Fast&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In a typical x86 program, more than 2/3 of the instructions are simple enough to be represented by a single (non-fused) micro-op. Most of the other 1/3 can be decoded into 4 micro-ops or less, with a (very) few taking more to execute. Recognizing these facts, especially the 2:1 simple-to-complex ratio, the P6 design divides its decoders into the well-known 4-1-1 structure, giving only one decoder full capability:&lt;br /&gt;&lt;br /&gt;&lt;div align="center"&gt;&lt;span style="font-size: 130%;"&gt;&lt;u&gt;Full x86 decoder&lt;/u&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_oGCeAi-2i3Q/RlvWRs4lV5I/AAAAAAAAABs/AxQ_KHfpBVU/s1600-h/x86_full_decoder_p6.gif" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5069881405130233746" src="http://1.bp.blogspot.com/_oGCeAi-2i3Q/RlvWRs4lV5I/AAAAAAAAABs/AxQ_KHfpBVU/s400/x86_full_decoder_p6.gif" style="cursor: pointer; display: block; margin: 0px auto 10px; text-align: center;" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The first decoder has four translate PLAs, decoding an instruction to up to 4 control uops in one clock cycle (see the full decoder diagram right above).&lt;/li&gt;&lt;li&gt;The first decoder also has a micro-code engine to decode the few really complex instructions multiple number of clock cycles, generating 3 control uops per cycle (notice the three 2:1 MUXes in the above diagram).&lt;/li&gt;&lt;li&gt;The second and third decoders, as explained in Technique #2, have only one PLA and can decode only one single-uop x86 instructions per clock cycle.&lt;/li&gt;&lt;li&gt;Each decoder is equipped with its own macro-alias field extractor, although the first decoder's can be bigger in size.&lt;/li&gt;&lt;/ul&gt;When the micro-code engine is used, the 2nd and 3rd decoders are stalled from progress to preserve in-order issue.&lt;span style="font-style: italic;"&gt; By differentiating the decoders and put performance emphasis on the common simple-instruction cases, instruction decode and issue complexity can be minimized, and higher clock rate can be reached.&lt;/span&gt; RISC design rule #1: &lt;span style="font-style: italic;"&gt;Make the common-case fast&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 130%; font-weight: bold;"&gt;The Myths, Part 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Internet being the greatest information exchange inevitably becomes also the largest rumor farm and myth factory of the world. There have been numerous very wrong ideas about the P6 microarchitecture as a whole and the decoding front-end in particular. In "The Myths" section I will try to correct some of these misconceptions.&lt;br /&gt;&lt;br /&gt;Since this Part 1 article only talks about the basic x86 decoding mechanisms, the related myths are also more basic and less astonishing. The described decoding mechanisms are over 10 years old, after all. Nevertheless, it is still better to get things right than wrong.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Myth #1: It is better to have more full decoders&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;An attempt to make fully capable decoders work in parallel is likely to spend more and gain little, not only because it will be very inefficient (resulting in slower clock rate and higher power usage), but also because it will cause trouble to the micro-op issue logic, which then must dynamically find out how many micro-ops are generated from each decoder, and &lt;span style="font-style: italic;"&gt;route&lt;/span&gt; them in an (M*D)-to-N fabric from D decoders of M micro-ops to a issue queue of length N.&lt;br /&gt;&lt;br /&gt;With twice as many simple instructions than complex ones in a typical program, an additional full decoder will not be worth it unless two more partial decoders are added. This ratio is increased even more with the introduction of micro-op fusion and the use of powerful SIMD instructions, although these are the later things to come.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Myth #2: It is better to utilize the full decoder as much as possible&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Even though the full decoder can generate up to 4 micro-ops per clock cycle &lt;span style="font-style: italic;"&gt;in parallel with&lt;/span&gt; the partial decoders, the issue queue of the P6 microarchitecture can only issue 3 micro-ops (or 4 in the case of Core 2) during any cycle. What this says is that the micro-op issue (and execution) logic will not be able to "digest" a continuous flow of x86 instructions with 4-1-1 uop complexity (with micro-op fusion, the pattern becomes &lt;span style="font-style: italic;"&gt;selectively&lt;/span&gt; 4-2-2 - see Part 2 for more detail).&lt;br /&gt;&lt;br /&gt;In other words, the pipeline (more precisely, the issue queue) will stall even when you sparsely (&lt;span style="font-style: italic;"&gt;e.g.&lt;/span&gt;, less than 30%) use those moderately complex instructions that can be decoded in one clock cycle. A corollary of this is that, in general, it is beneficial to replace a complex instruction by 3 simple ones (or 4 in the case of Core 2). The lesson: CISC does not scale. Even though you are writing/compiling to a CISC x86 ISA, you still want to make your assembly codes as much RISC-like as possible to get higher performance.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Myth #3: The same decoding width implies the same level of performance&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To be sure, the 4-1-1 decoding engine is not the performance bottleneck up until the days of Pentium M, when micro-op fusion was introduced. Even with micro-op fusion, which supposedly &lt;span style="font-style: italic;"&gt;doubles&lt;/span&gt; capability of the partial decoders, Intel reported less than 5% performance increase over the none-fused x86 decoding. The fact is, the IPC (instruction per clock) of all x86 processor cores, including the ones that bear the "Core 2" mark, have never exceeded 3. Pentium III running SPEC95 has IPC roughly between 0.6 and 0.9. Assuming 30% increase with each newer generation (which is quite optimistic to say the least), Pentium M would have IPC roughly between 0.8 and 1.2, Core would have it between 1.0 and 1.5, and Core 2 between 1.3 and 2.0. In other words, theoretically the ability to decode 3 instructions per cycle is quite sufficient up till this moment.&lt;br /&gt;&lt;br /&gt;Of course nothing in the real world runs in a theoretical way. Aside from the fact that there are many other things in a processor core to slow down execution, P6's 3-wide (or 4-wide in the case of Core 2) x86 decode &lt;span style="font-style: italic;"&gt;rarely &lt;/span&gt;&lt;span style="font-style: italic;"&gt;sustains&lt;/span&gt;&lt;span style="font-style: italic;"&gt; &lt;/span&gt;3 decodes per cycle, even with low complex-to-simple instruction ratio. The reasons -&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;First, the complex instructions must be well positioned to the first decoder.&lt;/span&gt; Since the 3 (or 4 in the case of Core 2) x86-to-RISC decoders work in program order, if unfortunately the first decoder is occupied by a simple instruction while a complex instruction comes to the 2nd place, then during that clock cycle only one simple instruction will be decoded. The steering circuit will "re-steer" the complex instruction from the 2nd place to the 1st on the next cycle.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Second, the decoders are flushed every 16 instruction bytes (or 24 in the case of Core 2).&lt;/span&gt; Look at the IFU diagram at the beginning of this article, in every clock cycle 3 instructions from a 16-byte window are steered to the decoders. In average an x86 instruction takes about 3.5 bytes (the variance is high, though), so it is likely that the 16-byte window is not consumed in one clock cycle. If this is the case, then during the next cycle, the steer circuit will try to steer the next 3 instructions from the same 16-byte window to their respective decoders. But wait, what happens if there are less than 3 instructions left? Well, then less than 3 decoders have work to do in the cycle!&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Third, taken branches &lt;/span&gt;&lt;span style="font-style: italic; font-weight: bold;"&gt;always&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; interrupt and stop short the decoding.&lt;/span&gt; This is similar to the reason above, except that here the latter decoders are not working not because the end of the 16-byte window is reached, but because the rest of the instruction bytes in the window are not (predicted) to be executed. This happens even under 100% branch prediction accuracy. The problem here is even more serious when the target address is unaligned to a byte-address of MOD 16. For example, if the branch target instruction has byte address 14 MOD 16, then only one instruction is fetched (inside the first 16-byte window) after the branch is taken.&lt;br /&gt;&lt;br /&gt;We will note that these are caused by P6's x86 decode design artifacts; they cannot be improved by any microarchitecture improvement &lt;span style="font-style: italic;"&gt;elsewhere&lt;/span&gt;. It is because of these reasons that we need micro-op fusion, macro fusion, or an additional partial decoder in the later generations of the P6 processor family to even get close to the theoretical 3-issue limit. We will however wait until Part 2 (and possibly Part 3) to dwell deeper into those.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-5088930965106384404?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/5088930965106384404/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=5088930965106384404' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5088930965106384404'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5088930965106384404'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/05/decoding-x86-from-p6-to-core-2-and.html' title='Decoding x86: From P6 to Core 2 - Part 1'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_oGCeAi-2i3Q/Rl2iAM4lV7I/AAAAAAAAAB8/NjzdQ4Ld5Ms/s72-c/x86_predecode_p6.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-8947975687627499458</id><published>2007-05-25T04:01:00.000-07:00</published><updated>2007-05-28T03:03:19.451-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='K10'/><title type='text'>The PoV-Ray benchmark and AMD's Barcelona demo</title><content type='html'>AMD recently &lt;a href="http://www.youtube.com/watch?v=VGiv9Dtrc5Q"&gt;showed off a 4-socket quad-core Barcelona (K10)&lt;/a&gt; which &lt;span style="font-style: italic;"&gt;almost&lt;/span&gt; doubles the speed of a 4-socket dual-core Opteron (K8) on PoV-Ray. More precisely, the rendering speed of the 16-core K10 system is just 1.87 times the speed of the 8-core K8 system, both running at the same processor frequency.&lt;br /&gt;&lt;br /&gt;To some degree, this is totally &lt;span style="font-style: italic;"&gt;below&lt;/span&gt; people's expectation on Barcelona/K10, especially according to AMD's official claim Barcelona should "blow Clovertown away."&lt;br /&gt;&lt;ul&gt;&lt;li&gt;First, we know PoV-Ray is very scalable with respect to number of cores: &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070403-00788.html"&gt;a 4-socket 8-core Opteron system&lt;/a&gt; today already doubles PoV-Ray speed of &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070403-00791.html"&gt;a 2-socket 4-core Opteron system &lt;/a&gt;(see 453.povray - 130 vs. 66.3). So what's the big deal if K10 runs 1.87 times as fast with twice the number of cores?&lt;/li&gt;&lt;li&gt;Second, according to SPECfp PoV-Ray scores, &lt;a href="http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20070217-00520.html"&gt;a 2-socket Clovertown system at 2.66GHz&lt;/a&gt; is &lt;span style="font-style: italic;"&gt;more than twice&lt;/span&gt; as fast as &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070423-00903.html"&gt;a 2-socket Opteron system at 3.0GHz&lt;/a&gt; (again, 453.povray - 145 vs. 69.4). How is Barcelona going to blow Clovertown away if it doesn't even double the speed of today's dual-core Opteron?&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The first question turns out to be easy to answer: &lt;span style="font-weight: bold;"&gt;the point of the demo is not just (nearly) twice the performance, but also &lt;/span&gt;&lt;span style="font-style: italic; font-weight: bold;"&gt;within the same power/thermal envelope&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;.&lt;/span&gt; In other words, the quad-core K10 is going to be a perfect drop-in replacement for today's dual-core K8. &lt;span style="font-style: italic;"&gt;The same does not hold with Intel's Clovertown/Xeon.&lt;/span&gt;  According to &lt;a href="http://www.gamepc.com/labs/view_content.asp?id=qcxeon&amp;page=3&amp;amp;cookie%5Ftest=1"&gt;this GamePC measurement&lt;/a&gt;, to upgrade a Xeon system from dual-core to quad-core under the same thermal/power envelope, one must lower the processor's clock rate by 30% (2.66GHz -&gt; 1.86GHz, or 2.33GHz -&gt; 1.6GHz), which generally implies a 15-20% loss of performance.&lt;br /&gt;&lt;br /&gt;However, this still doesn't answer the second question. Shouldn't K10 with 2x the number of cores be &lt;span style="font-style: italic;"&gt;more than 2x&lt;/span&gt; the speed of K8, due to the many per-core improvements we've heard of inside Barcelona/K10?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;To answer this question, we have to look more closely at the benchmark: PoV-Ray.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We know AMD was using PoV-Ray 3.7 beta in the Barcelona demo, because previous versions do not support SMP. Now, there are two executables in the PoV-Ray 3.7 beta package: one compiled with x87 instructions, and one with SSE2. Which one did AMD use? If it was the SSE2, then why didn't it show any per-core improvement? If it was the x87, then why did AMD purposely choose a slower program to demo its next-generation processor?&lt;br /&gt;&lt;br /&gt;It turns out that &lt;span style="font-style: italic;"&gt;none of these questions is appropriate&lt;/span&gt;. Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream&lt;span style="font-style: italic;"&gt; SIMD&lt;/span&gt; Execution) at all, but really double-precision FP with random register access; (2) PoV-Ray SSE seems to be optimized more specifically for Core 2 than anything else, where on K8 it is only about 5% faster than PoV-Ray x87. This is also&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;span style="font-style: italic;"&gt;not&lt;/span&gt; going to change with K10.&lt;br /&gt;&lt;br /&gt;First, there is no &lt;span style="font-style: italic;"&gt;actual&lt;/span&gt; usage of vectorized (or &lt;span style="font-style: italic;"&gt;packed&lt;/span&gt;) instructions in PoV-Ray SSE. The only packed instructions I see from the binary are register conversions between x87 and SSE2 formats. PoV-Ray SSE basically treat the SSE2 as a faster [sic] x87 engine which can access xmm registers randomly (rather than stack-based in x87).  For example, a simple double-precision division in PoV-Ray SSE is performed by the following instruction sequence:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Convert the divisor from single to double (CVTSS2SD)&lt;/li&gt;&lt;li&gt;Perform double-precision scalar  division using  DIVSD&lt;/li&gt;&lt;li&gt;Convert the result from two double values to two single values (CTVPD2PS).&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;This offers considerable advantage for Intel's Core 2, because SSE2 DIVSD (18 cycles) in Core 2 is much faster than x87 FDIV (36 cycles), and the conversion instructions are also quite fast (4 cycles). Overall, for Core 2, the above sequence will save ~30% number of cycles (4+18+4=26 vs. 36) from an x87 division. On the other hand, this sequence is very inefficient for K8, where SSE2 DIVSD is as fast as x87 FDIV (~20 cycles), but conversions are much slower (8 cycles). Overall, for K8, the sequence runs ~80% slower (8+20+8=36 vs. 20 cycles) than an x87 division.&lt;br /&gt;&lt;br /&gt;Roughly estimating, about 1/4 to 1/3 of the numerical instructions in the PoV-Ray SSE undergo such convert-calculate-convert process, where you see CVTxx2yy instructions all over the places in these parts of the code. Now I'm not sure whether this is compiled by an Intel compiler, or with an Intel library, or whatever else, but this is simply not the good/right way to do vectorized acceleration. It gives Core 2 a performance boost only due to Core 2's design artifact where such conversions are cheap/fast. Still, PoV-Ray SSE manages to run slightly faster than PoV-Ray x87 on K8 probably due to the ability to access register randomly, which results in better superscalar and out-of-order executions.&lt;br /&gt;&lt;br /&gt;Second, comparing &lt;a href="http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf"&gt;the K10 instruction latency&lt;/a&gt; with &lt;a href="http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF"&gt;the K8 instruction latency&lt;/a&gt;, we find that K10 has little, if any, improvement on &lt;span style="font-style: italic;"&gt;scalar&lt;/span&gt; SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains &lt;span style="font-style: italic;"&gt;unfriendly&lt;/span&gt; to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there &lt;span style="font-style: italic;"&gt;are&lt;/span&gt; some core improvements at work elsewhere inside the K10 design.&lt;br /&gt;&lt;br /&gt;So now it looks all reasonable that we see such "disappointing" results from the K10/Barcelona PoV-Ray demo. Except one question that naturally comes up: why did AMD choose PoV-Ray for the demonstration in the first place? Sure, PoV-Ray is very scalable to multiple cores, but there are many other applications that scale as well, aren't there? Maybe AMD wants to run a program that has something to &lt;span style="font-style: italic;"&gt;display&lt;/span&gt;, such as a cool 3D image? Maybe AMD wants to show K10 can scale even on an unfriendly workload? Or maybe the guys responsible of the demonstration are just incapable of finding a good benchmark? Or maybe PoV-Ray is already the best case AMD can find, and Barcelona/K10 &lt;span style="font-style: italic;"&gt;is&lt;/span&gt; going to disappoint? We simply won't know the real answer until the actual release of this greatly anticipated chip.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-8947975687627499458?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/8947975687627499458/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=8947975687627499458' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/8947975687627499458'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/8947975687627499458'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/05/pov-ray-benchmark-and-amds.html' title='The PoV-Ray benchmark and AMD&apos;s Barcelona demo'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-594603654533761882</id><published>2007-05-22T02:15:00.000-07:00</published><updated>2007-05-31T16:07:35.667-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Core 2'/><category scheme='http://www.blogger.com/atom/ns#' term='K8'/><category scheme='http://www.blogger.com/atom/ns#' term='SPEC CPU'/><title type='text'>Core 2 Duo: That Can Hardly Be More Optimized</title><content type='html'>In this article I am to find out how well Core 2 Duo and K8 perform on the single-processed benchmarks, SPECint and SPECfp (&lt;span style="font-style: italic;"&gt;i.e.&lt;/span&gt;, no "rate" here), as some interesting observations come up again. A look at the facts is never short of revelation.&lt;br /&gt;&lt;br /&gt;The systems are shown in the following table. K8int/K8fp denote K8 Opteron scores for integer and floating point benchmarks, respectively. Similarly, C2int/C2fp denote Core 2 Duo scores. Both SPEC CPU2000 and SPEC CPU2006 are compared. The main criteria in choosing these SPEC submissions are -&lt;br /&gt;&lt;ol&gt;&lt;li&gt;There are at least two or more processor speeds with all identical configurations otherwise.&lt;/li&gt;&lt;li&gt;They use 64-bit operating systems and compilers.&lt;/li&gt;&lt;li&gt;All have comparable memory across different architectures (DDR2-667 to be exact).&lt;/li&gt;&lt;/ol&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_oGCeAi-2i3Q/RlK3u84lVyI/AAAAAAAAAA0/wuFFXMx1Gtw/s1600-h/clock_scaling2_text.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_oGCeAi-2i3Q/RlK3u84lVyI/AAAAAAAAAA0/wuFFXMx1Gtw/s400/clock_scaling2_text.gif" alt="" id="BLOGGER_PHOTO_ID_5067314547990550306" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Unfortunately, I couldn't find a K8 and a C2D using the same operating system and compiler. Thus strictly speaking the absolute values in these tests are not comparable across system families. We will relax ourselves a bit here but keep this fact in mind.&lt;br /&gt;&lt;br /&gt;Below are the SPEC CPU2000 results. All points are the "base" scores -&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RlK58c4lV1I/AAAAAAAAABM/kjWdIbswjKg/s1600-h/clock_scaling2_grapha.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_oGCeAi-2i3Q/RlK58c4lV1I/AAAAAAAAABM/kjWdIbswjKg/s400/clock_scaling2_grapha.gif" alt="" id="BLOGGER_PHOTO_ID_5067316978942039890" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Below are the SPEC CPU2006 results. The points with a postfix 'B' character on their labels represent "base" scores; the other points represent the "peak" scores -&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_oGCeAi-2i3Q/RlK5_M4lV2I/AAAAAAAAABU/3XCGhgNIpns/s1600-h/clock_scaling2_graphb.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_oGCeAi-2i3Q/RlK5_M4lV2I/AAAAAAAAABU/3XCGhgNIpns/s400/clock_scaling2_graphb.gif" alt="" id="BLOGGER_PHOTO_ID_5067317026186680162" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In terms of SPEC CPU2000, Core 2 Duo completely outclasses Opteron/K8 on both INT (50%) and FP (30%) scores. Interestingly, the vast advantage greatly reduces with respect to SPEC CPU2006, where the leads become less than 30% for INT and almost none for FP. One explanation is that Core 2 Duo, released 3 years after Opteron, was optimized &lt;span style="font-style: italic;"&gt;by design&lt;/span&gt; for the benchmarks, at a time when only SPEC CPU2000 was available. Another explanation is the newer SPEC CPU2006 does not benefit from large L2 cache size (up to at least 4MB) as much and thus more favors K8's integrated memory controller. Yet another explanation is that the newer benchmark codes are more complex and thus less predictable by simple heuristics where Core 2 probably does better/more than K8.&lt;br /&gt;&lt;br /&gt;No matter what are the reasons (probably a bit from all three and more), one message is clear: for single-processed integer codes, Core 2 Duo beats K8 Opteron hands-down. For floating point, it's a close match, and one should look at the type of program he runs to make a preference.&lt;br /&gt;&lt;br /&gt;The really interesting observation lies on the "peak" versus "base" values of the benchmark results. For Core 2 Duo, peak offers just 4% boost on INT and 3% on FP. On the other hand, for K8 Opteron, peak offers 8% boost on INT and almost 30% on FP. It seems the microarchitecture of Core 2 Duo is so optimizing that there is little room for more software optimization, whereas K8 Opteron still can benefit from better compilation. This is certainly a plus for Core 2 Duo, because nobody likes to spend 2x time to compile an optimized executable.&lt;br /&gt;&lt;br /&gt;Comparing the SPEC and SPEC_rate results, we clearly see that while Core 2 Duo has a much better core implementation, its memory architecture trails after K8 and drags down its throughput scalability. The FSB bottleneck can even be seen from the Core 2 Duo lines in the second graph above, where the two left-most point sets (with 1066MHz FSB) are much lower than the others (with 1333MHz FSB). Again, as I said, with Core 2 Duo, Intel goes back to its root to improve, market on, and profit from the personal/home (versus big server/high performance) computing.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-594603654533761882?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/594603654533761882/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=594603654533761882' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/594603654533761882'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/594603654533761882'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/05/core-2-duo-optimizing-uarch-couldnt-be.html' title='Core 2 Duo: That Can Hardly Be More Optimized'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_oGCeAi-2i3Q/RlK3u84lVyI/AAAAAAAAAA0/wuFFXMx1Gtw/s72-c/clock_scaling2_text.gif' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-3855479771963244342</id><published>2007-05-19T07:24:00.000-07:00</published><updated>2007-05-31T16:09:21.188-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Core 2'/><category scheme='http://www.blogger.com/atom/ns#' term='K8'/><category scheme='http://www.blogger.com/atom/ns#' term='SPEC CPU'/><title type='text'>More scaling - where a picture speaks a thousand words</title><content type='html'>One reader to my previous article asked why didn't I use dual-socket Core 2 Duo for scaling comparison. The reason is simple: I couldn't find a single pair of SPEC 2006 results where a single-socket and a dual-socket Core 2 Duo machines use the same CPU clock rate, memory technology, compiler, and operating system, where scientifically valid comparison can be made.&lt;br /&gt;&lt;br /&gt;In this article I will relax a bit and do not require exact matches among the candidate systems. I will use four x86_64 system models show the "number of cores" and "clock rate" scaling of both Intel Core 2 Duo and AMD Opteron (K8).&lt;br /&gt;&lt;br /&gt;Below is the system settings and their SPEC2006_rate scores. I use "ds" for dual-socket dual-core (4 cores), "qc" for single-socket quad-core (4 cores), and "dc" for single-socket dual-core (2 cores) -&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_oGCeAi-2i3Q/Rk9BOs4lVxI/AAAAAAAAAAs/FFYDUyUydFg/s1600-h/clock_scaling_text.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_oGCeAi-2i3Q/Rk9BOs4lVxI/AAAAAAAAAAs/FFYDUyUydFg/s400/clock_scaling_text.gif" alt="" id="BLOGGER_PHOTO_ID_5066339826637559570" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Nothing is better than a picture to illustrate complicated data. Below is the performance graph of these systems. Green lines are for Fujitsu/AMD; blue lines for Fujitsu/Intel; red lines for Acer/Intel -&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_oGCeAi-2i3Q/Rk9BLM4lVwI/AAAAAAAAAAk/JcReXQ_KgQE/s1600-h/clock_scaling_graph.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_oGCeAi-2i3Q/Rk9BLM4lVwI/AAAAAAAAAAk/JcReXQ_KgQE/s400/clock_scaling_graph.gif" alt="" id="BLOGGER_PHOTO_ID_5066339766508017410" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Couldn't resist the temptation, below is a list of observations that I have to make:&lt;br /&gt;&lt;br /&gt;First, with 2 cores, Core 2 Duo is undoubtedly the winner on both SPECint_rate and SPECfp_rate. With 4 cores, however, K8 becomes the better choice for SPECfp. The more powerful a system is, the more advantage K8 has, due to its better "number of core" scalability.&lt;br /&gt;&lt;br /&gt;Second, Intel's FSB (front-side bus) &lt;span style="font-style: italic;"&gt;is&lt;/span&gt; a bottleneck for 4 cores, even at 1066MHz. This is obvious from the left-most points of 4-core Core 2 systems (C2ds and C2qc), where the scores are lower than the rest of the clock scaling trend. Looking at the system settings, these lower-than-expected performances come precisely from the 1066MHz FSB (vs. 1333MHz).&lt;br /&gt;&lt;br /&gt;Third, the MCM quad-core could be a good cost/power-saving for single-socket home users and low-end servers. It almost matches dual-socket Opteron on integer performance, although its floating-point performance is still somewhat desired.&lt;br /&gt;&lt;br /&gt;Fourth, the MCM quad-core does not scale well at/beyond 2.67GHz. You may cry, look, the 2.67GHz C2Q even has lower SPECfp_rate than the 2.40GHz C2Q! There must be something wrong with the Fujitsu systems! Unfortunately, no. As of May 2007, all reported 2.67GHz C2Q SPECfp_rate I can find are "lower than expected." (The highest among them is 33.9 - less than 1% higher - but it uses FB-DIMM, different from the other systems presented here). This is probably why Intel is so late in introducing a higher-clocked Core 2 Quad - if they are not (much) better, why bother?&lt;br /&gt;&lt;br /&gt;Fifth, the "clock rate" scaling of K8 performance is slowing down at 2.8GHz, especially for SPECfp_rate. Since all Fujitsu Primergy RX330 systems are identical except the CPU clock rate, the only explanation is that the larger processor-memory speed gap makes higher CPU frequency less effective. Core 2 does not experience the same slow down probably due to its larger cache and a better load/store circuits.&lt;br /&gt;&lt;br /&gt;Sixth, doubling L2 cache size helps Core 2 Duo for about as much as a speed grade (0.16GHz). This is seen from the "jump" on the single-socket Core 2 Duo performances (C2dc), where the left two points with 2MB L2 are one step lower than the right three points with 4MB L2.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-3855479771963244342?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/3855479771963244342/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=3855479771963244342' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/3855479771963244342'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/3855479771963244342'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/05/more-scaling-where-picture-speaks.html' title='More scaling - where a picture speaks a thousand words'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_oGCeAi-2i3Q/Rk9BOs4lVxI/AAAAAAAAAAs/FFYDUyUydFg/s72-c/clock_scaling_text.gif' height='72' width='72'/><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8459791.post-5200396591822597565</id><published>2007-05-15T12:45:00.000-07:00</published><updated>2007-05-31T16:08:48.827-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Core 2'/><category scheme='http://www.blogger.com/atom/ns#' term='K8'/><category scheme='http://www.blogger.com/atom/ns#' term='SPEC CPU'/><title type='text'>Multi-core scalability (lacking) of Intel Core 2 Duo</title><content type='html'>It's been over 9 months since Intel release the Core 2 Duo processors. Praise to this processor and its multi-chip module (MCM) quad-core brother, Core 2 Quad, float around the Internet. With this line of processors, Intel is going back to its root - market on and profit from the personal (vs. high-performance or big-server) computing. In other words, while Core 2 Duo/Quad works great for home projects (video encoding, playing games, etc.), it does not scale well to the larger, heavier-duty setups.&lt;br /&gt;&lt;br /&gt;Enough of talking, and lets see some proofs with industry-standard SPECint_rate and SPECfp_rate benchmarks. We will only look at the base scores from the new SPEC 2006 benchmark suite.&lt;br /&gt;&lt;br /&gt;First we look at how well Core 2 Quad scales from Core 2 Duo:&lt;br /&gt;&lt;ul&gt;[SPECint_rate_base2006]&lt;br /&gt;&lt;li&gt;Intel Xeon 3060 2.4GHz, 2 cores/1 chip, 1066MHz FSB - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070329-00696.html"&gt;26.0&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Intel Xeon X3220 2.4GHz, 4 cores/1 chip, 1066MHz FSB - &lt;a href="http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20070122-00261.html"&gt;43.4&lt;/a&gt; (1.67x)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;[SPECfp_rate_base2006]&lt;br /&gt;&lt;li&gt;Intel Xeon 3060 2.4GHz, 2 cores/1 chip, 1066MHz FSB - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070329-00694.html"&gt;22.4&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Intel Xeon X3220 2.4GHz, 4 cores/1 chip, 1066MHz FSB - &lt;a href="http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20070122-00259.html"&gt;33.5&lt;/a&gt; (1.50x)&lt;/li&gt;&lt;/ul&gt;The above show that, if you buy a Core 2 &lt;span style="font-style: italic;"&gt;Quad&lt;/span&gt;, you really get just 3.3 cores of performance for the average integer workloads, and only 3 cores for the floating-point. In other words, the architecture already lacks scalability to quad cores.&lt;br /&gt;&lt;br /&gt;In contrast, lets look at how AMD's Opteron (K8) scales to multi-core:&lt;br /&gt;&lt;ul&gt;[SPECint_rate_base2006]&lt;br /&gt;&lt;li&gt;AMD Opteron 854 2.8GHz, 2 cores/2 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2006q3/cpu2006-20060513-00033.html"&gt;22.3&lt;/a&gt;&lt;/li&gt;&lt;li&gt;AMD Opteron 854 2.8GHz, 4 cores/4 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2006q3/cpu2006-20060513-00034.html"&gt;41.4&lt;/a&gt; (1.86x)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;AMD Opteron 2210 1.8GHz, 2 cores/1 chip - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070416-00871.html"&gt;17.3&lt;/a&gt;&lt;/li&gt;&lt;li&gt;AMD Opteron 2210 1.8GHz, 4 cores/2 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070416-00872.html"&gt;34.3&lt;/a&gt; (1.98x)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;[SPECfp_rate_base2006] &lt;li&gt;AMD Opteron 854 2.8GHz, 2 cores/2 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2006q3/cpu2006-20060513-00029.html"&gt;24.1&lt;/a&gt;&lt;/li&gt;&lt;li&gt;AMD Opteron 854 2.8GHz, 4 cores/4 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2006q3/cpu2006-20060513-00030.html"&gt;45.6&lt;/a&gt; (1.89x)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;AMD Opteron 2210 1.8GHz, 2 cores/1 chip - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070416-00876.html"&gt;17.6&lt;/a&gt;&lt;/li&gt;&lt;li&gt;AMD Opteron 2210 1.8GHz, 4 cores/2 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070416-00875.html"&gt;34.8&lt;/a&gt; (1.98x)&lt;/li&gt;&lt;/ul&gt;What we see here is that, for a total of 4 cores per system, not only dual-core Opterons but even single-core Opterons connected by cHT links scale much better than two Core 2 Duos sitting on an MCM. Note that the absolute numbers in the different cases above are not directly comparable to each other, since they use different CPU clock rates, memory technologies, operating systems, and compilers.&lt;br /&gt;&lt;br /&gt;Now lets look at how well does Core 2 Duo scale to multi-core, multi-processor setup:&lt;br /&gt;&lt;ul&gt;[SPECint_rate_base2006]&lt;br /&gt;&lt;li&gt;Intel Xeon X5355 2.67GHz, 4 cores/1 chip, 1333MHz FSB - &lt;a href="http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20070216-00491.html"&gt;45.9&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Intel Xeon X5355 2.67GHz, 8 cores/2 chips, 1333MHz FSB - &lt;a href="http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20070216-00492.html"&gt;78.0&lt;/a&gt; (1.70x)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;[SPECfp_rate_base2006]&lt;br /&gt;&lt;li&gt;Intel Xeon X5355 2.67GHz, 4 cores/1 chip, 1333MHz FSB - &lt;a href="http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20070220-00561.html"&gt;33.9&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Intel Xeon X5355 2.67GHz, 8 cores/2 chips, 1333MHz FSB - &lt;a href="http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20070217-00520.html"&gt;56.3&lt;/a&gt; (1.66x)&lt;/li&gt;&lt;/ul&gt;Again, the scalability is very lacking; you get only 6.8 and 6.6 cores from an 8-core setup for integer and floating-point codes, respectively.&lt;br /&gt;&lt;br /&gt;In contrast, lets look at how does Opteron scale from 4 cores to 8. This time we use only the dual-core Opteron processors for comparison:&lt;br /&gt;&lt;ul&gt;[SPECint_rate_base2006]&lt;br /&gt;&lt;li&gt;AMD Opteron 2222SE 3.0GHz, 4 cores/2 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070403-00789.html"&gt;44.6&lt;/a&gt;&lt;/li&gt;&lt;li&gt;AMD Opteron 2222SE 3.0GHz, 8 cores/4 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070403-00796.html"&gt;84.4&lt;/a&gt; (1.89x)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;[SPECfp_rate_base2006]&lt;br /&gt;&lt;li&gt;AMD Opteron 2222SE 3.0GHz, 4 cores/2 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070403-00791.html"&gt;47.3&lt;/a&gt;&lt;/li&gt;&lt;li&gt;AMD Opteron 2222SE 3.0GHz, 8 cores/4 chips - &lt;a href="http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070403-00788.html"&gt;89.8&lt;/a&gt; (1.90x)&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-style: italic;"&gt;Non&lt;/span&gt;-surprisingly, for a total of 8 cores per system, the &lt;span style="font-style: italic;"&gt;dual-core&lt;/span&gt; Opterons also scale much better than the &lt;span style="font-style: italic;"&gt;quad-core&lt;/span&gt; Xeons.&lt;br /&gt;&lt;br /&gt;What is interesting above is that, for Core 2 Duo, the 4-to-8-cores scaling is actually better than the 2-to-4-cores one. This is probably due to the fact that the 8-core system has 33% faster FSB, plus a chipset intelligent enough to separate traffic to/from the two quad-core processors (rather than a dumb MCM connection as the Core 2 Quad has internally). This also shows that (1) Intel's FSB design is the bottleneck of multi-core scaling even at quad-core, and (2) The MCM quad-core is a even worse approach for scaling performance to multi-core.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;In Conclusion&lt;/span&gt; - Intel's Core 2 Duo could well be the fastest processor for home computers (or dual-core, single-processor servers) which cost a bit more money for faster video encoding and AI-intensive gaming. On the other hand, with hard proofs we show that &lt;span style="font-style: italic;"&gt;for servers that scale to 4 cores or higher, today's dual-core Opteron &lt;/span&gt;&lt;span style="font-style: italic;"&gt;is a far better choice&lt;/span&gt;. This is probably due both to Opteron's Direct-Connect architecture and integrated memory controller, both of which were implemented by AMD in 2003, and will be followed suit by Intel in its next major processor release (Nehalem) in late 2008.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8459791-5200396591822597565?l=abinstein.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abinstein.blogspot.com/feeds/5200396591822597565/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8459791&amp;postID=5200396591822597565' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5200396591822597565'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8459791/posts/default/5200396591822597565'/><link rel='alternate' type='text/html' href='http://abinstein.blogspot.com/2007/05/scalability-or-lack-of-it-of-intels.html' title='Multi-core scalability (lacking) of Intel Core 2 Duo'/><author><name>abinstein</name><uri>http://www.blogger.com/profile/09589312866039619976</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry></feed>
