Pentium 4: Round 2 - The Ass Kicking Continues
Copyright (C) 2001 by Darek Mihocka
President and Founder, Emulators Inc.
Originally posted March 3 2001
Benchmarks Round 2: March 3 2001
In my initial analysis of the Pentium 4 in December, I looked at the design of the Pentium 4 and how it differed from past Intel processors, I revealed the numerous flaws in its implementation, and I tackled the question of whether you and I, Joe Blow Consumers, should go and buy a Pentium 4 system. The answer, if you've read the preceding 30 or so pages, was a resounding NO! But for my comparison against the Pentium 4 I used slightly out-of-date systems based on older Pentium III and Athlon chips which were purchased earlier last year or even the previous year. As unfair an advantage that gave the Pentium 4, it still did poorly.
Pentium 4 vs. its high-end peers
For this round of benchmarks, I looked at higher end systems and current chips which are comparable to the Pentium 4. Forget last summer's hardware. Assuming you have the same $4000 to $4500 budget to spend that a Pentium 4 system costs, can you do better? Can you do better running a top of the line Pentium III chip? Can you do better running a top of the line Athlon? That's what I set out to find out in January.
Since the launch of the Pentium 4 in November, two other powerful chips have come out. Intel's own Pentium III line has been updated with the 1 GHz Pentium III Xeon chip. The Xeon, which very few other reviews even bother to mention, is a slightly faster version of the stock Pentium III chip that features a full clock speed on-chip L2 cache. At $999 a pop, it's almost double the price of a stock 1 GHz Pentium III chip.
The other chip news worth noting is the release of DDR (double data rate) memory chips and the subsequent release of DDR enabled Athlon chipsets and motherboards by VIA and other manufacturers. DDR is a variation on the standard SDRAM memory widely used today in all Pentium II, Celeron, Pentium III, and Athlon computers. While SDRAM currently runs at a bus speed of 133 MHz (this is referred to as PC133 memory), DDR memory is clocked at double that rate, 266 MHz. This is the computer industry's response to Rambus (or RDRAM), which Intel switched to for the Pentium 4 and which most other manufacturers have wisely stayed away from.
DDR vs RAMBUS
Needless to say, DDR increases the rate at which the processor can transfer large blocks of memory, the same thing Rambus is designed to do. I can tell you now, without even having to think hard that DDR will become the new memory standard. Just as PC100 was the standard for the original Pentium III and Athlon processors, and PC133 is the current standard for today's Pentium III and Athlon processors, both Intel and AMD are pushing DDR. AMD, already supporting it in the latest Athlon processors (which support the same 266 MHz bus speed), and Intel, which is rushing to produce a DDR based Pentium 4 chipset. When both Intel and AMD (and also chipset maker VIA) agree on something, you know it will happen.
My advice is don't stock up on too much more PC100 or PC133 RAM as next time you go to buy a computer they'll be obsolete. PC133 is very inexpensive at this time, which is one drawback to jumping on DDR or Rambus right away. Currently a 128M PC133 module sells for under $60. Rambus 128M modules still sell for about $240, four times the PC133 price, even though RDRAM has been out for a while. The Nintendo 64 game system was using it 3 years ago as have most high end Pentium III machines for the past year. Yet DDR memory, just released, is already priced at $180 per 128M, which while still triple the price of PC133 memory, is already undercutting RDRAM and dooming Rambus to obscurity.
The new test systems
So, armed with the same $10000 budget I used to buy two Pentium 4 machines back in November, I set out to put together one each of a 1.0 GHz Pentium III Xeon system and a 1.2 GHz DDR Athlon system. The top of line of each processor line.
As these parts were still not in easy supply locally here in Seattle by January, I mail ordered both systems. For the Pentium III Xeon system I went to Dell (yes, Dell). Despite their Intel-only policy, I still regard them as the best PC manufacturer and ordered a dual 1.0 GHz Pentium III Xeon system with 512 megabytes of PC800 RDRAM. This weighed in at just around $6000 even. A pricey system, but keep in mind this included an extra $999 for the second processor and would have totaled $5000 for a single processor system.
For the Athlon system, I went to XI Computer in California and had them build a 1.2 GHz Athlon system with 512 megabytes of PC2100 DDR memory. For some reason, DDR memory is no longer referred to as PC266, it's called PC2100. Go figure. This system weighed in fully loaded at under $2800!
To keep things fair with the original Pentium 4 systems, all systems included 60 gig hard disks, internal 250 meg ZIP drives, CD-ROM and/or DVD-ROM drives, USB, etc. I kept the specs of the new systems as close as possible to the 1.5 GHz Pentium 4 systems originally tested, sticking with 512 megabytes of memory in each, sticking with the ZIP drives, and even putting the same costly PC800 RDRAM in the Xeon system instead of opting for the cheaper PC600 RDRAM.
With the money left over I upgraded the original 900 MHz AMD Athlon machine used in December's tests to a new ASUS A7M266 motherboard and DDR memory. Everything else was kept the same (the 900 MHz Thunderbird processor, the hard disk, the DVD drive, the video card, the network card, etc).
So, three new systems to tackle the Pentium 4 and to answer the question: can you do better than blowing $4300 on a 1.5 GHz Pentium 4 system?
By the way, I want to mention that none of the new systems was over clocked. In general I avoided over clocking so as to show the true capabilities of the stock machines, not to see how much liquid nitrogen I could use to speed up a system. Over-clocking is an entirely different topic.
Also, the choice of using 512 megabytes of memory in each system may sound like overkill, but was done for two reasons: to keep the playing field fair between the different systems, and to give Windows ample memory to run everything in physical RAM and avoid swapping. We are testing CPU speed here, not hard disk speed, so random timing variations due to swap file activity needs to be minimized. A typical Windows Millennium or Windows 2000 home system only requires 64 to 128 megabytes but will end up swapping to disk a lot more.
Test #1: MPEG revisited
The first test was to repeat the MPEG encoding benchmark first run in December. After re-running the test on the two existing Pentium III systems to make sure the results came out the same as before, I ran the tests on the three new systems. Note, the 900 MHz Athlon DDR system is the same system used in December, except with the motherboard upgrade and the DDR memory upgrade. I also tested the effects of running the dual processor systems with both a single processor and in dual processor mode.
By the way, those two systems are running Windows 2000 Professional while all the other machines including the Pentium 4 and Athlon machines are running Windows Millennium.
The table below shows the original results posted in December, with the new systems and benchmarks added in italics.
|Chip speed and type||Elapsed time (seconds)||Clock cycles (billions)|
|1.2 GHz AMD Athlon DDR||413||496|
|1.0 GHz Pentium III Xeon (single)||473||473|
|1.5 GHz Pentium 4||484||726|
|1.0 GHz Pentium III Xeon (dual)||520||520|
|900 MHz AMD Athlon DDR||535||482|
|900 MHz AMD Athlon||544||490|
|670 MHz Pentium III (single)||680||456|
|670 MHz Pentium III (dual)||743||498|
|650 MHz Pentium III||757||492|
|533 MHz Celeron||858||457|
|600 MHz AMD Athlon||922||553|
|500 MHz Pentium III||946||473|
|600 MHz Crusoe||1369||821|
Interestingly the 900 MHz Athlon did not show any appreciable speed increase with the addition of the faster memory, barely a 2% improvement. One has to remember that in order to take full advantage of DDR memory one needs to use a DDR capable processor. The 6 month old Athlon does not run at a 266 MHz bus speed but rather only at 200 MHz. Therefore the memory bandwidth is only 50% higher (200 MHz vs. 133 MHz), not 100% higher. This shows that the MPEG encoder is not limited by memory bandwidth, as the Athlon's 384K of combined on-chip cache handles most memory requests. The 8 billion clock cycles saved by going to faster memory is from the memory improvement though, as the same 900 MHz processor was used in both tests and thus CPU instruction timings didn't change.
The 1.2 GHz Athlon system wins by a good 17% margin over the Pentium 4, giving a full 30% increase over the 900 MHz system. This is very good, as the 33% clock speed increase (going from 900 MHz to 1200 MHz) is almost linearly matched by the 30% performance increase. It means that no new bottlenecks were hit. In fact in unofficial testing I've successfully over clocked the system to over 1300 MHz with no problems.
The Pentium III Xeon, similarly with its 320K of on-chip cache, gives an almost linear speed increase over the 650 MHz and 670 MHz systems, 45% and 43% respectively. This is almost as well as expected, although a little disappointing given that the Xeon chip (with its faster on-chip cache) is supposed to be faster than a stock Pentium III. In this case, it seems to not matter. This suggests either that the Xeon is over-hyped, or that some internal limit of the Pentium III architecture was reached. Which would not surprise me given that Intel had problems getting Pentium III chips to go over 1.0 GHz.
When the second processor in the 670 MHz and 1.0 GHz systems was disabled, the speed actually increased. This is because Windows 2000 runs different kernels depending on whether a single processor is being used or if multiple processors are being used. The multi-processor kernel has additional overhead on certain Windows calls, and thus when running a single application you may actually notice a speed decrease! One of the dirty little secrets of dual processor computers is that they are not always faster!
Also interesting to note before I move on to the next test, the range of speed between the fastest system (the 1.2 GHz Athlon) and the lowest end Celeron system is only about 2 to 1. In other words, you can do half as well with a $500 bottom-of-the-line PC as with a high-end PC costing 6 to 12 times as much. Yet another reason not to hand over your wallet every time you walk into a computer store and are bombarded with "new and faster" computer systems.
Test #2: Prime95 revisited
The next test is a repeat of the floating point tests using the PRIME95 utility from www.mersenne.org. Due to a bit of inattention on my part in December, my timings for the Pentium 4 and Athlon systems were within about 10% accuracy, so this time I did the test a little differently so that the results could be read directly from the screen instead of requiring a clock.
When PRIME95 runs to determine if a given number is prime, it executes millions of iterations of floating point operations. The numbers are assigned by a central server, which issues a unique test number to each PRIME95 client. The large test numbers being issued require about 12 million iterations to determine whether the number is prime or not.
The PRIME95 program can be set to display timing information after a fixed number of iterations. Setting it to produce output say, every 100 iterations, produces an output window that looks like this:
By looking at either the "Per iteration time" or the "clocks" numbers, you can determine how quickly and efficiently the program is running on a given processor. As you can see the iterations on average take the same number of clock cycles, so first I let each system get into a steady output state as shown above to verify that there were no random variations in the per-iteration timings. I also verified that each machine was working on a number that required roughly the 12 million iterations.
For each processor then I recorded the per iteration time, the total clocks per time interval, and the total iterations of the number being worked on. As the slower systems (such as the 500 MHz Pentium III) had been assigned significantly smaller iteration counts, I did not include those machines in this round of testing in order to keep things as fair as possible.
The table below summarizes the single CPU speeds of the 5 fastest systems, each working on roughly a 12 million iteration number:
|Chip speed and type||Total iterations||Per iteration time (seconds)||Clock cycles (millions)|
|1.2 GHz AMD Athlon DDR||12406517||.117||140|
|900 MHz AMD Athlon DDR||11941261||.136||122|
|1.5 GHz Pentium 4||12391663||.136||204|
|1.0 GHz Pentium III Xeon||12223447||.181||181|
|670 MHz Pentium III||12141847||.232||155|
As with the results before, the 1.5 GHz Pentium 4 roughly ties the speed of the 900 MHz Athlon. With its crippled processor core, it is the brute force clock speed of the Pentium 4 that keeps it in the game here. As calculated in terms of clock cycles per iteration the Pentium 4 is the least efficient of the 5 chips.
What should be noted is that neither the Pentium III nor the Athlon scale well here. Instead of a 50% speed increase by the Xeon, it only fares 23% better than the slower Pentium III. Instead of a 33% increase in the Athlon, there is only a 16% improvement. This indicates a bottleneck has been hit as clock speed was increased. Given the large working set of PRIME95 (roughly on the order of 10 megabytes) which greatly exceeds the size of the on-chip caches, the bottleneck here is memory. Even with high speed memory, increasing processor clock speed increases the cost of fetching data from main memory. All 5 processors are stalling trying to keep data flowing in and out of the processor.
One can try to calculate the cost of the memory overhead. Each processor, running exactly the same code, will reference memory roughly the same number of times. Given that all 5 processors have 256K L2 caches, they will in theory each miss the cache the same number of times and have to reference main memory the same number of times. Now this estimate is not exact due to different caching techniques and properties of the different memories. However, we can do this more accurately for the same family of processors, i.e. comparing Athlons against each other and Pentiums against each other.
Athlons first. Using simple math, one can find the memory overhead of the Athlon to be about .057 seconds per iteration. That breaks down to .060 seconds of computation and .057 second of memory stall for the 1.2 GHz Athlon, and .079 seconds and .057 seconds for the 900 MHz Athlon. See how .060/.079 is roughly the 3/4 ratio in execution time you'd expect between a 1.2 GHz and 900 MHz processor. This means that each Athlon processor spends about 72 million clock cycles doing useful work, and the rest is wasted waiting for memory operations to complete. As the processor gets faster, the cost of waiting on memory gets proportionally higher. The 900 MHz Athlon only wastes 50 million cycles on memory, but the 1.2 GHz Athlon about 68 million, an increase slightly higher than the proportional increase in clock speed. This makes sense, since memory runs at a fixed speed and so a faster processor will wait more cycles for a memory access, AND, it will run out of things to do sooner and stall sooner during the memory operation. It should not surprise you that close to half the processor's time is spent stalled on memory. A lot of Windows code can end up doing this. This is a growing problem as processors have been increasing in speed faster than memory chips have been able to keep up.
The Pentium situation is tricky because while the Xeon and Pentium 4 both use the same RDRAM memory, their clock cycle counts for executing the code will be different. The Pentium III and Xeon will both spend the same number of clock cycles executing the code, but they use different memory speeds. We have 3 equations with 4 unknowns to solve for. This can be solved by introducing a fourth Pentium system or changing the clock speed of one of the existing systems. This I'll leave for someone else to play around with.
Test #3: Emulation tests
The most dramatic demonstration of the failure of the Pentium 4 comes when running emulation code. As I already eluded to in December, even against "slower" 600 to 900 MHz processors the Pentium 4 lost running various emulation benchmarks. Now faced against even faster and Pentium III and Athlon challengers, it gets totally blown out of the water. A high bandwidth memory bus is of no use when the internals of the processor are crippled.
To keep things fair, the emulators used were December releases of our SoftMac and Gemulator products. The initial tests were conducted on versions 8.01 of those emulators, but after a couple of weeks of tweaking in December with the Pentium 4 machines, on December 31st we released the 8.02 versions of those products which ran faster on the Pentium 4 than the 8.01 releases. Most of our customers should have downloaded the 8.02 releases by now and can try these benchmarks themselves.
The Gemulator tests were done using Atari TOS 2.06, running in 640x400 monochrome video mode, 4 megabytes of memory. The SoftMac tests were done using Mac OS 8.1 and the Quadra 650 BIOS, 640x480 256-color mode, 512K of disk cache, 24 megabytes of memory.
Those results are now posted on our Benchmarks page.
If it isn't clear already, the Pentium 4 is a terrible choice for PC users. It is a severely crippled processor that does not live up to its original design specifications. Its makes inefficient use of available transistors and chip space. It places a higher burden on software developers to optimize code, contrary to the trends being set by AMD and Transmeta processors. It reverts to 10 year old techniques which Intel abandoned and apparently forgot why.
Intel needs to heavily beef up the L1 cache size, add the missing L3 cache, add more decoders, raise the transfer rate from the trace cache to the core, lower the cost of shift operations, and add additional FPU and MMX execution units. Once these changes are made, and only then, will the Pentium 4 be a viable choice for computer users.
Continue to Round 3
Copyright © 1996-2015 Emulators, 14150 NE 20th Street, Suite 302, Bellevue, WA 98007, U.S.A.
Apple, Mac OS, Macbook, and Macintosh are registered trademarks of Apple Computer, Inc. Atari is a registered trademark of Atari U.S. Corporation. Athlon, Athlon XP, Opteron, and Phenom are registered trademarks of AMD. Microsoft, Windows, Windows NT, Windows XP, Windows Vista, Windows 7, Windows 8, Windows 10, Visual Studio, .NET, and/or other Microsoft products referenced herein are either trademarks or registered trademarks of Microsoft. Intel, Pentium, Core 2, Core i7, and Atom are registered trademarks of Intel. PowerPC is a trademark of IBM. Additional company and product names may be trademarks or registered trademarks of the individual companies and are respectfully acknowledged.