Pentium 4: Round 1 - Intel blows the lead
Copyright (C) 2000 by Darek Mihocka
President and Founder, Emulators Inc.
Originally posted December 27 2000


Processor Basics - the various generations of processors over the past 20 years

Generation 1 - 8086 and 68000

In the beginning, the computer dark ages of two decades ago, there was the 8086 chip, Intel's first 16-bit processor which delivered 8 16-bit registers and could manipulate 16 bits of data at a time. It could also address 16-bit of address space at a time (or 64K, much like the Atari 800 and Apple II of the same time period). Using a trick known as segment registers, a program could simultaneously address 4 such 64K segments at a time and have a total of 1 megabyte of addressable memory in the computer. Thus was born the famous 640K RAM limitation of DOS, since the remaining 384K was used for hardware and video.

A lower cost and slower variant, the 8088, was used in early PCs, providing only an 8-bit bus externally to limit the number of pins on the chip and reduce costs. As I incorrectly stated here before, the 8086 was not used in the original IBM PC. It was actually the lower cost 8088.

The original Motorola 68000 chip, while containing 16 32-bit registers and being essentially a 32-bit processor, used a similar trick of having only 16 external data pins and 24 external address pins to reduce the pin count on the chip. An even smaller 68008 chip, addressed only 20 bits of address space externally and had the same 1 megabyte memory limitation as the 8086.

While these first generation processors from Intel and Motorola ran at speeds of 4 to 8 MHz, they each required multiple clock cycles to execute any given machine language instruction. This is because these processors lacked any of the modern features we know today such as caches and pipelines. A typical instruction to 4 to 8 cycles to execute, really giving the chips an equivalent speed of 1 MIPS (i.e. 1 million instructions per second).

Back in the 80's NEC (yes, the monitor manufacturer) put out a couple of x86 clones, the V20 which was a clone of the 8088, and the V30 which was a clone of the 8086. Both chips offered slightly faster performance than the Intel parts. To be honest I have no clue what ever became of those processors. I've seen reference to the V40 and V50 but not much else.

Generation 2 - 80286 and 68020

By 1984, Intel released the 80286 chip used in the IBM AT and clones. The 80286 introduced the concept of protect mode, a way of protecting memory so that multiple programs could run at the same time and not step on each other. This was the base chip that OS/2 was designed for and which was also used by Windows/286. The 286 ran at 8 to 16 MHz, offering over double the speed of the original 8086 and could address 16 megabytes of memory.

Motorola meanwhile developed the 68020, the true 32-bit version of the 68000, with a full 32-bit data bus and 32-bit address bus capable of addressing 4 gigabytes of memory.

By the way, both companies did release a "1" version of each processor - the 80186 and 68010 - but these were minor enhancements over the 8086 and 68000 and not widely used in home computers.

Generation 3 - 80386 and 68030

The world of home computers didn't really become interesting until late 1986 when Intel released its 3rd generation chip - the 80386, or simply the 386. This chip, although almost 15 years old now, is the base on which OS/2 2.0, Windows 95, and the original Windows NT run on. It was Intel's first true 32-bit x86 chip, extending the registers to a full 32 bits in size and increasing addressable memory to 4 gigabytes. In effect, catching up to the 68020 in a big way, by also adding things like paging (which is the basis of virtual memory) and support for true multi-tasking and mode switching between 16-bit and 32-bit modes.

The 386 is really the chip, I feel, that put Intel in the lead over Motorola for good. It opened the door to things like OS/2 and Windows NT and Linux - truly pre-emptive, multi-tasking, memory protected operating systems. It was a 286 on steroids, so much more powerful, so much faster, so much more capable than the 286, that at over $20,000 a machine, people were dying to get their hands on them. I remember reading the review of the first Compaq 386 machine, again, a $20,000+ machine that today you can buy for $50, and the reviewer would basically kill to get one.

What made the 386 so special? Well, Intel did a number of things right. First they made the chip more orthogonal. What that means is that they extended the machine language instructions so that in 32-bit mode, almost any of the 8 32-bit registers could be used for anything - storing data, addressing memory, or performing arithmetic operations. Compare this to the 8086 and 80286 whose 16-bit instructions could only use certain instructions for certain operations. The orthogonality of the 386 registers made up for the extra registers in the Motorola chips, which specifically had 8 registers which could be used for data and 8 for addressing memory. While you could use an address registers to hold data or use data registers to address memory, it was most costly in terms of clock cycles.

The 386 allowed the average programmer to do away with segment registers and 640K limitations. In 386 protect mode, which is what most Windows, OS/2, and Linux programs run in today, a program has the freedom to address up to 4 gigabytes of memory. Even when such memory is not present, the chip's paging feature allows the OS to implement virtual memory by swapping memory to hard disk, what most people know as the swap file.

Another innovation of the 386 chip was the code cache, the ability of the chip of buffer up to 256 bytes of code on the chip itself and eliminate costly memory reads. This is especially useful in tight loops that are smaller than 256 bytes of code.

Motorola countered with the 68030 chip, a similar chip which added built-in paging and virtual memory support, memory protection, and a 256 byte code cache. The 68030 also added a pipeline, a way of executing parts of multiple instructions at the same time, to overlap instructions, in order to speed up execution.

Both the 386 and 68030 ran at speeds ranging from 16 MHz to well above 40 MHz, easily bringing the speed of the chips to over 10 MIPS. Both chips still required multiple clock cycles to execute even the simplest machine language instructions, but were still an order of magnitude than their first generation counterparts. Microsoft quickly developed Windows/386 (and later OS/2 and Windows NT) for the 386, and Apple added virtual memory support to Mac OS.

Both chips also introduced something known as a barrel shifter, a circuit in the chip which can shift or rotate any 32-bit number in one clock cycle. Something used often by many different machine language instructions.

The 386 chip is famous for unseating IBM as the leading PC developer and for causing the breakup with Microsoft. IBM looked at the 386, decided it was too powerful for the average user, and decided not to use it in PCs and not to write operating systems for it. Instead it chose to keep using the 286 and to support the 286 in OS/2. Microsoft on the other hand developed Windows/386 with improved multitasking, Compaq and other clone makers did use the 386 to deliver the horsepower needed to run such a graphical operating system, and the rest is history. By the time IBM woke up, it was too late. Microsoft won. Compaq DELL and Gateway won.

Generation 4 - 486 and 68040

This generation is famous for integrating the floating point co-processor, previously a separate external chip, into the main processor. This generation also refined the existing technology to run faster. The pipelines on the Intel 486 and Motorola 68040 were improved to in effect give the appearance of 1 clock cycle per instruction execution. 20 MIPS. 25 MIPS. 33 MIPS. Double or triple the speed of the previous generation with virtually no change in instruction set! As far as the typical programmer or computer user is concerned, the 386 and 486, or 68030 and 68040, were the same chips, except that the 4th generation ran quicker than the 3rd. And speed was the selling point and the main reason you upgraded to these chips.

The way these chips exploited speed was in a number of ways. First, the caches were increased in size to 8K, and made to handle both code and data. Suddenly relatively large amounts of data (several thousands bytes) could be manipulated without incurring the costly penalty of accessing main memory. Great for mathematical calculations and other such applications. This is why many operating systems today and many video games don't support anything prior to the 4th generation. Mac OS 8 and many Macintosh games require a 68040. Windows 98, Windows NT 4.0, and most Windows software today requires at least a 486. The caches made that huge a difference in speed! Remember this for later!

With the ability to read memory in a single clock cycle now came the ability to execute instructions in a single clock cycle. By decoding one instruction while finishing the execution of the previous instruction, both the 486 and 68040 could give the appearance of executing 1 instruction per cycle. Any given instruction still takes multiple clock cycles to execute, but by overlapping several instructions at once at different stages of execution, you get the appearance of one instruction per cycle. This is the job of the pipeline.

Keeping the pipeline full is of extreme importance! If you have to stop and wait for memory (i.e. the data or code being executed isn't in the cache) or you execute a complex instruction such as a square root, you introduce a bubble into the pipeline - an empty step where no useful work is being done. This is also known as a stall. Stalls are bad. Remember that.

One of the great skills of writing assembly language code, or writing a compiler, is knowing how to arrange the machine language instructions in such an order so that the steps you ask the processor to perform are done as efficiently as possible.

The rules for optimizing code on the 486 and 68040 are fairly simple:

The techniques used in the 4th generation are very similar to techniques used by RISC (reduced instruction set) processors. The concept is to use as simple instructions as possible. Use several simple instructions in place of one complex instructions. For example, to multiply by 2 simply add a value to itself instead of forcing the chip to use its multiply circuitry. Multiply and divide take many clock cycles, which is fine when multiplying by a large number. But if you simply need to double a number, it is faster to tell the chip to add two numbers than to multiply two numbers.

Another reason to follow the optimization rules is because both the 486 and 68040 introduced the concept of clock doubling, or in general, using a clock multiplier to run the processor internally at several times the speed of the main computer clock. The computer may run at say, 33 MHz, the bus speed, but a typical 486 or 68040 chip is actually running at 66 MHz internally and delivering a whopping 66 MIPS of speed.

The year is now 1990. Windows 3.0 and Macintosh System 7 are about to be released.

Generation 5 - the Pentium and PowerPC

With the first decade and the first 4 generations of chips now in the bag, both Motorola and Intel looked for new ways to squeeze speed out of their chips. Brick walls were being hit in terms of speed. For one, memory chips weren't keeping up with the rapidly increasing speed of processors. Even today, most memory chips are barely 10 or 20 times faster than the memory chips used in computers two decades ago, yet processor speeds are up by a factor of a thousand!

Worse, the remaining hardware in the PC, things like video cards and sound cards and hard disks and modems, run at fixed clock speeds of 8 MHz or 33 MHz or some sub multiple of bus speed. Basically, any time the processor has to reference external memory or hardware, it stalls. The faster the clock multiplier, the more instructions that execute each bus cycle, and the higher the chances of a stall.

This is why for example, upgrading from a 33 MHz 486 to a 66 MHz 486 only offers about a 50% speed increase in general, and similarly when upgrading from the 68030 to the clock doubled 68040.

It's been said many times by many people, but by now you should have realized that CLOCK SPEED IS NOT EVERYTHING!!

What can affect speed far more than mere clock speed is the rate at which the chip can process instructions. The 4th generation brought the chip down to one instruction per clock cycle. The 5th generation developed the concept of superscalar execution. That is, executing more than one instruction per clock cycle by executing instructions in parallel.

Intel and Motorola chose different paths to achieve this. After an aborted 68050 chip and short lived 68060 chip, Motorola abandoned its 68K line of processors and designed a new chip based on IBM's POWER RISC chip. A RISC processor (or Reduced Instruction Set) does away with complicated machine language instructions which can take multiple clock cycles to execute, and replaces them with simpler instructions which execute in fewer cycles. The advantage of this is the chip achieves a higher throughput in terms of instructions per second or instructions per clock cycle, but the down side is it usually takes more instructions to do the same thing as on a CISC (or Complex Instruction Set) processor.

The theory with RISC processors, which has long since proven to be bullshit, was that by making the instructions simpler the chip could be clocked at a higher clock speed. But this in turn only made up for the fact that more instructions were now required to implement any particular algorithm, and worse, the code grew bigger and thus used up more memory. In reality a RISC processor is no more or less powerful than a CISC processor.

Intel engineers realized this and continued the x86 product line by introducing the Pentium chip, a superscalar version of the 486. The original Pentium was for all intents and purposes a faster 486, executing up to 2 instructions per clock cycle, compared to the 1 instruction per cycle limit of the 486. Once again, CLOCK SPEED IS NOT EVERYTHING.

By executing multiple instructions at the same time, the design of the processor gets more complicated. No longer is it a serial operating. While earlier processors essentially followed this process:

a superscalar processor how has additional steps to worry about

The extra check are necessary to make sure that the code executes in the correct order. If two ADD operations follow one another, and the second ADD depends on the result of the first, the two ADD operations cannot execute in parallel. They must execute in serial order.

Intel gave special names to the two "pipes" that instructions execute in - the U pipe and the V pipe. The U pipe is the main path of execution. The V pipe executes "paired" instructions, that is, the second instruction sent from the decoder and which is determined not to conflict with the first instruction.

Since the concept of superscalar execution was new to most programmers, and to Microsoft's compilers, the original Pentium chip only delivered about 20% faster speed than a 486 at the same speed. Not 100% faster speed as expected. But faster nevertheless. The problem was very simply that most code was written serially.

Code written today on the other hand does execute much faster, since compilers now generate code that "schedules" instructions correctly. That is, it interleaves pairs of mutually exclusive instructions so that most of the time two instructions execute each clock cycle.

The original PowerPC 601 chip similarly had the ability to execute two instructions per cycle, an arithmetic instruction pair with a branch instruction. The PowerPC 603 and later versions of the PowerPC added additional arithmetic execution units in order to execute 2 math instructions per cycle.

With the ability to execute twice as much code as before comes greater demand on memory. Twice as many instructions need to be fed into the processor, and potentially twice as much data memory is processed.

Intel and Motorola found that as clock speed was being increased in the processors, performance didn't scale, even on older chips. A 66 MHz 486 only delivered 50% more speed than a 33 MHz 486. Why?

The reason again has to do with memory speed. When you double the speed of a processor, the speed of main memory stay the same. That means that a cache miss, which forces the processor to read main memory, now takes TWICE the number of clock cycles. With today's fast processors, a memory read can literally take 100 or more clock cycles. That means 100, or worse, 200 instructions not being executed.

The way Intel and Motorola attacked this problem was to increase the size of the L1 cache, the very high speed on-chip level one cache. For example, the original 486 had an 8K cache. The newer 100 MHz 486 chips had a 16K cache.

But 8K or 16K is nothing compared to the megabytes that a processor can suck in every second. So computers started to include a second level cache, the L2 cache, which was made up of slightly slower but larger memory. Typically 256K. The L2 cache is still on the order of 10 times faster than main memory, and allows most code to operate at near to full speed.

When the L2 cache is disabled (which most PC users can do in the BIOS setup), or when it is left out completely, as Apple did in the original Power Macintosh 6100, performance suffers.

Generation 6 - the P6 architecture and PowerPC G3/G4

By 1996 as processor speeds hit 200 MHz, more brick walls were being hit. Programmers simply weren't optimizing their code and as processor speeds increased, the processors simply spent more time waiting on memory or waiting for instructions to finish executing. Intel and Motorola adopted a whole new set of tricks in their 6th generation of processors. Tricks such as "register renaming", "out of order execution", and "predication".

In other words, if the programmer won't fix the code, the chip will do it for him. The Intel P6 architecture, first released in 1996 in the Pentium Pro processor, is at the heart of all of Intel's current processors - the Pentium II, the Celeron, and the Pentium III. Even AMD's Athlon processor uses the same tricks.

What they did is as follows:

From an engineering standpoint, the enhancements in the 6th generation processors are truly amazing. Through the use of brute force (larger caches and faster clock speed), parallel execution (multiple execution units and 3 decoders), and clever interlocking circuitry to allow out-of-order execution, Intel has been able to stick with the same basic architecture for 5 years now, catapulting CPU throughput from the 100 to 150 MHz range in 1995 to over 1 GHz today. Most code, every poorly written unoptimized code, executes at a throughput of over 1 instruction per clock cycle, or roughly 1000 MIPS on today's fastest Pentium III processors.

The PowerPC G3 and G4 chips use much the same tricks (after all, all these silicon engineers went to the same schools and read the same technical papers) which is why the G3 runs faster than a similarly clocked 603 or 604 chip.


Limitations of the Pentium III - why bother with a new design?

AMD, calling the Athlon a "7th generation" processor, something I don't fully agree with since they really didn't have a 6th generation processor, took the basic ideas behind the Pentium II/III and PowerPC G3 and used them to implement the Athlon. Having the benefit of seeing the original Pentium Pro's faults, they fixed many of bottlenecks of the P6 design and which even today limit the full speed of the Pentium III.

These are the same problems that Intel of course is trying to address in the Pentium 4. It helps us to understand why the AMD Athlon is a faster chip and what AMD did right to understand why Intel needed to design the Pentium 4, and that is what I shall discuss in this section.

Not counting the unbuffered segment register problem in the original Pentium Pro (which was fixed in the far more popular Pentium II chip), what are the bottlenecks? What can possibly slow down the processor when instructions are being executed out-of-order 3 at a time!?!?

Well, keep in mind that a chain is only as strong as its weakest link. In the case of the processor, each stage can be considered a link in a chain. The main memory. The L2 cache. The L1 cache. The decoder. The scheduler which takes decoded micro-ops and feeds them into the various execution units. in a the two main bottlenecks in the P6 architecture are the 4-1-1 limitation of the decoder, and the dreaded partial register stall.

If you read the Pentium III optimization document, you will see reference to the 4-1-1 rule for decoding instructions. When the Pentium III (for example) fetches code, it pulls in up to three instructions through the decoders each clock cycle. Decoder 1 can decode any machine language instruction. Decoders 2 and 3 can decode only simple, RISC-like instructions that break down into 1 micro-op. A micro-op is a basic operation performed inside the processor. For example, adding two registers takes one micro-op. Adding a memory location to a register requires two micro-ops: a load from memory, then an add. It uses two execution units inside the processors, the load/store unit on one clock cycle, and then an ALU on the next clock cycle. Micro-ops translate roughly into clock cycles per instruction but don't think of it that way. Since several instructions are being executed in parallel and out of order, the concept of clock cycles per instruction becomes rather fuzzy.

Instead, think of it like this. What is the limitation of each link? How frequently does that link get hit? Main memory, for example, may not be accessed for thousands of clock cycles at a time. So while accessing main memory may cost 100 clock cycles, that penalty is taken infrequently thanks to the buffering performed by the L1 and L2 caches. Only when dealing with large amounts of memory at a time, such as when processing a multi-megabyte bitmap, does it start to hurt.

Intel and AMD have addressed this problem in two ways. First, over they years they have gradually increased the speed of the "front side bus", the data path between main memory and the processor, to work at faster and faster clock speeds. From 66 MHz in the Celeron and Pentium II, to 100 and 133 MHz in the Pentium III, to 200 MHz in the AMD Athlon. Second, Intel produces a version of the Pentium II and III called the "Xeon", which contains up to 2 megabytes of L2 cache. The Xeon is used frequently in servers as it supports 8-way multi-processing, but on the desktop the Xeon does offer considerable speed advantages over the standard Pentium III when large amounts of data are involved. The PowerPC G4 has up to 1 megabyte of L2 cache, which explains why a slower clock speed Power Mac G4 blows away a Pentium III in applications such as Photoshop.

Basically, the larger the working set of an application, that is, the amount of code and data in use at any given time, the larger the L2 cache needs to be. To keep costs low, Intel and AMD have both actually DECREASED the sizes of their L2 caches in newer versions of the Pentium III and Athlon, which I believe is a mistake.

The top level cache, the L1 cache, is the most crucial, since it is accessed first for any memory operation. The L1 cache uses extremely high speed memory (which has to keep up with the internal speed of the processor), so it is very expensive to put on chip and tends to be relatively small. Again, from 8K in the 486 to 128K in the Athlon. But as my tests have shown, the larger the L1 cache, the better.

The next step is the decoder, and this is one of the two major flaws of the P6 family. The 4-1-1 rule prevents more than one "complex" instruction from being decoded each clock cycle. Much like the U-V pairing rules for the original Pentium, Intel's documents contain tables showing how many micro-ops are required by every machine language instructions and they give guidelines on how to group instructions.

Unlike main memory, the decoder is always in use. Every clock cycle, it decodes 1, 2, or 3 instructions of machine language code. This limits the throughput of the processor to at most 3 times the clock speed. For example, a 1 GHz Pentium III can execute at most 3 billion instructions per second, or 3000 MIPS. In reality, most programmers and most compilers write code that is less than optimal, and which is usually grouped for the complex-simple-complex-simple pairing rules of the original Pentium. As a result, the typical throughput of a P6 family processor is more like double the clock speed. For example, 2000 MIPS for a 1 GHz processor.

By sticking to simpler instruction forms and simpler instructions in general (which in turn decode to fewer micro-ops) a machine language programmer can achieve close to the 3x MIPS limit imposed by the decode. In fact, this simple technique (along with elimination of the partial register stalls) is the reason our SoftMac 2000 Macintosh emulator runs so much faster than other emulators, and why in the summer of 2000 when I re-wrote the FUSION PC emulator I was able to achieve about a 50% speed increase in the speed of that emulator in only a few days worth of work. By simply understanding how the decoder works and writing code appropriately, one can achieve near optimal speeds of the processor.

Once again, let me repeat: CLOCK SPEED IS NOT EVERYTHING! So many people stupidly go out and buy a new computer every year expecting faster clock speed to solve their problems, when the main problem is not clock speed. The problem is poorly written code, uneducated programmers, and out of date compilers (that's YOU Microsoft) that target obsolete processors. How many people still run Microsoft Office 95? Ok, do a DUMPBIN on WINWORD.EXE or EXCEL.EXE to get the version number of the compiler tools. That product was written in an old version of Visual C++ which targets now obsolete 486 processors. Do the same thing with Office 97 or Office 2000. Newer tools that target P6. Wonder why your Office 97 runs faster than your Office 95 on the same Pentium III computer? Ditto for Windows 98 over Windows 95. Windows 2000 over Windows 98. Etc. etc. The newer the compiler tools, the better optimized the code is for today's processors.

The next bottleneck are the actual execution units - the guts of the processor. They determine how many micro-ops of a given type can execute in one clock cycle. For example, the P6 family can load or store one memory location per clock cycle. It can execute one floating point instruction per clock cycle because there is only one FPU. This means that every the most optimized code, that caches perfectly, decodes perfectly, can still hit a bottleneck simply because too many instructions of the same type are trying to executing. Again, one needs to mix instructions - integer and floating point and branch, to make best use of the processor.

Finally that dreaded partial register stall! The one serious bug in the P6 design that can cause legacy code to run slower. By "legacy code" I mean code written for a previous version of the processor. See, until now, every generation so far improved on the design of previous generations. No matter what, you were almost 100% guaranteed that a newer processor, even running at the same clock speed as a previous processor, would deliver more speed. Why a 68040 is faster than a 68030. Why a Pentium is faster than a 486.

Not so with generation 6. While every other optimization in the P6 family pretty much boosts performance without requiring the programmer to rewrite one single line of code, even the 4-1-1 decode rule, the register renaming optimization has one fatal flaw that kills performance: partial registers stalls! A partial register stall is when a partial register (that is, the AL, AH, and AX parts of the EAX register, the BL, BH, and BX parts of the EBX register, etc) get renamed to different internal registers because the processor believes the uses are mutually exclusive.

For example, a C compiler will typically read an 8-bit or 16-bit integer from memory into the AL or AX register. It will then perform some operation on that integer, for example, incrementing it or testing a value. A typical C code sequence to test a byte for zero goes something like this:

int foo(unsigned char ch)
{
return (ch == 0) ? 1 : -1;
}

Microsoft's compilers for years have used a "clever" little trick with conditional expressions, and that is to use a compare instruction to set the carry flag based on the result of an expression, then to use the x86 SBB instruction to set a register to all 1's or 0's. Once set, the register can be masked and manipulated to generate any two desired resulting values. MMX code makes heavy use of this trick as well, although MMX registers are not subject to the partial register stall.

Anyway, when you compile the above code using Microsoft's Visual C++ 4.2 compiler with full Pentium optimizations (-O2 -G5) you get code the following code:

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 80 7c 24 04 01 cmp BYTE PTR _ch$[esp-4], 1
00005 1b c0 sbb eax, eax
00007 83 e0 02 and eax, 2
0000a 48 dec eax

0000b c3 ret 0
_foo ENDP
_TEXT ENDS
END

and when compiled with Microsoft's latest Visual C++ 6.0 SP4 compiler you get code like this:

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 8a 44 24 04 mov al, BYTE PTR _ch$[esp-4]
00004 f6 d8 neg al
00006 1b c0 sbb eax, eax
00008 83 e0 fe and eax, -2 ; fffffffeH
0000b 40 inc eax

0000c c3 ret 0
_foo ENDP
_TEXT ENDS
END

Notice in both cases the use of the SBB instruction to set EAX to either $FFFFFFFF or $00000000. Internally the processor reads the EAX register, subtracts it from itself, then write out the value back to EAX. (Yes, it is stupid that when a processor subtracts a register from itself that it would read the register first, but I have verified that it does). In the VC 4.2 case, the processor may or may not stall because we don't know how far back the EAX register was last updated and whether all or part of it was updated.

But interestingly, with the latest 6.0 compiler, even using the -G6 (optimize for P6 family) flag, a partial register stall results.  AL is written to, then all of EAX is used by the SBB instruction. This is perfectly valid code, and runs perfectly fine on the 486, Pentium classic, and AMD processors, but suffers a partial register stall on any of the P6 processors. On the Pentium Pro a stall of about 12 clock cycles, and on the newer Pentium III about 4 clock cycles.

Why does the partial register stall occur? Because internally the AL register and the EAX registers get mapped to two different internal registers. The processor does not discover the mistake until the second micro-op is about to execute, at which point it needs to stop and re-execute the instruction properly. This results in the pipeline being flushed and the processor having to decode the instructions a second time.

How to solve the problem? Well, Intel DID tell developers how to avoid the problem. Most didn't listen. The way you work around a partial register stall is to clear a register, either using an XOR operation on itself, a SUB on itself, or moving the value 0 into the register. (Ironically, SBB which is almost identical to SUB, does not do the trick!) Using one of these three tricks will flag the register as being clear, i.e. zero. This allows the second use of the instruction to be mapped to the same internal register. No stall.

So what is the correct code? Something like this is correct (generated with the Visual C++ 7.0 beta):

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 8a 4c 24 04 mov cl, BYTE PTR _ch$[esp-4]
00004 33 c0 xor eax, eax
00006 84 c9 test cl, cl
00008 0f 94 c0 sete al
0000b 8d 44 00 ff lea eax, DWORD PTR [eax+eax-1]

0000f c3 ret 0
_foo ENDP
_TEXT ENDS
END

Until every single Windows application out there gets re-compiled with Visual C++ 7.0, or gets hand coded in assembly language, your brand spanking new Pentium III processor will not run as fast as it can. But even then, at the expense of code size and larger memory usage. Note the extra XOR instruction needed to prevent the partial register stall on the SETE instruction. While this does eliminate the partial register stall, it does so at the expense of larger code. You eliminate one bottleneck, you end up increasing another.


Pentium 4 - Generation 7 or complete stupidity?

I had been studying Intel's publicly available white papers on the Pentium 4 for the good part of 6 months prior to its release. While the new architecture looked promising on paper, the actual implementation of the Pentium 4 is a castrated version of the ideal chip that Intel set out to design. Intel selectively left out important implementation details of the Pentium 4, which they finally revealed with the posting of the Intel Pentium 4 Processor Optimization manual on their web site.

In an attempt to cover up their design defects, Intel has been forced to carefully word their optimization document. I encourage all software developers and technically literate computer users to download the Pentium 4 optimization manual mentioned above, and to also for comparison to download and study the Pentium III manuals as well as the AMD Athlon manual. It does not take a rocket scientist to read and compare the three sets of processors to realize what the design flaws in the Pentium 4 are.

Let's get to the meat of it. WHY THE PENTIUM 4 SUCKS. If you've read this far I expect you have downloaded the Intel and AMD manuals I mentioned above, you're read them, and you have a good understanding of how the Pentium III, AMD Athlon, and Pentium 4 work internally. If not, start over!

You're read my previous section on the cool tricks introduced in the 6th generation processors (Pentium II, AMD Athlon, PowerPC G3) and the kinds of bottlenecks that can slow down the code:

As I mentioned, AMD to their well deserved credit attacked all these problems head on in the Athlon by detecting and eliminating the partial register stall, by relaxing limitations on the decoder and instruction grouping, and by making the L1 caches larger than ever.

So, after 5 years of deep thought, billions of dollars in R&D, months of delays, hype beyond belief, how did Intel respond?

In what can only be considered a monumental lapse in judgment, Intel went ahead and threw out the many tried and tested ideas implemented in both the PowerPC and AMD Athlon processor families and literally took a step back 5 years to the days of the 486.

It seems that Intel is taking the approach similar to that of their upcoming Itanium chip - that the chip should do less optimization work and that the programmer should be responsible for that work. An idea not unfamiliar to RISC chip programmers, but Intel really went a little too far. They literally stripped the processor bare and tried to use brute force clock speed to make up for it!

Except the idea doesn't work. Benchmark after benchmark after benchmark shows the 1.5 GHz Pentium chip running slower than a 900 MHz Athlon, and in some cases slower than a 533 MHz Celeron, even as slow as a 200 MHz Pentium in rare cases.

Intel did throw a few new ideas into the Pentium 4. The two ALUs (arithmetic logic units) which perform adds and other simple integer operations, run at twice the clock speed of the processors. In other words, the Pentium 4 is in theory capable of executing 6 billion integer arithmetic operations per second. As I'll explain below, the true limit is actually much lower and not any better than what you can get out of a Pentium III or Athlon already.

Another new idea is the concept of a 'trace cache", or what is basically a code cache for decoded micro-ops. Rather than constantly decode the instructions in a loop over and over again, the Pentium 4 does not have an L1 code cache. Instead, it caches the output of the decoder, caching the raw micro-ops. This sounds like a good idea at first, but again, in reality it does not prove any better than simply having an 8K code cache, and certainly falls short of the Athlon's 64K code cache.

The Benchmarks

As Tom's Hardware site documented last month, the Pentium 4 lost miserably against the AMD Athlon at MPEG video encoding. Only after Intel engineers personally modified the code did the Pentium 4 suddenly win the benchmarks. A side effect of this is that the benchmarks on the Athlon improved considerably as well, indicating that the code was very poorly written in the first place.

However, this brings up the point again that Intel now expects software developers to completely rewrite their code in order to see performance gains on the Pentium 4. And we don't all have the luxury of having an Intel engineer showing up on our doorstep to re-write our code for us. With thousands of Windows applications out there, not to mention the growing number of Linux applications out for the PC, and sadly out of date compiler tools, does Intel seriously expect millions of computer code to be rewritten just for the Pentium 4?

I downloaded an MPEG encoder and ran it through my own 150 megabyte sample video file. I used 8 similarly configured Windows Millennium computers as the test machines, which had the following processors and memory sizes and roughly sorted by cost:

The encoding times (in seconds) for the same sample piece of video were as follows:

Chip speed and typeElapsed time (seconds)Clock cycles (billions)
1.5 GHz Pentium 4484726
900 MHz AMD Athlon544490
670 MHz Pentium III743498
650 MHz Pentium III757492
533 MHz Celeron858457
600 MHz AMD Athlon922553
500 MHz Pentium III946473
600 MHz Crusoe1369821

The 1.5 GHz Pentium 4 won of course, but barely over the 900 MHz AMD Athlon at about 1/3 the price and 60% of the clock speed. Worse, the Pentium 4 fails to even cut the processing time in half compared to the much slower clocked Pentium III and Celeron systems. The Pentium 4 is barely twice as fast at this benchmark as a 500 MHz Pentium III.

This benchmark illustrates several important concepts I discussed earlier, especially when we calculate the total number of clock cycles executed on each processor. By counting the total number of cycles, it equalizes the differences in clock speed between the various systems.

First, CLOCK SPEED IS NOT EVERYTHING!! Just because one processor runs at a faster clock speed than another, does not mean you will get proportionally faster performance.

Second, the Pentium 4 seems to require almost 50% more clock cycles than the Athlon or Pentium III, indicating that either floating point operations each take more cycles on the Pentium 4, that the Pentium 4 does not execute as many floating point instructions in parallel as the Athlon, or that the Pentium 4 is being throttled by the cache or decoder. An MPEG decode does deal with a lot of data (in this case, 150 megabytes of data) and my guess is the small L1 cache on the Pentium 4 hurts it here. Intel's optimization document addresses this issue, referring to techniques known as "cache blocking" and "strip mining" to minimize the cache usage by working on small portions of data at a time. Again, something that a code rewrite is needed to implement.

For another floating point test, I ran the widely used Prime95 utility for finding Mersenne primes (see http://www.mersenne.org). Setting up the same machines I launched PRIME95 on each machine at the same time and had them begin calculating the primality of roughly the same length number several million digits long. The number being worked on requires about 24 hours of processing time. After about an hour of running time, it was clear that the Pentium 4 was neck and neck with the 900 MHz Athlon. After several hours, still tied. After 24 hours of running time both the Pentium 4 and 900 MHz Athlon completed, while the others were still part way through processing the number. I recoded the relative progress of each, with the Pentium 4 and 900 MHz Athlon being shown to have complete and roughly tied:

Chip speed and typeRelative speed
1.5 GHz Pentium 4>100% (tied)
900 MHz AMD Athlon>100% (tied)
670 MHz Pentium III90%
650 MHz Pentium III90%
533 MHz Celeron60%
600 MHz AMD Athlon99%
500 MHz Pentium III45%
600 MHz Crusoe60%

Here, the clear floating point dominance of the AMD Athlon over the Pentium III and Pentium 4 is evident. Since the source code to PRIME95 can be freely downloaded, I looked at it. It contains a lot of hand coded assembly code, and more importantly, a LOT OF FLOATING POINT instructions. The Athlon, with its ability to execute 3 floating point instructions per clock cycle, even at 60% the clock speed, just about keeps up with the Pentium 4. At 600 MHz speeds, the Athlon blows away a Pentium III chips running over 10% faster.

In a third floating point test, running the SoftMac 2000 emulator and then running a heavily floating point based benchmark on the Mac OS, the Pentium 4 fails to keep up with even the 600 MHz chips, losing badly (82 seconds vs. 49 seconds) against the 670 MHz Pentium III and losing worse (82 seconds vs. 36 seconds) against the 900 MHz AMD Athlon.

Running other tests using various emulators, I found that in general the Pentium 4 runs emulators such as SoftMac 2000 SLOWER in most cases than the 650 MHz Pentium III and 600 MHz AMD Athlon.

A small tangent about the Transmeta Crusoe

I should stop and mention a few things about the biggest surprise to me from COMDEX Las Vegas, and that being not the Pentium 4 chip but the Transmeta Crusoe chip.

See, the folks over at Transmeta have their own ideas how to build processors, especially for portable devices that have to restrict their power use. After about 5 years of secret development, these guys came up with a chip that works slightly differently from the Intel and AMD chips.

Rather than waste millions of transistors on a chip for out-of-order schedulers and other fancy tricks, they decided to strip all this out of the chip and eliminate about 90% of the chip's power consumption. Instead, a piece of software performs the code optimizations at run time. This is essentially the concept behind a dynamically recompiling emulator, and the concept behind a JIT (just in time) compiler such as what is found in Java.

What Transmeta has done is taken this a step further to JIT the entire Windows operating system at run time, rather than say, a tiny 100K Java applet. And they pulled it off. Using a 600 MHz chip that performs software based optimizations, I find the Picturebook consistently performing at about the speed of a 300 MHz Pentium class processor, as perfectly demonstrated by the MPEG results above.

In addition to that, the Crusoe chip has this peculiar side effect of running faster as time goes by. i.e. as more time elapses, the chip's JIT compiler appears to optimize the code further, and I've actually noticed this running the SoftMac emulator. As I repeat benchmarks under the Mac OS, they get slightly faster each run.

This shows up in the PRIME95 benchmark quite clearly, where after 24 hours of run time, the Crusoe keep right up with a 533 MHz Celeron chip - almost keeping up clock cycle for clock cycle with Intel's chip!

As I said a few weeks ago, hats off to the geniuses at Transmeta for pulling off such an amazing feat of emulation. This idea of software-assisted execution may well in fact be the solution for Intel's woes in future generations of chips, as it takes the burden of code optimization off the hands of millions of software developers and puts in back in the chip without requiring millions of extra transistors.


Analyzing the results - why the Pentium 4 fails to deliver

But back to the Pentium 4 and figuring out what it sucks. I finally pulled out my big gun: a custom CPU analyzer utility which I use to analyze various processors. It measures things like the sizes and speeds of the caches, and it executes hundreds of different sample code sequences in order to measures the throughput of each piece of code on each processor. These code sequences consists of code that is commonly emitted by Microsoft's Visual C++ compiler and code that is commonly found in emulation code. I've used this utility for years to hand tune my emulators to various processors and it's served me well.

After just a few minutes on the Pentium 4 it gave me the results I needed. I then read over Intel's Pentium 4 documents again and corroborated my results, in order to finally determine the fatal design flaws of the Pentium 4.

MISTAKE #1 - Small L1 data cache - I couldn't believe it myself when I first saw the results, but Intel's own statements confirm it. The Pentium 4 has a grossly under-sized 8K L1 data cache. That was the size of the L1 cache back in the 486, more than TEN YEARS AGO. The L1 cache is the most important block of memory in the whole computer. It is the first level of memory that the processor goes to and it is the memory that the processor spends most of its time accessing. Intel learned back in the 486 days that 8K of cache was grossly inadequate, raising the size of the cache from 8K to 16K in later versions of the 486 and to 32K (16K code, 16K data) in the P6 family. AMD went a step further with their 128K L1 cache in the Athlon and Duron processors.

Going back to 8K is just plain wrong. At a 1024x768 screen resolution and a 32-bit color depth, 2 scan lines of video consume 8K of data. Simply manipulating more than two scan lines of video data at a time will overflow the L1 cache on the Pentium 4.

My testing shows that while the Pentium 4 has extremely fast memory access for working sets of data up to 8K in size, at 16K and 32K sizes it is no faster than a 650 MHz Pentium III. The Pentium III's L1 cache, even though running at a much slower clock speed, keeps up with the Pentium 4's L2 cache. The 900 MHz Athlon's 64K data cache in fact outperforms the Pentium 4's L2 cache. Therefore at manipulating sound or video data, the AMD Athlon can manipulate 8 times as much data as the Pentium 4 as quickly as the Pentium 4.

MISTAKE #2 - No L3 cache - Intel originally specified a 1 megabyte L3 cache for the Pentium 4. This third level cache, much like a G4's back side cache or the large L2 cache in the Pentium III and Athlon, provides an extra level of fast memory to help keep the chip from having to access slow main memory. The L3 cache is completely removed in the released Pentium 4 chip.

How significant a cut is this? Well, consider that Intel DOES make versions of the Pentium III that have 1 and 2 megabytes of L2 cache - the Pentium III Xeon. While more expensive than the regular Pentium III chip, ask anyone with a Xeon if they're trade it in for a regular Pentium III. My testing shows that at working sets between 256K and 2M, a 700 MHz Xeon processor easily outperforms the Pentium 4 at memory operations. How much is 256K or 2M? Well, that's about the typical size of an uncompressed bitmap. It's the reason a Power Mac G4 running Photoshop kills a typical Pentium III running Photoshop. And axing the L3 cache is a main reason why the Pentium 4 is not the G4 killer it could have been.

MISTAKE #3 - Decoder is crippled - In another step back to 486 days of 10 years ago, Intel took a rather simple approach to the U-V pairing and 4-1-1 grouping limitations of past decoders. They simply eliminated the extra decoders and went back to a single decoder. Only one machine language instruction can be decoded per clock cycle. The idea behind this twisted logic being that the trace cache eliminates the need to decode instruction every clock cycle. True, IF and only if the code being executed has already been decoded and cache in the trace cache.

But guess what my friends? When a new piece of code is called that is not in the trace cache (or in a traditional code cache), the processor must reach into the L2 cache or into main memory to pull in 64 bytes of memory. Then it has to decode that 64 bytes of code. Well, a typical x86 instruction is about 3 bytes in size, therefore 64 bytes of memory is equivalent to about 21 machine language instructions. Assuming all 64 bytes of code executes, how long will it take a Pentium 4 to decode all of the instructions? 21 clock cycles. How long with it thus take that piece of code to execute? More than 21 clock cycles. Now, compare this to the Pentium III or Athlon. How long will those chips need to decode the bytes? Roughly 7 to 11 cycles.

MISTAKE #4 - Trace cache throughput too low - Remember my analogy about the weak link in the chain. We've already found that the decoder can only feed 1 instruction worth of micro-ops to the trace cache. Then, reading Intel's specs some more, we can see that the trace cache itself can only feed at most 3 micro-ops to the execution units per clock cycle.

The trace cache feeds these micro-ops to the processor core which then executes them in one or more dedicated execution units. Intel's Pentium 4 overview mentions that the Pentium 4 processor core contains 7 execution units:

Together, these execution units can in theory process 9 micro-ops per clock cycle - 4 simple integer operations, 1 integer shift/rotate, a read and write to memory, a floating point operating, and an MMX operation.

Sounds pretty sweet, except for the problem that the trace cache feeds only 3 micro-ops at a time! While on the Pentium III we have the situation that the decoder can feed up to 3 instructions and 6 micro-ops (4+1+1) to the core per clock cycle, the Pentium 4 is crippled to the point of decoding one instruction per cycle and feeding at most 3 micro-ops to the code per clock cycle.

For well optimized code, code which follows the 4-1-1 rule and runs optimally on both Pentium III and AMD Athlon processors, the Pentium 4 is virtually guaranteed to run slower at the same clock speed. I verified this with some common code sequences. No wonder the 900 MHz Athlon keeps beating the Pentium 4 in the benchmarks.

MISTAKE #5 - Wrong distribution of execution units - This is a direct result of mistake #4, and that is that the breakdown of the execution units themselves is completely wrong.

Think about it. 5 of the 7 execution units are dedicated to handling the integer registers, the 8 "classic" registers EAX EBX ECX EDX ESI EDI EBP and ESP. Yet as it's already clear, the Pentium 4 does horrific job of executing legacy code.

Intel's own documents put heavy emphasis on the use of the new MMX registers, both 64-bit and 128-bit MMX registers introduced in the P6 family. Yet only one single execution unit handles MMX. And if you read Intel's specs in more detail, it states that the unit can only accept a micro-ops every second clock cycle. In other words, the 1.5 GHz Pentium 4 can at most execute 750 million floating point operations or MMX operations per second. But MMX is one of the things Intel hypes up!

So why cripple the very feature you're trying to hype?

In a related act of stupidity, Intel put 3 integer ALUs in the core, two of which operate at double the chip speed. So between them, the three ALUs can accept up to 5 micro-ops per clock cycle. But we've already learned that the trace cache can provide at most 3. So one or more integer ALUs sit idle each clock cycle. It is impossible to even feed 4 micro-ops into the two double-speed units. So why did Intel waste transistors to implement a redundant ALU, but then cut corners by eliminating a much more needed second floating point unit?

(Editorial: with 20/20 hindsight from the year 2006, it is now obvious that the extra ALU units were there to handle Hyper-Threading so as to be able to feed micro-ops from two separate instruction streams into the core at once.)

MISTAKE #6 - Shifts and rotates are slow - It seems Intel has taken yet another step back to the days of the 486, even the days of the 286, by eliminating the high-speed barrel shifter found in all previous 386, 486, Pentium, 68020, 68030, 68040, and PowerPC chips. Instead, they created the shift/rotate execution unit, which by design operates at normal clock speed (not double clock speed), but in my testing actually operates even slower. A typical shift operation on the Pentium 4 requires 4 to 6 clock cycles to complete. Compare this with a single clock cycle on any 486, Pentium, or Athlon processor.

How bad is this mistake? For emulation code, it's absolutely devastating. Shift operations are used for table lookups, for bit extractions, for byte swapping, and for any number of other operations. For some reason, Intel's engineers just could not spare a few extra transistors to keep shifts fast, yet they waste transistors on idle double speed ALUs.

Intel's own documentation is now contradictory. On the one hand, Intel has for years advocated the use of shift and add operations to avoid costly multiply operations. For example, to multiply by 10, it is quicker on the 486 and Pentium to use shifts to quickly multiply by 2 and 8 and then add the results. However, on the Pentium 4 this trick of shift and add can take as long as 6 or 7 clock cycle, which negates much of the benefit over using a multiply.

This appears to have something to do with the fact that the original Pentium 4 design called for there to be two address generation units, which are circuits to quickly calculate addresses for memory operations. In previous chips, the AGU contained a barrel shifter to quickly handle indexed table lookups, which the Pentium 4 now handles using the much slower ALU. The "add and shift" trick was usually accomplished by the AGU by a programming trick using the LEA (load effective address) instruction. This trick is now rendered useless thanks to Intel cutting out the part.

MISTAKE # 7 - Fixed the partial register stall with a worse solution - While it is true that the partial register stall is finally a thing of the past in the Pentium 4, Intel's solution is less than elegant. It is not only worse that AMD's solution, but actually worse than the problem it tries to fix. Accessing certain partial registers now involves the shift/rotate unit, meaning that a simple 8-bit register read or write can take longer than accessing L1 cache memory! It's backwards!

MISTAKE #8 - Instructions take more clock cycles to complete - The end result of all the cost cutting and silicon chopping is that typical code sequences now take more clock cycles to execute than on the P6 architecture. Intel relies on the much faster clock speed of the Pentium 4 to overcome this problem, but this only works against the Pentium III and slower Intel processors. Against the AMD Athlon, it loses badly.

As I mentioned above, typical code sequences generated by C++ compilers now take more clock cycles to execute. This is due in part to the brain dead decision to only decode one instruction per clock cycle, to only feed 3 micro-ops to the core per clock cycle. And partly due to the longer pipeline used in the Pentium 4, flow control operations (such as branches, jumps, and calls) take longer because it takes longer to fill the processor pipeline.

For example, an indirect call through a general purpose register, common when making member function calls in C++, now takes about 40 clock cycles on the Pentium 4.  Compare this to only 10 to 14 cycles on P6 family and AMD Athlon processors. Even at the faster clock speed, the Pentium 4 function calls are slower overall. Similarly, Windows API calls, which call indirectly through an import table, are now slower. Several Windows APIs that I tested literally took 2 to 3 times the number of clock cycles to execute on the Pentium 4. This is because not only do all the internal function calls within Windows take longer, but you have to remember that Windows 2000 and Windows Millennium are compiled using C++ compilers that optimize for Pentium III and Athlon processors. So as I mentioned at the beginning, until such time as most Windows code is recompiled using as-yet-non-existent Pentium 4 optimized C++ compilers, the performance of Windows applications will be terrible on the Pentium 4 processor.

A specific code example showing the difficulties of optimizing for Pentium 4

I'd like to show a simple machine language code example which demonstrates how tricky it can be optimizing code for the Pentium 4. This came about from an email discussion I was having with another programmer and how tricky it is to write something as trivial as a simple multiply by 10.

As I mentioned before, there is a special unit in a processor called the "barrel shifter" which has been found in all 32-bit processors for over 15 years. The 386, 486, and all past versions of the Pentium have it, as do all versions of the PowerPC, and even the 68020, 68030, and 68040 has it. The barrel shifter performs left and right shift and rotate operations, which is as basic an operation to a processor as addition and subtraction.

There is another unit called an address generation unit (or AGU). These nifty little circuits can add up to 3 three numbers together in a single operation. This is useful both for accessing a memory element in an array, for calculating the address of something in an array, or, as a pure mathematical calculation involving 3 values.

Starting with the 486 and 68040 processors more than 10 years ago, shifting and address calculation became one clock cycle operations, no slower than an add, subract, OR, XOR, or AND operation.

For example, if you have an array of integers and you want to access the integer indexed by the EAX register, you would use an instruction such as

   MOV EBX,[array+EAX*4]

to read the integer. "array" is a 32-bit numeric constant that specifies the address of the array, while the EAX*4 means to scale the value in EAX by 4 (using a quick shift to the left) and then to add that to the address of the array. If it is an array of floating point numbers, use a scaling value of 8 instead of 4, since a floating point number (a "double" in C language) occupies 8 bytes of memory.

If the address of the array is not constant, for example, it is allocated at run time, or is a multi-dimensional array, then you can use a second register in place of the constant. For example, if the address of the array is stored in the EBP register, then the instruction becomes:

   MOV EBX,[EBP+EAX*4]

This is one way local variables in a C function are accessed. If you simply want to calculate the address of the memory instead of reading it, you use the LEA (Load Effective Address) instruction to store the address into EBX instead of the value pointed to by EBX.

In general, the MOV and LEA instructions can use addressing modes of the form [base + index*scale + displacement] where "base" is any of the 8 32-bit integer registers, "index" is any other 32-bit register (and can include the base register), "scale" is a scaling factor of either 2, 4, or 8 (or none for a default scaling of 1), and displacement is a 32-bit integer which contains an address or an offset. Not all of the 4 addressing components need to be used.

With the LEA instruction, the x86 processor can now perform a 3-number add, with something like a C expression "a = b + c + 10;" translating into EAX = EBX+ECX+10 and being coded into one instruction:

   LEA EAX,[EBX+ECX+10]

Notice that no memory is actually referenced. LEA is used merely to calculate values by performing the addition of a base register (EBX) with an index register (ECX) with some constant displacement (10). This is what the address generation unit (AGU) does, allowing the processor to quickly calculate addresses of array elements, screen pixel locations, and do some basic arithmetic in one clock cycle. Without this trick you would have to break it up into multiple instructions:

   MOV EAX,10
    ADD  EAX,EBX
    ADD  EAX,ECX

This not only requires more code bytes but runs slower since the three instructions may not all decode in one cycle, and the operations happen serially not in parallel. LEA makes it a breeze to evaluate simple expressions quickly and allows x86 to keep up with RISC processors at doing such basic operatings.

On all of the processors I mentioned above (68K, PowerPC, x86) the addressing modes can scale by a factor of 2, 4, or 8 with no overhead thanks to the fast shifter. This scaling trick can be used to quickly multiply by small constants, say, the count of bytes in a scan line when drawing to video memory or the number of bytes in a column of a two dimensional array. Multiplying by a constant is not a rare occurrence by any means.

If you look at the #6 and #7 mistakes listed above, I'll show how something as simple as multiplying a register by 10 becomes slower on the Pentium 4. In my description of that problem I used the example of 10, stating that to multiply by 10 you can multiply by 2, multiply by 8, then add the results. To keep the explanation simple, I did not go into the actual machine language details of how that would be done. But based on his and other people's email feedback I shall demonstrated that now and you will see just how much more difficult it is to optimize code for the Pentium 4 because of Intel's decision to cut a basic feature that has been around for 15 years.

The slowest way on most if not all x86 processors to multiply an integer is to use a variant of the MUL instruction, such as a signed integer multiply:

   IMUL EAX,10

This can take 5, 10, 20, or even more clock cycles depending on the exact processor. Ditto with the Motorola chips. Generic 32-bit multiplies take a long time because to implement a multiplier requires thousands of transistors and delays between the various outputs of those transistors. The multiplier circuit also produces a 64-bit product. Most compilers and programmers merely end up throwing away the upper 32-bit bits of the result. A waste.

Let's see what Microsoft Visual C++ 6.0 compiler does. Feel free to try this yourself. With no optimizations the compiler produces exactly that code for an expression such as x*10:

    IMUL EAX,10

With full optimizations it generates the following code. Due to the simplicity of this expression the compiler generates the same code regardless of whether you use -O1 -O2 -G4 -G5 or -G6):

    LEA EAX,[EAX+EAX*4]
    SHL EAX,1

This multiplies by 10 by first multiplying by 5, then multiplying by 2. Three operations are involved, a shift, an add, and a shift. The first shift and add execute in one cycle thanks to the AGU, so on most x86 processors (excluding the Pentium 4) this will take 2 clock cycles. Even less if out-of-order execution pairs these instructions up with other instruction. But in general, 2 cycles.

Why can this not execute in a single clock cycle? Because of the data dependency between the two instructions. SHL (shift left) cannot do its job until the results of the LEA are known. On the Pentium 4, the slow shift unit makes this code take 6 clock cycles to execute, three times slower than expected. A fine example of how today's compilers (and the Windows code YOU are running on your PC right now) are not ready for Pentium 4.

I have a theory of why they use this code sequence, but I'll get to that in a bit. Can we do better? The code sequence which the other programmer suggested was to do it like this:

    LEA ECX,[EAX+EAX]
    LEA EAX,[ECX+ECX*4]

This is valid (ECX gets EAX*2, then EAX gets ECX+ECX*4 = (EAX*2) + (EAX*4*2) = EAX*10) and takes advantage of the fact that a multiply by two can either be encoded as an addition of a base register to an index register, or a multiplication of the index register. Since non-scaling addressing existed in 16-bit processors, using the addition form produces shorter code than the *2 form. So this code is pretty good, and on a Pentium III sure enough still takes exactly the same 2 cycles to execute as Microsoft's code.

On the Pentium 4 this now drops to 3 cycles since he has eliminated an entire shift operation. But, he forgot that on older processors such as Pentium MMX and earlier, there is a one cycle delay in the AGU, and so this code takes 4 cycles on older processors compared to 3 cycles for Microsoft's code.

Oh oh.

He also makes the novice mistake of thinking that "spreading the load" to a second register will somehow speed up the execution. As if my using ECX as a temporary register will eliminate the data dependency between instructions. Of course it won't! In fact, the follow code which only uses EAX runs at exactly the same speed (including on the Pentium 4 and Pentium MMX) as his code:

    LEA EAX,[EAX+EAX]
    LEA EAX,[EAX+EAX*4]

When writing code, you want to minimize the number of "visible" registers used. On older processors, the delay in the AGU kills any advantage used a temporary register may get you in the pipeline. On P6 family, Athlon, and Pentium 4, the register renaming automatically assigns a second register is necessary.

So to the beginner machine language programmer: limit your use of registers when evaluating an expression unless there is a real gain to be had. In this case, yes, there is an optimization that specifically helps the Pentium 4 if a second register is used. And this is where my "multiply by 2, multiply by 8, and add the results" technique comes in.

Notice that in both the Microsoft code and in the other programmer's code, reversing the two instructions does not eliminate the data dependency. But what if you did this, making use of a temporary register:

    LEA ECX,[EAX+EAX]
    LEA EAX,[ECX+EAX*8]

To the average programmer, this looks like the same thing. After all, x*2 + x*8 = (x*2)*5 = x*10. What's different???

The difference here is when you think about what a crippled Pentium 4 now has to do with no AGU unit present. How did Intel engineers work around their "budget cut"?

What they did is to break down address calculations into the basic add and shift operations, or micro-ops, and then they feed those into the various execution units. [EAX+EAX] now becomes a simple add micro-op which can be executed by any of the ALUs. The EAX*8 now becomes a shift operation executed by the slow ALU. And the final addition is another quick add. Add, shift, add. Isn't that the SAME thing the previous example does?

On every other x86 processor that I've tested, these two sequences of code are essentially the same thing and execute in exactly the same number of clock cycles. On Pentium III, on Pentium MMX, on Athlon, same speed. But on Pentium 4 my code will execute slightly faster in most scenarios because the shift unit is freed up a cycle sooner. By eliminating the AGU and breaking up the address calculation, the processor can now take advantage of out-of-order execution and start the shift operation on same clock cycle as the first addition. Then once the shift is complete, the second add is executed. Remember, out-of-order execution does not only mean that the individual instructions are executed out-of-order, but parts of the instructions (the micro-ops) are as well. This is true on P6, on Athlon, and on Pentium 4.

So Intel engineers worked around their decision to cut the two AGU's by taking advantage of out-of-order execution to make up some of the loss. However, it does so at the expense of requiring compiler writers to change their code and hurt speed on older processors. It also hurts C++ code that tends to use a lot of double indirection and table lookups, which results delays caused by back-to-back shifts being sent through a single slow shifter.

And that is why I have to whole heartedly disagree with Intel's decision to cut not even one but both units. It breaks the 15 year old pattern that programmers have relied on having faster shift and add operations.

Look at the Motorola PowerPC chip. It has a whole arsenal of fast shift instructions, that can perform bitfield extractions, bitfield insertions, left shifts, right shifts, rotates, and bit masking operations, all in a single clock cycle. This is perfectly suited for C and C++ code.

So lets go back to why Microsoft's compiler emits an LEA and a SHIFT instead of two LEA instructions. Again, it has to do with the fact that address generation had some extra overhead on earlier processors such as the Pentium MMX and 486, and since those processors do not support out-of-order execution, it is quicker to sometimes use addition and shift operations directly. Taking Microsoft's code and rewriting it to:

    LEA EAX,[EAX+EAX*4]
    ADD EAX,EAX

produces code that runs as quickly as the original example on all past x86 processors, with the benefit of running faster on the Pentium 4. This code sequence is not optimal on Pentium 4 but is the 2nd best choice. This would be the code that as a compiler writer or assembly language programmer I would choose as the best overall code that would give ideal or near to ideal speeds on all 32 bit x86 processors. If I was optimizing specifically for Pentium 4 and could punt 486 and Pentium MMX a little, I would use the two LEA sequence to implement the (x*2)+(x*8) expression.

This trivial and somewhat contrived example shows just how even the most basic code sequences used in Windows code today needs to be re-examined and changes made to compilers. It's going to be messy. Intel could save all of us a lot of trouble (and speed up today's off-the-shelf) software by making one of two simple modifications to future Pentium 4 processors with respect to this whole Mistake #6 and #7 shifting thing:

Linux compilers need some work too

Another reader, David Ford, took the initiative to gather his own findings about the "multiply by 10" code optimization discussion above. He wanted to know how good the code produced by the Linux GNU C compiler was. He used 2.95.2. He found, using 7 different sequences of compiler optimization switches, that in most of the 7 cases the GNU compiler produced the LEA EAX,[EAX*4], ADD EAX,EAX sequence of code, which I listed as being a good overall code sequence for all x86 processors but not quite optimal for Pentium 4.

And another by the name of Alun Carr from Ireland emailed me his most recent results of using the GNU C compiler to compile the code example involving the C/C++ language question colon operator. This is similar to an "if else" in other languages for evaluating an expression that can result in one of two values. The code sample is this:

int foo(unsigned char ch)
{
return (ch == 0) ? 1 : -1;
}

I'll let Alan explain the results he got:

Using gcc 2.95.2 (the Mingw port to Windows; the code generator is the same for i386 Linux to the best of my knowledge) I get the assembler output given below when targeting i486, pentium, and pentiumpro architectures (GNU assembler uses the AT&T mnemonics). Apart from alignment, the i486 and Pentium code is identical, and seems perfectly sensible to me (but maybe you can get faster execution by use of 'tricks' with other instructions). The Pentium Pro code is somewhat different, and strikes me as inefficient (I've appended to it what I think the correct code sequence should be).

Listings:

Non-architecture-specific compiler flags used (architecture flags are given for each listing):

-O3 -fomit-frame-pointer -freg-struct-return -malign-double -mwide-multiply -finline-functions -S


i486 (-march=i486 -mcpu=i486)
=============================

.file "testgcc.c"
gcc2_compiled.:
___gnu_compiled_c:
.text
.align 16
.globl _foo
.def _foo; .scl 2; .type 32; .endef
_foo:
movl $-1,%eax
cmpb $0,4(%esp)
jne L3
movl $1,%eax
L3:
ret


Pentium (-march=pentium -mcpu=pentium) ======================================

.file "testgcc.c"
gcc2_compiled.:
___gnu_compiled_c:
.text
.align 4
.globl _foo
.def _foo; .scl 2; .type 32; .endef
_foo:
movl $-1,%eax
cmpb $0,4(%esp)
jne L3
movl $1,%eax
L3:
ret


Pentium Pro (-march=pentiumpro -mcpu=pentiumpro) ================================================

.file "testgcc.c"
gcc2_compiled.:
___gnu_compiled_c:
.text
.align 4
.globl _foo
.def _foo; .scl 2; .type 32; .endef
_foo:
movl $-1,%edx
movl $1,%eax
cmpb $0,4(%esp)
cmove %eax,%edx
movl %edx,%eax
ret


Shouldn't the Pentium Pro code be: ==================================

movl $1,%edx
movl $-1,%eax
cmpb $0,4(%esp)
cmove %edx,%eax
ret
 

As I explained to Alan, the problem with all of these code sequences is they're all fairly large for what they do, over 16 bytes each. This comes from the fact that in 32-bit mode, the Pentium has no quick and dirty instructions to load simple constants and wastes 5 bytes of code to load the value 1. Most other processors do have quick short ways to load small constants. To work around this problem some compilers will generate clever code sequences to generate simple constants such as 1 and -1. For example, clearing a register and incrementing it requires only 3 bytes of code to load the value 1.

You can compare the 3 code sequences above (the 486 / Pentium code, the Pentium Pro (P6) code, and the ideal code) with the 3 code sequences I presented in December which were generated using the Microsoft Visual C++ 4.2, 6.0, and 7.0 (beta 1) compilers. GNU, having its roots in RISC processors (which do have short ways to load small constants) goes the route of explicitly loading the constants 1 and -1.

Microsoft goes the route of generating the constants on the fly. Running all 6 code sequences I found, no surprise, that the VC 4.2 sequence is faster on all processors than the VC 6.0 sequence. This being because it is one instruction shorter (which helps the Pentium 4 decode it faster), doesn't create the partial register stall (which hurts the Pentium III), and has one less data dependency (which helps all the processors). The Visual C++ 7.0 code sequence which I said was the best, sure enough, is the fastest on all three processors.

Continue to Round 2


AboutAnnouncementsXformer - Atari 8-bit emulationGemulator - Atari ST EmulationFREE DOWNLOADSEmulators Online Home PageSoftMac and Fusion PC - Apple Macintosh emulationDirty Little SecretsShow Schedule

Emulators Inc. Logo (return to main page)Copyright 1996-2013 Emulators, 14150 NE 20th Street, Suite 302, Bellevue, WA 98007, U.S.A.
Questions and comments can be sent to
emulators@outlook.com.
Email is usually responded to within 2 to 3 business days.

Apple, Mac OS, Macbook, and Macintosh are registered trademarks of Apple Computer, Inc. Atari is a registered trademark of Atari U.S. Corporation. Athlon, Athlon XP, Opteron, and Phenom are registered trademarks of AMD. Microsoft, Windows, Windows NT, Windows XP, Windows Vista, Windows 7, and/or other Microsoft products referenced herein are either trademarks or registered trademarks of Microsoft. Intel, Pentium, Core 2, Core i7, and Atom are registered trademarks of Intel. PowerPC is a trademark of IBM. Additional company and product names may be trademarks or registered trademarks of the individual companies and are respectfully acknowledged.