NO EXECUTE! A weekly look at personal computer technology issues.

(c) 2007 by Darek Mihocka, founder, Emulators.com.

October 8 2007


[Part 1]  [Part 2]  [Part 3]  [Part 4]  [Part 5]  [Part 6]  [Next]

The Achilles Heel of x86

I will dive right into code and benchmarks today. To support my argument that the x86 hardware architecture should be simplified, I will demonstrate how a large number of x86 opcodes coud be removed from the instruction set without affecting the performance or capabilities of future Intel microprocessors.

Last week I posed the question of how one would increment an integer in memory. This is not a trick question. Obviously it is a simple matter to do in C or C++, but what is the actual code that the C or C++ compiler should emit? As easy as this problem sounds, the solution touches on many of the fundamental performance and reliability issues that affect real world code.

If you are software developer and have either gcc or Microsoft Visual Studio handy, please download and compile this test code: nx06_inc64.c. Everybody else read along and look at the results which will be summarized below.

The "inc64" test is very straightforward:

  • In C, define an array of two 32-bit integers and an array of two 64-bit integers.
  • Initialize each of the four integers to some initial value.
  • Loop approximately one billion times, increment each of the four integers, then print the final values to confirm that the operations completed.
  • Compile the code.
  • Run it and time it.
  • Calculate the number of clock cycles per iteration.

I compiled the code with the latest Visual Studio 2008 "Orcas" Beta 2 release available on MSDN (http://msdn.microsoft.com/). I targeting both 32-bit x86 and 64-bit x64 executables, then repeated this experiment multiple times, tweaking the compile options to compile for speed, for size, and to inline the inner loop. I repeated this with the released version of Visual Studio 2005, as well as with a decade old release of Visual Studio 98 (a.k.a. VC6). I also rebooted the Mac Pro into Mac OS X and recompiled the test code with Xcode and gcc 4. Looking at the generated code produced by each compiler with each set of compiler optimizations, I distilled the code to a handful of unique sequences of code. The various releases of Visual Studio tended to stick to one of two code patterns, while gcc took an entirely different approach.

I also synthesized several more code sequences by hand, which I test in my custom "CPU_TEST" framework which I developed years ago to do accurate architecture-level micro-benchmarks of various microprocessors. This is one of the tools for example which I used to pick apart and debunk the Pentium 4 speed claims seven years ago.

When doing benchmarking, it is important to test on more than one machine and to use systems based on both AMD and Intel products. I used my recent batch of Apple personal computers - a Core Duo based iMac, a Core 2 Duo based Macbook, and a quad-core Mac Pro. I also used some Dell, Sony, and home built PCs based on Athlon, Pentium M, Pentium III, and Pentium 4 systems used recently for Gemulator 9 benchmarking.

Let me first go over the possible ways one can increment a memory location using the x86 instruction set. Being CISC, or a "complex instruction set", x86 supports both register-to-register arithmetic operations as well as well as arithmetic operations directly to memory. RISC, or "reduced instruction set" architectures generally only allow arithmetic operations on registers, and require separate memory load and memory store operations to transfer the memory data into registers.

A given arithmetic operation, such as ADD, can be performed in various ways. The table below lists various assembly language representations of various ADD instructions, along with a description of what that form of the instruction does.

Table 6-1 - addressing forms
Assembly language sourceDescription
ADD EDI,EBXadds the 32-bit contents of the EBX register to the EDI register and stores the result back into EDI
ADD EDI,3add the 32-bit constant 3 to the EDI register and stores the result back into EDI
INC EDIa special form of ADD which implicitly adds the 32-bit constant 1 to the EDI register
ADD [EDI],EBXadds the 32-bit value in EBX to the 32-bit value in memory pointed to by the EDI register and stores the result back into that memory location
ADD EBX,[EDI]adds the 32-bit value in EBX to the 32-bit value in memory pointed to by the EDI register and stores the result back into the EBX register
ADD [EDI],3adds the 32-bit constants 3 to the 32-bit value in memory pointed to by the EDI register and stores the result back into that memory location
INC [EDI]a special form of ADD which implicitly adds the 32-bit constant 1 to the memory location pointed to by the EDI register

The "increment" instruction INC exists for two reasons. First, in the days of MS-DOS and the 8086 when memory was extremely limited, it was advantageous to encode very common operations using shorter bytecode sequences. The INC instruction is shorter to encode than an ADD because the constant 1 is implicit and is not encoded into the bytecode. To increment a register, Intel even defined shorter encodings which are only one-byte long that are used to increment each of the 8 general purpose registers. By the way, as memory is much more plentiful today, these are the one-byte encodings which AMD removed in AMD64 in order to serve as REX prefixes.

For similar reasons of minimizing code size in the early days of 8086, Intel provided the ability to add directly to memory in a single instruction. This avoids a three-instruction sequence such as this, which is known as "RISC-ified code":

    MOV EAX,[EDI]
    ADD EAX,1
    MOV [EDI],EAX

and allows one to replace it with a single instruction, called a "read-modify-write operation", either this:

    ADD [EDI],1

or this:

    INC [EDI]

INC has one more side-effect, in that unlike ADD SUB XOR and other arithmetic operations, INC only affects five of the six arithmetic flags: Overflow, Parity, Zero, Sign, and Adjust. The sixth flag, Carry, is not affected by INC (or DEC) but is modified by almost all other arithmetic operations.

The reason for this weird behavior of INC (and DEC) is that these instructions are primarily used in loop operations such as C/C++ "for" loops. INC and DEC can be used to quickly increment (or decrement) a register which serves as the index in C/C++ loop code such as this:

    for (i = 0; i < 100; i++) { ... }

Should that loop be performing a multi-precision arithmetic operation, such as a large addition, you do not want your loop index to affect the Carry flag. And thus INC and DEC behave the way they do. Interestingly this design decision from the early 1980's went on to plague the Pentium 4 architecture 20 years later as I will show.

Multi-precision arithmetic is necessary when performing 32-bit or 64-bit arithmetic in a 16-bit mode (such as in MS-DOS), or when performing 64-bit operations in 32-bit mode. Another form of ADD, called the "Add With Carry" or ADC instruction, performs and ADD but also adds in the existing Carry flag. So for example, to increment a 64-bit constant that is stored in the EDX:EAX register pair, one could execute this code sequence:

    ADD EAX,1
    ADC EDX,0

This same "add-with-carry" concept exist in other architecture. On 68000 / 68040 for example, one would use a code sequence such as this:

    ADD D0,#1
    ADDX D1,#0

And on 32-bit PowerPC, a code sequence such as this:

    ADDC R3,R3,#1
    ADDE R4,R4,#0

The common act of incrementing a 32-bit value (which requires only a single ADD or INC operation) and the act of incrementing a 64-bit value (which requires an ADD/ADC instruction sequence) is what I am measuring the performance of in the "inc64" test.

Interestingly, in examining the output of gcc (the C/C++ compiler used by Linux and Mac OS X) and Microsoft's various C/C++ compilers for Windows over the years, I was able to get the compilers to emit all three of the possible code sequences listed above - the RISC-ified code, the read-modify-write ADD, and the read-modify-write INC - as well as some others. One optimization trick that compilers do is called enregistering of constants. The compiler notices for example that the code uses the constant 1 a lot, so it loads the value 1 into a register. So for example, this is also valid ways to increment a memory location:

    MOV EAX,[EDI]
    MOV EDX,1
    ADD EAX,EDX
    MOV [EDI],EAX

such that if the constant 1 is used shortly thereafter, the "MOV EDX,1" instruction does not need to be repeated. Register-to-register operations are generally shorter to encode than constant-to-register operations.

Using my CPU_TEST framework, I tested the four code sequences I just mentioned as well as eight more for a total of twelve different implementations of the "increment two 32-bit integers and two 64-bit integers" loop. Ideally, a basic 32-bit microprocessor such as the 486 should execute such a code sequence in 6 clock cycles - one clock cycle for each 32-bit addition operation required. A superscalar microprocessor such as the Pentium III should in theory do it in under 6 cycles, since the operations can be parallelized. Of course there are other factors that slow down performance - the loop overhead of the test, and latency involved in reading and writing the data to main memory (or in reality, the L1 data cache).

As you will see in the table below, historically efficient architectures such as the AMD Athlon XP and the Intel Core 2 can do very well on this test, performing each loop iteration in only 8 clock cycles. Historically inefficient architectures such as Pentium 4, not surprisingly, can take upwards of 30 clock cycles. Pentium III and its derivatives such as Pentium M and Core Duo of course land somewhere in between.

What is important to note is that for any given piece of compiled code, no one single code sequence is optimal on all architectures. I am repeating what I have said before, which is that it is impossible for a software developer to ship optimal code. You cannot statically compile C or C++ code into an executable that will run as fast as possible on all microprocessors today or on ones to be released in the future. What was a cool optimization trick on a Pentium III could entirely blow on you on a Pentium 4, as did in fact happen when the Pentium 4 was released.

The culprit that I am trying to expose here is that read-modify-write operations are usually more expensive than RISC-ified code sequences, and significantly so. On the AMD architectures measured, as well as Intel's Core Duo and every Pentium 4, it is measurably more expensive. On the Pentium III and the latest Core 2 it close to a tie.

Interestingly, this is where Microsoft and the open source community make entirely different design decisions. Microsoft's compilers, up to an including the latest Visual Studio 2008 "Orcas" Beta 2 release, stick with the 1980's MS-DOS methodology of emitting read-modify-write operations. The gcc 4 compiler, which is included in the Xcode 3 tools as part of Apple's Mac OS X "Leopard" development kit, emits explicitly RISC-ified code.

I had each compiler make two builds - one which specified the compiler switch to "optimize for smallest code", and one to "optimize for fastest code". The gcc compiler produced almost identical code regardless of which optimization switch I used, whether -Os to optimize for size (listing Os), or -O3 to optimize for speed (listing O3). Other than code alignment they are the same.

Visual Studio 2008 takes the read-modify-write approach, using a read-modify-write INC with a constant when -O1 is used to optimize for size (listing O1), and switches to read-modify-write ADD instructions with enregistered constants when -O2 is used to optimize for speed (listing O2).

In order to test the code on all eight computers, even ones that are not running Mac OS X, I created a total of 12 synthetic assembly language code tests which I added to my CPU_TEST framework. I then ran each of these various micro-benchmarks on the eight different computers. I have summarized the results in the table below, but also see my ASM source code (listing) and raw output (text):

Table 6-2 - synthetic benchmarks timings
Test nameCode sequence description

Clock cycles per loop iteration by microprocessor

1000 MHz Pentium III1533 MHz Athlon XP 1800+2000 MHz Pentium 4 Xeon2000 MHz Core Duo2000 MHz Core 2 Duo2400 MHz AMD Opteron2533 MHz Pentium 4 Prescott2666 MHz Core 2 Xeon
CPU_TEST "24a"RISC-ified using INC EAX for 32-bit, ADD/ADC with constant for 64-bit1281510119158
CPU_TEST "24b"RISC-ified using ADD EAX,1 for 32-bit, ADD/ADC with constant for 64-bit13101410109158
CPU_TEST "24c"RISC-ified using ADD reg-reg, loads and stores interleaved1391410109208
CPU_TEST "24d"same as 24c, but not interleaved1510149119209
CPU_TEST "24e"same as 24c, but memory references via register indirection13101410119219
CPU_TEST "24f"same as 24d, but memory references via register indirection13101491010209
CPU_TEST "24g"read-modify-write, INC for 32-bit, ADD/ADC with constant for 64-bit1213271611162710
CPU_TEST "24h"read-modify-write, all ADD or ADD/ADC with constant12132417916258
CPU_TEST "24i"read-modify-write, INC for 32-bit, ADD/ADC using enregistered constants131426151116279
CPU_TEST "24j"read-modify-write, all ADD or ADD/ADC using enregistered constants13142415916258
CPU_TEST "24k"ADD/ADC from memory to register using enregistered constants1610299119299
CPU_TEST "24l"same as 24k without reloading register constants (hypothetical case which requires 3-operand ADD reg-mem-imm)13102491292510

I have color coded the 4 tests - 24g through 24j - which use read-modify-write arithmetic ADD operations to memory. Notice that on all eight tests machines, these operations are generally the slowest code sequences, except for on Core 2 where they are about tied for best. On no architecture is a read-modify-write operation superior to the technique of RISC-ification, which is measured by tests 24a through 24f.

In theory, the read-modify-write instruction sequences should be no slower than the explicitly RISC-ified three-instruction sequences. I would think that internally the CPU decoder breaks down a read-modify-write instruction into the same micro-op sequence that the three-instruction sequence breaks down to. This appears to be the case on Pentium III and Core 2 systems, but is not true on Athlon, Opteron, Core Duo, or Pentium 4.

Notice also that on half the platforms, the Pentium 4 and Core 2 systems, the use of read-modify-write INC instead of read-modify-write ADD instructions is more costly. This is a documented performance hazard of the INC instruction, which due to not updating the Carry flag in EFLAGS is effectively turned into a consumer of the Carry flag and thus creates and extra data dependency in the pipeline which the ADD instruction does not have. Intel documented this performance issue for the Pentium 4 and this has obviously carried through into the design of the Core 2. When Microsoft released the Visual Studio 2003 package, it added a -G7 switch to the C/C++ compiler which specifically substitutes ADD instructions in place of INC instructions exactly for this reason.

Test 24k is a rather inefficient code sequence which involves loading a constant (one or zero as necessary) into a register, followed by a memory-to-register ADD instruction, and a store back to memory. You should not actually write code like that unless you can preload the constant much earlier. I was curious to see whether a memory-to-register operation followed by a memory store is faster or the same speed as a single read-modify-write operation.

To factor out the overhead of the reloading of the constant into a register, I wrote test 24l without those reloads. While is technically incorrect, I am mimicking the effects of what would happen if AMD or Intel were to provide a 3-operand form of the ADD instruction, similar to what they already provide with the 3-operand IMUL and MUL instructions, which take a memory location and a constant and store the result into a register. Interestingly on AMD architectures it is in fact about the same cost as a read-modify-write to do this, while on all the Intel architectures it is the same or slightly faster than read-modify-write.

Eliminating these last two bogus tests, eliminating the duplicate INC and ADD RISC-ified cases which are about equal in performance, eliminating the read-modify-write INC operations which are clearly slower than their ADD counterparts, and eliminating the redundant tests which use enregistered constants effectively leaves just two variants of the code as candidates for most optimal implementations, test 24a (RISC-ified code and closest to what gcc emits) and test 24j (read-modify-write code sequence and closest to what Visual Studio 2008 emits).
 


Microsoft's Dilemma

I will summarize by repeating the table above using only the relevant test results corresponding to these two optimal code sequence candidates as well as the third sequence that corresponds to what Visual Studio 2008 emits when compiling for size:

Table 6-3 - summary of synthetic benchmarks
Test nameCode sequence description

Clock cycles per loop iteration by microprocessor

1000 MHz Pentium III1533 MHz Athlon XP 1800+2000 MHz Pentium 4 Xeon2000 MHz Core Duo2000 MHz Core 2 Duo2400 MHz AMD Opteron2533 MHz Pentium 4 Prescott2666 MHz Core 2 Xeon
CPU_TEST "24a"
(gcc -O3 /-Os code)
RISC-ified using INC EAX for 32-bit, ADD/ADC with constant for 64-bit1281510119158
CPU_TEST "24g"
(VS2008 -O1 code)
read-modify-write, INC for 32-bit, ADD/ADC with constant for 64-bit1213271611162710
CPU_TEST "24j"
(VS2008 -O2 code)
read-modify-write, all ADD or ADD/ADC using enregistered constants13142415916258

I am trying to convince you that read-modify-write instructions are an Achilles Heel of the x86 instruction set, both in terms of performance and reliability. What the above results show us is that specific micro-benchmarked code sequences appear to favor RISC-ification over the use of CISC-like read-modify-write instructions. We need to try this on slightly more real-world code. How closely do these micro-benchmark results match the real output of the C compilers and performance of that compiled code?

Since I only have one Mac Pro set up with gcc running on Mac OS X, ideally there is some other compiler out there that targets Windows and generates the gcc-style code. Yes, I can use gcc of course, but let's make it more interesting and stick with Microsoft's compilers, as these are the compilers used to build most of the popular Windows applications out there.

I therefore rebuilt the INC64.C code using the C compilers found in Visual Studio 2005 (a.k.a. VC8), Visual Studio 2003 (a.k.a. VC7.1), and with Visual Studio 98 (a.k.a. VC6). While Visual Studio 2005 has much the same compiler switches and code generation as Visual Studio 2008, Visual Studio 2003 and earlier support compiler switches to target specific architectures: -G3 to target the 386, -G4 to target the 486,  -G5 to target the Pentium / Pentium MMX, -G6 to target Pentium Pro / Pentium II / Pentium III, and -G7 to target Pentium 4. By using these compiler switches to select different target architectures in addition to using the usual -O1 or -O2 optimization switches, a broader set of compiled code is produced.

Interestingly enough, Microsoft's older versions of the compilers did in fact emit the gcc-style RISC-ified code! I will focus on the compiled output of Visual Studio 2003 (VC7.1) as it is the most recent of the older versions of Visual Studio that supports these switches of selecting a target architecture.

Combining -O1 with any of the -G3 through -G7 switches produced the same code, which used read-modify-write code sequences almost identical to those produced by Visual Studio 2005 and 2008. The -O2 cases are more interesting, as these produce RISC-ified code in all cases. -G3 -G4 -G5 and -G6 produce code that uses the INC instruction (listing O26), while the -G7 switch produces the same code but using the ADD instruction (listing O27).

The table below has my measured execution times in seconds of the various compiled code sequences, along with the calculated clock cycle count of each iteration of the test code.

Table 6-4 - compiled inc64.c timings
C compiler and switches

Execution time in seconds by microprocessor (clock cycles per iteration)

1000 MHz Pentium III1533 MHz Athlon XP 1800+2000 MHz Pentium 4 Xeon2000 MHz Core Duo2000 MHz Core 2 Duo2400 MHz AMD Opteron2533 MHz Pentium 4 Prescott2666 MHz Core 2 Xeon
INC64 compiled with VS2008, -O114.0 (13)9.1 (13)14.3 (26)8.1 (15)7.3 (14)7.6 (17)11.7 (27)5.3 (13)
INC64 compiled with VS2008, -O212.9 (12)9.8 (14)13.2 (24)7.6 (14)4.7 (9)7.6 (17)10.7 (25)3.3 (8)
INC64 compiled with VS2003, -O2 -G614.0 (13)6.2 (9)7.9 (15)5.5 (10)6.8 (12)4.4 (10)6.4 (15)4.3 (11)
INC64 compiled with VS2003, -O2 -G712.9 (12)6.9 (10)8.0 (15)5.6 (10)6.8 (12)4.2 (9)6.6 (16)4.4 (11)

As you can see, the compiled test code gives almost identical clock cycle timings as my synthetic micro-benchmarks. On all but the latest Intel Core 2 microprocessors, the code emitted by gcc and by older versions of Microsoft Visual Studio were optimal for the personal computers that most people have. Starting with Visual Studio 2005 and now with Visual Studio 2008, Microsoft seems to have shifted their code generation to specifically favor only the Core 2!

Is this a wise move on the part of Microsoft or an incredible blunder? Could it be that Windows Vista and all of Microsoft's current applications such as Office 2007 are compiled with code sequences that specifically do poorly on Pentium 4 but do well on Core 2? Should Microsoft ship C and C++ compilers that target machines that most Windows users do not yet own, or should it target the machines that hundreds of millions of Windows users do own today?

It's a tough call. Apple has obviously chosen to go with a compiler that goes the RISC-ification route and targets code that historically has been more optimized, even if slightly larger, than code that happens do well on today's CPU-de-jour, the Intel Core 2. Microsoft on the other hand, appears to have dropped past compiler switches that at least gave developers a choice and is forcing a single code generation policy on all new code. What if I work for the IT department of some company and want to build internal application which are specifically optimized for whatever AMD or Intel architectures the company happens to use? With Visual Studio 2005 and Visual Studio 2008, you seem to have less options.

Is it any wonder then that Windows Vista appears to run very well on today's latest and greatest Core 2 based systems, but seems to do more poorly when upgrading, say, an older Pentium 4 system from Windows XP? Not only is the older system slower overall, but you are upgrading it to code that is no longer optimal for that system. Windows Vista upgrades are thus a double whammy for AMD owners or Pentium 4 owners.

These results should convince you (and my former colleagues at Microsoft) that since one cannot statically compile and ship the "best" code, what one needs is a form of "managed native code"; code that uses an x86 bytecode but like Java or MSIL, but is dynamically translated at run time into native code that is optimal for the specific microprocessor being executed on. Such optimizations might exist in Java virtual machines, but no such option exists today for the vast majority of Windows, Linux, and Mac applications that are compiled and shipped as pure native code.

The x86 instruction set as it exists today is just too complicated and allows for too many redundant variations of the same code sequence. How can developers even hope to tackle multi-threaded applications running on multi-core machines when the fundamental problem of optimizing even single-threaded code is an ever changing moving target?

It is therefore critical that AMD and Intel put aside their differences and work out a solution to this problem. I have already pointed out that hardware virtualization does little to solve the security crisis plaguing personal computers today. Now you should be convinced that merely adding hardware virtualization around buggy inefficient code is not a long term solution to single-threaded performance, multi-threaded performance, or to security. Virtualization that is implemented on simplified hardware running binary translation-based virtual machines is the answer.
 


VX64 Continued

Last week I proposed a hypothetical alternative to AMD's 64-bit extensions, a virtual 64-bit instruction set I called VX64, which I urge AMD and Intel to consider as a replacement for the half-based AMD64/Intel64 technology out there now. I demonstrated how simple changes to the existing Mod-R/M-S-I-B bytes in x86 could have avoided the new REX prefix bytes introduced in AMD64. I demonstrated how this change would allow VX64 to support 16 general purpose registers even using today's 32-bit x86 instruction opcodes. I hinted at further simplifications of the opcode encodings themselves could aid the hardware to more efficiently decode instructions using less transistors.

I have already argued that since software developers have rejected MS-DOS and OS/2-style segments, that thread-local storage as implemented today in Windows using segment prefixes is not efficient, and that since AMD mostly removed segments from 64-bit mode, that the concept of segments and 48-bit pointers should be removed completely. Eliminating segments also eliminates the six segment prefix bytes, ten PUSH/POP segment instructions, far CALL, far JMP, two far RET instructions, thus freeing up sixteen more primary opcodes for other uses.

Today I have shown the redundancy and general inefficiency of read-modify-write instructions. Therefore I propose that such instructions be struck from the VX64 instruction set completely. Such instructions represent about three dozen, or about 15% of the entire primary opcode space. If you think about it, floating point, MMX, and SSE extensions have never added any form of read-modify-write operations. Read-modify-write is strictly an artifact of 1980's efforts to reduce code size.

Additionally since it is preferable to avoid using the INC and DEC instructions, a fact backed up by today's benchmark results, the one-byte encodings of INC and DEC can be removed (and as in AMD64, backed up by existing two-byte extended opcode versions). This frees up sixteen more primary opcodes. Unlike in AMD64, these recycled INC and DEC instructions will not be repurposed as REX prefixes.

Other short forms of extended instructions - the sixteen one-byte opcode versions of conditional jumps, the sixteen PUSH/POP registers opcodes, one-byte RET, and the four "MOV AX" opcodes - are also gone. 37 more primary opcodes freed up.

Unlike what AMD did, I would not simple eliminate rarely used one-byte instructions such as AAA, DAA, BOUND, SALC, PUSHA, POPA, etc. I would move those into the extended opcode space which still has a lot of unused opcodes. Other one-byte instructions which AMD did keep, CLI, STI, LEAVE, HLT, and the six string instructions, I would similarly make into extended opcodes.

I have now freed up 9 of 16 prefixes and well over 120 primary opcodes, and yet hardly impacted one's ability to write an efficient Windows application. I have cut down redundancy by eliminating duplicate encodings of the same instructions. Of the remaining 7 prefixes, let's eliminate the concept of the REX 64-bit data size prefix completely and have the rule that 64-bit data operations will be assigned new opcodes. That leaves six prefixes:

  • 0xF0 - LOCK prefix, locks the memory bus to allow atomic read-modify-write operations
  • 0xF2 - REPNE prefix, used by "string" opcodes 0xA4 through 0xAF.
  • 0xF3 - REP/REPE prefix, similarly used by string opcodes
  • 0x66 - data prefix - this turns a 16-bit data operation into a 32-bit operation and vice versa
  • 0x67 - address prefix - this changes the meaning of the Mod-R/M byte between the 16-bit addressing and 32-bit addressing versions.
  • 0x0F - the extended opcode prefix

Punt the LOCK prefix. Since read-modify-write operations are already gone, the only instructions that are left which can legally use a LOCK prefix are XADD and variants of CMPXCHG. Those two instructions are used by thread synchronization primitives and by lock-free code, and as it happens, my micro-benchmarks indicate that starting with the Pentium 4 these instructions already behave as if a LOCK prefix is present.

REPNE/REPE - these two prefix bytes are used to modify the six string instructions, as well as to concert MMX instructions into SSE instructions. These prefixed instructions will now be assigned new opcodes in the extended opcode space. REPNE and REPE are now gone.

The 16-bit data size prefix 0x66 is similarly removed and instructions that relied on it assigned new extended opcodes.

The address prefix 0x67 is replaced by the new Mod-R-S-I-B mechanism. Gone.

The only prefix that is left now is the 0x0F extended opcode prefix. It would be a HUGE simplification to the hardware (and to binary translation software) to get rid of this one prefix. Ten years ago this would have been possible, as the above cuts free up so much of the primary opcode space that it possible to merge in all of the extended opcodes as they existed in 1997, those mostly being MMX instructions. Unfortunately the recent explosion of SSE, SSE2, SSE3, and future SSE4/SSE5 instructions leaves us with more than 256 unique instruction opcodes, and thus needing more than 8 bits to encode an opcode.

One more radical step is needed, and that is to eliminate many of the memory-to-register forms of instructions. Looking back at Table 6-2, tests 24k and 24l show that memory-to-register forms of arithmetic instructions are really not as efficient as pure RISC-ification. Even if a 3-operand form common arithmetic operations existed, this would be redundant. So for most data instruction I am eliminating these instruction forms, which are normally encoded in the Mod-R/M-S-I-B bytes today and would be encoded in the Mod-R-S-I-B bytes in VX64. Instructions that do retain memory-to-register operations get split into two opcodes - a register-to-register opcode that requires no Mod-R-S-I-B bytes, and a memory-to-register form that does. Other than basic MOV instructions to load and store registers to memory, very few primary opcodes are now needed for the few such instructions that remain. This now shrinks the opcode pressure down to the point where all instructions fit into an 8-bit opcode space. Prefix 0x0F, the final prefix byte is gone!

Without prefixes, the decoder now knows that all primary opcodes always exist in the very first byte of the instruction being decoded. The Mod-R-S-I-B bytes, if present, always found in the second and third bytes of the instruction. The I-B byte itself only contains indexes to registers, and therefore does not affect the total instruction size. Therefore, we have also now simplified the instruction decoding to the point where the first two bytes of an instruction (and as I pointed out last week, the first 12 bits) uniquely determine both the validity of the opcode and the total instruction size.

This can all be done while retaining almost complete backward compatibility with existing IA32 and AMD64 assembly source code and C/C++ compiler output. If you look again at the Visual Studio 2003 compiler output from the previous section, every single instruction in those listings is still supported by this VX64 instruction set. All that changes is the way that the opcodes are assigned and the way that the instructions are decoded.

Whereas the instruction format of legacy IA32 and AMD64 is horribly complex (the AMD64 documentation requires a full 35 pages to explain it!), is full of redundant encodings, and is computationally not verifiable (due to the 2 to the power 120 possible instruction encodings), the rules governing VX86 instruction encoding lead to just a handful of possible instruction formats. VX64 instructions are either 4 bytes in size of 8 bytes in size. Period. Notice that by eliminating read-modify-write instructions, this also eliminates the possibility of 12-byte instructions which I hinted last week would be eliminated. The common instruction form of immediate-to-memory read-modify-write operation is not allowed, therefore VX86 does not contain instruction encoding that contain both a 32-bit address displacement and a 32-bit immediate. Instructions can have either one or the other but not both.

By limiting instruction sizes to at most 8 bytes, this guarantees that the hardware decoder can always decode a minimum of at least two instructions per 16-byte block of code. By guaranteeing that the first 12 bits of an instruction fully define the opcode and instruction size, it is now feasible to have 4 decoders in parallel decode each 16-byte block of code. Each decoder uses the 8192-bit ROM lookup table to determine the validity and length of its 4-byte clock of code, and checks with the decoder left of it to see if the previous instruction spilled into its block. Compare this against the current scheme were between 1 and 16 valid instructions may be decoded in each 16-byte block of code, and instruction can begin at any arbitrary byte.

VX86 instructions fall into very specific templates, similar to how RISC instructions are encoded, but while still retaining the flexibility and richness of the x86 instruction set. The basic no-operand instruction (such as STC or LEAVE), or an instruction that operates on a single register (e.g. INC EAX) takes on this form, where all the first 12 bits of the instruction define an opcode, the the next 4 bits of the second byte define a register index (whether GPR, MMX, XMM, control register, or floating point), and the last two bytes are available for instruction specific immediate operands:

Table 6-5 - basic VX64 instruction layout

bit positions

byte 0

byte 1

byte 2byte 3
7primary opcode (8 bits)minor opcode (4 bits)optional 8- or 16-bit immediate
6
5
4
3reg field (4 bits)
2
1
0

Alternately, the third byte can be used to encode a 3-operand register/register operation with optional 8-bit immediate operand. This can be used to encode existing x86 arithmetic instructions such as:

  • ADD ECX,EDX
  • SHLD EAX,EDX,13
  • IMUL EDI,ESI,4

by actually encoding them as 3-operand instructions and replicating the same register index for the destination register as the first source register:

  • ADD ECX,ECX,EDX
  • SHLD EAX,EAX,EDX,13
  • IMUL EDI,EDI,ESI,4

By not replicating the register, true 3-operand instructions can be formed, which in turn would eliminate a lot of existing register-to-register instructions needed today. The larger minimum instruction size of VX86 is thus offset by reductions in the total number of instructions needed compared to legacy x86.

Table 6-6 - 3-operand register VX64 instruction laytout

bit positions

byte 0

byte 1

byte 2byte 3
7primary opcode (8 bits)minor opcode (4 bits)source register 2
rs2 field (4 bits)
optional 8-bit immediate
6
5
4
3destination register index
reg field (4 bits)
source register 1
rs1 field (4 bits)
2
1
0

An instruction which operates on an address could use the Mod-R-S-I-B encoding described last week. This would look something like this:

Table 6-7 - possible Mod-R-S-I-B based VX64 instruction layout

bit positions

byte 0

byte 1

byte 2byte 3
7primary opcode (8 bits)Mod-S field (4 bits)ri field (4 bits)?
6
5
4
3reg field (4 bits)rb field (4 bits)
2
1
0

However, there is a problem with the Mod-R-S-I-B scheme which I described last week. As I hinted at, the current x86 definition of the Mod-R/M-S-I-B bytes are not fully compatible with the rules I have set forth. The stack pointer register ESP cannot be scaled, and therefore there are invalid instruction encodings that cannot be detected until the S-I-B byte is decoded. We can't have that in VX64. The Mod field itself also allows for optionally 1, 2, or 4 bytes of address displacement to follow the S-I-B byte. This does not fit well with the rule that instructions must be multiples of 4 bytes in size. I want to also incorporate 64-bit addressing into this scheme without resorting to REX prefixes or address prefix bytes of any sort.

Therefore, I am punting the legacy x86 "mod" and "scale" fields and replacing them with new "Mod" and "Scale" fields which are 4 bits and 8 bits in size respectively. This allows for up to 16 different addressing modes, and as you see in this list of the 16 combinations, allows one to incorporate PC-relative (a.k.a. RIP-relative) addressing without hacks on the R13 register:

  • 0000 - [base + d8]
  • 0001 - [index*scale]
  • 0010 - [index*scale + d32]
  • 0011 - [d48]
  • 0100 - [base + index + d8]
  • 0101 - [base + index*scale]
  • 0110 - [base + index*scale + d32]
  • 0111 - [reg + d48]
  • 1000 - [PC + base + d8]
  • 1001 - [PC + index*scale]
  • 1010 - [PC + index*scale + d32]
  • 1011 - [PC + d48]
  • 1100 - [PC + base + index + d8]
  • 1101 - [PC + base + index*scale]
  • 1110 - [PC + base + index*scale + d32]
  • 1111 - [PC + reg + d48]

These 16 addressing modes cover all of the existing addressing modes possible in IA32 and AMD64 today and are encoded in a very orthogonal manner which is easy to decode. The high bit is the PC relative addressing bit. The low two bits the size of the signed displacement - 8 bits, none, 32 bits, or 48 bits. Notice that I am at most defining a 48-bit displacement. If you read the AMD64 and Intel64 specs as they are defined today, "64-bit" addressing is actually a hoax. The current specifications defined at most 48 bits of usable address space. So while 64-bit Windows and Linux applications bloat their pointer sizes to 8 bytes, 25% of that is wasted. So therefore there is no need to ever have 64-bit displacement.

As I pointed out over the past couple of weeks, it is generally even unnecessary to even have "64-bit" addressing, as most applications can do just fine with 32-bit pointers. In a pure 32-bit addressing implementation of VX64, the four Mod field values ending in 11 would be invalid. This scheme as I have now defined adheres to the rule that the validity and size of an instruction can be determined fully by only the first 12 bits of the instruction.

The fourth byte of the instruction is used either as an 8-bit displacement, or as the "Scale" field. One thing that has bothered me about existing x86 implementations is that scaling is limited to two bits, and thus only scaling factors of 1, 2, 4, or 8 are possible. I want to devote an entire byte to the scaling, such that scaling factors of up to plus or minus 63 are possible. I want to allow this for two reasons: first, to allow easier scaling into arrays where array elements are larger than 8 bytes, and second, to allow for bit manipulation tricks using the LEA instruction. As such, I am defining the 8-bit "Scale" field as follows:

  • bit 7 - 0 = add index, 1 = subtract index
  • bit 6 = 0 = scale by shifting left, 1 = scale by shifting right
  • bits 5..0 = scaling factor of 0 to 63 bits

To allow for [base + d32] addressing modes, the redundant scaling factor of (shift right by 0 bits) acts as an index suppression.

The new addressing scheme therefore gives us four more instruction formats:

Table 6-8 - VX64 instruction layout for addressing with 8-bit displacement

bit positions

byte 0

byte 1

byte 2byte 3
7primary opcode (8 bits)Mod field (4 bits)ri field (4 bits)d8
6
5
4
3reg field (4 bits)rb field (4 bits)
2
1
0

Table 6-9 - VX64 instruction layout for addressing with 48-bit displacement

bit positions

byte 0

byte 1

bytes 2..7
7primary opcode (8 bits)Mod field (4 bits)d48 field (48 bits)
6
5
4
3reg field (4 bits)
2
1
0

Table 6-10 - VX64 instruction layout for addressing with scaled indexing

bit positions

byte 0

byte 1

byte 2byte 3 (Scale)
7primary opcode (8 bits)Mod field (4 bits)ri field (4 bits)add/subtract index
6left/right shift
5scale shift count
(6 bits)
4
3reg field (4 bits)rb field (4 bits)
2
1
0

Table 6-11 - VX64 instruction layout for addressing with scaled indexing and 32-bit displacement

bit positions

byte 0

byte 1

byte 2byte 3 (Scale)bytes 4..7
7primary opcode (8 bits)Mod field (4 bits)ri field (4 bits)add/subtract indexd32 (32 bits)
6left/right shift
5scale shift count
(6 bits)
4
3reg field (4 bits)rb field (4 bits)
2
1
0

Although I have not given you the actual instruction set encodings for VX86, roughly 1/4 of the primary opcodes, 64 out of 256, are needed to encode existing x86 load, store, and memory-to-register instruction which survive the cut. The other 192 out of 256 primary opcodes are used to encode up to 3072 12-bit opcodes for non-memory instructions, including conditional branches, 1- 2- 3- and 4-operand register operations, string instructions, floating point operations, MMX and SSE instructions, and various miscellaneous instructions.

By defining the instruction encodings in a manner that easily allows for over 3000 unique instructions, new SSE instructions can be added without worry of running out of opcodes any time soon or having to resort to complex schemes such as prefix bytes and variable sized opcodes. And yet there is a still a large compatible intersection of existing x86 instructions and these proposed VX64 instructions and addressing modes which would allow software developers to write to a common subset of these instruction sets which is assembly language compatible with both.

And that is it! Using only these 6 instruction formats, all possible 4-byte and 8-byte VX64 instructions using all possible addressing modes can be defined. Now is this not a lot easier than the mess that AMD and Intel have themselves in now?


Recapping the Past Four Weeks

In summary, AMD and Intel blew a golden opportunity over the past few years with AMD64 and Intel64 to simplify the x86 instruction set, to simplify the encoding and decoding of 64-bit applications, and quite possibly to even shrink the size of 64-bit code to as small or smaller than today's legacy 32-bit applications. This is the mess that you pay for when you buy your overpriced new 64-bit microprocessor and bloated 64-bit operating systems. And this is what you can thank for every time your computer blue screens or your computer becoming infected with malware.

Software companies like Microsoft and VMware, whose own engineers are publishing papers pointing out the stupidity of today's hardware design, instead of having the sense to push back against these inefficient hardware changes choose to blindly embrace every new over-hyped hardware technology that AMD and Intel throw at us. And in the process each of these companies makes money hand over first selling upgrades that consumers didn't need - from "VT-enabled" hypervisors that run slower than their older counterparts to new releases of Visual Studio that produce worse code than their older counterparts to 64-bit operating systems that just consumer more memory. Microsoft itself now has joined companies like McAfee and Symantec in profiting from selling to consumers worthless anti-virus and malware detection utilities. Consumers are being robbed of their money. It is like paying somebody money to tell you after the fact that your car has been broken into. All of these companies are digging in the wrong place and not solving the problems that actually need to be solved.

The software industry needs to discard that broken x86 instruction set of today and adopt a simpler, backward compatible, verifiable 64-bit x86 instruction set to move forward; that being the VX64 instruction set I just described. Virtual machines based on binary translation are the vehicle to transition software to such a platform, and in doing so can provide a safer, more secure, and potentially even more efficient environment than today's compilers, operating systems, and hardware virtualization provides today.

In turn, the hardware industry can discard its overly complex and buggy x86 hardware design and switch to supporting a very lightweight hardware design which is optimized for hosting virtual machines in a flat real mode memory model, where memory protection and security is handled by the virtualization software itself.

Next week, I will show you the results of my own research related to comparing hardware memory protection against software memory protection and prove to you that even on today's crippled hardware it is possible to write a virtual machine that achieves very efficient software memory protection.


Thank you to everyone that has emailed me feedback so far. If you have comments, email me directly at darek@emulators.com.

[Part 1]  [Part 2]  [Part 3]  [Part 4]  [Part 5]  [Part 6]  [Next]