NO EXECUTE! A weekly look at personal computer technology issues.

(c) 2007 by Darek Mihocka, founder, Emulators.com.

November 12 2007


[Part 1]  [Part 2]  [Part 3]  [Part 4]  [Part 5]  [Part 6]  [Part 7]  [Part 8]  [Part 9]  [Part 10]  [Next]

Hardware Virtualization Comes To The BIOS

Before I start, I just want to remind everybody that today is the day that the One Laptop Per Child organization is making available its "XO" laptops for sale to North American consumers. As I mentioned before, I think the OLPC is a terrific concept and is a wonderful step in making low-cost yet powerful and rugged laptops the standard in the marketplace. I placed my order today and I would urge you to do the same at http://www.laptopgiving.org/en/index.php.

Along the same lines, there is a new race in virtualization to make desktop and laptop computers boot as fast as possible. Ironically, this was once the standard, as in the early 1990's, both MS-DOS and even Windows 3.x booted in just seconds. With the advent of 32-bit Windows, Linux, Mac OS X, humanity has slowly been losing the boot speed race, for many of the technical reasons I've ranted against over the past couple of months.

An idea that has come and gone several times in the past is being brought back in a new form. In the past few weeks, the spotlight on hardware virtualization has shifted from VMware's server products to hypervisors found in the BIOS itself, with two new products being announced - the "SplashTop" (http://www.splashtop.com/index.php) finding its way into ASUS motherboards, and the similar "Hyperspace" being developed by BIOS maker Phoenix.

The allure of these two announcements is the concept of "instant boot", i.e. that your PC appears to boot up and is ready to do something useful in 2 or 3 seconds, and not 2 to 3 minutes that most Windows users have to suffer through. While this seems very clever and sexy at first, it is actually quite straightforward, and anyone who is familiar with "un-hibernating" an OS, or resuming a frozen virtual machine will easily understand.

There are two tricks behind these products: first, that the BIOS itself appears to contain the x86 hypervisor, thus saving the time needed to bootstrap something like VMware Server from the hard disk. Second, the first OS being loaded is a stripped down Linux so that the user can be productive in a matter of seconds, while the big fat Windows partition loads in the background at normal speed. From what I've understood of the press releases, nothing here actually speeds up the booting of Windows, only that Linux is loaded quickly first.

Although I champion the cause of running everything in virtualization and moving the hypervisor down "as low as possible", i.e. to the realm of the BIOS or the hard disk loader, I am a little disappointed that ASUS and Phoenix are taking the easy road of developing this with hardware virtualization in mind. Unfortunately for consumers, this "new" technology requires of course that you buy new motherboards, and hardware virtualization based CPUs. Neither SplashTop nor Hyperspace are solutions for the hundreds of millions of existing PCs out there. For the rest of us, we still need to rely on the Windows "hibernate" feature, which as is the case of my Sony VAIO notebook, is actually slower to un-hibernate Windows Vista that to just reboot Windows Vista from scratch!

But, as with the XO laptop, this is refreshing concept coming into the spotlight. ASUS and Phoenix need to join my effort to develop a purely software-only binary translation based hypervisor based on emulation techniques, not hardware virtualization techniques. Something like Bochs as I am working on now, could be adapted to run directly from inside a BIOS, and this actually goes very nicely with my argument that hypervisors need to be written for a 64-bit flat programming model, since things like paging and virtual memory are not yet started while in the BIOS.

I see the arrival of the XO laptop and the arrival of SplashTop/Hyperspace as good signs that the world is headed toward my "killer app" scenario which I described in my October 15 post. The scenario of owning or renting laptop computers that come pre-installed with nothing more than a hypervisor and some method of loading a virtual machine, such as from the Internet. This technology would share many of the same implementation details of the "instant on" concept as well. But as I have been pointing out, this killer app needs to be implemented using emulation so as to be compatible with the hundreds of millions of existing PCs that can be repurposed for this use instead of ending up in a third world landfill. Always forcing people to buy new hardware to get some new sexy feature is not a viable long term solution and computer manufacturers and computer users need to get out of this wasteful mindset.


Turbocharging Bochs

In addition to working on the new emulation tricks to speed up Gemulator and SoftMac, I have also have the past couple of months applying some basic optimization techniques to the open source Bochs emulator, specifically working on the recently released Bochs 2.3.5 sources (http://bochs.sourceforge.net/). As I mentioned in my October 22 posting, I have cut the boot time of Windows XP running under Bochs to well under 3 minutes.

As I had hoped, the speedup of my Bochs optimizations are also dramatic on older hardware, such as my old home-built 1000 MHz Pentium III. Older microprocessors are not as forgiving as today's Core 2, since they lack the huge L2 caches and improved predictors which make the Core 2 so powerful. It is important when doing optimization work not to fall into the trap of optimizing for today's "latest and greatest" hardware, which is fastest already, but rather as I do to pay close attention to older hardware which needs the performance improvement the most.

This week I will give you a taste of some of the Bochs performance improvements I've made and show common optimization mistakes that C++ programmers make when they don't really take into account the true architecture of modern microprocessors, and instead fall into the trap of measuring performance by lines of C++ code or number of C++ operations. At some point in the future I will do the same for PearPC, the open source Mac OS X emulator, which suffers from some rather horrific performance traps.

The times given are for three different computers - my quad-core 2.66 GHz Core 2 based Mac Pro, the 2.53 GHz Pentium 4 I used for the Gemulator and SoftMac benchmarks last month, and the 1.0 GHz Pentium III based home built machine.

WARNING! I will be discussing sequences of C++ code derived from the Bochs 2.3.5 source code which is covered by LGPL (the GNU Library license). If you have an allergy to GPL or to looking at open source code, I am afraid you will need to stop reading now.

Before anything else, I want to thank Stanislav Shwartsman who is a regular contributor to the Bochs project for helping me get up to speed on the Bochs code. Not only has he been great to bounce ideas and diffs with, if you sync to the current Bochs sources you will already get many of these speedups, as Stanislav has already merged them in.

Before one starts doing any kind of optimization work, you need to be able to reliably and reproducibly measure what it is that you are trying to optimize. In over 25 years of looking at other people's code I frequently come across code that just doesn't give the performance that the author intended. This is true for both software and hardware. Performance needs to be thought about and designed into the entire software stack right down into the gates of the microprocessor. A key factor to doing optimization right is to choose the right scenarios that you are hoping to optimize.

In the case of the Pentium 4, its designers planned for CPU clock speeds to reach about 10 GHz by mid-decade. I remember Intel engineers boasting back in 2000 how Pentium 4 was designed to scale easily up to 10 GHz, and those predictions even made their way into a recent Stephen Hawking book. The Pentium 4 by design did poorly at running existing code, and Intel's own developer documentation even stated that the Pentium 4 was expected to be about 20% slower at the same clock speed than a Pentium III. That design decision was made with the gamble that the performance would be made up by faster clock speeds, 2GHz, 3GHz, 10 GHz. The Pentium 4 line quickly made its way into the 3+ GHz range, but things stopped suddenly at 3.8 GHz when power dissipation issues made it impossible to push the clock speed any further.

Similarly, software itself needs to written with particular usage scenarios in mind or it will miss the mark. It is too easy to believe that today's compilers are so smart that one doesn't need to understand computer architecture, that one doesn't need to understand assembly language or look at the compiled code generated by the compiler. Unfortunately, compilers are fairly simple even today and programmers do need to know the low-level fundamentals in order to write efficient code. People aren't stupid, it's just that colleges puts little emphasis on such matters. I cringe at the thought of what today's computer science students who are still learning about linked lists and hash table, and yet not being taught the performance reasons why those data structures may greatly slow down a program.

With emulators and virtual machines, there are several fairly standard textbook techniques of writing an interpreter or jitter. This is the type of code you see in a lot of open source emulators, and it tends to be slow. It is slow not necessarily because emulation has to be slow (it does not!), or even that an interpreter written in C or C++ needs to be slow. There dozens if not hundreds of minor variants possible with any piece of textbook code, and finding the right variant requires precise benchmarking.

With the subject of "instant on" computing in the news lately, I will focus on two main Bochs benchmarks today: 1 - the time it takes to boot Windows XP in a Bochs virtual machine, and, 2 - benchmark near and dear to my heart - the performance of my Gemulator 2008 beta emulating an Atari ST inside of a Bochs virtual machine. I have other Windows applications that I benchmark, but I have chosen these two as they represent two very different classes of code. The Atari ST emulator executes very small core set of code which makes up the 68040 interpreter and is extremely CPU-bound, while XP boot tends to be more memory bound and executes a much larger variety of code and touches a much broader range of x86 instructions.

As I do with other emulators that I work on, I prepared a stable and repeatable set of tests for Bochs. Using a fresh 2 gigabyte Bochs disk image, I booted the Windows XP Service Pack 2 setup CD-ROM in Bochs 2.3.5 and did a clean install of XP SP2. I used the release version of Bochs 2.3.5, and stuck to mostly the default settings in the Bochs 2.3.5 configuration file. The XP virtual machine was configured to have 512 megabyte of RAM, two floppy disks, the 2 gigabyte hard disk, and a 1024x768 32-bit VGA graphics with a 15 Hz (70ms) refresh rate. I chose a somewhat slower refresh rate in order to not drown the benchmark results in VGA refresh times, which were showing up somewhat significantly on my older test PCs. 15 Hz is slow enough to make video refresh a small percentage of CPU cycles without resulting in video that is overly jerky as can happen in some emulators.

In playing around with Bochs, I found that the way that it simulates guest time (i.e. the passage of time as the code being emulated inside the virtual machine sees it) is based on the number of guest CPU instructions emulated. This makes the repeated testing of a virtual machine fairly deterministic, but results in time compression and time dilation inside the guest. For example, once Windows XP boots up inside of Bochs, if I bring up the Time and Date control panel applet, the guest time first ticks along very slowly and then once Windows XP is fully booted and the CPU emulation hits the idle loop in Windows, time suddenly accelerates to about 10 times normal speed. This not only looks silly, it makes it impossible to even double-click, since the emulated Windows XP sees the two mouse clicks are being several seconds apart.

Until I had figured out this time compression, Windows XP was booting up in Bochs in about 3 minutes and 20 seconds, or 200 seconds. My first attempt to remedy this was to change the simulation of the x86 HLT instruction (which is used in the Windows idle code) to behave as a NOP, this keeping the emulated x86 busy spinning. This still resulted in guest time running at the wrong rate, albeit a consistent wrong rate.

This needed setting another option in Bochs, which is the "ips" or instructions-per-second setting. Bochs needs to have a rough estimate of how many x86 instructions are simulated in each instruction of real time. The Bochs documentation suggests setting this to about 10 million, which I found to be about the right setting for a Pentium III or Pentium 4 host, but on my Mac Pro which has a 2.66 GHz Core 2 I had to set the ips value to 25 million. Another option I then set is "clock=slowdown", which inserts delays into the guest virtual machine to keep its time from running faster than real time.

Making these settings changes increases the virtual machine's boot time of Windows XP from about 200 seconds to about 230 seconds, or 3 minutes 50 seconds. Despite the slowdown, I felt that this was a more accurate benchmark and more closely simulates what Windows XP would be doing on a real machine, even though I was making the optimization job 30 seconds more difficult!

The Windows XP boot process I benchmarked at 6 different reference points, which I will call stages A through F. Stage A is the point where the Windows XP splash screen appears on the monitor. Stage B is where boot time apps kick, in this case a quick boot-time defrag tool which does nothing (since the drive is defragged). Stage C is the point where the screen shows "Windows is starting up...". Stage D is the "Welcome" screen. Stage E is where the desktop background renders. Stage F is the final rendering of the Start Menu, Taskbar, and desktop icons. I consider stage F to be the end of boot.

Once booted, I copied a set of Windows test programs to the virtual hard disk, activated Windows, cleaned up the hard disk and defragmented the hard disk. This is now the master hard disk image which I boot from each time. This avoids a common mistake in benchmarking of not keeping conditions identical. For example, when benchmarking an emulator that boots from disk images, you have to restore the disk image to the same state after each benchmarking pass. Don't install Google Desktop Search to the virtual machine and then expect it to benchmark the same as search is running. Seems silly, but I have seen professional programmers do this and then scratch their heads why the benchmarks are all off.


Bochs 2.3.5 Benchmark Results

WARNING! I will be discussing sequences of C++ code derived from the Bochs 2.3.5 source code which is covered by LGPL (the GNU Library license). If you have an allergy to GPL or to looking at open source code, I am afraid you will need to stop reading now.

I am going to compare four builds of Bochs based on the Bochs 2.3.5 sources:

Build A: This is the Bochs 2.3.5 reference build. I did not build this one, this is the actual BOCHS.EXE file that is installed from the SourceForge release point. This is built with GCC I believe.

Build B. This is my private build of the released Bochs 2.3.5 sources, made with the VC7.1 C++ compiler found in Microsoft Visual Studio 2003. Compiled using -O2 (optimize for speed) compiler switch.

Build C. This is a build with the various optimizations that I'll discuss. Also build with VC7.1 C++ compiler, using the same compiler switches.

Build D. This is a Profile Guided Optimization build made with the Beta 2 release of Visual Studio 2008. The Windows XP boot was used as the training scenario.

Some of you may not be familiar with PGO (Profile Guided Optimizations). This is a feature found in Intel's C/C++ compiler, in Visual Studio 2005 and later, and in many other compiler frameworks. What PGO does is instrument a program so as to collect runtime information about which functions were executed, how often, and also which conditional branches were taken and how often. The compiler then uses the PGO data to produce an "optimized" executable which in theory is faster than the default build. As I will show, this may not always be the case.

These are the timings, in seconds, for reaching the various stages of Windows XP boot on the four Bochs builds, as well as Virtual PC 2007 and QEMU 0.90 for comparison:

Mac Pro (Core 2 2.66 GHz) timingsVirtual PC 2007QEMU 0.90Build A - baseline Bochs 2.3.5 release buildBuild B - Bochs 2.3.5 recompiled with VC7.1 C++Build C - Bochs 2.3.5 with new optimizations, VC7.1Build D - PGO recompile of build C
Stage A - splash screen73111198
Stage B - defrag-726262425
Stage C - "Windows is starting up..."171667686364
Stage D - "Welcome"202395918699
Stage E - desktop wallpaper renders3433221220154197
Stage F - Start button and taskbar render3436236230161209

Note: Virtual PC is using a separate disk image which was similarly setup from scratch using XP SP2 but lacks the defrag utility. QEMU 0.90 is booting the same disk image that is used with Bochs.

Although Virtual PC is using hardware virtualization, QEMU is using dynamic recompilation binary translation, and Bochs is purely interpreted, the slowdown of Bochs is now under 5 times that of either VPC or QEMU. This is accomplished with four relatively simple types of modifications which contribute to most of the speedup:

  • removal of inline assembly for determining the guest EFLAGS state
  • unpacking of data structure from bitfields to bytes
  • using profile data to improve branch prediction rates of effective address calculation and conditional branches
  • making "lazy flags" evaluation less "lazy"

Inline Assembly Isn't Always The Answer

The first mystery that puzzled me was why build B - a recompile of the official Bochs 2.3.5 sources using Visual Studio 2003 - was benchmarking faster than the reference build using GCC. A few weeks ago I demonstrated how Visual Studio if anything produces less optimal code than GCC, so I would not expect it to outperform GCC.

The first tool in a developer's bag of tricks is to make sure that you understand the code that you are running, which means verifying that it is in fact running the code that you think you compiled. In this case, I was not! I assumed that the inline assembly found in the source file cpu\hostasm.h was being used, as this contains optimizations specific to x86 hosts. In Visual Studio's compilers, add the compiler switch -FAsc to generate listing files which show the original C/C++ source and the compiled x86 assembly code.

I soon realized that the inline asm was #ifdef-ed by the symbol __i386__, which Visual Studio's compiler does not define. As a result, the VC7.1-built executable was using C++ based routines. In looking at the inline asm, I realized that of course that inline asms wasn't very optimal. Here is an example of that inline asm:

// This file defines some convience inline functions which do the
// dirty work of asm() statements for arithmetic instructions on x86 hosts.
// Essentially these speed up eflags processing since the value of the
// eflags register can be read directly on x86 hosts, after the
// arithmetic operations.

// gcc on i386
#if (defined(__i386__) && defined(__GNUC__) && BX_SupportHostAsms)

#define BX_HostAsm_Add16
static inline void
asmAdd16(Bit16u &sum_16, Bit16u op1_16, Bit16u op2_16, Bit32u &flags32)
{
asm (
"addw %3, %1 \n\t"
"pushfl \n\t"
"popl %0"
: "=g" (flags32), "=r" (sum_16)
: "1" (op1_16), "g" (op2_16)
: "cc"
);
}
 

What this code is trying to do is to work around a very common problem in emulation - how to simulate the "condition flags" of the guest CPU. The condition flags are a set of bits which are typically updated after arithmetic operations such as ADD, multiplication, shifts, comparisons, and bit operations. On most microprocessors there are several standard bits:

  •     the Zero Flag (ZF) which indicates that the result of the most recent arithmetic operation was zero,
  •     the Sign Flag (SF), sometimes also called the Negative Flag (NF), which indicates that the most recent result was a signed negative number,
  •     the Carry flag (CF), which indicates that the result generated an unsigned overflow, and thus the true result cannot be represented, and,
  •     the Overflow flag (OF), which indicates that the results generated a signed overflow, and similarly the true result cannot be represented.

The Intel x86 and AMD64 architectures also define two other arithmetic flags:

  • the Adjust Flag (AF), also called Auxiliary Carry, which is the unsigned carry from the 4th bit. This flag is used to detect overflow of decimal operations, and,
  • the Parity Flag (PF), which is set to indicate that the result contains an even numbers of 0 and 1 bits.

The first four flags are pretty standard on many computer architectures, and generally are simulated by performing the arithmetic operation on the host CPU, reading the host CPU's arithmetic condition flags, and recording those flags in the simulated guest CPU's state. This is precisely what Bochs is doing in the above code, performing a 16-bit ADD operation, and then using the x86 "Push Flags" instruction PUSHF to push the host EFLAGS register to the stack, and pop that data into an integer register. x86 and AMD64, unlike most instruction sets, do not have an instruction to directly read the six arithmetic flags. The closest x86 has is the LAHF instruction ("Load AH with Flags"), which unfortunately reads only 5 of the flags but not Overflow.

LAHF is actually the instruction used by Gemulator and SoftMac to simulate the condition flags of 68000 and 68040 instruction, as many common instructions are defined specifically to always clear the Overflow flag. As such, LAHF can be used to read the flags that matter. Unfortunately, for reasons unknown, AMD removed the LAHF instructions from the 64-bit instruction set in 2002, and developers of virtual machines needed to look to other techniques to simulate condition flags without using LAHF. PUSHF is such a workaround, with the penalty of being significantly slower than LAHF. On some CPUs, costing 10 clock cycles or more due to the back-to-back stack operations and the data dependency in the code sequence.

Eliminating sequential data dependency is a fundamental optimization technique in achieving instruction level parallelism, or "ILP" for short. ILP is a measure of how many instructions a CPU core can execute simultaneously during the same clock cycle. Traditional microprocessors such as the Motorola 8000 and Intel 386 are able to only decode, execute, and "retire" one instruction at a time, at a cost of several clock cycles per instruction. There is no instruction level parallelism in those architectures.

Pipelined architectures such as those found in the Motorola 68040 and Intel 486 operate on several instructions during one clock cycle, simultaneously fetching and decoding one instruction, while starting execution of another, and retiring yet another. These are known as pipelined architectures, because the execution core resembles a pipeline or assembly line of instructions being processed, but the overall throughput does not exceed one instruction per clock cycle. They still do not achieve instruction level parallelism.

With the introduction of the Intel Pentium Pro, Intel delivered the "P6" architecture which is a true superscalar pipeline, capable of fetching three instructions per clock cycle, executing three instructions per clock cycle, and retiring three instructions per clock cycle. That is ILP. Written correctly, code can literally do three things at once. The Core 2 improves on that by widening the pipeline to allow up to 4 instructions to execute per clock cycle.

ILP works best when the code does not have long data-dependent chains of instructions. In the above code, ADD, PUSHF, POP, and the subsequent store into the flags32 variable are such a data dependency chain, made worse by the fact that there is a memory write followed immediately by a memory load which is dependent on that write. This is known as store forwarding, which on some implementations has extra penalties of its own.

It actually turns out that the C++ code which was used in its place, the code known as lazy flags evaluation code, gives slightly faster results. I'll discuss this in detail next week.


Do Not Overpack

Travelling lightly is good advices when going on vacation, and it is good advice inside of an interpreter. If you look at the file cpu\cpu.cpp, you will see the Bochs main interpreter loop. Recall how a few weeks ago I gave an example of a simple CPU interpreter loop, one that is structured like this:

void Emulate6502()
{
    register short unsigned int PC, SP, addr;
    register unsigned char A, X, Y, P;
    unsigned char memory[65536];

    memset(memory, 0, 65536);
    load_rom();

    /* set initial power-on values */
    A = X = Y = P = 0;
    SP = 0x1FF;
    PC = peekw(0xFFFC);

    for(;;)
        {
        switch(peekb(PC++))
            {
        default: /* undefined opcode! treat as nop */
        case opNop:
            break;

        case opIncX:
            X++;
            break;

        case opLdaAbs16:
            addr = peekw(PC);
            PC += 2;
            A = peekb(addr);
            break;

...
            }
        }
}

You would not actually write an interpreter in this manner for various efficiency reasons, but the general bytecode interpreter inside of a virtual machine, even inside a Java virtual machine, has such an inner execution loop. Bochs is no exception, using an inner loop to first fetch and decode the next instruction, then to calculate the effective address of that instruction (if it references memory), and then to call a handler which simulates that one instruction. The handler functions are equivalent to each of the "case" statements in the sample above.

To pass information about the current instruction between those three parts of the loop, Bochs uses a common data structure to represent the decoded state of the current instruction. This is defined as the bxInstruction_c class in the source file cpu\cpu.h. In Bochs 2.3.5, a portion of that class looks something like this:

void (BX_CPU_C::*ResolveModrm)(bxInstruction_c *) BX_CPP_AttrRegparmN(1);
void (BX_CPU_C::*execute)(bxInstruction_c *);

// 26..23 ilen (0..15). Leave this one on top so no mask is needed.
// 22..22 mod==c0 (modrm)
// 21..13 b1 (9bits of opcode; 1byte-op=0..255, 2byte-op=256..511.
// 12..12 BxRepeatableZF (pass-thru from fetchdecode attributes)
// 11..11 BxRepeatable (pass-thru from fetchdecode attributes)
// 10...9 repUsed (0=none, 2=0xF2, 3=0xF3).
// 8...8 extend8bit
// 7...7 as64
// 6...6 os64
// 5...5 as32
// 4...4 os32
// 3...3 (unused)
// 2...0 seg
Bit32u metaInfo;

// 27..20 modRM (modrm)
// 19..16 index (sib)
// 15..12 base (sib)
// 11...8 nnn (modrm)
// 7...6 mod (modrm)
// 5...4 scale (sib)
// 3...0 rm (modrm)
Bit32u modRMData;

The decoder fills in the execute field with a function pointer to a handler function which will simulate a given instruction. The execute field cannot be NULL, it must point to a valid function.

The ResolveModrm field contains another function pointer, this one which points to a function which will calculate the instruction's memory effective address. This field can be NULL, since most register operations do not operate on an effective address.

Other details about the instruction, such as the opcode and sub-opcode, the memory addressing mode, the addressing size, etc. are encoded as bitfields in two 32-bit integers called metaInfo and modRMData. These troubled me!

Looking at the code in cpu\fetchdecode.cpp that sets those various fields and bitfields, and the looking at the code in various handlers that reads those fields, I realized there was a lot of unnecessary bit twiddling going on. If you think about it, the fetch and decode stage reads x86 bytecode and extracts various bitfields into integers. Those integer values are then repacked into bitfields in metaInfo or modRMData. The handlers then have to immediately unpack those bitfields back into integers. This is wasteful to say the least, especially considering that the Pentium 4 is fairly inefficient at bitfield operations.

My solution replaced the various bitfields, none of which is bigger than 8 bits in size, with a series of 8-bit integer fields. This avoids the redundant packing and unpacking of bitfields and the associated shifting and masking operations.


Branch Prediction and PGO

In the 1980's and early 1990's, CPU pipelines were short, throughput was at best one instruction per clock cycle, and code optimization was relatively easy to do. A memory load, a memory store, a branch, a jump, a call, or an arithmetic operation all cost about the same - perhaps one clock cycle, two, three, in rare cases 6 or 8. Things like branch predictors and L1/L2 caches did not even exist on more microprocessors, and so programmers didn't have to code for them.

As clock speeds got faster, caches were added to hide the increasing latency of reading memory, which today can be 200 or 300 clock cycles for a "missed" read. To compensate for these "pipeline bubbles", pipelines were made longer so as to be able to work on more instructions. Should one instruction take a long delay to read memory, other instructions could proceed through the CPU pipeline, provided of course there were no data dependencies on the stalled instruction.

Unfortunately, the CPU can only cheat so much. It eventually fetches and decodes a flow control instruction with no specific target address - an indirect jump, an indirect function call, a function return, a conditional branch, or more rarely a trap - where the address of the next instruction is not immediately known. In the case of indirect jumps and function calls/returns, the pipeline usually has to empty such that the address of the indirection can be calculated. In the case of an conditional branch, the pipeline similarly has to empty so that the result of some arithmetic operation, and consequently the arithmetic flags that it updates, can be known with certainty. Only then can a conditional branch either get taken or fall through.

To work around these frequent stalls, engineers developed the concept of speculative execution, which relies on branch prediction to "guess" at the address of the next instruction. If the guess turns out to be correct, or predicted, the pipeline is kept full with valid instructions. If the guess turns out to be wrong, or mispredicted, the CPU pipeline is thrown out and instructions start getting fetched and decoded from the correct address.

A branch misprediction is a costly event, wasting approximately 10 to 20 clock cycles, or in ILP terms, 30 or more instructions that could have completed during that time. Literally, one branch misprediction can be costlier than executing even 20 integer arithmetic operations.

In recent years, programmers have started coming up with branchless code sequences, those which implement something like a C++ "if/else" statement without using flow control. I will get into this concept more next week, but for now, I want you to understand that the worst kind of branches are those with close to a 50-50 probability of getting taken or not taken. Branches that lean heavily in the taken or in the not-taken direction are easier for the hardware to predict.

Also, indirect function calls are easier to predict when the number of possible call targets is small, or even one. For example, if you initialize a function pointer at boot time and never change it again, such as selecting between one of two different square root functions based on some CPU feature, chances are good that the CPU will soon predict the correct function.

What a programmer, even a C++ or Java programmer therefore wants to be on the lookout for are branches and indirections that will be difficult for the hardware to predict. You have to assume the worst in those cases and plan to taking a 10 or 20 clock cycle stall every time such code is executed.

Two such glaring examples are sitting right in the main execution look in Bochs, and it has to do with those two function pointers we just saw - execute and ResolveModrm. Each function pointer is referenced during each guest instruction loop. Both function pointers can end up pointing at dozens of different functions. ResolveModrm points to one of several different address generation functions found in the source files cpu\resolve16.cpp, cpu\resolve32.cpp, and cpu\resolve64.cpp. execute points at one of the many x86 opcode handlers through out the code. Both function pointers can change on each iteration of the loop and can be highly unpredictable.

The code in question is in cpu\cpu.cpp and looks like this:

    if (i->ResolveModrm)
    {
    BX_CPU_CALL_METHODR(i->ResolveModrm, (i));
    }

    BX_CPU_CALL_METHOD(i->execute, (i));

The function pointed to by the i->ResolveModrm pointer is only called if that pointer is non-zero. On the other hand, the i->execute pointer is always called. There is a significant difference here in terms of branch misprediction, in that the address resolution requires two predictions - one to predict whether the pointer is non-NULL, and one to then predict the target address of that pointer. In the worst case, you can end up having two branch misprediction. The execute pointer on the other hand has a worst case of just one branch misprediction.

One thing to consider doing then is to avoid the "if" statement completely and rather always call i->ResolveModrm by replacing the NULL cases with a pointer to a dummy function that simply returns, or what is called a stub function. Which is faster? To check a pointer for being NULL and thus introduce a second branch prediction, or to always call the function even if it just to a stub?

To make this kind of coding decision, you need data, real data based on running real scenarios. A basic question is, what percentage of time is the pointer NULL?

To answer this question, I turned to the Profile Guided Optimization (PGO) feature in Visual Studio. This allows me to run Bochs in a profiler and actually get counts and taken/not-taken percentages for each and every branch in the program. For the particular line of code of the "if" statement, the results were clear:

; 251 : if (i->ResolveModrm)
000ec 8b 03 mov eax, DWORD PTR [ebx]
000ee 85 c0 test eax, eax
000f0 0f 85 4b 02 00 00 jne $LN1201@cpu_loop
; taken 365376736(36%), not-taken 623926243(63%)

Out of approximately one billion x86 instructions simulated, about 36% of the time the pointer is NULL, indicating that about 2 in 3 x86 instructions contain a memory address that needs to be calculated, which thankfully makes sense. Of the remaining time any number of functions get called. Each of those address resolving functions looks something like this, such as these two in cpu\resolve32.cpp:

void BX_CPU_C::Resolve32Mod0Rm7(bxInstruction_c *i)
{
RMAddr(i) = EDI;
}

void BX_CPU_C::Resolve32Mod1or2Rm0(bxInstruction_c *i)
{
RMAddr(i) = EAX + i->displ32u();
}

Each of the dozens of functions is very similar. Some use register EAX, some use ECX, some use EDX, and so on. Some add a displacement to the base register and some do not. I had a hunch that the vast majority of address resolution is in the form [reg] or [reg+displacement], and in fact the profile data showed that 87% of function resolution was of these two forms. And considering that [reg] is really the same as the address [reg+0], it is all the same form. So I decided to trade a branch misprediction for a memory load and an arithmetic ADD operation, by combining 16 such functions into a single function:

void BX_CPU_C::Resolve32DispBase(bxInstruction_c *i)
{
Bit32u base = BX_READ_32BIT_REG(i->sibBase());

RMAddr(i) = base + i->displ32u();
}

This now means that 36% of the time, the pointer is NULL, and 87% of the remaining time, or about 55% of the time, this new Resolve32DispBase function is called. Therefore by using both a stub function, and merging 16 similar functions into one, the "if" and indirect call can be replaced by a single indirect function call which has almost identical probability as the original "if" statement. In other words, one of the two branch mispredictions is just about eliminated.

The moral here is, don't mix "if" statements and indirect function calls if you can combine them into one.

A similar example of such double branch misprediction is in the handlers which simulate the guest x86 conditional branches. The Bochs 2.3.5 code is written so that all conditional branches go to a common JCC handler, such as the one in ctrl_xfer32.cpp. Within that handler is then immediately a switch statement which jumps to one of 16 condition handlers. Only two cases, the Jump Zero and Jump Not Zero instructions (JZ and JNZ) directly jump to a unique handler. The better thing to do was to eliminate the common JCC handler and switch statement completely, and to go with 16 unique handlers for the 16 branch conditions.


PGO's Ironic Twist

So far so good, the clock cycles are just falling off the benchmarks. But I was puzzled by Build D, the PGO-optimized version of Build C, was consistently slower in the benchmarks. After all, it is in theory compiled based on the scenario profiling results, so I would have expected it to be the fastest build of all!

The answer is quite ironic, but for now I will leave this as the puzzle for you to solve for next week, when I will also go into the details of lazy flags evaluation.


Keep those comments and ideas coming by emailing me at darek@emulators.com.

[Part 1]  [Part 2]  [Part 3]  [Part 4]  [Part 5]  [Part 6]  [Part 7]  [Part 8]  [Part 9]  [Part 10]  [Next]