NO EXECUTE! July 25, 2010

NO EXECUTE!

July 25 2010

[Part 33] [Table Of Contents] [Return to Emulators.com]

"Beyond debugging, migration to emulated environments opens the door to cross-platform migration of virtual machines. A virtual machine running on a slow ARM chip in a mobile device could be migrated to a fast desktop or server machine. Code running in abstract machines such as the JVM or .NET CLR could have the JID cache invalidated and be recompiled from bytecode for the new native platform. Currently, virtual-to-emulated (V2E) and emulated-to-virtualized (E2V) migrations using QEMU are still experimental. They are likely to become a significant feature of Xen in the future, however"

- The Definitive Guide to the Xen Hypervisor, David Chisnall, Prentice Hall, 2008

Prophetic words! This quotation from a book on virtualization published two years ago eludes to the near future, where someday virtual machines will migrate freely across devices without worry about which chip vendor's hardware is being used. I could not have written a better introductions to this week's discussion on QEMU. Since I have discussed Virtual PC, PowerPC, ARM, and Bochs recently, let's now take that closer look at QEMU.

A Closer Look At QEMU

As I mentioned in previous postings, three years ago I first looked at QEMU version 0.9 hosted on Windows and quickly dismissed it as junk. It was too slow (in some cases even slower than the Bochs interpreter) and too buggy. My interest in QEMU was ignited again last summer when I happed across a new Windows port of QEMU called appropriately enough, WinQEMU, hosted at this SourgeForge page: http://sourceforge.net/projects/winqemu/

I found WinQEMU 0.10.2 to be measurably faster than the previous 0.9 build I had tried two years earlier. As I have eluded to this summer, I have been following QEMU, enlisting in the sources for what have now been the 0.12.4 and 0.12.5 releases, and have enough improvement to put my work on Bochs on hold for a while in order to evaluate QEMU deeper.

As fortune would have it, the latest point release of QEMU, version 0.12.5, was released just days ago on the QEMU web sites:

http://www.qemu.org/

http://www.qemu.com/

For QEMU novices, I would suggest grabbing these 0.12.5 sources as your starting point. Until there is a significant new point release in the future, I will also continue to use the 0.12.5 release as my reference for future discussions.

After recently exchanging a few emails with Yan Wen, the author of WinQEMU, he assures me he is porting version 0.12.4 (hopefully 0.12.5 now) to Windows and will update the WinQEMU release within a month or two.

More adventurous readers can enlist into QEMU directly. To this on Linux, and I have successfully done this on both Fedora 12/13 and Debian 5.05 releases, you will need to make sure that you install the gcc and g++ packages (which you would have already for building Bochs), the git or git-core package for installing the Git source control system, the make package, and the SDL development package (on Debian you will want libsdk1.2-dev package).

On Fedora 12 and 13, use the "yum" command to install packages, such as:

yum -y install gcc

while on Debian, use the similar "apt-get" command:

apt-get install gcc

A few other packages I find very convenient to install, including "rdesktop" which permits a Linux machine to Remote Desktop into a Windows machine, and the "ntfs-3g" package gives read-write file access to any Windows NTFS partition which may reside on the same machine. Since I dual-boot regularly between Windows 7 and Linux on several of my machines, I like to keep all my virtual machine disk images, test programs, and source on common NTFS partitions, which I can then easily access from both Windows 7 and Linux.

For convenience in building both Bochs and QEMU and other open source projects you might enlist in, I suggest also installing the packages "cvs" and "subversion" (which are two other popular source control systems), "xterm" to make sure that X11 graphical subsystem is installed, "bochs" and "qemu" themselves in order to install pre-requisites such as BIOS images and device models, "wine" which allows running some Windows applications directly from Linux, and "bximage" for creating Bochs disk image files.

Generally, the configure script of any given project will point out any components which you are missing and will prompt you to install the appropriate missing packages.

Evaluating QEMU

For today I am going to look at three particular builds of QEMU - that terrible 0.9 release from 2007, the WinQEMU 0.10.2 release from 2009, and the current 0.12.5 release which I have built on both 64-bit PowerPC G5 and 64-bit x86-64 Fedora Linux machines from the tip-of-tree sources.

Bochs, QEMU, the Xen hypervisor, and the KVM virtualization built in to current Linux releases are all related. Bochs and QEMU started out sharing very similar BIOS and VGA BIOS sources as well as device models - Bochs being the purely interpreted x86 virtual machine, and QEMU being the "jit" (dynamic binary translation) based x86 virtual machine. Over the years two have diverged, and QEMU now also supports emulating 68040, PowerPC, ARM, MIPS, SPARC, and other architectures.

More recently, the QEMU framework has been used as the basis for VT-based hypervisors such as KVM, Xen, and VirtualBox. While I disagree with the VT approach of course, the nice thing about using QEMU as a starting point is that virtual machine disk images can easily be reused on different virtualization products. For my testing, I have several Windows 2000, Windows XP, Windows 7, and Linux disk images which I originally created in Bochs which I also test with the various QEMU releases and even KVM. The virtual machines do not have to be re-built from scratch for each hypervisor, but rather, those base images I create in Bochs can be used as-is everywhere.

For my testing I mainly used my oldest Windows 2000 disk image due to its small size, quick boot time, and because it is the least cluttered of my images. The Windows 2000 disk image holds a complete Windows 2000 Workstation Service Pack 4 install, plus Microsoft .NET 2.0 framework, FireFox 3.6, Visual Studio 98, and all of my test programs including the Gemulator 9 sources.

Why Visual Studio 98? Because that is the compiler I have used for over a decade to build the Gemulator product on Windows and because it has a ridiculously small disk footprint (roughly 18 megabytes for the command line build tools, header files, and libraries, and an additional 36 megabytes for the VS98 IDE). Sticking to the same build tools gives me a consistent way to compare benchmarks over the years without worrying that I am introducing variability due to compiler changes and thus code quality changes. For the same reasons, I tend to do most of my Bochs and QEMU testing using the older Windows 2000 and XP images instead of constantly changing what I am testing month to month.

Other tests which I used to evaluate QEMU my regular readers will be familiar with as I have referenced several of them and/or provided source code in past postings of NO EXECUTE!:

CPU_TEST - my framework of hundreds of small (and mostly assembly language) x86 and Windows micro-benchmarks
HDTEST32 - a utility I wrote to measure raw unbuffered disk write throughput using various block write sizes
T1FAST, T1SLOW, SIMP - small Visual Studio compiled C test programs I wrote which measure common function calling and integer code patterns
MEMBAND - a utility I wrote to measure memory copy bandwidth based on block copy size and misalignment between source and destination blocks
LPIPE - a utility I wrote which measure measures the speed of sending data via Windows pipes
CPU-Z - a third party utility from http://www.cpuid.com/ which displays processor information such as model and x86 instruction capabilities
FRACTAL - my first 32-bit Windows program from about 15 years ago, calculates and displays a fractal image
TESTFLOAT - a third party utility from http://www.jhauser.us/arithmetic/index.html which verifies x87 floating point correctness
111 - a custom test I wrote to check for common virtual machine memory implementation errors
ND32 - a custom set of micro-benchmarks which evaluate branch prediction performance
And a few others, such as managed C# variants of SIMP and some scripted Microsoft Office tests.

When I test and benchmark Bochs, or QEMU, or KVM, I run these tests as if I was evaluating a native x86 processor directly. When I show results below, I will be clear as to which virtual machine product was running on which host x86 or PowerPC hardware, and make it clear in the few cases where I am running a test directly on bare metal.

The Read-Modify-Write Bug

In my 2008 paper "Virtualization Without Directly Execution", I provided CPU_TEST T1FAST and T1SLOW benchmark results to show how QEMU 0.9's performance was barely faster than than of Bochs 2.3.7 after Stanislav and I had gone through and cleaned up the Bochs x86 interpreter engine. What I didn't go into much detail about were some rather blatant and serious x86 correctness errors in QEMU.

The most serious in my opinion is one which is ridiculously easy to reproduce and results in incorrect data being written to memory. The scenario is this:

allocate some memory using the Windows VirtualAlloc API (which I discussed in detail back in Part 4)
perform a memory read-modify-write operation such as an addition operation to an integer in memory
read the integer to verify that the value in memory is correct

It is a pretty simple test, which I originally wrote over 5 years ago to verify a bug I had discovered in Virtual PC 7, which unfortunately I reported to Microsoft too late to get into that final Virtual PC 7.02 release. The core code of the tests consists of two inlined assembly language instructions:

__asm stc ; set Carry flag
__asm adc dword ptr [eax],1 ; 0 + 1 + Carry should equal 2

The STC instruction (set carry) ensures that the x86 Carry Flag is in a known state, for this test I set the Carry. The ADC (Add with Carry) instruction performs a 3-input addition, taking the value in memory pointed to by register EAX, adding the constant 1 to it, and also adding the Carry Flag. The result is then written to memory, and the new value of the Carry Flag reflects the result.

Now, in the most trivial case, when the test program is freshly launched and any allocated memory is filled with zeroes, the value written to memory will be 0 + 1 + 1 = 2. If this was a global variable in memory, it would start with 0 and end up with the value 2. Easy, how could a virtual machine possibly blow such a simple piece of code?

Well, when this code sequence is run on Virtual PC 7.02, the value written to memory ends up being 1. Not 2, 1.

This is flat out a correctness error in Virtual PC's x86 integer and memory emulation which could result in something simple as a program malfunctioning to something more serious as a security exploit. if the value being written to memory is, say, a pointer, and the value of the pointer is now corrupted, a crash or data corruption could easily occur.

Any wonder then that many third party Windows programs simply fail to run in Virtual PC for Mac? There is the likely culprit.

How does this possibly happen?!?!?!?! The key lies in understanding how Windows initializes memory and looking at some extra debugging information I have instrumented into my test program.

Windows, like Linux, runs user mode applications using virtual memory address translation. A pointer in a Windows program does not point to physical memory, rather it is a pointer into virtual address space which is the translated by the TLB and page tables to the actual physical address. In a virtual machine such as Virtual PC or QEMU, this translation may be done in software. Read posting Part 8 for some background into how this is done.

On top of that, Windows and Linux are lazy about even allocating the physical memory until it is actually used, which is sort of the whole point of virtual memory. A Windows program can allocate 100 gigabytes of virtual memory even on computer with perhaps two gigabytes of actual physical RAM. The operating system allocates the mapping of virtual to physical memory as needed, swapping pages of physical memory out of the pagefile to give the illusion of these being 100 gigabytes of memory on a two-gigabyte machine.

A further optimization which Windows performs is to not even assign the physical memory page until a given page of virtual memory is actually written to. The write operation is the key, because for as long as you merely allocate memory and read from it, you will read zero values. The operating system plays page table tricks to map all of the pages of a block of virtual memory to a common 4K zero page. For example, that 100-gigabyte allocation would consist of over 26 million pages, but Windows merely needs to set those 26 million page table entries to all point to the exact same 4K page that contains nothing but zeroes and which is marked as a read-only page.

When a write finally occurs, the x86 processor throws an Access Violation exception, since the page table has that zero page marked as read-only. At this point, the Windows kernel then allocates the new physical page (possibly swapping one out to the pagefile to make room), changes the page table entry in the program's page table to now make that 4K block of virtual memory translate to the newly allocated physicla page, and then restarts the faulting x86 instruction to complete the write.

As a further optimization, Windows may not even allocate the page table entries themselves until they are referenced. In other words, page tables themselves are swapped in and out of the pagefile, so just the mere act of reading newly allocated virtual memory can cause an Access Violation as well. Windows uses this trick to lazy allocate page table memory, and it is during this read fault that Windows will initialize the page table entry to point to that common zero page.

With this knowledge, you can see that the ADC instruction could cause two faults, a read Access Violation fault, followed by a write Access Violation fault. This is in fact exactly what both QEMU 0.9 and Virtual PC 7.02 generate - a read fault and a write fault. It is also wrong!

Because a real x86 processor implements read-modify-write memory operations as a write snoop to the memory bus. Since the ADC instruction knows that it will perform both a read and a write, the ADC (and other arithmetic instructions capable of read-modify-write operations such as ADD SUB INC DEC XOR AND etc.) ask the memory bus for exclusive write access to that memory location. This is so that other cores can flush any cached (and soon to be stale) copies of that memory. Real x86 hardware, whether Intel or AMD and regardless of the processor model, generates only the write fault. The Windows kernel then sees that fault and bundles all of its actions together - allocate a page entry, allocate a physical page, map the entry to the page, and mask it writable - in one single round trip to the kernel. When the ADC instruction is then restarted, the memory is writable and the write succeeds.

Where Virtual PC 7.02 screws up is it seems to treat the read-modify-write operation as three separate operations consisting of the read from memory, the addition operation, and the write. What would then happen is this:

The initial read faults, Windows maps the memory pointed to by EAX to the common zero page.
The block of jitted code corresponding to the ADC instruction is executed again from the beginning, this time the read succeeds and reads the value 0.
The addition is performed, getting a result of 2 and clearing the Carry Flags (since 0+1+1 does not cause arithmetic overflow).
The write of 2 is attempted. This causes a write exception, Windows now goes and maps the page table entry to a fresh physical page.
The block of jitted code corresponding to the ADC instruction is executed yet again from the beginning, repeating the read (which reads zero), repeating the addition, which now gives 0+1+0=1 due to Carry Flag already being cleared, and writing the value of 1.

Memory gets corrupted and the program fails. Virtual PC 7.02 has two gross errors:

it treats a read-modify-write operation as two distinct memory operations, and,
it updates the guest x86 register state such the Carry Flag before it knows that the whole guest instruction will succeed.

QEMU 0.9 suffers from the first error as well, generating distinct read and write faults, and thus failing that portion of the test. Sadly, WinQEMU 0.10.2 as well as the latest QEMU 0.12.5 all fail this first simple test.

QEMU does correctly write the value 2 in this case, which at least indicates that it is buffering writing out the arithmetic flags state until after the write has succeeded. Memory at least is not corrupted by QEMU, but the incorrect fault on read could be detected by code running inside of the QEMU virtual machine.

But, it gets worse. If one recalls the AMD Phenom TLB bug, all sorts of ugly things happen on real hardware (causing all sorts of potential race conditions) when a memory access spans two pages of memory. This can be caused when a multi-byte memory access exactly such as the ADC example above is accessing an address at the last byte of a page, which thus causes two virtual pages of memory, two physical pages of memory, two L1 cache lines, and two page table entries to be accesses. The AMD and Intel manuals are full of errata discussing these kinds of problems due to the potential race conditions cause by having to sequence a lot of data operations between caches and memory. The well publicized 2008 bug in the AMD Phenom is an example of such a serious hardware error.

So what happens in QEMU and Virtual PC when such an unaligned access is attempted? Using the exact same code sequence of STC and ADC but simply changing the value of EAX to point to the last byte of a page in the allocated memory block interestingly results in the value of 3 in Virtual PC 7.02 and 4 in QEMU 0.9. Yikes!

The problem here again is in not handling the read-modify-write operation as real x86 hardware would. To get a value of three, one has to look deep into the PowerPC manual to discover that the behaviour of misaligned writes across a page boundary is undefined. In other words, just don't do it! And I am guessing Virtual PC does it. What appears to happen is that the PowerPC performs a partial write, writing the value of 1 to memory before faulting on the write to the second page. This would cause additional faults to occur, re-reading the value of 1 and feeding that as the input into the re-execution of the ADC code. Whatever the exact sequence of events, the result should not be 3!

QEMU 0.9, using a software TLB implementation, appears to handle the buffering of the flags correctly, but either through a partial write error or by re-executing the faulting ADC instruction too many times eventually writes a value of 4. Big big mistake.

As in the previous example, WinQEMU 0.10.2 and QEMU 0.12.5 do write the correct answer of 2, but with the extra read faults being generated.

Interestingly all versions of Bochs which I have tested do work correctly. They generate only the write faults, and always write the value 2. This is because in Bochs, read-modify-write operations check access permissions before anything else, before the memory access is attempted, before the arithmetic operation is performed.

This simple change seems obvious to make in QEMU and would help eliminate any kind of memory related ordering bugs.

Floating Point Compatibility

The next area where both Virtual PC 7.02 and QEMU fail miserably is in the emulation of 80-bit x87 floating point instructions. Virtual PC 7.02 faces a slightly different problem from QEMU and Bochs. In those emulators, x87 floating point is handled by the open source SoftFloat library - http://www.jhauser.us/arithmetic/index.html - an open source IEEE floating point library implemented purely in portable C using 32-bit and 64-bit integers. SoftFloat is great, because it gives any emulator, regardless of the availability or lack of availability of floating point hardware on a given processor, to implemented proper 32-bit, 64-bit, and 80-bit floating point operations. As I mentioned above, SoftFloat comes with a test utility called TestFloat, which contains a series of unit tests for various floating point operations and compares them with the results of the native floating point hardware.

TestFloat runs perfectly with no errors on Bochs, as well as on any recent AMD or Intel processor. However, on Virtual PC 7.02 and on QEMU, many of the tests fails due to rounding errors, floating point status flags errors, or just plain incorrect numeric results. Virtual PC I can understand, as it does not appear to use the library, opting instead to use the native PowerPC floating point hardware. Since PowerPC does not support 80-bit floats, it is not surprising that just about all of the 80-bit floating point tests in TestFloat fail on Virtual PC.

What is not clear to me is why those same tests fail in QEMU. QEMU is using the SoftFloat library, and therefore it is unacceptable that it should fail while Bochs works correctly. From what I can tell, QEMU takes shortcuts in not updating the x87 status flags for Underflow and Inexact results. While these are ignored by most floating point code, it is simply incorrect (and detectable) to set these results incorrectly. In my opinion, this is rather low-hanging fruit that should be fixed in QEMU.

Performance on Real World Build Scenario

Obvious correctness errors aside, the most important thing QEMU needs to focus on is performance. Back in December 2008 in Part 27, I analyzed the performance of the then relatively new Intel Atom and Intel Core i7 processors. I used a couple of tests, a real-world Visual Studio build scenario of building my Gemulator 9.0 emulator, and a set of synthetic micro-benchmarks from my CPU_TEST suite. I recently repeated these tests on a variety of modern 2 GHz class processor hosts, measuring either native performance or the performance when running inside of a KVM, QEMU, or Virtual PC 7.02 virtual machine.

A set of results of the Visual Studio build of Gemulator 9.0 sources is summarized below, with some data repeated from the December 2008 posting, showing the build time, the host clock cycles, and the environment of the build, "native" to indicate the test was run natively in Windows, and everything else running the Windows 2000 virtual machine disk image. The various virtual machine host environments were KVM running on Fedora Linux 13, QEMU 0.9, WinQEMU 0.10.2, the latest QEMU 0.12.5, Virtual PC 7.02 for Mac, or my Windows build of Bochs 2.4.5.

Desktop computer specs
clock speed, CPU Gemulator 9 build time
(seconds, lower is better) Total clock cycles
(billions) Execution environment
3460 MHz Core i5 15.6 53.97 native
3460 MHz Core i5 265 916.9 WinQEMU
2666 MHz Core i7 20.0 53.32 native
2666 MHz Core 2 (Mac Pro) 22.1 58.92 native
2666 MHz Core 2 (Mac Pro) 910 2426 Bochs 2.4.5
2260 MHz Centrino 2 Penryn 23.7 53.56 native
2260 MHz Centrino 2 Penryn 37.9 85.64 KVM
2666 MHz AMD Phenom 24.8 66.12 native
2400 MHz AMD Phenom 35.4 85.0 KVM
2400 MHz Core 2 Q6600 24.0 60.0 native
2400 MHz Core 2 Q6600 329 789.6 QEMU 0.12.5
2400 MHz Core 2 Q6600 383 919.2 WinQEMU 0.10.2
2400 MHz Core 2 Q6600 477 1144.8 QEMU 0.9
2500 MHz PowerMac G5 126 315 VPC 7.02
2500 MHz PowerMac G5 505 1262.5 QEMU 0.12.5
2000 MHz PowerMac G5 995 1990 QEMU 0.12.5
1250 MHz Mac Mini G4 2092 2615 QEMU 0.12.5

The data may look a little confusing at first, but let me walk you through it. From the Core 2, Core i5, and Core i7 native results, the bottom line is that the build of Gemulator 9 requires about 53 to 60 billion host clock cycles. Not surprising given the similarity of those architectures. Regardless of the clock speed, the absolute amount of work required is about the same.

What is interesting is to compare those results against the virtualized times. On the two systems that I have Fedora 13 running with KVM virtualization (my Penryn and Phenom boxes), the amount of absolute work rises to about 85 billion cycles. This indicates that KVM introduces approximately a 40% to 60% performance overhead for its virtualization. Not quite the zero-overhead cost of hardware virtualization as perpetuated. Real world workloads, which require emulating disk I/O, interrupts, exceptions, ring transitions, and hardware, do experience a slowdown even using VT, something that VMware pointed out four years ago in their excellent paper comparing jitting and VT: http://www.vmware.com/pdf/asplos235_adams.pdf. I first pointed readers at that VMware paper almost three years ago back in Part 3, and as VMware found back then, VT does not quite live up to the hype.

Bottom line: KVM, real workloads, 50% slowdown give or take.

Next datapoint, look at the 2.4 GHz Core 2 Q6600 numbers. The Q6600 is the quad-core Core 2 I discussed two years ago back in Part 19. Two years later, it is still one of my favourite processors, in part because I can easily over-clock it to 3.4 GHz when needed. For these tests I ran it at its speced 2.4 GHz speed, running virtual machines on both 64-bit Windows and 64-bit Debian 5.05 Linux. One thing that is very obvious is the efficiency increase of QEMU over the past 3 years from version 0.9 to 0.10.2 to 0.12.5, as you can see the build times dropping from 477 seconds down to 329 seconds. Keep in mind this was tested on exactly the same Windows 2000 disk image, reverted each time to a known snapshot for each of the three QEMU versions. The real-world efficiently of QEMU has truly improved by a good 50% over these past few versions. However, today this still leaves it at about, oh, 13 times slower than native execution, or almost an order of magnitude slower than running under KVM.

Another takeaway from this data are the PowerPC numbers, which I ran in emulation using either Virtual PC 7.02 or the latest QEMU 0.12.5 build. Virtual PC requires about 315 billion cycles, equating to about a 5x slowdown over native x86 performance. QEMU requires at least four times the time, or roughly about a 20x slowdown. Interestingly the slowdown gets even worse using the slower G5 processor, as the 2.5 GHz chip contains a 1MB L2 cache, while the 2.0 GHz chip contains only a 512K L2 cache. The size of the L2 cache matters a lot! The slower G4, both in absolute time and absolute clock cycles is slower yet. So as I was saying last posting, Virtual PC was actually getting quite fast on the G5 right around that Microsoft decided to discontinue the product. Pity.

It is interesting that Virtual PC actually holds its own against QEMU. It sets the bar for how fast QEMU could run, and that bar is quite a bit faster than QEMU is at today. Four times faster on PowerPC is possible by the mere existence proof of Virtual PC. This is not really too outlandish an expectation, given that various x86 jit frameworks such as Intel's PIN, DynamoRIO, Mojo, and recent work published at the 2010 CGO conference in Toronto all suggest that x86-to-x86 dynamic binary translation of real world applications can be accomplished with under 2x slowdown, in some cases with as low as 20% slowdown. Given the observed 50% slowdown seen in KVM, this suggests that it is possible (i.e. it is technically plausible) to make QEMU perform at about the same performance level as hardware virtualization.

The challenge is in figuring out the root causes of the existing performance bottlenecks.

		Guest OS	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K	-
		VM*	VPC702	Q.12.5	Q.12.5	Q.12.5	Q.12.5	Q.12.5	Q.10.2	Q.10.2	Q.9.0	Q.12.5	KVM	KVM	native
		Host OS	OS X	Fedora12	Fedora12	Fedora12	Fedora13	Debian5	Win7	Win7	Win7	Win7	Fedora13	Fedora13	Win7
		Host CPU	PPC G5	PPC G5	PPC G5	PPC G4	Phenom	Core2	Core2	Core2/M	Core2	Corei5	Core 2/M	Phenom	Core 2/M
		Clock	2500	2500	2000	1250	2400	2400	2400	2400	2400	3460	2400	2400	2400

Test:	Units
T1FAST	seconds		2.1 .. 4.2**	3.9	4.8	17.3	2.8	2.2	2.9	3.1	4.6	2.5	0.31	0.28	0.29
T1SLOW	seconds		2.7 .. 4.9**	6.1	7.7	20.9	3.9	3.2	4.8	4.8	8.9	3.7	0.31	0.28	0.29
SIMP	seconds		1.8	9.6	12.2	23.6	8.4	6.7	8.1	8.5	8.9	6.6	0.8	0.7	0.8
Office script	seconds		16.7	77.8	120.8	255.0	59.4	60.9	67.7	72.8	68.8	47.3	5.9	4.3	3.6
LPIPE	seconds		10.6	43.5	82.0	164.5	38.4	41.7	46.7	56.1	46.2	33.6	8.1	1.6	2.9
CPU-Z	text		HUNG	PII/SSE3	PII/SSE3	PII/SSE3	PII/SSE3	PII/SSE3	PII/SSE3	PII/SSE3	PII/SSE3	PII/SSE3	Core2/SSSE3	Phenom/SSE4A	Core2/SSSE3
HDTEST32	MB/sec		35	4	4	1	6	8	22	7	41	48	6	6	35
111	result		FAIL(2)	FAIL	FAIL	FAIL	FAIL	FAIL	FAIL	FAIL	FAIL(2)	FAIL	PASS	PASS	PASS
TESTFLOAT	result		FAIL	FAIL	FAIL	FAIL	FAIL	FAIL	FAIL	FAIL	FAIL	FAIL	PASS	PASS	PASS
MEMBAND (a,u)	clocks		1, 4	26, 55	33, 74	62, 166	21, 43	14, 40	17, 40	18, 40	20, 33	18, 29	<1, 3	<1, 1	<1, 3
FRACTAL	seconds		7.8	34.3	31.0	63.1	26.7	30.1	34.9	44.3	41.5	24.4	22.0	16.1	4.4
Build Gemulator 9	seconds		126	505	995	2092	357	329	387	432	477	265	37.9	35.4	25
C# short	seconds		3.7	12.8	25.1	40.1	9.4	6.6	8.3	9.2	15.9	9.6	0.9	0.5	0.4
C# long	seconds		30.8	59.9	105.0	176.4	46.5	34.6	45.7	53.5	88.5	40.3	6.4	4.0	3.6

		Guest OS	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K	Win2K
		VM*	VPC702	Q.12.5	Q.12.5	Q.12.5	Q.10.2	Q.9.0	KVM	Bochs
		Host OS	OS X	Fedora12	Fedora13	Debian5	Win7	Win7	Fedora13	Win7
		Host CPU	PPC G5	PPC G5	Phenom	Core2	Core2/M	Core2	Phenom	Core2
		Clock	2500	2500	2400	2400	2400	2400	2400	2666

Test:	Units
test 1 int add	clocks		2	4	2	2	6	6	1	14
test 1 int adc	clocks		5	48	25	30	36	28	1	17
test 1 mem indir	clocks		3	17	13	15	20	20	3	39
test 2 int adc++	clocks		2	44	22	11	27	8	0.5	18
test 5 zero mem	clocks		2	10	14	6	8	16	1	52
test 6 and0 mem	clocks		2	13	18	11	14	10	1.5	54
test 7 divide	clocks		115	250	77	69	73	124	44	77
test 3 os pg flt	clocks		33783	96153	85714	88889	104348	114285	3120	126952
test 12 PeekMsg	clocks		5995	10504	6997	11822	15384	11009	469	27770
test 15 sbb r r	clocks		8	56	17	15	23	20	1	37
test 17 read rtc	clocks		68	136	94	99	106	92	63	85
test 23 shld imm	clocks		4	36	15	15	18	5	2	20
test 29 call eax	clocks		174	210	102	113	197	421	8	122
test 29 call mispred	clocks		174	208	102	114	195	188	11	107
test A15 span 64	clocks		39	103	34	35	39	34	1	129
test A15 span 4K	clocks		38	141	107	104	96	91	1	222
test A19c FXSAVE	clocks		FAIL	808	529	475	998	606	65	653
test LAHF/SAHF	clocks		53	310	100	79	95	100	6	131
test 23b self mod	clocks		203	15625	10126	9795	13333	415	468	537
test 23d self mod	clocks		*	62500	41379	43636	52173	14201	1336	1052
nd32 x86 native loop	clocks		17	32	26	21	*	*	7	*
nd32 simulated	clocks		1596	2270	1467	1646	*	*	90	*
nd32 x86 sim ver c	clocks		1072	2106	1418	1560	*	*	131	*
nd32 x86 sim ver d	clocks		3056	3968	2264	2216	*	*	215	*
nd32 x86 sim ver l	clocks		1384	2566	1580	1888	*	*	49	*

Desktop computer specs clock speed, CPU	Gemulator 9 build time (seconds, lower is better)	Total clock cycles (billions)	Execution environment
3460 MHz Core i5	15.6	53.97	native
3460 MHz Core i5	265	916.9	WinQEMU
2666 MHz Core i7	20.0	53.32	native
2666 MHz Core 2 (Mac Pro)	22.1	58.92	native
2666 MHz Core 2 (Mac Pro)	910	2426	Bochs 2.4.5
2260 MHz Centrino 2 Penryn	23.7	53.56	native
2260 MHz Centrino 2 Penryn	37.9	85.64	KVM
2666 MHz AMD Phenom	24.8	66.12	native
2400 MHz AMD Phenom	35.4	85.0	KVM
2400 MHz Core 2 Q6600	24.0	60.0	native
2400 MHz Core 2 Q6600	329	789.6	QEMU 0.12.5
2400 MHz Core 2 Q6600	383	919.2	WinQEMU 0.10.2
2400 MHz Core 2 Q6600	477	1144.8	QEMU 0.9
2500 MHz PowerMac G5	126	315	VPC 7.02
2500 MHz PowerMac G5	505	1262.5	QEMU 0.12.5
2000 MHz PowerMac G5	995	1990	QEMU 0.12.5
1250 MHz Mac Mini G4	2092	2615	QEMU 0.12.5