ARM64 Boot Camp: The Genius of ARM64

ARM64 Boot Camp

updated January 16 2024

[ARM64 Boot Camp: Table Of Contents] [Return to Emulators.com]

The Genius of ARM64

This tutorial will discuss what makes ARM64 different from other RISC architectures and why I believe it is such a nice improvement over x86 and other previous CPU architectures used by Windows.

First, some terminology clarity: although the term "x86" generally covers all AMD and Intel processors which are derived from the original 8086, I will try to use the term "x86" specifically to refer to 32-bit AMD/Intel instruction set; the x86 that powered OS/2, Windows 95, Windows NT, the original Linux, and Windows XP. Intel officially calls their version of x86 "IA-32" but I don't think I've ever heard any engineer refer to it that way so I won't use that term. Contrary to what common sense would dictate, "IA64" is not their 64-bit version of x86 (that's called "Intel 64") but rather refers to the completely different and recently failed Itanium architecture. So "x86" is what I am using to refer to 32-bit AMD/Intel instruction set.

When AMD published their 64-bit x86 extensions in 2002 - which they sometimes referred to "x86-64" and other times as "AMD64" - I will use the more common term "x64" to denote that specific 64-bit AMD/Intel instruction set that is commonly used today for Windows 11. x86 and x64 are 90% identical, but as you will see there are subtle differences in how x86 and x64 are encoded as opcode bytes and so the distinction matters.

Similarly, "ARM" can mean a whole lot of things which all derived from that original 32-bit ARM instruction set from the 1980's. These days there are pretty much three ARM instruction set variants in common use on Windows and iOS/macOS and Android:

- ARM32 - or "Aarch32", the granddaddy consisting of fixed-size 32-bit wide instruction opcodes, 16 general purposes 32-bit wide integer registers (which includes the stack pointer and program counter), and then in later versions 16 additional 128-bit wide "NEON" vector registers. In the manual these are listed as the "A32" "A1" or "A2" opcode encodings.

- Thumb2 - or "T32", a condensed encoding of ARM32 which consists of a mix of 16-bit wide and 32-bit wide instruction opcodes, which access the same register state as ARM32. Certain 16-bit encodings are limited to only accessing 8 registers. One uniqueness about Thumb2 is that the instruction code stream is effectively variable length and so unlike ARM32 instructions do not always start on a 4-byte boundary - only a 2-byte boundary is guaranteed - and therefore 32-bit wide opcodes can in fact span cache lines and page boundaries. I've written 3 different Thumb2 emulators over the past 10 years and let me tell you the variable size nature of Thumb2 is a nightmare to emulate for various technical reasons.

- ARM64 - or "Aarch64", the latest and greatest instruction used by Android, iOS, macOS, and Windows on ARM. Like ARM32 it consists of fixed-size 32-byte wide instruction opcodes, 32 general purpose 64-bit wide integer registers plus a dedicated 64-bit program counter and dedicated 64-bit stack pointer, and 32 128-bit wide NEON vectors registers.

The opcodes in all three modes have all similar concepts in terms of there is generally a destination register (Rd), a first source registers or address base register (Rn), and a second source or address index register (Rm). Some instructions take a fourth register which is denoted as Ra or Rt depending on the use.

In general, most instructions exist across all three modes. So for example, if I wish to wish to add two 32-bit registers and store the result in a 3rd register, the assembly language in Thumb2 or ARM32 would be something like:

ADD R0, R1, R2 ; R0 = R1 + R2

while in ARM64 mode, "W" is the prefix for a 32-bit register and "X" is the prefix for a 64-bit registers, so:

ADD W0, W1, W2 ; 32-bit W0 = W1 + W2

ADD X0, X1, X2 ; 64-bit X0 = X1 + X2

Although Thumb2, ARM32, and ARM64 all have 32-bit wide instruction opcodes, the encodings are _not_ the same between the three modes. A given 32-bit code pattern decodes into entirely different opcodes depending on which mode (Thumb2, ARM32, or ARM64) the core is in. The bit positions of Rd Rn Rm are in different places in each of the three encodings. This is very different from AMD/Intel, where as I said about 90% of instruction opcodes are encoded identically whether in 32-bit x86 mode or 64-bit x64 mode. In x86/x64, the encoding of a 32-bit ADD such as:

ADD ECX, EDX ; 32-bit ECX = ECX + EDX

is _identical_ whether the code is executing as x86 or x64. Is this convenient? Sure, it makes it easy to write a decoder or instruction encoder that handles both x86 and x64. But as you will see below, this convenience comes with costly drawbacks!

It does get even funkier with Thumb2, because you can actually mix ARM32 and Thumb2 instruction encodings in the same binary, even on the same code page, even in the same function if you wanted to! Because of the guaranteed 2-byte instruction alignment, the lowest bit of the program counter is effectively a "Thumb2/ARM32" flag. Every indirect jump (BX instruction) can cause a mode switch, so for example a function pointer or a switch statement could actually dispatch to a mix of Thumb2 and ARM32 code blocks. I mentioned this was a nightmare to emulate when I was bringing up an Android emulator at Amazon a decade ago, because depending on how you branch to a code block the same code can be decoded two different ways. Ugh! Yes, one can technically mix and match x86 and x64 in the same binary on Windows using far jumps - some games use this for anti-cheat - but this is rare and difficult to do. On Linux/Android where the gcc and clang toolsets support mixing and it's practically required since some static libs are compiled Thumb2 and some as ARM32.

Fortunately, Microsoft deprecated ARM32 mode from its devices starting back in Windows RT in 2012, allowing only code written for Thumb2 making it impossible to mix modes. More recently, both Thumb2 and ARM32 have been deprecated from the ARM spec (making them optional instead of mandatory) so newer ARM processors such as the Apple M1 and M2 have already hit the market which _only_ implement ARM64 64-bit mode, period.

This is one of many genius moves that ARM has made, and much like Motorola in the 1980's with the 680x0 processors, ARM edits out and removes instructions that are no longer practical. The world has gone 64-bit so no reason to force CPU manufacturers to add silicon for supporting legacy 1990's era instructions. This of course is complete opposite of the AMD/Intel strategy of still supporting 16-bit, 32-bit, as well as 64-bit modes all in silicon.

Another genius move that distinguishes ARM from AMD/Intel is that ARM was not afraid to completely re-encode the ARM64 instruction set. They didn't just append new encodings on top of ARM32 (the way AMD appended new REX prefixes on top of x86 to create x64).

For example, in ARM32, _every_ instruction opcode wastes 4 bits (or the 32-bit opcode) to encode one of 16 predicate conditions. i.e. in ARM32 _every_ instruction is conditional. You can conditionally execute an ADD depending on if the Carry flag is set. You can conditionally load from memory only if the Zero flags is set. Sounds bizarre, but this actually made sense in the 1990's and 2000's when branch predictors were terrible, CPU cores were in-order, and so unrolling code into long blocks of predicated instructions is how C compilers implemented things like "if/then" statements or ternary operators. You will see this in x86/x64 as well through the use of conditional moves (CMOVcc) and conditional set (SETcc) instructions, but this has fallen out of favor as branch predictors have improved significantly in the past decade. There is a famous post from Linus highly discouraging the use conditional moves in Linux for those reasons: https://yarchive.net/comp/linux/cmov.html

So when designing ARM64, ARM to their credit removed these 4 bits from each opcode, and instead repurposed them such that one bit (bit 31) is now used to signify a 64-bit instruction, another bit (bit 4) is an extra index bit on the Rd destination register, and similarly the other two bits add an extra index bit to each of each of the Rn and Rm source registers. This is how, without bloating code, ARM64 can support 32 GPRs and 32 NEON registers, while ARM32/Thumb2 supported only 16 each.

In contrast, AMD's 64-bit extensions _guarantee_ that binary code size bloats when porting from x86 to x64!! Let's see why this is. Let's look at a trivial C function, something like computing the Nth Fibonacci number:

int32_t FibT32(int N)
{
int32_t A = 0;
int32_t B = 1;

for (int i = 1; i <= N; i++)
{
    A = A + B;

    // Swap A B using the traditional temp variable method

    int32_t T = B;
    B = A;
    A = T;
}

printf("fib index %2d = %5d\n", N, A);
return A;
}

I'm not a fan of the "create a temporary variable to swap two other variables" approaching to swapping, I prefer the in-place method with no additional source temps:

int32_t FibA32(int N)
{
int32_t A = 0;
int32_t B = 1;

for (int i = 1; i <= N; i++)
{
    A = A + B;

    // Swap A B in-place

    A = A - B;
    B = B + A;
    A = B - A;
}

printf("fib index %2d = %5d\n", N, A);
return A;
}

This actually produces the exact same compiled code despite the lack of temporary variable and extra addition and subtraction operations. The compiler C front-end (the parser) is smart enough to recognize the variable swap. If you are not familiar with this trick, there are several other ways to write a swap code sequence. For example, you can use XOR to compute a bitwise delta between A and B instead of an arithmetic delta as explained below:

    // Swap A B in-place using XOR instead of ADD/SUB

    A = A ^ B;
    B = B ^ A;
    A = B ^ A;

Whichever way you write it, the first line of code computes a delta between A and B - an arithmetic delta in the case of subtraction, and a logical bitwise delta in the case of XOR. The second line then applies the delta to variable B to set it to the original value of A. And then the third line sets variable A to the original value of B.

So let's see what the 32-bit x86, 64-bit x64, and 64-bit ARM64 code looks like for the inner loop of this function (instruction code bytes are in red, disassembled instructions in bold):

32-bit x86	64-bit x64	64-bit ARM64
$LL4@FibA32: ; 30 : { ; 31 : A = A + B; ; 35 : A = A - B; 00011 8b c1 mov eax, ecx ; 36 : B = B + A; 00013 03 ce add ecx, esi ; 37 : A = B - A; 00015 8b f0 mov esi, eax 00017 83 ea 01 sub edx, 1 0001a 75 f5 jne SHORT $LL4@FibA32	$LL4@FibA32: ; 30 : { ; 31 : A = A + B; ; 35 : A = A - B; 00014 8b c2 mov eax, edx ; 36 : B = B + A; 00016 03 d3 add edx, ebx ; 37 : A = B - A; 00018 8b d8 mov ebx, eax 0001a 49 83 e8 01 sub r8, 1 0001e 75 f4 jne SHORT $LL4@FibA32	0001c \|$LL13@FibA32\| ; 30 : { ; 31 : A = A + B; ; 35 : A = A - B; 0001c 2a0803ea mov w10,w8 ; 36 : B = B + A; 00020 0b130108 add w8,w8,w19 ; 37 : A = B - A; 00024 2a0a03f3 mov w19,w10 00028 51000529 sub w9,w9,#1 0002c 35000009 cbnz w9,\|$LL13@FibA32\|
code size: 11 bytes	code size: 12 bytes	code size: 20 bytes

Notice what is in common:

- all three instruction sets can encode this loop in 5 instructions: MOV ADD MOV SUB and conditional branch/jump not zero

- the compiler automatically allocated the temporary swap register (EAX in the case of x86/x64, and W10 in the case of ARM64)

- the opcode encoding for x86 is almost identical to that for x64 as expected. MOV is 8B, ADD is 03, SUB is 83, JNE is 75.

You can see that ARM64 is not radically different from what you already know from x86. But know you can see some differences:

- the x64 code size is 1 byte large due to an extra 49 prefix on the SUB R8 instruction. This 49 prefix is called a "REX prefix". A REX prefix is very similar to what I described how ARM64 took 4 opcode bits to represent the high bits of each register index, and to indicate whether the operation is 64-bit or 32-bit. That's exactly what a REX prefix encodes; unfortunately x86 had no such spare bits so this is solved by emitting the additional prefix byte to indicate these 4 bits. Even if just _one_ of the bits needs to be set (in this case, the REX.R bit to indicate the R8 register is a high register) you have to waste an entire byte.

- ARM64 supports 3-operand instructions including ADD, SUB AND, OR, XOR (which ARM naming is "EOR"), shift, rotate, and other ALU operations. x86 and x64 have historically only supported 2-operand instructions, which means that ALU operations such as ADD and SUB are always destructive in nature, i.e. the first operand is both a source register but also then written over as a destination register. In this case it doesn't matter because the ARM64 sequence chose to use W8 as both a destination and a source.

- ARM64 sequence is the longest at 20 bytes, which would seem that ARM64 is always much fatter and bloated than either x86 or x64.

But let's zoom out and look at the entire compiled function for all three instruction sets instead of just looking at the inner loop:

32-bit x86	64-bit x64	64-bit ARM64
_FibA32 PROC ; COMDAT ; 25 : { 00000 56 push esi ; 26 : int32_t A = 0; 00001 33 f6 xor esi, esi 00003 57 push edi ; 28 : ; 29 : for (int i = 1; i <= N; i++) 00004 8b 7c 24 0c mov edi, DWORD PTR _N$[esp+4] 00008 8d 4e 01 lea ecx, DWORD PTR [esi+1] 0000b 3b f9 cmp edi, ecx 0000d 7c 0d jl SHORT $LN3@FibA32 ; 27 : int32_t B = 1; 0000f 8b d7 mov edx, edi $LL4@FibA32: ; 30 : { ; 31 : A = A + B; ; 32 : ; 33 : // Swap A B in-place ; 34 : ; 35 : A = A - B; 00011 8b c1 mov eax, ecx ; 36 : B = B + A; 00013 03 ce add ecx, esi ; 37 : A = B - A; 00015 8b f0 mov esi, eax 00017 83 ea 01 sub edx, 1 0001a 75 f5 jne SHORT $LL4@FibA32 $LN3@FibA32: ; 38 : } ; 39 : ; 40 : printf("fib index %2d = %5d\n", N, A); 0001c 56 push esi 0001d 57 push edi 0001e 68 00 00 00 00 push OFFSET ??_C@_0BF@FLIKFCPO@fib?5index?5?$CF2d?5?$DN?5?$CF5d?6@ 00023 e8 00 00 00 00 call _printf 00028 83 c4 0c add esp, 12 ; 0000000cH ; 41 : return A; 0002b 8b c6 mov eax, esi 0002d 5f pop edi 0002e 5e pop esi ; 42 : } 0002f c3 ret 0 _FibA32 ENDP	FibA32 PROC ; COMDAT ; 25 : { $LN12: 00000 40 53 push rbx 00002 48 83 ec 20 sub rsp, 32 ; 00000020H ; 26 : int32_t A = 0; 00006 33 db xor ebx, ebx ; 27 : int32_t B = 1; 00008 ba 01 00 00 00 mov edx, 1 ; 28 : ; 29 : for (int i = 1; i <= N; i++) 0000d 3b ca cmp ecx, edx 0000f 7c 0f jl SHORT $LN3@FibA32 ; 27 : int32_t B = 1; 00011 44 8b c1 mov r8d, ecx $LL4@FibA32: ; 30 : { ; 31 : A = A + B; ; 32 : ; 33 : // Swap A B in-place ; 34 : ; 35 : A = A - B; 00014 8b c2 mov eax, edx ; 36 : B = B + A; 00016 03 d3 add edx, ebx ; 37 : A = B - A; 00018 8b d8 mov ebx, eax 0001a 49 83 e8 01 sub r8, 1 0001e 75 f4 jne SHORT $LL4@FibA32 $LN3@FibA32: ; 38 : } ; 39 : ; 40 : printf("fib index %2d = %5d\n", N, A); 00020 8b d1 mov edx, ecx 00022 44 8b c3 mov r8d, ebx 00025 48 8d 0d 00 00 00 00 lea rcx, OFFSET FLAT:??_C@_0BF@FLIKFCPO@fib?5index?5?$CF2d?5?$DN?5?$CF5d?6@ 0002c e8 00 00 00 00 call printf ; 41 : return A; 00031 8b c3 mov eax, ebx ; 42 : } 00033 48 83 c4 20 add rsp, 32 ; 00000020H 00037 5b pop rbx 00038 c3 ret 0 FibA32 ENDP	00000 \|FibA32\| PROC ; 25 : { 00000 \|$LN18\| 00000 d10043ff sub sp,sp,#0x10 00004 a9007bf3 stp x19,lr,[sp] ; 26 : int32_t A = 0; 00008 52800013 mov w19,#0 ; 27 : int32_t B = 1; 0000c 52800028 mov w8,#1 ; 28 : ; 29 : for (int i = 1; i <= N; i++) 00010 7100041f cmp w0,#1 00014 5400000b blt \|$LN14@FibA32\| ; 27 : int32_t B = 1; 00018 2a0003e9 mov w9,w0 0001c \|$LL13@FibA32\| ; 30 : { ; 31 : A = A + B; ; 32 : ; 33 : // Swap A B in-place ; 34 : ; 35 : A = A - B; 0001c 2a0803ea mov w10,w8 ; 36 : B = B + A; 00020 0b130108 add w8,w8,w19 ; 37 : A = B - A; 00024 2a0a03f3 mov w19,w10 00028 51000529 sub w9,w9,#1 0002c 35000009 cbnz w9,\|$LL13@FibA32\| 00030 \|$LN14@FibA32\| ; 38 : } ; 39 : ; 40 : printf("fib index %2d = %5d\n", N, A); 00030 2a0003e1 mov w1,w0 00034 90000008 adrp x8,\|??_C@_0BF@FLIKFCPO@fib?5index?5?$CF2d?5?$DN?5?$CF5d?6@\| 00038 91000100 add x0,x8,\|??_C@_0BF@FLIKFCPO@fib?5index?5?$CF2d?5?$DN?5?$CF5d?6@\| 0003c 2a1303e2 mov w2,w19 00040 94000000 bl printf ; 41 : return A; 00044 2a1303e0 mov w0,w19 00048 a9407bf3 ldp x19,lr,[sp] 0004c 910043ff add sp,sp,#0x10 00050 d65f03c0 ret ENDP ; \|FibA32\|
code size: 48 bytes	code size: 57 bytes	code size: 84 bytes

ARM64 is still larger although the delta between x64 and ARM64 is not as large. I also see some missed compiler optimization opportunities from Visual Studio - notice the SUB+STP and LDP+ADD instruction pairs could each be replaced with single pre-index and post-index forms of the STP and LDP instruction. I will file a bug!

With the full function disassemblies we see the true differences between instruction sets - some architectural, some due to legacy Windows ABI calling conventions requirements.

Notice that x86 has an "ADD ESP,12" after the call to printf to pop the stack. This is because the default calling convention for 32-bit x86 is "__cdecl" which requires pushing function arguments to the stack and then popping them at the call site. This is an old artifact of x86 dating back to MS-DOS days and is difficult to get rid of at this point. What this means is that x86 touches memory the most. There are 3 memory operations required to enter the function (PUSH PUSH followed by MOV to read the incoming function argument) and 3 memory operations to leave the function (POP POP RET). Additionally the call to printf() involves 4 memory operations (PUSH PUSH PUSH CALL).

x64 uses "__fastcall" calling convention which passes integer arguments in up to 4 general purpose registers, which reduces the 3+3+4 = 10 memory accesses in x86 down to 1+2+1 = 4 (PUSH and POP RET for the function entry/exit, and CALL for the printf() call).

ARM64 has the fewest memory operations - a single STP (a push) on entry and a single LDP (pop) on exit. ARM does _not_ push or pop a return address to the stack when executing the "BL" and "RET" instructions, so frequent calls to small leaf functions are much cheaper on ARM64 than on x86 or x64. This optimization also exists in most previous RISC architectures such as PowerPC as well. ARM64 implicitly uses the latest "__vectorcall" calling convention, where both integer arguments _and_ floating point arguments are passed in registers. Although x86 and x64 have supported __vectorcall since Visual Studio 2013, most apps do make use of it.

Compare this to, say, the Spotify app in the Windows Store, if you attach a debugger and single-step through the code - you will easily realize that the code is still compiled using plain old 32-bit __cdecl and full of memory pushes and pops!!

In fact to this day, even the latest Visual C++ x86 compiler still has trouble with __vectorcall keyword, which might explain its slow adoption more than 10 years since the keyword's introduction. I filed this x86 __vectorcall compilation bug almost a year ago and am still waiting for the fix to get released through Visual Studio public channels (although it looks like they have identified and fixed it internally so that's good news). A related bug which I filed at the same time which only affects the quality and performance of x86 code is similarly unfixed.

Fellow developers, please click on the two bugs just mentioned above and "upvote them". And while you're at it, please upvote this bug too.

Thankfully, when porting code to native ARM64, whether your original source code specifies __cdecl, __fastcall, __stdcall (another legacy calling convention frequently seen in 32-bit x86 apps), or __vectorcall, they are all identical on ARM64, i.e. ARM64 native calling convention is implicitly __vectorcall and thus will always use up to 8 integer registers and 8 vector registers for incoming function arguments. Data structures - if they are a nice power-of-2 size such as 16 bytes in size - will also be passed in registers instead of on the stack as can happen with x86 and x64. The lower overhead of function calls is one of the ways ARM64 outperforms legacy CISC architectures.

What about legacy RISC architectures? ARM64 does improve over previous RISC architectures such as PowerPC and MIPS. Here is a worksheet which summaries the similarities and differences of 6 various CPU architectures as seen from application-level "user mode" ranging from the 6502 all the way through the decades to the current ARM64 NEON instruction set. What I am showing here are the most common user-mode registers which are available to application software (I am omitting some user-mode registers such as timestamp counters and control registers):

6502 (RISC, 1970's)

Motorola 68000, 68010, 68020 (CISC, 1980's)

MIPS R4000 (RISC, 1990's)

Integer registers:

8 bits

Accumulator

D0 - D7

8 * 32 bits

Data registers (for ALU operations)

64 bits

Zero Register

8 bits

Index register X

A0 - A6

7 * 32 bits

Address registers (for memory operations)

R1-R30

30 * 64 bits

General Purpose Registers

8 bits

Index register Y

32 bits

Stack Pointer

R31

64 bits

Link Register

Special registers:

8 bits

Stack pointer index (add to address 0x100)

USP

32 bits

User Stack Pointer (kernel mode only)

64 bits

Multiply/Divide high register

16 bits

Program Counter

32 bits

Program Counter

64 bits

Multiply/Divide low register

8 bits

Processor flags/control register

CCR / SR

16 bits

Condition Code Register.

64 bits

Program Counter

Status Register (kernel mode only 68010+)

Arithmetic Flags (contained in the P register):

Arithmetic Flags (contained in the CCR register):

Arithmetic Flags.. There are none!

1 bit

Negative (sign bit of result)

1 bit

Negative (sign bit of result)

1 bit

Zero

1 bit

Zero

1 bit

Overflow (signed overflow)

1 bit

Overflow (signed overflow)

1 bit

Carry (unsigned overflow)

1 bit

Carry (unsigned overflow)

1 bit

Extended Carry (sticky Carry bit)

PowerPC G5 (RISC, 2000's)

x64 with SSE (CISC, 2000's)

ARM64 (Armv8-A Aarch64) (RISC, 2010's)

Integer registers:

GPR0

64 bits

General Purpose / Zero Register

RAX-RDI

7 * 64 bits

General Purpose Registers

X0 - X29

30 * 64 bits

General Purpose Registers

GPR1

64 bits

Stack Pointer

RSP

64 bits

Stack Pointer

X30

64 bits

Link Register

GPR1-31

30 * 64 bits

General Purpose Registers

R8-R15

8 * 64 bits

General Purpose Registers

Special registers:

64 bits

Link Register

A7 / USP

32 bits

User Stack Pointer

64 bits

Stack Pointer (must be 16-byte aligned)

64 bits

Program Counter

64 bits

Program Counter

64 bits

Program Counter

CTR

64 bits

Count Register

RFLAGS

64 bits

Condition Code Register.

APSR

64 bits

Application Program Status Register

CR[0:7]

32 bits

Condition Register (8 sets of 4 flags)

Status Register (kernel mode only 68010+)

XZR

64 bits

Zero Register

XER

32 bits

Exception Register

Arithmetic Flags (contained in XER and each of CR0..CR7)

Arithmetic Flags (contained in the RFLAGS register):

Arithmetic Flags (contained in the APSR):

8 * 1 bit

Less Than

1 bit

Negative (sign bit of result)

1 bit

Negative (sign bit of result)

8 * 1 bit

Greater Than

1 bit

Zero

1 bit

Zero

8 * 1 bit

Equal / Zero

1 bit

Overflow (signed overflow)

1 bit

Overflow (signed overflow)

9 * 1 bit

Summary Overflow (sticky overflow)

1 bit

Carry (unsigned overflow)

1 bit

Carry (unsigned overflow)

1 bit

Overflow (signed overflow)

1 bit

Adjust (low nibble auxiliary Carry)

1 bit

Carry (unsigned overflow)

1 bit

Parity (even parity of low 8 bits of result)

Vector Unit Registers:

VR0 -

XMM0 -

16 * 128 bits

V0 - V31

32 * 128 bits

VR31

32 * 128 bits

Vector Registers

XMM15

Vector Registers

FPSR

Floating Point Status Register

VSCR

32 bits

Status/Control Register

MXCSR

32 bits

Control/Status Register

FPCR

Floating Point Control Register

Note specifically that the architectures which I have seen and used the most in my career - 6502, 680x0, Intel x86/x64, and ARM64 - have the common characteristics of having having 4 very similar if not identical arithmetic flags: Sign/Negative, Zero, Overflow, and Carry. This similarly came in handy when I was developing Xformer to emulate the 6502-based Apple II and Atari 800 on the 68000-based Atari ST, developing Gemulator to then emulate 680x0 on Intel x86, and the most recently the work in Windows to emulate x86/x64 on ARM64. Having similar arithmetic flags to the past 45 years of legacy is advantageous for ARM64 as far as being able to emulate past architectures.

PowerPC and MIPS went _very_ different routes as far as far as implementing flags - MIPS got rid of them entirely, while PowerPC created 8 sets of them! When I worked on the Xbox 360 and was part of the team emulating x86 on the PowerPC, the arithmetic flags differences were a source of pain and inefficiency. Remember, arithmetic flags are how most CPUs perform conditional branches - what in source code are "if" "else" "while" "until" statements. Typically some sort of arithmetic computation is performed - an addition, a decrement, a compare - which then sets the NZVC flags to indicate the result of that operations, and then some sort of conditional branch is taken based on those condition flags.

The designers of PowerPC created 9 sets of arithmetic flags bits (!!!) - one that contains a master copy of the Carry and Overflow flags, and 8 sets of 4-bit sub-registers CR0-CR7 which contains "Equal", "Less Than", "Greater Than" and "Overflow" flags. It's a little bizarre in that you can have multiple sets of comparisons and conditional branches in flight. In reality I never ended up making use of more than two of the CRx at the same time.

The designers of MIPS decided to setting of conditions and branching into single instructions, so the computed arithmetic flags are never "seen" architecturally to the program running on MIPS; they are hidden micro-architectural state. MIPS is limited to branching based on whether a given register is zero or not zero; and similar to PowerPC it have branches on inequalities. The lack of arithmetic flags means that if you wanted to emulate MIPS on ARM64 it would be very easy to do! Nintendo 64 emulator anyone?

I've summarized below how MIPS, Intel, and ARM perform common branches seen in compiled C/C++ code. Note that MIPS being a very old school RISC design has "delay slots" which generally require sticking a NOP instruction after each conditional branch. So the clever flagless branches in MIPS do not necessarily save code size, and are not as versatile as either x64 or ARM64. You can see ARM64 has three types of conditional branches - branch if a bit is zero or not zero, branch if an entire register is zero or not zero, and branch based on the arithmetic flags. ARM64 has the advantage over x86/x64 in that it can perform more types of conditional branches without destroying flags - which is a very big advantage for emulation where you need to run some housekeeping code internal to the emulator without destroying the live flags which are holding the x86/x64 state. x86/x64 only support a single "branch if CX register is zero" instruction which is not as general as either MIPS or ARM64.

Comparison of conditional branches on different architectures

MIPS

X64 (flag altering)

X64 (flagless)

ARM64 (flag altering)

ARM64 (flagless)

Operations

sample C source code

BEQ R1,R1,label

JMP label

B label

Branch always (unconditional)

if (X == X)

NOP (delay slot)

BEQ R1,R2,label

CMP RCX,RDX

SUB RCX,RDX

CMP X1,X2

SUB X3,X1,X2

Branch if two values are equal

if (X == Y)

NOP (delay slot)

JE label

JRCXZ

BEQ label

CBZ X3,label

BNE R1,R2,label

CMP RCX,RDX

SUB RCX,RDX

CMP X1,X2

SUB X3,X1,X2

Branch if two values are not equal

if (X != Y)

NOP (delay slot)

JNE label

JRCXZ skip

BNE label

CBNZ X3,label

JMP label

BGEZ R2,label

CMP RCX,#0

CMP X1,X2

TBZ X1,#63,label

Branch if positive

if ((int64)X >= 0)

NOP (delay slot)

JGE label

BGE label

BGTZ R2,label

CMP RCX,#0

CMP X1,X2

Branch if positive and non-zero

if ((int64)X > 0)

NOP (delay slot)

JGT label

BGT label

BLEZ R2,label

CMP RCX,#0

CMP X1,X2

Branch if negative or zero

if ((int64)X <= 0)

NOP (delay slot)

JLE label

BLE label

BLTZ R2,label

CMP RCX,#0

CMP X1,X2

TBNZ X1,#63,label

Branch if negative

if ((int64)X < 0)

NOP (delay slot)

JLT label

BLT label

So in summary, I have just demonstrated how ARM64 pulls at least 10 "genius moves" compared to past Windows architectures:

- ARM64 did not try to extend existing ARM32 opcode encodings for convenience, but instead started with a fresh slate of opcode encodings.

- ARM64 trimmed useless features from ARM32 (such as predicated conditional execution) instead of carrying it over.

- ARM64 supports 32 registers (each of integer and vector) instead of just 8 or 16 of each as with MMX, SSE2, and AVX2.

- ARM64 has the same fixed 4-byte opcode size whether an instruction is operating on 32 bits or 64 bits, low register numbers or higher register numbers. There is no penalty as with x64's REX prefixes or Thumb2's variable sized instructions.

- ARM64 implements 3-operand ALU and vector operations, which can eliminate a lot of unnecessary MOV instructions that x86 and x64 require in order to preserve source registers.

- ARM64 allows for 8 integer function arguments to be passed in registers, bypassing pushes to the stack. x86, x64, and ARM32 support between 2 and 4 register arguments only.

- ARM64 function calls are implicitly __vectorcall which means no passing on stack of vectors and floating point values as can happen on x86 and x64.

- ARM64 supports pushing and popping 2 registers at a time, which saves code size and also reduces dependencies on the stack pointer and the total number of memory operations.

- ARM64 function calls do not touch memory; instead the return address is stores in register 30 called the Link Register, which makes function calls more efficient.

- the fixed size 4-byte 32-bit wide opcode makes for easier instruction decoding in hardware allowing 8 instructions to decode at once. Instructions never span a cache line or page boundary and a decoder can thus decode multiple instructions quite easily.

In short, ARM64 is both familiar and very similar to earlier instruction set architectures, but has so much more simplicity that it eliminates a lot of ugly edge cases for hardware implementation.

[ARM64 Boot Camp: Table Of Contents] [Return to Emulators.com]