My notes on "A Comparison of Software and Hardware Techniques for x86 Virtualization"

10/29/2023

My thoughts on the paper
My notes

My thoughts on the paper

Back in 1999, VMware introduced their first product, the VMware workstation. That was a hosted VMM that ran on an OS. In 2002, VMware released their ESX server, which ran on bare-metal. It used dynamic binary translation, as described in this paper.

Interestingly enough, this paper didn't come out until 2006. The goal here was to compare VMware's software based VMM, with a new hardware VMM based on Intel + AMD's (at the time) new x86 architecture extensions which supported classic virtualization.

I love the authors' reference to the RISC vs. CISC debate at the end of this paper. Much like how RISC architectures challenged the idea that doing things in hardware is always faster than doing them in software, this paper showed that VMware's software-based virtualization technique, binary translation, out performed VMMs that were based on Intel and AMD's architecture extensions for virtualization support.

My notes

2. Classic virtualization

Classically virtualizable - an architecture that can be virtualized purely with trap-and-emulate, which is a technique described in Popek and Goldberg's famous virtualization requirements paper, and which I've taken notes on here. The x86 is not virtualizable through trap and emulate.
A trap-and-emulate VM executes guest operations directly, but at a reduced privilege level. The VMM intercepts traps from the de-privileged guest and emulates the trapping instruction against the virtual machine state.
Shadow structures - privileged state that the VMM derives from guest-level "primary structures" (e.g. the FLAGS register). VMM can just trap accesses to modify this state and modify the shadow structure instead.
Shadow paging is a very interesting topic - I'll cover this in a separate blog post.

3. Software virtualization

x86 obstacles to virtualization: visibility of priviled state (e.g. guest can read %cs register), and privileged instructions like popf that don't trap in user-mode, but instead do-nothing.
Binary translation allows us to virtualize x86 while still maintaining high performance by ensuring that most guest instructions execute without VMM intervention.
So what is binary translation?
A binary translator translates x86 binary code to a safe subset of the instruction set. It is dynamic - translation occurs at runtime, and it is interleaved with execution of the generated code. It is adaptive - it can adjust the translated code in response to guest behavior to increase efficiency.
The translator parses each guest instruction into an intermediate representation (IR) object. These are accumulated into a Translation Unit (TU). Each translator invocation consumes one TU and produces a compiled code fragment (CCF), which will be binary output, and is the code that gets executed. Most instructions are translated identically, but there are exceptions. Translation cache (TC) stores CCFs.
Instructions that require special translation (NON-IDENT): pc-relative addressing, direct control flow, indirect control flow, privileged instructions.
Some privileged instructions, like clearing interrupts, are faster after translation! This is because we can remove the user to kernel context switch, and instead perform a simple in-memory operation like `vcpu.flags.IF = 0`, since we "shadow" the on-cpu privileged state.
The translator doesn't attempt to "improve" the translated code.
Inherently, the translator captures an execution trace of the guest code, which ensures that the TC code has good cache locality if the first and subsequent executions follow similar paths through the guest code.
Most virtual registers are bound to their physical counterparts during execution of TC code.
Memory accesses are common so their tranlation must be efficient and prevent access to the VM. The VMM has its own memory mapped to the higher porition of address space. Then, segmentation is used to segment guest portions (low) and VMM portions (high) of the address space. If the guest tries VMM memory, a segmentation fault will occur.
Translator gains access to the VMM address space using the %gs segment register, which is used to access the VMM segment. If the tranlator finds guest code trying to reference the %gs register, it will strip the reference to %gs.
Adaptive translation - the translator monitors instructions that trap frequently (like memory accesses to protected data like page tables) and replaces them with call-outs that avoid the fault overhead. A callout refers to when we transfer control from the translated code to the VMM.
I tried to think about a specific case of when and how this would work. As an example, if there is an instruction that triggers a divide by 0 fault, I'm sure the translator could patch the instruction to avoid the trap.
The authors also mention that adaptive translation is used to handle instructions that access IO devices and the VMM's address range. This is interesting, since I thought those instructions would have been caught during initial translation...

4. Hardware virtualization

Intel + AMD released architecture extensions which support classic x86 virtualization.
In-memory data structure, the virtual machine control block (VMCB), combines control state with a subset of the state of a guest virtual CPU. A new guest mode supports direct execution of guest code. An instruction, vmrun, loads hardware guest state from the VMCB and continues execution in guest mode. Guest execution continues until a condition is reached, which is expressed by the VMM in the control bits of the VMCB. Once this condition is reached, the state of the guest hardware is saved into the VMCB, and we exit back into "host mode". Since the host can access the guest state, it can emulate any trapped instructions, and then return execution to the guest.
The hardware does not trap back to host mode on every system call, because it can modify the guest hardware state in the VMCB block directly. This is good for performance.
VMware's VMM which uses these hardware extensions still uses MMU virtualization, and most emulation code is shared with the software VMM.
The performance of the hardware-based VMM is dependent on the frequency of exits and hardware transitions between guest and host modes.

5. Qualitative comparison

Binary translation pros: better trap elimination thanks to adaptive translation, emulation routines can use predecoded guest instructions, and instead of transferring control to the VMM for all callouts, we can instead keep the emulated routine in the translation cache. I think this leads to better memory locality. It doesn't necessarily reduce traps, since call-outs don't trap and instead just transfer control to the VMM.
Hardware VMM pros: better code density (since there is no translation), easier to handle exceptions (BT VMM has to do odd things to recover state when exceptions occur in non-IDENT instructions), and system calls run without VMM intervention - the hardware automatically interacts with the VMCB block.

6. Experiments

Compute intensive workloads between the hardware and software VMM run at similar speeds. But as more privileged instructions occur, both VMMs suffer. However, the software VMM typically outperforms the hardware VMM because it can adapt to frequently exiting workloads.
The only place where the hardware VMM out-shined the software VMM was for workloads where there were few I/O exits compared to the number of system calls performed. I/O exists require traps to guest mode, but system calls do not. In the software VMM, both require call-outs to VMM code.

7. Software and hardware opportunities

The main difficulty for the hardware VMM in the experiments surrounded MMU virtualization. Maintaining shadow page tables is expensive!
Hardware support for MMU virtualization via nested page tables would eliminate page table tracing overhead, and eliminate VMM intervention during guest context switches (hardware handles updating page tables when moving between guests).

9. Conclusion

Hardware extensions make VMM design easier, but it doesn't improve performance over binary translation and software-based VMMs.