Mini-SVM: Hello World, but it's a hypervisor instead (1/N)
17.06.2021

Mini-SVM

This is a hypervisor project and there is no specific target for it. Just learning and experimentation.

Where to find it?

At the github repo: https://github.com/martinradev/mini-svm

What is a Hypervisor

It's 2021 and virtualization features, like Intel VMX and AMD SVM, have been around for a while. A hypervisor utilizes such features to create a constrained environment, called Virtual Machine, where software can operate at almost native performance. The software inside can execute almost any instruction (mov, add, ...) as if regular software but with some exceptions. Special instructions like syscall, lgdt, IO operations, MSR writes, etc. can be intercepted and emulated by the hypervisor. Special events like exceptions or traps can be captured by the hypervisor

Why write my own hypervisor and write a blog about it

Sure there are other hypervisors our there, like KVM for one, but they are fairly complex and one can often get lost in all of the infrastructure, debugging, optimizations and security hardenings present there. IMO, starting with a new or smaller project to solve a substantially smaller problem is a better way to understand the fine details of using some technology. This can also lead to discovering missed opportunities, existing bugs or security issues in the larger project, simply because developers were more focused on the bigger picture. The bigger picture being getting it operational and making customers happy.

AMD Secure Virtual Machine (SVM) was chosen because I have an AMD CPU and I'm more familiar with this technology than with Intel VMX. However, there aren't significant differences with either.

Mini-SVM Hello World

The target for this post is to setup basic state and be able to execute simple code inside the VM, like: mov rax, 0x1337000 vmmcall This is only a few hundred lines of not-so-complex code.

Debug environment

Launching a virtual machine requires executing at least few privileged instructions which means that the Hypervisor would require to be a kernel module. This creates the issue that writing buggy code may just cause different undesirable events like a system crash, shutting down your CPU or corrupting your filesystem. Instead, we can use a feature of KVM called nested virtualization: the host operating system launches a Level-1 VM running another operating system which launches a Level-2 VM by using Mini-SVM. If our kernel module somehow misbehaves, this would crash the Level-1 VM but not the host operating system. To check if nested virtualization is supported, one can verify /sys/module/kvm_amd/parameters/nested on Linux.

For the Level-1 VM's kernel, I built a recent Linux with KVM and KVM_AMD to be part of the kernel image. For the Level-1 VM's user space, I built a rootfs with busybox utilities. A hack to easily get new files into the VM is to just pass -hda file to QEMU and then just copy the virtual hard disk's raw content locally to a file by executing something like dd if=/dev/sda of=/mini_svm.ko bs=1M count=16. This saves one from re-packing the rootfs or appending files to it many times over.

Finally, when launching the Level-1 VM, the option -cpu host,+svm should be passed to QEMU.

SVM structures and boilerplate

The hypervisor requires some boilerplate clode to setup the initial state, that being model specific registers and virtualization-related structures. The host OS needs to enable virtualization by setting the EFER_SVME bit in the EFER MSR and needs to allocate a physical page to hold the Host state on a context switch. The physical address of this page is written to the HSAVE_PA MSR.

When launching the VM via vmrun, the hypervisor needs to provide the physical address of the Virtual Machine Control Block (VMCB) structure. The layout of the structure can be viewed in Appendix B of the AMD64 Architecture Programmer’s Manual (Volume 2). It's a dauntingly large structure mixing fields of different sizes, including bit fields. To avoid the effort of carefully defining each field and manually adjusting padding, I decided to write a python script to perform the manual work. The script, found at src/generate_vmcb.py, contains definitions for the fixed offsets of the relevant for me fields which it then parses and outputs a C header file. To guarantee that the structure's layout is not completely broken, the script also outputs static_assert checks for few members.

The VMCB structure is split into two areas: the control and the save area. The first contains various fields for performing control operations, and the second for holding part of the VM's saved state. Through the control area, the hypervisor can indicate what instructions are intercepted, when to flush TLBs, the VM's ASID, etc. The saved state contains special registers like segment registers, RIP, RAX, RSP, etc.

In summary, to launch a VM one just needs to:

  1. enable SVM by writing to the EFER MSR
  2. allocate a physical page for the HSAVE area and write its phys address to HSAVE_PA MSR
  3. allocate a physical page for the VMCB and initialize some of its fields
  4. execute the sequence to run the VM: mov rax, VMCB_PA; clgi; vmrun; stgi;

The VM will immediately crash but that's a good start! I found it useful for debugging to enable nested virtualization and use the top-level hypervisor output to verify what's wrong. KVM would output what VMCB fields are erroneously populated. For example, interception of the vmrun has to always be enabled.

For the purpose of fast iteration, all the state is exposed to user space. This includes the VM's physical address space, register state and the VMCB. The first benefit of this approach is that most of the hypervisor code can be implemented in user space, and in any language. The second benefit is that the privileged code remains fairly small which makes it portable to other operating systems like Windows.

Managing address spaces and memory

Arguably the most complex part of KVM is the memory manager. However, I do not need all the bells and whistles, and I would also prefer to keep it minimalistic.

When translating a Guest Virtual Address (GVA) to a physical address, two translations happen: first from a GVA to a Guest Physical Address (GPA), and then from a GPA to a Host Physical Address (HPA). The first translation happens via the Guest Page Table (GPT) and second via the Nested Page Table (NPT). The page table structures on x86-64 are not complex, but I want to keep things even simpler: they are as shallow as possible and everything is Read-Write-Executable.

For constructing the GPT, I need to populate only two levels with one entry: the PML4 and the PDP. By choosing so, I need to allocate just two physical pages, put just one entry in each and voila the VM has access to 1Gig of virtual address space! This additionally creates an identity mapping: the GVA matches the GPA. To have the GPT be set, I only need to write the GPA into the CR3 register in the VMCB save area.

For the NPT, I allocated enough 4Kb pages to fill the pre-determined guest physical memory. Unfortunately, this does mean that the page hierarchy becomes deep (PML4, PDP, PD, PT) but it also means that we could easily perform page tracking of the VM in the future if we need to. The other benefit is that the HV can almost always allocate memory for the VM. If we were to use only PML4 and PDP, the HV would have to allocate 1Gib of physically-contiguous memory which can be difficult. To set the NPT, I need to write the HPA of the page table to the nCR3 field of the VMCB.

This boils down to around just 250 lines of simple C code. Compare that to the 7k LoC for the KVM memory manager.

Entering the VM

If you have written a toy x86 OS before, you may have had to write some boilerplate assembly to transition from Real Mode to Protected Mode to Long Mode. Well, we do not need to write such boilerplate code since the HV can immediately setup the control registers and segments in the VMCB. Understanding x86 segments is not even necessary since one can just modify KVM to dump the VMCB for a Linux VM. Copying the segment's hidden state is just fine.

Ok, state is setup, now we can enter, but not quite! Earlier I mentioned that one needs to execute the sequence mov rax, VMCB_PA; clgi; vmrun; stgi; to enter the VM, but this is unsufficient. The general-purpose registers, like RBX, RCX ..., are shared between the HV and the VM. If they are not taken care of, the HV would start using the VM's registers upon exit and just crash. Thus, the HV needs to save its register state, enter the VM, manually save the VM's registers upon VM exit, and finally restore its own registers. I suppose, for this reason AMD folks have decided to have RAX, RIP and RSP be automatically saved upon vmenter and restored upon vmexit. In particular, the stack can hold pointers to structures which can be used for storing the VM's state and for restoring the complete HV state.

In short, the code for entering and exiting the VM is:

mini_svm_run(rdi = VMCB_PA, rsi = vm regs hva)
push rbx ... r15 ; save most host registers
push rsi ; save vm regs
push rdi ; save vmcb

mov rsi, rax ; use rax as base
mov 0x0 ... 0x68(rax), rbx ... r15 ; load vm's registers
pop rax ; load vmcb pa into rax

clgi
vmrun
stgi

pop rax ; rax points to vm regs structure
mov rbx ... r15, 0x0 ... 0x68(rax) ; save vm's registers into the structure
pop r15 ... rbx ; restore the host registers
ret

This is sufficient to enter the VM multiple times.

State management on multi-core CPUs

Simple as they are, the boilerplate code and vmenter-ing code have a bug which would become obvious on a multi-core system: the HV-written MSRs are per logical core and the code would not work if the HV gets preempted and scheduled on another core.

There are a couple of options coming to my mind:

  1. Do the MSR setup and vmrun "atomically" by disabling preemption.
  2. Write the MSRs on each logical core.
  3. Hack around it by having the setup code and HV runtime run on the same core.

I chose option 3 and achieved it by using taskset: taskset 0x1 insmod mini-svm.ko and later taskset 0x1 ./hv_program.

Handling VM exits

The final step before writing a "Hello World" VM is to handle VM exits to some extent. VM exits are caused by events which need to be handled by the HV: exceptions, intercepted instructions, request for shutdown, etc. The VMCB structure contains four fields which can be used for decoding a vmexit: exitcode, exitinfo1, exitinfo2, exitintinfo.

For simple instruction interception, we can just use the exitcode to figure out what instruction was executed: cpuid, vmmcall, rdtsc, etc. We can read the VM's registers, update them accordingly to the emulated instruction, and then resume the VM. However, the new RIP for the VM has to be set after emulating the instruction. The reason for this is that the instruction was decoded, emulated by the HV, but never committed. Thus, the RIP register contains the address of the emulated instruction. Conveniently, the VMCB field contains a field, named nRIP, which contains the next instruction pointer. Doing rip = nRIP works.

Writing a Hello World VM

Now that finally everything is setup, we need to write a tiny program to execute as the VM. The program must be simple enough and be standalone without any image headers, dynamic libraries, etc. While we can certainly directly encode the machine instructions, C is also a possibility.

Here's the example VM program written in C: void _start() { const char msg[] = "Hello World!"; for (unsigned i = 0; i < sizeof(msg); ++i) { asm volatile( "xorl %%eax, %%eax\n\t" "movb %0, %%al\n\t" "vmmcall\n\t" : : "r"(msg[i]) : "%rax", "%rbx", "%rcx", "%rdx" ); } asm volatile( "hlt\n\t" ); } It will loop through each character of the message, load the character into the AL register and execute a vmmcall. To build the program, one can first compile it into an object file: gcc vm-program.c -O3 -fno-pie -m64 -c -nostdlib -o vm-program.o and then request the linker to output a raw binary: ld -m elf_x86_64 --oformat=binary -T linker.ld vm-program.o -o vm-program -nostdlib The linker script just instructs that the .text section should be loaded at some address. TBH, the linker script is probably not necessary at this point.

Now, let's run the HV user space program. The HV program would do the communication with our kernel module via ioctls, populate the VM's state and start running the VM. Each vmmcall instruction gets intercepted and the value of RAX gets printed out. Upon intercepting the hlt instruction, the VM gets killed.

Output: # ./hv_program exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 48. As char: H exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 65. As char: e exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 6c. As char: l exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 6c. As char: l exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 6f. As char: o exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 20. As char: exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 57. As char: W exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 6f. As char: o exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 72. As char: r exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 6c. As char: l exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 64. As char: d exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 21. As char: ! exitcode: 81. Name: MINI_SVM_EXITCODE_VMEXIT_VMMCALL RAX is 0. As char: exitcode: 78. Name: MINI_SVM_EXITCODE_VMEXIT_HLT

There it is! We have a small HV based on AMD SVM and even a tiny program which prints out "Hello World".

Final words

This is the first post of a series. I do have some interesting ideas I would like to explore with this HV. For now, that's not booting an OS.

See you around :)