QEMU Accelerator Technical Documentation
The QEMU Accelerator (KQEMU) is a driver allowing a user application to run x86 code in a Virtual Machine (VM). The code can be either user or kernel code, in 64, 32 or 16 bit protected mode. KQEMU is very similar in essence to the VM86 Linux syscall call, but it adds some new concepts to improve memory handling.
KQEMU is ported on many host OSes (currently Linux, Windows, FreeBSD, Solaris). It can execute code from many guest OSes (e.g. Linux, Windows 2000/XP) even if the host CPU does not support hardware virtualization.
In that document, we assume that the reader has good knowledge of the x86 processor and of the problems associated with the virtualization of x86 code.
We describe the version 1.3.0 of the Linux implementation. The implementations on other OSes use the same calls, so they can be understood by reading the Linux API specification.
KQEMU manipulates three kinds of addresses:
KQEMU has a physical page table which is used to associate a RAM address or a device I/O address range to a given physical page. It also tells if a given RAM address is visible as read-only memory. The same RAM address can be mapped at several different physical addresses. Only 4 GB of physical address space is supported in the current KQEMU implementation. Hence the bits of order >= 32 of the physical addresses are ignored.
It is very important for the VM to be able to tell if a given RAM page has been modified. It can be used to optimize VGA refreshes, to flush a dynamic translator cache (when used with QEMU), to handle live migration or to optimize MMU emulation.
In KQEMU, each RAM page has an associated dirty byte in the
array init_params.ram_dirty
. The dirty byte is set to
0xff
if the corresponding RAM page is modified. That way, at
most 8 clients can manage a dirty bit in each page.
KQEMU reserves one dirty bit 0x04
for its internal use.
The client must notify KQEMU if some entries of the array
init_params.ram_dirty
were modified from 0xff
to a
different value. The address of the corresponding RAM pages are stored
by the client in the array init_parms.ram_pages_to_update
.
The client must also notify KQEMU if a RAM page has been modified
independently of the init_params.ram_dirty
state. It is done
with the init_params.modified_ram_pages
array.
Symmetrically, KQEMU notifies the client if a RAM page has been
modified with the init_params.modified_ram_pages
array. The
client can use this information for example to invalidate a dynamic
translation cache.
A user client wishing to create a new virtual machine must open the device `/dev/kqemu'. There is no hard limit on the number of virtual machines that can be created and run at the same time, except for the available memory.
KQEMU_GET_VERSION
ioctlIt returns the KQEMU API version as an int. The client must use it to determine if it is compatible with the KQEMU driver.
KQEMU_INIT
ioctl
Input parameter: struct kqemu_init init_params
It must be called once to initialize the VM. The following structure is used as input parameter:
struct kqemu_init { uint8_t *ram_base; uint64_t ram_size; uint8_t *ram_dirty; uint64_t *pages_to_flush; uint64_t *ram_pages_to_update; uint64_t *modified_ram_pages; };
The pointers ram_base
, ram_dirty
,
phys_to_ram_map
, pages_to_flush
,
ram_pages_to_update
and modified_ram_pages
must be page
aligned and must point to user allocated memory.
On Linux, due to a kernel bug related to memory swapping, the corresponding memory must be mmaped from a file. We plan to remove this restriction in a future implementation.
ram_size
must be a multiple of 4K and is the quantity of RAM
allocated to the VM.
ram_base
is a pointer to the VM RAM. It must contain at least
ram_size
bytes.
ram_dirty
is a pointer to a byte array of length
ramsize/4096
. Each byte indicates if the corresponding VM RAM
page has been modified (see section 2.2 RAM page dirtiness)
pages_to_flush
is a pointer to an array of
KQEMU_MAX_PAGES_TO_FLUSH
longs. It is used to indicate which
TLB must be flushed before executing code in the VM.
ram_pages_to_update
is a pointer to an array of
KQEMU_MAX_RAM_PAGES_TO_UPDATE
longs. It is used to notify the VM that
some RAM pages have been dirtied.
modified_ram_pages
is a pointer to an array of
KQEMU_MAX_MODIFIED_RAM_PAGES
longs. It is used to notify the VM or the
client that RAM pages have been modified.
The value 0 is return if the ioctl succeeded.
KQEMU_SET_PHYS_MEM
ioctlThe following structure is used as input parameter:
struct kqemu_phys_mem { uint64_t phys_addr; uint64_t size; uint64_t ram_addr; uint32_t io_index; uint32_t padding1; };
The ioctl modifies the internal KQEMU physical to ram mappings. After
the ioctl is executed, the physical address range [phys_addr;
phys_addr + size[
is mapped to the RAM addresses [ram_addr;
ram_addr + size[
if io_index
is KQEMU_IO_MEM_RAM
or
KQEMU_IO_MEM_ROM
. If KQEMU_IO_MEM_ROM
is used, the
writes to the RAM are ignored.
When io_index
is KQEMU_IO_MEM_UNASSIGNED
, it means the
physical memory range corresponds to a device I/O region. When a
memory access is done to it, KQEMU_EXEC
returns with
cpu_state.retval
set to KQEMU_RET_SOFTMMU
.
KQEMU_MODIFY_RAM_PAGE
ioctl
Input parameter: int nb_pages
Notify the VM that nb_pages
RAM pages were modified. The
corresponding RAM page addresses are written by the client in the
init_state.modified_ram_pages
array given with the KQEMU_INIT ioctl.
Note: This ioctl does currently nothing, but the clients must use it for later compatibility.
KQEMU_EXEC
ioctl
Input/Output parameter: struct kqemu_cpu_state cpu_state
Structure definitions:
struct kqemu_segment_cache { uint16_t selector; uint16_t padding1; uint32_t flags; uint64_t base; uint32_t limit; uint32_t padding2; }; struct kqemu_cpu_state { uint64_t regs[16]; uint64_t eip; uint64_t eflags; struct kqemu_segment_cache segs[6]; /* selector values */ struct kqemu_segment_cache ldt; struct kqemu_segment_cache tr; struct kqemu_segment_cache gdt; /* only base and limit are used */ struct kqemu_segment_cache idt; /* only base and limit are used */ uint64_t cr0; uint64_t cr2; uint64_t cr3; uint64_t cr4; uint64_t a20_mask; /* sysenter registers */ uint64_t sysenter_cs; uint64_t sysenter_esp; uint64_t sysenter_eip; uint64_t efer; uint64_t star; uint64_t lstar; uint64_t cstar; uint64_t fmask; uint64_t kernelgsbase; uint64_t tsc_offset; uint64_t dr0; uint64_t dr1; uint64_t dr2; uint64_t dr3; uint64_t dr6; uint64_t dr7; uint8_t cpl; uint8_t user_only; uint16_t padding1; uint32_t error_code; /* error_code when exiting with an exception */ uint64_t next_eip; /* next eip value when exiting with an interrupt */ uint32_t nb_pages_to_flush; int32_t retval; uint32_t nb_ram_pages_to_update; uint32_t nb_modified_ram_pages; };
Execute x86 instructions in the VM context. The full x86 CPU state is defined in this structure. It contains in particular the value of the 8 (or 16 for x86_64) general purpose registers, the contents of the segment caches, the RIP and EFLAGS values, etc...
If cpu_state.user_only
is 1, a user only emulation is
done. cpu_state.cpl
must be 3 in that case.
KQEMU_EXEC
does the following:
cpu_state.nb_ram_pages_to_update
RAM pages from the array
init_params.ram_pages_to_update
. If
cpu_state.nb_ram_pages_to_update
has the value
KQEMU_RAM_PAGES_UPDATE_ALL
, it means that all the RAM pages may
have been dirtied. The array init_params.ram_pages_to_update
is
ignored in that case.
cpu_state.nb_modified_ram_pages
RAM pages from the array
init_params.modified_ram_pages
where modified by the client.
init_params.pages_to_flush
of length
cpu_state.nb_pages_to_flush
. If
cpu_state.nb_pages_to_flush
is KQEMU_FLUSH_ALL
, all the
TLBs are flushed. The array init_params.pages_to_flush
is
ignored in that case.
cpu_state
.
cpu_state
.
cpu_state.retval
.
cpu_state.nb_pages_to_flush
and
init_params.pages_to_flush
to notify the client that some
virtual CPU TLBs were flushed. The client can use this notification to
synchronize its own virtual TLBs with KQEMU.
cpu_state.nb_ram_pages_to_update
to 1 if some
RAM dirty bytes were transitionned from dirty (0xff) to a non dirty
value. Otherwise, cpu_state.nb_ram_pages_to_update
is set to 0.
cpu_state.nb_modified_ram_pages
and
init_params.modified_ram_pages
to notify the client that some
RAM pages were modified.
cpu_state.retval
indicate the reason why the execution was
stopped:
KQEMU_RET_EXCEPTION | n
cpu_state.error_code
contains the exception error code if it is
needed. It should be noted that in user only emulation, KQEMU
handles no exceptions by itself.
KQEMU_RET_INT | n
cpu_state.next_eip
contains value of RIP after the instruction raising the
interrupt. cpu_state.eip
contains the value of RIP at the
intruction raising the interrupt.
KQEMU_RET_SOFTMMU
KQEMU_RET_INTR
KQEMU_RET_SYSCALL
cpu_state.next_eip
contains value of RIP after the
instruction. cpu_state.eip
contains the RIP of the intruction.
KQEMU_RET_ABORT
The main priority when implementing KQEMU was simplicity and security. Unlike other virtualization systems, it does not do any dynamic translation nor code patching.
Note 1: KQEMU does not currently use the hardware virtualization features of newer x86 CPUs. We expect that the limitations would be different in that case.
Note 2: KQEMU supports both x86 and x86_64 CPUs.
Before entering the VM, the following conditions must be satisfied :
If EFLAGS.IF is set, the following assumptions are made on the executing code:
If eflags.IF if reset the code is interpreted, so the VM code can be accurately executed. Some intructions trap to the user space emulator because the interpreter does not handle them. A limitation of the interpreter is that currently segment limits are not always tested.
The VM code is always run with CPL = 3 on the host, so the VM code has no more priviliedge than regular user code.
The MMU is used to protect the memory used by the KQEMU monitor. That way, no segment limit patching is necessary. Moreover, the guest OS is free to use any virtual address, in particular the ones near the start or the end of the virtual address space. The price to pay is that CR3 must be modified at every emulated system call because different page tables are needed for user and kernel modes.
This document was generated on 30 May 2008 using texi2html 1.56k.