Anonymous inodes and file structures
Previously, when we discussed QEMU, we said the Linux kernel allocates file structures and sets its f_ops
and anonymous inodes. Let's look into the kvm-main.c
file:
static struct file_operations kvm_chardev_ops = { .unlocked_ioctl = kvm_dev_ioctl, .compat_ioctl = kvm_dev_ioctl, .llseek = noop_llseek, }; kvm_dev_ioctl () switch (ioctl) { case KVM_GET_API_VERSION: if (arg) goto out; r = KVM_API_VERSION; break; case KVM_CREATE_VM: r = kvm_dev_ioctl_create_vm(arg); break; case KVM_CHECK_EXTENSION: r = kvm_vm_ioctl_check_extension_generic(NULL, arg); break; case KVM_GET_VCPU_MMAP_SIZE: . ….. }
As such as kvm_chardev_fops
, there exist kvm_vm_fops
and kvm_vcpu_fops
:
static struct file_operations kvm_vm_fops = { .release = kvm_vm_release, .unlocked_ioctl = kvm_vm_ioctl, ….. .llseek = noop_llseek, }; static struct file_operations kvm_vcpu_fops = { .release = kvm_vcpu_release, .unlocked_ioctl = kvm_vcpu_ioctl, …. .mmap = kvm_vcpu_mmap, .llseek = noop_llseek, };
An inode allocation may be seen as follows:
anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR | O_CLOEXEC);
Data structures
From the perspective of the KVM kernel modules, each virtual machine is represented by a kvm
structure:
include/linux/kvm_host.h : struct kvm { ... struct mm_struct *mm; /* userspace tied to this vm */ ... struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; .... struct kvm_io_bus *buses[KVM_NR_BUSES]; …. struct kvm_coalesced_mmio_ring *coalesced_mmio_ring; ….. }
As you can see in the preceding code, the kvm
structure contains an array of pointers to kvm_vcpu
structures, which are the counterparts of the CPUX86State
structures in the QEMU-KVM user space. A kvm_vcpu
structure consists of a common part and an x86 architecture-specific part, which includes the register content:
struct kvm_vcpu { ... struct kvm *kvm; int cpu; ….. int vcpu_id; ….. struct kvm_run *run; …... struct kvm_vcpu_arch arch; … }
The x86 architecture-specific part of the kvm_vcpu
structure contains fields to which the guest register state can be saved after a VM exit and from which the guest register state can be loaded before a VM entry:
arch/x86/include/asm/kvm_host.h
struct kvm_vcpu_arch {
..
unsigned long regs[NR_VCPU_REGS];
unsigned long cr0;
unsigned long cr0_guest_owned_bits;
…..
struct kvm_lapic *apic; /* kernel irqchip context */
..
struct kvm_mmu mmu;
..
struct kvm_pio_request pio;
void *pio_data;
..
/* emulate context */
struct x86_emulate_ctxt emulate_ctxt;
...
int (*complete_userspace_io)(struct kvm_vcpu *vcpu);
….
}
As you can see in the preceding code, kvm_vcpu
has an associated kvm_run
structure used for the communication (with pio_data
) between the QEMU userspace and the KVM kernel module as mentioned earlier. For example, in the context of VMEXIT, to satisfy the emulation of virtual hardware access, KVM has to return to the QEMU user space process; KVM stores the information in the kvm_run
structure for QEMU to fetch it:
/usr/include/linux/kvm.h:
/* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
struct kvm_run {
/* in */
__u8 request_interrupt_window;
__u8 padding1[7];
/* out */
__u32 exit_reason;
__u8 ready_for_interrupt_injection;
__u8 if_flag;
__u8 padding2[2];
……. union {
/* KVM_EXIT_UNKNOWN */
struct {
__u64 hardware_exit_reason;
} hw;
/* KVM_EXIT_FAIL_ENTRY */
struct {
__u64 hardware_entry_failure_reason;
} fail_entry;
/* KVM_EXIT_EXCEPTION */
struct {
__u32 exception;
__u32 error_code;
} ex;
/* KVM_EXIT_IO */
struct {
#define KVM_EXIT_IO_IN 0
#define KVM_EXIT_IO_OUT 1
__u8 direction;
__u8 size; /* bytes */
__u16 port;
__u32 count;
__u64 data_offset; /* relative to kvm_run start */
} io;
..
}
The kvm_run
struct is an important data structure; as you can see in the preceding code, the union
contains many exit reasons, such as KVM_EXIT_FAIL_ENTRY
, KVM_EXIT_IO
, and so on.
When we discussed hardware virtualization extensions, we touched upon VMCS and VMCB. These are important data structures when we think about hardware-accelerated virtualization. These control blocks help especially in VMEXIT scenarios. Not every operation can be allowed for guests; at the same time, it's also difficult if the hypervisor does everything on behalf of the guest. Virtual machine control structures such as VMCS or VMCB control the behavior. Some operations are allowed for guests, such as changing some bits in shadowed control registers, but others are not. This clearly provides a fine-grained control over what guests are allowed to do and not do. VMCS control structures also provide control over interrupt delivery and exceptions. Previously, we said the exit reason of VMEXIT is recorded inside the VMCS; it also contains some data about it. For example, if a write access to a control register caused the exit, information about the source and destination registers is recorded there.
Let us see some of the important data structures before we dive into the vCPU execution flow.
The Intel-specific implementation is in vmx.c
and the AMD-specific implementation is in svm.c
, depending on the hardware we have. As you can see, the following kvm_vcpu
is part of vcpu_vmx
. The kvm_vcpu
structure is mainly categorized as a common part and architecture specific part. The common part contains the data which is common to all supported architectures and architecture specific, for example, x86 architecture specific (guest's saved general purpose registers) part contains the data which is specific to a particular architecture. As discussed earlier, the kvm_vcpus
kvm_run
and pio_data
are shared with the userspace.
The vcpu_vmx
and vcpu_svm
structures (mentioned next) have a kvm_vcpu
structure, which consists of an x86-architecture-specific part (struct 'kvm_vcpu_arch'
) and a common part and also, it points to the vmcs
and vmcb
structures accordingly:
The vcpu_vmx
or vcpu_svm
structures are allocated by the following code path:
kvm_arch_vcpu_create() ->kvm_x86_ops->vcpu_create ->vcpu_create() [.vcpu_create = svm_create_vcpu, .vcpu_create = vmx_create_vcpu,]
Please note that the VMCS or VMCB store guest configuration specifics such as machine control bits and processor register settings. I would suggest you examine the structure definitions from the source. These data structures are also used by the hypervisor to define events to monitor while the guest is executing. These events can be intercepted and these structures are in the host memory. At the time of VMEXIT, the guest state is saved in VMCS. As mentioned earlier the, VMREAD instruction reads the specified field from the VMCS and the VMWRITE instruction writes the specified field to the VMCS. Also note that there is one VMCS or VMCB per vCPU. These control structures are part of the host memory. The vCPU state is recorded in these control structures.