VM Architecture: Understanding Virtual Machine Design

VM Architecture

Virtual machine architecture spans multiple layers, from CPU instructions through memory translation to I/O virtualization. Engineers who understand these layers can make better decisions about VM configuration, troubleshoot performance anomalies faster, and design more appropriate infrastructure for demanding workloads.

CPU Virtualization

Modern CPUs expose hardware extensions for virtualization: Intel VT-x and AMD-V. These extensions add a set of new processor modes — known as VMX root (hypervisor) and VMX non-root (guest) — that allow guest code to execute directly on hardware rather than being interpreted or binary-translated.

Key mechanisms in CPU virtualization:

VM Entries and VM Exits — Transitions between hypervisor and guest. VM exits occur when the guest performs a privileged operation (like an interrupt or I/O) that the hypervisor must intercept.
VMCS / VMCB — The Virtual Machine Control Structure (Intel) or Virtual Machine Control Block (AMD). A per-VM data structure holding guest/host state and VM execution control fields.
APIC Virtualization — Hardware support for virtualizing interrupt delivery without hypervisor intervention on every interrupt.
Nested Virtualization — The ability to run a hypervisor inside a VM (used in cloud environments, development, and testing).

Practical implication: Every VM exit has cost. Workloads with high system call frequency or heavy interrupt loads generate more exits and experience higher virtualization overhead.

Memory Virtualization

Memory virtualization requires translating three distinct address spaces:

Guest Virtual Address (GVA) — As seen by processes inside the guest.
Guest Physical Address (GPA) — As seen by the guest kernel.
Host Physical Address (HPA) — Actual hardware memory.

The hypervisor bridges GPA to HPA through either:

Extended Page Tables (EPT) — Intel hardware mechanism. The CPU walks both guest page tables and EPT tables, resolving addresses without hypervisor intervention.
Nested Page Tables (NPT) — AMD equivalent.

Without hardware-assisted translation, hypervisors used Shadow Page Tables — maintained entirely in software, with severe performance costs and complex implementation.

Memory management techniques:

Technique	Mechanism	Trade-off
Memory Ballooning	Cooperative driver reclaims guest memory	Slow, requires cooperation
KSM / Page Sharing	Deduplicate identical pages	CPU cost to scan; risk of side channels
Transparent Huge Pages	Use 2MB/1GB pages	TLB efficiency; fragmentation trade-off
Memory Reservation	Guarantee physical memory to a VM	No overcommit possible
Memory Overcommit	Schedule memory pages from multiple VMs	Risk of balloon/swap storms

NUMA topology is critical: when a VM’s memory is allocated from a NUMA node different from its vCPUs, remote memory accesses add 40–100ns of latency per operation, which compounds significantly in memory-intensive workloads.

I/O Virtualization

I/O is the largest source of overhead in most VM environments. Three approaches exist:

1. Full Emulation The hypervisor emulates an entire device (e.g., an IDE disk or RTL8139 NIC) entirely in software. Every I/O request traps to the hypervisor, is processed, and returns. Simple but slow — 2–5x overhead versus native.

2. Paravirtualized I/O (VIRTIO) The guest driver knows it is running in a VM and uses a split-driver model:

Guest-side frontend driver (virtio-blk, virtio-net, etc.) puts descriptors into a shared ring buffer.
Host-side backend driver picks them up and services them.

VIRTIO dramatically reduces trap frequency. Modern VIRTIO-based network throughput can reach line rate on 10GbE, and VIRTIO-SCSI can approach raw NVMe latency in most workloads.

3. Direct Device Assignment (Passthrough) Using IOMMU (Intel VT-d / AMD-Vi), the hypervisor assigns a physical device directly to a single VM. The guest has exclusive, direct hardware access with no mediation overhead.

PCIe passthrough: Full device to one VM (e.g., GPU, NVMe, NIC).
SR-IOV: A single physical device presents multiple Virtual Functions, each assignable to a separate VM.

SR-IOV is preferred for network and storage at scale — it allows device sharing without software overhead.

vCPU Scheduling

vCPU scheduling is fundamentally different from OS process scheduling:

A vCPU is a thread from the hypervisor’s perspective.
Each physical CPU (pCPU) can run one vCPU at a time.
Overcommit ratios (e.g., 4:1 vCPU:pCPU) mean some vCPUs must wait.

Common scheduling concerns:

vCPU co-scheduling: For VMs with SMP, all vCPUs ideally run simultaneously. The CFS scheduler on KVM handles this, but imbalance causes vCPU stall time.
CPU affinity: Pinning vCPUs to specific physical cores eliminates scheduler noise — useful for latency-sensitive workloads.
Overcommit overhead: Beyond ~4:1 overcommit, steal time rises sharply and guest OS timing drifts.
NUMA awareness: vCPUs should be scheduled on cores in the same NUMA node as the VM’s memory.

Isolation Mechanisms

Isolation is why VMs are trusted in multi-tenant environments:

Memory isolation: EPT/NPT mappings are controlled by the hypervisor. A guest cannot access GPA ranges outside its own mapping.
I/O isolation: The IOMMU enforces that direct-assigned devices can only access the memory regions their VM owns — preventing DMA-based attacks.
CPU isolation: VMs execute in VMX non-root mode. Attempting Ring 0 operations triggers a VM exit to the hypervisor.
Network isolation: Hypervisor-managed virtual switches enforce VLAN tagging and access control lists between VMs.

Confidential computing (Intel TDX, AMD SEV-SNP) extends this model by encrypting VM memory with keys the hypervisor cannot access, protecting VM data even from an administrator or compromised hypervisor.

Reading Path

How Virtual Machines Work — End to End
KVM vs VMware: Architecture and Performance
VM Performance Tuning: CPU, Memory, Storage
VM Use Cases — how architecture translates to deployment patterns