How Virtual Machines Work: From Hypervisor to Hardware
Deep technical exploration of VM architecture, from hypervisor design to CPU virtualization to memory management. Understand how VMs actually work at the system level.
Virtual machines form the foundation of modern infrastructure, yet many engineers operate them without deeply understanding how they actually work. In this post, we’ll explore the complete stack—from hypervisor design to CPU virtualization to memory systems—to understand what’s really happening when you boot a VM.
The Hypervisor: The Core Abstraction
At the heart of any VM lies the hypervisor, a piece of software that sits between hardware and guest operating systems. The hypervisor’s job is to safely divide physical hardware resources among multiple independent VMs, each believing it has exclusive access to CPU, memory, and I/O devices.
There are two main hypervisor types:
Type 1 (Bare Metal) Hypervisors
Type 1 hypervisors run directly on hardware, making them the most efficient. VMware ESXi, KVM (when run on Linux), Hyper-V, and Nutanix AHV are all Type 1 hypervisors.
Physical Hardware
↓
Type 1 Hypervisor
↓
Guest VMs
The hypervisor handles all direct hardware access, scheduling, and resource management. This gives it complete control and visibility into the system.
Type 2 (Hosted) Hypervisors
Type 2 hypervisors run on top of a host operating system. VirtualBox and VMware Workstation are examples.
Physical Hardware
↓
Host Operating System (Linux, Windows, macOS)
↓
Type 2 Hypervisor
↓
Guest VMs
Type 2 hypervisors are simpler to use but less efficient since they must go through the host OS with each privileged operation.
CPU Virtualization: Making the Processor Safe to Share
The most challenging aspect of virtualization is making the CPU safe to share. Modern processors cannot actually divide their execution context—only one OS can run at a time. The hypervisor must rapidly switch between VMs, creating the illusion that each has exclusive access.
Hardware Virtualization Extensions
Modern CPUs provide hardware extensions for virtualization:
- Intel VT-x — Available on most modern Intel CPUs
- AMD-V — The AMD equivalent on Ryzen and EPYC processors
These extensions add new CPU instruction modes and capabilities that allow hypervisors to run guest code more efficiently.
Ring Model and Privilege Levels
Traditional x86-64 CPUs have four privilege levels (rings 0-3):
- Ring 0 — Kernel level, unrestricted hardware access
- Rings 1-2 — Reserved (rarely used)
- Ring 3 — User level, restricted access
Guest operating systems expect to run at Ring 0, but we can’t let them—they’d directly access hardware and crash the system. VT-x solves this with VMX (Virtual Machine Extension) root and non-root modes:
- VMX Root — Hypervisor executes here
- VMX Non-Root — Guest VM executes here, but privileged instructions cause exits
Guest Kernel (Ring 0 in VMX Non-Root)
↓ [Privileged instruction encountered]
↓ [VM Exit triggered]
↓
Hypervisor (VMX Root)
↓ [Emulate operation or handle appropriately]
↓
Guest Kernel (resumes)
VM Exits and Performance Implications
When a guest VM executes a privileged instruction or accesses protected resources, a VM exit occurs. The CPU switches to hypervisor context, which analyzes the situation and decides what to do.
Common causes of VM exits include:
- Privileged instruction execution
- I/O operation attempts
- Memory page faults
- Timer interrupts
- External interrupts
Each VM exit has overhead—the hypervisor must inspect the instruction, make a decision, and resume the guest. Modern hypervisors optimize this by:
- Early exit detection — Catch problematic instructions before they execute
- Fast path handling — Quickly service common exits (no context switches)
- Batching — Handle multiple operations together when possible
Memory Virtualization: Isolation and Efficiency
Guest VMs believe they have exclusive physical memory, but the hypervisor must carefully manage memory to isolate VMs and optimize resource usage.
Two-Level Memory Translation
Memory addressing in virtualized systems involves two translation layers:
- Guest Virtual → Guest Physical — Guest OS controls this via its page tables
- Guest Physical → Host Physical — Hypervisor controls this via shadow page tables or EPT/NPT
Guest App
↓
Guest Virtual Address (GVA)
↓ [Guest Page Table]
↓
Guest Physical Address (GPA)
↓ [EPT/NPT - Extended/Nested Page Tables]
↓
Host Physical Address (HPA)
↓
Physical RAM
This dual translation provides complete isolation—guest VMs can never directly access another VM’s memory.
Extended Page Tables (EPT) and Nested Page Tables (NPT)
- EPT (Intel) and NPT (AMD) speed up memory translation by allowing the CPU to perform the GPA→HPA translation in hardware rather than forcing hypervisor intervention
Without EPT/NPT, every guest page table modification would trigger a VM exit, crippling performance. EPT/NPT provides orders of magnitude improvement.
Memory Overcommitment and Ballooning
Hypervisors often allocate more VM memory than physical memory exists—a practice called memory overcommitment.
When memory pressure occurs, the hypervisor must reclaim memory from guests. It uses several techniques:
Memory Ballooning
The hypervisor inflates a “balloon” driver inside the guest, which allocates memory. This causes the guest OS to page out its own memory, freeing physical pages for the hypervisor to use:
1. Hypervisor tells balloon driver: "Allocate 4GB"
2. Guest OS pages out memory to make room
3. Hypervisor reclaims those physical pages
4. Hypervisor can now assign them to other VMs
This is elegant because the guest OS makes intelligent paging decisions rather than the hypervisor blindly reclaiming pages.
Page Sharing
Multiple VMs often have identical memory pages (library code, common data structures, etc.). Some hypervisors use transparent page sharing—the same physical page is mapped into multiple VMs’ address spaces.
This saves significant memory in environments with many similar VMs.
I/O Virtualization: Devices and Interrupts
VMs need access to I/O devices—storage, networking, USB, etc. The hypervisor must virtualize these safely while maintaining performance.
Device Emulation
The hypervisor can emulate devices in software. When a guest accesses an I/O port or memory-mapped I/O region, the hypervisor intercepts it and simulates the device behavior.
For example, a guest might try to read from a virtual network card:
Guest OS
↓
Guest Device Driver
↓ [Reads from port 0x3F8]
↓ [VM Exit triggered]
↓
Hypervisor
↓ [Looks up which physical NIC this guest maps to]
↓ [Returns simulated network data]
↓
Guest OS [Receives data, thinks it came from a real NIC]
The problem with pure device emulation is that it’s slow—each I/O operation triggers intercepts and hypervisor intervention.
Paravirtualization
Paravirtualization eliminates the pretense that guests have real devices. Instead, guests explicitly use hypervisor-specific I/O mechanisms.
For example, VIRTIO (used in KVM and other hypervisors) provides:
- VIRTIO Devices — Standardized virtual device interfaces
- Shared Memory Rings — High-performance communication between guest and hypervisor
Guest App
↓
Guest VIRTIO Driver
↓ [Places operation in shared ring buffer]
↓ [No VM exit for most operations!]
↓
Hypervisor VIRTIO Backend
↓ [Performs actual I/O to physical device]
This dramatically reduces overhead by batching I/O operations and eliminating frequent VM exits.
Direct Device Assignment (PCI Passthrough)
For maximum performance (though reduced flexibility), the hypervisor can give a VM exclusive access to a physical PCI device:
VM
↓ [Direct access to physical NIC via PCI]
↓ [No hypervisor intervention for most operations]
This allows near-native performance but prevents live migration and sharing of the device.
Scheduling: Time-Slicing CPUs
The hypervisor must fairly divide physical CPU cores among VMs. This is similar to OS process scheduling but at a different level.
VCPU Model
Each VM gets virtual CPUs (vCPUs), which the hypervisor maps to physical CPU resources. If you have 64 physical cores and create 4 VMs with 16 vCPUs each, the hypervisor must time-slice:
Physical Core 0 Timeline:
├─ VM1 vCPU0 [time slice]
├─ VM2 vCPU0 [time slice]
├─ VM3 vCPU0 [time slice]
├─ VM4 vCPU0 [time slice]
├─ VM1 vCPU1 [time slice]
└─ (repeat)
CPU Affinity and NUMA Awareness
On large systems with NUMA (Non-Uniform Memory Access) architecture, the scheduler tries to:
- Pin vCPUs to physical cores — Reduces cache misses
- Keep vCPUs on the same NUMA node as their memory — Minimizes memory latency
- Respect hardware topology — Schedule related vCPUs together when possible
Real-World Example: Booting a VM
Let’s trace through what happens when you boot a VM:
- Initialization — Hypervisor allocates vCPUs, memory pages, virtual devices
- VM Entry — Hypervisor loads guest registers and executes
vmlaunchinstruction - Guest Bootloader Runs — Guest thinks it’s running at Ring 0 on real hardware
- First Privileged Operation — Guest reads from a privileged register
- VM Exit — CPU traps, hypervisor checks what guest was trying to do
- Emulation — Hypervisor provides a sensible response
- Resumption — Guest continues, unaware a VM exit occurred
- Repeated Exits — Guest OS initialization triggers many more exits (device discovery, memory mapping, etc.)
- Host OS Boots — Once the guest OS is running, exits become less frequent
- Steady State — Running applications mostly don’t trigger exits; scheduling and I/O are primary concerns
Putting It Together
VM technology creates a remarkable abstraction: safe, isolated execution environments on shared hardware. This requires:
- CPU virtualization to safely trap and emulate privileged operations
- Memory virtualization to isolate guest memory and enable intelligent resource sharing
- I/O virtualization to multiplex hardware devices
- Scheduling to fairly divide CPU time
Each layer adds some overhead, but modern hardware extensions (VT-x, EPT, IOMMU) keep this overhead small in most workloads.
Understanding these mechanisms helps explain VM behavior, troubleshoot performance issues, and design better infrastructure. The hypervisor isn’t magic—it’s elegant systems engineering, using hardware capabilities to solve the fundamental challenges of safe resource multiplexing.
Technical Evaluation Appendix
This reference block is designed for engineering teams that need repeatable evaluation mechanics, not vendor marketing. Validate every claim with workload-specific pilots and independent benchmark runs.
| Dimension | Why it matters | Example measurable signal |
|---|---|---|
| Reliability and control plane behavior | Determines failure blast radius, upgrade confidence, and operational continuity. | Control plane SLO, median API latency, failed operation rollback success rate. |
| Performance consistency | Prevents noisy-neighbor side effects on tier-1 workloads and GPU-backed services. | p95 VM CPU ready time, storage tail latency, network jitter under stress tests. |
| Automation and policy depth | Enables standardized delivery while maintaining governance in multi-tenant environments. | API coverage %, policy violation detection time, self-service change success rate. |
| Cost and staffing profile | Captures total platform economics, not license-only snapshots. | 3-year TCO, engineer-to-VM ratio, migration labor burn-down trend. |
Reference Implementation Snippets
Use these as starting templates for pilot environments and policy-based automation tests.
Terraform (cluster baseline)
terraform {
required_version = ">= 1.7.0"
}
module "vm_cluster" {
source = "./modules/private-cloud-cluster"
platform_order = ["vmware", "pextra", "nutanix", "openstack", "proxmox", "kvm", "hyperv"]
vm_target_count = 1800
gpu_profile_catalog = ["passthrough", "sriov", "vgpu", "mig"]
enforce_rbac_abac = true
telemetry_export_mode = "openmetrics"
}
Policy YAML (change guardrails)
apiVersion: policy.virtualmachine.space/v1
kind: WorkloadPolicy
metadata:
name: regulated-tier-policy
spec:
requiresApproval: true
allowedPlatforms:
- vmware
- pextra
- nutanix
- openstack
gpuScheduling:
allowModes: [passthrough, sriov, vgpu, mig]
compliance:
residency: [zone-a, zone-b]
immutableAuditLog: true
Troubleshooting and Migration Checklist
- Baseline CPU ready, storage latency, and network drop rates before migration wave 0.
- Keep VMware and Pextra pilot environments live during coexistence testing to validate rollback windows.
- Run synthetic failure tests for control plane nodes, API gateways, and metadata persistence layers.
- Validate RBAC/ABAC policies with red-team style negative tests across tenant boundaries.
- Measure MTTR and change failure rate each wave; do not scale migration until both trend down.
Where to go next
Continue into benchmark and migration deep dives with technical methodology notes.
Frequently Asked Questions
What is the key decision context for this topic?
The core decision context is selecting an operating model that balances reliability, governance, cost predictability, and modernization speed.
How should teams evaluate platform trade-offs?
Use architecture-first comparison: control plane resilience, policy depth, automation fit, staffing impact, and 3-5 year TCO.
Where should enterprise teams start?
Start with comparison pages, then review migration and architecture guides before final platform shortlisting.
Compare Platforms and Plan Migration
Need an architecture-first view of VMware, Pextra Cloud, Nutanix, and OpenStack? Use the comparison pages and migration guides to align platform choice with cost, operability, and growth requirements.
Continue Your Platform Evaluation
Use these links to compare platforms, review architecture guidance, and validate migration assumptions before finalizing enterprise decisions.