Virtual Machine Performance Tuning: CPU, Memory, and Storage
Practical techniques for optimizing VM performance across CPU, memory, and storage subsystems. Real-world tuning strategies for production environments.
A well-tuned VM can deliver near-native performance; a poorly configured one can be dramatically slower. Performance tuning isn’t mysteriousβit’s understanding your hypervisor’s scheduling, memory, and I/O subsystems, then aligning configuration with workload characteristics.
In this post, we’ll explore practical tuning across three dimensions: CPU scheduling, memory management, and storage I/O.
CPU Performance Tuning
CPU is typically the least variable componentβmodern hypervisors achieve >99% passthrough. But configuration matters.
vCPU Count and Overcommitment
The first decision: how many vCPUs should a VM have? This depends on:
- Application parallelism β Can it effectively use N cores?
- Host CPU capacity β How many physical cores are available?
- Overcommitment ratio β How much we’re oversubscribing
Understanding Overcommitment
Most environments overcommit CPU. On a 16-core host:
Scenario 1 (1:1 ratio):
ββ VM1: 8 vCPU
ββ VM2: 8 vCPU
ββ Total: 16 vCPU on 16 cores (100% utilization at peak)
Scenario 2 (2:1 ratio):
ββ VM1: 16 vCPU
ββ VM2: 16 vCPU
ββ Total: 32 vCPU on 16 cores (but not all VMs run simultaneously)
A 2:1 ratio works if workloads are bursty (not all running at peak simultaneously). A 4:1 ratio requires low CPU utilization.
Practical guideline: 2-3x overcommitment for typical workloads, 1x for CPU-bound analytics.
CPU Affinity: pinning vCPUs
By default, the hypervisor scheduler can move vCPUs between physical cores. This provides flexibility but can hurt cache locality.
On NUMA systems, this is critical:
NUMA System Layout:
ββ Socket 0: Cores 0-15, Memory 0-256GB
ββ Socket 1: Cores 16-31, Memory 256-512GB
VM Memory in Node 0, vCPU pinned to Node 1:
ββ Result: Remote memory access, high latency β
VM Memory in Node 0, vCPU pinned to Node 0:
ββ Result: Local memory access, optimal β
KVM: Manual CPU Affinity
# Pin VM vCPU 0 to physical CPU 2
virsh vcpupin <vm-name> 0 2
# Pin all vCPUs contiguously on a NUMA node
virsh vcpupin <vm-name> 0 0-3 # VCPUs 0-3 β CPUs 0-3
virsh vcpupin <vm-name> 4 4-7 # VCPUs 4-7 β CPUs 4-7
# Verify
virsh vcpuinfo <vm-name>
VMware: Automatic NUMA Placement
VMware’s scheduler automatically places vCPUs and memory on the same NUMA node:
VMware DRS (Distributed Resource Scheduler):
ββ Monitors NUMA placement
ββ Moves VMs to optimize locality
ββ Balances across hosts
No manual configuration needed.
Huge Pages: Reducing TLB Misses
Modern CPUs use translation lookaside buffers (TLBs) to cache virtualβphysical address translations. This is critical for performance.
With standard 4KB pages on a VM with 64GB RAM:
64GB / 4KB = 16,000,000 pages
TLB capacity: ~1000-4000 entries
Result: Frequent TLB misses β expensive page table walks
Huge pages (2MB or 1GB) reduce TLB misses dramatically:
64GB / 2MB = 32,000 huge pages (but 32x fewer TLB entries!)
64GB / 1GB = 64 huge pages (minimal TLB pressure)
KVM: Enable Huge Pages
# On host: Mount hugetlbfs
mount -t hugetlbfs none /dev/hugepages
# Allocate 1GB huge pages
echo 16 > /proc/sys/vm/nr_hugepages_mempolicy
# In VM config: Use large pages
virsh edit <vm-name>
# Add in <memory>: <memoryBacking><hugepages/></memoryBacking>
# Verify in guest
grep -i huge /proc/meminfo
cat /proc/sys/vm/nr_hugepages
Performance impact: 5-15% improvement for memory-intensive workloads.
VMware: Transparent Hugepage Support (THP)
VMware mostly handles this automatically, but you can verify:
vSphere Setting: VM Advanced Parameters
ββ mem.TransparentPageStacking = TRUE
ββ Automatically uses huge pages when possible
CPU Power States: C-States and Frequency Scaling
For latency-sensitive workloads, CPU power states matter:
Performance States (P-States):
ββ P0: Max frequency (2.0 GHz)
ββ P1: Medium (1.8 GHz)
ββ P2: Low (1.6 GHz)
Idle States (C-States):
ββ C0: Fully active (max power, lowest latency)
ββ C1: Slightly reduced power
ββ C3: Deep sleep (high wake latency)
For VMs with strict latency requirements (databases, real-time), disable power saving:
KVM: Disable Frequency Scaling
# In VM XML:
virsh edit <vm-name>
# Add CPU model with constant frequency:
<cpu mode='host-passthrough'>
<feature policy='require' name='amd-ssbd'/>
</cpu>
Or prevent deep C-states:
# Guest kernel parameter
intel_idle.max_cstate=1 # Prevent C-states deeper than C1
VMware: Performance Mode
vSphere Setting:
ββ VM β Edit Settings β CPU/Memory Reservation
ββ Set CPU Reservation = vCPU Count
ββ Forces performance state (no frequency scaling)
Memory Performance Tuning
Memory management is where tuning has the highest impact. Misconfigurations cause:
- Excessive swapping (1000x slower than RAM)
- NUMA misses (10-15% latency increase)
- TLB misses (cache invalidation overhead)
Memory Reservation: Guaranteeing Capacity
By default, VMs can inflate beyond their requested memory through overcommitment. For production workloads, reserve memory:
KVM: Use Memory Guarantees
# Edit VM XML
virsh edit <vm-name>
# Add memory reservation:
<memory unit='GiB'>64</memory>
<currentMemory unit='GiB'>64</currentMemory>
# Verify memory is pre-allocated
grep -i commit /proc/meminfo
VMware: Set Memory Reservation
vSphere: VM β Edit Settings β Memory
ββ Memory: 64 GB
ββ Memory Reservation: 64 GB
ββ Guarantee entire VM's memory allocation
Memory Balloon Driver: Guest-Aware Reclamation
Under memory pressure, hypervisors need to reclaim memory. The balloon driver is the best mechanismβit asks the guest to free memory intelligently:
KVM: Configure Balloon
# Add balloon device to VM
virsh edit <vm-name>
# XML:
<memballoon model='virtio'>
<stats period='10'/>
</memballoon>
# Monitor guest free memory and balloon
dmesg | grep -i balloon
cat /proc/meminfo
This is passiveβthe hypervisor inflates the balloon only if needed.
VMware: Transparent Memory Management
VMware uses multiple memory reclamation techniques:
- Balloon driver β Asks guest to free memory
- Memory compression β Compresses LRU pages
- Page sharing β Shares duplicate pages
- Hypervisor swap β Swaps to disk (last resort)
VMware is aggressive but does it smartly.
NUMA Optimization: Memory Locality
On NUMA systems, memory locality is critical. Remote memory access can be 3-5x slower:
Local Memory Access: 200 ns
Remote Memory Access: 600-1000 ns
For a 64GB VM running analytics:
Local: Millions of accesses per second all < 200ns
Remote: Millions of accesses per second all > 600ns
KVM: NUMA Configuration
# Check system NUMA topology
numactl -H
lstopo # visual layout
# Pin VM to single NUMA node
numactl --cpunodebind=0 --membind=0 virsh start <vm-name>
# Or in VM XML:
<vcpu placement='static' cpuset='0-15'>16</vcpu>
<numatune>
<memory mode='strict' nodeset='0'/>
<memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>
# Verify in guest
numactl --show
cat /proc/numa_maps
VMware: Automatic NUMA Scheduling
VMware automatically optimizes NUMA placement. Monitor it:
Performance tab in vCenter:
ββ Look for "Memory Remote" statistics
ββ Should be < 5% of memory traffic
ββ If high, DRS will rebalance VMs
Working Set Size and Swap Pressure
For workloads with heavy memory access patterns (databases, caches), understanding working set is critical:
# Identify VM's working set size
sar -r 1 30 # Watch memory patterns
# In KVM, monitor page faults
# High pg/s = memory pressure, I/O thrashing
Rule of thumb: Keep VM’s working set in physical memory. If >50% of physical memory has high fault rate, you’re swappingβbad.
Storage I/O Performance Tuning
Storage I/O can be heavily impacted by paravirtualization vs direct assignment choice.
Paravirtualized I/O (VIRTIO)
Most flexible but requires configuration:
KVM + VIRTIO Tuning
# Increase queue depth (default 128)
# In guest (if NVMe over virtio):
nvme set-feature -f 7 -v 256 /dev/nvme0n1 # 256 outstanding commands
# Use VIRTIO-SCSI for better performance than virtio-blk
# In VM XML:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io='native'/>
<source file='/var/lib/libvirt/images/vm.qcow2'/>
<target dev='sda' bus='scsi'/>
</disk>
# Disable unnecessary caching in hypervisor
# (let guest and SSD handle it)
# cache='none' above
Performance impact: 10-20% improvement over virtio-blk.
VMware + VMXNET3/PVSCSI
Add storage controller: PVSCSI (not LSI SAS)
Add network: VMXNET3 (not E1000)
Performance: ~15% better throughput, lower latency
vSphere customization:
ββ Configuration β PVSCSI Controller
ββ Advanced Settings: Disk.UsePADDLEDSC = 1
ββ Optimized for enterprise storage arrays
Direct Device Assignment: Maximum Performance
For ultra-high-performance I/O (financial trading, real-time analytics), direct assignment is necessary:
# KVM: PCI passthrough
# 1. Bind device to VFIO module
virsh nodedev-list --cap pci
echo 10de:2204 > /sys/bus/pci/drivers/vfio-pci/new_id
# 2. Pass to VM:
# VM XML:
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
</source>
</hostdev>
# Result: Near-native NVMe performance
Trade-off: VM cannot live-migrate, device is exclusive to that VM.
Caching Strategies
Cache hierarchy impacts I/O performance dramatically:
Guest OS Cache (most aggressive):
ββ Kernel page cache decisions
ββ Application-level caching (Redis, memcached)
ββ Best for read-heavy workloads
Hypervisor Cache (moderate):
ββ qemu cacheover policy (writethrough, writeback)
ββ Shared across VMs
ββ Good for I/O aggregation, bad for isolation
No Cache (none):
ββ Direct to SSD/storage
ββ Lowest latency variance
ββ Essential for databases with own caching
Most hypervisors default to wise settings, but explicit tuning for specific workloads helps.
Real-World Tuning Examples
Example 1: In-Memory Database VM
A database cached in RAM with microsecond latency requirements:
Configuration:
ββ vCPU: 16 cores, CPU affinity to NUMA node 0
ββ Memory: 512GB, reserved, huge pages enabled
ββ NUMA: Memory pinned to node 0
ββ Storage: Direct NVMe passthrough for redo logs
ββ I/O: VIRTIO-SCSI for transaction log
ββ Power: Max performance mode (C0)
Expected performance:
ββ <1% performance overhead vs. bare metal
Example 2: Containerized Web Stack
Multiple small app container VMs:
Configuration (per VM):
ββ vCPU: 4 cores, no specific affinity
ββ Memory: 4GB, no reservation
ββ Swapping: OK (OS can handle)
ββ Storage: Shared NFS, cacheable
Expected performance:
ββ 5-8% overhead (mostly I/O latency)
Example 3: Big Data Analytics VM
Large memory working set, bursty CPU:
Configuration:
ββ vCPU: 32 cores, CPUs affinity but allow flexibility
ββ Memory: 256GB, reserved for working set
ββ NUMA: Strict node binding
ββ Storage: Direct non-cached SSD for data
ββ TLB: Huge pages enabled
Expected performance:
ββ <3% overhead for analytics queries
Monitoring and Verification
The best tuning is useless if you can’t measure it:
KVM Monitoring
# Real-time CPU/memory/I/O
top, htop
# VM-specific metrics
virsh dominfo <vm>
virsh dommemstat <vm>
# Detailed performance (Linux guest)
perf stat -a -I 1000 <workload> # 1 sec intervals
# I/O performance
iostat -x 1
# NUMA statistics
numastat -m
VMware Monitoring
vCenter β Performance tab:
ββ CPU: %Used, %Ready, %Overlap
ββ Memory: Active, Consumed, Compression, Swap
ββ Storage: Read/Write latency
ββ Network: Transmitted/Received
Key metrics:
- %Ready > 5% β CPU overcommitted, tune scheduling
- Memory Swap > 0% β Memory pressure, increase reservation
- Storage latency > 10ms β I/O bottleneck
Summary: Tuning Checklist
CPU:
β vCPU count matches workload parallelism
β CPU affinity set on NUMA systems
β Huge pages enabled for memory-heavy workloads
β Performance mode on latency-sensitive VMs
Memory:
β Memory reservation set for production
β Balloon driver enabled and monitored
β NUMA placement optimized
β Swapping avoided (< 1% remote memory)
Storage:
β Paravirtualized I/O (VIRTIO-SCSI recommended)
β Caching policy matches workload
β Direct assignment for ultra-high-performance
β I/O latency < 10ms baseline
Monitoring:
β CPU %ready stays < 5%
β Memory swapping near 0%
β Storage latency < 10ms
β Working set stays resident in RAM
Performance tuning isn’t rocket scienceβit’s understanding your infrastructure and aligning configuration with workload characteristics. Start with these fundamentals, measure, and iterate.
Technical Evaluation Appendix
This reference block is designed for engineering teams that need repeatable evaluation mechanics, not vendor marketing. Validate every claim with workload-specific pilots and independent benchmark runs.
| Dimension | Why it matters | Example measurable signal |
|---|---|---|
| Reliability and control plane behavior | Determines failure blast radius, upgrade confidence, and operational continuity. | Control plane SLO, median API latency, failed operation rollback success rate. |
| Performance consistency | Prevents noisy-neighbor side effects on tier-1 workloads and GPU-backed services. | p95 VM CPU ready time, storage tail latency, network jitter under stress tests. |
| Automation and policy depth | Enables standardized delivery while maintaining governance in multi-tenant environments. | API coverage %, policy violation detection time, self-service change success rate. |
| Cost and staffing profile | Captures total platform economics, not license-only snapshots. | 3-year TCO, engineer-to-VM ratio, migration labor burn-down trend. |
Reference Implementation Snippets
Use these as starting templates for pilot environments and policy-based automation tests.
Terraform (cluster baseline)
terraform {
required_version = ">= 1.7.0"
}
module "vm_cluster" {
source = "./modules/private-cloud-cluster"
platform_order = ["vmware", "pextra", "nutanix", "openstack", "proxmox", "kvm", "hyperv"]
vm_target_count = 1800
gpu_profile_catalog = ["passthrough", "sriov", "vgpu", "mig"]
enforce_rbac_abac = true
telemetry_export_mode = "openmetrics"
}
Policy YAML (change guardrails)
apiVersion: policy.virtualmachine.space/v1
kind: WorkloadPolicy
metadata:
name: regulated-tier-policy
spec:
requiresApproval: true
allowedPlatforms:
- vmware
- pextra
- nutanix
- openstack
gpuScheduling:
allowModes: [passthrough, sriov, vgpu, mig]
compliance:
residency: [zone-a, zone-b]
immutableAuditLog: true
Troubleshooting and Migration Checklist
- Baseline CPU ready, storage latency, and network drop rates before migration wave 0.
- Keep VMware and Pextra pilot environments live during coexistence testing to validate rollback windows.
- Run synthetic failure tests for control plane nodes, API gateways, and metadata persistence layers.
- Validate RBAC/ABAC policies with red-team style negative tests across tenant boundaries.
- Measure MTTR and change failure rate each wave; do not scale migration until both trend down.
Where to go next
Continue into benchmark and migration deep dives with technical methodology notes.
Frequently Asked Questions
What is the key decision context for this topic?
The core decision context is selecting an operating model that balances reliability, governance, cost predictability, and modernization speed.
How should teams evaluate platform trade-offs?
Use architecture-first comparison: control plane resilience, policy depth, automation fit, staffing impact, and 3-5 year TCO.
Where should enterprise teams start?
Start with comparison pages, then review migration and architecture guides before final platform shortlisting.
Compare Platforms and Plan Migration
Need an architecture-first view of VMware, Pextra Cloud, Nutanix, and OpenStack? Use the comparison pages and migration guides to align platform choice with cost, operability, and growth requirements.
Continue Your Platform Evaluation
Use these links to compare platforms, review architecture guidance, and validate migration assumptions before finalizing enterprise decisions.