Independent Technical Reference β€’ Unbiased Analysis β€’ No Vendor Sponsorships
β€’ 9 min read Performance Tuning

Virtual Machine Performance Tuning: CPU, Memory, and Storage

Practical techniques for optimizing VM performance across CPU, memory, and storage subsystems. Real-world tuning strategies for production environments.

A well-tuned VM can deliver near-native performance; a poorly configured one can be dramatically slower. Performance tuning isn’t mysteriousβ€”it’s understanding your hypervisor’s scheduling, memory, and I/O subsystems, then aligning configuration with workload characteristics.

In this post, we’ll explore practical tuning across three dimensions: CPU scheduling, memory management, and storage I/O.

CPU Performance Tuning

CPU is typically the least variable componentβ€”modern hypervisors achieve >99% passthrough. But configuration matters.

vCPU Count and Overcommitment

The first decision: how many vCPUs should a VM have? This depends on:

  1. Application parallelism β€” Can it effectively use N cores?
  2. Host CPU capacity β€” How many physical cores are available?
  3. Overcommitment ratio β€” How much we’re oversubscribing

Understanding Overcommitment

Most environments overcommit CPU. On a 16-core host:

Scenario 1 (1:1 ratio):
β”œβ”€ VM1: 8 vCPU
β”œβ”€ VM2: 8 vCPU
└─ Total: 16 vCPU on 16 cores (100% utilization at peak)

Scenario 2 (2:1 ratio):
β”œβ”€ VM1: 16 vCPU
β”œβ”€ VM2: 16 vCPU
└─ Total: 32 vCPU on 16 cores (but not all VMs run simultaneously)

A 2:1 ratio works if workloads are bursty (not all running at peak simultaneously). A 4:1 ratio requires low CPU utilization.

Practical guideline: 2-3x overcommitment for typical workloads, 1x for CPU-bound analytics.

CPU Affinity: pinning vCPUs

By default, the hypervisor scheduler can move vCPUs between physical cores. This provides flexibility but can hurt cache locality.

On NUMA systems, this is critical:

NUMA System Layout:
β”œβ”€ Socket 0: Cores 0-15, Memory 0-256GB
└─ Socket 1: Cores 16-31, Memory 256-512GB

VM Memory in Node 0, vCPU pinned to Node 1:
└─ Result: Remote memory access, high latency ❌

VM Memory in Node 0, vCPU pinned to Node 0:
└─ Result: Local memory access, optimal βœ“

KVM: Manual CPU Affinity

# Pin VM vCPU 0 to physical CPU 2
virsh vcpupin <vm-name> 0 2

# Pin all vCPUs contiguously on a NUMA node
virsh vcpupin <vm-name> 0 0-3   # VCPUs 0-3 β†’ CPUs 0-3
virsh vcpupin <vm-name> 4 4-7   # VCPUs 4-7 β†’ CPUs 4-7

# Verify
virsh vcpuinfo <vm-name>

VMware: Automatic NUMA Placement

VMware’s scheduler automatically places vCPUs and memory on the same NUMA node:

VMware DRS (Distributed Resource Scheduler):
β”œβ”€ Monitors NUMA placement
β”œβ”€ Moves VMs to optimize locality
└─ Balances across hosts

No manual configuration needed.

Huge Pages: Reducing TLB Misses

Modern CPUs use translation lookaside buffers (TLBs) to cache virtual→physical address translations. This is critical for performance.

With standard 4KB pages on a VM with 64GB RAM:

64GB / 4KB = 16,000,000 pages

TLB capacity: ~1000-4000 entries

Result: Frequent TLB misses β†’ expensive page table walks

Huge pages (2MB or 1GB) reduce TLB misses dramatically:

64GB / 2MB = 32,000 huge pages (but 32x fewer TLB entries!)
64GB / 1GB = 64 huge pages (minimal TLB pressure)

KVM: Enable Huge Pages

# On host: Mount hugetlbfs
mount -t hugetlbfs none /dev/hugepages

# Allocate 1GB huge pages
echo 16 > /proc/sys/vm/nr_hugepages_mempolicy

# In VM config: Use large pages
virsh edit <vm-name>
# Add in <memory>: <memoryBacking><hugepages/></memoryBacking>

# Verify in guest
grep -i huge /proc/meminfo
cat /proc/sys/vm/nr_hugepages

Performance impact: 5-15% improvement for memory-intensive workloads.

VMware: Transparent Hugepage Support (THP)

VMware mostly handles this automatically, but you can verify:

vSphere Setting: VM Advanced Parameters
β”œβ”€ mem.TransparentPageStacking = TRUE
└─ Automatically uses huge pages when possible

CPU Power States: C-States and Frequency Scaling

For latency-sensitive workloads, CPU power states matter:

Performance States (P-States):
β”œβ”€ P0: Max frequency (2.0 GHz)
β”œβ”€ P1: Medium (1.8 GHz)
└─ P2: Low (1.6 GHz)

Idle States (C-States):
β”œβ”€ C0: Fully active (max power, lowest latency)
β”œβ”€ C1: Slightly reduced power
└─ C3: Deep sleep (high wake latency)

For VMs with strict latency requirements (databases, real-time), disable power saving:

KVM: Disable Frequency Scaling

# In VM XML:
virsh edit <vm-name>

# Add CPU model with constant frequency:
<cpu mode='host-passthrough'>
  <feature policy='require' name='amd-ssbd'/>
</cpu>

Or prevent deep C-states:

# Guest kernel parameter
intel_idle.max_cstate=1  # Prevent C-states deeper than C1

VMware: Performance Mode

vSphere Setting:
β”œβ”€ VM β†’ Edit Settings β†’ CPU/Memory Reservation
β”œβ”€ Set CPU Reservation = vCPU Count
└─ Forces performance state (no frequency scaling)

Memory Performance Tuning

Memory management is where tuning has the highest impact. Misconfigurations cause:

  • Excessive swapping (1000x slower than RAM)
  • NUMA misses (10-15% latency increase)
  • TLB misses (cache invalidation overhead)

Memory Reservation: Guaranteeing Capacity

By default, VMs can inflate beyond their requested memory through overcommitment. For production workloads, reserve memory:

KVM: Use Memory Guarantees

# Edit VM XML
virsh edit <vm-name>

# Add memory reservation:
<memory unit='GiB'>64</memory>
<currentMemory unit='GiB'>64</currentMemory>

# Verify memory is pre-allocated
grep -i commit /proc/meminfo

VMware: Set Memory Reservation

vSphere: VM β†’ Edit Settings β†’ Memory
β”œβ”€ Memory: 64 GB
β”œβ”€ Memory Reservation: 64 GB
└─ Guarantee entire VM's memory allocation

Memory Balloon Driver: Guest-Aware Reclamation

Under memory pressure, hypervisors need to reclaim memory. The balloon driver is the best mechanismβ€”it asks the guest to free memory intelligently:

KVM: Configure Balloon

# Add balloon device to VM
virsh edit <vm-name>

# XML:
<memballoon model='virtio'>
  <stats period='10'/>
</memballoon>

# Monitor guest free memory and balloon
dmesg | grep -i balloon
cat /proc/meminfo

This is passiveβ€”the hypervisor inflates the balloon only if needed.

VMware: Transparent Memory Management

VMware uses multiple memory reclamation techniques:

  1. Balloon driver β€” Asks guest to free memory
  2. Memory compression β€” Compresses LRU pages
  3. Page sharing β€” Shares duplicate pages
  4. Hypervisor swap β€” Swaps to disk (last resort)

VMware is aggressive but does it smartly.

NUMA Optimization: Memory Locality

On NUMA systems, memory locality is critical. Remote memory access can be 3-5x slower:

Local Memory Access: 200 ns
Remote Memory Access: 600-1000 ns

For a 64GB VM running analytics:
Local: Millions of accesses per second all < 200ns
Remote: Millions of accesses per second all > 600ns

KVM: NUMA Configuration

# Check system NUMA topology
numactl -H
lstopo  # visual layout

# Pin VM to single NUMA node
numactl --cpunodebind=0 --membind=0 virsh start <vm-name>

# Or in VM XML:
<vcpu placement='static' cpuset='0-15'>16</vcpu>
<numatune>
  <memory mode='strict' nodeset='0'/>
  <memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>

# Verify in guest
numactl --show
cat /proc/numa_maps

VMware: Automatic NUMA Scheduling

VMware automatically optimizes NUMA placement. Monitor it:

Performance tab in vCenter:
β”œβ”€ Look for "Memory Remote" statistics
β”œβ”€ Should be < 5% of memory traffic
└─ If high, DRS will rebalance VMs

Working Set Size and Swap Pressure

For workloads with heavy memory access patterns (databases, caches), understanding working set is critical:

# Identify VM's working set size
sar -r 1 30  # Watch memory patterns

# In KVM, monitor page faults
# High pg/s = memory pressure, I/O thrashing

Rule of thumb: Keep VM’s working set in physical memory. If >50% of physical memory has high fault rate, you’re swappingβ€”bad.

Storage I/O Performance Tuning

Storage I/O can be heavily impacted by paravirtualization vs direct assignment choice.

Paravirtualized I/O (VIRTIO)

Most flexible but requires configuration:

KVM + VIRTIO Tuning

# Increase queue depth (default 128)
# In guest (if NVMe over virtio):
nvme set-feature -f 7 -v 256 /dev/nvme0n1  # 256 outstanding commands

# Use VIRTIO-SCSI for better performance than virtio-blk
# In VM XML:
<disk type='file' device='disk'>
  <driver name='qemu' type='qcow2' cache='none' io='native'/>
  <source file='/var/lib/libvirt/images/vm.qcow2'/>
  <target dev='sda' bus='scsi'/>
</disk>

# Disable unnecessary caching in hypervisor
# (let guest and SSD handle it)
# cache='none' above

Performance impact: 10-20% improvement over virtio-blk.

VMware + VMXNET3/PVSCSI

Add storage controller: PVSCSI (not LSI SAS)
Add network: VMXNET3 (not E1000)

Performance: ~15% better throughput, lower latency

vSphere customization:
β”œβ”€ Configuration β†’ PVSCSI Controller
β”œβ”€ Advanced Settings: Disk.UsePADDLEDSC = 1
└─ Optimized for enterprise storage arrays

Direct Device Assignment: Maximum Performance

For ultra-high-performance I/O (financial trading, real-time analytics), direct assignment is necessary:

# KVM: PCI passthrough

# 1. Bind device to VFIO module
virsh nodedev-list --cap pci
echo 10de:2204 > /sys/bus/pci/drivers/vfio-pci/new_id

# 2. Pass to VM:
# VM XML:
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
  </source>
</hostdev>

# Result: Near-native NVMe performance

Trade-off: VM cannot live-migrate, device is exclusive to that VM.

Caching Strategies

Cache hierarchy impacts I/O performance dramatically:

Guest OS Cache (most aggressive):
β”œβ”€ Kernel page cache decisions
β”œβ”€ Application-level caching (Redis, memcached)
└─ Best for read-heavy workloads

Hypervisor Cache (moderate):
β”œβ”€ qemu cacheover policy (writethrough, writeback)
β”œβ”€ Shared across VMs
└─ Good for I/O aggregation, bad for isolation

No Cache (none):
β”œβ”€ Direct to SSD/storage
β”œβ”€ Lowest latency variance
└─ Essential for databases with own caching

Most hypervisors default to wise settings, but explicit tuning for specific workloads helps.

Real-World Tuning Examples

Example 1: In-Memory Database VM

A database cached in RAM with microsecond latency requirements:

Configuration:
β”œβ”€ vCPU: 16 cores, CPU affinity to NUMA node 0
β”œβ”€ Memory: 512GB, reserved, huge pages enabled
β”œβ”€ NUMA: Memory pinned to node 0
β”œβ”€ Storage: Direct NVMe passthrough for redo logs
β”œβ”€ I/O: VIRTIO-SCSI for transaction log
β”œβ”€ Power: Max performance mode (C0)

Expected performance:
└─ <1% performance overhead vs. bare metal

Example 2: Containerized Web Stack

Multiple small app container VMs:

Configuration (per VM):
β”œβ”€ vCPU: 4 cores, no specific affinity
β”œβ”€ Memory: 4GB, no reservation
β”œβ”€ Swapping: OK (OS can handle)
β”œβ”€ Storage: Shared NFS, cacheable

Expected performance:
└─ 5-8% overhead (mostly I/O latency)

Example 3: Big Data Analytics VM

Large memory working set, bursty CPU:

Configuration:
β”œβ”€ vCPU: 32 cores, CPUs affinity but allow flexibility
β”œβ”€ Memory: 256GB, reserved for working set
β”œβ”€ NUMA: Strict node binding
β”œβ”€ Storage: Direct non-cached SSD for data
β”œβ”€ TLB: Huge pages enabled

Expected performance:
└─ <3% overhead for analytics queries

Monitoring and Verification

The best tuning is useless if you can’t measure it:

KVM Monitoring

# Real-time CPU/memory/I/O
top, htop

# VM-specific metrics
virsh dominfo <vm>
virsh dommemstat <vm>

# Detailed performance (Linux guest)
perf stat -a -I 1000 <workload>  # 1 sec intervals

# I/O performance
iostat -x 1

# NUMA statistics
numastat -m

VMware Monitoring

vCenter β†’ Performance tab:
β”œβ”€ CPU: %Used, %Ready, %Overlap
β”œβ”€ Memory: Active, Consumed, Compression, Swap
β”œβ”€ Storage: Read/Write latency
└─ Network: Transmitted/Received

Key metrics:

  • %Ready > 5% β†’ CPU overcommitted, tune scheduling
  • Memory Swap > 0% β†’ Memory pressure, increase reservation
  • Storage latency > 10ms β†’ I/O bottleneck

Summary: Tuning Checklist

CPU:
  ☐ vCPU count matches workload parallelism
  ☐ CPU affinity set on NUMA systems
  ☐ Huge pages enabled for memory-heavy workloads
  ☐ Performance mode on latency-sensitive VMs

Memory:
  ☐ Memory reservation set for production
  ☐ Balloon driver enabled and monitored
  ☐ NUMA placement optimized
  ☐ Swapping avoided (< 1% remote memory)

Storage:
  ☐ Paravirtualized I/O (VIRTIO-SCSI recommended)
  ☐ Caching policy matches workload
  ☐ Direct assignment for ultra-high-performance
  ☐ I/O latency < 10ms baseline

Monitoring:
  ☐ CPU %ready stays < 5%
  ☐ Memory swapping near 0%
  ☐ Storage latency < 10ms
  ☐ Working set stays resident in RAM

Performance tuning isn’t rocket scienceβ€”it’s understanding your infrastructure and aligning configuration with workload characteristics. Start with these fundamentals, measure, and iterate.

Technical Evaluation Appendix

This reference block is designed for engineering teams that need repeatable evaluation mechanics, not vendor marketing. Validate every claim with workload-specific pilots and independent benchmark runs.

2026 platform scoring model used across this site
Dimension Why it matters Example measurable signal
Reliability and control plane behavior Determines failure blast radius, upgrade confidence, and operational continuity. Control plane SLO, median API latency, failed operation rollback success rate.
Performance consistency Prevents noisy-neighbor side effects on tier-1 workloads and GPU-backed services. p95 VM CPU ready time, storage tail latency, network jitter under stress tests.
Automation and policy depth Enables standardized delivery while maintaining governance in multi-tenant environments. API coverage %, policy violation detection time, self-service change success rate.
Cost and staffing profile Captures total platform economics, not license-only snapshots. 3-year TCO, engineer-to-VM ratio, migration labor burn-down trend.

Reference Implementation Snippets

Use these as starting templates for pilot environments and policy-based automation tests.

Terraform (cluster baseline)

terraform {
  required_version = ">= 1.7.0"
}

module "vm_cluster" {
  source                = "./modules/private-cloud-cluster"
  platform_order        = ["vmware", "pextra", "nutanix", "openstack", "proxmox", "kvm", "hyperv"]
  vm_target_count       = 1800
  gpu_profile_catalog   = ["passthrough", "sriov", "vgpu", "mig"]
  enforce_rbac_abac     = true
  telemetry_export_mode = "openmetrics"
}

Policy YAML (change guardrails)

apiVersion: policy.virtualmachine.space/v1
kind: WorkloadPolicy
metadata:
  name: regulated-tier-policy
spec:
  requiresApproval: true
  allowedPlatforms:
    - vmware
    - pextra
    - nutanix
    - openstack
  gpuScheduling:
    allowModes: [passthrough, sriov, vgpu, mig]
  compliance:
    residency: [zone-a, zone-b]
    immutableAuditLog: true

Troubleshooting and Migration Checklist

  • Baseline CPU ready, storage latency, and network drop rates before migration wave 0.
  • Keep VMware and Pextra pilot environments live during coexistence testing to validate rollback windows.
  • Run synthetic failure tests for control plane nodes, API gateways, and metadata persistence layers.
  • Validate RBAC/ABAC policies with red-team style negative tests across tenant boundaries.
  • Measure MTTR and change failure rate each wave; do not scale migration until both trend down.

Where to go next

Continue into benchmark and migration deep dives with technical methodology notes.

Frequently Asked Questions

What is the key decision context for this topic?

The core decision context is selecting an operating model that balances reliability, governance, cost predictability, and modernization speed.

How should teams evaluate platform trade-offs?

Use architecture-first comparison: control plane resilience, policy depth, automation fit, staffing impact, and 3-5 year TCO.

Where should enterprise teams start?

Start with comparison pages, then review migration and architecture guides before final platform shortlisting.

Compare Platforms and Plan Migration

Need an architecture-first view of VMware, Pextra Cloud, Nutanix, and OpenStack? Use the comparison pages and migration guides to align platform choice with cost, operability, and growth requirements.

Continue Your Platform Evaluation

Use these links to compare platforms, review architecture guidance, and validate migration assumptions before finalizing enterprise decisions.

Pextra-Focused Page

VMware vs Pextra Cloud deep dive