Optimize your Kubernetes infrastructure for better performance cost savings.

Many organizations using Amazon EKS still rely on Cluster Autoscaler, which often leads to over-provisioning and rising infrastructure costs. This guide walks you through a zero-downtime migration to Karpenter, a smarter and more efficient autoscaler for Kubernetes.

Why Migrate to Karpenter?

Many organizations running Kubernetes on Amazon EKS rely on Cluster Autoscaler (CA) to manage node scaling. While CA is a proven tool, it often results in underutilized infrastructure, slower scaling, and higher costs.

Here’s why migrating to Karpenter is a game changer:

Faster Scaling
Karpenter provisions new nodes in 60–90 seconds, compared to 3–5 minutes with Cluster Autoscaler.

Intelligent Instance Selection
Automatically chooses the most cost-effective and performance-optimized instance types, including support for Spot Instances.

Cost Optimization
By leveraging Spot capacity and better bin-packing, Karpenter helps cut infrastructure costs by 40–60%.

Improved Resource Utilization
Smarter scheduling = fewer idle resources and better cluster efficiency.

Cluster Autoscaler Disadvantages:

No support for dynamic instance type picking – Relies on Auto Scaling Groups with predefined instance types that must be manually configured in advance, unlike Karpenter which can dynamically select from hundreds of available instance types based on workload requirements
Limited Spot Instance flexibility – Doesn’t support intelligent failover from Spot to On-Demand instances, only hardcoded percentages
ASG-dependent scaling constraints – Min/Max scaling is tied to Auto Scaling Group configurations
Slow node provisioning – Adds nodes slowly, reacting to unschedulable pods.
Poor resource utilization – Leads to over-provisioned clusters and wasted costs due to inflexible instance selection.

Pre-Migration Assessment

Before making any infrastructure changes, it’s critical to understand the current state of your EKS cluster. A successful migration to Karpenter starts with identifying inefficiencies and uncovering opportunities for optimization.

# Node overview

kubectl get nodes -o wide

kubectl top nodes

# Pod resource requests vs actual usage

kubectl top pods --all-namespaces

kubectl describe nodes | grep -A5 "Allocated resources"

Step 1: Analyze Your Current State

What to check:

Cluster resource usage
Review total node count, CPU and memory utilization across your workloads.
Application requests vs. actual usage
Most apps request more CPU/memory than needed, leading to inefficiency.
Instance type efficiency
Assess if current instance types fit your workload characteristics.
Scaling behavior
Look at how long scaling takes and how well it matches demand.
Cost visibility
Identify high-cost patterns or unused capacity.

Tools to use:

Kubernetes Resource Recommender (KRR)
CloudWatch Metrics
AWS Cost Explorer or AWS Cost Management Tools
Grafana and Prometheus

Our Real Customer Use Case:

In our recent implementation with a production EKS cluster, we discovered several critical inefficiencies:

Key Findings:

2–3x over-provisioned CPU/memory
Node utilization < 20%
Current node count: 22 nodes
Cost dominated by on-demand instances
Cluster Autoscaler too slow to meet demand
Instance types: m6a.xlarge
Over-provisioned applications

Phase 1: Resource Optimization

Step 2: Right-Size Your Applications

Before deploying Karpenter, we focused on optimizing workloads to improve efficiency, availability, and readiness for consolidation:

Health checks
Add readiness, liveness, and startup probes.
High availability
Use multiple replicas and appropriate autoscaling metrics.
Resource right-sizing
Apply KRR recommendations; test changes gradually.
PodDisruptionBudgets (PDBs)
Temporarily relax constraints to allow node consolidation.

Best Practices Before Implementation

# Add to all services

readinessProbe:

httpGet:

path: /health

port: 8000

initialDelaySeconds: 15

periodSeconds: 5

livenessProbe:

httpGet:

path: /health

port: 8000

initialDelaySeconds: 30

periodSeconds: 10

startupProbe:

httpGet:

path: /health

port: 8000

initialDelaySeconds: 10

periodSeconds: 5

failureThreshold: 12

High Availability Setup:

# Minimum 2 replicas for production

autoscaling:

enabled: true

minReplicas: 2

maxReplicas: 20

targetCPUUtilizationPercentage: 65

targetMemoryUtilizationPercentage: 70

# For RabbitMQ-based scaling

autoscaling:

type: rabbitmq

rabbitMqEnv: CELERY_BROKER_URL

queues:

– length: 20

name: celery-queue

Fix PodDisruptionBudgets

# Enable node draining

minAvailable: 0%

Resource Optimization with KRR

KRR analyzes historical pod usage (CPU and memory) and provides right-sizing recommendations based on filters like specific hours or days to reflect realistic workload patterns.

Install and Run KRR Analysis

# Install KRR

pip install krr

# Run analysis on production cluster

krr simple --cluster your-cluster-name --allow-hpa --history_duration=336

Zoom image will be displayed.

In the images, memory and CPU requests are compared to the recommended values; red highlights indicate requests that are significantly higher than necessary.

Zoom image will be displayed.

Key KRR Findings Example:

backend services: Most of them requesting 500m CPU, using 50m
celery-workers: Memory over-allocated by 60%
Recommendation: Reduce overall resource requests by 40–60%

Apply Resource Changes

It is recommended to always set memory requests and limits to ensure stable pod scheduling and avoid OOM kills. However, setting CPU limits is generally discouraged, as it can lead to throttling and degraded performance under load. CPU requests are important for scheduling, but limits can harm performance more than help.

Setting appropriate resource requests is your responsibility as an application owner. Use data to guide decisions based on actual usage patterns.

For more insight, see:

Example optimization:

# Before

resources:

 requests:

   cpu: 500m

   memory: 512Mi

 limits:

   memory: 1Gi

# After (KRR optimized)

resources:

 requests:

   cpu: 50m

   memory: 280Mi

 limits:

   memory: 340Mi

Step 3: Deploy and Validate Changes

Apply and test in dev/staging environments before production.

Expected results:

Node count reduction: 40–60%
Cluster utilization increase: from ~15% to 40%+
Same performance, lower cost
Cost savings: ~50%

Dev and Staging Envs Load Testing
To validate our optimizations and ensure reliable scaling behavior, we ran synthetic load tests in the staging environment. This simulated high traffic and stressed the cluster, helping us confirm that resource right-sizing, autoscaling policies, and HA settings were functioning as expected under pressure.

Phase 2: Deploy Karpenter

Step 4: Prepare Infrastructure

Karpenter controller running alongside Cluster Autoscaler
Infrastructure Prerequisites + Deploy Karpenter Controller + Create NodePool Configuration.

Requirements:

Tagged subnets and security groups
IAM role with EC2 permissions
Proper VPC configuration

Deployment strategy:

Run Karpenter alongside Cluster Autoscaler
Use Helm or Terraform
Karpenter nodes outside ASGs

NodePool/Provisioner setup:

Define instance types, capacity types, zones
Enable consolidation
Separate general and critical workloads

What is Consolidation?
Consolidation is Karpenter’s automatic cost optimization feature that continuously monitors node utilization and intelligently moves workloads from underutilized nodes to more efficient placements. When nodes become empty or underutilized Karpenter automatically terminates them, reducing your infrastructure costs without manual intervention.
PodDisruptionBudgets can block consolidation by preventing pod movement, which is why temporarily relaxing PDBs during migration allows Karpenter to optimize node usage more effectively.

Phase 3: Controlled Migration

Step 5: Implement Node Affinity Strategy

Two-tier node strategy:

Critical infrastructure → On-demand nodes
Application workloads → Spot or mixed capacity

Use nodeAffinity rules and labels to direct workloads.

Managing Spot vs. On-Demand Deployments

To gradually introduce Spot capacity, we defined separate NodePool (or Provisioner) resources for on-demand and spot instances. Then, we used pod-level node affinity and tolerations to control where each workload could be scheduled.

This allowed us to test Spot reliability with non-critical workloads while keeping core services on on-demand nodes.

This approach gave us confidence in Spot’s performance before expanding its usage cluster-wide.

Infrastructure Services (On-Demand):

nodeSelector:

karpenter.sh/capacity-type: “on-demand”

Application Services (Mixed/Spot-Preferred):

affinity:

 nodeAffinity:

   requiredDuringSchedulingIgnoredDuringExecution:

     nodeSelectorTerms:

     - matchExpressions:

       - key: type

         operator: In

         values: ["app-node-pool"]

Using Taints and Tolerations for Workload Isolation
To prevent basic deployments from accidentally scheduling on expensive on-demand nodes, implement taints and tolerations as an additional layer of control.
Taint your on-demand nodes with a "dedicated=critical:NoSchedule" taint. This ensures that only pods with the corresponding toleration can be scheduled on these nodes.
Basic application deployments without tolerations will automatically be blocked from on-demand nodes and forced to use Spot instances. This approach provides cost discipline by ensuring that only explicitly configured critical workloads can access expensive on-demand capacity, while regular applications are directed to cost-effective Spot nodes.
This taint/toleration strategy works alongside node affinity to create multiple layers of workload placement control.

Step 6: Execute Gradual Migration

Migration approaches:

Option 1: Natural migration
Let workloads reschedule slowly.
Option 2: Rolling restart
Force redeployments with zero downtime.

# Force immediate migration (5-10 minutes)

kubectl get deployments -n apps -o name | xargs -I {} kubectl rollout restart {} -n apps

Pods moving from CA nodes to Karpenter nodes

Monitor during migration:

Pod placement
App performance
Spot interruptions and rescheduling behavior

# Watch pod distribution

kubectl get pods -o wide –all-namespaces | grep -E “(karpenter|cluster-autoscaler)”

# Node status

kubectl get nodes -l type=app-node-pool

Phase 4: Complete Transition

Step 7: Clean Up Legacy Infrastructure

Handle system components:

Temporarily relax PDBs
Drain and move system pods
Restore PDBs afterward

Node cleanup tasks:

Delete ASG nodes
Remove Cluster Autoscaler
Clean up related AWS resources
Remove migration-specific taints/affinities

Drain CA Nodes Validation and Monitoring

# Cordon and drain all CA nodes

kubectl cordon -l eks.amazonaws.com/nodegroup

kubectl drain -l eks.amazonaws.com/nodegroup –ignore-daemonsets –delete-emptydir-data —timeout=600s

Final Cleanup:

# Scale down Cluster Autoscaler

kubectl scale deployment aws-cluster-autoscaler -n kube-system –replicas=0

# Restore system PDBs

kubectl patch pdb aws-cluster-autoscaler -n kube-system -p ‘{“spec”:{“minAvailable”:null,”maxUnavailable”:1}}’

Step 8: Verify Migration Success

What to measure:

Node startup time
CPU/memory utilization
Spot usage and fallback behavior
Infrastructure cost change

Performance Metrics

# Check cluster utilization

kubectl top nodes

kubectl get nodes

# Verify application health

kubectl get pods –all-namespaces | grep -v Running

Key Success Metrics:

Node count: 22 → 8–12 nodes
Utilization: 10% → 40–60%
Scaling time: 5 minutes → 90 seconds
Cost reduction: ~50%
Zero downtime achieved

Zoom image will be displayed.

Long-term monitoring:

Set alerts for scheduling/provisioning errors
Watch controller logs
Track costs and scaling patterns
Maintain a change log

Best Practices

Test thoroughly in staging
Migrate in phases
Add proper health checks
Use at least 2 replicas for key services
Monitor throughout migration

Cost Analysis Details

We tracked cost impact throughout the migration to measure ROI clearly.

Cost Breakdown:

Before: ~$9,000/month (22 nodes × ~$410 on-demand)
After: ~$3,800/month (12 nodes, ~70% Spot)
Savings: ~58% reduction in monthly infra costs
ROI Timeline: Break-even in less than 4 weeks (including engineering time and testing)

These savings came primarily from reducing overprovisioning, replacing underutilized nodes, and shifting most workloads to Spot capacity.

Zoom image will be displayed.

Zoom image will be displayed

Next Steps and Advanced Configurations

Post-migration optimization:

Tune provisioning configs
Adjust spot/on-demand weighting
Standardize provisioning rules

Advanced workload strategies:

Use taints/tolerations for workload separation
Add burstable pools for spikes
Combine with HPA/VPA

Scaling across environments:

Replicate setup across all clusters
Create CI pipelines for config changes
Build dashboards for monitoring and cost

Conclusion

Migrating from Cluster Autoscaler to Karpenter is not just a technical upgrade—it’s a strategic shift. Benefits include lower infra costs, faster and more intelligent scaling, simpler declarative provisioning, and higher utilization with less waste. The recommended approach is to optimize workloads first, deploy Karpenter in parallel, migrate with control, and monitor and tune continuously. Spot vs. on-demand usage can be gradually introduced using node affinity to safely place critical workloads on stable instances while testing others on Spot. With careful planning, the migration yields immediate ROI and a more scalable, cost-efficient Kubernetes platform for the future.

Bar Zviely
DevOps Engineer