The potential of pod migrations in Kubernetes
How it can improve cluster utilization and prevent progress loss of stateful workloads
Kubernetes is not only used for stateless applications, but also in High Performance Computing (HPC), Machine Learning, and other areas to run long-running, stateful containers. When a failure occurs, Kubernetes has no way to rescue it. Their state is often not persisted on disk, so stateful sets can't help either. Likewise, there is no way to optimize cluster resource utilization through rescheduling these workloads (see the descheduler project). This is especially interesting for workloads with unknown resource requirements, but resource over provisioning is a wide-spread phenomenon in the industry. Experts estimate the medium utilization between 6% to 12% 1. Fortunately, initial investigations suggest that container migrations are feasible with reasonable downtime and potentially bring immense increases in cluster resource efficiency.
How to migrate a pod?
Before the advent of containers, VM migrations were a well established solution to move application state across machines. Luckily, the CRIU project has made it possible to checkpoint and restore containers. In the past months, I set up a Kubernetes cluster with pod migration support based on prior work, exploring the migration performance through high-speed file servers on Azure and investigated the feasibility and potential for a commercial HPC service run on Kubernetes. You can find a tutorial for the cluster setup here and a short pod migration demo here.
Case study results and the potential of migration
The research findings show that the migration time is mainly influenced by memory load and just takes a few seconds for containers with a few gigabytes of memory usage and still less than 2 minutes for 50 GB. A migration controller was used to observe node memory consumption and preemptively migrate workloads upon imminent resource shortage. Different intervention heuristics were investigated, and the best performing one used a memory threshold to trigger migration. By selecting the pods with the steepest memory slope for migration, the heuristic could prevent forceful pod evictions due to resource over-commitment in most cases. The analysis was performed with production cluster data on a Kubernetes scheduler simulator. For the investigated HPC service, the simulations showed that cluster utilization could be improved by a factor of up to 4 times compared to status quo production scenarios. You can see the migration controller in action here:
The status quo and conclusion
Research for stateful pod migration on Kubernetes is rare, but prior proof of concepts have shown the practical feasibility. However, they did not evaluate its applicability or benefits for real-world problems. But there is demand for such a feature3,4 and just recently, a checkpointing solution (without restore) was officially integrated in Kubernetes v1.24. Container migration is a great option to migrate stateful pods without application knowledge. It's a workload agnostic approach, whose downtime is fairly predictable through the memory load of the container. While some applications might have features to persist state to disk, this does not apply to all, and when available the effect of migration (e.g. downtime, checkpointable state) is often not predictable. The hurdle of the technical setup is still the biggest one to overcome, but there exists an easier-to-use project called KubeVirt that allows running a VM inside a pod, which can be live migrated through a Kubernetes API extension object.
If you have a use-case for pod migration or want to know more about my findings, please reach out! I would like to make stateful pod migration more accessible!