Kubernetes — Automatically Cleaning Up Evicted Pods Is So Easy It May Surprise You!

Paul Dally
5 min readFeb 28, 2022

Pods can be evicted for a variety of reasons.

Examples of this include (but are not limited to):

  • VerticalPodAutoscaler scaling a Pod
  • When a Node is under resource pressure (CPU, memory, etc.)
  • When a higher priority Pod preempts a lower priority Pod
  • when a Pod’s emptyDir usage exceeds the configured sizeLimit

You may not be configuring sizeLimit on your emptyDirs, but you probably should — and if you do, you’ll invariably have evicted Pods to deal with at some point

Regardless of how a Pod gets evicted, too many evicted Pods can have negatively impact the stability, performance and usability of your Kubernetes cluster! You need to be able to clean up those evicted Pods to keep your cluster running optimally.

Default Pod garbage collection settings

Kubernetes will eventually garbage collect evicted Pods. Unfortunately, in a “vanilla” Kubernetes distribution it will wait until there are 12,500 terminated Pods before doing so.

This behavior can often be modified with the --terminated-pod-gc-threshold setting on the kube-controller-manager. Different Kubernetes platforms may have different default values. Unfortunately, some of the most common Kubernetes platforms, most notably the more popular hosted ones, do not allow their default value to be overridden.

Why the standard Pod garbage collection may not be adequate

As noted previously, an excess of evicted Pods could impact performance of your cluster, with the impact getting more severe the more latency there is between your nodes. In a stretch (or wide) cluster, especially one running across multiple regions, it will be that much more important to keep the number of evicted Pods to a reasonable level.

In some configurations, IP address starvation may even be a result as evicted Pods will continue to hold the IP addresses they were assigned until they are deleted. Having an excess of evicted Pods may also make managing your applications or cluster more difficult — after all who wants kubectl get pod to return 2 running Pods amongst 12,499 evicted Pods? Not me…

Another consideration is that garbage collection of terminated Pods is done on a cluster-wide basis, and terminated/evicted Pods may not be evenly distributed amongst all namespaces. If 10 namespaces all have evicted Pods, garbage collection might remove all of the evicted Pods from 7 of the namespaces because they happened to be the oldest, thereby removing the evidence required to diagnose and correct the underlying issue. It would be really preferable if there was an easy way to automatically remove evicted Pods on a namespace by namespace basis.

So what can we do?

There are plenty of articles and documents out there that illustrate ad-hoc, manual commands that can be run by an administrator. This might be reasonable in an emergency situation, but ongoing reactive manual intervention is a terrible way to run a production cluster! Furthermore, deleting all of the evicted Pods removes the very information that developers may need to determine how to fix certain underlying issues that might be causing the Pod to be evicted in the first place.

Let’s kick things up a couple of notches, and automate the cleanup of evicted Pods on a per-namespace basis, allowing non-administrators to deploy cleanup functionality with their application for their specific Namespace. Let’s also allow administrators to deploy a similar cluster-wide capability. Either way, we’d like to be able to leave some of the evicted Pods around, deleting all but the most recent handful — so that users can investigate and resolve the underlying cause of the Pod eviction.

Solution 1 — cleanup a single Namespace

If you are not a cluster admin, and you can’t get your cluster admins to do anything on a cluster-wide basis, you can at least implement a solution for your Namespace(s).

The idea is to create a CronJob that runs a Pod periodically that will delete evicted Pods based on whatever logic you like. The Pods will run under a ServiceAccount that is bound to a Role (via a RoleBinding) that gives the ServiceAccount permission to get, list and ultimately delete Pods. The image for our CronJob includes jq and the kubectl binary and some simple scripts that you can modify if you wish to fit your specific requirements.

You don’t need to have cluster admin privileges or a ClusterRole in the “same Namespace” scenario, as you will only be deleting Pods in your own NameSpace. The image referenced in the spec.jobTemplate.spec.template.spec.containers[0].command contains a script that looks like this:

#!/bin/bash# Get the list of Pods, then select the items that have
# been evicted, sort by the startTime (ascending), then
# select all but the most recent 3. Then pass just the
# names of those Pods to kubectl to be deleted
kubectl get pod -o=json | jq '[.items[] | select(.status.reason=="Evicted")] | sort_by(.status.startTime) | .[0:-3]' | jq -r '.[] .metadata.name' | xargs --no-run-if-empty kubectl delete pod

The CronJob can run at whatever frequency you like (in the working example source code I run it every 5 minutes to quickly illustrate the functionality, but a longer interval between executions would likely be fine).

Solution 2 — cleanup all Namespaces

Modifying solution 1 to allow a single CronJob to cleanup all Namespaces is pretty simple. It does, however, require a ClusterRole (with the same permissions as Solution 1 plus the ability to list Namespaces) and a ClusterRoleBinding, which will likely need administrative access on your cluster to deploy.

It is far more efficient and centrally maintainable in that you only require one instance, rather than one instance per namespace — but if you can’t convince your cluster administrator to take this on then solution 1 may be your only option.

The script required for solution 2 is very similar to that of solution 1, and looks like this:

#!/bin/bash# Loop through all Namespaces
for namespace in $(kubectl get namespace -o=json | jq -r '.items[] .metadata.name') ; do
echo "Processing namespace ${namespace}"

# Get the list of Pods, then select the items that have
# been evicted, sort by the startTime (ascending), then
# select all but the most recent 3. Then pass just the
# names of those Pods to kubectl to be deleted
kubectl -n ${namespace} get pod -o=json | jq '[.items[] | select(.status.reason=="Evicted")] | sort_by(.status.startTime) | .[0:-3]' | jq -r '.[] .metadata.name' | xargs --no-run-if-empty kubectl -n ${namespace} delete pod
done

You could easily exclude certain Namespaces if you wanted.

If you are not an admin but own multiple Namespaces, you could even create a hybrid of solution 1 and solution 2, modifying the script to loop over only specific Namespaces. Of course, you would need to deploy a Role and RoleBinding to each of those specific Namespaces, to grant the ServiceAccount in the Namespace running the CronJob access to the Pods in all of those Namespaces — which would take some work, but wouldn’t be that difficult.

See? You now have a fully hands-off method to pro-actively cleanup evicted Pods for either cluster admins or developers that will keep your application/cluster running optimally! Really easy!

The source code for this article can be found at github.

--

--

Paul Dally

AVP, IT Foundation Platforms Architecture at Sun Life Financial. Views & opinions expressed are my own, not necessarily those of Sun Life