Kubernetes — Preventing HorizontalPodAutoscalers from Running Wild

Paul Dally
6 min readJan 20, 2022

--

When using Kubernetes, you might sometimes find that the number of Pods actually running isn’t the same as the number of replicas that you have configured on your Deployment (for example).

When the number of Pods is less than you expect, the explanation might be ResourceQuota or LimitRange constraint that you can diagnose by looking at the ReplicaSet associated with your Deployment. It might be due to insufficient cluster resources. There are a number of other potential causes as well.

A more surprising scenario, however, is when the number of Pods is more than you expect. Sometimes the number of Pods can be a lot more than you expect.

One of the most common scenarios where this occurs is when using a HorizontalPodAutoscaler. The Kubernetes documentation tells us that a HorizontalPodAutoscaler “automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand. Horizontal scaling means that the response to increased load is to deploy more Pods… If the load decreases, and the number of Pods is above the configured minimum, the HorizontalPodAutoscaler instructs the workload resource (the Deployment, StatefulSet, or other similar resource) to scale back down.”

If you’ve forgotten that you’ve deployed an HorizontalPodAutoscaler and your application is indeed experiencing increased load, then hooray!… everything may be operating according to design. But sometimes, the number of Pods starts increasing even when the application is not under load. What could cause this?

One common reason is that you haven’t paid attention to the HorizontalPodAutoscaler settings (or you haven’t paid attention to the “start” or “resting” resource requirements of your Pods). A Pod doesn’t need to be experiencing increased load to consume resources — for example, container startup might require a burst of cpu or a Java application may see increased memory usage over time simply due to growth of the heap resulting from caching. Another related aspect might be your startupProbe/readinessProbe (or the lack thereof).

Consider the following Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: helloworld
spec:
selector:
matchLabels:
app.kubernetes.io/name: helloworld
replicas: 1
template:
metadata:
labels:
app.kubernetes.io/name: helloworld
spec:
containers:
- name: hello-world
image: helloworld-webserver:1.0
# Cause CPU spike
command: ["/bin/sh"]
args: ["-c", "timeout 3m sha1sum /dev/zero | timeout 3m sha1sum /dev/zero; sleep infinity"]
resources:
requests:
cpu: 15m
memory: 32Mi
limits:
cpu: 50m
memory: 64Mi

and the following HPA:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
annotations:
name: helloworld-hpa
spec:
maxReplicas: 10
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: helloworld

When this application starts, there is just one Pod:

>kubectl -n default get pod
>kubectl -n default get pod
NAME READY STATUS RESTARTS AGE
helloworld-76b6d6f4fc-gg8xv 1/1 Running 0 2s

A few moments later, however, more Pods are created:

>kubectl -n default get pod
NAME READY STATUS RESTARTS AGE
helloworld-76b6d6f4fc-2xlkf 1/1 Running 0 22s
helloworld-76b6d6f4fc-7fzfj 1/1 Running 0 17s
helloworld-76b6d6f4fc-gg8xv 1/1 Running 0 33s
helloworld-76b6d6f4fc-lmq2n 1/1 Running 0 17s

And then more Pods (up to the maxReplicas specified in the HorizontalPodAutoscaler, in this case 10):

>kubectl -n default get pod
NAME READY STATUS RESTARTS AGE
helloworld-76b6d6f4fc-2xlkf 1/1 Running 0 2m1s
helloworld-76b6d6f4fc-7fzfj 1/1 Running 0 116s
helloworld-76b6d6f4fc-7gqxw 1/1 Running 0 16s
helloworld-76b6d6f4fc-8qww4 1/1 Running 0 31s
helloworld-76b6d6f4fc-9w8ct 1/1 Running 0 16s
helloworld-76b6d6f4fc-btwr6 1/1 Running 0 31s
helloworld-76b6d6f4fc-dhrz9 1/1 Running 0 31s
helloworld-76b6d6f4fc-gg8xv 1/1 Running 0 2m12s
helloworld-76b6d6f4fc-lmq2n 1/1 Running 0 116s
helloworld-76b6d6f4fc-xk97k 1/1 Running 0 31s

This is — of course — because the args of the podSpec cause the Pod to madly consume CPU for 3 minute. This specific example is a bit contrived, but this sort of scenario is not necessarily uncommon, as often Pods have startup processing that needs to occur. Once the startup processing is done, the CPU usage would drop, and eventually, the number of Pods would start to be reduced. This however, can take some time (by default, the downscale stabilization period is 5 minutes). In our example above, the Pods spike their CPU usage for 3 minutes, and then use essentially no CPU thereafter.

As a result, about 10 minutes after deploying, the HorizontalPodAutoscaler starts downscaling:

>kubectl -n default get pod
NAME READY STATUS RESTARTS AGE
helloworld-76b6d6f4fc-2xlkf 1/1 Running 0 10m
helloworld-76b6d6f4fc-7fzfj 1/1 Terminating 0 10m
helloworld-76b6d6f4fc-7gqxw 1/1 Terminating 0 8m45s
helloworld-76b6d6f4fc-8qww4 1/1 Terminating 0 9m
helloworld-76b6d6f4fc-9w8ct 1/1 Terminating 0 8m45s
helloworld-76b6d6f4fc-btwr6 1/1 Terminating 0 9m
helloworld-76b6d6f4fc-dhrz9 1/1 Terminating 0 9m
helloworld-76b6d6f4fc-gg8xv 1/1 Running 0 10m
helloworld-76b6d6f4fc-lmq2n 1/1 Running 0 10m
helloworld-76b6d6f4fc-xk97k 1/1 Terminating 0 9m

About a minute later, another Pod is terminated:

>kubectl -n default get pod
NAME READY STATUS RESTARTS AGE
helloworld-76b6d6f4fc-2xlkf 1/1 Running 0 11m
helloworld-76b6d6f4fc-gg8xv 1/1 Running 0 11m
helloworld-76b6d6f4fc-lmq2n 1/1 Terminating 0 11m

And a few minutes later, we are finally scaling back to minReplicas:

>kubectl -n default get pod
NAME READY STATUS RESTARTS AGE
helloworld-76b6d6f4fc-2xlkf 1/1 Terminating 0 14m
helloworld-76b6d6f4fc-gg8xv 1/1 Running 0 14m

One of the most important things that allowed this to happen is the failure to implement a quality readinessProbe or startupProbe. According to the Kubernetes documentation, “When scaling on CPU, if any pod has yet to become ready (it’s still initializing, or possibly is unhealthy) or the most recent metric point for the pod was before it became ready, that pod is set aside” when considering whether usage requires scaling. If you aren’t already defining quality probes (and you really should be), hopefully this convinces you to start doing so.

Newer versions of HorizontalPodAutoscaler can apply to a variety of metrics, not just CPU. If, for example, you implemented a HorizontalPodAutoscaler with a memory threshold of 60%, a resources.requests.memory of 1Gi and provided an Xmx argument to specify the maximum size of the heap when starting the Java application of 900M, what would happen? This article isn’t meant to be a full description of how the Java heap works, but in essence Java would gradually allow the heap to grow and would likely allow it to exceed 600M without triggering a full GC. At that point, even though the application may not be experiencing any meaningful load at all, the HorizontalPodAutoscaler would trigger an increase in the replicas.

Note: HorizontalPodAutoscalers use request and not limit for determining utilization. Where limit is higher than request, you can (and will often want to) specify utilization levels greater than 100% in your HPA.

If you provided an Xms argument to specify the minimum size of the heap you could potentially find yourself in even more trouble. Remember that heap isn’t the only type of memory that a Java application can consume. Metaspace memory is another type of memory that can be consumed in certain cases. If you specified Xms of 500M and your application consumes 100M of Metaspace memory at startup, every pod might immediately breach the 60% memory threshold that we configured in our theoretical example, causing the HorizontalPodAutoscaler to essentially scale the Pods to the maxReplicas level, except in this case, they would never scale back down.

It should also be noted — if you have a ResourceQuota on your Namespace, it does not make any sense whatsoever to have your HPA configured with maxReplicas greater than what the ResourceQuota would allow. If your podSpec has request.cpu of 500m, and your ResourceQuota allows up to 1000m for request.cpu, don’t set maxReplicas on the HPA to 10.

The key takeaways?

  1. Implement good startupProbe/readinessProbes
  2. Make sure the resources on the podSpec, the application configuration and any HorizontalPodAutoscalers are consistent with each other
  3. HPAs are using request not limit for utilization targets — you can (and often should) specify targets that are greater than 100%
  4. Don’t specify an unreasonably high value for maxReplicas… (and make sure that your ResourceQuota will allow for that number of replicas)

--

--

Paul Dally

AVP, IT Foundation Platforms Architecture at Sun Life Financial. Views & opinions expressed are my own, not necessarily those of Sun Life