Kubernetes — Probes (Not Just For Alien Abductions)

Paul Dally
5 min readJan 21, 2022

--

Kubernetes currently allows 3 types of probes on podSpecs:

  • livenessProbe
  • readinessProbe
  • startupProbe

The provided documentation for these probes is quite good, so please read it thoroughly.

I’ve found, however, that many developers seem to have outstanding questions about probes even after reading the documentation — when to use what type of probe, how to choose reasonable configuration values, what the probe should do, etc. Please feel free to raise any questions that you might have in the comments below.

Here are some thoughts that will hopefully help:

  1. You really need to understand what each type of probe does — if you haven’t at least skimmed the documentation, please do it (Maybe even right now… It’s ok, I’ll wait…). Great! I’m glad you are back! Now you know that if a livenessProbe fails, Kubernetes will restart the container, a failed readinessProbe will (usually) include or exclude a Pod from Service load balancers, and a failed startupProbe disables liveness and readiness checks, so that those probes don’t interfere with application startup. One important thing to notice is where the documentation says one useof readinessProbes relates to Service load balancers… there are other uses as well, for example, relating to HorizontalPodAutoscalers.
  2. You should almost always have a livenessProbe — if you do not have a livenessProbe, Kubernetes will assume that as long as the process started by the container’s command does not terminate, the container is healthy. I can think of dozens of scenarios in which a process can be running and yet not be able to effectively do what it is supposed to do (deadlock, leaks of memory or file handles, application coding errors, etc.). With perhaps the exception of Pods for a Job/Cronjob (and even then, a livenessProbe can certainly be useful), I almost always recommend a livenessProbe.
  3. You should have a readinessProbe almost as often as you have a livenessProbe — essentially any time you might want to have a container stop taking load for a while, but don’t necessarily want it to restart. For example, performance may degrade on a particular container if it is processing a particularly complex request or computation. You might want it to stop taking traffic until performance improves (presumably when the complex request or computation is complete). Having the livenessProbe restart the container would interupt the request, but a readinessProbe would simply remove the Pod from the Service load balancer pool until it started passing again, at which point the container would simply be added back to the load balancer pool. Of course, this all depends on having the readiness probe execute something that provides an accurate representation of container performance… but also when using an HPA, as an impaired Pod may not be using resources in the same ways as healthy pods, which can skew the calculations that your HPA does when determining whether to scale or not.
  4. Containers that take a potentially long but variable amount of time to start should implement a startupProbe rather than using a long initialDelaySeconds on the readiness/liveness probes. Imagine a container that takes between 1–5 minutes to start. If you did not use a startupProbe, the initialDelaySeconds on the liveness and/or readiness probes would need to be 300 to account for the longest startup time, which means that when the Pod does start in a shorter amount of time, it nonetheless will not start taking traffic until the full 5 minutes has elapsed. If you combine a startupProbe with your liveness/readiness probes, you’ll be able to start the liveness/readiness probes as soon as the startup is complete.
  5. Usually readinessProbes and livenessProbes should not be defined with identical values — You probably want your readinessProbe to trigger more quickly than the livenessProbe. You may want the readinessProbe and the livenessProbe to execute different actions. It all goes back to understanding what the probes do — given that a livenessProbe restarts the container, and given that Kubernetes will mark the Pod as not ready while the restart is underway, it seems redundant to also have Kubernetes hitting an additional readinessProbe also marking the Pod as not ready. For all intents and purposes, you are just adding traffic to your Pod and using capacity that would be better servicing real traffic.
  6. We should try to design Probes to avoid “pointless failure” but avoid making the probe “too complex”— imagine a scenario where an application is using an API provided by a 3rd party or an external database. If that API or database is not available, then the application does not work — so all probes should fail, right? … Right…? Well… no, probably not. If any of these probes fail, the container will at least be not considered ready and at worst will restart. In this scenario, this is worse than doing nothing. A restart cannot fix the problem, but may actually introduce problems (the Pod will stop taking load, driving up load on other replicas, which will probably shortly be restarting themselves, and all of this will play out over and over and over again until the real underlying issue is resolved). Transitioning the pod to not be Ready may trigger a HorizontalPodAutoscaler, since all load is being distributed over a smaller number of Pods and this may drive up utilization metrics, as well as unnecessarily consuming cluster resources. While the container is restarting, the Pod will be removed from Service load balancers — and since all of the other Pods in a ReplicaSet are running the same livenessProbe, they will like be removed from the Service load balancers as well, leaving none available — at which point, you won’t be able to serve up a reasonable error response to your clients, which will likely cause a) a bad experience and b) more difficulty in diagnosing the real issue. There is not necessarily a hard and fast rule about including external dependencies, and you may need to be selective about treating some classes of errors as failures of the probes and treating other errors as a success (at least from the probe’s perspective). You want to avoid a situation where your pods are constantly restarting when an external dependency is impaired, but certainly in order to achived that you need to balance the complexity of your probes against the likelihood that restarting a Pod will resolve the issue. As is often the case, the answer to the question of what you should do in your probe is “it depends”. Which leads to the next point…
  7. Your probes should be complementary to your other monitoring — you should not be relying only on your probes to determine the health of your application. You might consider monitoring the characteristics of your application’s runtime behavior (for example, “are we seeing normal amounts of requests?” or “are there abnormal quantities of errors in the logs”, etc.) and not just hitting the same health checks that your probes are configured with simply from another monitoring engine. In situations where it would be (or may be) “pointless to fail”, your complementary monitoring would allow the failure scenario to be analyzed and mitigated, and potentially used to guide enhancements to the probes to improve their efficacy moving forward.

--

--

Paul Dally
Paul Dally

Written by Paul Dally

AVP, IT Foundation Platforms Architecture at Sun Life Financial. Views & opinions expressed are my own, not necessarily those of Sun Life

No responses yet