Kubernetes — Help! My Nodes Won’t Download the Right Image
Tags used for running images should (almost always) be immutable!
When running an image, you should almost always use a tag that is immutable. Image repositories can usually be configured so that tags are immutable, in which case attempts to push an image with the same tag as a pre-existing image will fail (which will help you prevent inadvertent reuse of a particular tag). If you can’t configure image tag immutability (for example, you both run and build from the same image), you can simply implement practices in your image build pipeline to achieve a similar effect, for example using a semantic version in the tags (and never, ever reusing a patch number):
- name: hello-world
or if you must, for whatever reason, reuse patch numbers, then suffixing image tags with a date/time stamp or some other differentiator — for example:
- name: hello-world
Or potentially even have a repository or even a registry for each environment, along with a promotion process to move images from lower environments to higher environments without rebuilding the image. The part about not rebuilding the image to promote it is important. Whenever you rebuild an image, even from the exact same Dockerfile and dependencies added with COPY/ADD, there is a not insignificant risk that it will differ from the previous build. This is because often there are external dependencies such as package repositories, files retrieved in RUN statements through curl or wget, etc. that you don’t control.
There are a number of reasons to never overwrite an image tag that you are using to run an existing container.
To ensure that images are not inadvertently used where/when not intended
Consider the following scenario (this scenario is simplified to illustrate the point, please don’t interpret this as a recommendation that your software development lifecycle be limited to just development and production…).
On Monday morning, a new application is deployed to development, based on a Deployment that looks like this:
- name: hello-world
The developer used the v1 tag in the podSpec thinking that this way they can avoid updating the podSpec everytime the image is built. Big productivity win! Right? Not so fast…
Testing seems to go well, and on Monday afternoon the application is deployed to production.
However, on Tuesday morning a minor bug is reported that wasn’t caught in testing. The developer finds the source of the issue, and builds a new image — but the repo is not configured for immutability and the developer chooses to reuse the v1 tag for the new image. Unfortunately, along with the fix the developer has introduced a serious regression bug. On Tuesday afternoon, the regression bug is discovered in testing in the development environment — so no harm no foul so far, since no deployments have been applied to production, right?
Wednesday morning, however, one of the replicas in production starts experiencing odd performance problems, and the decision is made to delete the Pod to force the Deployment to recreate it, in effect, “rebooting” the containers within the Pod. The new Pod happens to be scheduled by Kubernetes on a different worker Node than the deleted Pod was scheduled on.
Wednesday afternoon sees client complaints start pouring in — the regression bug that was only in development before has now been inadvertently “released” into the production environment due to the Pod deletion and subsequent recreation. Wednesday is going to be a bad day!
It should be noted that the scenario described above is not the only way something like this can happen. Other situations that can give rise to similar behavior include:
- the usage of HorizontalPodAutoscaler
- restart or removal of a Node in the cluster
- capacity constraints on a Node in the cluster
Because the podSpec for both development and production references the same image tag, you have more or less lost control of when changes are “promoted” to production. If the developer had used :v1.0.0 and :v1.0.1 for the image tags, the new image would not have been unexpectedly used.
To enable (easier/more reliable) backout
Our developer now starts thinking about how to revert the Pod to its previous image. Unfortunately, the :v1 image in the registry is now the “new” v1 image, the “old” image is no longer available— registry garbage collection has deleted the image (or more precisely any layers that make up the old image that are now unused) because no image tag referenced them any more. The developer realizes that they may have to rebuild the original image if they want to backout, which is at best an inconvenience and at worse may introduces further regression issues and inconsistency.
If the developer had used :v1.0.0 and :v1.0.1 for the image tags, they would have easily been able to revert back from v1.0.1 to v1.0.0 simply by changing the image tag in the podSpec.
To ensure that rollouts are understood by Kubernetes to be different from what was previously running
Our developer decides that rather than attempt to “backout” and rebuild the original image it would be better to proceed with a fix to the regression issue that was introduced.
On Wednesday night, they find the cause of that issue and build a fixed version of myregistry.mycompany.com/demo/myimage:v1.
On Thursday morning they redeploy the deployment.yaml to production. Unfortunately, nothing happens! In fact, when examined with kubectl get deployment, the Deployment in Kubernetes still shows an age of 96 hours. This is because from Kubernetes’ perspective, the Deployment did not actually change and is the same as the Deployment that Kubernetes is currently running. If the image tag had changed in the Deployment’s configuration, or even if an environment variable had been added or some other change to the configuration that could have a bearing on the operation of the Pods had been made, then Kubernetes would have conducted a rollout.
Our developer now decides to manually delete the Pods, since that seemed to cause the images to be pulled previously.
As an aside, I should note that there are lots of reasons that deleting Pods manually is an inconvenient and potentially error-prone activity that can result in unexpected outages if not done carefully. Our developer, however, is not the careful sort and deletes the Pod manually again.
After deletion, the replacement Pod happens to be rescheduled by Kubernetes again on the same worker Node, and unfortunately the issues caused by the regression bug remain. Why isn’t Kubernetes using the “newest” v1?
The answer to this latest dilemma is the default imagePullPolicy. When an imagePullPolicy is not provided in the podSpec, Kubernetes’ behavior is:
if the tag for the container image is
imagePullPolicyis automatically set to
if you don't specify the tag for the container image,
imagePullPolicyis automatically set to
if you specify the tag for the container image that isn't
imagePullPolicyis automatically set to
The value of
imagePullPolicyof the container is always set when the object is first created, and is not updated if the image's tag later changes.
For example, if you create a Deployment with an image whose tag is not
:latest, and later update that Deployment's image to a
imagePullPolicyfield will not change to
Always. You must manually change the pull policy of any object after its initial creation.
In our case, the tag imagePullPolicy was automatically set to ifNotPresent, and because Kubernetes scheduled the Pod on the same worker Node it had previously been running on, and because myregistry.mycompany.com/demo/myimage:v1 already exists on that worker Node, the updated version of that image on the registry was not pulled again, because Kubernetes already had a version (albeit a undesirable version) of that image present (The image was present, and IfNotPresent is false).
One approach to work around this could be to specify an imagePullPolicy of Always, although there would potentially be performance implications to this that would become more pronounced with larger images. Kubernetes would then always pull the image, even if it was already present on the worker Node — and depending on the size of the image and your network speed, this could add 30 seconds, 60 seconds, perhaps even more to Pod startup time.
And in any case, in a pressure situation, do you really want to have to figure out which :v1 am I really running? I would strongly suggest that your answer should be “no”.
Whether your tags are immutable by configuration, or just immutable by your practices, avoid the sorts of issues described above and use immutable tags when running your Pods.
When Can Mutable Tags be used in podSpecs?
Previously I said that tags used in podSpecs should (almost always) be immutable. What is the exception to the rule?
Perhaps there are no exceptions to this rule in your application. In general, the answer is “whenever you are confidant that it won’t ever matter which image is actually run”.
Potentially some use-cases like an initContainer that only does chmod or chown on some files or other similar relatively trivial operations might be candidates. For these very simple and “core” sorts of tasks, the syntax, flags and/or functionality may be sufficiently probable to be consistant between any and all versions of the image to justify using a tag like “latest” or some other mutable tag.
For sure, by always picking up the “latest” (not necessarily just the tag “latest”, but perhaps something like “v8” or some other non-semantic tag) version of an image you may see some benefit related to security vulnerabilities. Be careful though — you might assume that something like wget or curl or some other command ought to be consistent between any and all versions of an image, but as an example I’ve seen numerous and meaningful functionality differences between wget and curl in different images.
Tags used for building images can make sense to be mutable!
The issues above don’t necessarily mean that ALL repos/tags should be immutable though. When tagging an image that will be used in the FROM statement of another dockerfile, it can make sense to use a tag that will be overwritten periodically.
Consider the following scenario:
On Friday morning, a developer builds a base image and tags it as myregistry.mycompany.com/demo/mybase:1.0.0. On Friday afternoon, another developer builds a derivative image with Dockerfile that looks like this:
On Saturday morning, the original developer discovers that the image they built has a security vulnerability, which they fix by building myregistry.mycompany.com/demo/mybase:1.0.1.
On Saturday afternoon, unaware that a vulnerability has been discovered and fixed, the second developer rebuilds their image to add another feature. Unfortunately, their newly-built image still includes the security vulnerability that the original developer has since patched, since the derivative image still includes :1.0.0 of the base image.
What does a better process look like? Imagine if the original developer had built image myregistry.mycompany.com/demo/mybase:1.0.0, tested it robustly, and then tagged that same image as myregistry.mycompany.com/demo/mybase:v1-release, and the v1-release tag had been used by the developer of the derivative image in their Dockerfile.
When the developer of the base image builds 1.0.1 of their image, and again after robust testing tags that image as v1-release, when the developer of the derivative image subsequently rebuilds their image, they automatically include the latest fixes, and everyone is happier. Robust testing is key, of course, but robust testing (hopefully automated) should be an expectation after a build of an image.
But This Means More Work For Me…. Doesn’t It?
Well… yes, a little. You have to modify your deployment.yaml (or more likely your kustomization.yaml for kustomize or values.yaml for helm or whatever applies in your specific case) before each deployment. But isn’t that (much) better than the the operational issues described above?
Furthermore, this work can/should be automated as part of your build process. For example, when doing a build, you could increment your patch number with a simple script or in Maven or Gradle or whatever it is that you use. You could then use the resulting version as the tag for the image build/push, but you could also programmatically modify your deployment.yaml (or kustomization.yaml or values.yaml or whatever) so that the image: in the podSpec matches the tag used for the image build/push.
If you have different kustomization.yaml or values.yaml files for each environment, you can potentially replace the tag in all files those files for all environments, and leverage your branching approach in your source code management system to ensure that the podSpec will reference the right tag. There are many variations on build/deploy pipelines, and so you will need to consider what will work best for you — but I’m sure that the effort will be well spent!