Getty Images
Practice troubleshooting Kubernetes clusters using log data
Knowing how to interpret application logs is essential for successful Kubernetes troubleshooting. Get hands-on practice with this walkthrough from Kubernetes expert Chad Crowell.
Kubernetes' complexity is well known, which might be why questions about how to troubleshoot Kubernetes issues account for nearly a third of the content on the Certified Kubernetes Administrator exam.
The CKA is a Kubernetes certification offered by the Cloud Native Computing Foundation that demonstrates an IT professional's ability to set up, configure and manage Kubernetes clusters in production environments. To obtain the CKA certification, candidates must pass a two-hour, performance-based exam that requires test takers to apply their Kubernetes knowledge in realistic scenarios.
Because the CKA test involves solving problems directly in the command line, hands-on practice is key to success. To help candidates prepare, the CKA study guide Acing the Certified Kubernetes Administrator Exam pairs explanations of key Kubernetes concepts and terms with example problems similar to those found on the real test.
In Chapter 8 ("Troubleshooting Kubernetes"), author and Kubernetes expert Chad M. Crowell explains how to identify and respond to a range of Kubernetes issues, including cluster events, worker node failures and networking problems. The below excerpt from the chapter walks the reader through a typical CKA exam task that involves troubleshooting an issue with a Kubernetes pod.
In this example problem, you'll follow along as Crowell explains how to create a MySQL deployment with the kubectl command-line tool and then use application logs to diagnose an issue in your Kubernetes cluster. After deciding how to troubleshoot the problem, you'll implement the fix and check that all pods are running and healthy.
For a deeper dive into Kubernetes and additional practice problems, check out the rest of Acing the Certified Kubernetes Administrator Exam. To learn more about the CKA certification, read TechTarget Editorial's interview with Crowell, where he offers advice for studying for the CKA exam and explains the benefits of getting a Kubernetes certification.
This chapter covers
- How to monitor and view logs in Kubernetes
- How to determine high CPU or RAM usage in Kubernetes
- How to resolve common cluster issues in Kubernetes
- How to analyze network traffic to identify communication issues
As this is the biggest topic (30%) on the CKA exam, we're going to cover troubleshooting in detail in this chapter. Troubleshooting means fixing issues with applications, control plane components, worker nodes, and the underlying network. When running applications in Kubernetes, problems will arise such as issues with pods, services, and deployments.
This chapter will help you understand the logs that a container might output in the process of debugging and getting the application back to a healthy state. If the problem is not the application, it may be the underlying node, the underlying operating system, or a communication problem on the network. On the exam, you'll be expected to know the differences between an application failure, a cluster-level problem, and a network problem and how to troubleshoot and come to a resolution in the shortest amount of time.
Note: The exercises in this chapter involve an action that you must take to "break" the cluster to provide something to troubleshoot. For the exam, the cluster or cluster object will already be broken, so you shouldn't be too concerned about the initial action as a pre-requisite for the exam.
Understanding application logs
One of the ways Kubernetes administrators find out why a problem is occurring in a cluster is by viewing the logs. Application logs help you to get more verbose information about what's going on inside a containerized application running in a pod. Container engines (e.g. containerd) are designed to support logging and usually write all their output to standard output (STDOUT) and standard error (STDERR) streams to a file located in the directory /var/log/containers.
The CKA exam will test you on your ability to troubleshoot errors from within a pod. Since pod errors and container errors are synonymous, this simplified retrieving the logs from any application in Kubernetes. An example of a question in this domain of the CKA exam is:
Exam Task
In cluster "ik8s", in a namespace named "db08328", create a deployment with the kubectl command-line (imperatively) named "mysql" with the image "mysql:8". List the pods in the "db08328" namespace to see if the pod is running. If the pod is not running, view the logs to determine why the pod is not in a healthy state. Once you've collected the necessary log information, make the necessary changes to the pod in order to fix the pod and get the pod back up in a running healthy state.
If you don't already have access to an existing Kubernetes cluster, creating a Kubernetes cluster with kind as explained in appendix A. You will only need a single node cluster, so follow the instructions in section 1.1.1 of appendix A. Once you have a shell to the control plane node using the command docker exec -it kind-control-plane bash, go ahead and set your alias for kubectl as well as tab completion, as this will help you with typos and get you used to using the tab completion for the exam. You can find the instructions to do this at the end of appendix B, but here again are the commands to run (in order):
apt update && apt install -y bash-completion
echo 'source <(kubectl completion bash)' >> ~/.bashrc
echo 'source /usr/share/bash-completion/bash_completion' >> ~/.bashrc
echo 'alias k=kubectl' >> ~/.bashrc
echo 'complete -o default -F __start_kubectl k' >> ~/.bashrc
source ~/.bashrc
On exam day, they will already have these configured, so don't worry about having to memorize these commands. When you sit the exam, you will already be able to use the "k" alias and tab completion as soon as you start the exam.
After these commands are run, let's create the namespace per the instructions with the command k create ns db08328. You can follow that up by listing all namespaces with the command k get ns. The output will look similar to this:
root@kind-control-plane:/# k create ns db08328
namespace/db08328 created
root@kind-control-plane:/# k get ns
NAME STATUS AGE
db08328 Active 4s
default Active 11m
kube-node-lease Active 11m
kube-public Active 11m
kube-system Active 11m
local-path-storage Active 11m
Now that you have the correct namespace, you can change your context to the "db08328" namespace, so you don't have to keep typing the namespace name with each and every command. You can change your context with the command k config set-context --current --namespace db08328. The output will look like this:
root@kind-control-plane:/# k config set-context --current --namespace db08328
Context "kubernetes-admin@kind" modified.
Now that you've set the context to the namespace in which you'll be creating the deployment in, let's create the deployment named "mysql" with the command k create deploy mysql --image mysql. After this, you can list the pods with the command k get po.
Exam Tip
Notice that you don't have to use the "-n" option to specify your namespace each time. I will warn you that this could get confusing on the exam, as with each task you are also setting the context. So just be mindful that you'll be performing this command twice; therefore it may be easier in some cases to type out the namespace each time, depending on how many namespaces you have to work in with each task.
The output of the deployment creation and the listing of the pods should look like the following:
root@kind-control-plane:/# k create deploy mysql --image mysql
deployment.apps/mysql created
root@kind-control-plane:/# k get po -w
NAME READY STATUS RESTARTS AGE
mysql-68f7776797-w92l6 0/1 CrashLoopBackOff 1 (10s ago) 7m28s
The result in this case is that the status of the pod is in a "CrashLoopBackOff". There are many statuses that a pod can have, including "OOMKilled", "ErrImagePull", "ImagePullBackoff", "FailedScheduling", "CreateContainerError", and more.
You can view these failed statuses in Table 8.1.
Table 8.1. Access modes and their short form, used in the YAML for a persistent volume or persistent volume claim.
Status | Meaning |
CrashLoopBackOff | The pod is trying to start, crashing, the restarting in a loop |
ImagePullBackOff | A pod cannot startup because it can't find the specified image locally or in the remote container registry. It will continue to try with an increasingly back-off delay. |
ErrImagePull | A pod fails to startup because the image cannot be found or pulled due to authorization |
CreateContainerConfigError | A container within the pod will not start due to missing components that are required to run |
RunContainerError | Running the container within a pod fails due to issues with the container runtime or entrypoint of container |
FailedScheduling | A pod is unable to be scheduled to a node, either because nodes are marked as unschedulable, or taint is applied |
NonZeroExitCode | The container within a pod exits unexpectedly due to an application error or missing file or directory |
OOMKilled | A pod was scheduled, but the memory limit assigned to it has been exceeded. |
A "CrashLoopBackoff" means that the pod is continuing to start, crash, and restart again, and then crashing again, hence the term "crash loop". We can see why this is happening by viewing the container logs with the command k logs mysql-68f7776797-w92l6 . This is where tab completion comes in handy because you can start typing mysql and then quickly press the tab key on the keyboard and it will complete the rest for you. Tab completion will be enabled on the exam, but if you'd like to set this up in your own cluster, see Appendix B. The output of the command k logs mysql-68f7776797-w92l6 will look like this:
root@kind-control-plane:/# k logs mysql-68f7776797-w92l6
2022-12-04 16:51:13+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.31-1.el8 started.
2022-12-04 16:51:13+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2022-12-04 16:51:13+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.31-1.el8 started.
2022-12-04 16:51:13+00:00 [ERROR] [Entrypoint]: Database is uninitialized and password option is not specified
You need to specify one of the following as an environment variable:
- MYSQL_ROOT_PASSWORD
- MYSQL_ALLOW_EMPTY_PASSWORD
- MYSQL_RANDOM_ROOT_PASSWORD
This tells us exactly what we wanted to know, which is that the database password needs to be set as an environment variable inside the container. The output even tells you the environment variable names to choose from. If you recall, in chapter 7 we created a mysql deployment, in which we set the password as an environment variable, so let's refer back to it and utilize those same techniques to solve the problem that we have here. Looking back at figure 7.12 specifically, we can see that the environment variable is set in line with the name of the container image, so let's apply this to our currently running deployment with the command k edit deploy mysql. First, if you've started with a fresh kind cluster, you'll need to run the command apt update; apt install vim to edit the deployment using the vim text editor. Once the deployment is open, you can make the following additions to the YAML, as you'll see depicted here.
spec:
containers:
- env:
- name: MYSQL_ROOT_PASSWORD
value: password
image: mysql
imagePullPolicy: Always
name: mysql
resources: {}
Once you've done this, save and quit editing the mysql deployment by pressing the ":wq" keys on your keyboard. This will take you back to the command prompt, in which you can perform the command k get po to see if the pod is now in a running state. The output should look like this:
root@kind-control-plane:/# k get po
NAME READY STATUS RESTARTS AGE
mysql-5dcb7797f7-6spvc 1/1 Running 0 12m
Sure enough, the pod now has a status of "Running", which is exactly what we need to get the pod back up in a running healthy state and complete the exam task.
To learn more about this book, and all Manning products and publications, please click here.