Troubleshooting

Debugging Clusters

Cluster health checks, node troubleshooting, control plane logs, and failure modes.

Cluster Health Check

# cluster info and component endpoints
kubectl cluster-info

# list all nodes and their status
kubectl get nodes -o wide

# component status (deprecated but still seen in CKA)
kubectl get componentstatuses

# dump cluster state for detailed debugging
kubectl cluster-info dump --output-directory=/tmp/cluster-dump

Node Troubleshooting

Node Conditions

ConditionDescription
ReadyTrue if the node is healthy and ready to accept pods
NotReadyNode is not healthy — kubelet not running, network issues, etc.
MemoryPressureNode is running low on memory
DiskPressureNode is running low on disk space
PIDPressureToo many processes running on the node
NetworkUnavailableNetwork is not configured correctly on the node
# check node conditions
kubectl describe node <node-name> | grep -A 10 "Conditions"

Debugging a NotReady Node

Step-by-step approach:

# 1. identify NotReady nodes
kubectl get nodes

# 2. describe the node — check Conditions, Taints, Events
kubectl describe node <node-name>

# 3. SSH to the node and check kubelet
ssh <node-name>
sudo systemctl status kubelet
sudo journalctl -u kubelet -n 50 --no-pager

# 4. check container runtime
sudo systemctl status containerd
sudo crictl ps

# 5. check kubelet certificates
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -noout -dates

# 6. check kubelet config
cat /var/lib/kubelet/config.yaml

Node Resource Issues

# check resource usage across nodes
kubectl top nodes

# check allocated vs available resources on a node
kubectl describe node <node-name>
# look at "Allocated resources" section — shows requests and limits

# find pods consuming the most resources
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu

Control Plane Troubleshooting

Static Pod Manifests

Control plane components run as static pods managed by kubelet. Manifests are located at:

/etc/kubernetes/manifests/
├── kube-apiserver.yaml
├── kube-controller-manager.yaml
├── kube-scheduler.yaml
└── etcd.yaml
# check static pod manifests
ls /etc/kubernetes/manifests/

# view a specific manifest
cat /etc/kubernetes/manifests/kube-apiserver.yaml

# after editing a manifest, kubelet auto-restarts the component
# monitor with:
crictl ps | grep kube-apiserver

Control Plane Logs

ComponentLog Locationjournalctl
kube-apiserver/var/log/kube-apiserver.logcrictl logs <container-id>
kube-scheduler/var/log/kube-scheduler.logcrictl logs <container-id>
kube-controller-manager/var/log/kube-controller-manager.logcrictl logs <container-id>
etcd/var/log/etcd.logcrictl logs <container-id>
kubeletjournalctl -u kubelet
# find control plane container IDs
crictl ps | grep kube-apiserver
crictl ps | grep kube-scheduler
crictl ps | grep kube-controller-manager
crictl ps | grep etcd

# view logs for a control plane component
crictl logs <container-id>

# kubelet logs (runs as a systemd service, not a container)
sudo journalctl -u kubelet -n 100 --no-pager
sudo journalctl -u kubelet -f    # follow live

# check control plane pods via kubectl
kubectl get pods -n kube-system
kubectl logs -n kube-system kube-apiserver-<node-name>

Common Control Plane Issues

API server not starting:

  • Check the static pod manifest for syntax errors: cat /etc/kubernetes/manifests/kube-apiserver.yaml
  • Check certificates: openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
  • Check etcd connectivity: crictl logs <etcd-container-id>
  • Check kubelet logs: journalctl -u kubelet | grep apiserver

Scheduler not scheduling pods:

  • Check scheduler logs: kubectl logs -n kube-system kube-scheduler-<node-name>
  • Check leader election: kubectl get endpoints kube-scheduler -n kube-system -o yaml
  • Verify scheduler is running: kubectl get pods -n kube-system | grep scheduler

Controller manager issues:

  • Check logs: kubectl logs -n kube-system kube-controller-manager-<node-name>
  • Check RBAC permissions
  • Verify leader election: kubectl get endpoints kube-controller-manager -n kube-system -o yaml

Cluster Failure Modes

FailureImpact
API server downCannot create/update/delete resources; existing pods continue running
etcd data lossCluster state lost; requires backup restore (etcdctl snapshot restore)
Node disconnectedNode marked NotReady after ~40s; pods evicted after ~5min
Network partitionSplit-brain scenarios; nodes may disagree on cluster state
Scheduler downNew pods stay Pending; existing pods unaffected
Controller manager downNo reconciliation (replicas not maintained, no garbage collection)

Worker Node Operations

# mark node as unschedulable (no new pods, existing pods stay)
kubectl cordon <node-name>

# mark node as schedulable again
kubectl uncordon <node-name>

# drain node — evict all pods and mark unschedulable
kubectl drain <node-name> --ignore-daemonsets

# drain with additional flags
kubectl drain <node-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force                        # force eviction of standalone pods

# check taints on a node
kubectl describe node <node-name> | grep -i taint

# add a taint
kubectl taint nodes <node-name> key=value:NoSchedule

# remove a taint
kubectl taint nodes <node-name> key=value:NoSchedule-

Useful Commands

# quick cluster health check
kubectl get nodes
kubectl get pods -n kube-system
kubectl cluster-info

# check all events cluster-wide
kubectl get events -A --sort-by='.lastTimestamp'

# check node conditions
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,REASON:.status.conditions[-1].reason

# find pods on a specific node
kubectl get pods -A --field-selector spec.nodeName=<node-name>

# check kubelet status on a node
ssh <node-name> "sudo systemctl status kubelet"

# restart kubelet
ssh <node-name> "sudo systemctl restart kubelet"

# check certificates expiration
kubeadm certs check-expiration

# renew certificates
kubeadm certs renew all