Troubleshooting Longhorn and DNS Networking
Lately I’ve been trying out Longhorn as the persistent storage engine for stateful k8s-workloads. Overall I am really impressed by Longhorn. Installing on a cluster is a breeze and creating replicated volumes works great.
After a routine reboot of one worker node (which had been properly drained and cordoned beforehand) any pod with an attached Longhorn-volume would refuse to start.
tl;dr
- pod longhorn-csi-plugin-* failed to start due to connection timeouts to longhorn-backend.
- checking DNS on the worker node revealed network issue which caused DNS-resolution and inter-node communication failure.
- Restart of flannel daemonset fixed the network on the node and issue was resolved.
Troubleshooting
On inspecting the event log of one of the affected pods these errors were present:
kubectl describe pod -n default pod-a-79d65ffcd4-zp9lm
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned default/pod-a-79d65ffcd4-zp9lm to worker-1
Warning FailedMount 6m41s kubelet Unable to attach or mount volumes: unmounted volumes=[pod-a-data], unattached volumes=[default-token-jqfxl pod-a-data]: timed out waiting for the condition
Warning FailedMount 2m36s (x6 over 12m) kubelet MountVolume.MountDevice failed for volume "pvc-e8d3b090-41ad-469e-b208-609e693c0fcf" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name driver.longhorn.io not found in the list of registered CSI drivers
Warning FailedMount 2m5s (x4 over 11m) kubelet Unable to attach or mount volumes: unmounted volumes=[pod-a-data], unattached volumes=[pod-a-data default-token-jqfxl]: timed out waiting for the condition
...
There was a good clue in there: driver name driver.longhorn.io not found in the list of registered CSI drivers
.
The Longhorn CSI drivers are managed by specific workloads that comes bundled with the deployment of Longhorn. The logical next step were to check status of all the pods in longhorn-system namespace:
kubectl get pods -n longhorn-system
NAME READY STATUS RESTARTS AGE IP NODE
...
instance-manager-r-bc118f38 1/1 Running 0 13m 10.244.3.130 worker-1
longhorn-csi-plugin-6468k 2/2 Running 0 7d9h 10.244.2.14 worker-2
longhorn-csi-plugin-nwsz8 1/2 CrashLoopBackOff 11 7h52m 10.244.3.126 worker-1
longhorn-csi-plugin-vblqk 2/2 Running 0 7d9h 10.244.5.18 worker-3
longhorn-driver-deployer-f6d46898b-mbpl9 1/1 Running 0 14m 10.244.2.83 worker-2
...
Pod longhorn-csi-plugin-nwsz8 was not looking too healthy with the CrashLoopBackOff state. This pod was also deployed to node worker-1, which can explain the driver.longhorn.io not found
error. Further log inspection of the containers revealed:
kubectl logs -n longhorn-system longhorn-csi-plugin-nwsz8 node-driver-registrar
I0202 01:21:06.106826 41062 main.go:110] Version: v1.2.0-lh1-0-g03d523f9
I0202 01:21:06.106872 41062 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0202 01:21:06.106880 41062 connection.go:151] Connecting to unix:///csi/csi.sock
W0202 01:21:16.107162 41062 connection.go:170] Still connecting to unix:///csi/csi.sock
W0202 01:21:26.107140 41062 connection.go:170] Still connecting to unix:///csi/csi.sock
Container node-driver-registrar was running but were unable to connect to the CSI socket on worker-1. Inspecting the log of the second container:
kubectl logs -f -n longhorn-system longhorn-csi-plugin-nwsz8 longhorn-csi-plugin
time="2021-02-02T08:15:04Z" level=info msg="CSI Driver: driver.longhorn.io csiVersion: 0.3.0, manager URL http://longhorn-backend:9500/v1"
time="2021-02-02T08:15:14Z" level=fatal msg="Error starting CSI manager: Failed to initialize Longhorn API client: Get \"http://longhorn-backend:9500/v1\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Aha! A Client.Timeout
when longhorn-csi-plugin container tried to connect to the manager via service longhorn-backend:9500.
Hmm, usually when I see a timeout I am thinking firewall or DNS. As there had not been any recent changes to any firewall rules DNS should be the more likely suspect.
First I had a look in kube-system namespace to confirm that kube-dns was present and that the backing coredns-pods were running. Everything looked fine there.
Next step was to verify if node worker-1 could resolve longhorn-backend service in longhorn-system namespace by performing DNS lookup directly from the node:
ssh worker-1
## Use the clusterIP of kube-dns service
## in kube-system namespace as the resolver
dig @10.96.0.10 longhorn-backend.longhorn-system.svc.cluster.local
; <<>> DiG 9.11.25-RedHat-9.11.25-2.fc33 <<>> @10.96.0.10 longhorn-backend.longhorn-system.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
Uh-oh, worker-1 couldn’t even connect to the kube-dns service. Now a networking issue seemed more likely.
Earlier when I looked at the coredns-pods in kube-system I could see that all pods for flannel CNI were running fine.
But just to be on the safe side I restarted daemonset kube-flannel-ds in kube-system:
kubectl rollout restart daemonset -n kube-system kube-flannel-ds
Repeating the DNS-lookup on worker-1:
dig @10.96.0.10 longhorn-backend.longhorn-system.svc.cluster.local
...
;; ANSWER SECTION:
longhorn-backend.longhorn-system.svc.cluster.local. 30 IN A 10.96.5.142
...
Great! worker-1 is now able to connect to the DNS service and also resolve longhorn-backend.longhorn-system.svc.cluster.local
to an actual IP. This seemed to do the trick as pod longhorn-csi-plugin-nwsz8 had started running properly and pod-a-79d65ffcd4-zp9lm could successfully attach the longhorn volume.
Ok, to be fair it was flannel CNI… but still: it’s always DNS.
What was the actual root cause?
My suspicion is that flannel failed to re-new the subnet lease on the worker node. Unfortunately the flannel logs were purged after subsequent reboots so I can’t confirm this.
As this k8s-cluster is more of a build-up/tear-down kind of deal, than something mission-critical, I don’t have the biggest motivator to try to replicate the issue right now and perform a thorough root cause analysis. But who knows, if the same issue starts happening frequently I might dedicate some time and update this post.