Kubernetes "foo"

So I found out something fun today, and I thought that I would share my experiences with anyone else who may experience the same "issue" with Kubernetes (although the real hero lives in IRC land; the Atomic folks have been amazingly helpful).

It's well known that the Kubernetes Master is treated as a Unicorn; meaning, what happens when that master or your Kubernetes cluster suffers from a power hit? It's definitely not a great situation to be in. Likewise, there's not much you can do about a power hit if you're working on shit in your house, right?

I noticed that scaling out additional nodes or scaling them back produced some really strange results (both in Kubernetes and of course Cockpit). Quickly I found out that my host had suffered from a power cycle. Thank you Duke Energy! At first glance it seems like you're in purgatory; you can't really create new pods, but you can't seem to delete the old one's either. It appears it could be an issue with etcd or Kubernetes, but let's dig deeper to find out.

You can try to find these running "ghost" containers by issuing a kubectl get pods --namespace=default (namespace is default in my lab environment at the moment), but I noticed that you don't get anything in return. The Cockpit interface shows pods getting spun up over and over. So who's right? Both are actually, and it can make you feel pretty powerless.

Deeper investigating showed that the containers were actually in fact being spun up by the kubectl command, but those containers where not being properly reported back to Kubernetes.

The folks over on the RedHat #atomic IRC channel really helped me out. They said that they personally remove the containers using Docker. Realizing what I was doing (basically creating more work for myself later), I stopped trying to scale out more containers, as this means I was going to be forced to delete them out of Docker directly anyway. Not fun...especially if you were in a "production" environment where you could have thousands of active containers. Hint: This is where good label use comes into play!

So here are the procedures you'll need to perform if you ever run into this, but only in a lab environment. I will update this over the weekend after some additional testing (read on at the bottom to find out more), because my thought is that I may be able to just delete the running kubernetes container on the master, and that may reset my declarative state for the kubernetes cluster. I will have to dig deeper, but alas this is a lab (which I'm using for a later demonstration), so time is of essence.

Making Kubernetes Useful Again

  1. On each node, stop the following services:
    sudo systemctl stop kube-proxy kubelet flanneld

  2. Next, make sure that the Docker service is started, so that you can work directly with the running containers on the Atomic Kubernetes node:
    sudo systemctl start docker

  3. Find out which containers are running. For extra credit, also find out which images are downloaded to each of the nodes in the k8s cluster:

    [centos@fedminion02 ~]$ sudo docker ps -a
    CONTAINER ID        IMAGE                                  COMMAND                CREATED             STATUS                      PORTS               NAMES
    4274b04c3c6e        nginx                                  "nginx -g 'daemon of   46 minutes ago      Exited (0) 46 minutes ago                       k8s_nginx.d7d3eb2f_my-nginx-zvq6a_default_11523c39-5c88-11e5-833d-fa163e6fd979_33fb7971
    871cc3668e87        gcr.io/google_containers/pause:0.8.0   "/pause"               46 minutes ago      Exited (0) 46 minutes ago                       k8s_POD.ef28e851_my-nginx-zvq6a_default_11523c39-5c88-11e5-833d-fa163e6fd979_fc2dfa29
    fb95a7269f8c        nginx                                  "nginx -g 'daemon of   50 minutes ago      Exited (0) 50 minutes ago                       k8s_nginx.d7d3eb2f_my-nginx-dvped_default_8ba8e71f-5c87-11e5-833d-fa163e6fd979_30e16003
    edb89618ab45        gcr.io/google_containers/pause:0.8.0   "/pause"               50 minutes ago      Exited (0) 50 minutes ago                       k8s_POD.ef28e851_my-nginx-dvped_default_8ba8e71f-5c87-11e5-833d-fa163e6fd979_9731a746
    5734b49d6f02        nginx                                  "nginx -g 'daemon of   20 hours ago        Exited (0) 50 minutes ago                       k8s_nginx.d7d3eb2f_my-nginx-9qdsz_default_1af37908-5bef-11e5-833d-fa163e6fd979_6975d0cb
    b4e969c99ba9        gcr.io/google_containers/pause:0.8.0   "/pause"               20 hours ago        Exited (0) 50 minutes ago                       k8s_POD.ef28e851_my-nginx-9qdsz_default_1af37908-5bef-11e5-833d-fa163e6fd979_0138d224
    36ce4d655d68        nginx                                  "nginx -g 'daemon of   20 hours ago        Exited (0) 20 hours ago                         k8s_nginx.d7d3eb2f_my-nginx-pjkk9_default_16578ce0-5bef-11e5-833d-fa163e6fd979_ae73356d
    2eb60a208f14        gcr.io/google_containers/pause:0.8.0   "/pause"               20 hours ago        Exited (0) 20 hours ago                         k8s_POD.ef28e851_my-nginx-pjkk9_default_16578ce0-5bef-11e5-833d-fa163e6fd979_607d9c01
    d7ff987e7240        nginx                                  "nginx -g 'daemon of   20 hours ago        Exited (0) 20 hours ago                         k8s_nginx.d7d3eb2f_my-nginx-xsfpf_default_0cbbc5e4-5bef-11e5-833d-fa163e6fd979_ca445d38
    7d814893c059        gcr.io/google_containers/pause:0.8.0   "/pause"               20 hours ago        Exited (0) 20 hours ago                         k8s_POD.ef28e851_my-nginx-xsfpf_default_0cbbc5e4-5bef-11e5-833d-fa163e6fd979_100aa082
    860f0e4e3f8b        nginx                                  "nginx -g 'daemon of   21 hours ago        Exited (0) 20 hours ago                         k8s_nginx.d7d3eb2f_my-nginx-bpfed_default_3e0d7c10-5be5-11e5-833d-fa163e6fd979_5c30503c
    f3d238100f6d        gcr.io/google_containers/pause:0.8.0   "/pause"               21 hours ago        Exited (0) 20 hours ago                         k8s_POD.ef28e851_my-nginx-bpfed_default_3e0d7c10-5be5-11e5-833d-fa163e6fd979_c8d23490
    [centos@fedminion02 ~]$ 
    [centos@fedminion02 ~]$ 
    [centos@fedminion02 ~]$ 
    [centos@fedminion03 ~]$ sudo docker images -a
    REPOSITORY                       TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
    docker.io/nginx                  latest              0b354d33906d        6 days ago          132.8 MB
    <none>                           <none>              cc72ca035672        6 days ago          132.8 MB
    <none>                           <none>              bfd040d9aed8        6 days ago          132.8 MB
    <none>                           <none>              b0d40bad0b2b        6 days ago          132.8 MB
    <none>                           <none>              4f7e3e7c2546        6 days ago          132.8 MB
    <none>                           <none>              3a6a1b71b6e0        6 days ago          132.8 MB
    <none>                           <none>              93f81de2705a        6 days ago          125.1 MB
    <none>                           <none>              4ac684e3f295        6 days ago          125.1 MB
    <none>                           <none>              d6c6bbd63f57        6 days ago          125.1 MB
    <none>                           <none>              426ac73b867e        6 days ago          125.1 MB
    none>                           <none>              8c00acfb0175        8 days ago          125.1 MB
    <none>                           <none>              843e2bded498        8 days ago          125.1 MB
    gcr.io/google_containers/pause   0.8.0               2c40b0526b63        5 months ago        241.7 kB
    <none>                           <none>              56ba5533a2db        5 months ago        241.7 kB
    <none>                           <none>              511136ea3c5a        2 years ago         0 B
    [centos@fedminion03 ~]$ 
    
  4. So I deleted them:

    [centos@fedminion03 ~]$ sudo docker rm -f 9ef242e9f4fe 3f7902f141f8 9005e5560b46 6f645a44c13b f206cdae56ab 8e4614e3b3e1 69c1970f2899 31e362ce4fc5 d2246a05f4ab  134aef09ba5f
    9ef242e9f4fe
    3f7902f141f8
    9005e5560b46
    6f645a44c13b
    f206cdae56ab
    8e4614e3b3e1
    69c1970f2899
    31e362ce4fc5
    d2246a05f4ab
    134aef09ba5f
    [centos@fedminion03 ~]$
    [centos@fedminion03 ~]$ sudo docker rmi -f 0b354d33906d cc72ca035672 bfd040d9aed8 b0d40bad0b2b 4f7e3e7c2546 3a6a1b71b6e0 93f81de2705a 4ac684e3f295 d6c6bbd63f57 426ac73b867e 8c00acfb0175 843e2bded498 2c40b0526b63 56ba5533a2db 511136ea3c5a
    Untagged: docker.io/nginx:latest
    Deleted: 0b354d33906d30ac52d2817ea770ddce18c7531e58b5b3ca0ae78873f5d2e207
    Deleted: cc72ca0356723a8008c5e8fe0118700aee1153930e16877560db58a0121757f0
    Deleted: bfd040d9aed8c694585cfa02bb711f8a93f9fd1d6c8920b89b0030b2017c0b5b
    Deleted: b0d40bad0b2b521174cf3ee2f4e74d889561a7c5e2c9c85167978de239bc1c97
    Deleted: 4f7e3e7c25464989f847bb9a1d1f3d4c479d12137b44ed5c8b8d83bcf67f4d4b
    Deleted: 3a6a1b71b6e021df3abda9f223b016a8bc018c27785530d4af69ada52cac404a
    Deleted: 93f81de2705a3caf7ffd22347c5768b5169c6d037e3867825a4ab7cdd9a7de99
    Deleted: 4ac684e3f295443f69e92c60768b1052ded4aac3a54466c2debf66de35a479e9
    Deleted: d6c6bbd63f57240b4d7a3714f032004b1ed0aa1b9761778f9a2a8632e0c5efd1
    Deleted: 426ac73b867e7b17208d55c2bcc6ba9bd81ae4aff1ad79106e63fc94d783f27f
    Deleted: 8c00acfb017549e44d28098762c3e6296872a1ca9b90385855f1019d84bb0dac
    Deleted: 843e2bded49837e4846422f3a82a67be3ccc46c3e636e03d8d946c57564468ba
    Deleted:  2c40b0526b6358710fd09e7b8c022429268cc61703b4777e528ac9d469a07ca1
    Deleted: 56ba5533a2dbf18b017ed12d99c6c83485f7146ed0eb3a2e9966c27fc5a5dd7b
    Deleted:  511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158
    [centos@fedminion03 ~]$ 
    

Final Thoughts

I really hope that some @kubernetesio folks comment on this one. I really believe that a better way of handling this could be erasing only the kubernetes container on the master and rebooting the minions? From what I understand, the kubernetes container helps maintain state and proxies information between the master and the minions, so it would make sense that this could resolve the issue. I will try attempt to explore this option in my home lab later this week. Here's a sample of what I would be testing:

[fedora@fedmaster ~]$ kubectl get services
NAME         LABELS                                    SELECTOR   IP(S)        PORT(S)   AGE  
kubernetes   component=apiserver,provider=kubernetes   <none>     10.109.0.1   443/TCP   23h  
[fedora@fedmaster ~]$ kubectl delete kubernetes

As with many folks, I am still learning Kubernetes. It's an extremely impressive project, and I'm lucky to be exploring it.I'm also thankful for these little strange senarios, because they'll only help me learn more, and I love learning new things!

comments powered by Disqus