Patterns and anti-patterns for a reliable Kubernetes infra deployment

Patterns and anti-patterns for a reliable Kubernetes infra deployment

Kubernetes has become the most favorite container orchestration solution and many companies have moved forward to microservice architecture and Kubernetes deployment.
In this transition, one of the most important characteristics to consider is creating a reliable Kubernetes infrastructure deployment. Yes, I know that with all the managed Kubernetes clusters like EKS, AKS, GKE, Rancher, and others, it has become way easier to deploy Kubernetes infrastructures. However keep in mind that reliability does not always come with these solutions and you have to be responsible for your infrastructure design and reliability.

Photo by S Migaj from Pexels

How to make my Kubernetes cluster more reliable?

1- Infrastructure as code deployment
Ok, wait, I know that these days most of the companies already have their infrastructures deployed as code. But does the code help reliability? it depends…. As an SRE, I saw many companies having their infrastructures deployed as code, but still, there were a lot of untraced changes applied to the infra besides just the IaC deployment. Best practices say that no one should have access to change anything on Infra except the pipeline of the IaC deployment. Assume that you are updating your cluster to a newer version and you do it via your Terraform code. it goes smoothly and you get your cluster updated and then BOOM, some applications on the cluster are broken!!! Why!? Someone did a resource upgrade on the UI without using code and now your cluster has a shortage of resources.

2- Do not mix application deployment with infrastructure deployment
Recently Terraform and other IaC tools are getting so popular and many teams use them to deploy both infrastructure and application. I agree that it is a great strategy but be careful… The fact that you can deploy infrastructure in a pipeline doesn’t mean that both infra and application deployment should happen all at once in a single pipeline. What is the issue? First, If you are trying to adapt GitOps best practices, you need different approval processes for application and infra. Secondly, application code gets changed much faster than the underlying infrastructure, so don’t mix them.

3- GitOps for infrastructure
As you might have implemented some or part of GitOps best practices, you know that like DevOps and SRE, it is not a one-size-fits-all solution to all teams. But it is definitely one of the patterns in implementing a reliable Kubernetes infrastructure. Why? Because it is more secure and efficient, it is way faster and it may reduce your total costs. It gives you consistency which results into reliability.

4- Cloud-agnostic deployment.
For having a more reliable Kubernetes infra, being cloud-agnostic highly increases your infrastructure reliability . If you are aiming for creating a highly reliable service, you need a highly reliable infrastructure for it, therefore you shouldn’t rely on one single cloud provider. The best practice in this matter is to have core services which need to be fail-safe being deployed on cloud-agnostics infrastructures. Deploying cloud-agnostic infrastructure is not yet easy with IaC tools as you need to rewrite the code for each provider. In addition to that, I need to mention that if you are using a single cloud provider, still there is room for adding more reliability to it, for this matter you can refer to each cloud provider’s documents.

5- Monitoring and alerting for the underlying infra
Implementing a proper monitoring solution for the infrastructure per se is another must. As an SRE I also have seen multiple times that teams have a decent monitoring solution in place, but they don’t cover all the needed metrics of the underlying infrastructure for their Kubernetes nodes. i.e If you are deploy your infra on on-premises machines, you need to collect your raid controller metrics. I assume that you already experienced at least one monitoring tool and as you might know, if you want to gather all the metrics from Kubernetes nodes, a common pattern is deploying a Daemonset to make sure that you have at least one monitoring pod available on each node. For this purpose, I can suggest the use of Prometheus-operator. There is a proper Helm chart for deploying it which makes it easy to deploy and use. In addition, if there are still some metrics regarding underlying hardware to be gathered which are not covered by node-exporter which is deployed as a part of Prometheus-operator, you can create your own exporter. Have the exporter running with privileged access and Host-to-Container mounts, then scrape it via the Prometheus operator.

6- Logging for the underlying infra
Like monitoring having a comprehensive logging solution for the Kubernetes, infrastructure is also mandatory to improve reliability in the end. Again, having proper logging and especially covering the systemd/journald of the underlying nodes is so helpful and it is needed to be deployed in my opinion. For this purpose, I’d suggest using logging operators and one of the most beneficial ones is logging-operator from Banzaicloud. It is easy to deploy with its Helm chart and easy to maintain.. For shipping node logs like _kubelet_, _audit_ logs, or most importantly _systemd_ journal you can use its Host tailer.

From Banzaicloud official page

The last but not least words

Kubernetes is very popular these days, and what I mentioned above is a part of the areas you need to go through to make your Kubernetes infrastructure more reliable. But it is definitely not a prescription for all the infrastructure types. I tried to make it as general as I can, but still, there are a lot of cases, you might need to investigate, experience, and learn from yourself. Also, keep in mind that there is no perfect infrastructure. If you put a magnifier on each and every big company’s infrastructure you will definitely find issues and breakpoints, thus don’t think about creating a 100% fault-free infrastructure, but try to propel your infra to be as much closer as you can to being fault-free.

Who am I?

I am Ehsan, a passionate site reliability engineer and a cloud solutions architect working for Techspire Netherlands and dedicated to assisting businesses smoothing out their System Engineering and Security Operations, improving availability, scalability, and QoS for their services using infrastructure as code concepts, a wide range of monitoring and logging tools, digging into Linux based operating systems, using my deep knowledge in Network concepts and big-data analytical tools.

LinkedIn
Twitter
WhatsApp
Facebook