Prometheus Multi-Cluster monitoring using Prometheus Agent Mode

Prometheus Multi-Cluster monitoring using Prometheus Agent Mode

Prometheus is the most favoured monitoring solution for monitoring Kubernetes metrics nowadays. Prometheus allows SRE/DevOps teams to find a deep insight into services performance and reliabilities in Kubernetes, by storing queryable metrics for applications running in a cluster. On the other hand, many companies are switching to multi-cluster/multi-cloud solutions following cloud-agnostic best practices. For a while, I was busy with finding a practical solution for monitoring multi-cluster Kubernetes and this post is about what I found as the best solution to implement.

Photo by clark cruz from Pexels

Prometheus agent mode


Starting with v2.32.0 , Prometheus introduced a new feature to use which is called agent mode and you can deploy it by passing --enable-feature=agent flag. But first, let’s take a look at other traditional ways of monitoring multi-clusters using Prometheus and keep in mind that the highest level of Prometheus which gathers all the metrics of all clusters is called prometheus global view.

Prometheus already introduced other ways to support the global view case, each with its own pros and cons.

  • Prometheus Federation
    this feature is the first aggregation feature introduced by Prometheus. In this case, a global view Prometheus server scrapes a subset of metrics from a child Prometheus. In short on if you want to federate metrics from one server to another, you need to config your global Prometheus server to scrape from the /federate endpoint of a child server. Read more about Prometheus Federation on the official website.
  • Prometheus Remote Read
    this feature allows you to select metrics from other Prometheus servers using their own way of querying. The best example for this implementation is Thanos as the global view service to perform PromQL queries on data within Thanos while Thanos fetches the required metrics from remote locations.

Now let’s see how Prometheus agent-mode in combination with Prometheus remote write feature comes to the rescue. I have to mention that combined with the remote write feature, you can use other solutions like Thanos, VictoriaMetrics, and Grafana. But if you are looking for a sole Prometheus implementation, agent-mode is the only answer.

When you enable the Agent mode it makes a light-weighted Prometheus which has the querying, local storage, and alerting disabled and optimizes it only for remote write purposes. In the case of TSDB (Time series Database) and WAL (Write-ahead log), it has optimized it with an in-memory WAL only, and it is much more light-weighted as well. Other than mentioned features, scraping logic, service discovery and related configuration will remain the same. If I want to describe it in a better sense, I’d say, it is sort of (not exactly) a smart exporter that gathers all the metrics in a cluster and pushed them into a global-view Prometheus.

Prometheus global view using remote/write feature

What makes Prometheus Agent so interesting to me? the fact that it’s built into Prometheus. It is the same Prometheus scraping APIs, same UI, same configs, and same discovery methods. The Agent mode TSDB WAL doesn’t keep data after writing them to the remote endpoint. It is worth mentioning that if Prometheus agent mode cannot reach the remote endpoint, it keeps the data on the disk for a short period of time until the remote write gets accessible again.

Prometheus remote write in short

To enable remote-write feature you need to add --web.enable-remote-write-receiver flag in your global view implementation and then you can send data from prometheus agent mode by adding the following to the Prometheus config file

    - url: '[http://<prometheus>:9090/api/v1/write'](http://prometheus-global-headless-service.monitoring-global.svc.cluster.local:9090/api/v1/write')

In which cases Prometheus agent mode is useful?

As I already mentioned it is absolutely useful when you want to monitor multiple clusters in a single pane of glass. But it is not only the case, these days many companies are moving forward to implement edge computing. It is the era of IoT, self-driving cars, and many other models that you can deploy a Kubernetes cluster in a resource-bounded device. Imagine you have an internet-providing company and you are company runs Kubernetes clusters on the edge modems for supporting various services. Let’s assume that you are using K3S which is a lightweight K8S cluster and you want to collect metrics from the K3S cluster and applications running there. In this case, you need some requirements to meet.

  • Deploy a global view that gathers all the modems metrics in a single pane of glass.
  • Sometimes, due to an update, you might lose the network connection for a short period of time, and you want to keep your metrics.
  • All these modems are resource-constrained, you won’t want to implement a large application on your devices.
  • You need a central alerting system to be connected to your ticketing system and in case of any issue, raise a ticket for further investigations.
  • You want the solution to be maintainable itself. So in any case you can easily update it and you don’t want to reinvent the wheel.

Another good example for switching to agent-mode global-view implementation is when you want to monitor metrics for clusters which do not expose any public endpoints for security reasons. But these clusters has egress traffic allowed most of the times.

So what’s the next step?

I hope you find this post interesting and useful for the high level design of your multi-cluster/multi-cloud Prometheus implementation as many companies are looking for a single pane of glass for their monitoring implementations. For the next post I try to make an actual example with some needed Kubernetes manifests for deploying Prometheus agent mode and a global-view one.

Who am I?

I am Ehsan, a passionate site reliability engineer and a cloud solutions architect working for Techspire Netherlands and dedicated to assisting businesses smoothing out their System Engineering and Security Operations, improving availability, scalability, and QoS for their services using infrastructure as code concepts, a wide range of monitoring and logging tools, digging into Linux based operating systems, using my deep knowledge in Network concepts and big-data analytical tools.