So, we've all been there. You go to your trusty grafana, search for some sweet metrics that you implemented and WHAM! Prometheus returns us a 503, a trusty way of saying I'm not ready, and I'm probably going to die soon. And since we're running in kubernetes I'm going to die soon, again and again. And you're getting reports from your colleagues that prometheus is not responding. And you can't ignore them anymore.
All right, lets check what's happening to the little guy.
kubectl get pods -n monitoring
prometheus-prometheus-kube-prometheus-prometheus-0 1/2 Running 4 5m
It seems like it's stuck in the running state, where the container is not yet ready. Let's describe the deployment, to check out what's happening.
State: Running │
Started: Wed, 12 Jan 2022 15:12:49 +0100 │
Last State: Terminated │
Reason: OOMKilled │
Exit Code: 137 │
Started: Tue, 11 Jan 2022 17:14:41 +0100 │
Finished: Wed, 12 Jan 2022 15:12:47 +0100 │
So we see that the prometheus is in a running state waiting for the readiness probe to trigger, probably working on recovering from Write Ahead Log (WAL). This could be an issue where prometheus is recovering from an error, or a restart and does not have enough memory to write everything in the WAL. We could be running into an issue where we set the request/limits memory lower than the prometheus requires, and the kube scheduler keeps killing prometheus for wanting more memory.
For this case, we could give it more memory to work to see if it recovers. We should also analyze why the prometheus WAL is getting clogged up.
In essence, we want to check what has changed so that we suddenly have a high memory spike in our sweet, sweet environment.
A lot of prometheus issues revolve around cardinality. Memory spikes that break your deployment? Cardinality. Prometheus dragging its feet like it's Monday after the log4j (the second one ofc) zero day security breach? Cardinality. Not getting that raise since you worked hard the past 16 years without wavering? You bet your ass it's cardinality. So, as you can see much of life's problems can be accredited to cardinality.
In short cardinality of your metrics is the combination of all label values per metric.
For example, if our metric
http_request_total had a label response code, and let's say we support 8 status codes, our cardinality starts off at 8.
For good measure we want to record the HTTP verb for the request. We support
GET POST PUT HEAD which would put the cardinality to 4*8=32.
Now, if someone adds a URL to the metric label (!!VERY BAD IDEA!!, but bare with me now) and we have 2 active pages, we'd have a cardinality of 2*4*8=64.
But, imagine someone starts scraping your website for potential vulnerabilities. Imagine all the URLs that will appear, most likely only once.
The point to this story is, be very mindful of how you use labels and cardinality in prometheus, since that will indeed have great impact on your prometheus performance.
Since this has never happened to me (never-ever) I found the following solution to be handy.
Since we can't get prometheus up and running to utilize PromQL to detect the potential issues, we have to find another way to detect high cardinality.
Therefore, we might want to get our hands dirty with some
kubectl exec -it -n monitoring pods/prometheus-prometheus-kube-prometheus-prometheus-0 -- sh, and run the prometheus
tsdb analysis too.
/prometheus $ promtool tsdb analyze .
Which produced the result.
> Block ID: 01FT8E8YY4THHZ2S7C3G04GJMG
> Duration: 1h59m59.997s
> Series: 564171
> Label names: 285
> Postings (unique label pairs): 21139
> Postings entries (total label pairs): 6423664
> Highest cardinality metric names:
> 11340 haproxy_server_http_responses_total
We see the potential issue here, where the
haproxy_server_http_responses_total metric is having a super-high cardinality which is growing.
We need to deal with it, so that our prometheus instance can breathe again. In this particular case, the solution was updating the haproxy.
... or burn it, up to you.