[Supported in Kublr 1.21.2 and later]
Tags: prometheus, grafana, SLO, slock/sloth
Kublr use Prometheus/Grafana/Alermanager stack for centralize monitoring of managed clusters.
In most cases you can use cluster specification for stack customize: https://support.kublr.com/en/support/solutions/articles/33000261531-monitoring-frequently-requested-configuration-changes
Also you can use Prometheus recording rules (https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) stored into config maps.
spec: features: monitoring: values: prometheus: config: extraRulesConfigmaps: - name: prometheus-sloth-rules fileMask: '*.yaml'
Sloth
https://github.com/slok/sloth#sloth
Kublr runs Prometheus in server mode, not used operator. You will need to use Sloth manifest in RAW Prometheus v1 mode (https://github.com/slok/sloth#raw-prometheus)
1. Download latest sloth version from https://github.com/slok/sloth/releases
2. Сlone sloth SLI plugin from https://github.com/slok/sloth-common-sli-plugins
KCP k8s API SLO used SLI plugins example
1. Create kcp.yaml manifest with raw sloth rules
version: "prometheus/v1" service: "kcp-k8s" labels: sloth: kublr kublr_cluster: kublr-control-plane kublr_space: kublr-system slos: # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%). - name: "api-availability" objective: 99.9 description: "K8S API SLO based on availability for HTTP request responses." sli: plugin: id: "sloth-common/kubernetes/apiserver/availability" options: filter: job="kubernetes-apiservers",kublr_cluster="kublr-control-plane",kublr_space="kublr-system" alerting: name: "k8sAPIAvailabilityProblem" page_alert: disable: true ticket_alert: disable: true - name: "api-latency" objective: 95.0 description: "K8S API SLO based on availability for HTTP request responses." sli: plugin: id: "sloth-common/kubernetes/apiserver/latency" options: bucket: "0.5" filter: job="kubernetes-apiservers",kublr_cluster="kublr-control-plane",kublr_space="kublr-system" alerting: name: "Fake" page_alert: disable: true ticket_alert: disable: true
Important: Alerting should be disabled, because we create cumulative alert rules later.
2. generate prometheus rules with sloth
# ./sloth generate -p ./sloth-common-sli-plugins -i kcp-sloth.yaml -o kcp-k8s-rules.yaml INFO[0000] SLI plugins loaded plugins=11 svc=storage.FileSLIPlugin version=v0.6.0 INFO[0000] Generating from Prometheus spec version=v0.6.0 INFO[0000] Multiwindow-multiburn alerts generated out=kcp-k8s-rules.yaml slo=kcp-k8s-api-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLI recording rules generated out=kcp-k8s-rules.yaml rules=8 slo=kcp-k8s-api-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Metadata recording rules generated out=kcp-k8s-rules.yaml rules=7 slo=kcp-k8s-api-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLO alert rules generated out=kcp-k8s-rules.yaml rules=0 slo=kcp-k8s-api-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Multiwindow-multiburn alerts generated out=kcp-k8s-rules.yaml slo=kcp-k8s-api-latency svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLI recording rules generated out=kcp-k8s-rules.yaml rules=8 slo=kcp-k8s-api-latency svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Metadata recording rules generated out=kcp-k8s-rules.yaml rules=7 slo=kcp-k8s-api-latency svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLO alert rules generated out=kcp-k8s-rules.yaml rules=0 slo=kcp-k8s-api-latency svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Prometheus rules written format=yaml groups=4 out=kcp-k8s-rules.yaml svc=storage.IOWriter version=v0.6.0
Kubernetes service monitoring example
1. Create services.yaml
version: "prometheus/v1" service: "kcp" labels: sloth: kublr kublr_cluster: sloth kublr_space: kublr-system slos: - name: "PrometheusTargetAvailability" objective: 99.9 sli: plugin: id: "sloth-common/prometheus/targets/availability" options: filter: kublr_cluster="kcp",kublr_space="kublr-system" alerting: name: "PrometheusTargetAvailability" page_alert: disable: true ticket_alert: disable: true # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%). for kublr-api service - name: "kublr-api-requests-availability" objective: 99.9 description: "Common SLO based on availability for HTTP request responses." sli: events: error_query: sum(rate(http_request_duration_milliseconds_count{app="kcp-kublr-api",kublr_cluster="kcp",kublr_space="kublr-system",code=~"(5..|429)"}[{{.window}}])) total_query: sum(rate(http_request_duration_milliseconds_count{app="kcp-kublr-api",kublr_cluster="kcp",kublr_space="kublr-system"}[{{.window}}])) alerting: name: "KublrAPIAvailability" page_alert: disable: true ticket_alert: disable: true - name: "ingress-requests-availability" objective: 99.9 description: "Common SLO based on availability for HTTP request responses." sli: events: error_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="kcp",kublr_space="kublr-system",status=~"(5..|429)"}[{{.window}}])) total_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="kcp",kublr_space="kublr-system"}[{{.window}}])) alerting: name: "IngressAvailability" page_alert: disable: true ticket_alert: disable: true
2. generate prometheus rules with sloth
# ./sloth-darwin-amd64 generate -p ./sloth-common-sli-plugins -i services.yaml -o kcp-rules.yaml INFO[0000] SLI plugins loaded plugins=11 svc=storage.FileSLIPlugin version=v0.6.0 INFO[0000] Generating from Prometheus spec version=v0.6.0 INFO[0000] Multiwindow-multiburn alerts generated out=kcp-rules.yaml slo=kcp-PrometheusTargetAvailability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLI recording rules generated out=kcp-rules.yaml rules=8 slo=kcp-PrometheusTargetAvailability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Metadata recording rules generated out=kcp-rules.yaml rules=7 slo=kcp-PrometheusTargetAvailability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLO alert rules generated out=kcp-rules.yaml rules=0 slo=kcp-PrometheusTargetAvailability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Multiwindow-multiburn alerts generated out=kcp-rules.yaml slo=kcp-kublr-api-requests-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLI recording rules generated out=kcp-rules.yaml rules=8 slo=kcp-kublr-api-requests-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Metadata recording rules generated out=kcp-rules.yaml rules=7 slo=kcp-kublr-api-requests-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLO alert rules generated out=kcp-rules.yaml rules=0 slo=kcp-kublr-api-requests-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Multiwindow-multiburn alerts generated out=kcp-rules.yaml slo=kcp-ingress-requests-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLI recording rules generated out=kcp-rules.yaml rules=8 slo=kcp-ingress-requests-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Metadata recording rules generated out=kcp-rules.yaml rules=7 slo=kcp-ingress-requests-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] SLO alert rules generated out=kcp-rules.yaml rules=0 slo=kcp-ingress-requests-availability svc=generate.prometheus.Service version=v0.6.0 INFO[0000] Prometheus rules written format=yaml groups=6 out=kcp-rules.yaml svc=storage.IOWriter version=v0.6.0
Prometheus configuration apply
1. Create/modify config map
$ kubectl create configmap -n kublr prometheus-sloth-rules --from-file=./kcp-k8s-rules.yaml --from-file=./kcp-services-rules.yaml configmap/prometheus-sloth-rules created
2. reload prometheus configuration
$ kubectl exec -n kublr $(kubectl get pods --no-headers -n kublr -l app=kublr-monitoring-prometheus -o=custom-columns=NAME:.metadata.name) -c prometheus -- killall -HUP prometheus
Grafana dashboard
https://grafana.com/grafana/dashboards/14348
Default dashboard used Garafa exemplars. Exemplars are a way to associate higher cardinality metadata from a specific event with traditional time series data.
https://grafana.com/docs/grafana/latest/datasources/prometheus/#configuring-exemplars
Note: This feature is available in Prometheus 2.26+ and Grafana 7.4+.
Modify cluster spec for using Prometheus v2.28.1
spec: features: monitoring: values: prometheus: image: tag: v2.28.1
Import Grafana Dashboard 14348 into your grafana.
Alertmanager rules
https://sre.google/workbook/alerting-on-slos/#6-multiwindow-multi-burn-rate-alerts
You can create cumulative alert rules into cluster specification or use generated rules by sloth.
For example we can create cumulative alert rule:
spec: features: monitroing: values: alertmanager: alerts: - alert: SLOOverExpected annotations: description: '{{$labels.sloth_service}} {{$labels.sloth_slo}} in {{$labels.kublr_space}}.{{$labels.kublr_cluster}} SLO error budget burn rate is over expected.' summary: '{{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn rate is too fast.' title: '{{$labels.kublr_space}}.{{$labels.kublr_cluster}}.{{$labels.sloth_slo}} - SLO error budget burn rate is too fast!' expr: | ( (slo:sli_error:ratio_rate5m{} > (14.4 * 0.0009999999999999432)) and ignoring(sloth_window) (slo:sli_error:ratio_rate1h{} > (14.4 * 0.0009999999999999432)) ) or ignoring(sloth_window) ( (slo:sli_error:ratio_rate30m{} > (6 * 0.0009999999999999432)) and ignoring(sloth_window) (slo:sli_error:ratio_rate6h{} > (6 * 0.0009999999999999432)) ) labels: severity: warning - alert: SLOOverCrtitcal annotations: description: '{{$labels.sloth_service}} {{$labels.sloth_slo}} in {{$labels.kublr_space}}.{{$labels.kublr_cluster}} SLO error budget burn rate is over critical.' summary: '{{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn rate is too fast.' title: '{{$labels.kublr_space}}.{{$labels.kublr_cluster}}.{{$labels.sloth_slo}} - SLO error budget burn rate is critical!' expr: | ( (slo:sli_error:ratio_rate2h{} > (3 * 0.0009999999999999432)) and ignoring(sloth_window) (slo:sli_error:ratio_rate1d{} > (3 * 0.0009999999999999432)) ) or ignoring(sloth_window) ( (slo:sli_error:ratio_rate6h{} > (1 * 0.0009999999999999432)) and ignoring(sloth_window) (slo:sli_error:ratio_rate3d{} > (1 * 0.0009999999999999432)) ) labels: severity: critical
Cumulative Example for Ingress
version: "prometheus/v1" service: "ingress" labels: sloth: kublr slos: - name: "kcp-availability" objective: 88.1 description: "Control Plane Ingress requests SLO based on availability for HTTP request responses." labels: kublr_cluster: kcp kublr_space: kublr-system sli: events: error_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="kcp",kublr_space="kublr-system",status=~"(5..|429)"}[{{.window}}])) total_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="kcp",kublr_space="kublr-system"}[{{.window}}])) alerting: name: "IngressAvailability" page_alert: disable: true ticket_alert: disable: true - name: "production" objective: 99.9 description: "Ingress requests SLO based on availability for HTTP request responses." labels: kublr_cluster: prod-cluster kublr_space: production sli: events: error_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="prod-cluster",kublr_space="production",status=~"(5..|429)"}[{{.window}}])) total_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="prod-cluster",kublr_space="production"}[{{.window}}])) alerting: name: "IngressAvailability" page_alert: disable: true ticket_alert: disable: true