[Supported in Kublr 1.21.2 and later]

Tags: prometheus, grafana, SLO, slock/sloth


Kublr use Prometheus/Grafana/Alermanager stack for centralize monitoring of managed clusters. 

In most cases you can use cluster specification for stack customize: https://support.kublr.com/en/support/solutions/articles/33000261531-monitoring-frequently-requested-configuration-changes

Also you can use Prometheus recording rules (https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) stored into config maps.

spec:
  features:
    monitoring:
      values:
        prometheus:
          config:
              extraRulesConfigmaps:
              - name: prometheus-sloth-rules
                fileMask: '*.yaml'

Sloth

https://github.com/slok/sloth#sloth


Kublr runs Prometheus in server mode, not used operator. You will need to use Sloth manifest in RAW Prometheus v1 mode (https://github.com/slok/sloth#raw-prometheus)


1. Download latest sloth version from https://github.com/slok/sloth/releases

2. –°lone sloth SLI plugin from https://github.com/slok/sloth-common-sli-plugins


KCP k8s API SLO used SLI plugins example


1. Create kcp.yaml manifest with raw sloth rules

version: "prometheus/v1"
service: "kcp-k8s"
labels:
  sloth: kublr
  kublr_cluster: kublr-control-plane
  kublr_space: kublr-system
slos:
  # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%).
  - name: "api-availability"
    objective: 99.9
    description: "K8S API SLO based on availability for HTTP request responses."

    sli:
      plugin:
        id: "sloth-common/kubernetes/apiserver/availability"
        options:
          filter: job="kubernetes-apiservers",kublr_cluster="kublr-control-plane",kublr_space="kublr-system"
    alerting:
      name: "k8sAPIAvailabilityProblem"
      page_alert:
        disable: true
      ticket_alert:
        disable: true

  - name: "api-latency"
    objective: 95.0
    description: "K8S API SLO based on availability for HTTP request responses."
    sli:
      plugin:
        id: "sloth-common/kubernetes/apiserver/latency"
        options:
          bucket: "0.5"
          filter: job="kubernetes-apiservers",kublr_cluster="kublr-control-plane",kublr_space="kublr-system"
    alerting:
      name: "Fake"
      page_alert:
        disable: true
      ticket_alert:
        disable: true

Important: Alerting should be disabled, because we create cumulative alert rules later.

 

2. generate prometheus rules with sloth

# ./sloth generate -p ./sloth-common-sli-plugins -i kcp-sloth.yaml -o kcp-k8s-rules.yaml
INFO[0000] SLI plugins loaded                            plugins=11 svc=storage.FileSLIPlugin version=v0.6.0
INFO[0000] Generating from Prometheus spec               version=v0.6.0
INFO[0000] Multiwindow-multiburn alerts generated        out=kcp-k8s-rules.yaml slo=kcp-k8s-api-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLI recording rules generated                 out=kcp-k8s-rules.yaml rules=8 slo=kcp-k8s-api-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Metadata recording rules generated            out=kcp-k8s-rules.yaml rules=7 slo=kcp-k8s-api-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLO alert rules generated                     out=kcp-k8s-rules.yaml rules=0 slo=kcp-k8s-api-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Multiwindow-multiburn alerts generated        out=kcp-k8s-rules.yaml slo=kcp-k8s-api-latency svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLI recording rules generated                 out=kcp-k8s-rules.yaml rules=8 slo=kcp-k8s-api-latency svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Metadata recording rules generated            out=kcp-k8s-rules.yaml rules=7 slo=kcp-k8s-api-latency svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLO alert rules generated                     out=kcp-k8s-rules.yaml rules=0 slo=kcp-k8s-api-latency svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Prometheus rules written                      format=yaml groups=4 out=kcp-k8s-rules.yaml svc=storage.IOWriter version=v0.6.0

Kubernetes service monitoring example

1. Create services.yaml 

version: "prometheus/v1"
service: "kcp"
labels:
  sloth: kublr
  kublr_cluster: sloth
  kublr_space: kublr-system
slos:
  - name: "PrometheusTargetAvailability"
    objective: 99.9
    sli:
      plugin:
        id: "sloth-common/prometheus/targets/availability"
        options:
          filter: kublr_cluster="kcp",kublr_space="kublr-system"
    alerting:
      name: "PrometheusTargetAvailability"
      page_alert:
        disable: true
      ticket_alert:
        disable: true
   # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%). for kublr-api service
  - name: "kublr-api-requests-availability"
    objective: 99.9
    description: "Common SLO based on availability for HTTP request responses."
    sli:
      events:
        error_query: sum(rate(http_request_duration_milliseconds_count{app="kcp-kublr-api",kublr_cluster="kcp",kublr_space="kublr-system",code=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(http_request_duration_milliseconds_count{app="kcp-kublr-api",kublr_cluster="kcp",kublr_space="kublr-system"}[{{.window}}]))
    alerting:
      name: "KublrAPIAvailability"
      page_alert:
        disable: true
      ticket_alert:
        disable: true
  - name: "ingress-requests-availability"
    objective: 99.9
    description: "Common SLO based on availability for HTTP request responses."
    sli:
      events:
        error_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="kcp",kublr_space="kublr-system",status=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="kcp",kublr_space="kublr-system"}[{{.window}}]))
    alerting:
      name: "IngressAvailability"
      page_alert:
        disable: true
      ticket_alert:
        disable: true

2. generate prometheus rules with sloth

# ./sloth-darwin-amd64 generate -p ./sloth-common-sli-plugins -i services.yaml -o kcp-rules.yaml 
INFO[0000] SLI plugins loaded                            plugins=11 svc=storage.FileSLIPlugin version=v0.6.0
INFO[0000] Generating from Prometheus spec               version=v0.6.0
INFO[0000] Multiwindow-multiburn alerts generated        out=kcp-rules.yaml slo=kcp-PrometheusTargetAvailability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLI recording rules generated                 out=kcp-rules.yaml rules=8 slo=kcp-PrometheusTargetAvailability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Metadata recording rules generated            out=kcp-rules.yaml rules=7 slo=kcp-PrometheusTargetAvailability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLO alert rules generated                     out=kcp-rules.yaml rules=0 slo=kcp-PrometheusTargetAvailability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Multiwindow-multiburn alerts generated        out=kcp-rules.yaml slo=kcp-kublr-api-requests-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLI recording rules generated                 out=kcp-rules.yaml rules=8 slo=kcp-kublr-api-requests-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Metadata recording rules generated            out=kcp-rules.yaml rules=7 slo=kcp-kublr-api-requests-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLO alert rules generated                     out=kcp-rules.yaml rules=0 slo=kcp-kublr-api-requests-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Multiwindow-multiburn alerts generated        out=kcp-rules.yaml slo=kcp-ingress-requests-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLI recording rules generated                 out=kcp-rules.yaml rules=8 slo=kcp-ingress-requests-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Metadata recording rules generated            out=kcp-rules.yaml rules=7 slo=kcp-ingress-requests-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] SLO alert rules generated                     out=kcp-rules.yaml rules=0 slo=kcp-ingress-requests-availability svc=generate.prometheus.Service version=v0.6.0
INFO[0000] Prometheus rules written                      format=yaml groups=6 out=kcp-rules.yaml svc=storage.IOWriter version=v0.6.0


Prometheus configuration apply

1. Create/modify config map

$ kubectl create configmap -n kublr prometheus-sloth-rules --from-file=./kcp-k8s-rules.yaml --from-file=./kcp-services-rules.yaml
configmap/prometheus-sloth-rules created

2. reload prometheus configuration

$ kubectl exec -n kublr $(kubectl get pods --no-headers -n kublr -l app=kublr-monitoring-prometheus -o=custom-columns=NAME:.metadata.name) -c prometheus -- killall -HUP prometheus

Grafana dashboard

https://grafana.com/grafana/dashboards/14348


Default dashboard used Garafa exemplars. Exemplars are a way to associate higher cardinality metadata from a specific event with traditional time series data.

https://grafana.com/docs/grafana/latest/datasources/prometheus/#configuring-exemplars


Note: This feature is available in Prometheus 2.26+ and Grafana 7.4+.


Modify cluster spec for using Prometheus v2.28.1

spec:
  features:
    monitoring:
      values:
        prometheus:
          image:
            tag: v2.28.1

Import Grafana Dashboard 14348 into your grafana.


Alertmanager rules

https://sre.google/workbook/alerting-on-slos/#6-multiwindow-multi-burn-rate-alerts

You can create cumulative alert rules into cluster specification or use generated rules by sloth.

For example we can create cumulative alert rule:

spec:
  features:
    monitroing:
      values:
        alertmanager:
          alerts:
            - alert: SLOOverExpected
              annotations:
                description: '{{$labels.sloth_service}} {{$labels.sloth_slo}} in {{$labels.kublr_space}}.{{$labels.kublr_cluster}} SLO error budget burn rate is over expected.'
                summary: '{{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn rate is too fast.'
                title: '{{$labels.kublr_space}}.{{$labels.kublr_cluster}}.{{$labels.sloth_slo}} - SLO error budget burn rate is too fast!'
              expr: |
                (
                 (slo:sli_error:ratio_rate5m{} > (14.4 * 0.0009999999999999432)) 
                  and ignoring(sloth_window) 
                 (slo:sli_error:ratio_rate1h{} > (14.4 * 0.0009999999999999432))
                ) 
                 or ignoring(sloth_window) 
                (
                 (slo:sli_error:ratio_rate30m{} > (6 * 0.0009999999999999432))
                  and ignoring(sloth_window)
                 (slo:sli_error:ratio_rate6h{}  > (6 * 0.0009999999999999432))
                )
              labels:
                severity: warning
            - alert: SLOOverCrtitcal
              annotations:
                description: '{{$labels.sloth_service}} {{$labels.sloth_slo}} in {{$labels.kublr_space}}.{{$labels.kublr_cluster}} SLO error budget burn rate is over critical.'
                summary: '{{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn rate is too fast.'
                title: '{{$labels.kublr_space}}.{{$labels.kublr_cluster}}.{{$labels.sloth_slo}} - SLO error budget burn rate is critical!'
              expr: |
                (
                 (slo:sli_error:ratio_rate2h{} > (3 * 0.0009999999999999432)) 
                  and ignoring(sloth_window) 
                 (slo:sli_error:ratio_rate1d{} > (3 * 0.0009999999999999432))
                ) 
                 or ignoring(sloth_window) 
                (
                 (slo:sli_error:ratio_rate6h{} > (1 * 0.0009999999999999432))
                  and ignoring(sloth_window)
                 (slo:sli_error:ratio_rate3d{} > (1 * 0.0009999999999999432))
                )
              labels:
                severity: critical

Cumulative Example for Ingress

version: "prometheus/v1"
service: "ingress"
labels:
  sloth: kublr
slos:
  - name: "kcp-availability"
    objective: 88.1
    description: "Control Plane Ingress requests SLO based on availability for HTTP request responses."
    labels:
      kublr_cluster: kcp
      kublr_space: kublr-system
    sli:
      events:
        error_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="kcp",kublr_space="kublr-system",status=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="kcp",kublr_space="kublr-system"}[{{.window}}]))
    alerting:
      name: "IngressAvailability"
      page_alert:
        disable: true
      ticket_alert:
        disable: true

  - name: "production"
    objective: 99.9
    description: "Ingress requests SLO based on availability for HTTP request responses."
    labels:
      kublr_cluster: prod-cluster
      kublr_space: production
    sli:
      events:
        error_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="prod-cluster",kublr_space="production",status=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(nginx_ingress_controller_request_duration_seconds_count{kublr_cluster="prod-cluster",kublr_space="production"}[{{.window}}]))
    alerting:
      name: "IngressAvailability"
      page_alert:
        disable: true
      ticket_alert:
        disable: true