tags: logging, monitoring


TABLE OF CONTENTS


Overview


Sometimes it is necessary to send alerts based on appearance of certain events in the Kublr log collection and audit indices (or potentially any other indices stored in the Kublr log collection and management Elastic stack).


One way to achieve this is using Elastic Watchers as described in this support article: elasticsearch watchers.

The downside of this approach is that it requires a commercial Elastic license and is only supported on Kublr 1.21 and later.


This article describes another, lighter-weight method, that works with free Elastic stack version included in Kublr by default and is not limited by a version of Kublr.


This method is based on using open source Prometheus ES Exporter package, that can be installed into the Kublr Control Plane cluster and run arbitrary ES queries and expose their results as Prometheus metrics.

These metrics can then be used to build Grafana dashboards and Alert Manager alerts as usual.


Deploy Prometheus ES Exporter


Use the following commands to deploy Prometheus ES Exporter.

Note that the queries in the following command are only provided as an example, customize them to fit your specific use-case.


# get password required to access Elastic

KIBANA_PASSWORD="$(kubectl -n kublr get secret \
  kublr-logging-searchguard -o jsonpath="{.data.kibana-password}" |
  base64 -d)"

# Deploy Prometheus ES Exporter helm package

helm upgrade \
    elastic-exporter https://braedon.github.io/helm/prometheus-es-exporter-0.1.1.tgz \
    --create-namespace --namespace kublr \
    --install \
    --values - \
<<EOF
image:
  tag: '0.14.0'
container:
  extraArgs:
  - '--basic-user=system.kibanaserver'
  - '--basic-password=${KIBANA_PASSWORD}'
  - '--header=x-forwarded-for: 127.0.0.1'
  - '--header=x-proxy-user: admin'
  - '--header=x-proxy-roles: admin'
  - '--cluster-health-disable'
  - '--nodes-stats-disable'
  - '--indices-aliases-disable'
  - '--indices-mappings-disable'
  - '--indices-stats-disable'
service:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: '9206'
    prometheus.io/scrape: 'true'
elasticsearch:
  cluster: https://kublr-logging-elasticsearch-client.kublr:9200
  queries: |-
    [DEFAULT]
    QueryIntervalSecs = 15
    QueryTimeoutSecs = 30
    QueryIndices = _all
    QueryOnError = drop
    QueryOnMissing = drop

    [query_all]
    QueryJson = {
        "size": 0,
        "track_total_hits": true,
        "query": {
          "match_all": {}
        }
      }

    [query_aggregated]
    QueryJson = {
        "size": 0,
        "query": {
          "bool": {
            "filter": [
              { "range": {"@timestamp": {"gte": "now-1m", "lt": "now"}}},
              { "prefix": {"_index": "kublr"}},
              { "match_phrase": { "log": "hardware error" } }
            ]
          }
        },
        "aggs": {
          "hits_by_cluster": {
            "composite": {
              "sources": [
                { "space": { "terms": { "field": "cluster_space.keyword" } } },
                { "cluster": { "terms": { "field": "cluster_name.keyword" } } }
              ]
            }
          }
        }
      }
EOF


Customizing Prometheus ES Exporter configuration


The Helm values file in the example above includes container.extraArgs parameter with additional arguments that disable a number of global statistics that the exporter can expose as metrics.


Remove corresponding arguments (e.g. '--cluster-health-disable', '--nodes-stats-disable', '--indices-aliases-disable', '--indices-mappings-disable', and/or '--indices-stats-disable') from the values if you want to re-enable corresponding statistics in the metrics.


Refer to the Prometheus ES Exporter documentation and the source code for more details on the arguments customization.


Customizing Prometheus ES Exporter queries


The example configuration above contains two ES queries that demonstrate basic capabilities of Prometheus ES Exporter.

Each query should be specified in a section with a name starting with "query_". The rest of the section name will be used as a basis for the Prometheus metrics exposed by the exporter, so for this example the exporter will expose the following metrics:

  • all_hits
  • all_took_milliseconds
  • aggregated_hits
  • aggregated_hits_by_cluster_doc_count
  • aggregated_took_milliseconds

The "all" query is very simple, does not contain aggregations, just returns total count of all documents in all ES indices, and corresponding metrics will only contain one series each.


The "aggregated" query is a realistic query example useful in real-life scenarios.


The query counts log records in Kublr log indices over the last minute that contain a specific term ("hardware error" in this example).


It also contains an aggregation named "hits_by_cluster", which aggregates counts by Kublr cluster space and name.


Each aggregation included in an ES query results in an additional Prometheus metrics that contains multiple series based on the aggregation result buckets.


In this example it will result in the metric "aggregated_hits_by_cluster_doc_count", which will contain multiple series - one for each separate Kublr cluster - labeled with additional labels "hits_by_cluster_cluster" and "hits_by_cluster_space".



Using Prometheus ES Exporter metrics to create alerts


The metrics exported by Prometheus ES Exporter can be used to create alert as usual.


For example "aggregated_hits_by_cluster_doc_count" metric can be used to generate alerts using the following custom alert definition:


alert: HardwareErrorInLogs
expr: aggregated_hits_by_cluster_doc_count > 0
annotations:
  summary: 'Log record including hardware error phrase'
  description: Log record including hardware error phrase
    in {{ $labels.hits_by_cluster_cluster}} cluster
    (kublr space: {{ $labels.hits_by_cluster_space}}).
labels:
  severity: warning


As usual, custom alerts may be added to the Kublr Control Plane via its cluster spec as follows:



spec:
  features:
    monitoring:
      values:
        alertmanager:
          alerts:
          - alert: HardwareErrorInLogs
            expr: aggregated_hits_by_cluster_doc_count > 0
            annotations:
              summary: 'Log record including hardware error phrase'
              description: Log record including hardware error phrase
                in {{ $labels.hits_by_cluster_cluster}} cluster
                (kublr space: {{ $labels.hits_by_cluster_space}}).'
            labels:
              severity: warning


Reference