• Talk to an expert
  • All posts

    Monitoring certificate expiration with Prometheus

    Introduction

    Most modern communications over computer networks are reliant on TLS to keep them safe and secure from unauthorized third parties. It is vital to keep track of the expiration dates of the TLS certificates in use. A commonly used software in professional environments for event monitoring and alerting, for example in case your TLS certificate is about to expire, is Prometheus.

    Now imagine that we have Prometheus installed in a Kubernetes cluster and want to be alerted if or when the TLS certificate that we use for reaching the application(s) running on the cluster are about to expire.

    This in and of itself should not pose a significant issue. But what if the certificates are not installed within the cluster, but are instead configured on a load balancer that is external to the cluster itself?

    We can’t just add a ServiceMonitor for something that is not inside the cluster, can we? Well, there is actually a way to do that. Let’s see it in detail:

    Using Grafana to check on the certificate status, once again using the `x509_cert` input
    Checking the certificate status using Grafana

    Telegraf

    First, we install Telegraf on the load balancer and we export the certificate status using the x509_cert input.

    That input can check for a pem file, or a certificate from an URL. We opted to just check every .pem file in the /etc/ssl/private directory, but you can tune the behavior to your needs: see the README for more information.

    While we’re at it, we can instruct it to also monitor disk and cpu usage, so that we can actually monitor the whole load balancer performance with Prometheus.

    Here is a configuration example:

    [agent]
      interval = "60s"
      round_interval = true
      metric_batch_size = 1000
      metric_buffer_limit = 10000
      collection_jitter = "0s"
      flush_interval = "10s"
      flush_jitter = "0s"
      precision = "0s"
      hostname = "blabla"
      omit_hostname = false
    [[outputs.prometheus_client]]
      listen = ":9273"
    [[inputs.cpu]]
      percpu = true
      totalcpu = true
      collect_cpu_time = false
      report_active = false
      core_tags = false
    [[inputs.disk]]
      ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
    [[inputs.x509_cert]]
      sources = ["/etc/ssl/private/*.pem"]

    If you are using a load balancing software (we use the excellent traefik for that role) that can “talk” to Telegraf, you can also collect that information and expose it to Prometheus: in that way, scraping a single endpoint will fetch the information for the node itself, the certificate status and the load balancing software metrics.

    Endpoint

    Usually, Kubernetes knows information about entities (like pods, services, and so on) that are configured within the cluster.

    In order to let Kubernetes know that you have an external entity to handle, we have to configure an external endpoint. So we are going to create one for our load balancer, so that Kubernetes knows it exists.

    apiVersion: v1
    kind: Endpoints
    metadata:
      labels:
        k8s-app: lbmon
      name: lbmon
      namespace: monitoring
    subsets:
    - addresses:
      - ip: 10.0.1.101
      ports:
      - name: metrics
        port: 9273
        protocol: TCP

    We will use the namespace monitoring for it, which is the same one we used for our Prometheus instance. That is not mandatory, but it makes sense from a logical point of view. Please take note of the name and label you used, we’ll need that later.

    Checking the certificate status in Prometheus using the x509_cert input
    Checking the certificate status in Prometheus

    Service

    Now that we have the endpoint, we configure a service for it. This will allow up to create a ServiceMonitor for it in the next steps:

    apiVersion: v1
    kind: Service
    metadata:
      name: lbmon
      namespace: monitoring
      labels:
        k8s-app: lbmon
    spec:
      type: ExternalName
      externalName: 10.0.1.101
      clusterIP: ""
      ports:
      - name: metrics
        port: 9273
        protocol: TCP
        targetPort: 9273

    Note: the name of the service must match the name of the endpoint, or it won’t work. Also, note how we explicitly specify an empty clusterIP, using the IP of the load balancer instead.

    The port must match the TCP port configured in Telegraf.

    Find the correct selector

    A Kubernetes cluster can have multiple Prometheus instances scraping different ServiceMonitors. In order to find which ServiceMonitor to scrape, each Prometheus instance has a selector rule that filters out services.

    In order to get our monitor scraped by one or more Prometheus instances, we need to add the right information in our ServiceMonitor declaration.

    Getting that information is quite easy: just look in the Prometheus CRD description which is the selector that is configured:

    kubectl describe -n monitoring prometheus prometheus_instance_name_here

    Look in the output for information about Probe Namespace Selector and Probe Selector. In our case, this is:

    [...]
      Probe Namespace Selector:
      Probe Selector:
        Match Labels:
          Release:   prom00
    [...]

    So we know that in order for our ServiceMonitor to be scraped, we need to add a this label: Release: prom00

    ServiceMonitor

    We can now provide a service monitor that instruct Prometheus to “keep an eye” on our load balancer service:

    ---
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: lbmon
      namespace: monitoring
      labels:
        k8s-app: lbmon
        release: prom00
    spec:
      selector:
        matchLabels:
          k8s-app: lbmon
      namespaceSelector:
        matchNames:
        - monitoring
      endpoints:
      - port: metrics
        interval: 60s

    The selector we use on the ServiceMonitor spec must match the label we used on our Service and our Endpoint. Note that we also added the label release: prom00 which is needed by our Prometheus selector.

    Alerting

    Now that the monitoring is in place, we can add an alerting rule to inform us that “…the end is near…” (for the certificate, at least).

    Note that we will need to add the release: prom00 label so that the rule get correctly read in our Alertmanager instance.

    ---
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        release: prom00
      name: certficate-alert
      namespace: monitoring
    spec:
      groups:
      - name: certificates
        rules:
        - alert: exiration-near
          annotations:
            message: Certificate for volume  expiration is near
            summary: Certificate expiration notice for 
          expr: (x509_cert_enddate-time())/86400 <= 15
          for: 60m
          labels:
            severity: warning

    The end date and the time are in seconds, so we divide by 86400 (the number of seconds in a day) to get the number of days left. This alert will go off if the certificate expiration is less than 15 days ahead.

    And there you have it! Now your certificate expiration is kept under a close eye by our faithful Prometheus.

    Conclusions

    Monitoring the status and expiration dates of your TSL certificates is vital to providing communications security over computer networks. Using Prometheus and Telegraf in tandem simplifies the supervision of the certificate’s expiration dates even when they are stored outside of your Kubernetes cluster.