Skip to main content
Sumo Logic

Install the Kafka App, Alerts, and view the Dashboards

This page has instructions for installing the Sumo App and Alerts for Kafka and descriptions of each of the app dashboards.

This page has instructions for installing the Sumo App and Alerts for Kafka and descriptions of each of the app dashboards. These instructions assume you have already set up collection as described in the Collect Logs and Metrics for Kafka App page.

Pre-Packaged Alerts

Sumo Logic has provided out-of-the-box alerts available through Sumo Logic monitors to help you quickly determine if the Kafka cluster is available and performing as expected. These alerts are built based on metrics datasets and have preset thresholds based on industry best practices and recommendations.

For details on the individual alerts,  please see this page.

Installing Alerts

  • To install these alerts, you need to have the Manage Monitors role capability.
  • Alerts can be installed by either importing a JSON or a Terraform script.
  • Note: There are limits to how many alerts can be enabled - please see the Alerts FAQ for details.
Method 1: Install the alerts by importing a JSON file:
  1. Download a JSON file that describes the monitors. 

    1. The JSON contains the alerts that are based on Sumo Logic searches that do not have any scope filters and therefore will be applicable to all Kafka clusters, the data for which has been collected via the instructions in the previous sections.  However, if you would like to restrict these alerts to specific clusters or environments, update the JSON file by replacing the text ‘messaging_system=kafka with ‘<Your Custom Filter>.  

Custom filter examples: 

  1. For alerts applicable only to a specific cluster, your custom filter would be:  messaging_cluster=Kafka-prod.01
  2. For alerts applicable to all clusters that start with Kafka-prod, your custom filter would be: messaging_cluster=Kafka-prod*
  3. For alerts applicable to a specific cluster within a production environment, your custom filter would bemessaging_cluster=Kafka-1 and environment=prod (This assumes you have set the optional environment tag while configuring collection)
  1. Go to Manage Data > Alerts > Monitors.
  2. Click Add:

Add monitors page.png

  1. Click Import to import monitors from the JSON above.

Method 2: Install the alerts using a Terraform script

Step 1: Generate a Sumo Logic access key and ID

Generate an access key and access ID for a user that has the Manage Monitors role capability in Sumo Logic using these instructions. Please identify which deployment your Sumo Logic account is in, using this link.

Step 2: Download and install Terraform 0.13 or later 

Step 3: Download the Sumo Logic Terraform package for Kafka alerts

The alerts package is available in the Sumo Logic github repository. You can either download it through the “git clone” command or as a zip file. 

Step 4: Alert Configuration 

After the package has been extracted, navigate to the package directory terraform-sumologic-sumo-logic-monitor/monitor_packages/Kafka/

Edit the monitor.auto.tfvars file and add the Sumo Logic Access Key, Access Id and Deployment from Step 1 .

access_id   = "<SUMOLOGIC ACCESS ID>"

access_key  = "<SUMOLOGIC ACCESS KEY>"

environment = "<SUMOLOGIC DEPLOYMENT>"

The Terraform script installs the alerts without any scope filters, if you would like to restrict the alerts to specific clusters or environments, update the variable ’kafka_data_source’. Custom filter examples: 

  1. For alerts applicable only to a specific cluster, your custom filter would be: messaging_cluster=Kafka-prod.01
  2. For alerts applicable to all clusters that start with Kafka-prod, your custom filter would be: messaging_cluster=Kafka-prod*
  3. For alerts applicable to a specific cluster within a production environment, your custom filter would bemessaging_cluster=Kafka-1 and environment=prod 
    (This assumes you have set the optional environment tag while configuring collection)

All monitors are disabled by default on installation, if you would like to enable all the monitors, set the parameter monitors_disabled to false in this file.

By default, the monitors are configured in a monitor folder called “Kafka”, if you would like to change the name of the folder, update the monitor folder name in this file.

If you would like the alerts to send email or connection notifications, configure these in the file notifications.auto.tfvars. For configuration examples, refer to the next section.

Step 5: Email and Connection Notification Configuration Examples

To configure notifications, modify the file notifications.auto.tfvars file and fill in the connection_notifications and email_notifications sections. See the examples for PagerDuty and email notifications below. See this document for creating payloads with other connection types.

Pagerduty Connection Example:
connection_notifications = [
    {
      connection_type       = "PagerDuty",
      connection_id         = "<CONNECTION_ID>",
      payload_override      = "{\"service_key\": \"your_pagerduty_api_integration_key\",\"event_type\": \"trigger\",\"description\": \"Alert: Triggered {{TriggerType}} for Monitor {{Name}}\",\"client\": \"Sumo Logic\",\"client_url\": \"{{QueryUrl}}\"}",
      run_for_trigger_types = ["Critical", "ResolvedCritical"]
    },
    {
      connection_type       = "Webhook",
      connection_id         = "<CONNECTION_ID>",
      payload_override      = "",
      run_for_trigger_types = ["Critical", "ResolvedCritical"]
    }
  ]

Replace <CONNECTION_ID> with the connection id of the webhook connection. The webhook connection id can be retrieved by calling the Monitors API.

Email Notifications Example:
email_notifications = [
    {
      connection_type       = "Email",
      recipients            = ["abc@example.com"],
      subject               = "Monitor Alert: {{TriggerType}} on {{Name}}",
      time_zone             = "PST",
      message_body          = "Triggered {{TriggerType}} Alert on {{Name}}: {{QueryURL}}",
      run_for_trigger_types = ["Critical", "ResolvedCritical"]
    }
  ]

Step 6: Install the Alerts

  1. Navigate to the package directory terraform-sumologic-sumo-logic-monitor/monitor_packages/Kafka/ and run terraform init. This will initialize Terraform and will download the required components.
  2. Run terraform plan to view the monitors which will be created/modified by Terraform.
  3. Run terraform apply.

Step 7: Post Installation

If you haven’t enabled alerts and/or configured notifications through the Terraform procedure outlined above, we highly recommend enabling alerts of interest and configuring each enabled alert to send notifications to other people or services. This is detailed in Step 4 of this document.

Install the App

This section demonstrates how to install the Kafka App.

To install the app:

Locate and install the app you need from the App Catalog. If you want to see a preview of the dashboards included with the app before installing, click Preview Dashboards.

  1. From the App Catalog, search for and select the app. 

  2. Select the version of the service you're using and click Add to Library.

  1. To install the app, complete the following fields.

    1. App Name. You can retain the existing name, or enter a name of your choice for the app.


    2. Data Source. 

      • Choose Enter a Custom Data Filter, and enter a custom Kafka cluster filter. Examples: 

        1. For all Kafka clusters
          messaging_cluster=*

        2. For a specific cluster:
          messaging_cluster=Kafka.dev.01.


        3. Clusters within a specific environment:
          messaging_cluster=Kafka-1 and environment=prod 
          (This assumes you have set the optional environment tag while configuring collection)

    3. Advanced. Select the Location in Library (the default is the Personal folder in the library), or click New Folder to add a new folder.

    4. Click Add to Library.

Once an app is installed, it will appear in your Personal folder, or other folder that you specified. From here, you can share it with your organization. 

Panels will start to fill automatically. It's important to note that each panel slowly fills with data matching the time range query and received since the panel was created. Results won't immediately be available, but with a bit of time, you'll see full graphs and maps.

Dashboard Filters with Template Variables

Template variables provide dynamic dashboards that rescope data on the fly. As you apply variables to troubleshoot through your dashboard, you can view dynamic changes to the data for a fast resolution to the root cause. For more information, see the Filter with template variables help page.

Kafka - Cluster Overview

The Kafka - Cluster Overview dashboard gives you an at-a-glance view of your Kafka deployment across brokers, controllers, topics, partitions and zookeepers.

Use this dashboard to:

  • Identify when brokers don’t have active controllers
  • Analyze trends across Request Handler Idle percentage metrics. Kafka’s request handler threads are responsible for servicing client requests ( read/write disk). If the request handler threads get overloaded, the time taken for requests to complete will be longer. If the request handler idle percent is constantly below 0.2 (20%), it may indicate that your cluster is overloaded and requires more resources.
  • Determine the number of leaders, partitions and zookeepers across each cluster and ensure they match with expectations

Kafka - Outlier Analysis

The Kafka - Outlier Analysis dashboard helps you identify outliers for key metrics across your Kafka clusters.

Use this dashboard to:

  • To analyze trends, and quickly discover outliers across key metrics of your Kafka clusters

Kafka - Replication

The Kafka - Replication dashboard helps you understand the state of replicas in your Kafka clusters.

Use this dashboard to:

Monitor the following key metrics

  • In-Sync Replicas (ISR) Expand Rate - The ISR Expand Rate metric displays the one-minute rate of increases in the number of In-Sync Replicas (ISR). ISR expansions occur when a broker comes online, such as when recovering from a failure or adding a new node. This increases the number of in-sync replicas available for each partition on that broker.The expected value for this rate is normally zero.
  • In-Sync Replicas (ISR) Shrink Rate - The ISR Shrink Rate metric displays the one-minute rate of decreases in the number of In-Sync Replicas (ISR). ISR shrinks occur when an in-sync broker goes down, as it decreases the number of in-sync replicas available for each partition replica on that broker.The expected value for this rate is normally zero.
    • ISR Shrink Vs Expand Rate - If you see a Spike in ISR Shrink followed by ISR Expand Rate - this may be because of nodes that have fallen behind replication and they may have either recovered or are in the process of recovering now.
    • Failed ISR Updates
    • Under Replicated Partitions Count
    • Under Min ISR Partitions Count -The Under Min ISR Partitions metric displays the number of partitions, where the number of In-Sync Replicas (ISR) is less than the minimum number of in-sync replicas specified. The two most common causes of under-min ISR partitions are that one or more brokers are unresponsive, or the cluster is experiencing performance issues and one or more brokers are falling behind.
  • The expected value for this rate is normally zero.

Kafka -Zookeeper

The Kafka -Zookeeper dashboard provides an at-a-glance view of the state of your partitions, active controllers, leaders, throughput and network across Kafka brokers and clusters.

Use this dashboard to:

Monitor key Zookeeper metrics such as:

  • Zookeeper disconnect rate - This metric indicates if a Zookeeper node has lostits connection to a Kafka broker. 
  • Authentication Failures - This metric indicates a Kafka Broker is unable to connect to its Zookeeper node.
  • Session Expiration - When a Kafka broker - Zookeeper node session expires, leader changes can occur and the broker can be assigned a new controller. If this metric is increasing we recommend you:
    1. Check the health of your network.
    2. Check for garbage collection issues and tune your JVMs accordingly.
  • Connection Rate.

Kafka - Broker

The Kafka - Broker dashboard provides an at-a-glance view of the state of your partitions, active controllers, leaders, throughput, and network across Kafka brokers and clusters.

Use this dashboard to:

  • Monitor Under Replicaed and offline partitions to quickly identify if aKafka broker is down or over utilized.
  • Monitor Unclean Leader Election count metrics - this metric shows the number of failures to elect a suitable leader per second. Unclean leader elections are caused when there are no available in-sync replicas for a partition (either due to network issues, lag causing the broker to fall behind, or brokers going down completely), so an out of sync replica is the only option for the leader. When an out of sync replica is elected leader, all data not replicated from the previous leader is lost forever.
  • Monitor producer and fetch request rates.
  • Monitor Log flush rate to determine the rate at which log data is written to disk

Kafka - Failures and Delayed Operations

The Kafka - Failures and Delayed Operations dashboard gives you insight into all failures and delayed operations associated with your Kafka clusters.

Use this dashboard to:

  • Analyze failed produce requests -  A failed produce request occurs when a problem is encountered when processing a produce request. This could be for a variety of reasons, however some common reasons are:
    • The destination topic doesn’t exist (if auto-create is enabled then subsequent messages should be sent successfully).
    • The message is too large.
    • The producer is using request.required.acks=all or –1, and fewer than the required number of acknowledgements are received.
  • Analyze failed Fetch Request -  A failed fetch request occurs when a problem is encountered when processing a fetch request. This could be for a variety of reasons, but the most common cause is consumer requests timing out.
  • Monitor delayed Operations metrics -  This contains metrics regarding the number of requests that are delayed and waiting in purgatory. The purgatory size metric can be used to determine the root cause of latency. For example, increased consumer fetch times could be explained by an increased number of fetch requests waiting in purgatory. Available metrics are:
    • Fetch Purgatory Size - The Fetch Purgatory Size metric shows the number of fetch requests currently waiting in purgatory. Fetch requests are added to purgatory if there is not enough data to fulfil the request (determined by fetch.min.bytes in the consumer configuration) and the requests wait in purgatory until the time specified by fetch.wait.max.ms is reached, or enough data becomes available.
    • Produce Purgatory Size - The Produce Purgatory Size metric shows the number of produce requests currently waiting in purgatory. Produce requests are added to purgatory if request.required.acks is set to -1 or all, and the requests wait in purgatory until the partition leader receives an acknowledgement from all its followers. If the purgatory size metric keeps growing, some partition replicas may be overloaded. If this is the case, you can choose to increase the capacity of your cluster, or decrease the amount of produce requests being generated.

Kafka - Request-Response Times

The Kafka - Request-Response Times dashboard helps you get insight into key request and response latencies of your Kafka cluster.

Use this dashboard to:

  • Monitor request time metrics - The Request Metrics metric group contains information regarding different types of request to and from the cluster. Important request metrics to monitor : 
    1. Fetch Consumer Request Total Time - The Fetch Consumer Request Total Time metric shows the maximum and mean amount of time taken for processing, and the number of requests from consumers to get new data. Reasons for increased time taken could be: increased load on the node (creating processing delays), or perhaps requests are being held in purgatory for a long time (determined by fetch.min.bytes and fetch.wait.max.ms metrics).
    2. Fetch Follower Request Total Time - The Fetch Follower Request Total Time metric displays the maximum and mean amount of time taken while processing, and the number of requests to get new data from Kafka brokers that are followers of a partition. Common causes of increased time taken are increased load on the node causing delays in processing requests, or that some partition replicas may be overloaded or temporarily unavailable.
    3. Produce Request Total Time - The Produce Request Total Time metric displays the maximum and mean amount of time taken for processing, and the number of requests from producers to send data. Some reasons for increased time taken could be: increased load on the node causing delays in processing the requests, or perhaps requests are being held in purgatory for a long time (if the requests.required.acks metrics is equal to '1' or all).

Kafka - Logs

This dashboard helps you quickly analyze your Kafka error logs across all clusters.

Use this dashboard to:

  • Identify critical events in your Kafka broker and controller logs; 
  • Examine trends to detect spikes in Error or Fatal events 
  • Monitor Broker added/started and shutdown events in your cluster. 
  • Quickly determine patterns across all logs in a given Kafka cluster.

Kafka Broker - Performance Overview

The Kafka Broker - Performance Overview dashboards helps you Get an at-a-glance view of the performance and resource utilization of your Kafka brokers and their JVMs.

Use this dashboard to:

  • Monitor the number of open file descriptors. If the number of open file descriptors reaches the maximum file descriptor, it can cause an IOException error
  • Get insight into Garbage collection and its impact on CPU usage and memory
  • Examine how threads are distributed 
  • Understand the behavior of class count. If class count keeps on increasing, you may have a problem with the same classes loaded by multiple classloaders.

Kafka Broker - CPU

The Kafka Broker - CPU dashboard shows information about the CPU utilization of individual Broker machines.

Use this dashboard to:

  • Get insights into the process and user CPU load of Kafka brokers. High CPU utilization can make Kafka flaky and can cause read/write timeouts.

Kafka Broker - Memory

The Kafka Broker - Memory dashboard shows the percentage of the heap and non-heap memory used, physical and swap memory usage of your Kafka broker’s JVM.

Use this dashboard to:

  • Understand how memory is used across Heap and Non-Heap memory. 
  • Examine physical and swap memory usage and make resource adjustments as needed.
  • Examine  the pending object finalization count which when high can lead to excessive memory usage.

Kafka Broker - Disk Usage

The Kafka Broker - Disk Usage dashboard helps monitor disk usage across your Kafka Brokers.

Use this dashboard to:

  • Monitor Disk Usage percentage on Kafka Brokers. This is critical as Kafka brokers use disk space to store messages for each topic. Other factors that affect disk utilization are:
    1. Topic replication factor of Kafka topics.
    2. Log retention settings.
  • Analyze trends in disk throughput and find any spikes. This is especially important as disk throughput can be a performance bottleneck.
  • Monitor iNodes bytes used, and disk read vs writes. These metrics are important to monitor as Kafka may not necessarily distribute data from a heavily occupied disk, which itself can bring the Kafka down.

Kafka Broker - Garbage Collection

The Kafka Broker - Garbage Collection dashboard shows key Garbage Collector statistics like the duration of the last GC run, objects collected, threads used, and memory cleared in the last GC run of your java virtual machine.

Use this dashboard to:

  • Understand the amount of time spent in garbage collection. If this time keeps increasing, your Kakfa brokers may have more CPU usage .
  • Understand the amount of memory cleared by garbage collectors across memory pools and their impact on the Heap memory.

Kafka Broker - Threads

The Kafka Broker - Threads dashboard shows the key insights into the usage and type of threads created in your Kafka broker JVM

Use this dashboard to:

  • Understand the dynamic behavior of the system using peak, daemon, and current threads.
  • Gain insights into the memory and CPU time of the last executed thread.

Kafka Broker - Class Loading and Compilation

The Kafka Broker - Class Loading and Compilation dashboard helps you get insights into the behavior of class count trends.

Use this dashboard to:

  • Determine If the class count keeps increasing, this indicates that the same classes are loaded by multiple classloaders.
  • Get insights into time spent by Java Virtual machines during compilation.

Kafka - Topic Overview

The Kafka - Topic Overview dashboard helps you quickly identify under-replicated partitions, and incoming bytes by Kafka topic, server and cluster. 

Use this dashboard to:

  • Monitor under replicated partitions - The Under Replicated Partitions metric displays the number of partitions that do not have enough replicas to meet the desired replication factor. A partition will also be considered under-replicated if the correct number of replicas exist, but one or more of the replicas have fallen significantly behind the partition leader. The two most common causes of under-replicated partitions are that one or more brokers are unresponsive, or the cluster is experiencing performance issues and one or more brokers have fallen behind.

This metric is tagged with cluster, server, and topic info for easy troubleshooting.  The colors in the Honeycomb chart are coded as follows: 

  1. Green indicates there are no under Replicated Partitions.
  2. Red indicates a given partition is under replicated.

Kafka - Topic Details

The Kafka - Topic Details dashboard gives you insight into throughput, partition sizes and offsets  across Kafka brokers, topics and clusters.

Use this dashboard to:

  • Monitor metrics like Log partition size, log start offset, and log segment count metrics.
  • Identify offline/under replicated partitions count. Partitions can be in this state on account of resource shortages or broker unavailability. 
  • Monitor the In Sync replica (ISR) Shrink rate. ISR shrinks occur when an in-sync broker goes down, as it decreases the number of in-sync replicas available for each partition replica on that broker.
  • Monitor In Sync replica (ISR) Expand rate. ISR expansions occur when a broker comes online, such as when recovering from a failure or adding a new node. This increases the number of in-sync replicas available for each partition on that broker.