Skip to main content

Host and Process Metrics

Thumbnail icon

The Sumo Logic app for Host and Process Metrics allows you to monitor the performance and resource utilization of hosts and processes that your mission critical applications are dependent upon. Preconfigured dashboards provide insight into CPU, memory, network, file descriptors, page faults, and TCP connectors. This app uses Telegraf, an open-source, plugin-based collector for the collection of both host and process metrics data.

This app uses Telegraf and associated input plugins to collect both host and process metrics. To use the installed collector to collect and analyze host metrics, please see the Host Metrics app.

This app has been validated on Linux(Ubuntu 20.04.2 LTS) and Windows (Microsoft Windows Server 2019) and higher using Telegraf 1.18.2. This app is not recommended in Kubernetes environments; instead please use the Kubernetes app.

Collecting metrics for Host and Processes

We use the Telegraf agent for Host and Process metrics collection. Telegraf runs on the same system and uses the input plugins to obtain host and process metrics, and the Sumo Logic output plugin to send the metrics to Sumo Logic.

The following input plugins are used by Sumo Logic

For Linux:

For Windows:

Configuring metrics collection

This section provides instructions for configuring metrics collection for the Sumo Logic app for Host and Process metrics. Follow the below instructions to set up the metric collection for a given machine:

  1. Configure a Hosted Collector. To create a new Sumo Logic hosted collector, perform the steps in the Configure a Hosted Collector section of the Sumo Logic documentation.
  2. Configure an HTTP Logs and Metrics Source. Create a new HTTP Logs and Metrics Source in the hosted collector created above by following these instructions. Suggestions for setting your source category:
    • For identifying a specific cluster or a group of hosts: <clustername>/metrics
    • For identifying a group of hosts within a given deployment: <environment name>/<clustername>/metrics. Make a note of the HTTP Source URL and source category.
  3. Install Telegraf. Use the following steps to install Telegraf on each host machine.
  4. Configure and start Telegraf. As part of collecting metrics data from Telegraf, we will use the input plugins (described earlier) to get data from Telegraf and the Sumo Logic output plugin to send data to Sumo Logic.
    • Create or modify telegraf.conf (in linux it’s located in /etc/telegraf/telegraf.d/ and on Windows, it’s located in C:\Program Files\InfluxData\Telegraf\). Copy and paste the inputs, outputs and processors section from one of the below files.
    • Enter values for the following parameters (marked with CHANGE_ME) in the downloaded file:
      • In the output plugins section which is [[outputs.sumologic]]:
        • URL - This is the HTTP source URL created in step 3. Please see this doc for more information on additional parameters for configuring the Sumo Logic Telegraf output plugin.
        • Do not modify the following values set by this Telegraf configuration as it will cause the Sumo Logic app to not function correctly.
          • data_format = “carbon2” In the output plugins section, which is [[outputs.sumologic]], This indicates that metrics should be sent in the carbon2 format to Sumo Logic.
        • For other optional parameters refer to the respective plugindocumentation for configuring the input plugins for Telegraf.
        • For all other parameters, see this doc for more parameters that can be configured in the Telegraf agent globally.

Once you have finalized your telegraf.conf file, you can start or reload the telegraf service using instructions from the doc.

At this point, host and process metrics should start flowing into Sumo Logic.

See Telegraf’s Metric filtering capabilities to exclude metrics that you do not need from being sent to Sumo Logic.

Monitoring processes with certain pattern

exe. Selects the processes that have process names that match the string that you specify

pattern. Selects the processes that have command lines (including parameters and options used with the command) matching the specified string using regular expression matching rules.

For Linux

On Linux servers, the strings that you specify in an exe or pattern section are evaluated as regular expressions.

Example: For filtering executable name containing nginx (i.e., pgrep <exe>)

[[inputs.procstat]]
pid_tag=false
exe = "nginx"

Example: For filtering command lines containing config (i.e., pgrep -f <pattern>)

[[inputs.procstat]]
pid_tag=false
pattern = "config"

For Windows

On servers running Windows Server, these strings are evaluated as WMI queries. (Ex pattern: "%apache%"). For more information, see LIKE Operator.

Example: For filtering executable name containing nginx.

[[inputs.procstat]]
pid_finder = "native"
pid_tag=false
exe = "%nginx%"

Example: For filtering command lines containing config.

[[inputs.procstat]]
pid_finder = "native"
pid_tag=false
pattern = "%config%"

Example: For defining multiple patterns for multiple processes you can use the plugin multiple times.

[[inputs.procstat]]
pid_tag=false
exe = "nginx"

[[inputs.procstat]]
exe = "tomcat"
pid_tag=false

Troubleshooting

  • To identify the operating system version and name.
    • For Windows machines, run the command in Powershell to get the OS Version.
      [System.Environment]::OSVersion.Version
      (Get-WmiObject -class Win32_OperatingSystem).Caption
    • For Linux, run below command in terminal.
      uname -a
      lsb_release -a
  • To enable debug logs, set “debug = true” flag in telegraf.conf and run the command, it will output error in stdout.
    telegraf --config telegraf.conf --test
  • If the telegraf conf changes are not reflecting, make sure to restart Telegraf using the command
    • Windows: ./telegraf.exe --service restart
    • Linux: sudo service telegraf restart
  • If certain metrics are not coming you may have to run the telegraf agent as root. Check the respective plugin documentation for more information.

Sample queries

CPU Utilization by Host panel in Host Metrics - CPU Dashboard

host.name=*  cpu=cpu-total  metric=(host_cpu_usage_user OR host_cpu_usage_system OR host_cpu_usage_iowait OR host_cpu_usage_steal OR host_cpu_usage_softirq OR host_cpu_usage_irq OR host_cpu_usage_nice) | sum by host.name

CPU Usage panel in Process Metrics Details Dashboard

metric=procstat_cpu_usage host.name=*  process.executable.name=* | avg by host.name, process.executable.name | outlier

Installing the Alerts

The next few sections provide instructions for installing the Sumo app and Alerts for hosts and processes. These instructions assume you have already set up a collection as described in Collecting Metrics for Host and Processes.

Sumo Logic has provided out-of-the-box alerts available through Sumo Logic monitors to help you monitor your hosts and processes. These alerts are built based on metrics and logs datasets and include preset thresholds based on industry best practices and recommendations.

For details on the individual alerts, see last section.

  • To install these alerts, you need to have the Manage Monitors role capability.
  • Alerts can be installed by either importing them a JSON or a Terraform script.

There are limits to how many alerts can be enabled - please see the Alerts FAQ for details.

Method A: Importing a JSON file

  1. Download the JSON file describing all the monitors.
    The JSON contains the alerts that are based on Sumo Logic searches that do not have any scope filters and therefore will be applicable to all hosts, the data for which has been collected via the instructions in the previous sections. However, if you would like to restrict these alerts to specific hosts or environments, update the JSON file by replacing the text $$hostmetrics_data_source with <your sourceCategory>. SourceCategory examples:
    • For alerts applicable only to a specific cluster of hosts, your custom filter could be: '_sourceCategory=yourclustername/metrics'.
    • For alerts applicable to all hosts that start with ec2hosts-prod, your custom filter could be: '_sourceCategory=ec2hosts-prod*/metrics'.
    • For alerts applicable to a specific cluster within a production environment, your custom filter could be: '_sourceCategory=prod/yourclustername/metrics'
  2. Go to Manage Data > Alerts > Monitors.
  3. Click Add.
  4. Click Import to import monitors from the JSON above.

The monitors are disabled by default. Once you have installed the alerts using this method, navigate to the Host and Process Metrics folder under Monitors to configure them. See this document to enable monitors, to configure each monitor, to send notifications to teams or connections please see the instructions detailed in Alert Configuration of this document.

Method B: Using a Terraform script

  1. Generate a Sumo Logic access key and ID. Generate an access key and access ID for a user that has the Manage Monitors role capability in Sumo Logic using these instructions. Please identify which deployment your Sumo Logic account is in, using this link.

  2. Download and install Terraform 0.13 or later.

  3. Download the Sumo Logic Terraform package for Host and Process alerts. The alerts package is available in the Sumo Logic GitHub repository. You can either download it through the “git clone” command or as a zip file.

  4. Alert Configuration. After the package has been extracted, navigate to the package directory terraform-sumologic-sumo-logic-monitor/monitor_packages/host_process_metrics/. Edit the host_and_processes.auto.tfvars file and add the Sumo Logic Access Key, Access Id, and Deployment from Step 1.

    access_id   = "<SUMOLOGIC ACCESS ID>"
    access_key = "<SUMOLOGIC ACCESS KEY>"
    environment = "<SUMOLOGIC DEPLOYMENT>"

    Update the variable host_and_processes_data_source with your source category. SourceCategory examples:

    • For alerts applicable only to a specific cluster of hosts, your custom filter could be: _sourceCategory=yourclustername/metrics.
    • For alerts applicable to all hosts that start with ec2hosts-prod, your custom filter could be:_sourceCategory=ec2hosts-prod*/metrics.
    • For alerts applicable to a specific cluster within a production environment, your custom filter could be: _sourceCategory=prod/yourclustername/metrics.
  5. All monitors are disabled by default on installation. If you would like to enable all the monitors, set the parameter monitors_disabled to false in this file.

  6. By default, the monitors are configured in a monitor folder called “Host and “Process Metrics”, if you would like to change the name of the folder, update the monitor folder name in this file.

  7. Email and Connection Notification Configuration Examples. If you would like the alerts to send email or connection notifications, configure these in the file host_process_metrics_notifications.auto.tfvars.

    • To configure notifications, modify the file host_process_metrics_notifications.auto.tfvars file and fill in the connection_notifications and email_notifications sections. See the examples for PagerDuty and email notifications below. See this document for creating payloads with other connection types. Replace <CONNECTION_ID> with the connection id of the webhook connection. The webhook connection id can be retrieved by calling the Monitors API.
    Pagerduty Connection Example:
    connection_notifications = [
    {
    connection_type = "PagerDuty",
    connection_id = "<CONNECTION_ID>",
    payload_override = "{\"service_key\": \"your_pagerduty_api_integration_key\",\"event_type\": \"trigger\",\"description\": \"Alert: Triggered {{TriggerType}} for Monitor {{Name}}\",\"client\": \"Sumo Logic\",\"client_url\": \"{{QueryUrl}}\"}",
    run_for_trigger_types = ["Critical", "ResolvedCritical"]
    },
    {
    connection_type = "Webhook",
    connection_id = "<CONNECTION_ID>",
    payload_override = "",
    run_for_trigger_types = ["Critical", "ResolvedCritical"]
    }
    ]
    Email Notifications Example
    email_notifications = [
    {
    connection_type = "Email",
    recipients = ["abc@example.com"],
    subject = "Monitor Alert: {{TriggerType}} on {{Name}}",
    time_zone = "PST",
    message_body = "Triggered {{TriggerType}} Alert on {{Name}}: {{QueryURL}}",
    run_for_trigger_types = ["Critical", "ResolvedCritical"]
    }
    ]
  8. Install the Alerts.

    1. Navigate to the package directory terraform-sumologic-sumo-logic-monitor/monitor_packages/host_process_metrics/ and run terraform init. This will initialize Terraform and will download the required components.
    2. Run terraform plan to view the monitors which will be created/modified by Terraform.
    3. Run terraform apply.

Post Installation

If you haven’t enabled alerts or configured notifications through the Terraform procedure outlined above, we highly recommend enabling alerts of interest and configuring each enabled alert to send notifications to other people or services. This is detailed in Step 4 of this document.

Installing the Host and Process Metrics app

To install the app, do the following:

  1. From the Sumo Logic navigation, select App Catalog.
  2. In the Search Apps field, search for and then select your app.
    Optionally, you can scroll down to preview the dashboards included with the app.
  3. To install the app, click Install App.
  4. Click Next in the Setup Data section.
  5. In the Configure section of the respective app, complete the following fields.
    1. Key. Select either of these options for the data source.
      • Choose Source Category, and select a source category from the list for Default Value.
      • Choose Custom, and enter a custom metadata field. Insert its value in Default Value.
  6. Click Next. You will be redirected to the Preview & Done section.

Your app will be installed in the Installed Apps folder and dashboard panels will start to fill automatically.

Each panel slowly fills with data matching the time range query and received since the panel was created. Results will not immediately be available, updating with full graphs and charts over time.

Viewing Host and Process Metrics dashboards​

All dashboards have a set of filters that you can apply to the entire dashboard. Use these filters to drill down and examine the data to a granular level.

  • You can change the time range for a dashboard or panel by selecting a predefined interval from a drop-down list, choosing a recently used time range, or specifying custom dates and times. Learn more.
  • You can use template variables to drill down and examine the data on a granular level. For more information, see Filter with template variables.
  • Most Next-Gen apps allow you to provide the scope at the installation time and are comprised of a key (_sourceCategory by default) and a default value for this key. Based on your input, the app dashboards will be parameterized with a dashboard variable, allowing you to change the dataset queried by all panels. This eliminates the need to create multiple copies of the same dashboard with different queries.

Overview

The Host Metrics - Overview dashboard gives you an at-a-glance view of the key metrics like CPU, memory, disk, network, and TCP connections of all your hosts. You can drill down from this dashboard to the Host Metrics - CPU/Disk/Memory/Network/TCP dashboard by using the honeycombs or line charts in all the panels.

Use this dashboard to:

  • Identify hosts with high CPU, disk, memory utilization, and identify anomalies over time.
Host Metrics dashboards

CPU

The Host Metrics - CPU dashboard provides a detailed analysis based on CPU metrics. You can drill down from this dashboard to the Process Metrics - Details dashboard by using the honeycombs or line charts in all the panels.

Use this dashboard to:

  • Identify hosts and processes with high CPU utilization.
  • Examine CPU usage by type and identify anomalies over time.
Host Metrics dashboards

Disk

The Host Metrics - Disk dashboard provides detailed information about on disk utilization and disk IO operations.You can drill down from this dashboard to the Process Metrics - Details dashboard by using the honeycombs or line charts in all the panels.

Use this dashboard to:

  • Identify hosts with high disk utilization and disk IO operations.
  • Monitor abnormal spikes in read/write rates.
  • Compare disk throughput across storage devices of a host.
Host Metrics dashboards

Memory

The Host Metrics - Memory dashboard provides detailed information on host memory usage, memory distribution, and swap space utilization. You can drill down from this dashboard to the Process Metrics - Details dashboard by using the honeycombs or line charts in all the panels.

Use this dashboard to:

  • Identify hosts with high memory utilization.
  • Examine memory distribution (free, buffered-cache, used, total) for a given host.
  • Monitor abnormal spikes in memory and swap utilization.
Host Metrics dashboards

Network

The Host Metrics - Network dashboard provides detailed information on host network errors, throughput, and packets sent and received.

Use this dashboard to:

  • Determine top hosts with network errors and dropped packets.
  • Monitor abnormal spikes in incoming/outgoing packets and bytes sent and received.
  • Use dashboard filters to compare throughput across the interface of a host.
Host Metrics dashboards

TCP

The Host Metrics - TCP dashboard provides detailed information around inbound, outbound, open, and established TCP connections.

Use this dashboard to:

  • Identify abnormal spikes in inbound, outbound, open, or established connections.
Host Metrics dashboards

Process Metrics - Overview

The Process Metrics - Overview dashboard gives you an at-a-glance view of all the processes by open file descriptors, CPU usage, memory usage, disk read/write operations and thread count.You can drill down from this dashboard to the Process Metrics - Details dashboard by using the honeycombs or line charts in all the panels.

Use this dashboard to:

  • Identify top processes by CPU, memory usage, and open file descriptors.
  • Determine the longest running processes and users that have spawned the most number of processes.
Host Metrics dashboards

Process Metrics - Details

The Process Metrics - Details dashboard gives you a detailed view of key process related metrics such as CPU and memory utilization, disk read/write throughput, and major/minor page faults.

Use this dashboard to:

  • Determine the number of open file descriptors in all hosts. If the number of open file descriptors reaches the maximum file descriptor limits,, it can cause IOException errors.
  • Identify anomalies in CPU usage, memory usage, major/minor page faults and reads/writes over time.
  • Troubleshoot memory leaks using the resident set memory trend chart.
Host Metrics dashboards

The Process Metrics - Trend dashboard gives you insight into the state of your processes over time.

Use this dashboard to:

  • Analyze the current state of all the processes (sleeping, dead, idle, stopped, total, paging)
  • Identify anomalies over time in the number of threads, zombie processes, and total processes
Host Metrics dashboards

Host and Process Metrics Alerts

For Host Metrics

Sumo Logic provides out-of-the-box alerts available via Sumo Logic monitors. These alerts are built based on metrics datasets and have preset thresholds based on industry best practices and recommendations.

Alert NameAlert DescriptionAlert ConditionRecover Condition
Host Metrics - High CPU UtilizationThis alert fires when host CPU utilization is over 80%.> 80 %<= 80 %
Host Metrics - High Network ErrorsThis alert fires when a host has encountered network errors in the last five minutes.> 1%<= 1%
Host Metrics - Unusual network throughput inThis alert fires when host network interfaces are receiving an unusually high amount of data (> 100 MB/s) over a 5-minute time interval.> 100 MB/sec<= 100 MB/sec
Host Metrics - Unusual network throughput outThis alert fires when host network interfaces are sending an unusually high amount of data (> 100 MB/s) over a 5-minute time interval.> 100 MB/sec<= 100 MB/sec
Host Metrics - Host out of memoryThis alert fires when memory utilization is over 90%.> 90 %<= 90 %
Host Metrics - Host out of inodesThis alert fires when a host's filesystem is close to running out of available iNodes (> 90% used).> 90 %<= 90 %
Host Metrics - Host swap is filling upThis alert fires when swap utilization is over 80%.> 80 %<= 80 %
Host Metrics - Host out of disk spaceThis alert fires when disk utilization is over 90%.> 90 %<= 90 %
Host Metrics - Unusual disk read rateThis alert fires when the disk is reading an unusually high amount of data (> 50 MB/s) over a 5-minute time interval.> 50 MB/sec<= 50 MB/sec
Host Metrics - Unusual disk write rateThis alert fires when the Disk is writing an unusually high amount of data (> 50 MB/s) over a 5-minute time interval.> 50 MB/sec<= 50 MB/sec

For Process Metrics

Sumo Logic provides out-of-the-box alerts available via Sumo Logic monitors. These alerts are built based on metrics datasets and have preset thresholds based on industry best practices and recommendations.

Alert NameAlert DescriptionAlert ConditionRecover Condition
Process Metrics - High CPU UsageThis alert fires when the CPU utilization of a process is over 80% of the system CPU.> 80 %<= 80 %
Process Metrics - High Read RateThis alert fires when a process is reading an unusually high amount of data (> 20 MB/s) over a 5-minute time interval.> 50 MB/sec<= 50 MB/sec
Process Metrics - High Write RateThis alert fires when a process is writing an unusually high amount of data (> 20 MB/s) over a 5-minute time interval.> 50 MB/sec<= 50 MB/sec
Process Metrics - High Page FaultsThis alert fires when the rate of page faults is high (> 1000).> 1000<= 1000
Process Metrics - High Memory UsageThis alert fires when the memory used by a process is over 80% of system memory.> 80 %<= 80 %
Process Metrics - High Open file descriptorsThis alert fires when the number of file descriptors used by a process is more than 1000.> 1000<= 1000
Legal
Privacy Statement
Terms of Use

Copyright © 2024 by Sumo Logic, Inc.