Skip to main content
Sumo Logic

What if I don't want to send all the tracing data to Sumo Logic?

You can selectively filter tracing data before it is sent to Sumo Logic. Instances of the OpenTelemetry Collector on each node can connect to OT collectors in the aggregation layer that can do smart filtering of data before forwarding this to Sumo Logic. A few common reasons why you'd want to filter tracing data are for scaling, privacy, or cost optimization purposes for large environments.

env multiple agents bd.png

Prerequisites

Already installed and running OpenTelemetry Collector using either:

In the instructions below, we are going to modify the config for the aggregating OpenTelemetry Collector

Filtering data at the output of OT collector in aggregation mode

Sumo Logic’s OpenTelemetry collector has a unique capability of shaping trace data during output, according to user-defined custom rules. You can define rules in a cascading fashion, assigning different pool sizes to each rule, and giving them different priorities. This ensures you will always have valuable, useful, and cost-optimized data for analysis in the backend.

Example 1 - filtering out health checks

Consider that your system is generating health check traces that you want to filter out easily. Additionally, assuming the health checks can be identified by operation name, which starts with "GET /health" or "POST /health". This can be achieved with the following configuration, which consists of two steps. The first step removes healthchecks and the second step reduces the volume of data:

  • it ensures that no more than 2,000 spans/second are sent
  • keeps at most 50,000 traces in the buffer
  • waits for 30 seconds for the decision
  • passes all traces where neither of the operations starts with (case-insensitive) "get /health" or "post /health"

Such as when a trace contains an operation "GET /healthcheck", it will be filtered out. 

processors:
  cascading_filter/health:
    ## decision_wait specifies for how long spans are collected into traces
    ## until the sampling decision is made
    decision_wait: 30s
    num_traces: 500000
    spans_per_second: 50000
    ## To make sure that health checks are excluded, this setting MUST be set to 0
    probabilistic_filtering_ratio: 0
    policies:
      - name: everything-that-is-not-healthcheck
          ## This selects all traces where there is NO span
          ## starting with `GET /health` or `POST /health` operation name
          ## This must be adjusted accordingly.
          ## Note 1: (?i) makes the match case-insensitive 
          ## Note 2: The expression is matching exactly from the start of the string.
          ##         To match e.g. "http get /dispatch", the expression would be:
          ##         "(?i)^(http get /dispatch).*"
          properties:
            name_pattern: "(?i)^(get /health|post /health).*"
          invert_match: true
          spans_per_second: -1
  cascading_filter/adaptive:
    # Since traces were collected already, the wait time here can be minimal
    decision_wait: 2s
    num_traces: 50000
    # Here the limit of spans per second can be set
    spans_per_second: 2000
    probabilistic_filtering_ratio: 1.0
    policies:
      - name: everything
        spans_per_second: -1
   batch:
     batch_size: 200
     send_batch_max_size: 400   

             
service:
  pipelines:
    traces:
      receivers: [...]
      processors: [..., cascading_filter/health, cascading_filter/adaptive, batch]
      exporters: [...]

Example 2 - more complex filtering 

Following part of config.yaml configuration below does the following:

  • ensures that no more than 1,500 spans/second will be sent
  • keeps at most 50,000 traces in the buffer
  • allocates 10% of the above rate to all traces sampled probabilistically
  • allocates 100 spans/second for traces where at least one span has a duration of over 5 seconds
  • allocates 200 spans/second for traces with at least 10 spans, where one of them matches "foo.*"
  • allocates 300 spans/second for traces where service is other than "service-a" or "service-b"
  • selects a random choice of all traces to fill-up the limit up to 1,000 spans/second

processors:
  cascading_filter:
    decision_wait: 30s
    num_traces: 50000
    spans_per_second: 1500
    probabilistic_filtering_ratio: 0.1
    policies:
     - name: min-duration
       spans_per_second: 100,
       properties: { min_duration: 5s }
     - name: foo-policy,
       spans_per_second: 200,
       properties:
         name_pattern: "foo.*",
         min_number_of_spans: 10,
     - name: not-service-a-or-b
       spans_per_second: 300
       string_attribute:
         key: service.name
         values: 
           - service-a
           - service-b
       invert_match: true
     - name: everything_else,
       spans_per_second: -1
   batch:
     batch_size: 200
     send_batch_max_size: 400   

             
service:
  pipelines:
    traces:
      receivers: [...]
      processors: [..., cascading_filter, batch]
      exporters: [...]

See more details on the configuration and examples here: https://github.com/SumoLogic/opentelemetry-collector-contrib/tree/master/processor/cascadingfilterprocessor

Enabling cascading_filter for Kubernetes

To apply those rules for the Kubernetes collection Helm chart, it is recommended to prepare a custom my-values.yaml file, for example:

otelcol:
  config:
    processors:
      ## Smart cascading filtering rules with preset limits.
      cascading_filter:
       ...

    service:
      pipelines:
        traces:
          processors: [memory_limiter, k8s_tagger, source, resource, cascading_filter, batch]

 

You can also refer to example values.yaml template available at GitHub and adjust it accordingly.

After having the config prepared, the following command can be issued:

helm upgrade collection sumologic/sumologic \
  --namespace sumologic \
  --reuse-values \
  -f my-values.yaml

Enabling cascading_filter for other environments

To make sure cascading_filter has access to the whole trace, it must be run on a single, central, aggregating collector that is sending all data to Sumo Logic.  

receivers:
  otlp:
    protocols:
      grpc: 
        endpoint: 0.0.0.0:55680
      http: 
        endpoint: 0.0.0.0:55681

processors:
  memory_limiter:
    check_interval: 5s
    limit_mib: 1000
    spike_limit_mib: 500
  batch:
    send_batch_size: 256
    send_batch_max_size: 512
    timeout: 5s
  queued_retry:
    num_workers: 16
    queue_size: 5000
    retry_on_failure: true
extensions:
  health_check: {}
exporters:
  zipkin:
    endpoint: ENDPOINT_URL
  logging:
    loglevel: debug
service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, queued_retry]
      # To enable verbose debugging, add “,logging” to the list of exporters
      exporters: [zipkin]

receivers:
  zipkin:
    endpoint: 0.0.0.0:9411
  otlp:
    protocols:
      grpc: 
        endpoint: 0.0.0.0:55680
      http: 
        endpoint: 0.0.0.0:55681

  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_compact:
        endpoint: 0.0.0.0:6831
      thrift_http:
        endpoint: 0.0.0.0:14268
processors:
  memory_limiter:
    check_interval: 5s
    limit_mib: 1000
    spike_limit_mib: 500
  batch:
    send_batch_size: 256
    send_batch_max_size: 512
    timeout: 5s
  resourcedetection/aws:
    detectors: [ec2]
    timeout: 5s
    override: false
  resource/aws:
    attributes:
    - action: upsert
      key: cloud.namespace
      value: ec2
  resourcedetection/gcp:
    detectors: [gcp]
    timeout: 5s
    override: false
  resource/gcp:
    attributes:
    - action: upsert
      key: cloud.namespace
      value: gce
  queued_retry:
    num_workers: 16
    queue_size: 5000
    retry_on_failure: true
extensions:
  health_check: {}
exporters:
  otlp:
    endpoint: "COLLECTOR_HOSTNAME:55680"
    insecure: true
  logging:
    loglevel: debug
service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [jaeger, zipkin, otlp]
      # For AWS EC2 environment use this:
      processors: [memory_limiter, batch, resourcedetection/aws, resource/aws, queued_retry]
      # For GCP Compute Engine environment use this:
      processors: [memory_limiter, batch, resourcedetection/gcp, resource/gcp, queued_retry]
      # For other environments use this:
      processors: [memory_limiter, batch, queued_retry]
      # To enable verbose debugging, add “,logging” to the list of exporters
      exporters: [otlp]

Edit centralized/aggregating OpenTelemetry Collector config and include the following sections:

processors:
  ## Smart cascading filtering rules with preset limits.
  cascading_filter:
   ...

service:
  pipelines:
    traces:
      processors: [..., cascading_filter, batch]

 

After editing the config, make sure the instance is restarted to reflect the new changes. You can refer to the sample template file available at GitHub and adjust it accordingly.

Troubleshooting 

When cascading_filter is being run, it emits a number of metrics that describe actions made by it. Such as otelcol_processor_cascading_filter_count_final_decision or otelcol_processor_cascading_filter_count_policy_decision. They are available at OpenTelemetry Collector metrics endpoint (http://<OPENTELEMETRY_COLLECTOR_ADDRESS>:8888/metrics) or for Kubernetes, via collected metrics (Helm chart flag otelcol.metrics.enabled must be set to true)