In this lab, rather than alerting on simple error counts with a static threshold, which can yield false positives (Fig. 1 below), learn to create an alert that will notify you when your errors increase at a higher rate than your overall traffic (Fig. 2). For a more in depth explanation, check out this blog post on creating meaningful alerts.
Fig. 1 & Fig. 2
Using Labs/Apache/Access data, search only for messages with status code 200 or 404. 404 messages are your errors, and 200 messages give you a sense of the overall traffic.
Count 200 messages as Successes and 404 messages as Fails.
Sum Successes and Fails to get a count by timeslice to identify a trend over time.
Create a ratio of fails to successes
Use outlier operator to identify anomalies in the ratio
_sourceCategory=Labs/Apache/Access (status_code=200 or status_code=404)
| timeslice 1m
| if (status_code="200", 1, 0) as successes
| if (status_code="404", 1, 0) as fails
| sum(successes) as success_cnt, sum(fails) as fail_cnt by _timeslice
| fail_cnt/success_cnt as failure_rate
| sort _timeslice desc
| outlier failure_rate window=5, threshold=3, consecutive=1, direction=+
Adding the following where clause allows you to filter out only outliers (where ration increase is higher than normal) . Using your email address, you can now create a Scheduled Search to Alert when this query has results.
| where failure_rate_violation > 0