To keep pace with real-time analytics, Sumo Logic creates indices in near real-time. However, if data is collected with incorrect time zones, or timestamps, or timestamps, or if there is sporadic latency in data collection, Sumo Logic can over-generate indices in an attempt to handle the messages properly. These extra indices can degrade the performance of searches. We call this problem index fragmentation.
When Sumo Logic encounters an index that may contain messages parsed with incorrect message times, we may drop these indexes from the search. If so, we let the user know by displaying the warning message, "Your search contains messages that have been incorrectly parsed and cannot be displayed." These messages typically do not fit within the selected time range, so the results of your queries are usually unaffected. However, the dropped messages may not be found when you search for other time ranges either.
To address this issue, you need to identify the messages or Sources that could be contributing to index fragmentation. Using the "Use Receipt Time" option within Sumo Logic, the following query can help you identify the minimum, maximum, and average delay (+-) between the message times parsed from your log message (messageTime) and the time the system received them (receiptTime). The results of the query sorts the delay times by Collector, Source, and SourceName. In an ideal situation, these values should all be as close to zero as possible. The larger the variance in the minimum and maximum delay times, the greater the chances are for index fragmentation by that source.
| _receiptTime - _messageTime as delay
| delay / 60000 as delayInMinutes
| toInt(delayInMinutes) as delayInMinutes
| where delayInMinutes > 5 or delayInMinutes < -5
| count(*) as messagecount, avg(delayInMinutes) as avgDelayInMinutes, min(delayInMinutes)
as minDelayInMinutes, max(delayInMinutes) as maxDelayInMinutes
by _collector, _source, _sourceName
Known causes of index fragmentation
Log messages with no timezone information
When no timezone information is found in a log message, Sumo Logic will by default apply PST to the time from the message, unless the Source configuration includes an alternate timezone set for the Source. If a message comes from a server configured with UTC timezone, we would parse this as PST, which would cause a 7 or 8 hour "delay". Update this setting in the Source configuration, in Advanced, under the Timestamp options. For more information, see Timestamps, Time Zones, Time Ranges, and Date Formats.
Sumo Logic also allows multiple remote Sources to be configured under one Source configuration. In some cases, the timezones of the multiple sources may not all match the configured timezone default. If this occurs, some messages will come from the correct timezone while others will come with delay. To address this, you will need to create separate Source configurations for each timezone.
Improper multiline message detection
In some cases Sumo Logic may not properly parse a multiline message as a single log line. When this occurs, the additional lines are parsed as their own message. If these lines include a timestamp value, such as "lastLoginDate=01/01/2013 12:00:00", Sumo Logic may incorrectly parse this as the timestamp of the message and create an additional index for that time range. To address this issue, make sure that your multiline log messages are being parsed as a single line in the Source configuration using the Multiline Processing options.
Incorrect time settings across servers
If you have an environment where multiple servers are generating and sending logs to Sumo Logic, but the server times are not in sync using a Network Time Protocol (NTP) server, Sumo Logic may receive messages with different time stamps. If these time stamps vary by several minutes, this can create additional indexes, which causes index fragmentation. To address this issue, make sure your servers are synchronized.