Skip to main content
Sumo Logic

Parse Variable Patterns Using Regex

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.

User added fields, such as extracted or parsed fields, can be named using alphanumeric characters as well as underscores ("_") and dashes ("-­"). They must start and end with an alphanumeric character.

Syntax

  • ... | parse regex "start_anchor_regex(?<field_name>.*?)stop_anchor_regex" | ...
  • ... | parse regex "start_anchor_regex(?<field_name>.*?)stop_anchor_regex" nodrop | ...
  • ... | parse regex field=<field_name> "start expression(?<fieldname>field expression) stop expression" | ...

Options

  • parse field=fieldname 

    The parse field=fieldname option allows you to specify a field to parse other than the default message. For details, see Parse field

  • * | parse "a=*," as a nodrop  

    The parse nodrop option forces results to also include messages that do not match any segment of the parse term. For details, see Parse nodrop

  • parse multi 

    The parse multi option allows you to parse multiple values within a single log message. See Parse multi. You can use the alternate term "extract".

For more information on Regular Expressions, see the Perl documentation. Or try the regex tester at regex101.com.

Rules

  • Regex must be a valid JAVA or RE2 regular expression enclosed within quotes.
  • Matching is case sensitive. If any of the text segments cannot be matched, then none of the variables will be assigned.
  • If no field is specified, then the entire text of incoming messages is used.
  • Multiple parse expressions are processed in the order they are specified. Each expression always starts matching from the beginning of the message string.
  • Multiple parse expressions can be written with shorthand using comma-separated terms.
  • Can be used with the parse anchor operator.
  • Nesting named capture groups is not supported.
  • The parse regex operator only supports regular expressions that contain at least one named capturing group. We don’t support regular expressions that either don’t have any capturing groups or contain unnamed/numbered capturing group. See Named Capturing Groups for further details.

    You can convert your normal regular expressions into named capturing groups with the following steps:

    Wrap everything in parenthesis, and append “?” followed by a capturing group name enclosed within “<>”. Let's see an example below, the highlighted portions is what has been added.

    Normal Regex Regex with named capturing group
    \d{3}-[\w]* (?<regex>\d{3}-[\w]*)

    If your regex contains a named capturing group (part of the regex is enclosed within parentheses), then you have two options:

    1. You can convert it into a non-capturing group. In this case we will not extract out that part of your regex into a Sumo field. You can convert these easily by appending “?:” to the group right after the starting parenthesis.

    Normal Regex Regex with named capturing group
    (abc|/d{3}) (?:abc|/d{3})
    1. If you want to extract out the value from your numbered capturing group to a named capturing group within your regex you can convert it into a named capturing group. Do this by appending a “?” and enclosing the name of the capturing group within “<>”. Sumo will generate a field with the same name that is specified in the named capturing group.

    Normal Regex Regex with named capturing group
    (abc|/d{3}) (?<test_group>abc|/d{3})

Examples 

Parsing an IP address

Extracting IP addresses from logs is straight-forward using a parse regex similar to:

... | parse regex "(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) " | ...

Parsing multiple fields in a single query.

Parse regex supports parsing out multiple fields in one query. For example, say we want to parse username and host information from logs. Use a query similar to:

... | parse regex "user=(?<user>.*?):" 
| parse regex "host=(?<msg_host>.*?):" 
| ...

Indicating an OR condition to use non-capturing groups

In situations where you want to use an OR condition, where you have multiple possibilities that may match the regular expression, the best practice is to use non-capturing groups (?: regex).

To specify a list of alternative strings in a regular expression, use the group syntax. For example, for the following two log lines:

Oct 11 18:20:49 host123.example.com 16234563: Oct 11 18:20:49: %SEC-6-IPACCESSLOGP: list 101 denied tcp 10.1.2.3(1234) -> 10.1.2.4(5678), 1 packet
Oct 11 18:20:49 host123.example.com 16234564: Oct 11 18:20:49: %SEC-6-IPACCESSLOGP: list 101 accepted tcp 10.1.2.5(4321) -> 10.1.2.6(8765), 1 packet


you can write the following query to extract the "protocol":

parse regex "list 101 (accepted|denied) (?<protocol>.*?) "

So, you would actually write:

parse regex "list 101 (?:accepted|denied) (?<protocol>.*?) "


But if you mean to also capture whether it is an "accepted" or a "denied" into an alias, then you would include:

parse regex "list 101 (?<status>accepted|denied) (?<protocol>.*?) "

Parse multi

In addition to parsing a field value, the multi option (also called parse multi) allows you to parse multiple values within a single log message. This means that the multi keyword instructs the parse regex operator to not just look for the first value in a log message, but for all of the values, even in messages with a varying number of values. As a part of this process, the multi keyword creates copies of each message so that each individual value in a field can be counted.

For example, in the Amazon VPC flow logs you can identify the messages with the same source and destination IP addresses using parse regex multi.

_sourceCategory=aws/vpc 
| parse regex "(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})" multi
| count by ip_address, _raw
| where _count >1

The output looks like:

ParseRegexMulti.png