The Basics of Regular Expressions
Symbol / Characters | Description | Examples |
---|---|---|
Literal characters | Characters as they read in text. | `abc` **matches** `abc`, **not** `def` |
`.` | Any single character. | `b.d` **matches** `bcd`, **not** `bd`, `bde` |
`\w` | Any alphanumeric character, sometimes called a "word character". | `a\wc` **matches** `abc`, **not** `a c` |
`\W` | Any non-alphanumeric character. | `a\Wc` **matches** `a c`, **not** `abc`, `def` |
`\d` | Any digit character. | `a\d` **matches** `a1`, `a2`, **not** `ab`, `ac` |
`\D` | Any non-digit character. | `a\D` **matches** `ab`, `ac`, **not** `a1`, `a2` |
`\s` | Any whitespace character. | `a\sc` **matches** `a c`, **not** `abc`, `adc` |
`\S` | Any non-whitespace character. | `a\Sc` **matches** `abc`, `a1c`, **not** `a c` |
`[...]` | A single character of those in brackets. | `a[bc]` **matches** `ab`, `ac`, **not** `ad`, `bc` |
`[^...]` | A single character other than those in brackets. | `a[^bc]` **matches** `ad`, **not** `ab`, `de` |
`[n1-n2]` | Range notation, allowing for alphanumeric matching of an alphabetic or numeric range. Case sensitivity applies. | `[a-z]` **matches** `a`, `b`, `z`, **not** `A`, `Z`; `[0-9]` **matches** `1`, `5`, `9`; `[a-zA-Z0-9]` **matches** `a`, `B`, `3` |
`*` | Zero or more of the preceding character or expression. | `abc*` **matches** `ab`, `abc`, `abcc`, **not** `ac`, `acd` |
`+` | One or more of the preceding character or expression. | `abc+` **matches** `abc`, `abcc`, **not** `ab`, `abd` |
`?` | Zero or One of the preceding character or expression. | `abc?` **matches** `abc`, `ab`, **not** `ac` |
`{n}` | Matches `n` occurences of the preceding character or expression. | `\d{3}` **matches** `123`, **not** `12`, `a23` |
`\|' | Create an "OR" expression. | `a[b\c]` **matches** `ab`, `ac` **not** `ad` |
`\` | Escape special regex characters. | `ab\?` **matches** `ab?`, **not** `ab`, `ab\` |
`^` | Match position to beginning of line. | `^bc` **matches** `bcd`, **not** `abc` |
`$` | Match position to end of line. | `bc$` **matches** `abc`, **not** `bcd` |
`(...)` | Group characters together based on pattern in parentheses. Groups are referenced in numeric order (i.e. \1 is the first group, \2 is the second), typically in replacement or character group isolation. | `(ab)cd` captures `ab`, accessible with `\1` |
`(?group_name>...)` | Create a named group. Splunk uses named groups in field extraction regex. | `(? |
Regular Expression Examples with Splunk
There are many common patterns in data where regex can be used to identify values. Often, this is most beneficial when data has a consistent format, but the values of words and numbers change. Common examples are phone numbers, IP addresses, and timestamps. For each of these examples, many variations of regex can be applied using basic symbols.
Use Cases
Example 1: Phone Numbers
For this example, a 10-digit phone number is expected in data. A simple match of 10 digits can be accomplished with the following:
# Example data we want to find
1234567890
# Digit symbol
\d{10}
# Digit symbol with numeric range
\d[0-9]{10}
Both of these regular expressions above will find the example data we are looking for. Now, let’s use this regular expression in a Splunk SPL search using the rex command.
index="" sourctype=""
| rex "(?\d[0-9]{10})"
| where isnotnull(phone_number)
| stats count as call_count by phone_number
| sort - call_count
<index>
and <sourcetype>
with data from your Splunk environment. This search uses the rex command to extract all instances of 10-digit numbers from the phone_number
field of each event, creating a new field called phone_number
. The query then filters the results to include only the events that have at least one valid 10-digit number match, then presents the count of events containing each found phone number in a tabular format. Example 2: Phone Numbers with Hyphens
Undoubtedly, encountering this type of phone data with hyphens is likely. The following regular expression will separate the area code, telephone prefix, and the line number into three distinct sections by looking for the common phone number patterns with the hyphens.
# Example data
123-456-7890
# Digit symbol
\d{3}-\d{3}-\d{4}
# Multiple occurences of the group containing 3 digits and a hyphen
(\d{3}-){2}\d{4}
Let’s now try the same search with a different regular expression.
index="" sourctype=""
| rex "(?\d{3}-\d{3}-\d{4})"
| where isnotnull(phone_number)
| stats count as call_count by phone_number
| sort - call_count
In the same way as our other example, this search will attempt to locate number sequences that look like phone numbers with dashes in them. Again, using the rex command, we are able to utilize a regular expression to find events in Splunk that have phone numbers with hyphens.
Example 3: Phone Numbers in Multiple Formats
In an exploratory exercise, it may be unknown if the data contains hyphens or is prefixed with a country code. Using the OR
symbol with combinations of other frequency symbols like *
, regex can match all formats presented. Groups used to separate out the sections of data can also be used to identify each section with a name.
# Example data
123-456-7890
+1 123-456-7890
1234567890
+11234567890
# Matching all examples with |, *, and groups
((\+\d)|)\s*(\d{3}(-|)){2}(\d{4})
# Using groups to name all sections of the phone number
((?\+\d)|)\s*(?\d{3}(-|))(?\d{3}(-|))(?\d{4})
Using the rex command to extract more unique fields for each occurrence of a phone number allows for more complex search queries to show granular details of the phone number data. This also allows Splunk users to write additional queries focusing on each field for presentation in a multi-panel dashboard.
Example 4: IP Addresses
In Splunk, IP addresses are often critical for data analysis and threat hunting. The principles in the previous example can be easily applied to IP Addresses to account for any variability in IP ranges.
# Example data
10.11.12.123
172.16.0.45
3.86.250.13
# Match with frequency range for each octet
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
# Use a group to match 3 octets that end with a period
(\d{1,3}\.){3}\d{1,3}
# Use | to match all octets in an abbreviated form
(\d{1,3}(\.|)){4}
In the example below, a regex pattern for extracting IP addresses is used to determine potential IP Addresses involved in a Brute Force attack.
index="authentication" action="blocked"
| rex "(?\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"
| stats count as login_failures by src_ip
| where login_failures > 10
| sort - login_failures
Example 5: Timestamps
Timestamps are a primary concern for Splunk administrators when onboarding data. In some cases, data may contain additional timestamps that are useful to extract as fields. Timestamp data will generally follow a consistent pattern across the events of a unique log source, and knowledge of this format can help apply accurate regex.
# Example data
Jan 01, 2024 10:14:42
# Using whitespace and character frequency as observed
\w{3}\s\d{2},\s\d{4}\s\d{2}:\d{2}:\d{2}
# Use of + symbol for matching on month name if length is variable
\w+\s\d{2},\s\d{4}\s\d{2}:\d{2}:\d{2}
When working with time data in Splunk, regex in search commands should not be used for extracting event timestamps. Moreover, valid data ingestion processes more specifically involved leveraging timestamp parsing utilities in Splunk configuration files, which produce event timestamps automatically in Splunk.
Utilizing regex for timestamps can be useful in Splunk search when raw data contains fields that have additional timestamps that provide useful context in reporting or dashboarding. The example SPL below shows a method of tabling out service ticket data, where the primary timestamp is extracted from ticket creation, but resolution time is needed in the resulting table.
index="itsm"
| rex "(?\w{3}\s\d{2},\s\d{4}\s\d{2}:\d{2}:\d{2}"
| table _time resolution_date analyst_comments
| rename _time as creation_time
Considerations
The examples shown, in some cases, demonstrate methods of abbreviating regex with more complex combinations of symbols. While regex optimization is beyond the scope of this article, readers should be aware that functional but complex regex may slow down text parsing operations. Additionally, shorter strings do not always improve regex performance.
Filtering Searches with Regular Expressions
Regular Expressions in Splunk Search
As a regex beginner, using regex to search Splunk provides a great mechanism to explore data, provide adhoc field extractions, and test regex for application in administrative configurations. We will demonstrate how to apply regex, rex, and erex SPL commands to enhance analytics and reporting capabilities.
regex
command uses the following syntax:
| regex =
In-Search Field Extractions
The erex Command
| erex examples=","
The rex Command
| rex field= ""
Conclusion
You can also learn more about our Atlas platform. The Atlas platform by Kinney Group is a comprehensive solution that empowers organizations to optimize their Splunk environments. By leveraging automation, best practices, and a unified interface, Atlas simplifies the management, monitoring, and scaling of Splunk deployments. With Atlas, businesses can enhance the performance, security, and cost-efficiency of their Splunk infrastructure, enabling them to derive maximum value from their machine data and drive informed decision-making. Get started by running the free Atlas Assessment available for free on Splunkbase.
