What is the Splunk dedup Command?
The Splunk dedup command, short for “deduplication”, is an SPL command that eliminates duplicate values in fields, thereby reducing the number of events returned from a search. Typical examples of a dedup produce a single event for each host or a pair of events for each sourcetype.
How the dedup Command Works
Dedup has a pair of modes. We’ll focus on the standard mode, which is a streaming search command (it operates on each event as a search returns the event).
The first thing to note is the dedup command returns events, which contrasts with stats commands which return counts about the data. Outputting events is useful when you want to see the results of several fields or the raw data, but only a limited number for each specified field.
When run as a historic search (e.g., against past data), the most recent events are searched first. If the dedup runs in real-time, the first events received are searched, which does not guarantee that they are the most recent (data doesn’t always arrive in a tidy order).
Splunk dedup Command Example
Let’s run through an example scenario and explore options and alternatives. I will use the windbag command for these examples since it creates a usable dataset (windbag exists to test UTF-8 in Splunk, but I’ve also found it helpful in debugging data).
Step 1: The Initial Data Cube
Result: 100 events. Twenty-five unique values for the field lang, with the highest value having eight events.
Step 2: Using Dedup to reduce events returned
Now, let’s limit that to 1 event for each of those values in lang.
| windbag | dedup lang
Result: 25 events. Lang still has 25 unique values, but there is only one event for each language specified this time.
We can also reduce by a combination of fields and even create fields before using dedup.
Step 3: Cast time into a bin, then reduce fields with lang and time bin
The windbag data is spread out over the past 24 hours (because I’m running 24-hour time). Taking advantage of this, we can create another usable field by using bin to set the time into 12-hour buckets. Using bin like this is one way to split the data. Since I ran this at 21:45, I wound up with four buckets (Who said this was perfect?), with the middle two buckets having forty-two events each.
| windbag | bin span=12h _time | dedup lang, _time
Result: 64 events. Twenty-five different lang fields, with the highest event count at 3.
Step 4: Add a random 1 or 2 to the mix, and dedup off of those three fields.
The above exercise was one way to divide the data up. This time, we’re going to randomly assign (using random and modulo arithmetic) each event a 1 or 2 for the group, and then use that in a dedup along with the span of 12 hours.
| windbag | eval group = (random() % 2) + 1 | bin span=12h _time | dedup lang, _time, group
Result: each run changes. It ranged from seventy-five events to eighty-six in the ten runs I let it try.
Step 5: What if we want more than one event per field?
This time we’ll add an integer behind dedup to give us more results per search.
| windbag | dedup 2 lang
Result: Each of the twenty-five lang entries returned two events.
Step 6: How to Use the Data
Great, so we can reduce our count of events. What can we do with this? Anything you can picture in SPL. We may want a table of different fields. Stats counts based upon fields in the data? Why not?
index=_internal | dedup 100 host | stats count by component | sort - count
Result: Returned 500 events, then stats counted. In case anyone is wondering, ~80 of that data is the component Metrics (apparently, we need to use this cloud stack more)
Other dedup Comand Options and Considerations
There are several options available for dedup that affect how it operates.
Note: It may be better to use other SPL commands to meet these requirements, and often dedup works with additional SPL commands to create combinations.
- consecutive: This argument only removes events with duplicate combinations of values that are consecutive. By default, it’s false, but you can probably see how it’s helpful to trim repeating values.
- keepempty: Allows keeping events where one or more fields have a null value. The problem this solves may be easier to rectify using fillnull, filldown, or autoregress.
- keepevents: Keep all events, but remove the selected fields from events after the field event containing that particular combination.
This option is weird enough to try:
| windbag | eval group = (random() % 2) + 1 | dedup keepevents=true lang, group
Then add lang and group to the selected fields. Note how each event has lang and group fields under the events. Now, flip to the last pages. The fields for lang and group are not present for those events. Bonus points if you can tell me why this exists.
- sortby: A series of sort options exist, which are excellent if your dedup takes place at the end of the command. All options support +/- (ascending or descending). The options possible are field, auto (let dedup figure it out), ip to interpret results as IPs, num (numeric order), and str (lexicographical order).
| windbag | bin span=12h _time | dedup lang, _time sortby -lang
This command will sort descending by language. What is nice is that we don’t have to pass the command to sort, which would result in an additional intermediate search table.
- Multivalue Fields: Dedup functions against multivalue fields.
All values of the field must match to be deduplicated.
- Alternatives Commands:The uniq command works on small datasets top remove any search result that is an exact duplicate of the previous event. The docs for dedup also suggest not running on _raw, as that field would result in many calculations to determine if it is a dupe.
- MLTK Sample Command: The Sample command that ships with the machine learning toolkit does a great job of dividing data into samples. If my goal is to separate data, and MLTK exists on the box, then the sample command is preferred.
- Stats Commands: The stats command, and its many derivatives, are faster if your goal is to return uniqueness for a few fields. For example, | windbag |bin span=12h _time |stats max(_time) as timebucket by lang returns the max value of _time, similar to dedup after a sort.
If you found this helpful…
You don’t have to master Splunk by yourself in order to get the most value out of it. Small, day-to-day optimizations of your environment can make all the difference in how you understand and use the data in your Splunk environment to manage all the work on your plate.
Cue Atlas Assessment: a customized report to show you where your Splunk environment is excelling and opportunities for improvement. Once you download the app below, you’ll get your report in just 30 minutes.