Skip to content
Article

Splunk spath Command: How to Extract Structured XML and JSON from Event Data

 

Written by: The Kinney Group Team | Last Updated:

 
November 4, 2022
 
splunk dedup command
 
 

Originally Published:

 
November 4, 2022

Your dilemma: You have XML or JSON data indexed in Splunk as standard event-type data.

Sure, you’d prefer to have brought it in as an indexed extraction, but other people onboarded the data before you got to it and you need to make our dashboards work.

How do you handle Splunk data and make it searchable? We could make regular expressions and hope the shape of the data is static—or we can use the easy button: spath command.

New call-to-action

What is the Splunk spath Command?

The spath command extracts fields and their values from either XML or JSON data. You can specify location paths or allow spath to run in its native form. Spath is a distributed streaming command, meaning that if it takes effect in our search before any transforming or centralized commands, the spath work will occur in the index layer. Distributed streaming can significantly enhance search performance with a robust set of indexers.

Splunk does well on JSON data, even if it’s brought in as event data. We’ll show that later on as we discuss scenarios in which spath or rename and multivalue commands make more sense.

Spath helps with specialty issues in JSON but shines with XML data. So, we’ll start our investigation there.

How the spath Command Works on XML Ingested as Event Data

Note: By event data, we mean the XML was not ingested as indexed extractions. Data brought in as indexed extractions likely do not need spath, as the fields should already exist.

To help train our consultants, I created an XML that looks like this:

XML Example for spath command in Splunk

So, we see an element named team, and additional elements one level under that. We could create regex and hope we know the field names or use spath to extract the fields quickly. Here, we used spath and then a rename to pretty up the results (without the rename, we’d see team.NickName, team.City, etc.).

From here, we now have a set of events and extracted fields to work with, and can then execute standard SPL to calculate statistics or create reports and dashboards.

Spath 2

How Spath Works on Compliant JSON Event Data:

Note: By event data, we mean the JSON that was not ingested as indexed extractions. Data brought in as indexed extractions likely do not need spath, as the fields should already exist.

Here we see a dataset of meteorite impacts on the earth (shout out to jdorfman’s GitHub list of awesome JSON datasets – ). We onboarded this data with a simple line breaker and not as indexed extractions. And, as we can see, the geolocation.coordinates{} field was spotted as JSON and brought in. The other fields, such as id, mass, name, etcetc., were also autodetected. Hence, there isn’t much work we need to do to make this data usable.

JSON without spath

Splunk brought the data in and displayed the fields. However, we still need additional handling on the multivalue field geoloctation.coordinates{} since it returns longitude and latitude as two elements.

To adjust this data:

1. Rename geolacation.coordinates{} to coordinates since subsequent commands object to those curly brackets.

rename geolocation.coordinates{} to coordinates

2. Merge the two values in coordinates for each event into one coordinate using the nomv command.

nomv coordinates

3. Use rex in sed mode to replace the \n that nomv uses to separate data with a comma

rex mode=sed field=coordinates "s/\n/,/g"

The best part of this approach is that rename, nomv, and rex are all distributed streaming commands, which take advantage of powerful indexing layers for excellent performance.

JSON without spath, now with handling for the multivalue

How the Spath Command Works on JSON Data with Headers:

Sometimes we are provided JSON that has meaningless headers. Often this is the result of a syslog server applying headers that our JSON data doesn’t need but is is standard output from the servers. These headers result in the JSON not validating correctly; hence, Splunk doesn’t perform its automatic field extraction on that data. SoSo, what can we do? If you guessed spath, then you earned a gold star.

This data is from the usgs.gov earthquake feed, and then horrible headers were added via syslog. (Apologies to USGS, their data is excellent and the spectacle seen below is not their fault)

Quality JSON data hidden by headers

The best move would be to strip the headers using props/transforms or ingest actions, but in our scenario, this data already exists, or we cannot modify the data due to compliance requirements. Spath to the rescue.

To adjust this data for meaningful searching:

1. Create a field of all the JSON data using rex

rex field=_raw "Earthquakes (?<thefeed>.*)"

2. Use spath to separate the data using an input parameter that references the name of the field we created using rex

spath input=thefeed

3. Rename fields if you desire

I won’t rename them here, but we could issue:

| rename properties.* as *

Hence, I have this for my search:

index="structured" sourcetype="earthquakes" | rex field=_raw "Earthquakes (?<thefeed>.*)" | spath input=thefeed

post spath with fields

Like the previous meteorite feed, we have coordinates split apart. I want them in x,y notation, so we repeat the steps as shown before.

1. Rename geolacation.coordinates{} to coordinates since subsequent commands object to those curly brackets.

rename geometry.coordinates{} to coordinates

2. Merge the two values in coordinates for each event into one coordinate using the nomv command.

nomv coordinates

3. Use rex in sed mode to replace the \n that nomv uses to separate data with a comma

rex mode=sed field=coordinates "s/\n/,/g"

Ergo, I end up with the following search:

index="structured" sourcetype="earthquakes" | rex field=_raw "Earthquakes (?<thefeed>.*)" | spath input=thefeed | rename geometry.coordinates{} to coordinates | nomv coordinates | rex mode=sed field=coordinates "s/\n/,/g"

The final product of search-time spath on this hidden data

The Splunk spath Command Made Easy

The spath command provides a great deal of flexibility when dealing with certain types of structured data onboarded as standard unstructured data. Try it out and see what issues you can solve.

If you found this helpful…

You don’t have to master Splunk by yourself in order to get the most value out of it. Small, day-to-day optimizations of your environment can make all the difference in how you understand and use the data in your Splunk environment to manage all the work on your plate.

Cue Atlas Assessment: a customized report to show you where your Splunk environment is excelling and opportunities for improvement. Once you download the app below, you’ll get your report in just 30 minutes.

New call-to-action