Your dilemma: You have XML or JSON data indexed in Splunk as standard event-type data.
Sure, you’d prefer to have brought it in as an indexed extraction, but other people onboarded the data before you got to it and you need to make our dashboards work.
How do you handle Splunk data and make it searchable? We could make regular expressions and hope the shape of the data is static—or we can use the easy button: spath command.
What is the Splunk spath Command?
The spath command extracts fields and their values from either XML or JSON data. You can specify location paths or allow spath to run in its native form. Spath is a distributed streaming command, meaning that if it takes effect in our search before any transforming or centralized commands, the spath work will occur in the index layer. Distributed streaming can significantly enhance search performance with a robust set of indexers.
Splunk does well on JSON data, even if it’s brought in as event data. We’ll show that later on as we discuss scenarios in which spath or rename and multivalue commands make more sense.
Spath helps with specialty issues in JSON but shines with XML data. So, we’ll start our investigation there.
The spath Command on XML Ingested as Event Data
Note: By event data, we mean the XML was not ingested as indexed extractions. Data brought in as indexed extractions likely do not need spath, as the fields should already exist.
To help train our consultants, I created an XML that looks like this:
Maryland Stadium
College Park, Maryland
University of JMaryland
Terrapins
Old Bay and Crab Cakes
Spartan Stadium
East lansing, Michigan
Michigan State Univeristy
Spartans
He has trouble with snap, and the ball is free
So, we see an element named team, and additional elements one level under that. We could create regex and hope we know the field names or use spath to extract the fields quickly. Here, we used spath and then a rename to pretty up the results (without the rename, we’d see team.NickName, team.City, etc.).
From here, we now have a set of events and extracted fields to work with, and can then execute standard SPL to calculate statistics or create reports and dashboards.
How spath Works on Compliant JSON Event Data
Note: By event data, we mean the JSON that was not ingested as indexed extractions. Data brought in as indexed extractions likely do not need spath, as the fields should already exist.
Here we see a dataset of meteorite impacts on the earth (shout out to jdorfman’s GitHub list of awesome JSON datasets – ). We onboarded this data with a simple line breaker and not as indexed extractions. And, as we can see, the geolocation.coordinates{} field was spotted as JSON and brought in. The other fields, such as id, mass, name, etcetc., were also autodetected. Hence, there isn’t much work we need to do to make this data usable.
Splunk brought the data in and displayed the fields. However, we still need additional handling on the multivalue field geoloctation.coordinates{} since it returns longitude and latitude as two elements.
To adjust this data:
- STEP #1: Rename geolacation.coordinates{} to coordinates since subsequent commands object to those curly brackets.
rename geolocation.coordinates{} to coordinates
- STEP #2: Merge the two values in coordinates for each event into one coordinate using the nomv command.
nomv coordinates
- STEP #3: Use rex in sed mode to replace the \n that nomv uses to separate data with a comma
rex mode=sed field=coordinates "s/\n/,/g"
The best part of this approach is that rename, nomv, and rex are all distributed streaming commands, which take advantage of powerful indexing layers for excellent performance.
index="structured" sourcetype="look_out_dinosaurs" host=BoomGoesBigRock | rename geolocation.coordinates{} to coordinates | nomv coordinates | rex modex=sed field=coordinates "s/\n,/g"
The spath Command on JSON Data with Headers
Sometimes we are provided JSON that has meaningless headers. Often this is the result of a syslog server applying headers that our JSON data doesn’t need but is is standard output from the servers. These headers result in the JSON not validating correctly; hence, Splunk doesn’t perform its automatic field extraction on that data. So, what can we do? If you guessed spath, then you earned a gold star.
This data is from the usgs.gov earthquake feed, and then headers were added via syslog. (Apologies to USGS, their data is excellent and the spectacle seen below is not their fault)
The best move would be to strip the headers using props/transforms or ingest actions, but in our scenario, this data already exists, or we cannot modify the data due to compliance requirements. Spath to the rescue.
To adjust this data for meaningful searching:
- STEP #1: Create a field of all the JSON data using rex
rex field=_raw "Earthquakes (?.*)"
- STEP #2: Use spath to separate the data using an input parameter that references the name of the field we created using rex
spath input=thefeed
- STEP #3: Rename fields if you desire. I won’t rename them here, but we could issue:
| rename properties.* as *
Hence, I have this for my search:
index="structured" sourcetype="earthquakes"
| rex field=_raw "Earthquakes (?.*)"
| spath input=thefeed
Like the previous meteorite feed, we have coordinates split apart. I want them in x,y notation, so we repeat the steps as shown before.
- STEP #1: Rename geolacation.coordinates{} to coordinates since subsequent commands object to those curly brackets.
rename geometry.coordinates{} to coordinates
- STEP #2: Merge the two values in coordinates for each event into one coordinate using the nomv command.
nomv coordinates
- STEP #3: Use rex in sed mode to replace the \n that nomv uses to separate data with a comma
rex mode=sed field=coordinates "s/\n/,/g"
Ergo, I end up with the following search:
index="structured" sourcetype="earthquakes"
| rex field=_raw "Earthquakes (?.*)"
| spath input=thefeed
| rename geometry.coordinates{} to coordinates
| nomv coordinates
| rex mode=sed field=coordinates "s/\n/,/g"
Conclusion
The spath command provides a great deal of flexibility when dealing with certain types of structured data onboarded as standard unstructured data. Try it out and see what issues you can solve.