Skip to content
SPL // Splunk

Using the spath Command

KGI Avatar
 

Written by: Michael Simko | Last Updated:

 
May 1, 2024
 
Splunk Search Command Of The Week: spath
 
 

Originally Published:

 
November 4, 2022

Your dilemma: You have XML or JSON data indexed in Splunk as standard event-type data.

Sure, you’d prefer to have brought it in as an indexed extraction, but other people onboarded the data before you got to it and you need to make our dashboards work.

How do you handle Splunk data and make it searchable? We could make regular expressions and hope the shape of the data is static—or we can use the easy button: spath command.

What is the Splunk spath Command?

The spath command extracts fields and their values from either XML or JSON data. You can specify location paths or allow spath to run in its native form. Spath is a distributed streaming command, meaning that if it takes effect in our search before any transforming or centralized commands, the spath work will occur in the index layer. Distributed streaming can significantly enhance search performance with a robust set of indexers.

Splunk does well on JSON data, even if it’s brought in as event data. We’ll show that later on as we discuss scenarios in which spath or rename and multivalue commands make more sense.

Spath helps with specialty issues in JSON but shines with XML data. So, we’ll start our investigation there.

The spath Command on XML Ingested as Event Data

Note: By event data, we mean the XML was not ingested as indexed extractions. Data brought in as indexed extractions likely do not need spath, as the fields should already exist.

To help train our consultants, I created an XML that looks like this:

				
					<?XML version="1.0" encoding="UTF-8"?>
<dataroot>
    <team>
        <Stadium>Maryland Stadium</Stadium>
        <City>College Park, Maryland</City>
        <University>University of JMaryland</University>
        <Nickname>Terrapins</Nickname>
        <KnownFor>Old Bay and Crab Cakes</KnownFor>
    </team>
    <team>
        <Stadium>Spartan Stadium</Stadium>
        <City>East lansing, Michigan</City>
        <Univeristy>Michigan State Univeristy</Univeristy>
        <Nickname>Spartans</Nickname>
        <KnownFor>He has trouble with snap, and the ball is free</KnownFor>
    </team>
</dataroot>
				
			

So, we see an element named team, and additional elements one level under that. We could create regex and hope we know the field names or use spath to extract the fields quickly. Here, we used spath and then a rename to pretty up the results (without the rename, we’d see team.NickName, team.City, etc.).

From here, we now have a set of events and extracted fields to work with, and can then execute standard SPL to calculate statistics or create reports and dashboards.

How spath Works on Compliant JSON Event Data

Note: By event data, we mean the JSON that was not ingested as indexed extractions. Data brought in as indexed extractions likely do not need spath, as the fields should already exist.

Here we see a dataset of meteorite impacts on the earth (shout out to jdorfman’s GitHub list of awesome JSON datasets – ). We onboarded this data with a simple line breaker and not as indexed extractions. And, as we can see, the geolocation.coordinates{} field was spotted as JSON and brought in. The other fields, such as id, mass, name, etcetc., were also autodetected. Hence, there isn’t much work we need to do to make this data usable.

JSON without spath

Splunk brought the data in and displayed the fields. However, we still need additional handling on the multivalue field geoloctation.coordinates{} since it returns longitude and latitude as two elements.

To adjust this data:

  • STEP #1: Rename geolacation.coordinates{} to coordinates since subsequent commands object to those curly brackets.
				
					rename geolocation.coordinates{} to coordinates
				
			
  • STEP #2: Merge the two values in coordinates for each event into one coordinate using the nomv command.
				
					nomv coordinates
				
			
  • STEP #3: Use rex in sed mode to replace the \n that nomv uses to separate data with a comma
				
					rex mode=sed field=coordinates "s/\n/,/g"
				
			

The best part of this approach is that rename, nomv, and rex are all distributed streaming commands, which take advantage of powerful indexing layers for excellent performance.

JSON without spath, now with handling for the multivalue

The spath Command on JSON Data with Headers

Sometimes we are provided JSON that has meaningless headers. Often this is the result of a syslog server applying headers that our JSON data doesn’t need but is is standard output from the servers. These headers result in the JSON not validating correctly; hence, Splunk doesn’t perform its automatic field extraction on that data. SoSo, what can we do? If you guessed spath, then you earned a gold star.

This data is from the usgs.gov earthquake feed, and then horrible headers were added via syslog. (Apologies to USGS, their data is excellent and the spectacle seen below is not their fault)

Quality JSON data hidden by headers

The best move would be to strip the headers using props/transforms or ingest actions, but in our scenario, this data already exists, or we cannot modify the data due to compliance requirements. Spath to the rescue.

To adjust this data for meaningful searching:

  • STEP #1: Create a field of all the JSON data using rex
				
					rex field=_raw "Earthquakes (?<thefeed>.*)"
				
			
  • STEP #2: Use spath to separate the data using an input parameter that references the name of the field we created using rex
				
					spath input=thefeed
				
			
  • STEP #3: Rename fields if you desire. I won’t rename them here, but we could issue:
				
					| rename properties.* as *
				
			

Hence, I have this for my search:

				
					index="structured" sourcetype="earthquakes" 
| rex field=_raw "Earthquakes (?<thefeed>.*)"
| spath input=thefeed
				
			

post spath with fields

Like the previous meteorite feed, we have coordinates split apart. I want them in x,y notation, so we repeat the steps as shown before.

  • STEP #1: Rename geolacation.coordinates{} to coordinates since subsequent commands object to those curly brackets.
				
					rename geometry.coordinates{} to coordinates
				
			
  • STEP #2: Merge the two values in coordinates for each event into one coordinate using the nomv command.
				
					nomv coordinates

				
			
  • STEP #3: Use rex in sed mode to replace the \n that nomv uses to separate data with a comma
				
					rex mode=sed field=coordinates "s/\n/,/g"
				
			

Ergo, I end up with the following search:

				
					index="structured" sourcetype="earthquakes" 
| rex field=_raw "Earthquakes (?<thefeed>.*)"
| spath input=thefeed 
| rename geometry.coordinates{} to coordinates
| nomv coordinates 
| rex mode=sed field=coordinates "s/\n/,/g"
				
			

The final product of search-time spath on this hidden data

Conclusion

The spath command provides a great deal of flexibility when dealing with certain types of structured data onboarded as standard unstructured data. Try it out and see what issues you can solve.

Helpful? Don't forget to share this post!
LinkedIn
Reddit
Email
Facebook