This article is split into three parts to improve readability – this is the first segment of the series.
TL;DR “My simple definition and mental model of metrics indexes, based on a foundational understanding of events indexes, is that metrics indexes are designed to store numeric measurements in a highly efficient manner, and consist of events that contain just the four standard Splunk index fields: _time, source, sourcetype, and host, along with numeric measurements that are stored with a metric_name, and ‘dimension’s which are string fields that can be used for filtering and grouping the data sets.”
As a long-time Splunker, I have become increasingly aware of the existence and growth of ‘metrics’ data use in the IT industry, and my need to better understand how to work with Splunk’s metrics indexes. I recently had an opportunity to build an analytics solution in Splunk that leverages a metrics index – so I took some notes along the way, which I’ll share in this article. Metrics indexes are a bit of a different creature; if you’re just getting started on your metrics journey, I hope this will help shorten your path.
For the most part, metrics are a component of observability solutions. Metrics are values derived from counts or measurements that are calculated or aggregated over a period of time. Metrics can originate from a variety of sources, including infrastructure, hosts, services, cloud platforms, and IoT devices.
Splunk metrics indexes were introduced in the October 2017 Splunk Enterprise version 7.0 release as an alternative to using the ‘events’ index type to store time series numeric data.
Using metrics indexes in Splunk offers a significant boost in performance when storing and retrieving numerical data. Splunk metrics indexes use a highly structured format to handle the higher volume and lower latency demands associated with most metrics data, and storing data in metrics indexes provides faster search performance and less index storage requirements as compared to saving the same data in events indexes.
You will see the term MTS (metric time series) used in discussions of metrics. An MTS is a collection of data points that have the same metric – such as CPU utilization – and the same set of dimensions, such as location and hostname. A series of measurements of CPU utilization for a given host over time is a Metric Time Series.
My simple definition and mental model of metrics indexes, based on a foundational understanding of events indexes, is that metrics indexes are designed to store numeric measurements in a highly efficient manner, and consist of events that contain just the four standard Splunk index fields: _time, source, sourcetype, and host, along with numeric measurements that are stored with a metric_name, and ‘dimension’s which are string fields that can be used for filtering and grouping the data sets.
Metrics indexes are by nature somewhat different than events indexes, and the methods and tools for working with them are different as well. This article will outline these differences and provide examples of getting data into and out of metrics indexes, as well as strive to help you build some experience and a mental model of metrics indexes that aids in understanding and working with them intuitively.
Part 1 of this series will cover what metrics indexes are and how to create and populate one:
- Comparing events and metrics indexes – they are a bit different.
- Creating metrics indexes – and how to store multiple measurements in each event.
- Storing event data into metrics indexes – great for saving measurements and trend history.
Part 2 of the series will outline how to inspect and extract data from metrics indexes:
- Investigating metrics indexes – this is trickier than with events.
- Retrieving data from metrics indexes – this is too.
Part 3 wraps up the series with examples of how to analyze data from metrics indexes and use it in visualizations, as well as some notes on naming conventions and troubleshooting:
- Analyzing metrics data – much the same as events data, but there are some twists.
- Visualizing metrics data – formatting the data correctly helps.
- Naming conventions for metrics and dimensions – structure is important
- Troubleshooting metrics indexes – what could go wrong?
Comparing events and metrics indexes
Assuming you are familiar with Splunk events indexes, you know that every event regardless of data source includes a timestamp (_time) and a sourcetype, source, and host field; the same is true of metrics indexes. But this is where the similarity fades.
Event indexes store timestamped events that can have multiple fields containing string or numeric data (name=”Joe”, cpu_pct=24.5, etc.), JSON, XML, or even other key-value formats using less common delimitations such as spaces or other delimited data types. When you run a search against events indexes, you can (should) apply filters for the index, sourcetype, and one or more other fields to reduce the result set to just the data of interest and ensure the time range is only as large as needed to avoid searching through more data than is needed to improve search performance.
Metrics indexes store ‘metric data points’, which are a single measurement with a metric_name, a timestamp, and one or more ‘dimensions’ that might be considered labels for each measurement type. The _indextime field is absent from metrics indexes, and the source is optional.
Note that a log is an event that occurred, and a metric is a measurement of some element of a system. Splunk introduced metric indexes in addition to traditional indexes to allow for more efficient storage and search for both events and metrics data. Splunk metric indexes use less storage space than events indexes, and increase query search speed 500 times, using fewer system resources at lower licensing cost. I will use the term ‘events’ to refer to the recording of metric data points in this article, despite the risk of that not being the most accurate term.
The fields that make up a metrics index are depicted in the table below:
Some additional notes about metrics indexes include:
- Metrics timestamps have per-second precision by default. You can enable millisecond precision when creating the index.
- Numeric value measurements are 64-bit floating point numbers (only), which provide a precision of between 15 and 17 decimal digits.
- Dimensions are stored as string values – even if they consist of a number.
- Metrics names are case-sensitive and cannot start with a number or underscore.
Working with metrics indexes may not be immediately intuitive – it wasn’t for me. It may be helpful to work through the examples below and then re-read this section to help solidify your mental model of metrics indexes. If you don’t want to create the suggested index and saved search, you can poke through the _metrics index that Splunk saves internal metrics to – but be warned that there is a lot to consume there for an initial learning path.
Creating metrics indexes
Metrics indexes are created in the same manner as events indexes. You can create them with Splunk Web (Settings > Indexes > New Index and click the Metrics button), the CLI, a REST call, or by manually adding a stanza and applicable entries in an indexes.conf file.
The contents of the indexes.conf file will use the same format and entries as with event indexes, with the exception that a metrics stanza must include a ‘datatype=metric’ entry and an optional ‘metric.timestampResolution = ms’ entry if millisecond resolution is selected (the default is 1 second – so no entry is required for 1 sec resolution).
Note that millisecond timestamp resolution can result in decreased search performance.
It may be prudent to edit the indexes.conf file to include a ‘frozenTimePeriodInSecs = 63072000’ (2 years) or similar entry to limit the retention time to something less than the default of 6 years unless you just want to limit the index size with the maxTotalDataSizeMB entry (or both).
Be aware that metric indexes don’t support the delete command, so take care when creating your metric_names, dimensions, etc. Try to achieve a well-thought-out, hierarchical structure for labeling measurements, or be prepared to exclude older, poorly identified measurements from your final analysis product or just delete the index and start over.
Storing multiple metric measurements
By default, sending measurements to a metrics index will result in a single metrics event being created for each measurement value, even if you have multiple measurement values to store for a single timestamp and the same source. If you are running v8.0.0 or later of Splunk Enterprise, this can be averted by creating or editing a limits.conf file (in the /local directory of the app you’re creating your metrics solution in or in $SPLUNK_HOME/etc/system/local) with the following entry under the [mcollect] stanza:
# Sets the output format from mcollect to use single value format always.
# Set to false to allow saving multiple measurements per event.
always_use_single_value_output = false
Storing event data into metrics indexes
There are a number of approaches and solutions for getting machine data into Splunk metrics indexes that are beyond the scope of this article. The discussion and examples below focus on extracting data from existing Splunk event type indexes and storing statistical data derived from these events into metrics indexes for later analysis and visualization. This is a good way to gain familiarity with metrics indexes, and the retrieval and visualization techniques covered in this article will apply regardless of where and how the metrics data originates.
The following example illustrates a search against the Splunk internal _audit index to analyze run times for searches that occurred in the previous hour, in 15-minute increments, and include information about the Splunk app the search originated from in case you need help tracking down an excessive offender:
index=_audit sourcetype=audittrail action=search search_id=* earliest=-1h@h latest=@h
| bin _time span=15m
“`Reduce the fields returned by the indexing tier to just what is needed – improves performance“`
| fields _time host app total_run_time
“`Obtain statistical values for each 15-minute sample period and create hierarchical metric names“`
| stats count(total_run_time) AS search.tot.run.time.count, avg(total_run_time) AS search.tot.run.time.sec.avg, max(total_run_time) AS search.tot.run.time.sec.max, perc95(total_run_time) AS search.tot.run.time.sec.p95, stdev(total_run_time) AS search.tot.run.time.sec.stdev BY _time, host, app
“`round each metric value except count to 3 decimal places“`
| eval search.tot.run.time.sec.avg=round(‘search.tot.run.time.sec.avg’,3), search.tot.run.time.sec.max=round(‘search.tot.run.time.sec.max’, 3), search.tot.run.time.sec.p95=round(‘search.tot.run.time.sec.p95’, 3), search.tot.run.time.sec.stdev=round(‘search.tot.run.time.sec.stdev’, 3)
“`ensure each metric has some numeric value“`
| fillnull value=0
The results include a search count and the average, peak, 95th percentile, and standard deviation metrics for each 15-minute period. This search can be saved as a Report with ‘Save As’ > Report – this one is named ‘Search Run Time Metrics’:
The search is then scheduled to run in a less-busy period after the top of each hour by clicking Edit > Edit Schedule and using a cron timer such as ’13 * * * *’
There are two approaches to getting this statistical data derived from event logs into a metrics index. You can apply the mcollect or meventcollect command as the last line of the Search Processing Language (SPL), with the appropriate syntax and arguments, or you can let Splunk do it.
The easier and more educational approach, at least to get started, is to allow Splunk to configure a mcollect command for you by finding your report in ‘Searches, Reports, and Alerts’ and selecting Edit > Edit Summary Indexing. Click the ‘Enable summary indexing’ checkbox, then select the appropriate metrics index from the ‘Select the summary index’ drop-down, and click ‘Save’.
After the scheduled search has run at least once, click ‘View Recent’ and then click the latest run entry. You will see the SPL and search results – and note that a mcollect entry has been added as a last line in the SPL by Splunk as a result of configuring the Summary Indexing in the step above:
| mcollect spool=t index=”app_statistics_metrics” file=”RMD5680f66cdd0d29571_464585425.stash_new” name=”Search Run Time Metrics” marker=”” split=allnums
The arguments to the mcollect command in the above example are as follows:
| mcollect – the mcollect command converts events into data points to be stored in a metrics index.
Basically, the mcollect command with the arguments above collects the search results into a unique file created in the $SPLUNK_HOME/var/spool/splunk directory on the search head, where it is indexed and then deleted.
Note: An alternative is the meventcollect command, which converts events generated by streaming search commands into metric data points and stores that data into a metrics index on the indexers. Put another way – meventcollect can be used to save non-transformed data from events into metrics indexes. If you ‘transform’ the data by converting it into a row/column table format with a stats or table command, you’ll need to use mcollect to store it in a metrics index. See the Splunk docs for details: https://docs.splunk.com/Documentation/Splunk/9.1.0/SearchReference/Meventcollect
spool=t If set to true, the metrics data file is written to the Splunk spool directory ($SPLUNK_HOME/var/spool/splunk) for ingestion and deletion. If set to false, the file is written to the $SPLUNK_HOME/var/run/splunk directory, and the file will remain in this directory unless further automation or administration is done to remove it.
file=”RMD5680f66cdd0d29571_464585425.stash_new” The file name where you want the collected metric data to be written. The filename in this example was created by Splunk from using the Summary Indexing option, but you can specify a filename if you manually configure the mcollect command. You can also use a timestamp or a random number for the file name by specifying either file=$timestamp$ or file=$random$.
name=”Search Run Time Metrics” This argument isn’t specified in the Splunk docs but appears to behave like an alias for the ‘marker’ argument discussed below.
marker=”” Unused in this example. The syntax for this argument is: marker=<string>
The marker argument string can be one or more comma-separated key/value pairs that mcollect adds as dimensions to metric data points it generates, to ease searching on those metric data points later. An example of using this argument is: marker=dataset=search_run_time_statistics. You could then use ‘dataset=search_run_time_statistics’ as a filter in your search to extract just those metrics events.
split=allnums The syntax for this argument is: split=<true | false | allnums>
When set to ‘allnums’, mcollect treats all numeric fields as metric measures and all non-numeric fields as dimensions. This eliminates having to specify these, but you can optionally use a ‘field-list’ argument to declare that mcollect should treat certain numeric fields in the events as dimensions.
You can review the syntax, usage, and additional information about mcollect arguments in the Splunk docs: https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Mcollect
Final Notes on mcollect
If you inspect the savedsearches.conf file that contains a scheduled search that includes Splunk Web configured Summary Indexing, you will NOT find the mcollect command entries there. They are the result of an ‘action.summary_metric_index.comand’ function with control syntax which can be found in the Edit > Advanced Edit fields for your saved search.
However, once you understand how to use the mcollect command and its arguments, you can add it to the bottom of your saved-search SPL and forgo using the Summary Index option altogether to provide better control over how the data is indexed.
In this segment, we covered what metrics indexes are, how to create one, and how to use the mcollect command to save data extracted from event logs into a metrics index. Part 2 of this series will outline how to inspect and become familiar with metrics indexes, and how to extract data from them – see you there!
Subscribe to the Kinney Group blog to make sure you don’t miss out on parts 2 and 3 as they are released in the coming weeks!
If you found this helpful…
You don’t have to master Splunk by yourself in order to get the most value out of it. Small, day-to-day optimizations of your environment can make all the difference in how you understand and use the data in your Splunk environment to manage all the work on your plate.
Cue Atlas Assessment: Instantly see where your Splunk environment is excelling and opportunities for improvement. From download to results, the whole process takes less than 30 minutes using the button below: