Automation and configuration management tools are wonderful creatures. They come in many varieties including BMC BladeLogic, Puppet, Salt, Chef, Ansible, Urban Code, etc. Implemented correctly, these tools can take days of manual effort down to minutes with a simple, wizard-like setup.
Splunk is delivered with the optional, and free, Deployment Server (See: Splunk Documentation). The Splunk deployment server is a limited use configuration management system that distributes application configuration across Splunk distributed architectures. Among other uses, we implement deployment servers to deploy input/output configurations to forwarders, props and transformations to Splunk indexers, and applications to Splunk Search Heads.
Automation and configuration management tools create the most wondrous of problems. These tools are neither beneficial nor malevolent. Automation implements instructions regardless of the quality of those mandates. (Go ahead and ask how I know that…)
One day I was deploying a Splunk environment in our lab and I did what any good computer guy does — I borrowed working configurations. (Don’t judge, how many of us thought to make the wheel round on our own?) I built a new Splunk Index Server and Splunk Search Head, and named them <prefix>SplunkIndex and <prefix>SplunkSearch. I installed Splunk, hooked up the Indexer, and then enabled the Deployment Server. I copied applications to the deployment-apps directory on the deployment server, and then reloaded the deploy-server.
My forwarders, indexer, and search head all received their application configurations and data started flowing into my new Splunk Instance. It was great — until it wasn’t.
After a few minutes my Splunk indexes stopped reporting any new events. The Splunk Indexer was still online. The services were running on the indexer and on the forwarders. New apps were still being deployed successfully to the forwarders. I checked outputs.conf on the forwarders, and even cycled those services to no avail. On the indexer “netstat -na | grep 8089” showed connections from the forwarders — for a while. Then the connections would go stale and the ephemeral ports froze. In splunkd.log I found references to frozen connections. The forwarders ceased transferring data to the indexer and declared the indexer frozen.
You win a brownie** if you know what was going on by this point in the story.
The key to this story is that the deployment server managed a base config application. In the name of automation, this basic config deployed an outputs.conf to every server. However, the person I copied my configs from had the foresight to blacklist the Splunk Index server so it wouldn’t try to send outputs to itself (which can result in a really ugly loop). The configurations were fine until someone (ok, me) changed the names of the Splunk Index server by adding a prefix to splunkindex instead of a suffix (in my defense, it looked better in vCenter). The blacklist controlling which servers get the outputs.conf listed splunkindex*. If I had used a suffix the indexer wouldn’t have received the outputs.conf, and hence wouldn’t have entered the computer version of an endless self-hug.
But, I decided to get cute on my naming convention and was rewarded with a very nice learning opportunity.
The takeaway, be like Santa: Check your (white and black) lists twice before deploying applications to your environment.
** A figurative brownie in case you wondered.