Using HashiCorp Consul to Optimize Database Stability and Recovery Times

Nearly every admin has been on that all-night call, trying to find out why your end users can’t access some data. After hours of troubleshooting, you find out one of your databases in the cluster isn’t responding and your server or service is hanging on to a connection. You have to get a database administrator (DBA) on the call to shut down services on the database (DB) instance in the cluster or manually point your application to a known-good DB instance. This typically requires a service restart for the application to recognize the new database URL.

Service Mesh to the Rescue

Teams spend hours troubleshooting and fixing issues like this one. Typical load balancing is great for checking basic TCP connectivity, and clustering can provide the most available instance of a server in the immediate cluster. There are, however, situations that demand deeper health checks (think API calls and custom scripting) or even extended connectivity outside of your immediate network. Traditional network-based load balancing and clustering are rarely robust enough for use cases like this.

This is where a service mesh can swoop in and save the day. Now, when I think about the HashiCorp Consul’s service mesh, I immediately picture an infrastructure that is based on microservices. Pairing a service mesh with a database is the last thing you might think of, but bear with me for a moment while I walk through how I’ve used Hashicorp‘s Consul to solve this exact use case and more.

Service Mesh 101

Let’s start with a bit of a definition of a service mesh. The term service mesh is used to describe the network of microservices that make up such applications and the interactions between them.

Hashicorp’s Consul has a server/client architecture where the server or container host keeps track of the services that are online, their performance, and their health. On the client-side, there is an agent that “phones home” back to the Consul server to report status. The client has a native application that it reports back to the server. This helps the Consul server decide how fast each server can respond and compare the results. Then, the Consul server can report back the “most available” server.

For an idea of what the architecture looks like, let’s reference the following image, courtesy of hashicorp.com.

Figure 1 - the client's databases connect to servers, which continuously communicate their status

Figure 1 – the client’s databases connect to servers, which continuously communicate their status

There are a few interesting things about Hashicorp Consul’s service mesh that you can see from this image. While the clients talk to the server, you can see that they communicate amongst themselves as well. This allows Consul to paint a bigger picture of what can communicate and how performant each component is at multiple points. You can also see that the application is extending beyond a single datacenter, something traditional load balancers struggle to do.

Database Communication

Now, suppose your web service relies on a backend database (your database is represented by the silver DB icons in the image). The databases have a health check script that tells the servers (in black) that they are healthy and can be advertised as available.

Consul acts as a DNS server, so when you query it via DNS, it will return the most available IP for the name you are querying. For example, if you query for dbservice.service.consul, Consul might return the IP for DB 1 if that is the most available server. The next query might return DB 3 or even DB 5 if it happens to be closest and most available.

While this might seem complicated, Consul eliminates the complexity that can come with a service mesh. The DNS defaults to a TTL (time to live) of zero, so as long as your service honors DNS TTLs (some services may not), you will always get the latest result.

The next time you get that all-night troubleshooting call, this solution can help in the following ways:

  1. When a database instance is not available at all, the consul will see that it is not responding and stop providing its IP as a DNS result.
  2. The consul server will also stop providing a database’s IP as a DNS result when the database is online but not passing health checks from the health check script.
  3. When an application queries consul for a service, it will always get the instance that is most available and never one that is not available.
  4. Consul can extend from one data center or cloud provider to the next, removing the traditional dependency of having services “behind” the load balancer.
  5. All of the above result in near-instantaneous failover times.

Let’s take a quick look at what it takes to implement something like this in your environment.

Installation

Consul, in its basic form, is simply a binary you download from Hashicorp. The latest installation procedures can be found on HashiCorp’s learning portal. These instructions will get your Consul server up and running and configure the basics. Once you do this, you’ll need to set up your database health check. To do so, download the same binary and follow the concepts on this page of the learning portal.

With your Consul client running, you will want to implement a database health check script. As a fan of Python, I tend to lean towards this for my health checks. For this example, my consul client config (consul-conf) looks something like this:

services {

  id = "mysql "

  name = "mysql_health"

  tags = [

    "primary"

  ]

  address = ""

  port = 3306

  checks = [

    {

      args = ["/bin/check_mysql.py"]

      interval = "5s"

      timeout = "20s"

    },

   {

    "id": "mysql_tcp",

    "name": "MySQL TCP on port 3306",

    "tcp": "localhost:3306",

    "interval": "10s",

    "timeout": "1s"

  }

  ]

}

This will execute my check_mysql script from the /bin/directory as well as ensure the server is responding on TCP 3306.

A basic python mysql check might look like this:

#!/bin/python


import MySQLdb

import sys


# Connect to the MySQL database

db = MySQLdb.connect(host = 'localhost', user = 'root', passwd = 'v3ryB@dP@ss', db = 'AGreatDbName')


# Check if connection was successful

if (db):

# Carry out normal procedure

print("Connection successful")

sys.exit(0)

else:

# Terminate

print("Connection unsuccessful")

sys.exit(1)

You can set up a couple of client database servers and easily test to see what Consul is returning by using dig to query Consul like this:

dig @consulserver.yourdomain.com -p 8600 mysql.service.consul

Failovers in a Snap

Using this basic idea, you can have near-instant failovers from node to node, between datacenters, or even between cloud providers. In real-world implementations, I’ve used Consul to decrease outages from hours to mere seconds, with 100% of the failover happening automatically. For more Kinney Group solutions like this one, get in touch by filling out the form below:

Start typing and press Enter to search