Trond's Working!

Infrastructure service discovery in a Windows-based, non-containerized world

October 20, 2017

I’ve spent so much time in the world of Kubernetes lately, that I wanted to write about something completely different — even old-school by some standard.

Data is an important part of what we do at my employer, and Elasticsearch is one of our main tools to figure out what our systems are doing. We’ve been running an old (but well-working!) version of Elasticsearch for about 18 months, and upgrading it has been on the to-do list for a long time — not just Elasticsearch but our entire logging pipeline, which includes Filebeat (the lightweight agent that runs on all nodes), Logstash (Log processing/parsing), Elasticsearch itself (storage) and Kibana (visualization).

Our old FELK stack (getit?) was running in Azure, and very “static”. Filebeat agents were configured using Ansible, Logstash servers were “static” nodes, and our Elasticsearch cluster is a mix of “hot” and “cold” nodes. We transfer older data onto “cold” (cheaper) nodes where searches are slower, but it saves us a bunch of money. We have a custom process that runs in a combination of Lambda and Flask to move data from hot to old nodes, make sure data is backed up and optimized/defragmented. All of this has been working really well — I can’t remember us having had a single blip of downtime for ages.

But: Things are brewing at Elastic, and we didn’t want to be left behind version-wise so it was time to upgrade our stack. This allowed us to take a look at how we do things, and possibly replacing our fairly static infrastructure with something more dynamic.

It’s worth mentioning that we are now also running Consul on all nodes — something we didn’t do at the time we deployed our “V1” FELK stack. And while the old stack runs happily behind Nginx proxies, we’ve had awesome results using the Traefik load balancer in our Kubernetes clusteres, and I figured it was time to put it to use outside of Kubernetes aswell to see how it worked.

The idea behind service discovery is that instead of configuring a client with a static address for it to reach its whatever, you provide it with an “alias” which comes from Consul or whatever service discovery system you’ve put in place. Nodes offering a service simply tells Consul about it, and since Consul is a lightning-fast replicated system, every single node in your network has a full “view” of which nodes offer which servers. For a lot of systems this is kinda redundant. If you run everything inside Kubernetes you already have a robust service discovery system builtin. For http services it might be easier to just pipe all traffic through an L7 load balancer and let that route the traffic to where it needs to go. However, Logstash especially is a bit wonky. Filebeat and Logstash don’t communicate over http, they use long-living tcp connections. Because of this, we’ve always found it easier to just have filebeat talk directly to logstash without anything in between. This is fine for a static infra, configure your filebeat.yml with “logstash.mycompany.com” and you’re done. However, we wanted to see if we could build a more dynamic infra — especially now that we’re moving to aws we’re able to use Autoscaling groups to dynamically scale our stuff up and down, and to make sure we’re running on relatively short-lived nodes so that we don’t have to deal with patching and upgrades.

Another cool thing about Consul, is that it supports so-called “checks” — both on the service-offering side and on the service-consuming side. On the server side you can use checks to make sure that only healthy nodes register themselves as service “offerers” (or “servers”, as some call it :-) ). On the client side you can have Consul perform some action when there’s a change in consul — for example if a new service node comes online or if one leaves. Newer versions of Logstash has a builtin health endpoint which is perfect for this. So, instead of “assuming” that a service really works, you can make sure that the server continuously “proves” that it is capable of serving. We currently run a bunch of Logstash processes on the same servers, but this will also allow us to break things apart without thinking twice about it — because clients relate to the service and not the host offering the service. Very nice.

On the client side, we’re experimenting a bit with different models. Filebeat will as I wrote keep an open tcp connection to its configured Logstash, and if it loses that connection it will simply try again. One thing we’ve found is that Filebeat will not refresh it’s dns before retrying, so if it loses its connection to a deleted Logstash server, it needs a “kick” to get back into shape. Consul watches fixes this elegantly — we’ve simply configured consul watches to restart the Filebeat agent if there’s a change in the service used by the agent. Filebeat is pretty lightweight and checkpoints its current status, so we’re not very worried about those restarting now and then. This means that if a Logstash node leaves or comes online, Filebeat will restart — which implicitly causes it to reevaluate which Logstash node to talk to. We still need to figure out how to get notified if a Filebeat agent keeps failing.

We’ll also use Consul for Elasticsearch, but these nodes won’t be ephemeral. There’s too much data to copy back and forth for that to make any sense. However, Elasticsearch is essentially a rest endpoint, which we’ll put behind Traefik load balancers. Here we’ll use Consul to drive Traefik’s configuration so that only nodes that are alive and working get traffic. In normal operations this will be all “hot” nodes, but during maintenance windows we’ll simply be able to bring down nodes one by one without worrying about clients not hitting a working endpoint. And if we decide to set up dedicated “coordinator” nodes in Elasticsearch, then we’ll be able to do that without any manual change anywhere.

So that’s what we plan to do. We have a metric ton of data in our old cluster, and we’ve decided to copy data across instead of simply upgrading our old cluster. This allows us some much-needed index cleanup and reorganization so although it’s a bit of a painful process I think it will be worth it in the end.

So there. I wanted to write this down because I’m having a ton of fun working on this stuff, and even tho this isn’t fancy new tech like Kubernetes, it’s still super-rewarding to revisit an existing design and finding tons of ways to improve upon something that already works really well for us. I’m super-stoked about running our new stack in aws, and to be able to drive it all using Ansible/Cloudformation to construct a truly dynamic logging infrastructure makes this a really interesting project. Tons of fun!


Trond Hindenes

Hi, I'm Trond Hindenes, SRE lead at RiksTV. Fan of Python, drones, cloud and snowboarding. I'm on twitter.