Building a self-coordinating Config Management system

If you happen to read this blog from time to time, the title shouldn’t come as a surprise – it’s a topic I’ve kept revisiting over the last year for multiple reasons. The most important of these is simply that I find the topic extremely interesting.

So just to set the stage: Now that we have our Ansible playbooks and/or or Chef cookbooks or DSC configs we’re able to fully automate the entire configuration of the servers (nodes) that we run. However, it turns out that just changing stuff on computers running in production and serving thousands of users isn’t a very good idea, because you might bring down the entire service by doing so. Back in the day this was solved with “maintenance windows”. Between 0200 and 0400 every other Sunday of the month, the server could do whatever it wanted, and it was okay to bring down the entire service if needed. In today’s day and age, this doesn’t cut it anymore. What if you need that change ASAP? We simply cannot afford to wait for 2 weeks until the next service window opens up.

Many organizations are solving this with so-called “immutable infrastructure” – the idea is that you never make changes to your infra, you rebuild it based on the new config. I like to call this “shift-left config management”, as it takes all the config management “gunk” and stuffs it into the server during the deployment process instead of while the server is in production. And if you’re able to manage your infra in an immutable way, it’s probably worth doing so: You get the benefits of testeability and the reduced risk that comes with never making changes to a production system (because change = risk).

However, there’s a good chance that you won’t be able to use this methodology 100% in your infra. You have nodes that are simply too stateful to allow a rapid provision/run/destroy scheme. Servers such as database servers, Elasticsearch or even stateful “runner-type” applications. So because of this (and because of the fact that many of our apps still are way too stateful to be able to go the “immutable infra” route), I wanted to see if we could build a modern way of managing stateful infrastructure.

Our first attempt was to add a layer of logic “on top of” our config management system (Ansible) that would coordinate jobs in such a way that we never brought down more an acceptable percentage of the servers running a service. We experimented with a separate DynamoDB-based database where all servers were stored, and using tags we would try and “map out” the right order of invoking a config management run (a “config management run” in this context is simply running ansible-playbook against a server or group of servers). I think we could have made this model work, but it was getting complicated, with lots of logic around grouping groups of servers into “chunks” that could be updated together without risk, and so on (I wrote about this a while back).

Lately, we’ve taken a step back and tried to look at the problem with fresh eyes. This “refresh” has been fueled in part by the fact that we’re using Consul more and more – not so much for service discovery as for service coordination – especially since we’ve started using the Traefik load balancer in front of some of our external apis. Anyway. It struck me that we don’t need a complex central component to coordinate config management jobs, because nodes are perfectly capable of doing that themselves, using Consul as a “shared source of truth”.

In this example, 3 servers offer the same service (lets pretend it’s a public-facing api). Using Consul services, each server (potentially) knows about the other servers offering the same service, and can optionally lookup their health state using the Consul api. So, if server1 and server3 both have problems, then server2 would be very stupid if it started reconfiguring itself based on an Ansible package.

So here’s what we built:

  • AnsibleJobService: This is a rest interface that takes a “job request”. A job request could be “run this playbook against all web servers”. It then uses multiple sources (such as the aws ec2 api) to figure out which servers should get updated. Each server gets targeted by a separate (and ephemeral) Ansible “job container” – these are provisioned using the Kubernetes api, and coordinated using SQS. For the diagram above, 3 pods would be started, each targeting a single server.
  • AnsibleConsul: This is a fairly simple Ansible module which simply returns “true or false” based on whether or not the server is okay to be taken offline, and initiates “maintenance mode” on the local server if true. It looks at the Consul services offered by the local service and makes a decision based on the state of itself and other servers offering the same service. Our Ansible playbook will simply retry this in a loop until it succeeds
  • AnsibleJobServiceFrontend: A fairly simple Aurelia app that allows some rudimentary job control and visualization of the jobs we kick off.


When a job is kicked off, all servers get the job in parallel (or close to it – we introduce a little bit of randomness just to err on the side of caution). The first node able to “maintmode” itself goes thru its entire config run, while other servers simply loop-wait for the first. The last task in our Ansible playbooks is a stop that disables “maintmode” which signals that the current server is back in production, and the next one can be processed. There are also parameters that can be set to control how many “offline” nodes it’s okay to have.

If multiple jobs are started that target the same server, AnsibleJobService will make sure that only a single job is released at a time against it.

This picture shows the “live view” of a “multi-node” job. As you can see, 2 of the 3 servers are already done with their stuff, and the last one has kicked off, performing some IIS config changes.


I’m super-happy with all of this, because it reduces the risk that we used to have when invoking config management runs. With this new solution, anyone can trigger a job without the fear of taking services down – which again allows us to move faster and operate our systems without expert knowledge. A developer can simply PR a change, merge it, and then invoke the change against production – all in the time frame of a single cofee cup.

Infrastructure service discovery in a Windows-based, non-containerized world

I’ve spent so much time in the world of Kubernetes lately, that I wanted to write about something completely different – even old-school by some standard.

Data is an important part of what we do at my employer, and Elasticsearch is one of our main tools to figure out what our systems are doing. We’ve been running an old (but well-working!) version of Elasticsearch for about 18 months, and upgrading it has been on the to-do list for a long time – not just Elasticsearch but our entire logging pipeline, which includes Filebeat (the lightweight agent that runs on all nodes), Logstash (Log processing/parsing), Elasticsearch itself (storage) and Kibana (visualization).

Our old FELK stack (getit?) was running in Azure, and very “static”. Filebeat agents were configured using Ansible, Logstash servers were “static” nodes, and our Elasticsearch cluster is a mix of “hot” and “cold” nodes. We transfer older data onto “cold” (cheaper) nodes where searches are slower, but it saves us a bunch of money. We have a custom process that runs in a combination of Lambda and Flask to move data from hot to old nodes, make sure data is backed up and optimized/defragmented.  All of this has been working really well – I can’t remember us having had a single blip of downtime for ages.

But: Things are brewing at Elastic, and we didn’t want to be left behind version-wise so it was time to upgrade our stack. This allowed us to take a look at how we do things, and possibly replacing our fairly static infrastructure with something more dynamic.

It’s worth mentioning that we are now also running Consul on all nodes – something we didn’t do at the time we deployed our “V1” FELK stack. And while the old stack runs happily behind Nginx proxies, we’ve had awesome results using the Traefik load balancer in our Kubernetes clusteres, and I figured it was time to put it to use outside of Kubernetes aswell to see how it worked.

The idea behind service discovery is that instead of configuring a client with a static address for it to reach its whatever, you provide it with an “alias” which comes from Consul or whatever service discovery system you’ve put in place. Nodes offering a service simply tells Consul about it, and since Consul is a lightning-fast replicated system, every single node in your network has a full “view” of which nodes offer which servers. For a lot of systems this is kinda redundant. If you run everything inside Kubernetes you already have a robust service discovery system builtin. For http services it might be easier to just pipe all traffic through an L7 load balancer and let that route the traffic to where it needs to go. However, Logstash especially is a bit wonky. Filebeat and Logstash don’t communicate over http, they use long-living tcp connections. Because of this, we’ve always found it easier to just have filebeat talk directly to logstash without anything in between. This is fine for a static infra, configure your filebeat.yml with “” and you’re done. However, we wanted to see if we could build a more dynamic infra – especially now that we’re moving to aws we’re able to use Autoscaling groups to dynamically scale our stuff up and down, and to make sure we’re running on relatively short-lived nodes so that we don’t have to deal with patching and upgrades.

Another cool thing about Consul, is that it supports so-called “checks” – both on the service-offering side and on the service-consuming side. On the server side you can use checks to make sure that only healthy nodes register themselves as service “offerers” (or “servers”, as some call it 🙂 ). On the client side you can have Consul perform some action when there’s a change in consul – for example if a new service node comes online or if one leaves. Newer versions of Logstash has a builtin health endpoint which is perfect for this. So, instead of “assuming” that a service really works, you can make sure that the server continuously “proves” that it is capable of serving. We currently run a bunch of Logstash processes on the same servers, but this will also allow us to break things apart without thinking twice about it – because clients relate to the service and not the host offering the service. Very nice.

On the client side, we’re experimenting a bit with different models. Filebeat will as I wrote keep an open tcp connection to its configured Logstash, and if it loses that connection it will simply try again. One thing we’ve found is that Filebeat will not refresh it’s dns before retrying, so if it loses its connection to a deleted Logstash server, it needs a “kick” to get back into shape. Consul watches fixes this elegantly – we’ve simply configured consul watches to restart the Filebeat agent if there’s a change in the service used by the agent. Filebeat is pretty lightweight and checkpoints its current status, so we’re not very worried about those restarting now and then. This means that if a Logstash node leaves or comes online, Filebeat will restart – which implicitly causes it to reevaluate which Logstash node to talk to. We still need to figure out how to get notified if a Filebeat agent keeps failing.

We’ll also use Consul for Elasticsearch, but these nodes won’t be ephemeral. There’s too much data to copy back and forth for that to make any sense. However, Elasticsearch is essentially a rest endpoint, which we’ll put behind Traefik load balancers. Here we’ll use Consul to drive Traefik’s configuration so that only nodes that are alive and working get traffic. In normal operations this will be all “hot” nodes, but during maintenance windows we’ll simply be able to bring down nodes one by one without worrying about clients not hitting a working endpoint. And if we decide to set up dedicated “coordinator” nodes in Elasticsearch, then we’ll be able to do that without any manual change anywhere.

So that’s what we plan to do. We have a metric ton of data in our old cluster, and we’ve decided to copy data across instead of simply upgrading our old cluster. This allows us some much-needed index cleanup and reorganization so although it’s a bit of a painful process I think it will be worth it in the end.

So there. I wanted to write this down because I’m having a ton of fun working on this stuff, and even tho this isn’t fancy new tech like Kubernetes, it’s still super-rewarding to revisit an existing design and finding tons of ways to improve upon something that already works really well for us. I’m super-stoked about running our new stack in aws, and to be able to drive it all using Ansible/Cloudformation to construct a truly dynamic logging infrastructure makes this a really interesting project. Tons of fun!

London Calling! Conferences ahoi coming up

Happy to report that I will be giving multiple sessions around Ansible and Windows in September.

The first one is at winops (, where I’ll be talking about our learnings from implementing Ansible during the last 12-18 months. I’ll touch on some technical things, but I will also be discussing the “soft” side of taking a “manual” organization thru a journey of automation and devops and things.

If you want to learn more about how Ansible and Windows/Powershell actually plug together, I’ll be doing a talk about that at PSday ( – This session will get into the nuts and bolts of how to write Ansible modules and roles from a Windows/Powershell perspective, and also give a bit of an intro of why one would consider using Ansible instead of just “native” DSC.

So, if you’re in London, managing Windows platforms and want to learn more about Ansible, come join!!

There’s also rumors that the UK has beer, so that’s good.

I just replaced Windows 10 with Linux on my main laptop, and here’s why

As I’m writing this my computer is downloading some new bits – not from Windows update but from apt (wherever that may be). I just installed Linux on my main laptop, the one I’m using 8-15 hours a day to get stuff done and to learn things.

It’s funny, coming from a very Microsoft-centric background, to find myselv at the point where I realized that I wasn’t tied to Windows as the OS of choice anymore. I spend my days in Chrome, VSCode and Pycharm. That’s pretty much it.  I used to rely on a Powershell-based terminal, but over the last year I’ve found that a bash-based (using Conemu/Clink) was better for me.

The main reason(s) I’m switching? Docker. Docker and Ansible.

To dig deeper: Here are my requirements for a well-working setup:
1. Not having to run a Linux vm in Hyper-V. Don’t get me wrong, Hyper-V is an awesome hypervisor. As a workstation hypervisor tho, it has some serious shortcomings, which are especially painful when running Linux guests inside it. Stuff like storage mapping, proper graphics drivers for high-resolution screens and robust (nic-independent) networking. Honestly I don’t understand why Microsoft haven’t moved Hyper-V out of “mmc hell” and given it a proper gui / workstation functionality.
2. To be able to run Linux-based containers both natively and thru Windows: In short: Our CI process uses Linux VMs, but most of our devs are on Windows. When coming up with tooling to support “dockerized dev workflows” I need to be able to test/verify scripts/etc on both native Linux and on Docker where the “Docker client” is running on Windows
3. To be able to test both “Docker for Windows” and “Docker toolbox”: Because of point 1 above I’m hesitant to demand that devs run Hyper-V on their workstations. Imho Docker toolbox with Virtualbox is a way better option for a good Docker experience on Windows. However, the Docker tooling in Visual Studio is for some reason tied to “Docker for Windows” – the suckiest of the options (at least right now).

All in all, I need flexibility in testing various combinations of Docker on Windows and natively. As far as I can see, the best way to get that flexibility is to run Linux on my computer and “nested virtualization” inside Qemu/KVM (Qemu 2.7 and up supports Hyper-V guests).

Apart from the Docker thing, I notice that I’m spending less and less time in what used to be my “main” tools – the Powershell console and the Powershell ISE. Most of my “automation work” is done using Ansible, which day-to-day means editing a bunch of yaml files (Ansible uses yaml-based configs) which I can do in VS Code regardless of OS. Running natively on Linux also gives me a better testing experience when developing Ansible stuff, although I’ve had great success with Bash on Windows for the last year as well.

So: It’s still early but I feel good about this. I have my main editors (Pycharm and VS code) and both are working well. I can du whatever I want as far as VMs go using Qemu/KVM, and virt-manager gives qemu noobs like me a familiar interface to manage vms. I can write .Net core apps, and even run Powershell 6.0. I have spotify and chrome. So far, there’s not really anything I can think of that I’ll miss, except for the “full” Visual Studio – which I’m not spending too much time in anyways.

Gory details about my current setup:
OS: Ubuntu 17.04 (Qemu 2.7 comes bundled with this version of Ubuntu, which is why I chose it)
Main terminal: terminator (updated to 1.9 to get rid of some bugs)
IDE’s: Pycharm / VSCode
RDP client: myrdp (
Frameworks/Platforms: Python 2.7/3.5, .Net core 1.0.4, Powershell 6.0 beta


Getting just the right amount of concurrency in Ansible

This is something I’ve been working on for a while, and as I figure it might be something others also might be struggling with, I thought I’d share:

Configuration management is this incredibly powerful thing where servers “take care of themselves” – at least that’s the idea. Real life is often far more messy. Automated configuration management also means that changes might occur on several of your servers at the same time, and that might not always be a good thing. We’re a (relatively) small shop without the luxury/burden of having our services spread out across thousands of servers – a lot of our stuff runs on as little as 2 vms. Most of our stack is still on .Net classic, so for the moment we’re not able to run it in Kubernetes or stuff either. Which means that we had to figure out a way to efficiently manage our vms in such a way that we would never bring down 2 servers running the same thing at the same time, will still automating the config management of them as much as possible. “Bring down” in this context could be as little as changing a setting inside IIS, as some of those changes may cause the AppPool running our services to restart. Ultimately, we want to be able to get to a point where our config management tool is free to perform reboots or whatever it needs to, all while the service itself is still up and running as far as customers are concerned.

As you may know, we’ve been running Ansible for more than a year now, and we’ve changed how we execute ansible against our vms a lot during the last year. This blog post explains some of the thinking we’ve done about this, and what we’ve come up with (for now).

Even though we’re a fairly small shop we did something smart from the get-go: We organized our servers into a structure we call servicefamilies/servicegroups. These two attributes get tagged on a server wherever it is – in Datadog, Elasticsearch, Aws and inside the vm using envvars. In short, we stick those attributes wherever we can. These are used for a host of different things, but in this context the important thing is that they indicate a relationship between servers: If two servers share the same family and group, they do the same job  – which means that we typically shouldn’t incur downtime on more than one vm inside a family/group combination at the same time. As we grow we may adjust this to something like max 10% instead, but for now it’s fine since our numbers are small. So, if server1 is going through a reboot, server2 better be up. Simple enough.

It’s probably also worth mentioning that Ansible has some built-in smarts to take care of rolling updates, as it has the ability to perform serial execution on playbooks. The reason we’re not using those options is that I want to make it as easy as possible for devs to write playbooks – I simply don’t want them to have to concern themselves with the right amount of serial-ness inside playbooks they write.

We’ve been executing Ansible inside Docker for a little while already, and the plan all along was to end up in a place we could use this to our advantage. A playbook is kicked off not by running ansible-playbook locally on a workstation or by ssh-ing into a server, but simply by sending a job request to a custom rest api we built. This job looks at the requested playbook, and the family/group of the servers to run it against, and performs some smart grouping of them according to their family/group membership. So, a single “job request” is translated into a single “ansible run” for every server that gets “hit” by that playbook. This info gets spread into a set of sqs queues, where a number of messagehandlers (also running in containers) will pick them up one by one. This is essentially how we ensure serialness. So, if 4 servers get “hit” by a job, and they belong in two different family/groups, we will send jobs to 2 queues, each containing the jobs for the “grouped” servers. Something like this:

So, the magic here is that each “queue handler” runs in paralell with the other handlers, and each family/group’s servers get handled by a single handler at any given time. We can scale up our down the number of queues simply by adjusting a config file, and the “job controller” will make sure that an SQS queue and a messagehandler container will exist for every requested queue. We also use the “Approximate amount of messages” attribute of each queue to make sure we always send jobs to the least busy queue, ensuring that jobs get sent thru the pipe as quickly as possible. Each job sent through the queue will cause the messagehandler to invoke an Ansible job targeting a single server (using the –limit parameter) and wait until it’s either done or failed – that’s how we ensure that each group gets executed in a serial fashion.

Each job also posts a truckload of status back to a DynamoDB that keeps track of all jobs in flight, and the entire output log gets tagged with the job guid and uploaded to s3. This will allow us to do stuff like “if the first server failed executing its job, cancel all the other corresponding jobs for the same group/family”. There’s tons of fancy add-ons we can build around this as we progress.

It’s pretty cool to see a single request getting translated into a bunch of jobs, spread across all available queues and executed as quickly as possible. As we gain experience and confidence in this way of running Ansible, we can start performing reboots and other potentially “dangerous” actions in the middle of a playbook, knowing that this system will ensure that we’ll never bring down servers in such a way that it causes problems with the services we provide our customers.

Automating SQL Server credential rotation using Hashicorp Vault

When looking for a secret management solution last year we actually decided not to go with Vault, for a couple of different reasons, the most important was that we needed something lightweight and simple to set up as we (I) had so much going on that we (I) just didn’t have the time to dig deep into Vault and all its intricasies at the time. It seems that Vault has a certain gravitational pull tho, and more and more products integrate with it – tools like Rancher , Concourse  and a bunch of others simply offload the meat of their credential management story to Vault. Which means that it was about time for me to have another look at it.

As you may know, I work at a “mostly .Net” shop with tons of SQL databases across multiple environments, and as we’re slowly starting to embrace cloudy things, it’s probably a pretty bad bet to just assume that apps will authenticate to SQL server using Kerberos for all eternity. Which means that we need to come up with some other way® to make sure our apps can access the DBs they need, and only those.

What I like about Vault is that it takes “credentials management” one step further than many other options – instead of storing “static” credentials in a (hopefully secure) database, it will generate a time-limited credential on the fly and automatically expire it (delete it) when the TTL has passed. That’s not to say you can’t use Vault as an “old-school” password management system – it supports that too. But if you happen to use a system Vault supports for its “on-the-fly” credentials feature you can use that to get much tighter security than your “static passwords” solution will provide.

Here’s how I envision a typical workflow (mind you, this is my own rambling, not taken from Vault’s documentation):

1. A VM is provisioned because we want to scale out/replace our API layer. Config management knows which app(s) this server will run, and requests an AppRole from vault with the correct permissions.
2. The app starts up, and using its AppRole credentials from step one, requests a credential from Vault which will give it access to Database1.
3. Vault checks that the VM’s AppRole (I got tired of writing in Italic) has the necessary permissions (controlled thru Policies) to request a db credential, which it has. Vault then talks to the SQL Server which generates a Login, which Vault forwards to the VM.
4. If using TTLs, the VM needs to “know” when the TTL for its SQL credential is about to pass, and requests another one when the first is about to expire, and step 3 happens all over.
5. When the (first) SQL Login credential expires, vault again talks to SQL Server and asks it to remove that Login. And so it goes.

There’s a couple of points worth mentioning about this approach

  • There’s no single “I have all the access” credential that gives access to everything. Those are nasty.
  • Credential rotation is not a painful process you need to endeavor once every xx days/weeks/months, its something that happens automatically as often as you want.
  • You can set up Vault to provide you with a full audit trail of all activities including applications rotating their credentials.
  • You can isolate as tight as you want. For example, its fairly easy to configure policies so that only servers in production get to request prod DB credentials. I’m using SQL Server Logins here, but be sure to check out Vault’s support for dynamic credentials for stuff like RabbitMQ, AWS IAM users, and many others. While diving into this I setup a small lab using Docker and Microsoft’s SQL Server for Linux image, so that I could test things without the hassle of standing up a full SQL server instance. The rest of this post is a walkthru of that lab, which you can find the code for here:

I’m assuming you have some way of running Docker on your computer. I’m using Docker machine with Virtualbox on my Windows 10 laptop. Before firing up the thing, make sure you adjust the mapping for the vault container so that it lines up with your local path:

- /c/Users/trond/Documents/projects/vault-dev:/vaultdev

In order to get up and running, do the following:

#Start the SQL Server container
docker-compose.exe up -d sqlserver
#Create a DB called "testdb"
docker exec sqlserver /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P MyPassword123 -Q "CREATE database testdb"
#Start the vault container
docker-compose.exe up -d vault

#Grab the output from the vault container - you'll find the "root token" here
docker logs vault

At this point, we have a running vault server, running in dev mode – which is nice for labbing but not for real production usage. To interact with the vault server I just create a second console session and use that to attach an interactive bash session with the vault container:

#enter an interactive session to the vault container
docker exec -it vault /bin/bash
#log in to vault using the obtained root token (replace with your own token)
vault auth 0c7d8a21-0d41-ecf0-b779-27aa8b1d8a67

At this point, we’re ready to do stuff with Vault. We’ll create a SQL Server secrets backend, an AppRole auth backend, and the policies and things we need to get this thing working. All of this can also be found in the “” script in the lab git repo, but It’s probably better to paste stuff in line by line.

First, we’ll create a “regular” admin user so that we don’t have to use the “root token” anymore. The “admins.hcl” policy enables access to everything, but you can lock this down as you move along.

#create a userpass auth backend:
vault auth-enable userpass
vault write auth/userpass/users/admin password=admin policies=admins
vault policy-write admins /vaultdev/data/policies/admins.hcl
#Switch to the regular "admin" login:
vault auth -method=userpass username=admin password=admin

At this point, you’re logged in to Vault with your “admin” user. Continue setting up the thing, starting with the SQL Server integration which will allow Vault to auto-provision SQL Server Logins.

#activate the database secret backend, and
#create the mssql 'connection'
vault mount database
vault write database/config/mssql plugin_name=mssql-database-plugin connection_url='sqlserver://sa:[email protected]:1433' allowed_roles="testdb_fullaccess,testdb_readaccess"

#create the role for db read-only access
vault write database/roles/testdb_readaccess db_name=mssql creation_statements="USE [master]; CREATE LOGIN [{{name}}] WITH PASSWORD='{{password}}', DEFAULT_DATABASE=[master], CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF;USE [testdb];CREATE USER [{{name}}] FOR LOGIN [{{name}}];ALTER ROLE [db_datareader] ADD MEMBER [{{name}}];" default_ttl="1m" max_ttl="5m"

#create the role for db full access
vault write database/roles/testdb_fullaccess db_name=mssql creation_statements="USE [master]; CREATE LOGIN [{{name}}] WITH PASSWORD='{{password}}', DEFAULT_DATABASE=[master], CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF;USE [testdb];CREATE USER [{{name}}] FOR LOGIN [{{name}}];ALTER ROLE [db_owner] ADD MEMBER [{{name}}];" default_ttl="1m" max_ttl="5m"

#create a policy for db read-only access. Note that we're not creating one for full access
vault policy-write testdb_readaccess /vaultdev/data/policies/testdb_readaccess.hcl

As you can see from the above code, the integration works by providing a connection string to the server (my “connection” is named mssql but in a real-world scenario I’d probably use the name of the server/cluster). Each “role” essentially contains the SQL script that Vault will execute against the SQL Server, after replacing the templated values for the Login name (which will be auto-generated). We’ll see this in practice in a few secs. Also note that I’ve set the default ttl to a very low value (1m). Vault will simply delete the SQL Login after the ttl has passed.

Users can request a credential by using the “database” backend created above by specifying one of the two roles we created in that backend. Have a look at the contents of the “testdb_readaccess.hcl” policy to see an example of a policy controlling access to the backend/role.

We can now test the sql connection by requesting a few credentials:

#test the thing
vault read database/creds/testdb_readaccess
vault read database/creds/testdb_fullaccess

if you hook up a SQL Server Management studio to port 1433 of the SQL Server container (I’m using ip address by default, but your mileage on that will probably vary) you’ll see two Logins created, and if you wait 60 secs you’ll also see that Vault removes them automatically.

Now its time to create an auth backend that our app can use:

#Enable the approle auth backend:
vault auth-enable approle
vault write auth/approle/role/testdb_readaccess role_id=test policies=testdb_readaccess secret_id_ttl=0 token_num_uses=0

#Get a secretId - this is what your CM tool will somehow inject into your vm:
vault write -f auth/approle/role/testdb_readaccess/secret-id

I have to admit I’ve struggled a bit with Vault’s documentation. I think I would have laid out the structure of it very differently if it was a system I owned myself, so it takes some getting used to. It’s probably safe to say that the squeeze is worth the juice tho.

So, just to recap what we’ve done so far:

  • We’ve created a secrets backend based on the “databases/mssql” type
  • We’ve created an auth backend for regular users (where our admin user is, and one other auth backend on the AppRole type.
  • Inside the “databases” backend we’ve created 2 roles (testdb_readaccess and testdb_fullaccess). In Vault lingo a credential is created against a role. Think of roles as containers inside a secret backend, and credentials you create get created inside that role.
  • We’ve created an ACL Policy called “testdb_readaccess”. ACL Policies are not tied to a certain auth or secrets backend, they control overall access in the Vault system. Sinve Vault is very REST-friendly with everything being a path, policies work by allowing different types of access to one or multiple paths.
  • Inside the approle auth backend, we created another role called testdb_readaccess, attached to the policy created above. Any AppRole created against (inside) that role will get applied the policy, and thus gets access to read a credential from the “testdb_readaccess” role inside our database secrets backend.


Before starting up our very real-life app there’s a few more things about AppRoles probably worth mentioning: The idea (in my mind at least) is that you create an AppRole role (eh) per vm or (or container) you spin up. For us it will probably be part of the provisioning/bootstrapping Ansible will do for us whenever we provision a vm in aws. The actual “credential” the vm needs to “use” an approle is the approle role ID (“test” in my example, but by default this is a guid-type string), and a secret id. Its also possibly to limit the use of the approle based on cidr. This means that if your vm will have a static ip address thru its lifecycle you can increase security by specifying that only vms with a certain ip address is allowed to use the credential, which is pretty awesome.

Okay, so lets say you inject role id and secret id as environment variables or something – now you need to decide wether to make your apps completely oblivious to Vault or if you want to write some kind of integration. The latter is generally preferred from a security standpoint, as it will allow short ttls on secrets – your app will simply refresh these itself. You could also write an “agent” which takes care of this, and makes sure a “known config” file is kept up to date with the right database credentials – in that case you’d have to make sure that your app reads the updated config file. For apps, any changes to web.config triggers a recycle of the application pool running the app, so the problem shouldn’t be hard to solve there.

Just to test stuff out in a container I wrote a small python app which runs a console app in a loop, and requests a new database credential whenever its current is at 50% of its ttl (so, 30 secs in our example).

Before starting it you need to set the secrets_id environment variable in our docker-compose.yml file for the lab based on the last output (the “secret_id” field) from the above scripts.

Run the “app” container by issuing the following:

docker-compose up -d app

And stream its logs:

docker logs -f app

You should see something like:

number of responses from sql server: 6
number of responses from sql server: 6
number of responses from sql server: 6
getting/updating sql server credentials
number of responses from sql server: 6
number of responses from sql server: 6
number of responses from sql server: 6
number of responses from sql server: 6

The python app simply requests a new sql credential from vault after 30 seconds (50% of the ttl), so if you open SQL Server Management Studio you should see the Logins get removed as time passes.

Wrapping up, there’s a couple of things to note here:

  • The AppRole uses its role-id + secret-id to get a token. This token is what it’s using to authenticate to Vault when requesting a SQL server login. This token has a 20 minute ttl, which means that the example app will stop working after 20 minutes, as I haven’t coded any token refresh logic
  • By default, an AppRole has some very strict TTLs, and re-using a secret id is not allowed. I’m not sure I quite understand how Hashicorp figures this to be used in practice, as a vm has to be able to re-request credentials after reboots and such without “help” from config management or other tools (or maybe I’m supposed to use refresh tokens for this, idk). My code simply sets very generous ttls/id reuse values. You should probably investigate this before starting to use approles, at least I will.
  • I’ve added an “env” file that will allow you to debug the python app in VSCode after replacing the “secret_id” with your own, if you prefer that instead of using the “app” container.

In any case, this is a very simple walkthru of one of the many capabilities of Vault. Hopefully you’ll find it interesting enough to learn more about the product.


Why we started running Ansible inside Docker containers

We’ve been using Ansible for about a year now to manage our (mostly Windows) vms and cloud resources, and it’s been mostly a success story. That’s not to say that config management and automation is easy – the time it takes to “reverse-engineer” a complex infrastructure that’s been manually (mis)treated for years is not to be underestimated.

Our Ansible codebase has grown exponentially in size and complexity, containing a multitude of roles, custom modules, callback/lookup lugins and regular playbooks. I realized a while ago that we would need to come up with “something” in order to alleviate some very obvious pain points, which is why I started looking at Docker.

Just a quick background on our stuff:
Our Ansible code (scripts) is split into 4 separate repos: playbooks, roles, modules+plugins, and cloud configuration playbooks. These get “built” whenever one of them changes, and published to our internal nuget server. From there Octopus Deploy lays them down on disk in the right folders on our “Ansible server”. (The “Ansible server” is simply a linux vm where the required stuff is installed).

This has caused a few problems lately:
1. Testing is hard. Sure, I can do some smoketests locally, but we don’t currently have a good way of doing efficient testing
2. No blue/green releases: Since we have a single “Ansible server”, any version deployed there would affect all environments (dev, staging, prod, etc). This is increasingly risky
3. It requires some specific knowledge to kick off an Ansible playbook, either using ssh or flansible.

So, that’s what we set out to fix. We’re already running Rancher for a truckload of our “utility” services so I started building a docker image pipeline containing the “base install” of Ansible. The actual Ansible version is configurable, so that we have the option to run our playbooks using both released and development versions of Ansible. The plan is to do nightly or weekly builds of these, so that we’ll be able to stay up-to-date against the Ansible devel branch.

Most of the “smarts” happens when the container starts, not at build time. This is important as it lets us inject options at run-time. When starting up, the container gets fed info about:

  • the playbook path to run
  • the environment its in (this is used to download some environment-specific config stored in s3)
  • the versions of the 4 “ansible repos” to run (these can also be set to “latest” and “release”/”prerelease”.
  • The environment to run against (dev/test etc)

Side note: The “pull ansible repos” part was actually the hardest to implement. Since we’re a “mostly .Net” shop, nuget is the de-facto artifact format. I quickly discovered that nuget simply does not work on linux if used outside of .Net core tooling, and found myself fighting mono versions, ssl trusts and thousand other things. Needless to say, we scrapped all that and built our own nuget client in pure python, which is able to pull the right version (or simply the latest) of a package from our nuget server.

This allows us for example to test new versions of our playbooks/roles/modules against our dev environment, only “releasing” to production when the changes are verified. We can also break up the CI job that currently deploys all 4 ansible repos into separate, smaller jobs, since there’s no need to have these aligned with each other anymore.

We invoke these “Ansible Jobs” (as I call them) using the excellent Rancher api from a separate webservice which serves as the job coordinator. The Ansible docker image contains a callback posting back to the coordinator’s rest api, which allows the job coordinator to track the status and result of each job.

We’re just barely scratching the surface on this, and we plan to build some more automation on top of this new capability – for instance the ability to continously execute Ansible playbooks against our environments without any intervention. We also plan to implement mutex-like functionality to ensure that we don’t execute multiple jobs in paralell against the same parts of our services.

There’s still lots to do, but I’m super-happy about this new direction we’re taking with Ansible.



Automating Infrastructure Validation using Pester, Ansible and Elasticsearch

I’ve been pretty slow on the whole “Pester movement” lately – I’m simply not writing that many advanced PowerShell functions at the moment. However, Pester can do more than unit/integration testing of PowerShell modules. The PowerShell “Operation Validation Framework” uses Pester to invoke tests for verifying that the running infrastructure is configured as intended.

Unfortunately, OVF is not getting much love from MS, and the newest version of Pester has some changes which breaks the current OVF version. However, I don’t see any problems running Pester directly so we simply decided to skip OVF altogether.

This whole “infra testing” project was triggered by the fact that one of our network suppliers notified us about a planned fw upgrade causing downtime and stuff – and we found ourselves needing to perform a large number of tests in a short period of time to be able to verify the fw upgrade, and possibly ask to have the change rolled back if not everything went according to plan.

Here’s the process we designed:

In short, we have all tests in the same (for now) repo. Any changes to this repo triggers the usual CI Build/Publish module process.

We created an Ansible role which encapsulates all required activities for executing a task, such as making sure the nuget feed is configured as a package provider on every node, downloading the newest version of the “infratest” module, and executing the test. We already have a standard format for application logfiles being indexed by filebeat agents running on all servers, which means that as long as an application writes its log in a certain format, and places the logs in a certain directory (which the app can lookup using envvars), then those logs will be parsed and stored in Elasticsearch.

Since we have a defined set of tags which is deployed to all servers, we can tag any test to make sure that the test only executes on the relevant nodes.

Here’s an example of a single test for a single node in ES:

The “sourcecontext” field contains the name of the test, and “message” will be true/false depending on success or failure.

Since we already have a custom “inventory” dashboard running (which pulls data from various sources such as Ansible inventory, dns and datadog) we could plug this in easily. Here’s the “server view” of one server, with info about what tests were run the last 24 hours, and whether they failed or not:

There’s also a “global” test list where users can click to get the status of each host.

In time we hope to use this to build a “queue” of potentially suspect nodes, and feed that back in to auto-running Ansible jobs for remmediation.



Ansible + Windows: What we’ve learned from 6 months in production

One of the first tasks I started with at my current customer when we begain working togehter back in May, was to introduce the notion of configuration management. What I told them:

Obviously it’s not as clear-cut as this – there are a multitude of other things in play regarding choosing a configuration management solution. Still, coming from an environment with mostly manual config/deploy, whichever modern tool you choose will likely give you awesome results.

Customer is mostly a windows shop, running (for now) on a very traditional stack of Windows 2012R2. Parts of the stack run on NodeJS, but that’s (mostly) outside of my scope. For now.

A couple of parellell initiatives also caused us to ramp up our cloud usage – among others we were in the process of deploying a “production-grade” Elasticsearch cluster for log ingestion, and this turned out to be a great starting point for our Ansible efforts.

Lesson learned: Don’t do all at once. Build configuration piece by piece. Iterate. Repeat.

I simply started with the beginning: What do we need for _all_ our servers to make them go from freshly deployed generic vms to a “base config” we could build on? At first, it wasn’t much but later it turned out it was great to have a role where we could stick stuff that we wanted configured on every single server. Our first Ansible Role was born. Later, this is where we would stick things like monitoring agent install and base config, generic environment variables and things like that. We also built a special “AzureVM” role which is only run on freshly deployed Azure servers (we have some other providers too) which configures data disk setups and similar azure-specific things.

Deploying Elasticsearch on Windows using Ansible

Turns out Elasticsearch is surprisingly easy to deploy: Java, env vars, and unzip a few packages. We stuck the Elasticsearch config in a template where the names of all nodes get auto-generated from Ansible’s inventory. Worked well. For Logstash (we have a looooot of logstash filters) we decided to create a role, since it was more complex. I noticed that the builtin “windows firewall” module for Ansible wasn’t all that good, so I turned to one of my own projects ( to generate a DSC-based module for firewall configs instead. Much better.

Lesson learned: Use DSC-based modules where the builtin ones don’t do the trick.

I spent a loooooong time figuring out how to deal with different environments. We created some roles for provisioning Azure stuff – among others we have a “deploy vm role” which performs a truckload of validation and configuration. We built this stuff on top of my “Ansible-Arm” modules ( , which allow us better control of azure stuff than what the builtin roles do. Mind you, this is very much “advanced territory”. Still, using this approach we can fine-tune how every vm comes up, and use Jinja2 templating to construct the json files which forms the azure resources we need.

For inventory, we’re running a custom build of my armrest web service ( Actually, we’re running 4 instances of it, each pointing to its own environment (dev/test/uat/prod), which gives us 4 urls. In Ansible we have 4 corresponding “environment folders”, so when I point to “dev”, Ansible knows which webservice to talk to in order to grab inventory. Armrest also does some magic manipulation of the resource group names – especially stripping the “prod” or “dev” etc names from each RG, so that we can target stuff in playbooks which will work across environments. An RG called “dev.mywebservers.europe” will get the group “mywebservers.europe” in Ansible’s “dev” inventory, for instance. All fairly easy to do, and all super-flexible.

Using armrest we also rely heavily on azure’s “tags” feature, as these get translated into hostvars by Ansible. We used this to target playbooks where only a subset of the servers in a RG should perform a specific Ansible task.

For instance, this playbook:

- name: configure logstash nodes
    - name: separate out logstash nodes
      failed_when: false
        key: "{{ application }}"

- name: Basicconfig
  hosts: elasticsearch
      role: rikstv_basicconfig_azurevm

Gets applied to Azure vms with this tag:

note that armrest strips away the “ansible__” prefix and presents the rest as hostvars.

So, this allows us to control config using tags and resource groups, which we again can provision using Ansible (during VM deployment, Ansible knows to add the “winrm” tags to Windows vms which we deploy. Linux vms get another set of tags).

As for deployment, we have 3 “main” repos: Ansible playbooks, roles, and a separate one for Ansible-based azure deployments. These get bundled up into one “ball” by our CI process (teamcity), and deployed to our “Ansible server” using Octopus Deploy.

Configs are invoked manually, either by ssh-ing to the Ansible node, or by using flansible ( rest calls.

We need to come up with a more structured way of testing changes. Right now we deploy to “dev” first, and if stuff looks good, we push to prod. Note that batch sizes and the small rate of change allows us to do this. Bigger environments with more stuff going on will likely require more stringent testing procedures. I’m also super-thankful for WSL in windows 10, which lets me smoketest stuff before I commit to source control.

A few weeks back we got to test our stack as we essentially redeployed our entire Elasticsearch cluster node by node. Without Ansible there’s no way we would be able to do that in a couple of hours without downtime. And the vast majority of that time was spend waiting for Elasticsearch to sync indices.

We’ve also used Ansible to push a large number of custom settings for our pethora of IIS nodes, stuff related to logging, IIS hardening, etc. The fact that we now can produce a “ready to go” IIS instance with filebeat indexing logfiles and all other required settings without manual intervention is great, and it already allows us to move a _lot_ faster than what we were used to. I’d stil consider us “too manual” and not “cloud-scale”, but we’re slowly getting there, and Ansible has helped us every step of the way.

Lesson learned: Don’t over-engineer. One step at a time. Start today!

Using Ansible as a software inventory db for your Windows nodes

If you read this blog from time to time, you won’t be surprised by the fact that I’m a huge fan of Ansible. If you don’t, well – I am.

Ansible is really good at doing stuff to servers, but you can also use it to collect stuff from servers. There’s a builtin module which actually gathers a bunch of information about your infrastructure by default, and the cool thing is that that’s very much extensible. And very easy do to. So, since the wind was howling outside today and I had a Jira task regarding some package cleanup waiting for me, I decided to build a neat little software inventory thing using Ansible.

Note that this post requires some prior knowlegde with Ansible, I won’t go thru all the things.

The first thing I needed was a PowerShell script which let me collect info about all software on each node. I came up with this little thingy:

Now, In order for Ansible to use that script you need to point Ansible to a folder where the script lives. I decided to do it like this:

This is a playbook solely responsible for gathering various pieces of information. So, as this runs, Ansible will execute the “softwarefacts” script and add it to the list of “known stuff” about the server.

The problem is, by default this info is not persisted anywhere. Ansible has some built-in support for storing facts in Redis, but that’s meant as a way of speeding up the inventory process, not storing the data indefinitely.

So, here’s what I do:
The last part of my “inventory playbook” looks like this:

What I do here, is that I store each node’s info in a temp variable, and then use a template to write that to disk on locally on the Ansible control node (the “dumpall.j2” template simply contains “{{ vars_hack | to_json }}”)

Lastly, I have a python script which will dump these files (one per node) into a RethinkDB database, which is the last step executed by Ansible:

(word of caution: Now that RethinkDB is closing shop, you might be wise going with another DB engine. The procedure should be roughly the same for any NoSQL db tho).

Using RethinkDB’s super-nice web-based data explorer I can query the db for the full doc for each of my nodes:

r.db('ansible_facts').table('ansible_facts').filter({"ansible_hostname": "servername"})

After executing the playbook I can query the RethinkDB database for the list of software installed on one of my nodes:

We already have a simple front-end getting data from this database using a super-simple rest api written in Flask, and implementing support for software facts took me about 20 minutes.

Go make something!