Automating Infrastructure Validation using Pester, Ansible and Elasticsearch

I’ve been pretty slow on the whole “Pester movement” lately – I’m simply not writing that many advanced PowerShell functions at the moment. However, Pester can do more than unit/integration testing of PowerShell modules. The PowerShell “Operation Validation Framework” uses Pester to invoke tests for verifying that the running infrastructure is configured as intended.

Unfortunately, OVF is not getting much love from MS, and the newest version of Pester has some changes which breaks the current OVF version. However, I don’t see any problems running Pester directly so we simply decided to skip OVF altogether.

This whole “infra testing” project was triggered by the fact that one of our network suppliers notified us about a planned fw upgrade causing downtime and stuff – and we found ourselves needing to perform a large number of tests in a short period of time to be able to verify the fw upgrade, and possibly ask to have the change rolled back if not everything went according to plan.

Here’s the process we designed:

In short, we have all tests in the same (for now) repo. Any changes to this repo triggers the usual CI Build/Publish module process.

We created an Ansible role which encapsulates all required activities for executing a task, such as making sure the nuget feed is configured as a package provider on every node, downloading the newest version of the “infratest” module, and executing the test. We already have a standard format for application logfiles being indexed by filebeat agents running on all servers, which means that as long as an application writes its log in a certain format, and places the logs in a certain directory (which the app can lookup using envvars), then those logs will be parsed and stored in Elasticsearch.

Since we have a defined set of tags which is deployed to all servers, we can tag any test to make sure that the test only executes on the relevant nodes.

Here’s an example of a single test for a single node in ES:

The “sourcecontext” field contains the name of the test, and “message” will be true/false depending on success or failure.

Since we already have a custom “inventory” dashboard running (which pulls data from various sources such as Ansible inventory, dns and datadog) we could plug this in easily. Here’s the “server view” of one server, with info about what tests were run the last 24 hours, and whether they failed or not:

There’s also a “global” test list where users can click to get the status of each host.

In time we hope to use this to build a “queue” of potentially suspect nodes, and feed that back in to auto-running Ansible jobs for remmediation.



Ansible + Windows: What we’ve learned from 6 months in production

One of the first tasks I started with at my current customer when we begain working togehter back in May, was to introduce the notion of configuration management. What I told them:

Obviously it’s not as clear-cut as this – there are a multitude of other things in play regarding choosing a configuration management solution. Still, coming from an environment with mostly manual config/deploy, whichever modern tool you choose will likely give you awesome results.

Customer is mostly a windows shop, running (for now) on a very traditional stack of Windows 2012R2. Parts of the stack run on NodeJS, but that’s (mostly) outside of my scope. For now.

A couple of parellell initiatives also caused us to ramp up our cloud usage – among others we were in the process of deploying a “production-grade” Elasticsearch cluster for log ingestion, and this turned out to be a great starting point for our Ansible efforts.

Lesson learned: Don’t do all at once. Build configuration piece by piece. Iterate. Repeat.

I simply started with the beginning: What do we need for _all_ our servers to make them go from freshly deployed generic vms to a “base config” we could build on? At first, it wasn’t much but later it turned out it was great to have a role where we could stick stuff that we wanted configured on every single server. Our first Ansible Role was born. Later, this is where we would stick things like monitoring agent install and base config, generic environment variables and things like that. We also built a special “AzureVM” role which is only run on freshly deployed Azure servers (we have some other providers too) which configures data disk setups and similar azure-specific things.

Deploying Elasticsearch on Windows using Ansible

Turns out Elasticsearch is surprisingly easy to deploy: Java, env vars, and unzip a few packages. We stuck the Elasticsearch config in a template where the names of all nodes get auto-generated from Ansible’s inventory. Worked well. For Logstash (we have a looooot of logstash filters) we decided to create a role, since it was more complex. I noticed that the builtin “windows firewall” module for Ansible wasn’t all that good, so I turned to one of my own projects ( to generate a DSC-based module for firewall configs instead. Much better.

Lesson learned: Use DSC-based modules where the builtin ones don’t do the trick.

I spent a loooooong time figuring out how to deal with different environments. We created some roles for provisioning Azure stuff – among others we have a “deploy vm role” which performs a truckload of validation and configuration. We built this stuff on top of my “Ansible-Arm” modules ( , which allow us better control of azure stuff than what the builtin roles do. Mind you, this is very much “advanced territory”. Still, using this approach we can fine-tune how every vm comes up, and use Jinja2 templating to construct the json files which forms the azure resources we need.

For inventory, we’re running a custom build of my armrest web service ( Actually, we’re running 4 instances of it, each pointing to its own environment (dev/test/uat/prod), which gives us 4 urls. In Ansible we have 4 corresponding “environment folders”, so when I point to “dev”, Ansible knows which webservice to talk to in order to grab inventory. Armrest also does some magic manipulation of the resource group names – especially stripping the “prod” or “dev” etc names from each RG, so that we can target stuff in playbooks which will work across environments. An RG called “dev.mywebservers.europe” will get the group “mywebservers.europe” in Ansible’s “dev” inventory, for instance. All fairly easy to do, and all super-flexible.

Using armrest we also rely heavily on azure’s “tags” feature, as these get translated into hostvars by Ansible. We used this to target playbooks where only a subset of the servers in a RG should perform a specific Ansible task.

For instance, this playbook:

- name: configure logstash nodes
    - name: separate out logstash nodes
      failed_when: false
        key: "{{ application }}"

- name: Basicconfig
  hosts: elasticsearch
      role: rikstv_basicconfig_azurevm

Gets applied to Azure vms with this tag:

note that armrest strips away the “ansible__” prefix and presents the rest as hostvars.

So, this allows us to control config using tags and resource groups, which we again can provision using Ansible (during VM deployment, Ansible knows to add the “winrm” tags to Windows vms which we deploy. Linux vms get another set of tags).

As for deployment, we have 3 “main” repos: Ansible playbooks, roles, and a separate one for Ansible-based azure deployments. These get bundled up into one “ball” by our CI process (teamcity), and deployed to our “Ansible server” using Octopus Deploy.

Configs are invoked manually, either by ssh-ing to the Ansible node, or by using flansible ( rest calls.

We need to come up with a more structured way of testing changes. Right now we deploy to “dev” first, and if stuff looks good, we push to prod. Note that batch sizes and the small rate of change allows us to do this. Bigger environments with more stuff going on will likely require more stringent testing procedures. I’m also super-thankful for WSL in windows 10, which lets me smoketest stuff before I commit to source control.

A few weeks back we got to test our stack as we essentially redeployed our entire Elasticsearch cluster node by node. Without Ansible there’s no way we would be able to do that in a couple of hours without downtime. And the vast majority of that time was spend waiting for Elasticsearch to sync indices.

We’ve also used Ansible to push a large number of custom settings for our pethora of IIS nodes, stuff related to logging, IIS hardening, etc. The fact that we now can produce a “ready to go” IIS instance with filebeat indexing logfiles and all other required settings without manual intervention is great, and it already allows us to move a _lot_ faster than what we were used to. I’d stil consider us “too manual” and not “cloud-scale”, but we’re slowly getting there, and Ansible has helped us every step of the way.

Lesson learned: Don’t over-engineer. One step at a time. Start today!

Using Ansible as a software inventory db for your Windows nodes

If you read this blog from time to time, you won’t be surprised by the fact that I’m a huge fan of Ansible. If you don’t, well – I am.

Ansible is really good at doing stuff to servers, but you can also use it to collect stuff from servers. There’s a builtin module which actually gathers a bunch of information about your infrastructure by default, and the cool thing is that that’s very much extensible. And very easy do to. So, since the wind was howling outside today and I had a Jira task regarding some package cleanup waiting for me, I decided to build a neat little software inventory thing using Ansible.

Note that this post requires some prior knowlegde with Ansible, I won’t go thru all the things.

The first thing I needed was a PowerShell script which let me collect info about all software on each node. I came up with this little thingy:

Now, In order for Ansible to use that script you need to point Ansible to a folder where the script lives. I decided to do it like this:

This is a playbook solely responsible for gathering various pieces of information. So, as this runs, Ansible will execute the “softwarefacts” script and add it to the list of “known stuff” about the server.

The problem is, by default this info is not persisted anywhere. Ansible has some built-in support for storing facts in Redis, but that’s meant as a way of speeding up the inventory process, not storing the data indefinitely.

So, here’s what I do:
The last part of my “inventory playbook” looks like this:

What I do here, is that I store each node’s info in a temp variable, and then use a template to write that to disk on locally on the Ansible control node (the “dumpall.j2” template simply contains “{{ vars_hack | to_json }}”)

Lastly, I have a python script which will dump these files (one per node) into a RethinkDB database, which is the last step executed by Ansible:

(word of caution: Now that RethinkDB is closing shop, you might be wise going with another DB engine. The procedure should be roughly the same for any NoSQL db tho).

Using RethinkDB’s super-nice web-based data explorer I can query the db for the full doc for each of my nodes:

r.db('ansible_facts').table('ansible_facts').filter({"ansible_hostname": "servername"})

After executing the playbook I can query the RethinkDB database for the list of software installed on one of my nodes:

We already have a simple front-end getting data from this database using a super-simple rest api written in Flask, and implementing support for software facts took me about 20 minutes.

Go make something!

Why Azure is not the hybrid cloud it should be

“Hybrid Cloud” is one of those buzzword which leaves me uneasy, almost as overused as the dreaded “devops” word.

Anyways, hybrid is important as customers are starting to move parts of their stuff to the cloud. Microsoft have always preached that they are the premier “Hybrid Cloud” platform. After helping a customer getting ready to seriously migrate to the cloud for the last couple of months, I’m gonna stamp a big “meh” on the “Microsoft Hybrid Cloud”.

But first, lets get our definitions going. The best “hybrid cloud” definition I could find in 2 minutes or less, is: “The public and private cloud infrastructures, which operate independently of each other, communicate over an encrypted connection, using technology that allows for the portability of data and applications.”, which I found at cnet.

So, hybrid cloud is the art of making a given public cloud work as seamlessly as possible with a customer’s “private cloud”. I would even like to expand on this a bit, and call a hybrid cloud something which integrates easily with the customer’s existing infrastructure, even if the customer hasn’t “clouded” their infrastructure, but actually is running on more or less traditional (VM-based) infrastructure.

Also, don’t take this post as a massive and hateful attack on Microsoft. I work with Azure every single day, and most of the time I enjoy it. I’m not a Microsoft fanboy, but I’m definitely a fan – although I might just as well use python and Linux to get my work done as Windows and Powershell.

So, here’s the deal: Azure is a fantastic place for your “born-in-the coud” applications. It’s a much less fantastic place for the traditional n-tier apps you’re trying to migrate away from your (dying) on-prem infrastructure. My reason for saying this mostly relates to the huge shortcomings in Azure’s networking stack.

If you’ve worked with Azure for a number of years you might remember that virtual networks actually came after the first version of the VM (so-called “cloud service”) was introduced. Stuff used public ip addresses, and access to these was controlled using ACL’s or not at all. Now that Virtual Networks is a thing, an Azure “environment” has a much closer resemblance to a “traditional” on-prem network than what used to be the case, and it’s easier to migrate an application as-is from on-premises to the cloud in part because of this. Define your subnets, create your vms and hit play. That is, until you want to start mixing your VMs with PaaS-based services such as webapps, SQL databases or other offerings. And, for most shops, using PaaS as much as possible should be one of the goals of any cloud migration, since the PaaS model in general offers lower way operational costs than IaaS-based workloads do.

So, you want to take your existing SQL Server VM, break apart the databases and stuff them inside Azure SQL. Guess what: You can’t put Azure SQL inside a Vnet. Which means that your databases will be exposed on the internet. Which means that you either have to trust your password policies and auditing, or lay down some ACLs to protect your DB instance which is now running straight on the internet. Unfortunately, Microsoft have not prioritized putting in place a control plane which enables easy administration of these ACL’s. There’s no “allow access to this DB from my resources only”. If you have 100 VMs talking to that DB, that’s 100 ACLs you need to add. If you’re using dynamic IP addressing, those 100 addresses will change if you deallocate your VMs from time to time (as you would if you’re following the immutable infrastructure model). And unless you attach a public ip address to each of your VMs, there’s no way of even knowing which public ip those VMs will use when communicating with your Azure SQL database. Redmond, we have a problem.

Okay, so you gave up on Azure SQL and decided to leave your SQL Server VMs for now. Instead you’ve decided to take a look at Azure webapps in order to get rid of some of your servers running IIS. Azure Webapps are great. The problems surface (again) when you want to integrate these with resources you have inside a Vnet or on-prem-network, and you don’t want to expose the traffic between them on the internet. Webapps offer a few options of integrating with stuff inside a Vnet, and every single one of them sucks. You can:

  • Configure web apps to “phone home” using VPNs. But you can’t do that if your vnet is already connected to on-prem via Expressroute. And if you use the new Vnet peering feature to split up your networking under a connected “umbrella” you’re also going to run into trouble.
  • Use “Hybrid Connections”, which is an agent you can install on your VM in order to have a web app connect to it. Fine for a single webapp-to-server connection, but it doesn’t work at all on a larger scale.
  • Use App Service Environments – basically a privately hosted webapp component which you can put inside your vnet. The only problem is that it’s going to be much more expensive than the VMs you’re trying to replace.

So you’re basically stuck.

So you decide to just leave your VMs as-is for now, hoping (praying) for a better tomorrow. It’s time to implement some load balancers which will spread out the load across your IIS servers. Load balancing in Azure is fishy at best. You can’t publish a http server using both an internal and an external load balancer at the same time (or, you can with the latest update – after configuring a loopback NIC inside your VM with the same IP address as your public load balancer. I mean, what? ). You can not attach an NSG directly onto a Load Balancer, so make sure you know what you’re doing and please do not do what the Portal tells you because you’ll end up in security hell. You can’t move VMs between networks or replace their nics or change availability sets after they’re created so you better get your PowerShell hat on because there’s no way you’re gonna get everything right in the first place and you can’t use the portal to spin up a VM based on an existing disk.

These are just examples. All in all, you’re gonna have a pretty lousy day when venturing into hybrid land. All of this is not because Azure is bad or Microsoft is evil or anything like that. The fact is that they’re churning out an impressive array of features almost daily. And the problem I have with that is that Microsoft right now has a fundamentally wrong focus on new shiny stuff instead of “lets make Azure a great platform to operate”.

So there. Azure is a great platform for cloud-native stuff. For traditional hybrid, VM-based workloads it is simply not nearly good enough. Let’s hope it changes. In the meantime, I’m playing with this thing called GCE.


It’s not about Nano server or guis. It’s about modern vs legacy management tooling

I came across this article (, fully knowing what the article was about just by looking at the author’s name.

Don’t get me wrong, I have nothing at all against Aidan Finn, but the whole premise for the article is wrong in my opinion. In short, the article explains how Nano’s lack of a GUI makes management more difficult for the “regular” IT shops – even the ones that have been known to have a “little collection of scripts” that they use.

In short: Nano is not for you. Nano is not for the small shop on the corner running Small Business Server 2003. It’s not for the medium-sized businesses mainly managing using “legacy” tooling (and I consider System Center a very good example of legacy). There’s nothing at all wrong with using GUis, it was what made Windows server so popular among SMB’s in the first place.

Or, to put it differently: If you’re not able to take your existing deployment/management stack and simply plug in Nano server with a minimal set of changes, it’s simply not meant for you. Sounds harsh, I know. But let’s look at the facts: Nano server doesn’t have a GUI. It will require you to be able to fully configure a server without touching it. If it’s a physical host, then your provisioning/configuration stack needs to be able to bring it into a fully workable state. Same thing if it’s a VM, although the provisioning part is probably a bit lighter. If something goes wrong with your server, you need to be able to shoot it in the head and redeploy, which means the the applications you run on Nano need to have resiliency against node failure built in. For a lot of modern web applications, this is trivial. The server is largely a stateless worker and easily scales out or in without users noticing it. For a single file server where spreadsheets are stored, not so much.

And just to clarify: “workable state” is not “look ma, I deployed a server”. Workable state means that the server does something meaningful. It runs an application which provides some sort of value to your employer or customer.

If you are already using modern tooling, this is not a big deal. You bring up and down servers year round. If they fail, you bring up a new one. If they fail in the same way often, you investigate. For you, Nano server is an interesting way of bringing down deployment time and node footprint.

However, if you’re still running servers “by GUI” there’s nothing at all wrong with that. The fact that this management model doesn’t scale in the same way as modern tooling doesn’t necesarily mean it’s wrong for every single company on the globe. It’s all good.

Looking at the comments section of that article tho, it is my clear opinion that people are getting it wrong, if they plan to use Nano for “HCP, DNS, SOFS, PKI, etc”. It’s not the  workload you put on your server that should dictate your choice of OS, its the management platform you have in place to, um, manage it. If you’re a “mostly gui” shop suddenly deciding to implement Nano-based DNS servers because “it seemed cool”, you are gonna fail. Hard.

So, don’t look at Nano server. Look at your management stack. If you’re already using some modern tooling and your config is safely versioned in git, you’re golden. If not, do yourself (and your users) a favor and just disregard Nano for the time being.

Taking Docker for Azure out for a spin

As you may have read, I’ve been mighty impressed with the networking stack in Docker swarm mode. Today I got access to the beta of Docker for Azure, which is is an offering aimed at getting you up and rnuning super-quickly using the public cloud. All you need is essentially a service principal and a ssh key.

In addition to the network goodness of docker swarm mode, docker for Azure includes some auto-provisioning which lets Docker handle load balancer configuration for you. So, if you publish a service on port 80, then Docker for Azure will make sure the load balancer and NSG rules are configured for that, and that’s what I thought I’d show in this post.

Getting access

First, the docer for Azure thing is currently in private beta, so you’ll have to request access. Expect a week or so for this.

Deploying the thing

Before you get going, you’ll need two things: An ssh key and a service principal. The last one isn’t too hard to configure yourself, but why bother – the good folks at Docker have create a docker image which will do it for you.

Using Docker for Windows, I simply spun up this container, which contains the scripts necessary to setup the required service principal:

docker run -ti docker4x/create-sp-azure sp-name

That script will procide you with a code, which you enter into a browser on your local computer, before signing in with an admin user (that user must have permissions to create service principals in Azure AD). It will then give you the app id and secret which you need to deploy the template. The app name will be named “dockeronazure” and from the looks of it it’s getting contributor access to your entire azure subscription. I don’t quite understand why they don’t just give it access to the resource group instead of the entire thing, but that’s easy enough to tweak. If running the script yourself is not an option, it’s easy enough to create a service principal in the old portal (just make note of the app id and generate a secret). It doesn’t need any special permissions, as you can configure them in the “new” portal using the IAM link on either the resource group or on the entire subscription.

This service principal is used whenever Docer for Azure needs to interact with the AzureRM rest api, for instance to configure load balancers or perform port openings.

In my email invitation I got a link to start the deployment of the thing:
2016-08-20 19_22_19-Docker Private Beta_ Docker for Azure - - Mail

Note that when you sign up for the beta, you have to specify your subscription ID. The deployment will check this, so make sure you’re deploying to the right one if you have multiple subscriptions.

Here are the parameters you have to supply for the template deployment:
2016-08-20 19_24_51-Parameters - Microsoft Azure

For testing you can safely scale down to a STANDARD_A1 image. The lower 3 inputs are the most critical to get right.

Running the thing

After the deployment is done, you should get a confirmation page from the deployment with two outputs; the ssh command you need to run, and the public ip of the load balancer placed in front of your containers:
2016-08-20 19_27_05-Microsoft.Template - Microsoft Azure and 2 more pages ‎- Microsoft Edge
using my favorite console (which of course is cmder), i can simply log in to the manager node using my private ssh key.

So, now that the thing is running, lets give it a few services:

docker network create -d overlay nginx-net
docker service create --name nginx1 --network nginx-net -p 80:80/tcp nginx
docker service create --name nginx2 --network nginx-net -p 81:80/tcp nginx
#Scale a bit
docker service scale nginx=5
docker service scale nginx2=4

After a few moments, docker will have scaled the services correctly, which you can verify by running

docker service ls

2016-08-20 19_38_37-Cmder

Now, the interesting thing here, is that the Azure load balancer in front of these containers automatically gets updated with the corresponding rules. We configured one service on port 80 and another on 81.

Looking at the load balancer, it’s clear that the provisioner did it’s job:
2016-08-20 19_41_10-Load balancing rules - Microsoft Azure and 2 more pages ‎- Microsoft Edge

And we can go ahead and test the services from a browser using the ip in the deployment output shown further up:
2016-08-20 19_42_10-Welcome to nginx! and 2 more pages ‎- Microsoft Edge

The nginx start page might not be the most exciting thing we could have published, but it serves as a very quick way of showing how Docker for Azure allows for a very tightly orchestrated experience – bring up a service and don’t worry about the networking thing – it just works.

This is just the start tho, much more is possible using “docker deploy”.

Using Elasticsearch and Azure functions to store Azure Activity Logs for analysis

Whoa, that’s some title!

In short, I’m doing a bunch of work with Elasticsearch at the moment, so I thought it would be a fun project to start putting some Azure audit logs in there. That would enable us to set up reporting on failed requests, who did what, etc, etc, etc.

If you don’t know Elasticsearch, it’s an awesome platform for indexing and querying data. It is blazing fast, and is used for everything from website search engines to log analytics. Powerful stuff.

There’s a ton of ways to put data into Elasticsearch. I prefer to send logs and similar into Elasticsearch thru another product called Logstash. Logstash has a processing pipeline that enables validation and transformation of the incoming message stream, so that we can be sure that the data ending up in Elasticsearch is of high quality and correctly parsed.

Now, as far as Azure goes, more and more of Azure services support sending their logs to Azure Eventhubs, something I think is awesome. This lets us set up all kinds of data processing pipelines using these logs. You can for instance hook up Stream analytics/PowerBI to graph your data in a nice way. Tho that’s not the route I’m going right now.

Even Hubs are (tada) event-driven, so in order for us to push the incoming audit logs we need something to trigger to these events. I thought this would be a perfect opportunity to test Azure Functions, the new “serverless” PaaS offering from Azure. The tooling is still way to rough for my taste, but it works.

So, here’s what we have:


  1. As you or someone else performs activities in Azure, Audit logs are generated. These are sent to an Event Hub you configure.
  2. An Azure Function gets triggered whenever there’s a new message in the Event Hub
  3. The message coming from Azure is an array which (may) consist of multiple messages. Therefore, our function will split these up into separate events
  4. Using Logstash’s http input plugin the function can post each event to Logstash
  5. Logstash performs formatting/parsing/whatever you want, and pushes the event into the correct Elasticsearch index

All of this is relatively easy to setup.

  1. Sending Audit Logs to Event Hubs:
    1. In the Azure Portal, browse to “Activity Log” (
    2. Click “Export”
    3. Choose “Export to an event hub”. Note that you may have to go to the old portal to set up a service bus namespace first. The wizard will create its own eventhub for you, but the Service bus namespace will have to exist already
  2. Using Functions as a trigger:
    1. Create a new Azure Function – I was lazy and used the “EventHubTrigger – C#” starter template. The code you need in there can be found here:

      Here you also set up the connection to the event hub created in step one.

    2. In order to reference nuget packages, your Functions app also needs a “project.json” in the same folder as the function itself. This is easiest done by going to the Function’s app service and then use the “App Service Editor”. Here’s the contents of my “project.json” file:
      2016-08-14 18_30_22-project.json - wwwroot - App Service Editor and 4 more pages ‎- Microsoft Edge
  3. Logstash setup: I won’t go into details about setting up the ELK stack here, basically you need a VM with Java, the JAVA_HOME env variable set, and a few zip packages for the products themselves. That’s been documented in detail in so many other blogs. For the Logstash configuration, this is what I’m using:
  4. This basically says “listen to http messages on port 8081, and forward them to the elasticsearch index called “logstash-http-input-test”.
  5. As far as Elasticsearch goes, there’s really not much to configure

Here’s an example of a message viewed in Kibana, which is the web ui usually used on top of Elasticsearch:

2016-08-14 18_35_41- - Remote Desktop Connection

As you can see, Logstash breaks apart the json message coming in from Azure Functions, and sends it to Elasticsearch. There’s obviously a ton of data transformation I could do as well on this, but I wanted to keep it simple.

At this point it’s time to go to town on the incoming data. Here’s a list of the count of each type of resource I’ve done an operation against:
2016-08-14 18_38_43- - Remote Desktop Connection


And these are the users that perform those operations:
2016-08-14 18_40_58- - Remote Desktop Connection

To sum up, this is just an example of how to do (potentially) meaningful things with the data that Azure spits out. Event hub integration is being rolled out to (as far as I know) all the “things” in Azure that produce data/logs, so this has the potential to be very powerful. It might be a good idea to consider webjobs instead of Azure Functions, as the sheer volume of these logs may be high enough to be expensive.

Azure Vnet peering

Vnet peering is a brand new feauture in Azure networking that you need to know about. In short, it allows you to connect two separate vnets without using a site-to-site VPN link. VPN links rely on gateways which cost money by the hour, so this allows for a vastly simplified networking topology if you for some reason are using multiple vnets. My current project has separate subscriptions (and thereby vnets) for production and dev/test, so this allows us to link those vnets using peering.

Peering also allows for transitive communications thru a vpn link if one of the peered vnets has one, so you can use peering to set up a “hub and spoke” topology where only the hub vnet needs to have a site-to-site VPN link down to your on-prem network. I wanted to test out this, so I deployed a very simple example of this using 3 vnets, one of which simulates the local (on-prem) network. This is what I did:

peering with vpn

In my example, vnet1 has a VPN connection to our mock local network (vnet3). In addition, vnet1 and vnet2 also has a peering relationship. Configured correctly, this allows VMs in vnet2 and vnet3 to communicate with each other, using vnet1’s VPN connection.

Note that vnet-to-vnet VPN links can be configured using a simplifed setup in the Azure portal. I didn’t go that route, as I wanted a real-life simulation of a local-to-azure VPN setup. This required me to deploy both a gateway and a “local network gateway” for vnet1 and vnet3, as the “local network gateway” is a representation of the opposite network in the link. The way to set this up is to first create the virtual network gateways for both vnets with their corresponding public ip addresses, and then use that info to create the required local network gateways. Logically, a vnet gateway connects to the opposite “local network gateway”.

An important thing to note here, is that the local network gateway representing vnet1 needs to include the routes of the vnet itself, in addition to the address ranges for any peered networks. This is how the local network (vnet3) knows how to route traffic destined for the peered networks.

Also, the peering relationships need to be configured in order to allow for this. A peering relationship is setup from both sides, so in order to make this work it needs to be configured as follows:

Vnet1’s peering config:
2016-08-07 23_33_18-Action center

And vnet2’s peering config:
2016-08-07 23_35_39-tovnet1 - Microsoft Azure


These settings instruct the peering relationship to allow the use of vnet1’s VPN link in order to get traffic to and from vnet3(local)

I deployed a small linux vm in each network to test ping, but I won’t bore you with pics of the console output of those. It works.









The beautiful networking stack in Docker Swarm mode

Okay, so I don’t normally run around and call networking beautiful, but this stuff really is. I’m getting up to speed on Docker Swarm mode at the moment, which is the new “in-box” clustering in Docker 1.12.

Here’s what the new networking stack in Swarm mode solves:

Take an app which you want to make accessible to users .I’m just using the nginx image, since it by default serves a website on port 80. Here’s my setup:

Swarm-Networking (1)

So, I have 2 docker hosts, “manager” and “worker”. Each host is running a container using the nginx image. In this setup, I can publish port 80 into my docker hosts, slap a load balancer in front and be home for dinner. This has been easy for as long as I’ve pretended to know Docker, simply run

docker run -d -p 80:80 nginx

on each container host.


But: What if one node failed? Or you’re running multiple container hosts, but don’t service the nginx container on all of them? In the past, this would require a ton of tweaking on both the load balancer (which would require you to reach out to whatever you’re running the load balancer on) and possible the container host. Also, you wouldn’t be able to run more than one of these containers on each container host, since you are placing a “lock” on the “public” ip’s port 80 on that node.

Now, this is where Swarm mode shines.

Swarm mode sets up this cool forwarding feature that allows any host in the cluster to respond to the published port of any published service, and forward it to whatever container host runs that container. Something like:

Swarm-Networking (2)

In this scenario, I’ve scaled my nginx service down to only one node. My external load balancer still balances between both container hosts, so in the “old days” this would mean that my stuff simply wouldn’t work (obviously the external lb would probe each host but that’s beside the point right now). With Swarm mode, whatever traffic the “Manager” node gets, simply gets forwarded to the container running on Worker 1. I can even reconfigure my external LB like this and it will still work:

Swarm-Networking (3)


So all in all, this makes the whole (previously daunting) task of running web things (and other services with a listening port) so much easier.

Retaining my enthusiasm for tech through bad days at work

Wow, that was a screwed-up 6 months. I started a new job, my first CxO position. And realized after two weeks that the CEO was a sociopath. I managed to hang in there for 2 long months until I said “enough” gave my notice, and stayed on for another 2 long (looooong) months to make sure my customers didn’t get stuck with half-finished projects. All in all, it’s been a large dip in my (so far) otherwise interesting and fun career.

I noticed stuff happening as my motivation reached lower and lower levels: I started to not care about the things I like to do. I stopped paying attention to news around Azure and PowerShell and Ansible. I had to let summit organizers know that I was going thru a rough patch and wasn’t able to present (which every single one accepted without giving me shit, which I am very thankful for). I was kind of going downhill in my love for tech.

Until I realized something: I can do fun tech-related stuff OUTSIDE of work. In other words, I realized I needed a hobby that didn’t involve snowboards and mountains. I decided to take up drones. The rest of this blog post is what I’ve done and what I’ve learned the last 6 months, and let me warn you: It’s all pretty far from Azure and PowerShell.

So, why Drones? I’ve dabbled in RC on and off as long as I can remember. I even bought a Phantom 2 drone a couple of years back, to goof around filming aerial shots. I quickly realized that filling my backpack with all the gear necessary to shoot my friends skiing or snowboarding from the air wasn’t going to happen as much as I hoped it would, so when Squadrone systems released their autonomous drone, the Hexo+, I ordered one straight away. This was a drone MADE for the outdoors and for snowboarding. I didn’t even need to shoot my friends, I could have it film myself as I rode down my favorite mountains here in the mountaineous northwest of Norway.

Or so I hoped.

Turned out, the Hexo+ sucked (still does, last time I checked). It uses the “operator’s” cell phone GPS for tracking, and we all know how imprecise those units are. The system operates at 1Hz (one update per second), and let me tell you: Lots of stuff can happen in one second when your crusing down a mountain on a snowboard. All in all, the thing sucked. Which led me to believe that I could build something better myself. And research started.

Modern drones are controlled by a flight controller, basically a small onboard computer which has tons of sensors and smarts to figure out what the motors need to do when you push the stick on your RC transmitter forward. The flight controller levels out the drone, makes sure it stays at the right altitude and a truckload of other things. As I researched, I kept coming back to a flight controller unit called PixHawk, and that one appealed to me because of its ability to run different flight controller operating systems, both of which are open-source. As I kept digging further I figured out that it’s even possible for a small Linux computer (such as the Raspberry Pi) to run the controller operating systems, and that it’s possible to communicate with it using different options (both cabled and wireless). I even visisted a drone company in Oslo, Norway who are building some cutting-edge stuff based on these controlles. Much fun.

So, I went to town on a couple of chinese online stores and ordeded about 8000 pieces of small items I didn’t exactly knew what was, but I figured I’d need in order to build what I thought of as a “research platform” (which really was a butt-ugly hexacopter with a Pixhawk unit, a Raspberry Pi and all the dangling wires that come with it).

I decided to go with the flight controller operating system called ArduPilot since I’d heard of it before, and because there’s a corresponding Python SDK that goes with it. By the way, I didn’t know Python at the time. Good time to learn.

I also figured I needed some kind of “ground station” with a precice GPS so I bought a USB-based GPS receiver with a lot higher precision and speed than a regular smartphone GPS. I spent 2 days looking for how to get my Android phone to use that as a source instead of the builting GPS, and it turned out I needed to write that myself. By the way, Android apps are written in Java. I didn’t know Java. Good time to learn. 4 days after I started Android Studio (the Android IDE) for the first time, I had successfully integrated a USB driver I found in some obscure GitHub repo into my own app, and was able to replace the phones GPS function with my USB unit’s high-precision GPS signal . I was amazed at how similar Java actually is to C# (and to PowerShell in some extent). Very strongly typed, very object-oriented, very well documented on the internets. My app didn’t perform as well as I wanted (we’re talking a bunch of updates per second), so I had to learn about multithreading in Java, which came to good use later in my endeavour.

Just for the fun of it I found a YouTube video on how to reverse-engineer apps, so I also managed to crack open the Hexo+ app and force it to use my high-precision GPS signal instead of the phone’s. There’s actually a YouTube video describing how to reverse-engineer an android app from start to finish, made by an extremely soft-spoken dude in India. People are awesome.

Back to my main task: To send my phone’s GPS position to my “research drone” so that it would follow me smoothly. I banged my head against the Linux Bluetooth stack and measured performance diffs between http and raw tcp. I wrote a heavily multithreaded Python app to serve as the “command center” which would receive signals from my phone and decide where to fly the drone. I learned about Python queues and TwistedMatrix, an amazing project for writing networked applications in Python. Oh, and I learned Python. Turns out it’s not so hard, although I personally prefer to work in a more strongly typed environment. The Python community is awesome tho, and I was learning so much that I felt I was a new person at the end of every day.

I had to dig up on math skills learnt and forgotten long ago. Stuff like “how do you calculate the distance in meters between two GPS coordinates?”. What’s a covariant? Again, the community is awesome and without it I would have gotten seriously stuck in problems calculating headings and offsets and other things my math teacher in High School would have loved.

And I had to learn about drone hardware. About ESCs and power ratings and motors and prop thrust and prop wash and thousand other things. I (re)-learned how to solder properly. I even had to teach myself how to crimp a PicoBlade connector, which required a magnifying glass and more patience than I thought I had in me.

And in the end, it flew. It flew beatifully. Or, not beautifully but at least it flew. I needed to come up with a way to smooth out the flight commands which my python app sent to the PixHawk flight controller. I discovered a whole simulation framework which enabled me to test new ideas without crashing my precious drones (yep, they multiplied. Don’t know how that happened). I figured I could use Windows 10’s new “Bash on Windows” to run the Linux parts of the simulation stack to avoid having to spin up a VM whenever I wanted to tweak something, which shot my productivity thru the roof. I even created a youtube video about that, for which Erle Robotics in spain sent me their own custom RaspBerry Pi-based flight controller system just because they thought it was cool. Like I said, the community rocks.

I learned more about python and class-level properties versus instance-level properties. I even started enjoying just flying drones again, so I bought this crazy small/cheap FPV racing drone which I could (and did!) crash without it costing me more than my car is worth.

I got word that some Norwegian drone company had “heard of my work in the autopilot space”, whatever that means. I mean, I was just goofing around trying to cancel out the insanity that was my day job.

And I built a website with more ugly jQuery code behind it than I ever thought possible to put inside a single JavaScript file. But it works. It works beatifully.

I can use that webpage on my phone to take off my Drone, to have it follow me and all of that is because of code I wrote. In Java. In Python. In Javascript.

Next week it’s time to realize that I’ll soon have a day job again. It’s a good company. I’ll get to work with Azure and AWS and automation and all the things I love. However, the value of a pet project to keep me busy thru the last months have meant so much to me. My buddies have even started buying drones because I couldn’t shut up about how much fun I was having. I’ve spent late nights and early mornings in fron of the computer, not because I got paid but because I just couldn’t not. It was too much fun. And I’ve got the opportunity to dip my toes in a vast range of technologies and programming languages, and I’ve realized they all have their strengths and weaknesses. I used to look down on Java devs. I don’t anymore, it’s a kick-ass language. Python rocks as well. As does JavaScript although my skill level there is mostly cut-paste-chrome-debug.

Money has been tight, and will be even tighter before it gets better. Still, I wouldn’t want to trade the last month’s whirlwind of technology learning with anything.

And if you think your VM deployment script is exciting, try writing code that makes things actually fly by themselves.

Now I’m looking forward to getting back on the horse, and to chatting with all my friends on the Azure/PowerShell community. And to get paid.