Auto Scaling Continous Integration

How we parallelised our builds

How we parallelised our builds

Automation is the backbone of Bipsync’s technical operations and we aggressively test every part of our application on every commit. This article is the story of how we outgrew our Continuous Integration (CI) setup and how we improved it to scale for our needs in 2017 and beyond.

The way we were

The CI setup for the main Bipsync product used to look like this:

Our CI ran an exhaustive test suite in real browsers and triggered our Continuous Deployment (CD) pipeline on success. This served us very well, but as Bipsync has grown, we began to experience problems.

Problem 1: Non-representative test environments

The Docker test containers were not fully representative of production. Whilst they installed roughly the same dependencies as production, they were not provisioned by our production Ansible playbooks so were not fully equivalent.

Secondly, the integration tests ran on a special “blessed” server that was set up to run a Selenium grid. By “blessed” I mean a server that we treat specially – this server was not maintained in the same way as our other infrastructure. “Blessed” infrastructure is usually a bad sign in devops land – it makes the system hard to reason about and scary to replace.

Problem 2: Locking architecture

Our builds had two lock points where the build could not be parallelised:

  1. Unit tests and static analysis
  2. Integration tests

All this led to resource contention, meaning that builds were getting backed up:

This led to a situation reminiscent of the old 3 Stooges sketches, whereby Larry, Curly and Moe would try to run through the same door at the same time and all three would get stuck. As our software development team has grown, our commit volume increased and this type of problem became more prevalent.

Erosion of trust

The two problems we have described led to slow builds and frustrated developers. A key principle of software development is rapid feedback cycles – the longer the feedback takes, the less useful it is, so we were facing an erosion of trust in our CI setup, which could become a very serious problem.


Retooling our CI setup

Step 1: representative infrastructure

We have had great success with container-based CI for Ansible, so it was a natural next step to modify our build process to use an Ansible-provisioned container instead of our hand-rolled Docker ones.

We built on our existing Infra test container (shown in purple in the diagram below) and applied our production Ansible playbook to create bipsync_unittest:

 

This was straightforward and we were able to drop it into our build right away, so it occurred to us that we could use this approach to test the Bipsync application with experimental versions of dependencies like PHP and Mongo. So, we constructed multiple containers in the same way:

 

This meant that we could test dependencies without tinkering around with some “blessed” infrastructure.

Step 2: removing the main lock

The lock on “main” was mainly there because the build was all running on the Jenkins master node. To make the first phase of our build non-blocking, we added Jenkins slave nodes that had the bipsync_unittest containers available to them:

We were therefore able to remove the main lock and the first phase of our build was now parallelizable. This reduced the load on our master node and helped us to get faster feedback on coding standard violations and unit tests failures.

Our integration suite, however, was still on “blessed” infrastructure.

Step 3: obviating the “blessed” integration server

Removing the “blessed” integration server proved to be more difficult – not only did it have a web server preconfigured, it was also running Selenium Grid for integration tests. Because we use Paratest to parallelise our integration tests already, we also needed to be able to serve the Bipsync applications on separate URLs for the separate Paratest threads:

This meant that, however we run Bipsync, the Windows slaves need to somehow be able to route back to the containers.

Our eventual solution was to have a reverse proxy on the Jenkins slaves:

This proxy meant that the slave could route traffic through to the appropriate container. With this determined, all we need to do was create some DNS entries so that our Windows slaves could route back to the appropriate slave:

Once this was solved, we had no more “blessed” infrastructure – just a standard Ubuntu server running Selenium grid with a pool of Windows nodes behind it.

We determined that we could split our integration suite into three sub-suites, and these ran roughly in the same time as any two unit test container builds. Therefore, each build requires four executors:

Step 4: scaling

At this point, our executors were interchangeable – nothing was “magic”. This is very important – nothing at all is “blessed”; it’s all replaceable at any time, nothing is treated like a “pet”, and any executor can handle any task. Once we have this interchangeability, we can create “pools” of infrastructure that can service our test builds. This allows us to scale up and down. This is like having a “swarm” of test executors – very powerful indeed.

The infrastructure looks something like the following:

Because nothing is “blessed”, we can scale the Windows or Ubuntu clusters simple by adding or removing nodes.

We proceeded with manual scaling until we had a feel for the level of load each build executor could handle.

Step 5: autoscaling

All of this infrastructure doesn’t come for free! Once we had it working nicely we elected to reduce costs.

Thankfully, for the Ubuntu slaves this was largely simple. Jenkins has plugins for launching new slaves on demand, and we simply set them to terminate after 30 minutes inactivity.

The Windows slaves were a little more complex. After some measurements, we determined that to run the integration tests at full speed, 1 windows slave was required per 1.5 Ubuntu slaves. We produced a custom Jenkins job with a Groovy script that determines the correct amount of executors and established it in the cloud.

Step 6: do a talk and write the user manual

This is the bit that many blog posts don’t tell you – once you’ve made a big change, you need to communicate it to your coworkers!

Once our continuous integration infrastructure was up and running satisfactorily, we spent a lot of time documenting it and gave an internal talk to make sure the whole team was on board.


The benefits of our new infrastructure

The non-blocking builds mean that developers get their results faster, which is extremely important because rapid feedback is the lifeblood of software development.

There were more benefits, though, that we perhaps did not anticipate – having parallelism in builds allows us to branch more freely, meaning our master branch is always in a deployable state. Our workflow was therefore more flexible and our incident reaction time decreased significantly.

Furthermore, the lack of “blessed” infrastructure means that everything is defined simply and clearly and everything can easily be destroyed/recreated at any time.

Advice on retooling your CI setup

Minimise disruption by working in parallel

For large changes, do not tinker with your main CI setup – have a parallel installation if possible.

For smaller changes, having a central build file in our source control system is a perfect fit for trying out different build configurations. Thanks to Jenkinsfile, we were able to try all of these changes out without disrupting the developers.

Measure everything!

A lot of time and tuning went into determining the optimum parallelism, build lengths, and test suite splits. Measure everything, don’t guess.

Make it work, make it fast, make it cheap

The way I approached this is to:

  1. Make it work
  2. Make it fast
  3. Make it cheap

There was a spike in our infrastructure costs as I was working on this – spinning up virtual machines and driving them hard with multiple containers – but it will pay off in the long term.

Aim for interchangeability

If you can use all of your build executors interchangeably, you will be able to service more builds than you would if you had specific executors for specific projects.

This is not always possible – e.g. some projects may requires different operating systems etc – but through containerisation/virtualisation, build executors can be fairly generic.

Don’t expect to retool overnight

Retooling our CI setup did not happen overnight – in fact, here are just some of the notes that I took whilst designing this process!

 

See also