At Simple, we take great care to ensure our infrastructure is easy to understand. This is so important to us that, over the past few months, we rebuilt the entire system from the ground up in order to simplify it. In this post, I’ll explain why we did such a crazy thing, and how we did it.
Too Many Moving Pieces
Simple operates a service-oriented architecture on Amazon Web Services. We used Chef to manage these services on mutable and long-lived instances. Clients on an instance would periodically converge the system from its current state to the desired state, as defined by our central Chef server. However, as we added more services (and therefore more instances) to the infrastructure, we found it difficult to manage a system with so many moving parts.
In our experience, we observed that the more these instances were converged by Chef, the more likely we would run into unexpected problems. A simple example is software upgrades: a shared library dependency could be upgraded to a version which would cause incompatibility with another program. In one example, a small cookbook version bump in a seemingly unrelated cookbook caused our PostgreSQL instances to upgrade two versions, causing all of our databases to be temporarily inaccessible. In (arguably) worse cases, bugs wouldn’t alert our monitoring systems. Instead, instances would just silently misbehave until someone noticed an issue, possibly hours or even days later. Our operations engineers ran into these kinds of problems so often that examining the Chef client’s logs became the first step during incident response.
Throw It All Out
The instability and complexity of this system slowed down our capability to maintain our infrastructure. One small bug could quickly propagate to many machines. As a result, introducing a change in our Chef cookbooks took a lot of careful, calculated analysis.
This severely slowed down our team. We knew a massive overhaul was necessary. To fulfill PCI compliance requirements, we had to rebuild the system in Amazon’s VPC. We saw this as the perfect opportunity to re-think the way our infrastructure should work.
Our goal was to minimize the mutation of an instance. Our infrastructure would be less complex if instances in production never changed. We saw numerous benefits in moving to an immutable model:
- No more periodic convergence.
- Artifacts tested in development are shipped to production.
- Less complexity in deployment pipeline.
- Security and anomaly detection benefit from immutability.
Given these benefits, Simple undertook an enormous effort to convert our old infrastructure to a unique immutable infrastructure. We use Chef “wrapper-style” cookbooks in tandem with chef-solo to build immutable AMIs on AWS. Updates to the infrastructure are now a matter of changing code in a system we’ve developed called Cloudbank.
Infrastructure as Code
Cloudbank is a small Python program that manages changes to the infrastructure. It has two primary goals: to model our infrastructure as code, and to converge the infrastructure from an old state to a new state.
Cloudbank contains a representation of our infrastructure as code. Basically, Cloudbank is the higher level code to manage our instances in AWS. This is in the form of a small Python library, which codifies common patterns in our infrastructure. Here is an example of how our Sensu monitoring stack is expressed in Cloudbank:
sensu = r.SensuServer("sensu-server", env, desired_capacity=4) sensu_rabbitmq = r.SensuRabbitMQ("sensu-rabbitmq", env) # sensu needs to talk to sensu-redis sensu_redis = r.SensuRedis("sensu-redis", env) sensu_redis.authorize(sensu, port='6379') topology.has(sensu) topology.has(sensu_redis) topology.has(sensu_rabbitmq)
Instance-level configuration is managed by Cloudbank as well. All of our base AMIs have a built-in configuration process we call spinup. The ‘spinup’ process is triggered when the instance boots. Cloudbank provides the environment for the ‘spinup’ process by setting the instance’s (immutable) AWS user data. We use this extensively, especially in customizing services that are running on our common platform AMI. We pass in information, such as the service name, service version, network environment information, credentials, etc., to the platform’s ‘spinup’ process. This simplifies instance configuration. The final configuration state of an instance is simply a matter of combining the AMI’s base configuration and the configuration of the ‘spinup’ process.
Since all of this is just Python code, changes to our infrastructure are run through our standard pull request process just like any other code change.
The ability to manage the infrastructure as code is immensely beneficial to us. The obvious benefit is that we can use the same tools (git) to manage and review history. Some non-obvious benefits are our ability to redeploy our infrastructure from any point in stored in git. In the future, we plan to work on an automated system that can run integration tests across a temporary test infrastructure given only a git commit in Cloudbank.
Getting from State A to State B
Much like local Chef clients, Cloudbank converges the infrastructure from its current, live state, to the desired state represented in Cloudbank’s code. In most cases, the process is simple: new instances, built on new AMIs, are deployed, and old instances are removed. No longer are long running instances building up leftover gunk from many converges. Most of Cloudbank’s convergence work relies on AWS CloudFormation. Cloudbank generates a JSON template of our infrastructure to send to AWS CloudFormation, which does the actual deployment / removal of instances.
I like to think of Cloudbank like a compiler: Cloudbank analyzes our infrastructure-as-code AST, then compiles those objects down into AWS primitives, such as Security Groups, Instances, and AutoScaling Groups. Cloudbank’s unique structure even allows us to introduce optimization passes into our infrastructure. For example, Cloudbank can ensure that redundant Security Group rules are compiled into one simpler rule. After processing the AST, Cloudbank then sends the resulting structure to AWS CloudFormation, which converges the infrastructure as a whole.
Relying on CloudFormation
What’s become critical in our usage of CloudFormation is how we’ve organized our CloudFormation Stacks for deployment. AWS resources, such as Security Groups, Instances, and so on, are organized in units of deployment called CloudFormation Stacks. When we were first working on this, we piled all of our systems into one CloudFormation Stack, grouped by environment. That is, we had one CloudFormation Stack which had all of production, and one Stack which had all of development. We hoped to have the production and development stacks closely follow each other. This was great for rapid development and iteration, and also helped with connecting Security Groups to one another. However, once we started serving customers on this system, updates became a source of major trouble. Deploying a set of new changes all at once to many different pieces of the infrastructure caused chaos. We quickly moved from environment-based Stacks into service-based Stacks. Every service that can be isolated now has its own AWS CloudFormation Stack. This now allows for more manageable updates to the infrastructure.
While we rely heavily on AWS CloudFormation, we tend not to use advanced features. We limit our use to setting up Autoscaling Groups, Instances, and configuring Security Groups. We had lots of trouble with CloudFormation Stacks unexpectedly falling into unrecoverable states. In one example, a misconfigured Elastic Load Balancer caused a Stack to become corrupted. We had to delete our entire development Stack and rebuild it again. While this is fine during development of a non-production system, it was an enormous risk when we went live in production. Rebuilding an entire environment can take a long time, and we didn’t want to risk having to delete our entire production infrastructure.
We also avoid CloudFormation sub-stacks. A sub-stack is required to expand a CloudFormation Stack beyond its limit of 200 resources. This was a requirement for us when we were managing entire environments per stack, but it became dangerous. Cloudbank had an optimization pass that would attempt to intelligently allocate resources between the sub-stacks. Despite the presumed cleverness of the solution, moving a resource from one sub-stack to another became a delicate and complicated operation. As a result, if we weren’t careful, we’d see pieces of our infrastructure unexpectedly be deleted and re-created, causing downtime.
Finally, Cloudbank doesn’t use AWS CloudFormation for every infrastructure update. CloudFormation is unusable for any system that requires storing state across updates. The most obvious example here is a database. In the interim, we wrote our own code to manage deploying new instances while maintaining data retention. We don’t break our immutability goals to do this, however. We simply start a new database instance and transfer the database’s data volume from the old to the new instance. Triggering this method is transparent to the user. Cloudbank’s interface for updating the infrastructure is the same. Having said that, managing this transition ourselves is more brittle than we’d like. We’re experimenting with ways to manage our databases within CloudFormation. We’re interested to see if CloudFormation’s custom resources can help us with this task. Other ideas include using the ‘spinup’ process as a way to obtain the database’s dataset.
We’d do it again if we could
The engineering teams at Simple are never afraid of exploring new territory. Simple is a unique business. Although we’re in a heavily regulated industry, we need to move fast. What used to take days to set up and laboriously maintain, now takes minutes to spin up and destroy. Engineers can now set up, tear down, and manage corners of the infrastructure as they please, without having to choreograph tangly changes in Chef with others. This has led to a big boost to productivity, as now our infrastructure engineers effectively have full autonomy. By removing any chance of instances changing in production, we’ve also removed an entire class of possible downtime, letting us sleep a whole lot better. We’re excited to work within our immutable infrastructure. Just like any other system based on immutable values, it forces us to flip the way we think. Most importantly, though, it makes our large-scale system a whole lot easier to reason about.
Interested in joining our engineering team? We’re hiring.