June 06, 2016
by Tom Wanielista

Scaling Sensu By Doing Less

As we began to rapidly expand our infrastructure, we needed to rethink the way we used Sensu, our monitoring system. By removing components, we managed to find a way to scale effortlessly.
sensu-pattern0 2x

When we first started using Sensu for our monitoring system at Simple, we used the recommended configuration. This included running a RabbitMQ cluster for passing messages between the Sensu server and monitored hosts—Redis for storing intermediate state, and so on. This worked well enough for our development and production environments. But as we began to build tools to quickly start up and shut down new environments, things started to get messy.

We needed to clone this configuration for every environment we would create. Managing multiple rabbit clusters (made up of multiple instances) and multiple Redis clusters (made up of multiple instances) across several isolated environments became unwieldy. To make matters worse, we expected to set up more of these environments in the future. We needed to come up with a lighter weight Sensu configuration that was easy to set up and that had a very small number of moving components to manage per environment. In addition, we hoped our solution could monitor all of these environments from one place, as tracking alerts coming from different Sensu clusters became awkward and confusing.

We ended up building a Sensu transport that uses Amazon SNS and Amazon SQS as the transport mechanism. We’ve been running the system in production for about a year, and we’re really happy with the final result.

Every one of our environments now has a dedicated Amazon SNS topic that our Sensu clients publish to. All of these topics push their messages onto a single SQS queue that one global Sensu cluster consumes from. This Sensu cluster runs in an Amazon AutoScaling Group for easy scale-up when needed. A diagram may help express this clearer:

monitoring-mid-2015 2x

As a result, we’ve exceeded our requirements: set up requires only creating a SNS topic, and there aren’t any extra instances to manage per environment. Some caveats:

  • As the flow of messages is unidirectional (messages flow from hosts to the sensu cluster only) we can only use ‘standalone’ checks. This has been fine for us as we never used the subscription-based checks feature of Sensu.

  • We were bit by a memory leak bug when using ruby 2.2.1 and the aws-sdk gem. Upgrading to the latest ruby version fixed this issue.

We’ve open-sourced our Amazon SNS + SQS transport in case anyone else wants to try this out. I highly recommend it if you want to get Sensu monitoring up and running without having to also setup RabbitMQ. It’s made our monitoring infrastructure practically invisible, reliable, and performant. Our single Sensu cluster effortlessly processes around 24 million messages per day, with lots of room to grow.

Disclaimer: Hey! Welcome to our disclaimer. Here’s what you need to know to safely consume this blog post: Any outbound links in this post will take you away from Simple.com, to external sites in the wilds of the internet; neither Simple nor The Bancorp Bank, our partner bank, endorses any linked-to websites; and we didn’t pay/barter with/bribe anyone to appear in this post. And as much as we wish we could control the cost of things, any prices in this article are just estimates. Actual prices are up to retailers, manufacturers, and other people who’ve been granted magical powers over digits and dollar signs.