On Tuesday, we were really excited to launch Goals. We’ve been talking about Goals for a long time, and we’ve been dreaming about it for even longer. But, as you may have noticed if you tried to share our excitement, we had some serious technical issues surrounding our announcement. We want to be honest with our customers about what happened.
Since Simple sits at the intersection of a technology company and a bank, we also want to introduce you to something that technology companies do after an outage in order to be transparent to their customers: the postmortem. A postmortem is a report after an outage that objectively discusses failures in order to diagnose problems, document the effects of those problems, and suggest short- and long-term solutions. Customers who are familiar with the tech universe will be used to seeing these. But customers who are familiar with banks are used to obfuscation, or worse, silence. This is how we bridge that gap.
You can read our postmortem of Tuesday’s events on status.simple.com. Anytime a major outage occurs, we will publish a postmortem within a few days. If the outage affected customers in a crucial way, we will also summarize our findings here on the company blog.
So, what went wrong on Tuesday?
I’ll describe a common situation that many of our customers ran into on Tuesday. First, some background: for the uninitiated, we generally avoid providing customer support via email–it’s not secure enough to transmit account-related information. Simple contains a messaging system inside the web app (and soon the mobile app), so that customers can have more secure text-based conversations with our Customer Relations staff. When a customer receives one of these messages, she gets a generic email directing her to log in and view the message. This was our first major feature announcement since adding thousands of customers, and it made sense to us to use our messaging feature to let customers know about it.
But it didn’t make sense to our customers. Many customers understandably believed, since they knew this messaging channel to be used for customer support, that there was an issue with their accounts. When so many customers attempted to sign in at once while messages were still being sent to other customers, one of our databases reached its connection limit, resulting in users being unable to sign in. At that point, another error on our part: instead of “sorry, login is down right now, try again later”, our login error message is, uniformly, “your username and passphrase don’t match.” This resulted in customers attempting to reset their passphrases, putting further strain on the same databases that were already causing the login errors. In turn, this caused passphrase reset emails to either be extremely delayed or never sent. It was even worse for users who received the email on their mobile phones and had no access to a computer, as messaging is currently not available in our mobile app. When customers called to find out what was happening, they were told that all of this happened because we had tried to send them a message announcing a new feature.
Definitely not a good experience. So what did we learn?
First, we now realize what should have been much more apparent from the beginning: a channel meant for customer support is not an appropriate place to announce a new feature. Our enthusiasm for our product clouded our vision on this one, but many customers made the point that this is exactly what their old banks do–that’s why so many customers always ignore “you have a new message” emails from their banks. Customers trusted us not to communicate like that. So, in the future, we won’t. As much as we love our product’s features, and as much as we want our customers using them, marketing is marketing, and banking is banking. In the future, we will use email to communicate new features, and in-app messaging to communicate vital information.
Secondly, though we knew it was not ideal, we failed to think about just how serious the missing mobile piece would be in a situation like this one. The next version of the iPhone app will allow customers to send and receive messages directly from the app. This will add a unique and powerful communications channel that we think customers will enjoy using.
Third, we learned we can be humbled with seemingly foreseeable mistakes. Simple is a growing company, and though we have built robust and dependable systems, we can still fail to predict the ways our actions can affect our infrastructure. We are already working to improve our capacity to prevent future problems such as the one we experienced on Tuesday. From the technical postmortem:
We’re taking several steps to prevent similar incidents in the future. First, we’ve already deployed performance fixes and additional capacity for our user service. We’re also adjusting our monitoring thresholds so that we can add capacity before customers are impacted. Finally, we’re examining all of our services to make sure that they can serve our new customers as Simple continues to grow.
To our customers, we are sorry for any stress we caused you. Simple is a marriage of technology and finance, and in both sectors, transparency is at the heart of doing right by customers. So when we have problems, you’ll know what happened. We think that’s a good thing.