Roughly 3 months ago and after some unexpected downtime, our community asked us for more insight into the state of our API. We quickly pushed out an MVP using Stashboard, given our people-pleasing nature, but, quickly discovered that it didn’t give the granularity or accuracy we needed.
We sat down and brainstormed what was need to accurately reflect the state of the system to satisfy customers:
- We provide 3 distinct services that our customers interact with, so we should show the uptime of each of these services separately.
- Balanced is distributed – so, we must reflect what parts of our system are up or down while other parts are operating normally, therefore, there is no absolute up/down state.
- Everyone in the company needs to be able to communicate issues, but, if something goes wrong in the middle of the night the status page should automatically announce this.
- The status page should also show the history of incidents in addition to the uptime percentage.
- Customers want notifications pushed to them, so there needs to be some way for them to subscribe.
- It’s almost 2013, we want real-time stats!
- At the time Balanced had 8 employees, 5 of them were engineers, -1 of them had time to build the system.
Quick, someone copy and paste a solution
Everyone knows the quickest way to solve a problem is to piggyback on someone else’s hard work and started exploring the existing commercial and open source products in the wild:
Stashboard was the go to. It looks reasonable out of the box, it’s widely supported, it takes about 10 seconds to setup. We had stashboard up and running and added our 3 services. 100% uptime ensued. The problem was that we didn’t have—100% uptime. Nobody (us included) believed the status page, which made it useless. Clearly, health checks are insufficient in this case.
Pingdom provides one of the best commercial offerings that we’ve seen—great UI, runs on all devices and pushes notifications to you, so we spun up a trial account in parallel with stashboard and watched how it went. We got 99.999% uptime, but again, this wasn’t accurate. Another shortcoming was we want to health checks that included
PUT requests. We could have written a script and mounted it to a URL and map all requests to a
GET to get it to run. That’s reasonable, but it still doesn’t accurately reflect all requests flowing through Balanced.
After looking at a couple other offerings, it became clear that an external system wouldn’t have the level of access to our internal system, there was nothing else for it, someone was going to have to write some code and that someone was me.
One discovery we did make while evaluating external services was that a simple health check is not sufficient. These services provided uptime monitoring, what we wanted was availability. Essentially, if Balanced serves 100 requests and one request fails then the uptime we want to show is 99%. Time calculations break down if you’re up when there are no requests but down during a spike.
No solution in sight, a brave code warrior enters the arena
Quickly, pausing only for coffee, hummus, and a tasty bread covered morsel, with only a MacBook Pro and a standing desk for company, I bravely created an empty git repo…
Like all good engineers, I took the chance to try something new. We run on AWS and AWS has infrastructure tools out the arse, so let’s leverage that. A quick Google search revealed that AWS load balancers (ELBs) have a CloudWatch API that returns HTTP status codes grouped by the most significant digit. Great. All I needed to do was sum the number of 2xx, 3xx, 4xx, and 5xx requests, figure out the percentage of these that were in the 5xx class, put a fork in it, and we’re done. ELBs are like gifts from the gods, for the 99% of things you need they do them better than you ever could. Hit that 1% issue and you’re at the mercy of the AWS infra team. I begged and pleaded on the AWS forums, the days ticked past and the CEO stood at my shoulder demanding stats for the hordes of customers banging at our digital walls. Not being one who gives up, I gave up and moved on.
Good developers build, great designers steal
In the meantime, our designer, Damon, began scouring the web for inspiration. After checking out some amazing designs we found Heroku’s status page and whipped up a mock based on that. We quickly built our a static HTML version and then sat down to figure out how to get information into the app.
Internally, we’ve both leveraged and built a slew of tools. All I needed to do was pick and choose from the right tools and write some glue to string everything together. Since we couldn’t get the HTTP status codes from the ELB we dropped down a level and decided to parse the NGINX logs. These almost always correspond with the actual status code the user got so we consider them our authoritative source for deciding if a request succeeded. We already log these to a centralized server using RSYSLOG, so I already had a data source to draw from. Next, I went and brewed a fresh pot of coffee and bestowed it upon bninja for his prescient work in building our log parser, Slurp. We wrote a quick Slurp script that read the HTTP status code from each request and then fed them into Graphite buckets. Each bucket was based on service name (
JS) and then response code family (
5xx, and a special case
timeout for slow requests).
Using Twitter as infrastructure
So we had a basic uptime for each service based on all requests that go through our system. Happily I sat back, stretched and went outside to get some sunshine, but then from out of nowhere a wild CEO appeared. Clearly, this was an unfinished solution. Yes. We have near real-time availability stats, but how does someone update the status manually? Great question. Leaving the warm California sunshine behind, we re-entered the office.
We chose Twitter because it satisfied the remaining requirements we had. Subscribe to notifications? Just follow @balancedstatus. Everyone needs to be able to update the status page? Just login to the Twitter account and use the grammar we developed:
API-ISSUE: The API is currently giving out free money :-0where
APIis the service and
ISSUEis the state and the remainder of the tweet is the message displayed.
Twitter would feed updates into the status page instead of the other way around.
STFU and let me play with it
You too can get the status page up and running. If you want to explore the code we’ve released, just follow these simple steps. If you get stuck, join one of our support channels and get some help.
Before you begin, make sure you’ve got Google App Engine SDK installed.
Get the code
git clone https://github.com/balanced/status.balancedpayments.com.git cd status.balancedpayments.com
Run the local development server
Visit http://localhost:8000/ to view the page
To pull data from our test Twitter account, run this curl command
curl http://localhost:8000/twitter -d update=1 -u username:password
If you want to deploy this to your own GAE account, edit
app.yaml, change the name of the app (
situation-demo) to your own app name and then run
appcfg.py update situation/
If you want to use this for your own, you’ll need to point it at your own graphite server and add your own Twitter app credentials into settings.py.
Finally, you’ll need to setup a cron job or some other sort of scheduled task which will
POST to the
/uptime URLs to pull data into the system.