The past week has been a rough one for Roll20 in terms of downtime for the service, with a total of around 1.5 hours of downtime throughout the last 7 days. A variety of factors has been at play, and I wanted to take a few moments to hash out what happened and what we’re doing to keep it from happening again.
On November 10th and 11th, we experienced a total of approximately 30 minutes of downtime spread across a variety of our real-time shards. This is the part of the Roll20 service that handles your “in-game” play, and it’s where most of the magic happens. In addition to the downtime, you may have experienced slow response times during gameplay.
The cause of this degraded service was an issue with Firebase, our real-time hosting provider. While both Firebase and Roll20 use industry-best practices to try and catch issues before they happen, in this case there was a small “blind spot” in Firebase’s monitoring that didn’t notice the slow service. Once we were able to alert them to the issue, they quickly remedied the underlying problem, and have since taken steps to add additional monitoring so they’ll notice this issue before it happens in the future.
On November 16th, we experienced a total of around an hour of downtime (about 30 minutes in the morning around 6 AM CST, and around 30 minutes in the evening around 5:30 PM CST) on our Main site. This is the part of the Roll20 service that handles logging you in, posting on the forums, and inviting others to join your game, among other things.
The morning downtime was caused by a hardware failure of our database server. Computers do fail on occassion, and it was just bad luck that it happened to this server that morning. The only thing we can do here is be prepared for this inevitability, and as such we were able to be migrated to new hardware by our hosting provider (Linode) and the service came back up. So from the time the hardware failed to the time we were moved to new hardware and back up and running was only about half an hour. While we hope it doesn’t happen again soon, we’re pleased with the way this was handled.
The evening downtime on November 16th was caused by a wide-scale Direct Denial of Service (DDoS) attack on Linode’s Dallas datacenter, causing severe network congestion, and thousands of websites (not just Roll20) to be unavailable for a short period of time. Again, while these attacks are unfortunate they have become an expected price of doing business on the Internet, and we were pleased with the way that Linode was able to fully mitigate the attack in short order.
It was just really, really bad luck that both of these issues occurred on the same day, as they were completely unrelated. And both are issues that should be very rare occurrences.
We work very hard to keep Roll20 operating at peak performance, taking proactive steps both to prevent downtime from happening, and to prepare for it so that if it does happen, we can be back up quickly. We know that even 30 minutes of downtime at the wrong time can mean the difference between having a great time, and missing your only opportunity to play for weeks (or even longer!). We sincerely apologize to those of you who had interrupted games as a result of these issues, and we’re writing this post to assure you that we are doing everything we can to minimize downtime in the future. As always, you can view our history of uptime on our status page, at http://status.roll20.net.
Thanks for your support of Roll20, and hopefully I won’t need to write another blog post about downtime for a good long time.