If you're online and notice that your favorite website or app isn't working, there's a good chance you might head over to a site like Downdetector to see if outages are being reported. Of course, this morning, if you tried that, you found that site to be, well, down.
That's because, for the second time in just over a week, Cloudflare suffered a major outage that resulted in thousands of sites going offline for a short time. Other sites and services that were affected include Shopify, customer help desk service Zendesk, chat service Discord, Cloudbase, Dropbox, and Nest.
That's a pretty big deal if you happen to use any of those services to run your business; never mind that thousands of websites rely on Cloudflare to provide network services.
The company posted the following on its blog:
For about 30 minutes today, visitors to Cloudflare sites received 502 errors caused by a massive spike in CPU utilization on our network. This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.
This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.
The latter part is a reference to unconfirmed reports that the outage was the result of a distributed denial of service (DDoS) attack from China.
I spoke with Cloudflare's CEO, Matthew Prince, who confirmed that the outage wasn't the result of an attack but rather a "mistake on our part."
Which, let's agree, isn't necessarily what you want to hear from the head of the company you depend on to keep your business connected online, but it is absolutely the right thing to say. Give credit to Cloudflare for owning this one.
Prince went on to say that while his team is working to diagnose the exact cause of the problem, it was related to a "bug in the firewall application that caused it to spin up and consume all of the CPU capacity across all of our systems."
Which brings up an important point. Much of the technology we depend on is built on an increasingly complex and fragile network of servers, computers, and service providers. At any time, the internet depends on thousands of interconnected devices and services to work, and when one of them doesn't, the entire thing freezes up.
Sure, today it was just a bunch of websites we use every day to get work done. But it could certainly be worse. If the internet as we know it can be taken down by deploying some poorly written code that causes everything to overload and shut down, what happens when someone writes code intended to do exactly that?
Prince says he's confident Cloudflare will be able to track down exactly what happened and put into place processes to mitigate against its ever happening again.
"We're pretty good at learning from our mistakes," Prince told me. "And we believe in radical transparency."
That means that in addition to the blog post, a full accounting of what happened, and what Cloudflare is doing about it, will be posted in the near future, says Prince.
If you're a service provider, sure, mistakes happen. But there's a responsibility to make sure that not only are you protecting your networks from the worst-case scenario, but also from the simple human errors that have a huge impact. The bottom line is, this kind of thing just can't happen. Not when people are depending on you. Not when other people's businesses are at stake.
But when it does, remember how Cloudflare and its CEO handled this outage, because it's exactly what you should do. Own it. Apologize. Tell everyone what you're going to do differently to make sure it never happens again.
If you're a consumer or a business owner, consider whether you have a plan for what to do when the worst-case scenario does happen. Let's be honest: We're on borrowed time in terms of when to expect the worst-case scenario. And when it comes, chances are it's not going to be a 20- or 30-minute outage that takes down a few popular sites and services.
It will be a downing of network communications systems. Business and customer databases. Credit card processing services. The electrical grid.
I'm not an alarmist or a conspiracy theorist, but you don't have to be either to recognize that it is ultimately your responsibility to have a plan. If all it takes for half the internet to go dark for 20 minutes is some poorly deployed software code, imagine what happens when the next time it's intentional.