Early Monday morning, almost all of Google's services were inaccessible in many areas of the world. While it didn't take long for the issue to be resolved, the 30 or so minutes that Gmail, YouTube, Google Drive, Docs, Classroom, and News were unavailable highlighted just how much of what we do depends on the search giant's services.
If you slept through it, you missed quite the panic, at least as expressed on social media, as people realized that they were completely at the mercy of whoever at Google was responsible for fixing whatever had suddenly broken. Even though it was still early, it didn't take long for people to imagine what a day might look like without access to Google.
Here in Michigan, back when our children rode a bus to school every morning,--meaning last year--it wasn't all that uncommon this time of year for ice or snow to gift us all with an unexpected day off. Now that they're attending school virtually, a Google outage would be the digital version. I suspect that's also true for many people working remotely right now.
While it was a reminder that there are still forces beyond our control that have a lot to say about what we get done every day, there was an even more valuable lesson. It appears that the entire outage may have been due to one of the simplest of errors. Here's how a Google spokesperson responded when I asked the company for information about the outage:
Today, at 3.47 a.m. PT Google experienced an authentication system outage for approximately 45 minutes due to an internal storage quota issue. Services requiring users to log in experienced high error rates during this period. The authentication system issue was resolved at 4:32 a.m. PT. All services are now restored. We apologize to everyone affected, and we will conduct a thorough follow up review to ensure this problem cannot recur in the future.
I freely admit I'm not a network administrator, but "an internal storage quota issue," sounds an awful lot like the hard drive on one of the servers filled up and no one was paying enough attention to notice. I followed up to clarify whether this was due to human error or a system failure, but Google declined to comment further.
There are really two important lessons here:
First, Google's authentication system is used to sign in to all of the company's services, as well as many third-party services that use Google for Single Sign-On. That means it's pretty important, and not just to Google.
Google's authentication system affects the ability of millions of people to get work done, and millions of students trying to log on for school. When something is that important, it's probably worth having a system in place that can't be foiled by running out of space on a hard drive, whether from human error or by system failure.
That kind of failure can't happen. Sure, sometimes it does, but your job is to make sure it doesn't. When you ask people to trust you to be their primary service for most of the things they do for work, it has to work.
Besides, Google literally runs the third-largest cloud computing platform and has data centers all over the world. It's almost inconceivable that its services went out worldwide because of something that could have been avoided by sending someone on a 10-minute trip to Best Buy to pick up some extra storage.
Which highlights my second point: that often it's the simplest things we overlook that cause very real problems. Most major technology problems result from things like using weak passwords, sharing accounts, forgetting to back up your data, or you forget to set a reminder to renew your website domain.
Those are very real problems that can have big consequences for both your company and your customers. The good news is, most of us aren't running an enterprise quite as large as Google, meaning, when we forget to do the obvious thing, it doesn't affect quite as many people. Even better news, they're all pretty easy to avoid.