IoT-Panic: The Cloud Just Disappeared!

Mattia Antonini and Massimo Vecchio
January 11, 2021

 

“Error 500: the server returned an internal error. Please retry later”: this is one of the scariest error messages that you can encounter when you use a web service. What happens if this is the error returned to an IoT device that calls its cloud service? Or what if the server just rejects the connection? Can the application continue to deliver a limited service, or it will be completely unusable?

The term Internet of Things (IoT) recalls, by definition, the idea that we can connect our devices to the Internet to exchange information, receive commands, take actions, and so on. Today, almost every commercial IoT application (e.g., Amazon Alexa, Google Home, etc.) exploits connected devices to interact with users in different ways using gestures, voice, vision, and so on. Such applications require at least the two-tier architecture depicted in Fig. 1, composed of IoT devices on the one side, and a cloud endpoint on the other side. Here, by analogy with biology, the cloud endpoint acts as the central nervous system of the entire application, while our devices behave just like the peripheral nervous system and our senses.

Figure 1: A traditional two-tier IoT architecture, where IoT devices fully rely on the cloud to deliver their service.

Figure 1: A traditional two-tier IoT architecture, where IoT devices fully rely on the cloud to deliver their service.

It is easy to understand that this may lead to several problems:
What if a small failure or a bug disrupts an entire cloud platform?
What if the services of a main public cloud provider will become inaccessible?
What if a main disaster event (e.g. an earthquake) destroys part of the infrastructure? Is really the cloud resilient against major failures?

Unfortunately, we already know the answer to some of these questions, simply because it just happened: on 14th December 2020, around 12 p.m. UTC, all authentication-based Google Services experienced a main outage due to a failure in a quota management system for the User ID Service, one of the services involved in the authentication of requests from users [1], as depicted in Fig. 2.

Figure 2: This is the message that millions of logged users received during the last Google down1.

Figure 2: This is the message that millions of logged users received during the last Google down1.

This might look like a problem that should only affect real users, indeed people all around the world were unable to access the Google Workspace apps (Gmail, Drive, Docs, Meet, Calendar, Hangouts, YouTube, etc.) for almost 1 hour, turning into a huge economic impact.

This incident should raise crucial (almost existential) questions:
What is the real impact of Google on our daily life?
Can we continue to work if Google disappeared or failed?

Let’s take a further step: are these questions valid only for services for humans? Again and sadly, the answer is “no”. The same authentication service is used by Google to authenticate requests coming from the Google Home ecosystem and its devices. It is pretty straightforward to understand that this outage turned, literally, all the most sophisticated smart devices into dumb pieces of yet shiny technology. Millions of people could not use their smart devices to heat-up their houses, turn on/off their lights, listen to music, remotely control their children, answer the doorbell, or even clean their houses with a smart vacuum cleaner. And this is not just a cyberpunk provocation: Fig. 3 reports a tweet posted by a Twitter user immediately after the outage and that went viral.

Figure 3: A tweet sent by Joe Brown immediately after the Google outage, source: https://twitter.com/joemfbrown/status/1338452107419148290.

Figure 3: A tweet sent by Joe Brown immediately after the Google outage, source: https://twitter.com/joemfbrown/status/1338452107419148290.

Figure 4: A tweet, complete with a snapshot of the server error message2, sent by Alex Dunsdon, source: https://twitter.com/alexdunsdon/status/1338461046785368067.

Figure 4: A tweet, complete with a snapshot of the server error message2, sent by Alex Dunsdon, source: https://twitter.com/alexdunsdon/status/1338461046785368067.

We, as IoT practitioners, technologists, developers, designers, architects, theorists, gurus, need a radical change in how we create IoT applications. The dogma “too big to fail” does not work anymore and we need to promote applications that are (more) resilient against cloud failures.

Said differently: we need a Plan B, and we need it now! We need applications that do not need a remote endpoint to work. For example, our smart homes should be able to work even if the Internet connection is not available. We can survive if we cannot get the latest weather forecast or we cannot listen to our favorite playlist, but we should be always able to heat our houses, even if we use smart thermostats. The same applies to other application domains that critically impact our lives: e-health, transportation, or, more generally, wherever IoT represents an enabling technology.

For all these reasons, we should push our application design and engineering processes towards computing paradigms that are resilient by design against this kind of outages. One for all: Edge

Computing, that by the way is not a novel paradigm, as it has been on some niching researchers’ lips for more than one decade now [2]. Once again: the edge is the key [3].

Further Readings

  1. Google Cloud Status Dashboard, “Google Cloud Infrastructure Components Incident #20013”, December 14, 2020, [online available] https://status.cloud.google.com/incident/zall/20013.
  2. M. Satyanarayanan, P. Bahl, R. Caceres and N. Davies, "The Case for VM-Based Cloudlets in Mobile Computing," in IEEE Pervasive Computing, vol. 8, no. 4, pp. 14-23, Oct.-Dec. 2009.
  3. M. Antonini, M. Vecchio and F. Antonelli, "Fog Computing Architectures: A Reference for Practitioners," in IEEE Internet of Things Magazine, vol. 2, no. 3, pp. 19-25, September 2019.

1 It is important to notice that, based on the source of the request, the error code may change. For instance, Google services were returning the error code 400 (i.e., bad request) during the authentication phase of Nest devices (Fig. 4), even if the problem was triggered by the server itself (i.e, error code 500) and not by the client.

 2 As anticipated, the error code 400 (i.e., bad request) refers to a malformed authentication request by the client; however, the real source of the problem was the faulty behavior of User ID service by Google.


 

Mattia AntoniniMattia Antonini is a Ph.D. Candidate at FBK ICT, Italy. Mattia received the B.Sc. degree (summa cum laude) in computer, electronics and telecommunication engineering, and the M.Sc. degree (summa cum laude) in communication engineering from the University of Parma, Italy, in 2014 and 2017, respectively. He has been a member of international research groups and he has worked on EU-funded projects since his B.Sc. He is serving as a reviewer for a few IEEE journals. His current research topics cover edge intelligence architectures and system design, edge computing, embeddable machine learning, and data analytics.

 

Massimo VecchioMassimo Vecchio received his M.Sc. degree in Information Engineering from the University of Pisa, Italy, and his Ph.D. degree in Computer Science and Engineering from IMT Institute for Advanced Studies, Lucca, Italy in 2005 and 2009, respectively. From 2015 and until very recently, he was an associate professor at the eCampus University, while since September 2017 he has also joined FBK, Trento, Italy, to coordinate the research activities of the OpenIoT Research Unit. His current research interests revolve around the IoT in general, and the Edge Artificial Intelligence in particular. Regarding his most recent editorial activity, he is an associate editor of the Applied Soft Computing journal and the IEEE Internet of Things Magazine, besides being the managing editor of the IEEE IoT newsletters.