IoT-Panic: The Cloud Just Disappeared!
“Error 500: the server returned an internal error. Please retry later”: this is one of the scariest error messages that you can encounter when you use a web service. What happens if this is the error returned to an IoT device that calls its cloud service? Or what if the server just rejects the connection? Can the application continue to deliver a limited service, or it will be completely unusable?
The term Internet of Things (IoT) recalls, by definition, the idea that we can connect our devices to the Internet to exchange information, receive commands, take actions, and so on. Today, almost every commercial IoT application (e.g., Amazon Alexa, Google Home, etc.) exploits connected devices to interact with users in different ways using gestures, voice, vision, and so on. Such applications require at least the two-tier architecture depicted in Fig. 1, composed of IoT devices on the one side, and a cloud endpoint on the other side. Here, by analogy with biology, the cloud endpoint acts as the central nervous system of the entire application, while our devices behave just like the peripheral nervous system and our senses.
Figure 1: A traditional two-tier IoT architecture, where IoT devices fully rely on the cloud to deliver their service.
It is easy to understand that this may lead to several problems:
What if a small failure or a bug disrupts an entire cloud platform?
What if the services of a main public cloud provider will become inaccessible?
What if a main disaster event (e.g. an earthquake) destroys part of the infrastructure? Is really the cloud resilient against major failures?
Unfortunately, we already know the answer to some of these questions, simply because it just happened: on 14th December 2020, around 12 p.m. UTC, all authentication-based Google Services experienced a main outage due to a failure in a quota management system for the User ID Service, one of the services involved in the authentication of requests from users [1], as depicted in Fig. 2.
Figure 2: This is the message that millions of logged users received during the last Google down1.
This might look like a problem that should only affect real users, indeed people all around the world were unable to access the Google Workspace apps (Gmail, Drive, Docs, Meet, Calendar, Hangouts, YouTube, etc.) for almost 1 hour, turning into a huge economic impact.
This incident should raise crucial (almost existential) questions:
What is the real impact of Google on our daily life?
Can we continue to work if Google disappeared or failed?
Let’s take a further step: are these questions valid only for services for humans? Again and sadly, the answer is “no”. The same authentication service is used by Google to authenticate requests coming from the Google Home ecosystem and its devices. It is pretty straightforward to understand that this outage turned, literally, all the most sophisticated smart devices into dumb pieces of yet shiny technology. Millions of people could not use their smart devices to heat-up their houses, turn on/off their lights, listen to music, remotely control their children, answer the doorbell, or even clean their houses with a smart vacuum cleaner. And this is not just a cyberpunk provocation: Fig. 3 reports a tweet posted by a Twitter user immediately after the outage and that went viral.
Figure 3: A tweet sent by Joe Brown immediately after the Google outage, source: https://twitter.com/joemfbrown/status/1338452107419148290.
Figure 4: A tweet, complete with a snapshot of the server error message2, sent by Alex Dunsdon, source: https://twitter.com/alexdunsdon/status/1338461046785368067.
We, as IoT practitioners, technologists, developers, designers, architects, theorists, gurus, need a radical change in how we create IoT applications. The dogma “too big to fail” does not work anymore and we need to promote applications that are (more) resilient against cloud failures.
Said differently: we need a Plan B, and we need it now! We need applications that do not need a remote endpoint to work. For example, our smart homes should be able to work even if the Internet connection is not available. We can survive if we cannot get the latest weather forecast or we cannot listen to our favorite playlist, but we should be always able to heat our houses, even if we use smart thermostats. The same applies to other application domains that critically impact our lives: e-health, transportation, or, more generally, wherever IoT represents an enabling technology.
For all these reasons, we should push our application design and engineering processes towards computing paradigms that are resilient by design against this kind of outages. One for all: EdgeComputing, that by the way is not a novel paradigm, as it has been on some niching researchers’ lips for more than one decade now [2]. Once again: the edge is the key [3].
Further Readings
- Google Cloud Status Dashboard, “Google Cloud Infrastructure Components Incident #20013”, December 14, 2020, [online available] https://status.cloud.google.com/incident/zall/20013.
- M. Satyanarayanan, P. Bahl, R. Caceres and N. Davies, "The Case for VM-Based Cloudlets in Mobile Computing," in IEEE Pervasive Computing, vol. 8, no. 4, pp. 14-23, Oct.-Dec. 2009.
- M. Antonini, M. Vecchio and F. Antonelli, "Fog Computing Architectures: A Reference for Practitioners," in IEEE Internet of Things Magazine, vol. 2, no. 3, pp. 19-25, September 2019.
1 It is important to notice that, based on the source of the request, the error code may change. For instance, Google services were returning the error code 400 (i.e., bad request) during the authentication phase of Nest devices (Fig. 4), even if the problem was triggered by the server itself (i.e, error code 500) and not by the client.
2 As anticipated, the error code 400 (i.e., bad request) refers to a malformed authentication request by the client; however, the real source of the problem was the faulty behavior of User ID service by Google.
Mattia Antonini is a Ph.D. Candidate at FBK ICT, Italy. Mattia received the B.Sc. degree (summa cum laude) in computer, electronics and telecommunication engineering, and the M.Sc. degree (summa cum laude) in communication engineering from the University of Parma, Italy, in 2014 and 2017, respectively. He has been a member of international research groups and he has worked on EU-funded projects since his B.Sc. He is serving as a reviewer for a few IEEE journals. His current research topics cover edge intelligence architectures and system design, edge computing, embeddable machine learning, and data analytics.
Massimo Vecchio received his M.Sc. degree in Information Engineering from the University of Pisa, Italy, and his Ph.D. degree in Computer Science and Engineering from IMT Institute for Advanced Studies, Lucca, Italy in 2005 and 2009, respectively. From 2015 and until very recently, he was an associate professor at the eCampus University, while since September 2017 he has also joined FBK, Trento, Italy, to coordinate the research activities of the OpenIoT Research Unit. His current research interests revolve around the IoT in general, and the Edge Artificial Intelligence in particular. Regarding his most recent editorial activity, he is an associate editor of the Applied Soft Computing journal and the IEEE Internet of Things Magazine, besides being the managing editor of the IEEE IoT newsletters.
Sign Up for IoT Technical Community Updates
Calendar of Events
IEEE 8th World Forum on Internet of Things (WF-IoT) 2022
26 October-11 November 2022
Call for Papers
IEEE Internet of Things Journal
Special issue on Towards Intelligence for Space-Air-Ground Integrated Internet of Things
Submission Deadline: 1 November 2022
Special issue on Smart Blockchain for IoT Trust, Security and Privacy
Submission Deadline: 15 November 2022
Past Issues
September 2022
July 2022
March 2022
January 2022
November 2021
September 2021
July 2021
May 2021
March 2021
January 2021
November 2020
July 2020
May 2020
March 2020
January 2020
November 2019
September 2019
July 2019
May 2019
March 2019
January 2019
November 2018
September 2018
July 2018
May 2018
March 2018
January 2018
November 2017
September 2017
July 2017
May 2017
March 2017
January 2017
November 2016
September 2016
July 2016
May 2016
March 2016
January 2016
November 2015
September 2015
July 2015
May 2015
March 2015
January 2015
November 2014
September 2014