Practical Artificial Intelligence for the IoT

Danilo Pietro Pau
July 22, 2021


This article proposes a practical methodology and associated tools for next-generation IoT developers who aim to productively conceive and deploy IoT applications with interoperable artificial neural networks (ANN) into resource-constrained microcontrollers (MCU).

Educating future computer-science engineers on artificial intelligence (AI) and ANN for embedded systems requires a consistent study plan. The embedded industry is struggling to find well-prepared B.Sc., M.Sc., and Ph.D. degreed engineers who are fluent in python and C/C++ programming. Thus, there is a resource gap that will take several years to be filled.

Academia has recognized this disparity and is taking steps to compensate. One example is an initiative between Harvard University and Edx with the courses Fundamentals of TinyML[1], Applications of TinyML[2], and Deploying TinyML[3] that focus on the basics of machine learning and embedded systems by teaching how to program in TensorFlow Lite for MCUs. The program teaches IoT practitioners to write machine learning (ML) models for resource-constrained MCUs and a range of student-designed IoT applications.

And industry? What help can it provide? STMicroelectronics in cooperation with the Università di Catania has developed the Programming in Embedded Systems: from the basics of the Microcontroller to Artificial Intelligence[4] course, which is a clear demonstration of the industry’s interest in facilitating teaching to a large community of IoT practitioners. Programs like this will facilitate graduates being prepared to find ML jobs more easily in an environment rich with opportunities to drive their future professional growth.

When developing IoT applications based on ANN that can be mapped onto tiny MCUs, the steps must be simple and easy to learn. We are proposing a simple, easy, inexpensive, and productive 5-step methodology, as shown in Figure 1:

  1. Define an IoT application: first define an applicative problem to be solved and be prepared, accordingly, to capture enough representative data about the physical phenomena that will be subject to ANN processing for that data to be meaningful. This usually involves placing sensors at, on, or near the physical object to be monitored to record its state and associated changes over time. Examples of physical parameters include acceleration, temperature, sound, pressure, vision, thermal imaging, battery charge, and others depending on the target IoT application.
  2. Create an ANN, which requires labeled data that have been acquired from sensors and doing pre-processing. In "supervised learning," the designer must associate elements of the data set to semantically defined labels so that an input-output relation is uniquely set by construction. This classified set is the "ground truth" that will be used to train the ANN and then validate it by using a proper partition of it. The designer must decide the type of topology the ANN should feature to learn from the data and achieve high accuracy for the target IoT application. Usually, this step requires using one of the popular off-the-shelf deep learning frameworks to architect, train, and test the ANN topologies.
  3. Train the ANN, which involves passing the data sets through the ANN iteratively so that the ANN's outputs can minimize any adopted error criteria. ANN definition, training, and testing are typically performed using an off-the-shelf deep learning framework like said in the previous step. This training is usually done on a powerful computing platform (like a server with GPUs), that features virtually unlimited memory and computational power, to allow many epoch-based iterations until the output data converges to satisfactory accuracy. This training produces a pre-trained ANN stored in a file with the format of the deep-learning framework adopted (e.g. Keras .h5, Tensorflow Lite .tflite, Pytorch/MXNet/Paddle Paddle .onnx). These are interoperable file formats and de facto industry standards (Google) or specified by a large community (ONNX).
  4. Convert the ANN into optimized code for the MCU: to avoid an unproductive, repetitive, error-prone, hand-crafted C-code development, STMicroelectronics designed a technology that allows fast, accurate, automatic conversion of pre-trained ANNs into optimized C code. This C-code can run on a tiny MCU with full validation and built-in performance-characterization facilities without any need to hand-craft them. The technology, embodied in freely available tools[5],[6], guides IoT practitioners and developers through the selection of the MCU (using ST’s STM32 and SPC58 families) and provides rapid and detailed feedback on the implementation implications of the ANN on the selected MCU, for both IoT (STM32) and Automotive (SPC58) applications. Validation of the ANN against the deep learning run time can run both on the PC and the target MCU. This technology does not pose any specific constraint on the IoT application, while it facilitates its integration into the design flow using a well-defined set of public application program interfaces (APIs). It also offers simple and efficient interoperability with popular deep-learning training tools widely used by the AI developer’s community.
  5. Finally, embed the ANN into an MCU integrated into the IoT application for field trials and in-field validation.

Figure 1: The 5 steps needed to deploy an IoT application based on ANN onto STM32 and SPC58 MCUs.

Figure 1: The 5 steps needed to deploy an IoT application based on ANN onto STM32 and SPC58 MCUs.

To prove this methodology in practice, let us go through a case study to appreciate how the methodology is so easy and productive even for IoT practitioners that need to quickly prototype their ideas without using challenging and complex software packages or development boards.

Step 1: Let’s consider that we need to classify the fill level of sodium chloride sterile liquid in bottles for intravenous administration. One goal is to reduce or eliminate continuous human visual monitoring, as this may represent an onerous, time-consuming, and high-error task. Automating the task can help to increase productivity and save time. Under normal circumstances, human visual monitoring of the saline level in the bottle is required from time to time, without any real-time criticality. When the saline liquid in the bottle is fully consumed, and the bottle is not replaced or the infusion process is not stopped immediately, the difference between the patient's blood pressure and the empty saline bottle could cause an outward rush of blood into the bottle.

Step 2: The dataset with pictures of the saline bottles is openly available in [1], free of charge, properly labeled, documented, and organized in folders.

Step 3: Various ANNs were designed based on convolutional (separable and depth-wise) feed-forward topologies (CNN) using Keras’ deep-learning framework. Their topologies are composed of a mix of Conv2D, ReLU, MaxPooling, Flatten, and Dense layers. By setting the kernel size of filters, their numbers, and interleaving SeparableConv2D, model sizes can be dramatically reduced. This is important because it reduces the model complexity measured in the number of multiplies and accumulate (MACC), occupation into non-volatile MCU memory (FLASH) and it also reduces dynamic memory occupation (RAM). ANNs were hand-crafted at first using 32bits floating-point precision and then quantized to use integer 8bits. 8bit quantization happens first by converting the pre-trained neural network from Keras to Tensorflow Lite file format and then using the post-training quantization procedure, which also required a calibration dataset. This procedure usually marginally reduces accuracy -- or at least did so in our case.

Step 4: Those ANNs were automatically converted into C code for STM32 MCUs, which are embedded into the STM32 Nucleo family[7]. STM32 Nucleo MCU boards are priced affordably so they are easy to use to try out new IoT ideas and quickly create prototypes or proofs-of-concept with any Arm® Cortex® M4- and M7-based STM32 MCU. Compiler and debugger tools are free of charge[8], too. Figure 2 highlights various ANN versions developed in step 3, their complexity, and validation accuracy achieved using the ST AI tools[9] that automatically deploy ANN on the STM32 Nucleo board.

Figure 2: Complexity of the various ANN topologies automatically mapped on a STM32H743ZI2 Nucleo board (480MHz, 2Mbytes Flash, 1 Mbytes RAM). Any other STM32 Nucleo M4 or M7 can be used. DW stands for Depth-wise convolutions

Figure 2: Complexity of the various ANN topologies automatically mapped on a STM32H743ZI2 Nucleo board (480MHz, 2Mbytes Flash, 1 Mbytes RAM). Any other STM32 Nucleo M4 or M7 can be used. DW stands for Depth-wise convolutions

Step 5: Since the STM32 Nucleo board does not feature an integrated image sensor or any other sensor, we designed the system depicted in Figure 3a. The demonstrator uses a PC connected to an STM32 Nucleo MCU board via USB: a) the STM32 MCU runs any ANN model (in Figure 2) generated with «validation on target» built-in X-CUBE-AI program; b) a webcam is attached to the PC; 3) a python script, exploiting OpenCV library, running on the PC in the Conda environment properly set, gets from the sensor image frames data in real-time and sends them to the STM32 Nucleo board via USB, which process the data and sends back ANN classification results to the GUI, as well as STM32 execution times. PC <--> STM32-Nucleo bi-directional communication happens through a serial port emulated on a USB. Images are encoded using a dedicated binary protocol (documentation is openly available by installing X-CUBE-AI) and decoded on the MCU.

Figure 3a: Demonstrator of the concept by using a PC and an STM32 NUCLEO MCU board

Figure 3a: Demonstrator of the concept by using a PC and an STM32 NUCLEO MCU board.

Figure 3b: Some visual results.

Figure 3b: Some visual results.

The lessons that IoT practitioners can learn from this case study are:

  • Do not stop at the first ANN topology you conceive or reuse from related works. Further exploration must be done to shrink the model size to be as tiny as possible while maintaining accuracy at the expected level. Any ANN layer has measurable (through X-CUBE-AI) complexity and storage costs that impact model parameter size and memory footprint. Be aware of those costs even before embracing time-consuming model training such as K-fold validation.
  • Consider using ST’s AI tools to automatically explore ANN deployability on an MCU. As said you can do this even before the training phase, using the automatic analysis that renders computational complexity of the ANN as well as its impact on FLASH and RAM, embedded into the MCU, so users can be aware of, and apply optimizations to the ANN as early as possible in the design process. Take advantage of the automatic deployment, including on target validation and performance characterization, the tool via built-in programs offers to program the MCU with your favorite ANN. All these will have a dramatic impact on your productivity.
  • Consider using post-training quantization to convert full precision models to integer 8bits. This decreases storage memory costs by four times and accelerates execution speed on average by as much as two or three times on the MCU. Check carefully if accuracy is marginally compromised and at which level. Typically, a reduction of less than 1% can be tolerated.
  • Connect any sensor to the PC and consider using X-CUBE-AI pre-defined data format to convert and route sensor data to the STM32 Nucleo MCU via USB. This will help to quickly assemble a proof of concept (POC) that helps to show the idea to stakeholders. Write a python script to support the purpose, as we did.
  • Unleash your fresh-mind creativity. This is perhaps the most important and valuable contribution any early practitioner can bring to the ML field, since not biased by experience. Address new problems and challenge the AI tools.

How will the methodology evolve to become even easier and more productive than today? All the topologies presented in Figure 2 were hand-crafted. Associated hyperparameters were changed manually, picking values by considering personal insights, experiences, and knowledge of the inner properties of the ANN layers used. Undoubtedly, this is the most challenging task for any early practitioner, especially if has an embedded programming background, with limited or no-preexisting knowledge. So, we are back to a new gap: how to shape an ANN topology. Is there any change likely to provide a better way forward? AutoML tools represent an interesting evolution to automatically design an ML algorithm and simultaneously setting its hyperparameters to optimize its empirical performance on a given dataset. When the ML algorithm to optimize is an ANN, AutoML specializes in Neural Architecture Search (NAS). For a given ANN topology, hyperparameter optimization (HPO) supports the automatic choice of a set of optimal hyperparameters to maximize accuracy. Unfortunately, the resulting very accurate ANN is typically inferred onto powerful targets and is not deployable on resource-constrained MCUs. Fortunately, the research community is very active, and interesting technologies are proposed to fill the gap and to help the mapping process. Early examples are AutotinyML [2], tinyNAS [3], μNAS [4] and [5]. These tools can offer the support of NAS/HPO features and addressing the challenge of NN implementability on MCU in the earliest phase of the ANN design process.

The technology for mapping ANN into MCUs is moving incredibly fast, these days. However, it shall not leave the education of next-generation IoT practitioners behind and all efforts to help them must be made. The IEEE Region 8 Action for Industry sub-committee has also set the Internship Initiative as a contribution to better connect industries with IEEE students as early as possible during their studies. 


  1. Pau, D., Kumar, B. P., Namekar, P., Dhande, G., & Simonetta, L. (2020). Dataset of sodium chloride sterile liquid in bottles for intravenous administration and fill level monitoring. Data in Brief, 33, 106472.
  2. Perego, R., Candelieri, A., Archetti, F., & Pau, D. (2020). Tuning Deep Neural Network’s Hyperparameters Constrained to Deployability on Tiny Systems. In International Conference on Artificial Neural Networks (pp. 92-103). Springer, Cham.
  3. Lin, J., Chen, W. M., Lin, Y., Cohn, J., Gan, C., & Han, S. (2020). Mcunet: Tiny deep learning on iot devices. arXiv preprint arXiv:2007.10319.
  4. Liberis, E., Dudziak, Ł., & Lane, N. D. (2021). μNAS: Constrained Neural Architecture Search for Microcontrollers. In Proceedings of the 1st Workshop on Machine Learning and Systems (pp. 70-79).
  5. Xiong, Y., Mehta, R., & Singh, V. (2019). Resource constrained neural network architecture search: Will a submodularity assumption help?. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1901-1910).









[8] (X-CUBE-AI)



Danilo PauDanilo Pau graduated at Politecnico di Milano in 1992. One year before, he joined STMicroelectronics, where he worked first on HDMAC decoder design and then on MPEG2 video memory reduction, video coding, embedded graphics, and computer vision. Today, his work focuses on developing solutions for deep learning tools and applications. Danilo has been elevated to IEEE Fellow since 2019, served as Industry Ambassador coordinator for IEEE Region 8 South Europe and Member of the Machine Learning, Deep Learning and AI in the CE (MDA) Technical Stream Committee IEEE Consumer Electronics Society (CESoc). With over 80 patents, 100 publications, 113 MPEG authored documents and 40 invited talks/seminars at various worldwide Universities, PhD schools and Conferences, Danilo's favourite activity remains mentoring undergraduate students, MSc engineers and PhD students. If you are interested to discuss further, please contact Danilo Pietro Pau, Technical Director, IEEE and ST Fellow STMicroelectronics, Agrate Brianza (Italy) at