A Practical Hybrid IoT Architecture with Deep Learning Technique for Healthcare and Security Applications

: Facial mask detection technology has become increasingly important even beyond the context of the COVID-19 pandemic. Along with the advancement in facial recognition technology, face mask detection has become a crucial feature for various applications. This paper introduces an Internet of Things (IoT) architecture based on a developed deep learning algorithm named You Only Look Once (YOLO) to keep society healthy, and secured, and collect data for future research. The proposed paradigm is built on the basis of economic consideration and is easy to implement. Yet, the used YOLOv4-tiny is one of the fastest object detection models to exist. A mask detection camera (MaskCam) that leverages the computing power of NVIDIA’s Jetson Nano edge nanodevices was built side by side with a smart camera application to detect a mask on the face of an individual. MaskCam distinguishes between mask wearers, those who are not wearing masks, and those who are not wearing masks properly according to MQTT protocol. Furthermore, a self-developed web browsing application comes with the MaskCam system to collect and visualize statistics for qualitative and quantitative analysis. The practical results demonstrate the superiority and effectiveness of the proposed smart mask detection system. On the one hand, YOLOv4-full obtained the best results even at smaller resolutions, although the frame rate is too small for real-time use. On the other hand, it is twice as fast as the other detection models, regardless of the quality of detection. Consequently, inferences may be run more frequently over the entire video sequence, resulting in more accurate output.


Introduction
Facial mask detection technology has become increasingly important.It has been widely known and used as an effective tool for dealing with the COVID-19 pandemic.However, with the advancement in facial recognition technology, face mask detection has many more other applications that extend far beyond the pandemic, such as preventing the places and situations mentioned above.Fortunately, artificial intelligence (AI) techniques, particularly deep learning algorithms, have flourished in recent decades.To improve throughput, efficiency, accuracy, and so on, Keras, OpenCV, and Tensorflow algorithms are used in conjunction with the Python embedded language.This provides researchers and engineers with a powerful tool that can solve the mask detection problem without any hassles.In mobile environments where there are a great deal of walking and people tend to move around a lot, deep learning models can take up a tremendous amount of time and processing power.Therefore, a new technique is needed to combat such problems.
Sethi et al. [8] present a face mask based on MAFA (MAsked FAces) datasets and deep learning algorithms.As a result, mask detection achieved a relatively high level of accuracy, at about 98.2%.According to Asif and Sohaib et al. [9], a machine learning method combined with the method of transfer learning offers an alternative solution.Other researchers have found solutions to this problem by utilizing computer vision, AI, or other methods.These studies are similar in that they all achieve high mask detection accuracy.However, they only focus on techniques that increase results as much as possible.With the Internet of Things (IoT), the applicability of a product is not only determined by the accuracy of its operation but also by its connectivity across many different locations [10].To facilitate research and analysis of collected data, data should be easily extracted, retrieved, and stored.Hence, for the above reasons, this paper proposes a MaskCam system based on a developed deep learning algorithm named You Only Look Once (YOLO) as an object detection method [11].You Only Look Once, or YOLO, is an algorithm that employs neural networks to detect objects in real time.One of the main advantages of this algorithm over others is its speed and accuracy.The above advantages make it applicable to a wide variety of applications, such as traffic signals and parking, among others.The object detection function of YOLO is implemented by convolutional neural networks (CNNs).This technique has the distinguishing characteristic that the prediction is made within a single algorithm run.Furthermore, the algorithm has extensive learning capabilities, allowing it to learn from the prior representation and then apply the created model to object detection.The YOLO algorithm utilizes residual blocks, bounding box regression, and intersection over union (IOU) techniques.In the residual blocks, the image will first be divided into small grids, and then objects that appear in each grid will be detected.Bounding box regression identifies the object in an image by creating an outline to highlight it in each cell.Through a single bounding box regression in the YOLO algorithm, the object's height, width, and center can be predicted, as well as its class.The intersection over union (IOU) detects objects by using the very nature of box overlap.YOLO uses IOU to assign a perfect output box around objects.An IOU value of 1 indicates that the box predicted by the IOU is the same as the one that exists in reality.During this process, bounding boxes that are not equal to or different from the real box are eliminated.
The developed YOLOv4-tiny was used in this study.In comparison with other similar object detectors, YOLO has long been the most popular option.In contrast to area-based detectors, which generate region recommendations sent to the classifier, it uses the complete picture as input.This makes it significantly faster than other traditional detectors.YOLOv4-tiny is a compressed version of YOLOv4 designed for training on less powerful machines [12,13].The MQTT (MQ Telemetry Transport) protocol is also used to communicate between the data server and the detection device [14].A detailed description of hardware installation and a system overview will be covered in the Section 2. A web-based GUI front-end design will take the features of displaying statistics, as well as relationships, and send MQTT messages to devices using the Streamlit framework.Offices, schools, hospitals, or any place that requires people to put on a mask as part of an epidemic prevention program can benefit from the developed smart mask detection system.Launching the system is easy and inexpensive.Several measures were taken as part of this study to analyze the problem and identify the infractions:

•
YOLO tiny V4 was used to measure the accuracy of a masked face using a custom-built dataset of a blend of several data sets.

•
Mask detection algorithms that provide high accuracy and frame rates were analyzed, so that the system can operate in real time.

•
The use of MQTT communication for IoT applications allows the connection of devices to servers and data storage to become simpler and more convenient.
The remainder of the paper is organized as follows.Section 2 provides detailed explanations of the methodology.Section 3 describes the study results and discussion.Finally, Section 4 lays out the conclusions.

Methodology
This section presents the system overview, hardware installation, and software development.

System Overview
An intelligent mask-wearing surveillance system was developed using the powerful NVIDIA device.The overall system is depicted in Figure 1.Basically, the system consists of two main parts: (1) Device-side: The intelligent camera is powered by NVIDIA Jetson Nano.Based on an optimized deep learning detection model, NVIDIA Jetson Nano, which is considered the brain of the intelligent camera, captures and detects mask-wearing or mask-nowearing cases.Device-side can contain a variety of devices installed in different surveillance locations.(2) Server-side: Data received from the device side are stored on the server, which is regarded as a warehouse.Data detected are then presented on a dashboard for analysis.It also handles the user's commands and feedback and then transmits them to the device.
system is easy and inexpensive.Several measures were taken as part of this study to analyze the problem and identify the infractions: • YOLO tiny V4 was used to measure the accuracy of a masked face using a custombuilt dataset of a blend of several data sets.

•
Mask detection algorithms that provide high accuracy and frame rates were analyzed, so that the system can operate in real time.

•
The use of MQTT communication for IoT applications allows the connection of devices to servers and data storage to become simpler and more convenient.
The remainder of the paper is organized as follows.Section 2 provides detailed explanations of the methodology.Section 3 describes the study results and discussion.Finally, Section 4 lays out the conclusions.

Methodology
This section presents the system overview, hardware installation, and software development.

System Overview
An intelligent mask-wearing surveillance system was developed using the powerful NVIDIA device.The overall system is depicted in Figure 1.Basically, the system consists of two main parts:

Hardware Installation
NVIDIA Jetson Nano, Logitech 270 HD Webcam, and Intel Dual Band Wireless AC 8265 were used in the hardware setup of the intelligent camera device.The total cost for those devices is about 350 USD. Figure 2 illustrates the device in its fully assembled state.

Hardware Installation
NVIDIA Jetson Nano, Logitech 270 HD Webcam, and Intel Dual Band Wireless AC 8265 were used in the hardware setup of the intelligent camera device.The total cost for those devices is about 350 USD. Figure 2 illustrates the device in its fully assembled state.In addition, a cooling fan was installed to cool the NVIDIA board to prevent thermal throttling and maintain performance over a long period.In addition, a cooling fan was installed to cool the NVIDIA board to prevent thermal throttling and maintain performance over a long period.

Deep Mask Detection Model and Optimization
In the Jetson Nano device, YOLOv4-tiny's single object detection is applied to detect whether or not faces' bounding boxes have masks.The deep learning algorithm is implemented using OpenCV and Python programming language.Detection models are used to recognize four categories: faces wearing masks, faces without masks, faces not visible, and faces with misplaced masks, as shown in Figure 3. Four public datasets, with approximately 6000 labels for each class, are used in the model: Kaggle Medical Masks [15], MAFA [16], WiderFace [17], and WIDER FACE [18].Furthermore, the detections are tracked across the scene using an open-source object tracker, named Norfair [19].Every time a person walks in front of the camera, the algorithm detects their face's bounding box as it changes within the scene.Rather than counting each individual frame after frame, the algorithm counts each individual once.Once the detection result for their face exceeds a certain threshold for various frames, a voting process determines whether or not the individual is wearing a mask.This is also true if the face cannot be clearly seen.After running this algorithm, the final output is a count of how many individuals passed in front of the camera.In addition, it shows what percentage of those individuals were wearing a mask.In the Jetson Nano device, YOLOv4-tiny's single object detection is applied to detect whether or not faces' bounding boxes have masks.The deep learning algorithm is implemented using OpenCV and Python programming language.Detection models are used to recognize four categories: faces wearing masks, faces without masks, faces not visible, and faces with misplaced masks, as shown in Figure 3. Four public datasets, with approximately 6000 labels for each class, are used in the model: Kaggle Medical Masks [15], MAFA [16], WiderFace [17], and WIDER FACE [18].Furthermore, the detections are tracked across the scene using an open-source object tracker, named Norfair [19].Every time a person walks in front of the camera, the algorithm detects their face's bounding box as it changes within the scene.Rather than counting each individual frame after frame, the algorithm counts each individual once.Once the detection result for their face exceeds a certain threshold for various frames, a voting process determines whether or not the individual is wearing a mask.This is also true if the face cannot be clearly seen.After running this algorithm, the final output is a count of how many individuals passed in front of the camera.In addition, it shows what percentage of those individuals were wearing a mask.
In addition, a cooling fan was installed to cool the NVIDIA board to prevent thermal throttling and maintain performance over a long period.

Deep Mask Detection Model and Optimization
In the Jetson Nano device, YOLOv4-tiny's single object detection is applied to detect whether or not faces' bounding boxes have masks.The deep learning algorithm is implemented using OpenCV and Python programming language.Detection models are used to recognize four categories: faces wearing masks, faces without masks, faces not visible, and faces with misplaced masks, as shown in Figure 3. Four public datasets, with approximately 6000 labels for each class, are used in the model: Kaggle Medical Masks [15], MAFA [16], WiderFace [17], and WIDER FACE [18].Furthermore, the detections are tracked across the scene using an open-source object tracker, named Norfair [19].Every time a person walks in front of the camera, the algorithm detects their face's bounding box as it changes within the scene.Rather than counting each individual frame after frame, the algorithm counts each individual once.Once the detection result for their face exceeds a certain threshold for various frames, a voting process determines whether or not the individual is wearing a mask.This is also true if the face cannot be clearly seen.After running this algorithm, the final output is a count of how many individuals passed in front of the camera.In addition, it shows what percentage of those individuals were wearing a mask.In order for the training model to run as efficiently as possible on the resourceconstrained device, it needs to be converted to an optimized format, producing a TensorRT engine [20].Model weight is reduced to 16 floating points, which runs well on the Jetson Nano while maintaining reasonable accuracy.Additionally, the optimal NVIDIA Deep-Stream SDK was reduced to accelerate the inference process on the NVIDIA GPU [21].The processing pipeline is presented in Figure 4.A Python multiprocessing module can be used to handle the detection, video streaming, and MQTT communication processes required to stream the rendered video.mation 2023, 14, x FOR PEER REVIEW 6 of In order for the training model to run as efficiently as possible on the resource-co strained device, it needs to be converted to an optimized format, producing a Tensor engine [20].Model weight is reduced to 16 floating points, which runs well on the Jets Nano while maintaining reasonable accuracy.Additionally, the optimal NVIDIA Dee Stream SDK was reduced to accelerate the inference process on the NVIDIA GPU [2 The processing pipeline is presented in Figure 4.A Python multiprocessing module c be used to handle the detection, video streaming, and MQTT communication process required to stream the rendered video.

MQTT Broker and Webserver
In order to collect statistics and visualize them, a separate server was implemente This server can be run on a device alongside the Jetson Nano (for example, AWS EC2) [2 The web-based GUI system accommodates statistics from the MaskCam system, stor them in a database, and displays them.It can also send MQTT commands directly to d vices via its web interface.
A web browser application was designed and developed using the Streamlit fram work for the web-based GUI frontend.With Streamlit, the development time of IoT das boards is shortened through an easy-to-use GUI.The frontend web application displa statistics, as well as relationships, and sends MQTT messages to devices.The dashboa interface consists of three main components: (1) Device selection: The device is selected to view its recorded data.
(2) Filters: By selecting the date/time, the data will be visualized.
(3) Reported statistics: Analyzing and visualizing the statistical data gathered from t selected device during a specific period.On the backend, PostgreSQL was selected to store statistical and device informatio When compared with MySQL, PostgreSQL is best suited to systems with complex quer that must be executed or for data warehousing and analysis.In Python, RESTful APIs web applications were developed using the FastAPI framework.In addition to supporti asynchronous programming, FastAPI can also be used with Uvicorn and Gunicorn.Fu thermore, the backend module executes an MQTT subscriber task that reads all co mands from devices and records them in the database.

Results and Discussions
This section presents the implementation of the IoT architecture with deep learni technique, which leverages a face mask detection model YOLOv4-tiny network, which implemented in TensorflowRT and optimized by DeepStream.The model uses a com

MQTT Broker and Webserver
In order to collect statistics and visualize them, a separate server was implemented.This server can be run on a device alongside the Jetson Nano (for example, AWS EC2) [22].The web-based GUI system accommodates statistics from the MaskCam system, stores them in a database, and displays them.It can also send MQTT commands directly to devices via its web interface.
A web browser application was designed and developed using the Streamlit framework for the web-based GUI frontend.With Streamlit, the development time of IoT dashboards is shortened through an easy-to-use GUI.The frontend web application displays statistics, as well as relationships, and sends MQTT messages to devices.The dashboard interface consists of three main components: (1) Device selection: The device is selected to view its recorded data.
(2) Filters: By selecting the date/time, the data will be visualized.
(3) Reported statistics: Analyzing and visualizing the statistical data gathered from the selected device during a specific period.
On the backend, PostgreSQL was selected to store statistical and device information.When compared with MySQL, PostgreSQL is best suited to systems with complex queries that must be executed or for data warehousing and analysis.In Python, RESTful APIs for web applications were developed using the FastAPI framework.In addition to supporting asynchronous programming, FastAPI can also be used with Uvicorn and Gunicorn.Furthermore, the backend module executes an MQTT subscriber task that reads all commands from devices and records them in the database.

Results and Discussions
This section presents the implementation of the IoT architecture with deep learning technique, which leverages a face mask detection model YOLOv4-tiny network, which is implemented in TensorflowRT and optimized by DeepStream.The model uses a combination of four public datasets: Kaggle Medical Masks, MAFA, WiderFace, and WIDER FACE datasets, with approximately 6000 labels for each object class.The dataset includes four object classes: face with mask, face without mask, face not visible, and misplaced mask.With the integrated C270 HD Webcam, the quality of video can reach 30 FPS with a resolution of 1780 × 720.Various object detection models were compared, including MobileNetv2, a full version of YOLOv4, and a tiny variant of YOLOv4.Using different input resolutions, the models are trained and optimized with TensorRT before being benchmarked on the same reference videos.The comparisons between these models can be found in Table 1.Despite the fact that YOLOv4-full obtained the best results even at smaller resolutions, the frame rate is too small for real-time use.While the quality of detection is similar between YOLOv4-tiny and MobileNetV2, YOLOv4-full is significantly faster, i.e., twice as fast as the other detection models.Consequently, inferences can be run over the whole video sequence more frequently, leading to better results.As is obvious in Figure 5, MaskCam detected the people in the picture regardless of their movements.In addition, each detected object is given a number for clear presentation and analysis.As seen, objects 9 and 11 refer to those who were not wearing masks, including those not wearing the mask properly.In addition, object 8 is not considered visible.Figures 6 and 7 demonstrate how statistical detection data can be plotted using the web-based dashboard.In addition to reporting the number of people who pass through the surveillance area and whether or not they are wearing masks, it also reports the percentage of people wearing masks.
Information 2023, 14, x FOR PEER REVIEW 7 of 12 a resolution of 1780 × 720.Various object detection models were compared, including Mo-bileNetv2, a full version of YOLOv4, and a tiny variant of YOLOv4.Using different input resolutions, the models are trained and optimized with TensorRT before being benchmarked on the same reference videos.The comparisons between these models can be found in Table 1.Despite the fact that YOLOv4-full obtained the best results even at smaller resolutions, the frame rate is too small for real-time use.While the quality of detection is similar between YOLOv4-tiny and MobileNetV2, YOLOv4-full is significantly faster, i.e., twice as fast as the other detection models.Consequently, inferences can be run over the whole video sequence more frequently, leading to better results.As is obvious in Figure 5, MaskCam detected the people in the picture regardless of their movements.In addition, each detected object is given a number for clear presentation and analysis.As seen, objects 9 and 11 refer to those who were not wearing masks, including those not wearing the mask properly.In addition, object 8 is not considered visible.Figures 6 and 7 demonstrate how statistical detection data can be plotted using the webbased dashboard.In addition to reporting the number of people who pass through the surveillance area and whether or not they are wearing masks, it also reports the percentage of people wearing masks.  in Figure 9. Various masks in the image along with multiple faces in the image are believed to be the main reasons for the difference in accuracy, according to the author.Therefore, when training the current model with a different mask than that with which it was trained, the accuracy would differ slightly.Upscaling can be carried out on low-quality images for detection and classification [25,26].Furthermore, CNNs can achieve higher accuracy by improving image quality [27].The CNN-based model proposed by Kaur et al. [28] proceeds by correctly recognizing the face and then evaluating whether or not the face has been covered.Figure 10 shows that our model functions somewhat similarly.
Information 2023, 14, x FOR PEER REVIEW 9 of 12 dataset when using different models to determine whether or not an image contained a mask, as depicted in Figure 9. Various masks in the image along with multiple faces in the image are believed to be the main reasons for the difference in accuracy, according to the author.Therefore, when training the current model with a different mask than that with which it was trained, the accuracy would differ slightly.Upscaling can be carried out on low-quality images for detection and classification [25,26].Furthermore, CNNs can achieve higher accuracy by improving image quality [27].The CNN-based model proposed by Kaur et al. [28] proceeds by correctly recognizing the face and then evaluating whether or not the face has been covered.Figure 10 shows that our model functions somewhat similarly.Information 2023, 14, x FOR PEER REVIEW 9 of 12 dataset when using different models to determine whether or not an image contained a mask, as depicted in Figure 9. Various masks in the image along with multiple faces in the image are believed to be the main reasons for the difference in accuracy, according to the author.Therefore, when training the current model with a different mask than that with which it was trained, the accuracy would differ slightly.Upscaling can be carried out on low-quality images for detection and classification [25,26].Furthermore, CNNs can achieve higher accuracy by improving image quality [27].The CNN-based model proposed by Kaur et al. [28] proceeds by correctly recognizing the face and then evaluating whether or not the face has been covered.Figure 10 shows that our model functions somewhat similarly.A similar algorithm was developed by Bhuiyan et al. [29] to identify whether or not the individual being monitored is wearing a mask.The performance of the custom-trained model was enhanced with data augmentation [30].The performance of different models is compared using the model metrics, which are defined in Table 2. Figure 10 illustrates the comparison of different models studied and compared by Naeem Ullah et al. [31], as other algorithms can be used to improve our current model.Comparing the current model with other models, it has demonstrated impressive results and can be implemented in real-world scenarios.By effectively detecting people wearing masks, people not wearing masks, people not wearing masks properly, and people who are invisible in the detected image frame, the algorithm was able to detect all of the conditions mentioned above.Therefore, this facial mask detection technology is very versatile and can be employed for different purposes.For example, in some places such as hospitals and transportation hubs, or during a pandemic where there is a high risk of spreading contagious diseases and wearing masks is mandatory, this technology helps to identify individuals who are not following the maskwearing protocol and then advise them to wear masks.However, in some restricted access and sensitive places such as banks, military bases, power plants, and so on, where people are asked to not wear masks while entering, this technology helps to identify those who are wearing masks and inform them to take off their masks so that their identity can be checked, thus enhancing security.Generally, there are instructions for individuals to A similar algorithm was developed by Bhuiyan et al. [29] to identify whether or not the individual being monitored is wearing a mask.The performance of the custom-trained model was enhanced with data augmentation [30].The performance of different models is compared using the model metrics, which are defined in Table 2. Figure 10 illustrates the comparison of different models studied and compared by Naeem Ullah et al. [31], as other algorithms can be used to improve our current model.Comparing the current model with other models, it has demonstrated impressive results and can be implemented in real-world scenarios.
By effectively detecting people wearing masks, people not wearing masks, people not wearing masks properly, and people who are invisible in the detected image frame, the algorithm was able to detect all of the conditions mentioned above.Therefore, this facial mask detection technology is very versatile and can be employed for different purposes.For example, in some places such as hospitals and transportation hubs, or during a pandemic where there is a high risk of spreading contagious diseases and wearing masks is mandatory, this technology helps to identify individuals who are not following the mask-wearing protocol and then advise them to wear masks.However, in some restricted access and sensitive places such as banks, military bases, power plants, and so on, where people are asked to not wear masks while entering, this technology helps to identify those who are wearing masks and inform them to take off their masks so that their identity can be checked, thus enhancing security.Generally, there are instructions for individuals to follow on whether or not they need to wear masks in particular situations, but in case they do not follow the instructions, this facial mask detection technology in our study will help inform them to comply those instructions.

Conclusions
This study proposes a new cost-effective Internet-of-Things-based and deep-learningbased mask detection solution to assist people in adhering to many applications of facial mask detection technology.Indoor measurement was the primary focus of this study.In addition to being applicable to inexpensive hardware devices, the pipeline also has high efficiency in mask detection.Further, Jetson Nano was successfully used in practice to deploy deep learning models using the proposed method.Yet, it is also possible to improve a neural-network-based product in terms of accuracy and the implementation of deep learning algorithms.Lastly, a web application was developed for data visualization and analysis.The present smart mask detection system showed outstanding accuracy, a shorter processing time, and the smallest model size, at about 98%, 8.95 s, and 33 MB, respectively, compared with the other used models.
Future study will include state-of-the-art detection (YOLOv5, motion detection) to enhance the accuracy for individuals at a distance and work well on edge devices.A potential benefit of this work is that it can be applied to other applications by creating more accurate models, which will result in better accuracy and higher quality results.This includes self-driving cars, traffic signals, parking, facial recognition, robotics, and the medical industries that use these technologies to detect objects in a variety of different scenarios.

( 1 )
Device-side: The intelligent camera is powered by NVIDIA Jetson Nano.Based on an optimized deep learning detection model, NVIDIA Jetson Nano, which is considered the brain of the intelligent camera, captures and detects mask-wearing or mask-nowearing cases.Device-side can contain a variety of devices installed in different surveillance locations.(2)Server-side: Data received from the device side are stored on the server, which is regarded as a warehouse.Data detected are then presented on a dashboard for analysis.It also handles the user's commands and feedback and then transmits them to the device.

Figure 1 .
Figure 1.The pipeline of the intelligent mask-wearing surveillance system.

Figure 1 .
Figure 1.The pipeline of the intelligent mask-wearing surveillance system.

Information 2023 ,
14, x FOR PEER REVIEW 5 of 12

Figure 3 .
Figure 3. Detection strategy of the proposed system.

Figure 3 .
Figure 3. Detection strategy of the proposed system.

Figure 3 .
Figure 3. Detection strategy of the proposed system.

Figure 4 .
Figure 4.The mask face detector and tracker run as a DeepStream pipeline.

Figure 4 .
Figure 4.The mask face detector and tracker run as a DeepStream pipeline.

Figure 5 .
Figure 5. Real-time results of the system.Figure 5. Real-time results of the system.

Figure 5 .
Figure 5. Real-time results of the system.Figure 5. Real-time results of the system.

Figure 8 .
Figure 8.Comparison to other models in terms of accuracy, times, and size.

Figure 8 .
Figure 8.Comparison to other models in terms of accuracy, times, and size.

Figure 8 .
Figure 8.Comparison to other models in terms of accuracy, times, and size.

Figure 10 .
Figure 10.Comparison to other models in terms of accuracy, precision, recall, and F1 score.

Figure 10 .
Figure 10.Comparison to other models in terms of accuracy, precision, recall, and F1 score.

Table 1 .
Performance of different Jetson Nano models with TensorRT optimization.

Table 1 .
Performance of different Jetson Nano models with TensorRT optimization.