An IoT-Platform-Based Deep Learning System for Human Behavior Recognition in Smart City Monitoring Using the Berkeley MHAD Datasets

: Internet of Things (IoT) technology has been rapidly developing and has been well utilized in the ﬁeld of smart city monitoring. The IoT offers new opportunities for cities to use data remotely for the monitoring, smart management, and control of device mechanisms that enable the processing of large volumes of data in real time. The IoT supports the connection of instruments with intelligible features in smart cities. However, there are some challenges due to the ongoing development of these applications. Therefore, there is an urgent need for more research from academia and industry to obtain citizen satisfaction, and efﬁcient architecture, protocols, security, and services are required to fulﬁll these needs. In this paper, the key aspects of an IoT infrastructure for smart cities were analyzed. We focused on citizen behavior recognition using convolution neural networks (CNNs). A new model was built on understanding human behavior by using the berkeley multimodal human action (MHAD) Datasets. A video surveillance system using CNNs was implemented. The proposed model’s simulation results achieved 98% accuracy for the citizen behavior recognition system.


Introduction
Smart city monitoring aims to improve people's quality of life and the performance of services by using the latest technology [1]. Data are acquired from the many devices that serve in the smart cities and include data such as videos, security surveillance, environment, e-government, transportation, etc. The IoT involves intelligent devices, sensors, wireless devices such wireless sensors, and radio-frequency identifications (RFIDs) built into service systems, being connected in a network [2]. Data are collected from devices classified and used for decision making. Many smart city applications have been introduced such as transportation [3], healthcare, environment monitoring, public safety [4], and many others. Therefore, leveraging CNN and machine learning (ML) can be used in smart city monitoring [5][6][7][8][9]. These techniques aim to develop algorithms that can process input data for learning and accordingly be able to predict unknown information or actions. These algorithms could be categorized into two streams, supervised learning and unsupervised learning.
Supervised learning enables one to find a certain mapping to predict the outputs of unknown data [10]. Unsupervised learning focuses on exploring the intrinsic characteristics of inputs. Since supervised learning leverages the labels of inputs that are understandable to a human, it can apply to pattern classification and data regression problems [11,12]. However, supervised learning relies on labeled data, which needs a considerable amount of manual work. Moreover, there could be uncertainties and ambiguities in labels as well [13][14][15][16]. Additionally, the label for an object is not unique. To tackle these problems, unsupervised learning can be used to handle the intra-class variation, as it does not require the labels of data. In previous research, ML techniques were applied in many applications such as computer vision, bioinformatics, medical applications, natural language processing, speech processing, robotics, and stock market analysis [17]. The use of the deep learning (DL) approach enables the improvement of the detection and recognition processes in IoT-based human recognition platforms [18,19]. Accordingly, we proposed a new model for citizen behavior recognition based on IoT infrastructure for smart city monitoring. The model uses convolution neural networks (CNNs) with the Berkeley MHAD Datasets, which helps to understand human behavior and improves the performance of designed video surveillance systems with high accuracy.
The outcome of this study is to improve human behavior modeling for various smart city applications. The existing smart city systems use human behavior to promote smart agents. The proposed model helps to build a knowledge base of human behavior from various sources using sensory data and IoT technologies.
The main motivation of this research was to detect suspicious human behavior, which will help to identify activities such as fighting, slapping, vandalism, and people running in public places, schools, or colleges [5]. This paper focuses on the recognition of some actions including jumping, jumping jacks, boxing, waving two hands, waving one hand (the right hand), and clapping hands. The novelty of this paper can be summarized in the following two points: We developed a new framework for modeling human behavior-based deep learning models to understand and analyze human behavior better. The proposed algorithms explore convolution deep neural networks, which learn different features of historical data to determine collective abnormal human behaviors; We tested and evaluated the experiments of human behavior recognition systems based on convolution deep neural networks to demonstrate the usefulness of the proposed method.
An IoT-platform-based deep learning system for human behavior recognition is of benefit to society and the industry of smart city monitoring. Some high-value services and applications may involve: Safety and security services, i.e., suicide deterrence in municipal places, amenability monitoring, and the scrutiny of disaster mitigation due to the detection of vandalism in a crowd, the protection of critical infrastructures, the detection of violent and dangerous situations, perimeter monitoring and person detection, and weapon detection and reporting; Epidemic control policy services, i.e., social distancing in municipal spaces, automatic mask recognition, sanitary compliance detection, and monitoring healthcare; Infrastructure and Traffic monitoring, i.e., monitoring traffic in smart cities, the recognition of traffic rule violations, the surveillance of roadsides, and parking space management.
The paper is organized as follows: Section 2 provides an overview of the background and related work. Section 3 presents the proposed video surveillance system. The experi-Systems 2022, 10, 177 3 of 17 mental results and discussion are illustrated in Section 4. Result validation is provided in Section 5, and finally, Section 6 provides the conclusions and future work.

Smart Sustainable Cities
Smart sustainable cities are innovative cities that use information and communication technologies (ICTs) to improve the quality of life, the efficiency of operations and services, and competitiveness while ensuring that the needs of the present and future generations concerning the economic, social, environmental, and cultural aspects are met. Smart cities have emerged as a possible solution to the problems related to sustainability that result from rapid urbanization [20]. They are considered imperative for a sustainable future. In general, those cities that aim to become smart sustainable cities have to become more attractive, sustainable, inclusive, and more balanced for the citizens who live or work in them, as well as city visitors. Figure 1 shows the classification of some applications in smart cities [21].

management.
The paper is organized as follows: Section 2 provides an overview of the bac and related work. Section 3 presents the proposed video surveillance system. The mental results and discussion are illustrated in Section 4. Result validation is pro Section 5, and finally, Section 6 provides the conclusions and future work.

Smart Sustainable Cities
Smart sustainable cities are innovative cities that use information and com tion technologies (ICTs) to improve the quality of life, the efficiency of operati services, and competitiveness while ensuring that the needs of the present and fut erations concerning the economic, social, environmental, and cultural aspects Smart cities have emerged as a possible solution to the problems related to susta that result from rapid urbanization [20]. They are considered imperative for a sus future. In general, those cities that aim to become smart sustainable cities have to more attractive, sustainable, inclusive, and more balanced for the citizens who work in them, as well as city visitors. Figure 1 shows the classification of some tions in smart cities [21]. Smart cities have become a focus for many governments globally, as citizens satisfaction in terms of efficient services and smooth, secure transportation [22]. cept of the sustainability of smart cities falls within the scope of data and info sustainability. Sustainable urban development needs clean and transparent data resent the different aspects of smart cities and are available at an individual level. more, these data must be freely used for data exchange in IoT networks [23].
The concept of smart sustainable cities is linked to the possibility of obtai right information at the right time to help in making decisions by citizens or gov service providers to improve the quality of life [24]. The process of monitoring a human behavior through IoT platforms helps to improve the intelligent manage smart cities, which enables the transformation of the social behavior of citizens tow sustainability of city resources by making decisions to produce new smart city ment standards and rules [25]. It also helps the government evaluate citizens' beh improve services, in addition to environmental, social, and economic sustainabil Artificial intelligence (AI) techniques help to improve the performance of sm tainable cities that are based on IoT networks. These technologies offer effective s in intelligent transportation, urban planning [26], data confidentiality, and big d cessing. It also helps in decision-making processes and predicting possible futur AI technologies are involved in many smart city applications providing solutions fic congestion, energy data analysis, health care diagnostics, and cyber security [2 examples are shown in Figure 1. Smart cities have become a focus for many governments globally, as citizens demand satisfaction in terms of efficient services and smooth, secure transportation [22]. The concept of the sustainability of smart cities falls within the scope of data and information sustainability. Sustainable urban development needs clean and transparent data that represent the different aspects of smart cities and are available at an individual level. Furthermore, these data must be freely used for data exchange in IoT networks [23].
The concept of smart sustainable cities is linked to the possibility of obtaining the right information at the right time to help in making decisions by citizens or government service providers to improve the quality of life [24]. The process of monitoring abnormal human behavior through IoT platforms helps to improve the intelligent management of smart cities, which enables the transformation of the social behavior of citizens toward the sustainability of city resources by making decisions to produce new smart city management standards and rules [25]. It also helps the government evaluate citizens' behavior to improve services, in addition to environmental, social, and economic sustainability.
Artificial intelligence (AI) techniques help to improve the performance of smart sustainable cities that are based on IoT networks. These technologies offer effective solutions in intelligent transportation, urban planning [26], data confidentiality, and big data processing. It also helps in decision-making processes and predicting possible future events. AI technologies are involved in many smart city applications providing solutions for traffic congestion, energy data analysis, health care diagnostics, and cyber security [27]. Some examples are shown in Figure 1.

IoT-Platform-Based Deep Learning Systems
Deep learning (DL) is a part of artificial neural networks (ANNs). It is a computation model inspired by biological principles from the human brain. This field has been studied extensively for decades. The ANN is composed of connected artificial neurons that simulate the neurons in a biological brain. The weights between the layers of neurons are based on a non-linear transformation function called sigmoid [28]. The main objective of ML is to develop algorithms that are capable of learning and making accurate predictions in a given task.
In recent years, DL achieved a noteworthy achievement in computer vision. The creators of AlexNet accomplished a record in the execution of a profoundly difficult dataset named ImageNet [29]. AlexNet was capable of ordering millions of high-resolution images from different classes with the best blunder rate. DL approaches are ML methods that work on numerous (multi-layer) dimensions [30]. Convolutional neural networks (CNNs or ConvNets) are considered to be the most important DL architecture for human action recognition [31].
Different DL structures have been previously proposed and appear to produce stateof-the-art results on numerous assignments, not limited to human activity recognition [32]. As one of the most crucial deep learning models, CNNs obtain superior results in solving visual-related tasks. A CNN is a type of artificial neural network, intended for processing visual and other two-dimensional information. The main advantage of this model is that it works specifically on crude information with no manual feature extraction. The CNN model was first introduced in 1980 by Fukushima [33]. CNNs are inspired by the structure of the visual nervous system [34]. CNN models continue to be proposed and developed. Figure 2 shows a summarized statistic about the number of articles focused on smart city applications based on AI, ML, and DL technologies from 2018 to September 2022. Deep learning (DL) techniques help to clarify innovative solutions that deal with the challenges facing smart applications in urban cities related to the environment, people, transportation, and security. These technologies help to improve data processing, transform data into useful information, and help develop cognitive intelligence for sustainable cities [35]. Recently, the combination of ML and DL techniques has become more common as an unsupervised neural network and has been used to identify patterns and objects through videos in many applications, providing high detection ability with an accuracy level of more than 80%. DL technologies analyze the aggregated and integrated big data including images, videos, sensors, cloud computing, and resource management mechanisms to implement many intelligent operations related to detection and prediction [36].

Related Work
Several studies have recently been presented to monitor human behavior using deep learning mechanisms for different applications [37]. Our proposed system is focused on designing an IoT-platform-based DL system for citizen behavior recognition. The pro-

Related Work
Several studies have recently been presented to monitor human behavior using deep learning mechanisms for different applications [37]. Our proposed system is focused on designing an IoT-platform-based DL system for citizen behavior recognition. The proposed algorithm is a part of an IoT-based citizen behavior recognition project that integrates the structure of smart cities based on the IoT with the system of monitoring human behavior and predicting suspicious events, which helps the competent authorities to take appropriate actions. From previous studies, we found only a few models that address such a study, which depend on the mechanisms of deep learning to investigate human behavior. By reviewing these studies, the differences between them and our study are presented in Table 1. The previous studies considered different methodologies for human behavior recognition with high detection performance; however, most of them give lower accuracy than our proposed method.

Human Behavior Recognition Methodology
After investigating all the recently developed deep learning architecture, a recurrent neural network was selected based on the many needed and required features for recognizing human activities as well as the pattern recognition used throughout this work. The proposed solution consisted of two main phases, as shown in Figure 3. These phases are testing and training, which describe the steps of video processing based on the HAR approach. In the testing phase, the video input is captured with a high-resolution video sensor, taking a set of image samples, and then passed to the image preprocessing unit, which is used to convert the analog-captured video to normalized datasets and enhance the quality of feature extraction for analysis preparation.
The extracted datasets are then used as input into the human detection and segmentation unit, which enables the recognition of the sample of video objects based on the intelligent model approach used. The dimensions of the segmented samples are then reduced by efficiently representing a large number of pixels from the sample using the feature extraction and representation unit to effectively capture parts of interest [31]. Then, the extracted features of the dataset are trained using the DL and RNNS algorithms and matched with the filtered detected pattern of the captured video in the training phase to output the human behavior recognition result. nizing human activities as well as the pattern recognition used throughout this work. The proposed solution consisted of two main phases, as shown in Figure 3. These phases are testing and training, which describe the steps of video processing based on the HAR approach. In the testing phase, the video input is captured with a high-resolution video sensor, taking a set of image samples, and then passed to the image preprocessing unit, which is used to convert the analog-captured video to normalized datasets and enhance the quality of feature extraction for analysis preparation. The extracted datasets are then used as input into the human detection and segmentation unit, which enables the recognition of the sample of video objects based on the intelligent model approach used. The dimensions of the segmented samples are then reduced by efficiently representing a large number of pixels from the sample using the feature extraction and representation unit to effectively capture parts of interest [31]. Then, the extracted features of the dataset are trained using the DL and RNNS algorithms and matched with the filtered detected pattern of the captured video in the training phase to output the human behavior recognition result.
Different DL recognition structures have been previously proposed and appear to produce state-of-the-art results on numerous tasks. Overall, DL approaches are ML methods that work on numerous (multi-layer) dimensions [34]. Furthermore, the data were preprocessed to yield a skeleton of the body part to be detected in an activity, to reduce the number of incorrect predictions produced by the model. One of the main problems for human activity recognition (HAR) is the view-invariant issue, wherein the model is only able to detect activities from the same viewpoint that it was trained to detect. However, integrating deep learning with pose estimation technology should overcome this challenge.
The proposed DL recognition structure is shown in Figure 4. The hierarchical model enables the extraction of video features using a classification task. It consists of many lay- Different DL recognition structures have been previously proposed and appear to produce state-of-the-art results on numerous tasks. Overall, DL approaches are ML methods that work on numerous (multi-layer) dimensions [34]. Furthermore, the data were preprocessed to yield a skeleton of the body part to be detected in an activity, to reduce the number of incorrect predictions produced by the model. One of the main problems for human activity recognition (HAR) is the view-invariant issue, wherein the model is only able to detect activities from the same viewpoint that it was trained to detect. However, integrating deep learning with pose estimation technology should overcome this challenge.
The proposed DL recognition structure is shown in Figure 4. The hierarchical model enables the extraction of video features using a classification task. It consists of many layers that are used to represent the location and direction, makes a combination with corresponding objects, detects familiar objects in the video, and obtains recognition results [26]. The deep learning network obtains the hierarchical recognition tasks. The process of human behavior recognition (HBR) is described in Algorithm 1. ers that are used to represent the location and direction, makes a combination with corresponding objects, detects familiar objects in the video, and obtains recognition results [26]. The deep learning network obtains the hierarchical recognition tasks. The process of human behavior recognition (HBR) is described in Algorithm 1. In Algorithm 1, we developed a new method for HBR based on deep learning to study and analyze certain HB aspects. The proposed algorithm explores CDN, which learns different historical data features. To test and evaluate Algorithm 1, we applied it to different experiments for HBR systems based on CDN networks to demonstrate the effectiveness of the proposed procedure.
Firstly, we began with training and testing, during which the signal acquisition was set and preprocessing was conducted. Secondly, the algorithm calculated the extracted features; the global and local features were identified. Thirdly, by using the extracted global and local features, the baseline was set. Fourthly, we tested the process and used In Algorithm 1, we developed a new method for HBR based on deep learning to study and analyze certain HB aspects. The proposed algorithm explores CDN, which learns different historical data features. To test and evaluate Algorithm 1, we applied it to different experiments for HBR systems based on CDN networks to demonstrate the effectiveness of the proposed procedure.
Firstly, we began with training and testing, during which the signal acquisition was set and preprocessing was conducted. Secondly, the algorithm calculated the extracted features; the global and local features were identified. Thirdly, by using the extracted global and local features, the baseline was set. Fourthly, we tested the process and used the global and local features to compare with the baseline and state the classification model. Finally, the recognition mode was activated and performed.
The HBR enables the processing of the captured video by dividing it into several samples, where each sample is separately processed in a multi-layer structure. All deep learning modules extract information for segments in parallel, so that the features can be extracted without more redundant information [26]. In this model, each deep learning unit is jointly trained, which adds greater advantages to the process of identifying events and human behaviors. To extract more features from the captured video samples, the deep learning units were trained on a set of different exercises at the same time with the samples from three categories related to dynamic and static samples in addition to selectivity for feature extraction [27]. These three categories enabled us to detect human jumping, boxing, waving, and clapping.

Algorithm 1. Human Behavior Recognition (HBR) Algorithm
Initiate training process Initiate Testing process 1.
for each training and testing 2.
Calculate: and extract features 5.
Set: global and local features 6.
for the training process 7.
if (global and locate) features extracted 8.
for the testing process 12.
if (global and locate) features extracted 13.
Do: comparison with output in step 8
Recognize action 16. end 17. end The process was performed in four main stages. First, the input video was obtained from the Berkeley MHAD Datasets [16,17] which have been preprocessed using an openpose bottom-up approach. Second, the data were processed through human detection and segmentation, using filters and patterns to find the overall commonalities in those patterns, as previously described. Feature extraction and representation were used to select an action from the dataset by going deeper into the compression to where the patterns were assembled [38]. Action recognition was the last step, where the action was detected and extracted from a known pattern class.

Experimental Results and Discussion
The proposed CNN model consisting of two convolution layers, two maximum pooling layers, and two full connection layers is shown in Table 2. The active function used rectified linear unit (ReLU), and the input image was converted into a 28 × 28 size monochrome image. The model was built using the recurrent neural network (RNN) architecture. The parameters that could vary and thus change the accuracy and the precision of the overall activity recognition were the numbers of epochs, the batch size, and the iterations. A single epoch was a point at which an entire dataset was passed forward and in reverse through the neural system just once. In contrast, the batch size was the number of training samples in one batch. Iterations were defined as the number of batches needed to complete one epoch. As mentioned earlier, the dataset was the Berkeley MHAD Datasets [39], which contains 11 activities. Six of them were used in this model, including jumping, jumping jacks, boxing, waving two hands, waving one hand (the right hand), and clapping hands. The best result, illustrated in Figure 5, showed an accuracy of~99% and a precision of 99%. Furthermore, the figure compares the accuracy of testing and training. In training, the accuracy obtained 100% in some cases, so the overall test accuracy reaching 99% was a good result. The rows in the normalized confusion matrix are the actual class of activities, and the columns are the predicted class of activities. Thus, the diagonal elements represent the degree of correctly predicted classes.  To visualize the algorithm's performance in the six classes, a confusion matrix was used, as shown in Figure 6. For this model, the batch size value was 256, and the number of epochs was 800. In addition, as shown in Figure 6, it was observed that, due to color darkening, there was a slight similarity between clapping hands and boxing, and boxing with waving one hand, which is understandable, as these activities have much in common. The testing accuracy was 98.69%.
Systems 2022, 10, x FOR PEER REVIEW Figure 6. The confusion matrix of the proposed algorithm's performance on the six classe Figure 7 shows that the overall accuracy dropped to 97%, and Figure 8 s confusion matrix. Theoretically, fewer epochs should cause a drop in the overall as the level of learning would still be high, and there would still be many factors the model needs to be trained. This is commonly known as underfitting. This ne verified experimentally. The batch size was set to 256, the same as the previous the number of iterations was reduced from 800 to 600 epochs.  Figure 7 shows that the overall accuracy dropped to 97%, and Figure 8 shows its confusion matrix. Theoretically, fewer epochs should cause a drop in the overall accuracy, as the level of learning would still be high, and there would still be many factors on which the model needs to be trained. This is commonly known as underfitting. This needs to be verified experimentally. The batch size was set to 256, the same as the previous one, but the number of iterations was reduced from 800 to 600 epochs. Figure 9 shows the result and confirms that the overall accuracy drastically dropped to 93%. Theoretically, a larger batch size would cause a drop in the overall accuracy. The verification of this was required. The batch size was increased to 512, and the iterations were reduced to 100 epochs to reduce the training time. Furthermore, as can be seen from the confusion matrix in Figure 10, there was confusion between boxing and waving with one hand, and boxing and clapping, in addition to waving with two hands and jumping jacks.    Figure 9 shows the result and confirms that the overall accuracy drastically dropped to 93%. Theoretically, a larger batch size would cause a drop in the overall accuracy. Th verification of this was required. The batch size was increased to 512, and the iteration were reduced to 100 epochs to reduce the training time. Furthermore, as can be seen from the confusion matrix in Figure 10, there was confusion between boxing and waving with one hand, and boxing and clapping, in addition to waving with two hands and jumping jacks.  Another experiment was conducted where the batch size was less than 256, as shown in Figure 11. The batch size was 128, and the number of epochs was 1000. The result for the overall accuracy was 97.91%, as shown in Figure 12. Unnecessarily increasing the the confusion matrix in Figure 10, there was confusion between boxing and waving with one hand, and boxing and clapping, in addition to waving with two hands and jumping jacks.  Another experiment was conducted where the batch size was less than 256, as shown in Figure 11. The batch size was 128, and the number of epochs was 1000. The result for the overall accuracy was 97.91%, as shown in Figure 12. Unnecessarily increasing the Another experiment was conducted where the batch size was less than 256, as shown in Figure 11. The batch size was 128, and the number of epochs was 1000. The result for the overall accuracy was 97.91%, as shown in Figure 12. Unnecessarily increasing the number may cause propagation in the signals, creating an error in some values. The batch size of 128 with fewer iterations provided the best result thus far. number may cause propagation in the signals, creating an error in some values. The batc size of 128 with fewer iterations provided the best result thus far.   Figure 13 shows that the model batch size was decreased to 128, as discussed earlier with fewer iterations, such as 200 epochs, which was enough to train the model withou causing signal propagation. Figure 13 shows the result was 98.71%, which was as expected and was the highest result obtained thus far. The trend shown in the test exponentiall    Figure 13 shows that the model batch size was decreased to 128, as discussed earlier, with fewer iterations, such as 200 epochs, which was enough to train the model without causing signal propagation. Figure 13 shows the result was 98.71%, which was as expected and was the highest result obtained thus far. The trend shown in the test exponentially  Figure 13 shows that the model batch size was decreased to 128, as discussed earlier, with fewer iterations, such as 200 epochs, which was enough to train the model without causing signal propagation. Figure 13 shows the result was 98.71%, which was as expected and was the highest result obtained thus far. The trend shown in the test exponentially closed on the training trend more than in previous models. Moreover, the confusion matrix was the clearest obtained thus far, as shown in Figure 14.
closed on the training trend more than in previous models. Moreover, the confusion ma trix was the clearest obtained thus far, as shown in Figure 14.  The above results, shown in Figures 6-14, indicate that when the model batch siz was decreased, fewer iterations were enough to train the model without causing signa propagation. This practically confirmed that a larger batch size caused a drop in the over all accuracy. It was also shown that unnecessarily increasing the number of epochs ma cause propagation in the signals, creating an error in some values. closed on the training trend more than in previous models. Moreover, the confusion ma trix was the clearest obtained thus far, as shown in Figure 14.  The above results, shown in Figures 6-14, indicate that when the model batch siz was decreased, fewer iterations were enough to train the model without causing signa propagation. This practically confirmed that a larger batch size caused a drop in the ove all accuracy. It was also shown that unnecessarily increasing the number of epochs ma cause propagation in the signals, creating an error in some values. The above results, shown in Figures 6-14, indicate that when the model batch size was decreased, fewer iterations were enough to train the model without causing signal propagation. This practically confirmed that a larger batch size caused a drop in the overall accuracy. It was also shown that unnecessarily increasing the number of epochs may cause propagation in the signals, creating an error in some values.
Previously, many tests have been conducted between the values of X_train, Y_train, and the X_test and Y_test. The dataset was split with an allocation of 80% for training and 20% for testing. Figures 7, 9, 11 and 13 present the test results. The view-invariant issue required using a preprocessed dataset that used the open-pose method to obtain body parts as the input to our model. This trained the model to recognize the action regardless of the view with an accuracy level of~99%. Table 3 shows the results when the parameters were changed.

Result Validation
To validate that the model was working with high accuracy and that it could detect actions from different viewpoints, a video was uploaded that was not from the training dataset. This was carried out to check if the model could identify the activity's label. Figure 15 shows the images from the uploaded video.
Previously, many tests have been conducted between the values of X_train, Y_train, and the X_test and Y_test. The dataset was split with an allocation of 80% for training and 20% for testing. Figures 7, 9, 11 and 13 present the test results. The view-invariant issue required using a preprocessed dataset that used the open-pose method to obtain body parts as the input to our model. This trained the model to recognize the action regardless of the view with an accuracy level of ~99%. Table 3 shows the results when the parameters were changed.

Result Validation
To validate that the model was working with high accuracy and that it could detect actions from different viewpoints, a video was uploaded that was not from the training dataset. This was carried out to check if the model could identify the activity's label.  The video was converted to an array suitable to be uploaded in the Jupyter notebook. The result of the predicted activity is shown in Figure 16.  The video was converted to an array suitable to be uploaded in the Jupyter notebook. The result of the predicted activity is shown in Figure 16.
of the view with an accuracy level of ~99%. Table 3 shows the results when the para were changed.

Result Validation
To validate that the model was working with high accuracy and that it could actions from different viewpoints, a video was uploaded that was not from the t dataset. This was carried out to check if the model could identify the activity's lab ure 15 shows the images from the uploaded video. The video was converted to an array suitable to be uploaded in the Jupyter no The result of the predicted activity is shown in Figure 16.

Conclusions
Smart sustainable city monitoring is challenging for many reasons, such as different application domains requiring different tasks. Furthermore, a large volume of data of different types and modalities requires different algorithms and analysis techniques. This paper proposed and simulated an IoT platform for human behavior recognition using CCNs. The development of citizen behavior and activity recognition may start by overcoming a small issue, such as a view-invariant problem. As currently used, they perform poorly when tested with real-life scenario viewpoints and complex tasks.
The study showed that there are still many challenges ahead for this emerging technology owing to the complex nature of the deep and wide coverage of smart city applications. A model was built on understanding simple human behaviors such as jumping, jumping jacks, boxing, waving two hands, waving one hand (the right hand), and clapping hands. A video surveillance system using CNNs was implemented. The simulation results showed the overall accuracy was about 98%. Future work could investigate suspicious human activity such as fighting, slapping, vandalism, and people running in public places, both indoors and outdoors.