Intelligent ADL Recognition via IoT-Based Multimodal Deep Learning Framework

Smart home monitoring systems via internet of things (IoT) are required for taking care of elders at home. They provide the flexibility of monitoring elders remotely for their families and caregivers. Activities of daily living are an efficient way to effectively monitor elderly people at home and patients at caregiving facilities. The monitoring of such actions depends largely on IoT-based devices, either wireless or installed at different places. This paper proposes an effective and robust layered architecture using multisensory devices to recognize the activities of daily living from anywhere. Multimodality refers to the sensory devices of multiple types working together to achieve the objective of remote monitoring. Therefore, the proposed multimodal-based approach includes IoT devices, such as wearable inertial sensors and videos recorded during daily routines, fused together. The data from these multi-sensors have to be processed through a pre-processing layer through different stages, such as data filtration, segmentation, landmark detection, and 2D stick model. In next layer called the features processing, we have extracted, fused, and optimized different features from multimodal sensors. The final layer, called classification, has been utilized to recognize the activities of daily living via a deep learning technique known as convolutional neural network. It is observed from the proposed IoT-based multimodal layered system’s results that an acceptable mean accuracy rate of 84.14% has been achieved.


Introduction
Smart homes-based monitoring via IoT devices is an important concept to be taken into consideration [1,2].Elderly and patient monitoring at IoT-based smart homes or facilities is a big challenge in this era [3].Machines are not intelligent enough to take care of such patients at facilities by themselves [4].Therefore, continuous improvements are needed when it comes to dealing with human health monitoring [5,6].However, the standard approaches are less efficient and require a multimodal IoT-based methodology to provide robust monitoring systems [7,8].Activities of daily living (ADLs) need to be examined for smart home monitoring systems.ADL monitoring applications are widespread including fall detection, home surveillance, smart environments, assistive robotics, and ambient assisted living [9][10][11][12][13][14]. ADLs are difficult to be recognized as each ADL consists of multiple small actions performed together to make one long activity [15].Single type of raw sensor data are not able to detect the complex sequences of ADL.Different subjects can Sensors 2023, 23, 7927 2 of 20 perform a single ADL by performing the small actions in a diverse sequence of actions [16].Therefore, a robust multimodal IoT-based intelligent system is required to take care of these limitations [17].
Deep learning models can help machines to infer the natural intuitions of human body motion.They provide a great opportunity to learn through sufficient examples of human actions in ADL in order to then identify them [18].End-to-end deep learning techniques are effective for high-level features extraction [19].A deep learning framework will help facilities to cope with high costs and nursing shortages via ADL recognition [20].Multiple hyper parameters can be used for each deep learning model to adjust the ADL recognition [21].Therefore, we have proposed a unique framework for the ADL recognition of elderly people at smart homes and facilities using IoT-based multisensory devices.This study has suggested a systematic method to take multimodal data from many IoT devices and process them to remove any noise and bias.Next, human silhouette detection and features processing is performed to highlight the important characteristics of the system.Finally, these features are optimized and the ADL is classified using a deep learning model.
Two publicly available datasets based on multimodal sensors and videos have been used to perform the evaluation for our proposed method, namely, Opportunity++ [22] and Berkeley-MHAD [23].These datasets contain numerous types of data, including inertial and vision-based data.The key contributions of this research paper are:

•
A novel algorithm has been proposed for 2D stick model extraction in this study for supporting more efficient ADL recognition in less computational time.

•
An algorithm for human body landmarks detection has been proposed to effectively recognize the daily locomotion activities.

•
A genetic algorithm has been optimized using a state-of-the-art fitness formula proposed for video and inertial sensors-based ADL data.

•
The proposed layers of the ADL recognition model support the delivery of a robust IoT-based multimodal system to achieve extraordinary efficiency.
A literature review is presented in Section 2 and a detailed architecture argument about the proposed IoT-based multimodal system is provided in Section 3. The experiments performed are described in Section 4 and this study's conclusive remarks along with some future directions are offered in Section 5.

Literature Review
This section presents a detailed literature review of both simple and multimodal IoT-based approaches for ADL recognition in smart environments.We have distributed the literature review into two sections, namely, simple modal systems and IoT-based multimodal systems.

Simple Modal Systems
In the literature, many researchers have worked to recognize ADL through different methodologies.A module encompassing different sensors-based fusion and features extraction has been proposed in [24].Accelerometers, magnetometers, and gyroscopes have been used in different combinations for ADL recognition.This study is more focused on environment identification, which leads to a low performance in ADL recognition.The authors of [25] have proposed an IoT-based model for the remote health monitoring of patients.Different health sensors, such as pulse, temperature, and galvanic skin response sensors were used.However, the system lacked actual implementation and could not perform well in the real-time environment.In [26], M. Sridharan et al. have proposed a model to map the location of activities performed by using already-detected landmarks and zones inside the home.They have also detected the gait of a person in different zones of the home.The model achieved 85% accuracy for trajectory prediction.However, due to no processing in the layers of filtration and features, the system attained a good performance with low-level information for ADL recognition.A methodology consisting of four stages has been suggested in [27].The four stages include acquisition, processing, fusion, and classification and have been described in the paper.The classification stage contained recognition of ADL, the identification of the environment, and the detection of activities with no motion involved.However, the lower the number of sensors utilized for classification, the less accurate the proposed methodology was.The researchers in [28] have proposed a study presenting an activity classification system analyzed over light gradient boosting, gradient boosting, cat boosting, extreme gradient boosting, and AdaBoost classifiers.A smartphone-based dataset has been utilized to test the performance and a few limitations were also present in the study, as in the ADL performance context.
In [29], an ADL recognition module has been proposed using video cameras.First, the data from cameras are acquired and pre-processed.Next, objects and humans along with their interactions are detected via two neural networks.Then, the activities are recognized through another neural network.Finally, the data are post-processed and transmitted to the gateway using priority queues, where a smartcare system has been introduced to use the results and monitor patients.However, a single sensor like camera-based activity recognition system is not a robust system.The authors explained inertial sensorbased ambient assisted living in [30].They have denoised the signal using Chebyshev, Kalman, and dynamic data reconciliation filters.Next, windows of seven seconds each have been extracted from the signal.Then, signals are normalized and signal energy, variance, frequency, and empirical mode decomposition features are mined.Furthermore, the features are dimensionally reduced using Isomap and the activities are classified using CNN-biLSTM classifiers.However, while simple activities are recognized in the proposed method, it is not a robust approach towards complex ADLs present in the daily routine.

IoT-Based Multimodal Systems
Different multimodal systems have been proposed in approaches proposed by researchers.An audio and depth modalities-based ADL recognition system has been proposed in [31].CNN has been used to recognize ADLs from depth videos, alhough the system was not applicable to real-time ADL recognition due to its computationally expensive nature.In [32], an ADL recognition approach using two deep learning methods has been suggested.The input has been provided to both CNN and bidirectional long short-term memory, and CNN layers performed direct mapping.However, using a grid search method to tune the hyper parameters has been very computationally expensive and thus this approach is not a feasible solution for real-time ADL recognition.
Due to differences in age, gender, weight, height etc., the authors proposed personalized models in [33].Personalization makes it possible for machine learning algorithms to objectively evaluate the performance of proposed systems.It also considered the resemblances between the physical and signal forms.However, the accuracy improvements for physical, signal, and both fused together are not very impressive.Another hybrid approach using both motion sensors and cameras has been suggested in [34].A motion-state layer and an activity layer have been used along with long-short-term-memory and CNN to recognize ADLs.Motion sensor data improved the classification according to the motion state while videos are utilized for the specification of ADL.However, due to the grouping of the motion state layer, the system was not able to produce acceptable results.
In [35], Žarić et al. presented a system to monitor the cooking process in home kitchens and to identify critical conditions related to elderly people.The proposed system utilized humidity, ultrasound, and temperature sensors as input to a system that is capable of generating an alert or a warning in case of a dangerous situation.They have also identified some cases for the analysis of the cooking process.A Moore finite-state machine having different states to the activities performed has been used to generate outputs using the proposed decision-making system.Nevertheless, the proposed system is limited to the kitchen environment and it is designed and tested only for electrical cooking plates.The authors of [36] described an ADL recognition and fall detection system using an Mbient sleeve sensor research kit, Imou smart cameras, proximity sensors, and the Microsoft SQL Server.They have given four concepts for fall detection including pose detection, data collection and processing, learning, and performance measurement.The complex activities have been further divided into atomic actions to detect the indoor localization.Then, the semantic relationship is inferred, studied, analyzed, and interpreted between accelerometer, gyroscope, and associated actions.Further, the integrated data are split into training and testing sets and accuracy has been computed.However, the system could achieve an accuracy of 81.13% due to the real-time environment and associated costs.The system focused on limited activities performed by the subjects whereas its performance is not clear when it comes to several other ADLs.

Materials and Methods
This system consists of two types of data, inertial and videos.A multimodality-based system has been proposed to recognize the complex forms of ADLs.It also aids recognition of ADLs where there are some data missing from one sensor.The inertial data have been filtered using Butterworth and the video frame sequences have been filtered by subtracting background from the frames.Furthermore, the landmarks have been detected from the filtered frame sequences and the filtered inertial data have been divided into windows of 5 s each.Then, the pre-processed data have been given to the features engineering layer to extract and reduce the huge number of features.Lastly, an ADL recognition layer has been utilized to classify the ADL from both state-of-the-art datasets.A detailed architecture diagram for a multimodal IoT-based deep learning framework is shown in Figure 1.The following subsections further explain each layer of this architecture for ADL recognition.
having different states to the activities performed has been used to generate outputs using the proposed decision-making system.Nevertheless, the proposed system is limited to the kitchen environment and it is designed and tested only for electrical cooking plates.The authors of [36] described an ADL recognition and fall detection system using an Mbient sleeve sensor research kit, Imou smart cameras, proximity sensors, and the Microsoft SQL Server.They have given four concepts for fall detection including pose detection, data collection and processing, learning, and performance measurement.The complex activities have been further divided into atomic actions to detect the indoor localization.Then, the semantic relationship is inferred, studied, analyzed, and interpreted between accelerometer, gyroscope, and associated actions.Further, the integrated data are split into training and testing sets and accuracy has been computed.However, the system could achieve an accuracy of 81.13% due to the real-time environment and associated costs.The system focused on limited activities performed by the subjects whereas its performance is not clear when it comes to several other ADLs.

Materials and Methods
This system consists of two types of data, inertial and videos.A multimodality-based system has been proposed to recognize the complex forms of ADLs.It also aids recognition of ADLs where there are some data missing from one sensor.The inertial data have been filtered using Butterworth and the video frame sequences have been filtered by subtracting background from the frames.Furthermore, the landmarks have been detected from the filtered frame sequences and the filtered inertial data have been divided into windows of 5 s each.Then, the pre-processed data have been given to the features engineering layer to extract and reduce the huge number of features.Lastly, an ADL recognition layer has been utilized to classify the ADL from both state-of-the-art datasets.A detailed architecture diagram for a multimodal IoT-based deep learning framework is shown in Figure 1.The following subsections further explain each layer of this architecture for ADL recognition.

Pre-Processing of Inertial Sensor Signals
Three different types of data have been retrieved from the inertial measurement unit, such as accelerometer, gyroscope, and magnetometer data.The acceleration data for ADL have been provided through accelerometer sensors.The gyroscope measures the angular velocity or the rate of change in sensors' orientation.Magnetometers give a point of reference for measuring the strength and direction of magnetic fields, which is important in order to obtain a precise locomotion.There is noise present in all types of raw data attained from the sensors including the inertial data.Subsequently, to remove this noise, this study proposes a filter utilization to get an as low as possible response frequency

Pre-Processing of Inertial Sensor Signals
Three different types of data have been retrieved from the inertial measurement unit, such as accelerometer, gyroscope, and magnetometer data.The acceleration data for ADL have been provided through accelerometer sensors.The gyroscope measures the angular velocity or the rate of change in sensors' orientation.Magnetometers give a point of reference for measuring the strength and direction of magnetic fields, which is important in order to obtain a precise locomotion.There is noise present in all types of raw data attained from the sensors including the inertial data.Subsequently, to remove this noise, this study proposes a filter utilization to get an as low as possible response frequency known as the Butterworth filter [37].Figure 2 shows the acceleration signal before and after applying the Butterworth filter to inertial data.known as the Butterworth filter [37].Figure 2 shows the acceleration signal before and after applying the Butterworth filter to inertial data.In preprocessing layer for inertial data and to help process in next layer this filtered data properly without any missing values, we proposed to utilize the data segmentation technique.After the filtration of raw data, the inertial signals have been segmented using an overlapping windowing procedure [38].Figure 3 gives a detailed view of data segmentation applied over acceleration signal.Each color in the figure represents a data segment from the signal.In preprocessing layer for inertial data and to help process in next layer this filtered data properly without any missing values, we proposed to utilize the data segmentation technique.After the filtration of raw data, the inertial signals have been segmented using an overlapping windowing procedure [38].Figure 3 gives a detailed view of data segmentation applied over acceleration signal.Each color in the figure represents a data segment from the signal.known as the Butterworth filter [37].Figure 2 shows the acceleration signal before and after applying the Butterworth filter to inertial data.In preprocessing layer for inertial data and to help process in next layer this filtered data properly without any missing values, we proposed to utilize the data segmentation technique.After the filtration of raw data, the inertial signals have been segmented using an overlapping windowing procedure [38].Figure 3 gives a detailed view of data segmentation applied over acceleration signal.Each color in the figure represents a data segment from the signal.

Pre-Processing of Videos
To produce accurate results, there is a need to process the input videos.First, the frames have been converted and the extracted images have been resized [39].Then, the Sensors 2023, 23, 7927 6 of 20 background has been subtracted from the frame sequences in order to get human silhouette for further processing.Figure 4 displays the human silhouette extracted after background subtraction.Afterwards, the head landmark has been detected using the human body shape and size [40] and the lowest point of body has been taken as the foot point of the human, calculated as: where T f Fo signifies the foot landmark position in the f frame sequences calculated using the frames variance.The calculations for human position has been designed as: where T f HS provides the human position in a frame f and T f E denotes the boundary for the frame.From both the head and foot point, the midpoint torso has been extracted followed by the neck, knee, hip, elbow, and shoulder points.

Pre-Processing of Videos
To produce accurate results, there is a need to process the input videos.First, th frames have been converted and the extracted images have been resized [39].Then, th background has been subtracted from the frame sequences in order to get human silho ette for further processing.Figure 4 displays the human silhouette extracted after bac ground subtraction.Afterwards, the head landmark has been detected using the huma body shape and size [40] and the lowest point of body has been taken as the foot point the human, calculated as: where  signifies the foot landmark position in the  frame sequences calculated usin the frames variance.The calculations for human position has been designed as: where  provides the human position in a frame  and  denotes the boundary f the frame.From both the head and foot point, the midpoint torso has been extracted fo lowed by the neck, knee, hip, elbow, and shoulder points.After landmark detection, a 2D stick model [41] has been extracted through joinin skeleton points detected from the mined landmarks as shown in Figure 5. Algorithm describes the pre-processing layer in detail for landmark detection and 2D stick mod development.First, the algorithm detects the head position and foot position in the huma silhouette to be recognized as the landmarks.If the head position is detected, then oth body landmarks are recognized and the mid-point of the recognized landmark is also d tected.Next, the algorithm continues to detect the mid-points for each landmark detecte Lastly, when all the seven landmarks are detected, the stick model is extracted throug connecting the mid-points.After landmark detection, a 2D stick model [41] has been extracted through joining skeleton points detected from the mined landmarks as shown in Figure 5. Algorithm 1 describes the pre-processing layer in detail for landmark detection and 2D stick model development.First, the algorithm detects the head position and foot position in the human silhouette to be recognized as the landmarks.If the head position is detected, then other body landmarks are recognized and the mid-point of the recognized landmark is also detected.Next, the algorithm continues to detect the mid-points for each landmark detected.Lastly, when all the seven landmarks are detected, the stick model is extracted through connecting the mid-points.

Features Processing Layer
In the second layer, we proposed to apply features extraction methodologies for both inertial and video data.Linear prediction cepstral coefficients (LPCCs) [42] have been applied for the inertial data using the equations:

Features Processing Layer
In the second layer, we proposed to apply features extraction methodologies for both inertial and video data.Linear prediction cepstral coefficients (LPCCs) [42] have been applied for the inertial data using the equations:

Features Processing Layer
In the second layer, we proposed to apply features extraction methodologies for both inertial and video data.Linear prediction cepstral coefficients (LPCCs) [42] have been applied for the inertial data using the equations: where σ 2 displays an estimate increase, l pcc n and x m denotes the LPCCs, and e conveys the LPCCs statistics.Figure 6 explicates the LPCCs extracted over jumping jacks activity over the Berkeley-MHAD dataset.
where  displays an estimate increase,  and  denotes the LPCCs, and  conveys the LPCCs statistics.Figure 6 explicates the LPCCs extracted over jumping jacks activity over the Berkeley-MHAD dataset.When it comes to predicting the ADL, the motion direction flow can significantly support the recognition of activities.It is a context-based feature that will identify the human movement patterns and directions [43].The motion flow for the human body can be calculated as: where  denotes the frame sequence extracted from video  ,  gives the motion flow direction of the current frame sequence, and  elucidates the motion flow direction from the previous frame.Figure 7 describes the motion direction flow for the jumping in place activity over the Berkeley-MHAD dataset.When it comes to predicting the ADL, the motion direction flow can significantly support the recognition of activities.It is a context-based feature that will identify the human movement patterns and directions [43].The motion flow for the human body can be calculated as: where F denotes the frame sequence extracted from video v, Md f gives the motion flow direction of the current frame sequence, and Md elucidates the motion flow direction from the previous frame.Figure 7 describes the motion direction flow for the jumping in place activity over the Berkeley-MHAD dataset.
After the features extraction stage, the dimensions of the feature vector have been increased immensely.Therefore, to reduce the feature vector size, we have introduced the application of the genetic algorithm [44].It involves a few biological orders-based techniques including mutation, selection, mating, and crossover of the chromosomes.So, we have utilized the fitness formula mentioned as: where x i denotes the scaling factor selected for inertial-based features, y i gives the average for all subjects in both datasets for inertial-based features, x f provides the scaling factor chosen for frame sequences-based features, y f shows the average over all subjects in both datasets for frame sequences-based features, f n denotes the number of features representing chromosomes, and α determines the scale factor set to 0.5.The detailed view of the genetic algorithm is represented in Figure 8.After the features extraction stage, the dimensions of the feature vector have been increased immensely.Therefore, to reduce the feature vector size, we have introduced the application of the genetic algorithm [44].It involves a few biological orders-based techniques including mutation, selection, mating, and crossover of the chromosomes.So, we have utilized the fitness formula mentioned as: where  denotes the scaling factor selected for inertial-based features,  gives the average for all subjects in both datasets for inertial-based features,  provides the scaling factor chosen for frame sequences-based features,  shows the average over all subjects in both datasets for frame sequences-based features,  denotes the number of features representing chromosomes, and  determines the scale factor set to 0.5.The detailed view of the genetic algorithm is represented in Figure 8.

ADL Recognition Layer
CNN [45] takes both data types and gives weights along with bias to different features and classifies one activity from another.It is considered to be the most effective algorithm for recognition, retrieval, and classification.Multiple layers-based variants are being used by the researchers in the literature.It also contains three types of layers, such as input, hidden, and output layers.Each hidden layer contains multiple combinations of softmax, convolution, completely connected, and pooling layers.It also consists of activation functions used for the setting of each node, which was selected as a rectified linear unit (ReLU) [46].We set the learning rate to 0.002 and the maximum epoch number was selected as 100. Figure 9 helps in understanding the CNN model for the ADL recognition layer.The input layer consisted of an activation shape in the form of (32,32,3) with an activation size of 3072 and no parameters.Next, the first convolution layer consisted of a (28, 28, 8) activation shape of ReLU along with a 6272 activation size, and 608 parameters with 5 filters.Then, the first pooling layer was utilized containing a (14, 14, 8) activation shape and 1568 size with 0 parameters.Further, the second convolutional layer has been added with a (10, 10, 16) activation shape and 1600 size along with 5 filters and 3216 parameters.Moreover, a second pooling layer consisted of a (5, 5, 16) activation shape and 400 size with 0 parameters.A flattened layer was further used.Two fully connected layers with (120, 1) and (84, 1) activation shapes and 120 and 84 size with 48,120 and 10,164 parameters were introduced next.Finally, a softmax layer of (10, 1) shape and 10 size in activation with 850 parameters has been used.

ADL Recognition Layer
CNN [45] takes both data types and gives weights along with bias to different features and classifies one activity from another.It is considered to be the most effective algorithm for recognition, retrieval, and classification.Multiple layers-based variants are being used by the researchers in the literature.It also contains three types of layers, such as input, hidden, and output layers.Each hidden layer contains multiple combinations of softmax, convolution, completely connected, and pooling layers.It also consists of activation functions used for the setting of each node, which was selected as a rectified linear unit (ReLU) [46].We set the learning rate to 0.002 and the maximum epoch number was selected as 100. Figure 9 helps in understanding the CNN model for the ADL recognition layer.The input layer consisted of an activation shape in the form of (32,32,3) with an activation size of 3072 and no parameters.Next, the first convolution layer consisted of a (28, 28, 8) activation shape of ReLU along with a 6272 activation size, and 608 parameters with 5 filters.Then, the first pooling layer was utilized containing a (14, 14, 8) activation shape and 1568 size with 0 parameters.Further, the second convolutional layer has been added with a (10, 10, 16) activation shape and 1600 size along with 5 filters and 3216 parameters.Moreover, a second pooling layer consisted of a (5, 5, 16) activation shape and with (120, 1) and (84, 1) activation shapes and 120 and 84 size with 48,120 and 10,164 parameters were introduced next.Finally, a softmax layer of (10, 1) shape and 10 size in activation with 850 parameters has been used.

Dataset Experimental Setup and Results
A brief overview of the datasets utilized, experiments performed on them, and their results is discussed in this section.

Datasets Description: Berkeley-MHAD and Opportunity++
An open access, and one of the earliest multimodal datasets, named Berkeley-MHAD [23] has been used in this system to validate the experimental section.It contains 12 IoTbased ADLs performed in an indoor environmental setting.Figure 10 presents the sample frame sequences from the Berkeley-MHAD dataset.Another publicly available dataset called Opportunity++ [22] is utilized to perform experiments on the proposed ADL model.A total of 12 subjects performed different IoT-based ADLs, completed in an indoor environment.Figure 11 shows the sample frame sequences from the Opportunity++ dataset.In order to obtain a less-biased and less-optimistic estimate of the proposed ADL recognition system, we have used a 10 fold cross-validation technique to evaluate the system's accuracy.The datasets have been shuffled randomly and split into 10 groups.For each group, it is tested and remaining groups are used to train the proposed ADL recognition model.The evaluation score is extracted from each set of test groups and the model's performance has been determined.

Dataset Experimental Setup and Results
A brief overview of the datasets utilized, experiments performed on them, and their results is discussed in this section.

Datasets Description: Berkeley-MHAD and Opportunity++
An open access, and one of the earliest multimodal datasets, named Berkeley-MHAD [23] has been used in this system to validate the experimental section.It contains 12 IoT-based ADLs performed in an indoor environmental setting.Figure 10 presents the sample frame sequences from the Berkeley-MHAD dataset.Another publicly available dataset called Opportunity++ [22] is utilized to perform experiments on the proposed ADL model.A total of 12 subjects performed different IoT-based ADLs, completed in an indoor environment.Figure 11 shows the sample frame sequences from the Opportunity++ dataset.In order to obtain a less-biased and less-optimistic estimate of the proposed ADL recognition system, we have used a 10 fold cross-validation technique to evaluate the system's accuracy.The datasets have been shuffled randomly and split into 10 groups.For each group, it is tested and remaining groups are used to train the proposed ADL recognition model.The evaluation score is extracted from each set of test groups and the model's performance has been determined.

Experimental Settings and Results
All the calculations and experimentation has been performed on a DELL laptop with Intel ® Core™ i7 4th generation CPU @ 2.4 GHz and 64-bit windows 10 bought from Islamabad, Pakistan.The software used was MATLAB (R2017a) for complete experimentation along with a 24 GB RAM.

Experimental Settings and Results
All the calculations and experimentation has been performed on a DELL laptop with Intel ® Core™ i7 4th generation CPU @ 2.4 GHz and 64-bit windows 10 bought from Islamabad, Pakistan.The software used was MATLAB (R2017a) for complete experimentation along with a 24 GB RAM.

Experiment 1: Confusion Matrices over Opportunity++ and Berkeley-MHAD
This subsection describes the confusion matrices extracted for the ADL recognition experiments performed on the Berkeley-MHAD and Opportunity++ datasets.Tables 1 and  2 provide a detailed explanation of true positives, false positives, true negatives, and false negatives [47][48][49] attained over both datasets with the recognition through CNN.

Experimental Settings and Results
All the calculations and experimentation has been performed on a DELL laptop with Intel ® Core™ i7 4th generation CPU @ 2.4 GHz and 64-bit windows 10 bought from Islamabad, Pakistan.The software used was MATLAB (R2017a) for complete experimentation along with a 24 GB RAM.

Experiment 1: Confusion Matrices over Opportunity++ and Berkeley-MHAD
This subsection describes the confusion matrices extracted for the ADL recognition experiments performed on the Berkeley-MHAD and Opportunity++ datasets.Tables 1 and  2 provide a detailed explanation of true positives, false positives, true negatives, and false negatives [47][48][49] attained over both datasets with the recognition through CNN.

Experiment 1: Confusion Matrices over Opportunity++ and Berkeley-MHAD
This subsection describes the confusion matrices for the ADL recognition experiments performed on the Berkeley-MHAD and Opportunity++ datasets.Tables 1 and 2 provide a detailed explanation of true positives, false positives, true negatives, and false negatives [47][48][49] attained over both datasets with the recognition through CNN.

Experiment 2: Confidence Levels over Skeleton Points
We also calculated the confidence levels detected for each part of the body identified in the landmark detection and 2D stick model generation stages.Table 3 gives a detailed view of 11 body points identified along with their confidence levels [50][51][52] in the range [0, 1].The mean accuracies of 84.12% and 84.17% have been achieved by the proposed IoT-based multimodal system over Opportunity++ and Berkeley-MHAD datasets, respectively.In this section, we have further assessed the proposed system based on a comparison with two well-known classification methods-artificial neural network (ANN) [53,54] and AdaBoost [55,56] classifiers.Both models were trained using the scikit-learn library.For ANN, we used an input layer, two hidden layers, and an output layer.Each hidden layer contains 50 neurons and gradient descent with momentum has been selected as the learning algorithm.The minimum batch size is 50, momentum is 0.15, number of epochs is 500, and biases were initialized with 0. Initial weights are selected randomly from a normal distribution and learning decay is exponential.For Adaboost, we have set the base learners as decision tree with a maximum depth of 5 levels and the number of base estimators as 50.Learning rate has been set to 0.001 to avoid unnecessary delays during the testing phase and estimator weights have been chosen randomly.
It is evident from the Tables 4 and 5 that our proposed model has achieved higher precision, recall [57], and F1-score [58] in both selected datasets, which shows that the multimodal IoT-based ADL recognition system using CNN has outperformed the others.The following are the equations for precision, recall, and F1-score: p = TP/(TP + FP), r = TP/(TP + FN), where p is the precision, r is the recall, and F − m is the F1-score.True positives are determined from TP, false positives are given by FP, false negatives are displayed by FN, and true negatives are shown by TN.Further, to validate the performance of the proposed IoT-based recognition system, we have given a comparison in Table 6 with other state-of-the-art methodologies presented in the literature.It is evident from the table that our proposed system outperformed the others in terms of accuracy for Opportunity++ [59,60] and Berkeley-MHAD datasets [61][62][63].

Discussion
The proposed ADL recognition system has focused on the usage of IoT-based devices for collecting data from humans, including elderly people and patients at a certain place.The data collected can be in the form of videos, their sequences, audio, and locks etc.A smart home or a private room in a hospital is a person's private and protected space.These IoT-based devices give rise to privacy and protection concerns, which can be mitigated by introducing multiple privacy mechanisms.Some studies proposed to introduce a minimum ratio of noise into the data in order to protect the privacy of a home [64][65][66][67].A few articles proposed to provide an infrastructure for such devices that can send personalized notices and give the choice to obtain a person's user preferences [68][69][70].Overall, an auto configuration support system has also been proposed in order to make sure that whenever a new device has been attached to the existing system, it is auto-configured according to the security protocols and user preferences [71][72][73].However, in the selected datasets for the proposed article, the faces of the individuals have also been blurred to maintain the privacy of users [74][75][76].
ADL recognition has been achieved successfully using the proposed model with landmark detection and a 2D stick model along with inertial sensor signal processing.We had to extract different body points in this method to make the 2D stick model.However, there were few ADL that could not achieve the ideal 2D stick model shape and caused the accuracy rates to decrease.Figure 12 gives examples of such activities performed during the ADL recognition stage.The landmark areas pointed out by red dotted circles show that the body landmarks' mid-points can be mixed up in specific body postures, therefore causing the performance of the 2D stick model and the accuracy rate to be compromised.
articles proposed to provide an infrastructure for such devices that can send personalized notices and give the choice to obtain a person's user preferences [68][69][70].Overall, an auto configuration support system has also been proposed in order to make sure that whenever a new device has been attached to the existing system, it is auto-configured according to the security protocols and user preferences [71][72][73].However, in the selected datasets for the proposed article, the faces of the individuals have also been blurred to maintain the privacy of users [74][75][76].
ADL recognition has been achieved successfully using the proposed model with landmark detection and a 2D stick model along with inertial sensor signal processing.We had to extract different body points in this method to make the 2D stick model.However, there were few ADL that could not achieve the ideal 2D stick model shape and caused the accuracy rates to decrease.Figure 12 gives examples of such activities performed during the ADL recognition stage.The landmark areas pointed out by red dotted circles show that the body landmarks' mid-points can be mixed up in specific body postures, therefore causing the performance of the 2D stick model and the accuracy rate to be compromised.

Conclusions and Future Work
Our proposed method for IoT-based ADL recognition is an important novel idea for the elderly home monitoring system.It is a combination of multimodal-based sensors to compute the ADL recognition efficiently.First, the multimodal data are filtered through multiple types of filtering techniques.Next, the inertial data are segmented using windows and vision data have been used to find the landmarks and create the 2D stick model.Then, we used state-of-the-art techniques like LPCCs and motion direction flow determination for inertial and video data, respectively.Further, to reduce the dimensionality issue, we proposed to utilize the genetic algorithm with a novel fitness function.Lastly, an efficient deep learner known as CNN has been applied over the reduced features to classify the ADL.Mean accuracies of 84.12% and 84.17% have been achieved over Op-portunity++ and Berkeley-MHAD datasets.The results have shown that the proposed ADL recognition technique has outperformed in certain ways, such as confidence levels of body landmarks detection, accuracy rate of the system, and other state-of-the-art methodologies-based comparisons.
In the future we will focus on the privacy issues and improvement of the 2D stick model.Another shortcoming worth-mentioning is that the proposed system removed background from the videos provided by immobile indoor cameras.However, this study might not work when there are different background settings in the data.Thus, the system will be implemented over more generalized environmental settings and data.

Figure 1 .
Figure 1.The architecture diagram for multimodal IoT-based deep learning framework via ADL recognition.

Figure 1 .
Figure 1.The architecture diagram for multimodal IoT-based deep learning framework via ADL recognition.

Figure 2 .
Figure 2. Sample signals after filters applied for motion sensor data.

Figure 3 .
Figure 3. Detailed view of data segmentation applied over the inertial signal has been presented using multiple colors in the figure.The red dotted box shows single segment of data.

Figure 2 .
Figure 2. Sample signals after filters applied for motion sensor data.

Figure 2 .
Figure 2. Sample signals after filters applied for motion sensor data.

Figure 3 .
Figure 3. Detailed view of data segmentation applied over the inertial signal has been presented using multiple colors in the figure.The red dotted box shows single segment of data.

Figure 3 .
Figure 3. Detailed view of data segmentation applied over the inertial signal has been presented using multiple colors in the figure.The red dotted box shows single segment of data.

Figure 4 .
Figure 4. (a) Real video frame and (b) extracted human figure after background extraction for ben ing activity in Berkeley-MHAD dataset.

Figure 4 .
Figure 4. (a) Real video frame and (b) extracted human figure after background extraction for bending activity in Berkeley-MHAD dataset.

Figure 5 .
Figure 5. (a) Human silhouette (b) 2D stick model, where each red dot represents the body point detected, green lines show the upper body skeleton, and orange lines give the lower body skeleton.

Algorithm 1 :Figure 5 .Algorithm 1 :Figure 5 .
Figure 5. (a) Human silhouette (b) 2D stick model, where each red dot represents the body point detected, green lines show the upper body skeleton, and orange lines give the lower body skeleton.

Algorithm 1 :
Landmark detection and 2D stick model creation Input: human silhouette detected after background subtraction; Output: 2D_stik_mod: 2D Stick Model; /* human body landmark detection in input data*/ /* HSS is for human silhouette size*/ /* HP is for head position*/ /* FP is for foot position*/ /* TP is for torso position*/ /* NP is for neck position*/ /* KP is for knee position*/ /* HNP is for hand position*/ /* EP is for elbow position*/ for all pixel in HSS if  _ =   =  |  |  |  |  |  |   =  end if all landmarks detected 2D_stik_mod = connect  to  end return 2D_stik_mod end;

Figure 7 .
Figure 7. Upward motion direction flow in Jumping in Place ADL.

Figure 7 .
Figure 7. Upward motion direction flow in Jumping in Place ADL.

Figure 8 .
Figure 8. Features optimization via genetic algorithm explained through a detailed view.

Figure 8 .
Figure 8. Features optimization via genetic algorithm explained through a detailed view.

Figure 12 .
Figure 12.Examples of problematic ADL activities over Berkeley-MHAD, where red dotted circles point out the skeleton extraction problems.

Table 1 .
Confusion matrix for ADL recognition for proposed approach recognition through CNN over the Opportunity++.

Table 2 .
Confusion matrix for ADL recognition for proposed approach recognition through CNN over the Berkeley-MHAD.

Table 3 .
Confidence levels over Berkeley-MHAD and Opportunity++ for body points detected.

Table 4 .
Comparative analysis with other well-known classifiers in terms of precision and recall over Berkeley-MHAD dataset.

Table 5 .
Comparative analysis with other well-known classifiers in terms of precision and recall over Opportunity++ dataset.

Table 6 .
Comparative analysis with other state-of-the-art techniques over both datasets.