Quantum Water Strider Algorithm with Hybrid-Deep-Learning-Based Activity Recognition for Human–Computer Interaction

Featured Application: This work is based on automatic human activity recognition, which has drawn much attention in video analysis technology due to the growing demands from many applications, such as surveillance environments, entertainment environments, and healthcare systems. Abstract: Human action and activity recognition are clues that alleviate human behavior analysis. Human action recognition (HAR) becomes a signiﬁcant challenge in various applications involving human computer interaction (HCI) and intellectual video surveillance for enhancing security in distinct ﬁelds. Precise action recognition is highly challenging because of the variations in clutter, backgrounds, and viewpoint. The evaluation method depends on the proper extraction and learning of data. The achievement of deep learning (DL) models results in effectual performance in several image-related tasks. In this view, this paper presents a new quantum water strider algorithm with hybrid-deep-learning-based activity recognition (QWSA-HDLAR) model for HCI. The proposed QWSA-HDLAR technique mainly aims to recognize the different types of activities. To recognize activities, the QWSA-HDLAR model employs a deep-transfer-learning-based, neural-architectural-search-network (NASNet)-based feature extractor to generate feature vectors. In addition, the presented QWSA-HDLAR model exploits a QWSA-based hyperparameter tuning process to choose the hyperparameter values of the NASNet model optimally. Finally, the classiﬁcation of human activities is carried out by the use of a hybrid convolutional neural network with a bidirectional recurrent neural network (HCNN-BiRNN) model. The experimental validation of the QWSA-HDLAR model is tested using two datasets, namely KTH and UCF Sports datasets. The experimental values reported the supremacy of the QWSA-HDLAR model over recent DL approaches.


Introduction
In the digital era, human beings require an advanced level of computer intelligence [1]. Human-computer interaction (HCI) is not restricted to the original hardware-related interaction. Certain smarter interaction techniques have been slowly occurring in the lives of people, namely a sequence of highly intellectual techniques based on voice recognition, face recognition, and gesture recognition [2]. Intellectual mechanisms will help establish interaction between computers and humans. The origin of such highly suitable interaction approaches has become a major advancement trend in the present HCI domain. The main objective of HCI advancement is to make effective computers that adapt to and serve the requirements of humans [3]-people-centered rather than forcing individuals to adapt to the computer. Gaining HCI data allows very efficient learning and creation of smarter systems [4]. Machine learning (ML) is a significant branch of artificial intelligence (AI). It performs well in several domains and illustrates powerful research and development (R&D) potential.
With the advent of ML technology in HCI, machines have become very intelligent. Amongst authors in both the industry and academia whose aim was the development of ubiquitous computing, the most broadly discussed research concept regarding HCI has been human activity recognition (HAR) [5,6]. Recently, the amount of research on HAR has increased rapidly due to the extensive availability of sensors, improvements in power utilization, and reduction in costs, and resulting in technological advances in ML approaches. The Internet of Things (IoT) and AI can now be live streamed [7]. The growth in HAR has alleviated practical implementations in several real-world domains, which include the medical sector, tactical military applications, the detection of crime and violence, and sports science [8]. It was apparent that the extensive range of conditions to which HAR can be applicable presents evidence that the domain consists of powerful capabilities for enhancing living standards [9].
Mathematical methods related to human activity data enable the recognition of a diversity of human activities-for instance, walking, running, sitting, standing, and sleeping. HAR mechanisms are of two major groups: sensor-related systems and video-related systems. Time-series-classifying tasks were major difficulties while utilizing HAR, i.e., whenever the movements of individuals were forecasted with the help of sensory data [10]. These tasks usually include accurately extracting features from raw data using signal processing approaches and deep-field expertise to fit one of the ML models. In recent research, the capability of deep learning (DL) techniques, which include long short-term memory (LSTM) neural networks and convolutional neural networks (CNN), to automatically extract meaningful attributes from the provided raw sensor data and attain the most advanced outcomes has been achieved [11,12].
This paper presents a new quantum water strider algorithm with hybrid-deep-learningbased activity recognition (QWSA-HDLAR) model for HCI. The proposed QWSA-HDLAR technique employs a deep-transfer-learning-based, neural-architectural-search-network (NASNet)-based feature extractor to generate feature vectors. In addition, the presented QWSA-HDLAR model exploits the QWSA-based hyperparameter tuning process of the NASNet model. Finally, the classification of human activities is carried out by the use of a hybrid convolutional neural network with a bidirectional recurrent neural network (HCNN-BiRNN) model. The experimental validation of the QWSA-HDLAR model is tested using two datasets, namely KTH and UCF Sports datasets. In short, the major contributions are listed as follows: • An automated QWSA-HDLAR technique encompassing NASNet-based feature extraction, QWSA-based hyperparameter tuning, and HCNN-BiRNN-based classification is presented for the identification and classification of human activities on HCI. To the best of our knowledge, the presented QWSA-HDLAR technique does not exist in the literature.
• Employ QWSA-based NASNet model to extract feature vectors, where the QWSA helps in accomplishing enhanced classification results due to the hyperparameter tuning process.

•
Validate the performance of the QWSA-HDLAR technique using two datasets, such as KTH and UCF Sports datasets.

Related Works
In [13], a novel technique was devised in this study for action recognition. The presented technique was related to the DL features fusion and shape. A two-step-based approach can be performed-human extraction to action recognition. At the initial step, human beings were derived through a simple learning process. During this process, HOG features can be derived from some chosen datasets. After choosing the powerful features with the use of entropy-controlled features, linear support vector machine (LSVM) maximization and detection were executed. Secondly, geometric features were derived from detected areas, and parallel DL features were derived from original video frames. The gained feature vector can be classified through a cubic multiclass SVM. Jaoued et al. [14] recommend a new technique for HAR depending upon a hybrid DL method. The devised method can be assessed on the challenging UCF101, KTH, and UCF Sports datasets.
Zheng et al. [15] examine the segmentation techniques' impact on DL method performance and compare four data transformation methods. The multichannel technique included three overlapped color channels, generated optimum performance. Additionally, the multichannel method can be implemented for three public datasets and generated satisfying outcomes for multisource acceleration data. Tanberk et al. [16] devise a hybrid deep method for understanding and interpreting videos, aiming at HCR. The devised architecture was built by compiling dense auxiliary movement information and optical flow approach in video datasets with the help of DL approaches. As we know, it was the first research related to a new compilation of LSTM fed by auxiliary data and 3D-CNN fed by optical flow on video frames for HAR.
Abdulazeem et al. [17] devise a structure with three main stages for HCR. The stages are recognition, pretraining, and preprocessing. This structure provides a set of new methods which were three-fold as follows: the first is during the pretraining stage-a standard CNN can be well trained on a generic dataset for adjusting weights; another is performing the recognition procedure-this pretrained method can be then applied to the target dataset; and finally, the recognition stage exploits CNN and LSTM to apply five distinct architectures. Ronald et al. [18] devise iSPLInception, a DL method motivated by the Inception-ResNet architecture from Google, which not only attains high prediction accuracy but utilizes some device sources. The researchers in [19] devise the late fusion of a HAR classifier and visual recognition. Vision can be utilized to recognize the several screws collected in a mock part, while HAR from body-worn inertial measurement units (IMUs) categorizes actions performed while assembling the parts. CNN techniques were utilized in both modes of classifier while or before several late fusion approaches were examined for estimation of a concluding state estimate.

The Proposed Model
In this paper, a novel QWSA-HDLAR model was developed for the recognition of human activities in the HCI environment. The proposed QWSA-HDLAR technique initially applies a NASNet model to derive a collection of feature vectors. Additionally, the presented QWSA-HDLAR model utilizes a QWSA-based hyperparameter tuning process to choose the hyperparameter values for the NASNet model optimally. Finally, the classification of human activities is carried out using the CNN-BiRNN model.

Feature Extraction: NASNet Model
Primarily, the proposed QWSA-HDLAR technique exploits the NASNet model to derive a collection of feature vectors. This is one of the very influential and famous technologies for handling smaller datasets through a pretrained network. A pretrained network was trained on massive datasets, generally in the tasks of classifying images; later, the architecture and weight were retained. If these primary datasets are relatively large and sufficiently widespread, then the feature subset that the pretrained network has learned could be helpful as a visual model. Thus, the feature might assist various computer vision tasks, although the new task can include completely different classifications from the primary task [20]. The transfer of learning from a pretrained network is exploited in two ways: fine-tuning and feature extraction. Feature extraction includes the convolution base of the pretrained network for extracting the novel data set feature and later training a novel classification on top of the output.
The fine-tuning corresponds to the feature extraction model, which includes unfreezing the final layer of the frozen convolution base employed for extracting features. Then, the unfrozen layer is later retrained alongside the novel classification formerly learned in the feature extraction model. The fine-tuning model aims to alter the pretrained model's most abstract features to make them more pertinent to the novel task. The following steps are involved in this study:

•
A pretrained NASNet is considered, and the classification base is detached.

•
The convolution base of pretrained models is frozen. Equipped with engineering expertise and a large amount of computation power, Google launched NASNet [21] and devised the problem of searching for an optimal CNN model as a reinforcement learning (RL) problem. The RL model is a type of ML approach which allows an agent to discern the best action in virtual environments to attain the goal with feedback from its own experiences and actions. Furthermore, the concept was to search the optimal grouping of parameters of the provided number of layers, searching filter size, strides, output channel, etc. In the RL settings, the reward after every search action was the accuracy of the searched model on the provided datasets. In NASNet, using the general framework is predetermined; the cells or blocks are not predetermined by researchers. Rather, they searched through the RL search technique. The structure of the NASNet model is shown in Figure 1.

Feature Extraction: NASNet Model
Primarily, the proposed QWSA-HDLAR technique exploits the NASNet model to derive a collection of feature vectors. This is one of the very influential and famous technologies for handling smaller datasets through a pretrained network. A pretrained network was trained on massive datasets, generally in the tasks of classifying images; later, the architecture and weight were retained. If these primary datasets are relatively large and sufficiently widespread, then the feature subset that the pretrained network has learned could be helpful as a visual model. Thus, the feature might assist various computer vision tasks, although the new task can include completely different classifications from the primary task [20]. The transfer of learning from a pretrained network is exploited in two ways: fine-tuning and feature extraction. Feature extraction includes the convolution base of the pretrained network for extracting the novel data set feature and later training a novel classification on top of the output.
The fine-tuning corresponds to the feature extraction model, which includes unfreezing the final layer of the frozen convolution base employed for extracting features. Then, the unfrozen layer is later retrained alongside the novel classification formerly learned in the feature extraction model. The fine-tuning model aims to alter the pretrained model's most abstract features to make them more pertinent to the novel task. The following steps are involved in this study: • A pretrained NASNet is considered, and the classification base is detached.

•
The convolution base of pretrained models is frozen. Equipped with engineering expertise and a large amount of computation power, Google launched NASNet [21] and devised the problem of searching for an optimal CNN model as a reinforcement learning (RL) problem. The RL model is a type of ML approach which allows an agent to discern the best action in virtual environments to attain the goal with feedback from its own experiences and actions. Furthermore, the concept was to search the optimal grouping of parameters of the provided number of layers, searching filter size, strides, output channel, etc. In the RL settings, the reward after every search action was the accuracy of the searched model on the provided datasets. In NASNet, using the general framework is predetermined; the cells or blocks are not predetermined by researchers. Rather, they searched through the RL search technique. The structure of the NASNet model is shown in Figure 1.  Furthermore, the number of early convolution filters and motif repetitions N are free parameters and utilized for scaling. In particular, they are named reduction and normal cells. A reduction cell is a convolution cell that returns a feature map from which the feature map's width and height are minimized by a factor of 2, and a normal cell is a convolution cell that returns a feature map of similar dimensions. NASNet achieved advanced results in the ImageNet competition, but the computational power required for NASNet was bigger than what a small company only capable of using common methodologies could provide.

Hyperparameter Tuning: QWSA Model
In this study, the QWSA-based hyperparameter tuning process optimally chooses the hyperparameter values of the NASNet model. The WSA has good performance in a majority of the problems; occasionally, it has a chance of becoming trapped in the local optima and early convergence [22]. Here, the concept of quantum computing was taken into account. In quantum space, the location of the male and female WSs is not concurrently defined; therefore, the location of WS must be determined using the wave function ψ(X, t), where X determines the location of WSs. This implies that the square of mode indicates the likelihood density of WS appearing at the X-location in space, and it can be expressed as follows [23]: In Equation (1), Q defines a likelihood density function which fulfills the normalized condition: The location of WS is accomplished using the Monte Carlo method, and its updated formula is provided as follows.
From the expression, M defines the population size; δ and γ signify that an arbitrary number lies within [0, 1]; P(t) determines the local attraction area of the t-th iterations of WSs, which define the location of every WSs is an arbitrary location among the global and the individual locations; D signifies the weighted distance between the candidate and the mean optimum location of population; m(t) defines the mean value of individual optimum location for the WSs; G ax determines the iteration count; β denotes the shrinkage-expansion coefficient, i.e., exploited to handle the individual convergence rate lies within β to β. In many instances, β b , β] = [l, 0.5 , and P b and P g define the candidate and the g, respectively. The flowchart of WSA is shown in Figure 2.
The updating is implemented in the mating phase of original WSA:   In these conditions, if X * i has good outcomes when compared to X i (t), X i (t + 1) = X * i (t); otherwise, X i (t + 1) = X i (t) as explained in Algorithm 1. The QWSA method extracts a fitness function for achieving enhanced classifier performance. It fixes a positive numeral for indicating the superior outcome of a candidate solution. In this article, a classifier error rate reduction was regarded as a fitness function, as provided in Equation (9). The optimum resolution contains a minimal error rate, and the poor resolution achieves a higher error rate.

Activity Recognition: HCNN-BiRNN Model
At the final stage, the classification of human activities is carried out by the use of CNN-BiRNN model. The CNN-BiRNN hybrid models contain two major mechanisms: a BiRNN with attention model [24] on the top half and a CNN with L convolution layers on the bottom half. These components of the models are jointly trained in an end-to-end manner. One sample X ∈ R 1×W×1 is a real value vector (W refers to the range dimension length); therefore, a 1D-CNN module is applied, from which the convolutional process occurs in the range dimension. For the initial layer, convolutional operations using stride length 1 employ K 1 filters, W (1) ∈ R 1×w (1) ×1 , to X, resulting in feature maps of layer 1, H (1) ∈ R 1×W (h1) ×K 1 . For the next L − 1 convolutional layers, convolution-pooling operations repetitively employ K l filters, W (l) ∈ R 1×w (l) ×K l−1 , to H (l−1) , and feature map H (L) ∈ R 1×W (hL) ×K L is obtained after the L convolution layer.
Every convolution layer is followed by a pooling layer; therefore, the time dimension is shortened, and the temporal dependency increases as the convolution layer grows. After dropping the single dimension, H (L) ∈ R W (hL) ×K L is regarded as a series of length W (hL) with K L feature vector at every time step. Next, we replace the W (hL) time step, using T for notation convenience. The forward recurrent neural network (RNN) reads H (L) in its novel order and produces hidden state f h t at all the time steps, and the backward RNN reads H (L) in its reverse order and generates bh t as follows From the expression, W f xh ∈ R m×K L and W bxh ∈ R m×K L refer to the input-hidden weights, W f hh ∈ R m×m and W bhh ∈ R m×m denotes the weight that connects the hidden layer, m denotes the dimensionality of the hidden state, and f indicates the sigmoid function. Next, the concatenation of f h t and bh t forward and backward states create h t = [ f h t , bh t ]. Consequently, every h t hidden state comprises data of the entire target, with stronger emphasis on the part near the region at t step. In BiRNN, the data at all the time steps might slowly be lost alongside the backward and forward propagation.
To prevent data loss and focus on the discriminatory time step automatically, the attention module is proposed, which, as a byproduct, is capable of relaxing the misalignment problem. In this technique, we adapt a multilayer perceptron for calculating the attention weight depending on the hidden state and indicate g as an invariant feature vector, i.e., weighting sum of h t , t = 1 . . . T, i.e., The weight, α t , is calculated using the below equation Let U a ∈ R 1×n , W a ∈ R n×2m be the parameter of the attention module, and weight α t stands for the coefficient which scores the matching degree among the recognition task and h t hidden state. The invariant feature vector g incorporates the data at each time step based on the discrimination of the hidden state. Assuming the invariant feature vector g, we adopted the so f tmax function to predict the label vector of X input sample, as follows In Equation (15), C denotes the class count, p(c|g; θ) represents the probability of g belongs to c − th class, and θ indicates the variable of softmax classification.

Performance Validation
The experimental result analysis of the QWSA-HDLAR model is tested using two datasets, namely the KTH dataset [25] and the UCF Sports dataset [26]. The first KTH dataset includes 600 samples with six class labels, as given in Table 1. Next, the UCF Sports dataset contains 1000 samples with ten class labels, as provided in Table 2.  Golf-Swing 100 3 Kicking-Front 100 4 Lifting 100 5 Riding Horse 100 6 Run-Side 100 7 StateBoarding-Front 100 8 Swing-Bench 100 9 Swing-SideAngle 100 10 Walk  Table 3 illustrates the overall HAR outcomes of the QWSA-HDLAR model on the test KTH dataset. The experimental output demonstrated that the QWSA-HDLAR model has shown enhanced performance on all datasets. For instance, on the entire dataset, the QWSA-HDLAR model has obtained average accu y of 99.33%, reca l of 98%, spec y of 99.60%, F score of 97.99%, and an area under the receiver operating characteristic curve (AUROC) score of 98.80%. Eventually, with 70% of the TR dataset, the QWSA-HDLAR model attained average accu y of 99.21%, reca l of 97.63%, spec y of 99.52%, F score of 97.62%, and AUROC score of 98.58%. Meanwhile, on 30% of the TS dataset, the QWSA-HDLAR model reached average accu y of 99.63%, reca l of 98.80%, spec y of 99.78%, F score of 98.83%, and an AUROC score of 99.29% as illustrated in Figure 3.     The training and validation accuracies depicted by the QWSA-HDLAR technique with distinct epochs on the KTH dataset are demonstrated in Figure 4. The results assured that the accuracies are found to be higher with increased epochs. Additionally, the training accuracy seems to be superior to testing accuracy. The training and validation accuracies depicted by the QWSA-HDLAR technique with distinct epochs on the KTH dataset are demonstrated in Figure 4. The results assured that the accuracies are found to be higher with increased epochs. Additionally, the training accuracy seems to be superior to testing accuracy.
The training and validation losses gained by the QWSA-HDLAR method on the test KTH dataset are reported in Figure 5. The figure identified that the QWSA-HDLAR technique has resulted in lower training and validation loss values.  In order to report the enhanced performance of the QWSA-HDLAR model, a comparison study with existing methods [25][26][27] is performed on the KTH dataset in Table 4. The results implied that the Gaussian mixture model with Kalman filter (GMM-KF) model has shown poor performance with a lower accu y of 90.47%, whereas the gated recurrent neural network (GRNN) model has attained a slightly enhanced accu y of 85.85%. This is followed by the support vector machine with 3DCNN (SVM-3DCNN) and the CNN with convolutional autoencoder (CNN-CAE) models, which obtained improved accu y values of 90.45% and 92.80%, respectively. Though the GMM-KFGRNN and SDL-HBC models resulted in reasonable accu y values of 95.52% and 99.38%, the QWSA-HDLAR model gained a higher accu y of 99.63%.  In order to report the enhanced performance of the QWSA-HDLAR model, a comparison study with existing methods [25][26][27] is performed on the KTH dataset in Table 4. The results implied that the Gaussian mixture model with Kalman filter (GMM-KF) model has shown poor performance with a lower of 90.47%, whereas the gated recurrent neural network (GRNN) model has attained a slightly enhanced of 85.85%. This is followed by the support vector machine with 3DCNN (SVM-3DCNN) and the CNN with convolutional autoencoder (CNN-CAE) models, which obtained improved values of 90.45% and 92.80%, respectively. Though the GMM-KFGRNN and SDL-HBC models resulted in reasonable values of 95.52% and 99.38%, the QWSA-HDLAR model gained a higher of 99.63%.  Figure 6 establishes the confusion matrices produced by the QWSA-HDLAR approach on the UCF Sports dataset. The figure implied that the QWSA-HDLAR technique has proficiently identified all ten class labels effectually on the applied data. Table 5 exemplifies the overall HAR outcomes of the QWSA-HDLAR technique on the test UCF Sports dataset. The experimental output illustrated that the QWSA-HDLAR       Table 5 exemplifies the overall HAR outcomes of the QWSA-HDLAR technique on the test UCF Sports dataset. The experimental output illustrated that the QWSA-HDLAR approach showed enhanced performance on all datasets. For example, on the entire dataset, the QWSA-HDLAR algorithm achieved average accu y of 99.06%, reca l of 95.30%, spec y of 99.48%, F score of 95.30%, and AUROC score of 97.39%. Finally, with 70% of the TR dataset, the QWSA-HDLAR approach reached average accu y of 99.06%, reca l of 95.40%, spec y of 99.48%, F score of 95.28%, and AUROC score of 97.44%. At the same time, on 30% of the TS dataset, the QWSA-HDLAR methodology reached average accu y of 99.07%, reca l of 95.26%, spec y of 99.48%, F score of 95.13%, and AUROC score of 97.37%.
The training and validation accuracies depicted by the QWSA-HDLAR methodology with distinct epochs on UCF Sports dataset are demonstrated in >Figure 7. The results ensured that the accuracies were higher with increased epochs. In addition to that, training accuracy seems to be better than testing accuracy.    The training and validation losses inferred by the QWSA-HDLAR technique on the test UCF Sport dataset are reported in Figure 8. The figure identified that the QWSA-HDLAR algorithm has resulted in lower training and validation loss values.  In order to report the enhanced performance of the QWSA-HDLAR technique, a comparison study with recent methodologies [28,29] is performed on the UCF Sports dataset In order to report the enhanced performance of the QWSA-HDLAR technique, a comparison study with recent methodologies [28,29] is performed on the UCF Sports dataset in Figure 9. The results implied that the AR-DT and LTP-HAR approaches showed poor performance with lower accu y values of 78.21% and 78.84%, respectively. in Figure 9. The results implied that the AR-DT and LTP-HAR approaches showed poor performance with lower values of 78.21% and 78.84%, respectively. This is followed by the average two-stream CNN and GMM-KFGRNN algorithms, which reached improved accu y values of 88.30% and 88.51%, respectively. Though the DTR-DNN and GS-LOF techniques resulted in reasonable accu y values of 95.83% and 95.54%, the QWSA-HDLAR method obtained a higher accu y of 99.07%. The detailed results and discussion show that the proposed model has shown effectual performance on HAR over the other models.

Conclusions
In this paper, a new QWSA-HDLAR approach has been presented for the recognition of human activities in the HCI environment. The proposed QWSA-HDLAR technique encompasses a series of operations: preprocessing, NASNet feature extraction, QWSA based hyperparameter tuning, and CNN-BiRNN-based classification. In the QWSA-HDLAR model, the QWSA is applied to optimally choose the hyperparameter values of the NASNet model. The experimental validation of the QWSA-HDLAR model is tested using two datasets, namely the KTH and UCF Sports datasets. The experimental values reported the supremacy of the QWSA-HDLAR model over recent DL approaches with maximum accuracy of 99.63% and 99.07% on the applied KTH and UCF Sport datasets, respectively. In the future, an ensemble of three DL models can be applied to improvise the overall recognition performance of the QWSA-HDLAR method.