A Fusion-Assisted Multi-Stream Deep Learning and ESO-Controlled Newton–Raphson-Based Feature Selection Approach for Human Gait Recognition

The performance of human gait recognition (HGR) is affected by the partial obstruction of the human body caused by the limited field of view in video surveillance. The traditional method required the bounding box to recognize human gait in the video sequences accurately; however, it is a challenging and time-consuming approach. Due to important applications, such as biometrics and video surveillance, HGR has improved performance over the last half-decade. Based on the literature, the challenging covariant factors that degrade gait recognition performance include walking while wearing a coat or carrying a bag. This paper proposed a new two-stream deep learning framework for human gait recognition. The first step proposed a contrast enhancement technique based on the local and global filters information fusion. The high-boost operation is finally applied to highlight the human region in a video frame. Data augmentation is performed in the second step to increase the dimension of the preprocessed dataset (CASIA-B). In the third step, two pre-trained deep learning models—MobilenetV2 and ShuffleNet—are fine-tuned and trained on the augmented dataset using deep transfer learning. Features are extracted from the global average pooling layer instead of the fully connected layer. In the fourth step, extracted features of both streams are fused using a serial-based approach and further refined in the fifth step by using an improved equilibrium state optimization-controlled Newton–Raphson (ESOcNR) selection method. The selected features are finally classified using machine learning algorithms for the final classification accuracy. The experimental process was conducted on 8 angles of the CASIA-B dataset and obtained an accuracy of 97.3, 98.6, 97.7, 96.5, 92.9, 93.7, 94.7, and 91.2%, respectively. Comparisons were conducted with state-of-the-art (SOTA) techniques, and showed improved accuracy and reduced computational time.


Introduction
Human verification or identification plays a significant role in information security, public security systems, point-of-sales machines, automatic teller machines, etc. [1]. Human beings can be identified by examining their different external and internal body parts, such as blood samples, skin, hair, ear shape, bite by forensic odontology, face recognition, and walking style by gait analysis [2]. Fingerprints and face recognition are well-known clothing, and walking style. The authors extracted features using ResNet101 deep model and then selected the best features using the KcE approach. The experimental process was conducted on the CASIA-B dataset and obtained an accuracy of 95.26% and 96.60%, respectively. Khan et al. [23] introduced a single-stream HGR framework based on optimal deep learning fused features. In their work, the authors performed data augmentation at the first step and then used two pre-trained models, such as Inseption-ResNet-V2 and NASNet mobile. Features of both deep learnings were fused and further optimized using the whale optimization algorithm. Machine learning classifiers were applied and obtained the best accuracy of 89%.
Huang et al. [24] demonstrated a gait recognition method based on multisource sensing information. The 3D human features data was extracted using the human body's structure and multisource stream information during a human walk. Athlete walk includes different characteristics, and based on these characteristics, a person is identified. The CASIA A dataset was used for the experimental process and obtained an accuracy of 88.33%. Hasan et al. [25] presented a modified residual block and a novel shallow convolutional layer for HGR. Wearable sensors were embedded in objects that can be worn on the subject body, such as wristwatches, necklaces, and smartphones, and were used for gait analysis. Template matching and conventional matching were not appropriate and did not provide improved performance for low-device wearable devices. They also introduced a modified residual block and shallow convolutional neural network that obtained an accuracy of 85% on the IMU-based dataset. Junaid et al. [26] presented a human gait analysis approach for osteoarthritis using DL and kernel extreme learning machine. The authors faced numerous difficulties in this approach, such as abnormal walking, patients' clothes, and angle changes. Conventional techniques are only concerned with feature selection and do not address such issues; therefore, the authors employed a novel robust method to address that disparity. For experimental purposes, two pre-trained models (VGG16-Net and AlexNet) were used and obtained improved accuracy. Yonghong et al. [27] addressed the free-view gait recognition problem. They faced the problems of traditional methods that capture gait sequences under uncontrolled scenes, unknown view angles, and dynamically changing viewing angles during the walk. They presented a unique walking trajectory fitting (WTF) approach for these challenges. Also, they introduced a joint gait manifold (JGM) technique for gait similarity evaluation.

Major Challenges
In summary, all the above methods still faced several issues, such as selecting important features and extracting irrelevant features. Moreover, the above methods did not select the entire CASIA-B dataset for the experimental process. Features extraction from the original video frames may extract some redundant and irrelevant features due to complicating factors, such as outdoor environment, lighting conditions, complex background, noise, and low-resolution frames. These factors impact recognition accuracy. In addition, several studies extracted features from a region of interest (ROI), which is a time-consuming step that, sometimes, leads to a chance of incorrect ROI detection. The incorrect ROI detection consumes the developed system's overall time and extracts irrelevant features that later reduce the classification accuracy. Therefore, this work presents a new framework using the best fusion-assisted deep learning features.

Major Contributions
The major contributions of this work are as follows: • A contrast enhancement technique based on local and global filter information fusion is proposed. The high-boost operation is finally applied to highlight the human region in a video frame.

•
The data augmentation was performed, and two fine-tuned deep learning models (MobilenetV2 and ShuffleNet) were trained using deep transfer learning. Features Features of both streams were fused in a serial-based fashion that can minimize the loss of information and then select the best features using a new approach called ESO-controlled Newton-Raphson.

•
The detailed ablation study-based results have been computed and discussed, showing the improvement in the accuracy of this work.

Manuscript Organization
The rest of the manuscript is organized in the following order. The proposed methodology is presented in Section 2, which includes the contrast enhancement technique, deep learning features, proposed feature fusion, and best feature selection. Section 3 presents the experimental results of the proposed methodology. Section 4 concludes the manuscript.

Proposed Methodology
The proposed human gait recognition framework is presented in Figure 1. This figure illustrates that the proposed HGR framework consists of several phases: contrast enhancement of original frames, data augmentation, training of the deep learning models, extraction of features, a fusion of both stream features, selection of the most optimal features, and finally, classification. A brief description of each step in the form of mathematics and numerical values is discussed below.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , )

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1) and Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1)

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1) Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1) . Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (i, j) denotes the HSV color-transformed image, and R, G, and B denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1) Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1) Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1)

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: , Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1) , Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning an optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine th original images into better information. In this work, a new fusion-based technique proposed for contrast enhancement and improvement of frame quality. For this purpos HSV color transformation is initially applied, which encodes 24-bit colors by hue, sat ration, and value. This selection aims to organize the colors in a more practically app cable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , denotes the HSV color-transformed image, and , , and denote the red, green, an blue channels of values between 0-255. The mathematical formulation is defined as fo lows: Proposed framework of HGR using two-stream fusion-assisted deep learning optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refin original images into better information. In this work, a new fusion-based techniqu proposed for contrast enhancement and improvement of frame quality. For this purp HSV color transformation is initially applied, which encodes 24-bit colors by hue, s ration, and value. This selection aims to organize the colors in a more practically a cable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ denotes the HSV color-transformed image, and , , and denote the red, green, blue channels of values between 0-255. The mathematical formulation is defined as lows:

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows:

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine t original images into better information. In this work, a new fusion-based technique proposed for contrast enhancement and improvement of frame quality. For this purpos HSV color transformation is initially applied, which encodes 24-bit colors by hue, sat ration, and value. This selection aims to organize the colors in a more practically app cable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , denotes the HSV color-transformed image, and , , and denote the red, green, an blue channels of values between 0-255. The mathematical formulation is defined as fo lows: Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refin original images into better information. In this work, a new fusion-based techniq proposed for contrast enhancement and improvement of frame quality. For this pur HSV color transformation is initially applied, which encodes 24-bit colors by hue, ration, and value. This selection aims to organize the colors in a more practically a cable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ denotes the HSV color-transformed image, and , , and denote the red, green blue channels of values between 0-255. The mathematical formulation is defined a lows: Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learn optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to re original images into better information. In this work, a new fusion-based techn proposed for contrast enhancement and improvement of frame quality. For this p HSV color transformation is initially applied, which encodes 24-bit colors by hu ration, and value. This selection aims to organize the colors in a more practicall cable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. denotes the HSV color-transformed image, and , , and denote the red, gre blue channels of values between 0-255. The mathematical formulation is defined lows: Sensors 2023, 23, x FOR PEER REVIEW Here Ꝿ denotes the change in maximum and minimum range, ℟ deno tracted red channel, Ꞡ denotes the extracted green channel, and ₿ denotes ed blue channel, respectively. Based on the above information, the hue, satu value channels are computed as follows: After that, the averaging filter is applied to the transformed image due filter type. This filter minimized the ambient noise, refined edges, and recti lighting. This approach involves filtering the frame by correlation with a su kernel. The value of the resultant pixel is determined as the weighted combi neighbor pixels. On the input signal, it functions as an averaging filter; it p input vector of values and determines an average for every value inside the v sider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is of the average filter that convolves the input image and produced the smo image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) fo ening of edges in a video frame. This operation is employed to strengthen t high-frequency components, further improving the relative relevance of fe veyed by high-frequency components. The high-boost filtering image is c follows:  Figure 2. Based on these contrast enhancement outpu mented the entire dataset and results outputs are described in Table 1.
Here Here Ꝿ denotes the change in maximum and minimum range, ℟ denotes the extracted red channel, Ꞡ denotes the extracted green channel, and ₿ denotes the extracted blue channel, respectively. Based on the above information, the hue, saturation, and value channels are computed as follows: After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) for the sharpening of edges in a video frame. This operation is employed to strengthen the image of high-frequency components, further improving the relative relevance of features conveyed by high-frequency components. The high-boost filtering image is computed as follows: ( , ) = ( − 1) * Ꞇ ( , ) + Ꞇ ( , ) * ℎ ( , ) where is an increasing factor for adjusting the weights, ( , v) denotes the high-pass filtered image, and ( , ) is the final fused enhanced image. The visual illustration is shown in Figure 2. Based on these contrast enhancement outputs, we augmented the entire dataset and results outputs are described in Table 1.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be original images into better information. In this work, a new fusion-b proposed for contrast enhancement and improvement of frame quality. HSV color transformation is initially applied, which encodes 24-bit co ration, and value. This selection aims to organize the colors in a more cable manner. Consider we have a CASIA-B dataset denoted as ẞ a denotes the HSV color-transformed image, and , , and denote th blue channels of values between 0-255. The mathematical formulation lows:

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: denotes the extracted green channel, and Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep l optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used t original images into better information. In this work, a new fusion-based t proposed for contrast enhancement and improvement of frame quality. For th HSV color transformation is initially applied, which encodes 24-bit colors by ration, and value. This selection aims to organize the colors in a more practi cable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ denotes the HSV color-transformed image, and , , and denote the red, blue channels of values between 0-255. The mathematical formulation is def lows: denotes the extracted blue channel, respectively. Based on the above information, the hue, saturation, and value channels are computed as follows: Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: Here Ꝿ denotes the change in maximum and minimum range, tracted red channel, Ꞡ denotes the extracted green channel, and ₿ d ed blue channel, respectively. Based on the above information, the h value channels are computed as follows: After that, the averaging filter is applied to the transformed ima filter type. This filter minimized the ambient noise, refined edges, an lighting. This approach involves filtering the frame by correlation w kernel. The value of the resultant pixel is determined as the weighted neighbor pixels. On the input signal, it functions as an averaging fi input vector of values and determines an average for every value ins sider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ of the average filter that convolves the input image and produced image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( ening of edges in a video frame. This operation is employed to stren high-frequency components, further improving the relative relevan veyed by high-frequency components. The high-boost filtering ima follows:

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be us original images into better information. In this work, a new fusion-base proposed for contrast enhancement and improvement of frame quality. Fo HSV color transformation is initially applied, which encodes 24-bit color ration, and value. This selection aims to organize the colors in a more pr cable manner. Consider we have a CASIA-B dataset denoted as ẞ and denotes the HSV color-transformed image, and , , and denote the blue channels of values between 0-255. The mathematical formulation is lows: Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1) Here Ꝿ denotes the change in maximum and minimum range tracted red channel, Ꞡ denotes the extracted green channel, and ₿ ed blue channel, respectively. Based on the above information, the value channels are computed as follows: After that, the averaging filter is applied to the transformed im filter type. This filter minimized the ambient noise, refined edges, a lighting. This approach involves filtering the frame by correlation kernel. The value of the resultant pixel is determined as the weighte neighbor pixels. On the input signal, it functions as an averaging f input vector of values and determines an average for every value in sider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ of the average filter that convolves the input image and produced image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ening of edges in a video frame. This operation is employed to stre high-frequency components, further improving the relative releva veyed by high-frequency components. The high-boost filtering im Proposed framework of HGR using two-stream fusion-assisted dee optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be use original images into better information. In this work, a new fusion-based proposed for contrast enhancement and improvement of frame quality. For HSV color transformation is initially applied, which encodes 24-bit colors ration, and value. This selection aims to organize the colors in a more pra cable manner. Consider we have a CASIA-B dataset denoted as ẞ and denotes the HSV color-transformed image, and , , and denote the re blue channels of values between 0-255. The mathematical formulation is d lows:

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1) Here Ꝿ denotes the change in maximum and minimum range tracted red channel, Ꞡ denotes the extracted green channel, and ₿ ed blue channel, respectively. Based on the above information, the value channels are computed as follows: After that, the averaging filter is applied to the transformed im filter type. This filter minimized the ambient noise, refined edges, lighting. This approach involves filtering the frame by correlation kernel. The value of the resultant pixel is determined as the weighte neighbor pixels. On the input signal, it functions as an averaging input vector of values and determines an average for every value in sider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and of the average filter that convolves the input image and produced image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ening of edges in a video frame. This operation is employed to stre Proposed framework of HGR using two-stream fusion-assisted dee optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be use original images into better information. In this work, a new fusion-base proposed for contrast enhancement and improvement of frame quality. Fo HSV color transformation is initially applied, which encodes 24-bit colors ration, and value. This selection aims to organize the colors in a more pra cable manner. Consider we have a CASIA-B dataset denoted as ẞ and denotes the HSV color-transformed image, and , , and denote the r blue channels of values between 0-255. The mathematical formulation is lows: Here Ꝿ denotes the change in maximum and minimum range, tracted red channel, Ꞡ denotes the extracted green channel, and ₿ d ed blue channel, respectively. Based on the above information, the h value channels are computed as follows: After that, the averaging filter is applied to the transformed imag filter type. This filter minimized the ambient noise, refined edges, an lighting. This approach involves filtering the frame by correlation w kernel. The value of the resultant pixel is determined as the weighted neighbor pixels. On the input signal, it functions as an averaging filt input vector of values and determines an average for every value insid sider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( of the average filter that convolves the input image and produced t image: the mathematically averaging filter can be illustrated as: .
After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Here Ꝿ denotes the change in maximum and minimum range, ℟ denotes the extracted red channel, Ꞡ denotes the extracted green channel, and ₿ denotes the extracted blue channel, respectively. Based on the above information, the hue, saturation, and value channels are computed as follows: After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) for the sharpening of edges in a video frame. This operation is employed to strengthen the image of high-frequency components, further improving the relative relevance of features conveyed by high-frequency components. The high-boost filtering image is computed as follows: where is an increasing factor for adjusting the weights, Here Ꝿ denotes the change in maximum and minimum range, ℟ denotes the extracted red channel, Ꞡ denotes the extracted green channel, and ₿ denotes the extracted blue channel, respectively. Based on the above information, the hue, saturation, and value channels are computed as follows: After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) for the sharpening of edges in a video frame. This operation is employed to strengthen the image of high-frequency components, further improving the relative relevance of features conveyed by high-frequency components. The high-boost filtering image is computed as follows: where is an increasing factor for adjusting the weights, ( , v) denotes the high-pass filtered image, and ( , ) is the final fused enhanced image. The visual il-(u, v), and ґ (a, b) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: Here Ꝿ denotes the change in maximum and minimum range, ℟ denotes the extracted red channel, Ꞡ denotes the extracted green channel, and ₿ denotes the extracted blue channel, respectively. Based on the above information, the hue, saturation, and value channels are computed as follows: After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) for the sharpening of edges in a video frame. This operation is employed to strengthen the image of high-frequency components, further improving the relative relevance of features conveyed by high-frequency components. The high-boost filtering image is computed as follows: Here Ꝿ denotes the change in maximum and minimum range, ℟ denotes the extracted red channel, Ꞡ denotes the extracted green channel, and ₿ denotes the extracted blue channel, respectively. Based on the above information, the hue, saturation, and value channels are computed as follows: After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) for the sharpening of edges in a video frame. This operation is employed to strengthen the image of high-frequency components, further improving the relative relevance of features conveyed by high-frequency components. The high-boost filtering image is computed as follows: where is an increasing factor for adjusting the weights, ( , v) denotes the Here Ꝿ denotes the change in maximum and minimum range, ℟ denotes the extracted red channel, Ꞡ denotes the extracted green channel, and ₿ denotes the extracted blue channel, respectively. Based on the above information, the hue, saturation, and value channels are computed as follows: After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) for the sharpening of edges in a video frame. This operation is employed to strengthen the image of high-frequency components, further improving the relative relevance of features conveyed by high-frequency components. The high-boost filtering image is computed as follows: (a, b).
The high-boost filter is later applied on the resultant image Here Ꝿ denotes the change in maximum and minimum range, ℟ denotes the extracted red channel, Ꞡ denotes the extracted green channel, and ₿ denotes the extracted blue channel, respectively. Based on the above information, the hue, saturation, and value channels are computed as follows: After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) for the sharpening of edges in a video frame. This operation is employed to strengthen the image of high-frequency components, further improving the relative relevance of features con- frequency components, further improving the relative relevance of features conveyed by high-frequency components. The high-boost filtering image is computed as follows: After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) for the sharpening of edges in a video frame. This operation is employed to strengthen the image of high-frequency components, further improving the relative relevance of features conveyed by high-frequency components. The high-boost filtering image is computed as follows: where is an increasing factor for adjusting the weights, ( , v) denotes the high-pass filtered image, and ( , ) is the final fused enhanced image. The visual illustration is shown in Figure 2. Based on these contrast enhancement outputs, we augmented the entire dataset and results outputs are described in Table 1.
After that, the averaging filter is applied to the transformed image due to its linear filter type. This filter minimized the ambient noise, refined edges, and rectified uneven lighting. This approach involves filtering the frame by correlation with a suitable filter kernel. The value of the resultant pixel is determined as the weighted combination of its neighbor pixels. On the input signal, it functions as an averaging filter; it processes an input vector of values and determines an average for every value inside the vector. Consider we have an input image Ɪ ( , ), filtered image Ꞇ ( , ), and ґ ( , ) is the weight of the average filter that convolves the input image and produced the smooth filtered image: the mathematically averaging filter can be illustrated as: The high-boost filter is later applied on the resultant image Ꞇ ( , ) for the sharpening of edges in a video frame. This operation is employed to strengthen the image of high-frequency components, further improving the relative relevance of features conveyed by high-frequency components. The high-boost filtering image is computed as follows: where is an increasing factor for adjusting the weights, ( , v) denotes the high-pass filtered image, and ( , ) is the final fused enhanced image. The visual illustration is shown in Figure 2. Based on these contrast enhancement outputs, we augmented the entire dataset and results outputs are described in Table 1.
P hb (u, v) = [P hb (u, v) + Figure 1. Proposed framework of HGR using two-stream fusion-assisted deep learning and optimal features selection.

Novelty 1: Hybrid Fusion Enhancement Technique
Preprocessing is an important step in computer vision that can be used to refine the original images into better information. In this work, a new fusion-based technique is proposed for contrast enhancement and improvement of frame quality. For this purpose, HSV color transformation is initially applied, which encodes 24-bit colors by hue, saturation, and value. This selection aims to organize the colors in a more practically applicable manner. Consider we have a CASIA-B dataset denoted as ẞ and ẞ ℇ ℛ. ẞ ( , ) denotes the HSV color-transformed image, and , , and denote the red, green, and blue channels of values between 0-255. The mathematical formulation is defined as follows: (1) where W is an increasing factor for adjusting the weights, P hp f (u, v) denotes the high-pass filtered image, and P hb (u, v) is the final fused enhanced image. The visual illustration is shown in Figure 2. Based on these contrast enhancement outputs, we augmented the entire dataset and results outputs are described in Table 1.

Pre-Trained Deep Models
Mobilenetv2: Mobilenetv2 was introduced by Google in 2018, and is the variation of the Mobilenet model. It is a convolutional neural network that contains 53 deep layers. This model is based on the inverted residual structure with connections between the bottleneck levels. Therefore, we used this model as a backbone network. A visual illustration of this network is shown in Figure 3 In this figure, it is noted that there are two blocks. One is a residual block with a stride of 1. Another one is a block with a stride of 2 for downsizing. There are three layers for both types of blocks. This time, the first layer is 1 × 1 convolution with ReLU6. The second layer is the depth-wise convolution. The third layer is another 1 × 1 convolution but without any non-linearity. It is claimed that if Relu is used again, the deep networks only have the power of a linear classifier on the non-zero volume part of the output domain. It is a very effective feature extractor mostly used for object detection and segmentation. Mobilenetv2 was pre-trained using the ImageNet dataset of about 1000 object classes. 1 × 1 convolution with ReLU6. The second layer is the depth-wise convolution. The third layer is another 1 × 1 convolution but without any non-linearity. It is claimed that if Relu is used again, the deep networks only have the power of a linear classifier on the non-zero volume part of the output domain. It is a very effective feature extractor mostly used for object detection and segmentation. Mobilenetv2 was pre-trained using the ImageNet dataset of about 1000 object classes.  Moreover, channel shuffle is an operation to help information flow across feature channels in a CNN. It was used as a part of the ShuffleNet architecture. The input and output channels can be fully related if a group convolution is allowed to obtain input data from different groups. It can be designed with very limited computing power. Two new operations are used in this architecture (pointwise group convolution and channel shuffle) that reduce the computation cost while maintaining accuracy. The visual description of this architecture is shown in Figure 4. channels in a CNN. It was used as a part of the ShuffleNet architecture. The input and output channels can be fully related if a group convolution is allowed to obtain input data from different groups. It can be designed with very limited computing power. Two new operations are used in this architecture (pointwise group convolution and channel shuffle) that reduce the computation cost while maintaining accuracy. The visual description of this architecture is shown in Figure 4.

Deep Transfer Learning based Features Extraction
The deep transfer learning process is employed in this work for the training of both pre-trained models on the enhanced CASIA-B dataset. For the deep transfer learning process, we firstly fine-tuned both models so that the last three consecutive layers were removed, and new layers were added and connected to the previous global average pooling layer. Next, the models were trained through deep transfer learning. Domain. A domain can be represented as: which contains two parts: the feature space, , and the probability distribution, ( ). = | ℇ , = 1, … … . . , and is a dataset with instances. However, source and target domains are the subcategory of transfer learning with the same feature vector but different probability distributions.
Task. The task can be represented as:

Deep Transfer Learning based Features Extraction
The deep transfer learning process is employed in this work for the training of both pre-trained models on the enhanced CASIA-B dataset. For the deep transfer learning process, we firstly fine-tuned both models so that the last three consecutive layers were removed, and new layers were added and connected to the previous global average pooling layer. Next, the models were trained through deep transfer learning. Domain. A domain can be represented as: which contains two parts: the feature space, V, and the probability distribution, ρ(V). Task. The task can be represented as: highly efficient in computation and power consumption. This model is evaluated in ImageNet 2016 classification dataset. It introduces the three variants of the shuffle unit, composed of group convolution and channel shuffles. Group convolution uses multiple kernels per layer and generates multiple channel outputs per layer. It can learn more intermediate features and increase the channels for the next layer. Moreover, channel shuffle is an operation to help information flow across feature channels in a CNN. It was used as a part of the ShuffleNet architecture. The input and output channels can be fully related if a group convolution is allowed to obtain input data from different groups. It can be designed with very limited computing power. Two new operations are used in this architecture (pointwise group convolution and channel shuffle) that reduce the computation cost while maintaining accuracy. The visual description of this architecture is shown in Figure 4.

Deep Transfer Learning based Features Extraction
The deep transfer learning process is employed in this work for the training of both pre-trained models on the enhanced CASIA-B dataset. For the deep transfer learning process, we firstly fine-tuned both models so that the last three consecutive layers were removed, and new layers were added and connected to the previous global average pooling layer. Next, the models were trained through deep transfer learning. Domain. A domain can be represented as: which contains two parts: the feature space, , and the probability distribution, ( ). = | ℇ , = 1, … … . . , and is a dataset with instances. However, source and target domains are the subcategory of transfer learning with the same feature vector but different probability distributions.
Task. The task can be represented as: (14) which includes two factors: the label space, L, and a mapping function, proposed a CNN model suitable and specially designed for mob highly efficient in computation and power consumption. This m ImageNet 2016 classification dataset. It introduces the three varian composed of group convolution and channel shuffles. Group conv kernels per layer and generates multiple channel outputs per layer termediate features and increase the channels for the next layer.
Moreover, channel shuffle is an operation to help informatio channels in a CNN. It was used as a part of the ShuffleNet archit output channels can be fully related if a group convolution is all data from different groups. It can be designed with very limited co new operations are used in this architecture (pointwise group con shuffle) that reduce the computation cost while maintaining acc scription of this architecture is shown in Figure 4.

Deep Transfer Learning based Features Extraction
The deep transfer learning process is employed in this work f pre-trained models on the enhanced CASIA-B dataset. For the d process, we firstly fine-tuned both models so that the last three co

Deep Transfer Learning based Features Extraction
The deep transfer learning process is employed in this work for the training of both pre-trained models on the enhanced CASIA-B dataset. For the deep transfer learning process, we firstly fine-tuned both models so that the last three consecutive layers were proposed a CNN model suitable and specially designed for mobile devices, which is highly efficient in computation and power consumption. This model is evaluated in ImageNet 2016 classification dataset. It introduces the three variants of the shuffle unit, composed of group convolution and channel shuffles. Group convolution uses multiple kernels per layer and generates multiple channel outputs per layer. It can learn more intermediate features and increase the channels for the next layer.
Moreover, channel shuffle is an operation to help information flow across feature channels in a CNN. It was used as a part of the ShuffleNet architecture. The input and output channels can be fully related if a group convolution is allowed to obtain input data from different groups. It can be designed with very limited computing power. Two new operations are used in this architecture (pointwise group convolution and channel shuffle) that reduce the computation cost while maintaining accuracy. The visual description of this architecture is shown in Figure 4.

Deep Transfer Learning based Features Extraction
The deep transfer learning process is employed in this work for the training of both pre-trained models on the enhanced CASIA-B dataset. For the deep transfer learning process, we firstly fine-tuned both models so that the last three consecutive layers were removed, and new layers were added and connected to the previous global average pooling layer. Next, the models were trained through deep transfer learning.
Domain. A domain can be represented as: which contains two parts: the feature space, , and the probability distribution, ( ). = | ℇ , = 1, … … . . , and is a dataset with instances. However, source and target domains are the subcategory of transfer learning with the same feature vector but different probability distributions. Task. The task can be represented as: , is a non-linear and indirect function that may fill the gap between input instances as well as the projected judgment that is acquired from the suggested datasets. Similarly, distinct objectives are specified because of the label spaces among these tasks. The visual process of deep transfer learning is shown in Figure 5. This figure illustrates that none of the layers are frozen, and the entire network is trained on the selected enhanced CASIA-B dataset. After training, features are extracted from the selected layers, such as global average pooling. In the first trained model, named MobilenetV2, the global average pooling layer is employed and performed the activation. From this layer, we obtained 1280 features for each image. Hence, the results vector is denoted by N × 1280. In the second trained model, named ShuffleNet, we extracted features from the global average pooling layer and obtained 544 features; hence, the resultant vector is obtained on dimensional N × 544. This is shown in Table 2. mapping function, ʄ(. ), generally known as ʄ( ) = ( | ), is a non-linear and indirect function that may fill the gap between input instances as well as the projected judgment that is acquired from the suggested datasets. Similarly, distinct objectives are specified because of the label spaces among these tasks. The visual process of deep transfer learning is shown in Figure 5. This figure illustrates that none of the layers are frozen, and the entire network is trained on the selected enhanced CASIA-B dataset. After training, features are extracted from the selected layers, such as global average pooling. In the first trained model, named MobilenetV2, the global average pooling layer is employed and performed the activation. From this layer, we obtained 1280 features for each image. Hence, the results vector is denoted by × 1280. In the second trained model, named ShuffleNet, we extracted features from the global average pooling layer and obtained 544 features; hence, the resultant vector is obtained on dimensional × 544. This is shown in Table 2.

Novelty 2: Minimal Serial Features Fusion
The two feature vectors 1 and 2 contain × 1280 and × 544 features for MobilenetV2 and ShuffleNet models, respectively. 3 is a fused feature vector using a serial approach that returns a feature vector of dimension × 1824 by employing the following mathematical formulation:

Novelty 2: Minimal Serial Features Fusion
The two feature vectors FV1 and FV2 contain N × 1280 and N × 544 features for MobilenetV2 and ShuffleNet models, respectively. FV3 is a fused feature vector using a serial approach that returns a feature vector of dimension N × 1824 by employing the following mathematical formulation: The resultant fused feature vector consists of some redundant information; therefore, we implemented a minimization function that removes the redundant features after each iteration. The objective function of this function is to minimize the error rate and reduce the computational time that is required after the reduction of the redundant features. Mathematically, this function is defined as follows: Also, the working of this process is described in the below Algorithm 1, and the final fused vector is denoted by FV.

Novelty 3: Proposed ESOcNR Feature Selection
Feature selection has been an important step in machine learning over the last two years. Many techniques have been introduced, but they faced a few issues, such as reducing important features and selecting extra features. These factors can reduce the accuracy and increase the computational time. This work proposed a new equilibrium state optimization technique controlled using the Newton-Raphson method (ESOcNR) for the best feature selection. The proposed technique is initially based on the original ESO algorithm [28], which uses a mass balance equation to define the concentration of a nonreactive ingredient in a control volume. The mass balance equation describes the mechanics of mass entering, leaving, and creating mass in a control volume. The universal mass-balance equation is represented by a first-order ordinary differential equation, described as follows: where W represents the inside of the control volume, D denotes the concentration, the rate of mass change in the control volume is denoted by W dE dt , and R denotes the volumetric efficiency of a control volume. The variable E f r implies the concentration at an equilibrium state with no production within the control volume, and H denotes the mass generation rate within the control volume. A stable equilibrium condition is achieved once W dE dt hits zero. Reordering of the above equation helps in solving dE dt as a function of R W , where R W reflects the inverse of the residency period, referred as λ or the turnover rate λ = R W . The W is computed as follows: In the above equation, the H is determined as follows: where t 0 and E 0 are the initial start time and concentration and are calculated by an integral interval, respectively. Equation (19) can be utilized to determine the attention in the control volume with a specified turnover rate, among several other things. It can also be employed to compute the average turnover rate by implementing a simple linear regression with a predetermined generation rate. The first term is equilibrium concentration, which is one of the ideal solutions chosen randomly from a pool known as the equilibrium pool. The direct search approach is the second term concerned primarily with a concentration difference between a particle and the equilibrium state. The said term acts as an explorer, urging particles to explore the entire region. The third term is associated with the generation rate that either primarily contributes as an exploiter or remedy refiner, but can also function as an explorer occasionally. Each term is defined below and how it influences the search pattern.
Evaluation and Initialization of Functions: The optimization process is initiated by ESO, as are several other meta-heuristic algorithms with the initial population. The dimensions of the search space with uniform random initialization and initial concentrations are determined by the number of particles as follows: The ith particle's initial concentration vector is indicated by E initial i , while the maximal and minimal values for the dimensions are given by E max and E min , respectively. The variable n is the population's number, and rand i is a haphazard vector that falls inside the range of 0-1. To evaluate the equilibrium candidates for the fitness function, particles are appraised and then classified.
Candidates and the Equilibrium Pool E fr : The algorithm's ultimate convergence state is the global optimal equilibrium state. There is no knowledge about the equilibrium state at the outset of the optimization procedure; thus, only equilibrium candidates are picked to create a search pattern for the particles. The finest four particles are discovered throughout the optimization process, as well as one additional particle whose concentration matches the arithmetic mean of the four previously mentioned particles. These four options aid EO in improving its exploring abilities, whereas the average helps with exploitation. The number of candidates is arbitrary and determined by the nature of the optimization issue.
In contrast, selecting fewer than four candidates degrades the method performance in multimodal and composition functions, although improving outcomes in uni-modal functions. Furthermore, having more than four candidates may have a detrimental effect. Thus five particles are referred to as equilibrium candidates and are utilized to form the equilibrium pool vector: Each particle's concentration is changed at random in each cycle by picking randomly from a pool of candidates with the same probability. For instance, in the first iteration, the first particle upgrades concentrations based on → E eq(1) ; in the second iteration, concentrations may be updated based on → E eq(ave) . Each particle will be updated until the optimization process is complete, with about the same percentage of updates going to each candidate solution.
Exponential Term (H): The next term which further contributes to the primary concentration update rule is the exponential term (H). A thorough explanation of such a concept might help EO strike a fair equilibrium between exploration and exploitation. Because the turnover rate in a real control volume fluctuates over time, λ is a random vector that tends to range from 0-1.
The time is represented by t and formulated as follows: where Itr and Max_Itr represent the current and max number of iterations, respectively, and b 2 is a constant used to adjust exploitation capabilities. To ensure convergence, the search velocity is reduced to increasing the algorithm's exploration and exploitation capabilities.
where, b 1 is a constant number that regulates exploration. The higher b 1 value, the better the exploring capacity, and hence the lower the exploitation efficiency. Similarly, increasing b 2 improves exploitation while decreasing exploring capabilities. The third component sign ( → s − 0.5) influences the course of exploration and exploitation. The variable → s is a random vector with a value ranging from 0 to 1. These constants are calculated empirically by evaluating a set of test functions. These parameters, however, can be modified as needed for specific circumstances.
Generation Rate (G): The generating rate is one of the most important words in providing the proper answer by improving the exploitation phase. The model below displays generation rates as a first-order exponential decay process: where → H 0 represents the beginning value and → l represents the decay constant. To have a more controlled and systematic search pattern and to limit the number of random variables, we assume → l = → λ and use the previously computed exponential term. As a result, the final set of generation rate equations are as follows: The generation rate control parameter (GCP) is a generic term for the potential contribution towards the updating process. However, one component (called the generation probability (GP)) determines the likelihood of this contribution by defining how many particles utilize the generic term to update their states. By using GP = 0.5, a reasonable balance between exploration and exploitation may be reached. Finally, the EO update regulation is defined as follows: The first term reflects an equilibrium concentration, whereas the second and third terms describe variations in concentration. The second term is in charge of searching the entire area for the best position. The third term contributes to exploitation by making the solution more exact when it reaches a spot. Depending on factors, such as particle concentrations, equilibrium candidates, and the turnover rate (λ), the second and third components may have the same or opposite sign. The same sign promotes diversity, which helps with domain searches, while the opposite sign minimizes variation, which helps with local searches. Finally, the memory-saving algorithms help each particle maintain track of its locations in space, which influences its fitness value. Each particle's fitness value in the current iteration is assessed to the one from the previous iteration and is rewritten if it attains a better choice. This process improves exploitation capacity but increases the likelihood of being caught in local minima if the approach does not benefit from global exploration capability.
Newton-Raphson based Final Selection: The features of each iteration of ESO are passed to the Newton-Raphson-based function [29] that computes the resultant value. The resultant value states when the number of iterations will be stopped. The main purpose of this function is to find the quick value for the threshold selection. This value reduces the computational time and improves the performance. Mathematically, this process is defined as follows: where, f n+1 ∈ → E and is a selected feature vector after each iteration. This vector is updated after each iteration, and once the value of f n+1 becomes constant, it will stop and return the best feature vector. The final selected feature is finally classified using machine learning classifiers.

Results and Analysis
Dataset and Performance Measures: The proposed HGR framework has been evaluated using the CASIA-B dataset. A detailed description of a dataset has been given in Section 3. Several classifiers have been used for classification accuracies, such as fine tree, medium tree, linear SVM, quadratic SVM, weighted KNN, coarse KNN, bagged trees, subspace discriminate, Bi-layered neural network, and Tri-layered neural network. The performance of each classifier is computed using recall rate, precision rate, accuracy, and time (seconds).
Experimental Setup: We divided the entire dataset 50:50 for training and testing. The 10-fold cross-validation was chosen for the testing process. Moreover, several hyperparameters have been utilized in training deep learning models, such as a learning rate of 0.0001, epochs are 100, momentum value of 0.7, mini-batch size of 32, and the chosen optimizer was the stochastic gradient descent. The entire framework was simulated on MATLAB 2022a using a personal computer Corei7 having 16 GB of RAM and 8 GB graphics card. Table 3 presents the results of angle 0 of the CASIA-B dataset using the proposed framework. The results of the fusion and optimization methods are presented in this table. Several classifiers have been employed for the classification results. First, the fusion method obtained the highest accuracy of 97.3% on Quadratic SVM, whereas the recall rate is 97.33% and the precision rate was 97.37%. Computational time was also noted, and it was observed that the fusion process consumes 572.19 s for this classifier.

Results Analysis
However, the minimum noted time for the fusion process was 69.438 s on the medium tree classifier, whereas the maximum reported time was 4420.7 s on the tri-layered neural network classifier. Second, the optimization results have been presented and obtained the maximum accuracy of 97.2% on Quadratic SVM. The recall rate of this classifier was 97.23%, and the precision rate was 97.3%. The computation time of this step was also noted, and Quadratic SVM executes in 45.259 s. However, the minimum noted time for the optimization process is 38.502 s on the medium tree classifier, whereas the maximum reported time was 2049.1 s on the tri-layered neural network classifier. This shows that the accuracy of the optimization process was almost consistent, but the execution time was significantly reduced, which thus shows the strength of the proposed framework.  Table 4 presents the results of the fusion and optimization methods on angle 18 of the CASIA-B dataset using the proposed framework. First, the fusion method obtained the highest accuracy of 98.6% on Quadratic SVM, whereas the recall rate was 98.57% and the precision rate was 98.57%. Computational time is also noted, and it was observed that the fusion process consumes 859.46 s for this classifier. However, the minimum noted time for the fusion process was 174.72 s on the fine tree classifier, whereas the maximum reported time was 2049.8 s on the subspace discriminant classifier. Second, the optimization obtained a maximum accuracy of 98.0% on Quadratic SVM. The recall rate of this classifier was 97.93%, and the precision rate was 98%. The computation time of this step was also noted, and Quadratic SVM executes in 42.512 s. However, the minimum noted time for the optimization process was 34.35 s on the medium tree classifier, whereas the maximum reported time is 1573.6 s on the bagged trees classifier. This shows that the accuracy of the optimization process was almost consistent, but the execution time was significantly reduced.  Table 5 presents the results of angle 36 of the CASIA-B dataset using the proposed framework. First, the fusion method obtained the highest accuracy of 97.7% on Quadratic SVM, whereas the recall rate was 97.67% and the precision rate was 97.63%. Computational time was also noted, and observed that the fusion process consumes 1014.7 s for this classifier. However, the minimum noted time for the fusion process was 95.118 s on the medium tree classifier, whereas the maximum reported time was 5675.8 s on the bagged trees classifier. Second, the optimization obtained the maximum accuracy of 97.2% on Quadratic SVM. The recall rate of this classifier was 97.23%, and the precision rate was 97.23%. The computation time of this step was also noted, and Quadratic SVM executes in 53.864 s. However, the minimum noted time for the optimization process was 23.989 s on the medium tree classifier, whereas the maximum reported time was 1827.5 s on the bagged trees classifier. Results in this table show the consistency, but time was significantly reduced, which is a main strength of this step. Table 6 discusses the results of angle 54 of the CASIA-B dataset using the proposed framework. First, the fusion method obtained the highest accuracy of 96.5% on Quadratic SVM, whereas the recall rate was 96.53% and the precision rate was 96.53%. The minimum noted computational time for the fusion process was 45.349 s on the medium tree classifier, whereas the maximum reported time was 4780.4 s on the bagged trees classifier. Second, the optimization obtained the maximum accuracy of 96.2% on Quadratic SVM. The recall rate of this classifier was 96.27%, and the precision rate was 96.27%. After this process, the computational time was significantly reduced, but not much change occurred in the accuracy. whereas the maximum reported time was 3645.4 s on the tri-layered neural network classifier. Second, the optimization obtained the maximum accuracy of 92.9% on coarse KNN. The recall rate of this classifier was 83%, and the precision rate was 82.9%. The minimum computation time of this step was 72.293 s on the medium tree classifier, whereas the maximum reported time was 2918 s on the tri-layered neural network classifier. Overall, this step reduced the computational time and remained consistent with the classification accuracy. Table 7. Classification results of HGR using the proposed framework on angle 72 of the CASIA-B dataset.

Recall (%) Precision (%) Accuracy (%) Time (s) Fusion Optimization
Fine tree Table 8 shows the results of angle 90 of the CASIA-B dataset using the proposed framework. Results are presented for both the fusion and optimization steps. First, the fusion method obtained the highest accuracy of 93.7% on Quadratic SVM, whereas the recall rate was 93.73% and the precision rate was 94.03%. The minimum noted computational time on the medium tree classifier was 281.13 s, whereas the maximum reported time was 4960.7 s on the bagged trees classifier. Second, the optimization obtained the maximum accuracy of 93.1% on Quadratic SVM. The recall rate of this classifier was 93.17%, and the precision rate was 93.53%. The minimum recorded time for the optimization process was 78.005 s on the medium tree classifier, whereas the maximum reported time was 1280.4 s on the weighted KNN classifier. These facts show that the accuracy was not changed too much, but a significant reduction was noted in computational time, which is the strength of the proposed optimization algorithm. Table 9 presents the results of angle 108 of CASIA-B dataset using the proposed framework. In this table, the fusion process obtained the highest accuracy of 94.7% on Quadratic SVM, whereas the recall rate was 94.57% and the precision rate was 94.63%. Second, the optimization process obtained the maximum accuracy of 94.2% on Quadratic SVM. Computational time was noted for both experiments and the minimum noted time for the fusion process was 83.087 s on the medium tree classifier. In contrast, the minimum noted time for the optimization process was 39.314 s on the medium tree classifier. This shows the significant improvement in the optimization process's computational time, which is the strength of this step.  Classifiers framework. In this table, the fusion method obtained the highest accuracy of 92.4% on Quadratic SVM, whereas the recall rate is 92.34% and the precision rate is 92.6%. The minimum computational time for this experiment is 126.29 s on the fine tree classifier. Second, the optimization results have been presented and obtained the maximum accuracy of 91.9% on Quadratic SVM. The recall rate of this classifier is 91.9%, and the precision rate is 92.17%. Compared to the fusion process, the computation time of this step has been significantly reduced to 96.952 s on the fine-tree classifier. Table 12 presented the results of angle 162 of the CASIA-B dataset and obtained the maximum accuracy for the fusion method was 96.5% on Quadratic SVM. In contrast, the recall rate was 96.47%, and the precision rate was 96.57%. Computational time was also noted, and it was observed that the minimum reported time was 32.703 s on the medium tree classifier. Second, the optimization obtained the maximum accuracy of 96.3% on Quadratic SVM. The recall rate of this classifier was 96.23%, and the precision rate was 96.4%. The computation time of this step was 68.363 s on the fine-tree classifier, which is significantly lower that the fusion process. Table 11. Classification results of HGR using the proposed framework on angle 144 of the CASIA-B dataset.

Recall (%) Precision (%) Accuracy (%) Time (s) Fusion Optimization
Fine tree classifier. However, the minimum noted time for the fusion process was 88.322 s on the medium tree classifier, whereas the maximum reported time was 2476.1 s on the bagged trees classifier. Second, the optimization obtained the maximum accuracy of 99.8% on Quadratic SVM. The recall rate of this classifier was 99.8%, and the precision rate was 99.77%. The computation time of this step was also noted, and Quadratic SVM executes in 113.25 s. However, the minimum noted time for the optimization process was 28.024 s on the bi-layered neural network classifier, whereas the maximum reported time was 168.63 s on the weighted KNN classifier. This shows that the accuracy of the optimization process was almost consistent, but the execution time was significantly reduced. Table 13. Classification results of HGR using the proposed framework on angle 180 of CASIA-B dataset.

Recall (%) Precision (%) Accuracy (%) Time (s) Fusion Optimization
Fine tree A detailed comparison of the proposed framework has been included in this section based on the intermediate steps' performance and individual ESO-based feature selection. Figure 6 shows the analysis of the intermediate steps of the proposed framework. This figure illustrates that the accuracy of the ShuffleNet and MobilenetV2 deep model features is insufficient, and both models performed better for a few angles. However, the fusion process improves the accuracy, but an increase occurred in the computational time, as presented in Tables 3-13. Therefore, a new technique named ESOcNR is proposed. Using this technique, a significant reduction occurred in the number of features, but a minor drop occurred in accuracy (Tables 3-13). Figure 7 illustrates the comparison between ESO-based feature selection and ESOcNR-based feature selection. In this figure, it is noted that the accuracy was improved after employing the proposed selection method. For example, for angle 0, a 3% change occurred; however, for the other angles, an almost 3-4% change is reported after employing the proposed ESOcNR. Table 14 shows the results of the proposed feature selection technique for CASIA-B dataset on all 11 angles. In this table, it is noted that the QSVM classifier shows the most improved accuracy for the most angles. Table 15 presents the comparison of the proposed framework with state-of-the-art techniques. In this table, the comparison is conducted with each angle of the CASIA-B dataset. The proposed framework performed better on angles 0, 18, 36, 54, 144, 162, and 180. However, the performance on other angles (72, 90, 108, and 126) is not improved, which will be considered in the future. Hence, overall, the proposed framework shows improved accuracy. method. For example, for angle 0, a 3% change occurred; however, for the other angles, an almost 3-4% change is reported after employing the proposed ESOcNR. Table 14 shows the results of the proposed feature selection technique for CASIA-B dataset on all 11 angles. In this table, it is noted that the QSVM classifier shows the most improved accuracy for the most angles. Table 15 presents the comparison of the proposed framework with state-of-the-art techniques. In this table, the comparison is conducted with each angle of the CASIA-B dataset. The proposed framework performed better on angles 0, 18, 36, 54, 144, 162, and 180. However, the performance on other angles (72, 90, 108, and 126) is not improved, which will be considered in the future. Hence, overall, the proposed framework shows improved accuracy.  work with state-of-the-art techniques. In this table, the comparison is conducted with each angle of the CASIA-B dataset. The proposed framework performed better on angles 0, 18, 36, 54, 144, 162, and 180. However, the performance on other angles (72, 90, 108, and 126) is not improved, which will be considered in the future. Hence, overall, the proposed framework shows improved accuracy.

Conclusions
A new framework is proposed in this work based on fusion-assisted deep learning features and the ESOcNR feature selection technique. The proposed framework consists of a few important subsequent steps, including contrast enhancement of video frames, deep learning features extraction from the selected models, proposed minimal serial fusion approach, and ESOcNR-based feature selection. Results are computed on the enhanced CASIA-B dataset using all 11 angles and show improvements in accuracy on 7 of these 11 angles. Based on the results and comparative analysis, we conclude the following points:

•
The training of deep learning models on the enhanced dataset has extracted the more useful features that later improved the accuracy. • The proposed fusion approach improved the accuracy but increased the computation time.

•
The original ESO-based feature selection approach selected some redundant features that reduced the classification accuracy. • Selection of best features using the proposed ESOcNR maintains the classification accuracy and reduces the computational time of the fusion process.
In the future, the weights of deep learning models can be optimized using some meta-heuristic features selection techniques. Moreover, in the future, the angles 72, 90, 108, and 126 should be analyzed to attempt to improve their accuracy.