Next Article in Journal
Correlating Disorder Microstructure and Magnetotransport of Carbon Nanowalls
Next Article in Special Issue
Human Activity Recognition Method Based on Edge Computing-Assisted and GRU Deep Learning Network
Previous Article in Journal
A Numerical Investigation of Sloshing in a 3D Prismatic Tank with Various Baffle Types, Filling Rates, Input Amplitudes and Liquid Materials
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Merging-Squeeze-Excitation Feature Fusion for Human Activity Recognition Using Wearable Sensors

Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani 12120, Thailand
Appl. Sci. 2023, 13(4), 2475; https://doi.org/10.3390/app13042475
Submission received: 23 January 2023 / Revised: 7 February 2023 / Accepted: 13 February 2023 / Published: 14 February 2023
(This article belongs to the Special Issue Novel Approaches for Human Activity Recognition)

Abstract

:
Human activity recognition (HAR) has been applied to several advanced applications, especially when individuals may need to be monitored closely. This work focuses on HAR using wearable sensors attached to various locations of the user body. The data from each sensor may provide unequally discriminative information and, then, an effective fusion method is needed. In order to address this issue, inspired by the squeeze-and-excitation (SE) mechanism, we propose the merging-squeeze-excitation (MSE) feature fusion which emphasizes informative feature maps and suppresses ambiguous feature maps during fusion. The MSE feature fusion consists of three steps: pre-merging, squeeze-and-excitation, and post-merging. Unlike the SE mechanism, the set of feature maps from each branch will be recalibrated by using the channel weights also computed from the pre-merged feature maps. The calibrated feature maps from all branches are merged to obtain a set of channel-weighted and merged feature maps which will be used in the classification process. Additionally, a set of MSE feature fusion extensions is presented. In these proposed methods, three deep-learning models (LeNet5, AlexNet, and VGG16) are used as feature extractors and four merging methods (addition, maximum, minimum, and average) are applied as merging operations. The performances of the proposed methods are evaluated by classifying popular public datasets.

1. Introduction

Human activity recognition (HAR) is an active and challenging research field [1] to specify human activities (e.g., sitting, walking, running) based on the data collected from devices such as cameras [2] and wearable sensors [3,4,5]. It has been essential in many applications, especially healthcare [6]. In addition, HAR helps an information–technology system to automatically monitor and record the activities of users such that we can analyze them and alert related persons (e.g., users, relatives, doctors) when an abnormal activity or an accident happens [7]. Due to the limitations of using cameras in HAR such as user privacy, using wearable devices (e.g., smart watches and smartphones) in HAR is receiving significant attention. These wearable devices commonly use sensors such as accelerometers, gyroscopes, and magnetometers to monitor the activities of the users [5,8,9]. In addition, many studies have been focused on using several inertial measurement units (IMUs) attached to different parts of the user body such that we can have data from different locations and utilize them together to obtain better recognition accuracy [10].
The HAR using wearable devices will receive sensor data from accelerometers, gyroscopes, and/or magnetometers and use them to classify the activities. Of the classification models/algorithms, two types are popularly applied to HAR: traditional machine learning (ML) algorithms and deep-learning (DL) models. By using a traditional ML algorithm (e.g., support vector machine, random forest), we will manually extract a set of useful features from the sensor data and pass them to the ML algorithm to specify the corresponding activities [11,12]. On the other hand, a DL model will automatically extract a set of features from the sensor data and use them in the classification process. As a result, DL models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been extensively studied in HAR [9,13,14,15].
A CNN model essentially consists of a series of convolutional layers and pooling layers to extract a set of features called feature maps, which will be used later in the classification. However, only some feature maps may be very useful for classifying the activities of interest. Therefore, informative feature maps should be emphasized while ambiguous feature maps should be suppressed. A channel–attention mechanism called the squeeze-and-excitation (SE) block [16] was proposed to solve this issue. The SE block will recalibrate each feature map by a weight value which is proportional to the importance of this feature map in the classification. The SE block has recently been applied to CNN and/or RNN models to better the HAR performances [17,18].
The performances of wearable-sensor HAR can be improved by implementing multi-branch DL architectures [19,20]. A multi-branch DL architecture consists of several parallel branches using DL models to extract different sets of feature maps independently. Specifically, each branch provides a set of feature maps denoting local information. Thereafter, these sets of feature maps will be fused by using feature fusion such as concatenation to generate a set of fused feature maps (denoting global information); which will be used later in the classification process. A traditional feature fusion method combines the local feature maps equally without being aware that, in each set, some feature maps may be informative while some feature maps are ambiguous. Motivated by this issue, we need a feature fusion method which is able to emphasize informative feature maps and suppress ambiguous feature maps in each branch during fusion such that we can combine the local feature maps efficiently and obtain useful discriminative fused feature maps.
Inspired by the squeeze-and-excitation (SE) mechanism [16], we propose a feature fusion method called the merging-squeeze-excitation (MSE) feature fusion. In each branch, the MSE feature fusion recalibrates the local feature maps by using a set of channel weights. Since the fused feature maps are the ones who enter the classification process, the channel weights could be computed such that the fused feature maps provide very discriminative information. Therefore, unlike the SE mechanism, we design the MSE feature fusion such that, at each branch, it computes the channel weights based on both local feature maps and fused feature maps. As a result, when we consider a set of C local feature maps, the c-th local feature map will be emphasized if either it is important to the classification or the corresponding c-th fused feature map is useful to the classification.
The MSE feature fusion consists of three steps: pre-merging step, squeeze-and-excitation step, and post-merging step. In the pre-merging step, the feature maps from all branches are merged together to obtain a set of pre-merged feature maps. Thereafter, during the squeeze-and-excitation step, the feature maps from each branch are recalibrated according to their importance measured from both the channel-wise statistics obtained from themselves and the channel-wise statistics obtained from the pre-merged feature maps. Finally, in the post-merging step, the MSE feature fusion applies the same merging operation used in the first step to combine the output feature maps from all branches and obtain a set of channel-weighted and merged feature maps which will be used to classify the activities of interest. In this work, we have applied three DL models (i.e., LeNet5, AlexNet, and VGG16) as feature extractors and four merging methods (addition, maximum, minimum, and average) as merging operations in the pre-merging step and post-merging step. Furthermore, we also modify the proposed MSE feature fusion by adding local skip connections, adding a global skip connection, using global channel attention, and stacking a series of the MSE feature fusions to create deep MSE feature fusions. Their performances are evaluated on three public HAR datasets: PAMAP2, DaLiAc, and DSAD.
The main contributions of this work are summarized as follows:
  • We propose five MSE feature fusion architectures for wearable-sensor HAR using multi-branch architectures such that the feature maps will be recalibrated according to their importance during the fusion. Three DL models and four merging methods used in the MSE feature fusion are studied and investigated.
  • Extensive experiments are conducted to evaluate and compare the performances of the proposed methods and baseline architectures by using the PAMAP2, DaLiAc, and DSAD datasets. The results show the following findings:
    • The MSE feature fusion with a global skip connection when using the average merging and AlexNet achieves the highest accuracy score of 99.24% on classifying the PAMAP2 dataset.
    • The MSE feature fusion with local skip connections when using the minimum merging and AlexNet achieves the highest accuracy score of 98.59% on classifying the DaLiAc dataset.
    • The original MSE feature fusion using the average merging and AlexNet achieves the highest accuracy score of 98.04% on classifying the DSAD dataset.
    • Among the merging methods studied in the proposed methods, the addition merging offers the worst accuracy scores. The maximum, minimum, and average merging have similar performances.
    • All of the highest accuracy scores are from using AlexNet as the feature extractor.
The rest of this paper is organized as follows. Section 2 reviews the previous work focusing on using the SE block in HAR, proposing multi-branch DL architectures for HAR, and presenting SE-based feature fusion methods. The data collection and preparation are described in Section 3. Section 4 and Section 5 present the proposed MSE feature fusion and its extensions, respectively. Their performances are evaluated and compared in Section 6. Finally, conclusions and future work are provided in Section 7.
The main symbols used in this paper are summarized as follows. Lower-case and upper-case bold letters represent vectors and three-dimensional (3D) arrays, respectively. The symbols R 1 × C and R C denote the space of C-real-number row vectors and the space of C-real-number column vectors, respectively. The symbol R H × W × C denotes the space of 3D arrays (of real numbers) whose height, width, and channel number are equal to H, W, and C, respectively. Table 1 and Table 2 summarize the main symbols used in this paper.

2. Related Work

The SE block was proposed to improve the performances of the CNNs and to demonstrated its potential in image classification [16]. It consists of two successive operations: squeeze and excitation. In the squeeze operation, the inputted feature maps will be passed to a global-average pooling (GAP) layer to generate channel-wise statistics, where each value is an average of the corresponding feature map. Thereafter, in the excitation operation, the SE block will compute an appropriate weight for each feature map by using the channel-wise statistics and fully connected (FC) layers. The SE block multiplies inputted feature maps by their weights and obtains the channel-weighted feature maps. Due to its success in image classification, the SE block has been adopted in many applications, including HAR. Zhongkai et al. [17] investigated the potential of the SE blocks by adding them to a list of state-of-the-art CNN models (e.g., VGG16, Inception, ReNet18, and PyramidNet18) and comparing the corresponding HAR performances. Mekruksavanich et al. [18] proposed a DL model called the SEResNet-BiGRU, which is a combination of residual blocks, SE blocks, and bidirectional gate recurrent units (BiGRUs), and applied it for transitional activity recognition. Khan et al. [21] proposed a multi-branch DL architecture where each branch uses a CNN model with an SE block to extract and re-weight feature maps. The above DL models with SE blocks are summarized in Table 3.
Several DL architectures have been extensively proposed and investigated in HAR using sensor data. In order to improve the HAR performances, instead of using only one branch, we can implement DL architectures with multiple branches such that several different and unique sets of feature maps will be obtained and helpful in classifying the activities. There are two common categories of the multi-branch architectures. In the first category, we consider a scenario wherein there is a set of wearable sensors (e.g., IMUs) attached to parts of the user body (e.g., a wrist, an ankle, the chest). Therefore, different sets of sensor data are obtained initially. These sets of sensor data are inputted into a multi-branch DL architecture. Each branch will receive each set and extract the corresponding features by using the same DL model independently [19,22,23]. In order to obtain a set of feature maps on each branch, Rueda et al. [19] applied a series of convolutional layers and max pooling layers, Liu et al. [22] implemented stacked convolutional layers, and Al-qaness et al. [23] employed a CNN model with residual blocks. The feature maps of all branches are fused (i.e., feature fusion) by using concatenation. In the second category, we apply a set of sensor data to a multi-branch DL architecture where each branch uses a different DL model and results in a different set of features. Three-branch DL architectures were proposed in [20,21,24,25,26,27,28], where a CNN model [20,24,27,28], a hybrid of a CNN model, and a bidirectional long short-term memory (LSTM) layer [25], as well as a CNN model with an SE block [21], and a hybrid of convolutional layers and gated recurrent unit (GRU) layers [26] were used on each branch to extract a set of features. The differences among these branches are the kernel sizes of the convolutional layers [20,21,24,25,26,27,28] and the number of layers [20]. Similarly, the output feature sets are combined by using concatenation. We summarize the aforementioned multi-branch DL architectures in Table 4.
Recently, the SE mechanism (i.e., squeeze and excitation operations) has been applied to feature fusion in multi-branch DL architectures where feature maps from each branch will be recalibrated before fusing them together. Li et al. [29] proposed a model called the temporal-spectral-based squeeze-and-excitation feature fusion network (TS-SEFFNet) to classify motor imagery tasks by using electroencephalography (EEG) signals. The TS-SEFFNet receives EEG signals and uses two branches with different DL models called the deep-temporal convolution block and the multi-spectral convolution block to extract two different sets of feature maps. The feature maps of each set are recalibrated by using the SE mechanism. The TS-SEFFNet combines the outputs of these two branches by using concatenation.
Instead of using sensor data from one modality, multimodal classification [30] receives data from multiple modalities and has gained a significant amount of attention [31,32,33]. Essentially, a multi-modal classification model will be implemented based on a multi-branch architecture where each branch will receive different modal data and extract the corresponding features. These features obtained from various modalities will be combined and sent to the classification process. Since the feature maps from each modality contribute information unequally, an efficient fusion method must be investigated [31,32,33].
In addition, several SE-based feature fusion methods were extensively investigated in multi-modal classification. Jia et al. [34] proposed a feature fusion method called the multi-modal SE feature fusion module to combine feature maps from EEG signals and feature maps from electrooculogram (EOG) signals for sleep-staging classification. Unlike [16,29] where each branch computes the weights for the feature-map calibration in the excitation operation separately, the multi-modal SE feature fusion module will calculate the channel weights based on the channel-wise statistics from both EEG feature maps and EOG feature maps.
Shu et al. [35] proposed a DL model called the expansion-squeeze-excitation fusion network (ESE–FN) for elderly activity recognition using RGB videos and skeleton sequences. The ESE–FN applies two successive fusion modules (modal fusion and channel fusion) to combine RGB features and skeleton features properly. The modal-fusion module performs the modal attention where modal-wise weights are computed and multiplied to the corresponding modalities’ feature maps. The channel-fusion module obtains the channel attention by calculating channel-wise weights and multiplying them to the feature maps. Both modules apply a new attention mechanism called the expansion-squeeze-excitation, which consists of three operations: expansion, squeeze, and excitation. The expansion is operated by using convolutional layers to expand the depth along the modality dimension for the modal fusion and expand the depth along the channel dimension for the channel fusion. The squeeze and excitation operations are similar to those in [16]. A summary of the work [29,34,35] is shown in Table 5.

3. Data Collection and Preparation

In order to evaluate the proposed HAR classification architectures in Section 4 and Section 5, we select the datasets whose sensor data are from IMUs attached to various locations of the user body. Thereafter, we preprocess the sensor data by scaling and segmentation. The details are explained as follows. The sensor data are from the following three wearable-sensor datasets:
  • PAMAP2: The PAMAP2 dataset [36] contains the sensor data collected from nine subjects who performed 18 physical activities. However, here, only 12 activities are considered: lying, sitting, standing, ironing, vacuum cleaning, descending stairs, walking, Nordic walking, cycling, ascending stairs, running, and rope jumping. Three IMUs were attached to a wrist, the chest, and an ankle of each subject. Each IMU was equipped with two triaxial accelerometers, one triaxial gyroscope, and one triaxial magnetometer. As a result, 12 types of sensor data ( M = 12 ) were obtained from each IMU. The sampling rate was set to 100 Hz.
  • DaLiAc: The DaLiAc dataset [37] contains the sensor data collected from 19 subjects who performed 13 physical activities: sitting, lying, standing, washing dishes, vacuuming, sweeping, walking, ascending stairs, descending stairs, treadmill running (8.3 km/h), bicycling on ergometer (50 Watt), bicycling on ergometer (100 Watt), and rope jumping. A total of four IMUs were attached to the right hip, the right wrist, the chest, and the left ankle. Each IMU was equipped with one triaxial accelerometer and one triaxial gyroscope. As a result, six types of sensor data ( M = 6 ) were obtained from each IMU. The sampling rate was set to approximately 200 Hz.
  • DSAD: The DSAD dataset [38] contains the sensor data collected from eight subjects who performed 19 physical activities: sitting, standing, lying on back, lying on right side, ascending stairs, descending stairs, standing in an elevator still, moving around in an elevator, walking in a parking lot, walking on a treadmill with a speed of 4 km/h in flat, walking on a treadmill with a speed of 4 km/h at 15 degree inclined positions), running on a treadmill with a speed of 8 km/h, exercising on a stepper, exercising on a cross trainer, cycling on an exercise bike in horizontal position, cycling on an exercise bike in vertical positions, rowing, jumping, and playing basketball. Five IMUs were attached to the torso, right arm, left arm, right leg, and left leg. Each IMU was equipped with one triaxial accelerometer, one triaxial gyroscope, and one triaxial magnetometer. As a result, nine types of sensor data ( M = 9 ) were obtained from each IMU. The sampling rate was set to 25 Hz.
The sensor data used to predict the current activity are obtained from different sensor types and varied within different ranges. It is a common step to apply the data scaling such that the values of these sensor data will be within the same range. In this work, the standardization method is applied to transform the sensor data such that their mean and standard deviation are zero and one, respectively. Let z t , m ( n ) be the sensor value at the t-th point obtained from the m-th sensor data of the n-th IMU. Its standardized value is obtained from:
z ˜ t , m ( n ) = z t , m ( n ) μ m ( n ) σ m ( n ) ,
where μ m ( n ) and σ m ( n ) are the mean and standard deviation, respectively, of the values from the m-th sensor data of the n-th IMU.
Next, a series of the standardized values z ˜ t , m ( n ) is divided into segments by using a non-overlapping window method. Each segment consists of L values. Let x m ( n ) R 1 × L be a segment of the standardized values from the m-th sensor data of the n-th IMU. The row vector x m ( n ) can be expressed as
x m ( n ) = z ˜ 1 , m ( n ) , z ˜ 2 , m ( n ) , , z ˜ L , m ( n ) .
The length L is set to 300, 600, and 125 data points for PAMAP2, DaLiAc, and DSAD, respectively (which are equal to three-second window, three second window, and five-second window, respectively). A summary of the sensor data which will be used in the evaluation is shown in Table 6.

4. Proposed Architecture

The proposed architecture is shown in Figure 1, which is based on a multi-branch architecture. There are N branches to receive the inputs from N IMUs. The number N will be equal to 3, 4, and 5, for the PAMAP2, DaLiAc, and DSAD datasets, respectively, as shown in Table 6. Each branch receives the input X ( n ) from an IMU and uses a one-dimensional (1D) CNN model to extract a set of feature maps A ( n ) . Since each feature map owns different significance of information, we propose the merging-squeeze-excitation (MSE) feature fusion to combine these N sets of feature maps A ( n ) by applying the SE mechanisms [16] and produce the channel-weighted and merged feature maps Y m s e , which will be used to predict the corresponding activity. The details are provided as follows:

4.1. Input and Feature Extraction

The input X ( n ) R 1 × L × M is a three-dimensional (3D) array (consisting of the height, width, and channel dimensions) storing data segments of all sensor data from the n-th IMU, where M is the number of sensor data types per IMU and L is the number of data points in one segment. It can be expressed as X ( n ) = [ x 1 ( n ) ; x 2 ( n ) ; ; x M ( n ) ] , where x m ( n ) is a data segment of the m-th sensor from the n-th IMU and expressed in Equation (2). Note that [ ( · ) ; ( · ) ; , ( · ) ] denotes that the elements inside are arranged along the channel dimension. Each branch will apply a 1D CNN model to extract feature maps A ( n ) R 1 × W × C , where W and C are the width and number of channels, respectively. The feature maps A ( n ) can be expressed as A ( n ) = [ a 1 ( n ) ; a 2 ( n ) ; ; a C ( n ) ] , where the row vector a c ( n ) = [ a 1 , c ( n ) , a 2 , c ( n ) , , a W , c ( n ) ] is a 1D feature map and a w , c ( n ) is a value at the w-th data point of the c-th channel. The following CNN models are considered as feature extractors due to their simplicity, low number of layers, and low computational complexities: LeNet5 [39], AlexNet [40], and VGG16 [41]. Note that these models originally consist of two-dimensional (2D) layers since they are applied to image processing. Here, we implement their 1D versions by changing all 2D layers to be 1D layers. For example, 2D convolutional layers are replaced by 1D convolutional layers and 2D max pooling layers are replaced by 1D max pooling layers. The other parameters are unchanged such as numbers of filters and kernel sizes. These 1D CNN structures are summarized in Appendix A. The width W and the number of channels C of A ( n ) according to the considered CNN models are shown in Table 7. In addition to these three CNN models, other CNN models can be applied to extract A ( n ) .

4.2. Merging-Squeeze-Excitation Feature Fusion

Conventional feature fusion methods [19,20,21,22,23,24,25,26,27,28] combine all feature maps from all branches equally without considering which feature maps are useful. However, some feature maps in A ( n ) may be unhelpful for the classification and they should be suppressed while the informative feature maps in A ( n ) should be emphasized. Therefore, in this work, inspired by the SE mechanism [16], we propose a feature fusion method called the merging-squeeze-excitation, which is aware of this issue. As shown in Figure 1, all sets of feature maps A ( n ) , for n = 1 , 2 , , N , are firstly combined in the pre-merging step to create a set of pre-merged feature maps B . Unlike [16], in the squeeze step, the channel-wise statistics h ( n ) used to compute the channel weights s ( n ) are computed according to the feature maps A ( n ) and pre-merged feature maps B . This implies that the importance of each feature map in A ( n ) is measured not only from A ( n ) but also from B . Accordingly, we find the corresponding channel weights s ( n ) , multiply them to A ( n ) , and obtain the channel-weighted feature maps P ( n ) in the excitation step. Finally, in the post-merging step, we recombine P ( n ) , for n = 1 , 2 , , N , using the same merging method in the pre-merging step to obtain the channel-weighted and merged feature maps Y m s e , which will be used in the classification process later. The details of these steps are explained as follows.

4.2.1. Pre-Merging

We use the pre-merging step to initially combine feature maps A ( n ) from all N branches together and to produce the pre-merged feature maps B R 1 × W × C , which will be used along with A ( n ) to compute the channel weights. The feature maps B can be expressed as B = [ b 1 ; b 2 ; ; b C ] , where the row vector b c R 1 × W is expressed as b c = [ b 1 , c , b 2 , c , , b W , c ] and b w , c is a value at the w-th data point of the c-th channel. Several feature merging methods are available [42]. Here, we investigate and compare the following methods:
  • Addition merging creates feature maps B by using the element-wise addition. The value b w , c is obtained from
    b w , c = n = 1 N a w , c ( n ) ,
    where a w , c ( n ) is a value at the w-th data point of the c-th channel of A ( n ) .
  • Maximum merging creates feature maps V by using the element-wise maximum operation. The value b w , c is obtained from
    b w , c = max a w , c ( 1 ) , a w , c ( 2 ) , , a w , c ( N ) .
  • Minimum merging creates feature maps B by using the element-wise minimum operation. The value b w , c is obtained from
    b w , c = min a w , c ( 1 ) , a w , c ( 2 ) , , a w , c ( N ) .
  • Average merging creates feature maps B by using the element-wise averaging operation. The value a w , c is obtained from
    b w , c = 1 N n = 1 N a w , c ( n ) .
For a future usage, we denote the merging operation as F M e r g e ( · ) . Specifically, we have
B = F M e r g e A ( 1 ) , A ( 2 ) , , A ( N ) .

4.2.2. Squeeze and Excitation

In the squeeze-and-excitation step, we recalibrate each set of feature maps A ( n ) such that the informative feature maps will be emphasized and ambiguous feature maps will be suppressed by using channel weights, which will be computed according to both A ( n ) and B . First, we obtain the channel-wise statistics u R C by passing B to a 1D GAP layer and the channel-wise statistics g ( n ) R C by passing A ( n ) to another 1D GAP. The statistics u are expressed as u = [ u 1 , u 2 , , u C ] T and the statistics g ( n ) are expressed as g ( n ) = [ g 1 ( n ) , g 2 ( n ) , , g C ( n ) ] T , where [ · , · , , · ] T is the transpose, u c is obtained by averaging the values in the c-th 1D feature map of B , and g c ( n ) is obtained by averaging the values in the c-th 1D feature map of U ( n ) . Specifically, we have
u c = 1 W w = 1 W b w , c ,
and
g c ( n ) = 1 W w = 1 W a w , c ( n ) .
Thereafter, we obtain a channel-wise statistics h ( n ) R C from
h ( n ) = u + g ( n ) .
The statistics h ( n ) are expressed as h ( n ) = [ h 1 ( n ) , h 2 ( n ) , , h C ( n ) ] T .
Next, a set of channel weights s ( n ) R C , where s ( n ) = [ s 1 ( n ) , s 2 ( n ) , , s C ( n ) ] T , for individual A ( n ) , is obtained by using two fully connected (FC) layers with the ReLU activation after the first FC layer and the Sigmoid activation after the second FC layer [16]:
s ( n ) = σ W 2 ( n ) δ W 1 ( n ) h ( n ) ,
where σ ( · ) is the Sigmoid activation function, δ ( · ) is the ReLU activation function, W 1 ( n ) R C r × C is the weight matrix of the first FC layer, W 2 ( n ) R C × C r is the weight matrix of the second FC layer, and r is the reduction ratio which is used to reduce the first FC layer’s output dimension.
Finally, we recalibrate the feature maps A ( n ) according to the channel weights s ( n ) to emphasize useful feature maps and suppress ambiguous feature maps and, then, obtain a set of channel-weighted feature maps P ( n ) R 1 × W × C . The feature maps P ( n ) can be expressed as P ( n ) = [ p 1 ( n ) ; p 2 ( n ) ; ; p C ( n ) ] , where p c ( n ) = [ p 1 , c ( n ) , p 2 , c ( n ) , , p W , c ( n ) ] is a 1D feature map and obtained from multiplication between the channel weight s c ( n ) and the 1D feature map a c ( n ) :
p c ( n ) = s c ( n ) a c ( n )
For a future usage, we denote the squeeze-and-excitation operation to compute P ( n ) as
P ( n ) = F S E A ( n ) , B .

4.2.3. Post-Merging

The post-merging step will apply the merging method used in the pre-merging step to combine the N sets of channel-weighted feature maps P ( n ) and obtain the channel-weighted and merged feature maps Y m s e . Similar to Section 4.2.1, we can express
Y m s e = F M e r g e P ( 1 ) , P ( 2 ) , , P ( N ) .
The set of feature maps Y m s e will be used in the classification process.

4.3. Classification

In this work, the classifier as shown in Figure 1 consists of a 1D GAP layer and two FC layers whose ReLU activation functions are used in the first FC layer while the Softmax activation function is used in the second FC layer. The numbers of neurons in the first and second FC layers are 1024 and K, respectively, where K is the number of classes (depending on the datasets). As specified in Section 3, the number of classes K is 12, 13, and 19 for the PAMAP2, DaLiAc, and DSAD datasets, respectively. Note that other classifiers’ structures are applicable.

5. Extensions of Merging-Squeeze-Excitation Feature Fusion

In this section, we present four extensions of the MSE feature fusion: MSE feature fusion with local skip connections, MSE feature fusion with a global skip connection, MSE feature fusion with global channel attention, and deep MSE feature fusion. Their performances will be evaluated and compared in Section 6.

5.1. MSE Feature Fusion with Skip Connections

Skip connections were used in ResNet models [43] to solve the vanishing-gradient issue. Here, we will apply this technique to the MSE feature fusion such that feature maps entering the classification will be at least as good as the feature maps obtained from the earlier step. We consider two possible positions to add skip connections.
  • The MSE feature fusion with local skip connections is shown in Figure 2a, where we add a skip connection to each branch of A ( n ) . As a result, we have
    Q ( n ) = A ( n ) + P ( n ) .
    The feature maps Y l s c that will enter the classifier are obtained from
    Y l s c = F M e r g e Q ( 1 ) , Q ( 2 ) , , Q ( N ) .
  • The MSE feature fusion with a global skip connection is shown in Figure 2b. We create a skip connection on the MSE feature fusion such that the pre-merged feature maps B from the pre-merging step will be added to the channel-weighted and merged feature maps Y m s e . Thereafter, we have Y g s c entering to the classifier as follows:
    Y g s c = Y m s e + B ,
    where Y m s e is defined in (14). The prediction will be based on both Y m s e and B .

5.2. MSE Feature Fusion with Global Channel Attention

In the proposed MSE feature fusion shown in Figure 1, the channel-weighted and merged feature maps Y m s e are obtained from F M e r g e P ( 1 ) , P ( 2 ) , , P ( N ) . In addition, we may compute a different set of channel-weighted feature maps based on the channel dependency of B (the output of the pre-merging step) directly. Figure 3 shows the MSE feature fusion with a global channel attention, where we create an additional set of channel-weighted feature maps R R 1 × W × C according to B . The set of feature maps R is denoted as R = [ r 1 ; r 2 ; ; r C ] , r c = [ r 1 , c , r 2 , c , , r W , c ] , and r w , c is a value. Similar to the previous calculation, we can obtain R according to the following steps. We find the channel weights v R C , where v = [ v 1 , v 2 , , v C ] T and v c is a value, from
v = σ W 2 δ ( W 1 u ) ,
where u is the channel-wise statistics as shown in Section 4.2.2, W 1 R C r × C is the weight matrix of the first FC layer, and W 2 R C × C r is the weight matrix of the second FC layer. The c-th 1D feature map r c is equal to the feature map b c weighted by v c :
r c = v c b c .
Finally, the set of channel-weighted and merged feature maps Y g c a entering the classifier is from
Y g c a = Y m s e + R ,
where Y m s e is defined in (14). As a result, the prediction will be computed from both local-channel-attention and global-channel-attention feature maps.

5.3. Deep MSE Feature Fusion

Instead of using only one-level MSE feature fusion to combine and recalibrate the feature maps A ( n ) as shown in Figure 1, we can stack a series of MSE feature fusion blocks to create deep MSE feature fusion, where feature maps are merged and weighted multiple times. The structure of deep MSE feature fusion is shown in Figure 4a, where D MSE feature fusion blocks are connected in series. The d-th block as shown in Figure 4b receives the channel-weighted feature maps P ˜ ( d 1 , n ) , for n = 1 , 2 , , N , and the channel-weighted and merged feature maps Y ˜ d e e p , ( d 1 ) from the previous block to create the new channel-weighted feature maps P ˜ ( d , n ) and the new channel-weighted and merged feature maps Y ˜ m s e , ( d ) . Note that P ˜ ( 0 , n ) and Y ˜ d e e p , ( 0 ) are equal to A ( n ) and B (defined in (7)), respectively. Similar to Section 4.2.2, we have P ˜ ( d , n ) = F S E P ˜ ( d 1 , n ) , Y ˜ d e e p , ( d 1 ) and Y ˜ m s e , ( d ) = F M e r g e P ˜ ( d , 1 ) , P ˜ ( d , 2 ) , , P ˜ ( d , N ) . Thereafter, the new merged feature maps Y ˜ d e e p , ( d ) will be obtained from
Y ˜ d e e p , ( d ) = Y ˜ d e e p , ( d 1 ) + Y ˜ m s e , ( d ) ,
where a skip connection is used to keep the deep MSE feature fusion stable. The feature maps Y ˜ d e e p , ( D ) of the last block will be used in the classification process.

6. Experimental Results and Discussion

6.1. Experimental Setup

All experiments were implemented by using Python programming language and Python libraries such as Scikit-learn, TensorFlow, Keras, etc. They were run on the Google Colab Pro+ platform. The performances of the investigated models were measured by the accuracy score, which is obtained from
A c c u r a c y = 1 K k = 1 K T P k + T N k T P k + F P k + T N k + F N k × 100 ,
where K is the number of classes, T P k is the number of true positives of the k-th class, F P k is the number of false positives of the k-th class, T N k is the number of true negatives of the k-th class, and F N k is the number of false negatives of the k-th class. There are two basic approaches to evaluate the model performances: training–validation–testing split and k-fold cross validation. The training–validation–testing split will divide a dataset into three separated parts: training set, validation set, and testing set. Therefore, the performance results of the investigated model will highly depend on the data in the testing set. In order to avoid this problem, similar to [18,22,25,27], we applied the k-fold cross validation, where k is set to 10, to the experiments. The 10-fold cross validation will divide a dataset into 10 parts. One part will be selected as the testing set while the remaining nine parts will be the training set. We evaluate an investigated model 10 times. For each time, we select a different part to be the testing set. Thereafter, the performance results will be the average of the testing scores. The investigated models were trained by minimizing the categorical cross-entropy using the Adam optimizer with the settings β 1 = 0.9 , β 2 = 0.999 , and ϵ = 10 7 . The training rate was set to 0.001. The batch size was 32. The number of epochs was 40. We did not experience an overfitting issue. Our training scores are slightly higher than the testing scores.

6.2. Baseline Architectures

We consider a single-branch DL architecture and a multi-branch DL architecture shown in Figure 5 as our baseline architectures for the performance comparison. The classifiers in these two architectures are similar to those used in the proposed MSE feature fusion as shown in Figure 1 and explained in Section 4.3.
  • For a single-branch DL architecture, all available sensor data will be combined first before we extract a set of features [13]. Here, all sensor data X ( n ) (from N IMUs) are concatenated together along the channel dimension. We denote this new array as X R 1 × L × N M . Thereafter, a 1D CNN model extracts a set of feature maps A R 1 × W × C which will be used in the classification process.
  • A multi-branch DL architecture consists of N branches to receive the sensor data X ( n ) individually [19,22,23]. Each branch extracts a set of feature maps A ( n ) using a 1D CNN model. Here, we concatenate these N sets of feature maps together along the channel dimension and obtain a new array Y m b R 1 × W × N C , which will be sent to the classifier.
Note that the sensor data X ( n ) and feature map A ( n ) were defined in Section 4. The values W and C were shown in Table 7.
The performances of these architectures are evaluated by classifying the PAMAP2, DaLiAc, and DSAD datasets where three CNN models (including LeNet5, AlexNet, and VGG16) are used as feature extraction. The accuracy scores are shown in Table 8, which will be compared to those achieved by the proposed architectures. We observe that the single-branch architectures outperform the multi-branch architectures in many cases. A reason is that the multi-branch architectures have extracted too many features (the output of the GAP in the classifier) and some of them may be ambiguous. The number of features out of the GAP in the multi-branch architectures is equal to N C while the number of features out of the GAP in the single-branch architectures is equal to C. As seen in Table 8, the single-branch architectures using AlexNet offer the highest accuracy scores of 98.77%, 97.60%, and 97.18% for the PAMAP2, DaLiAc, and DSAD datasets, respectively.

6.3. Proposed Merging-Squeeze-Excitation Feature Fusion

The performances of the proposed MSE feature fusion in Section 4 and its extensions in Section 5 are shown in the following subsections. For each proposed architecture, we will compare the accuracy scores among the merging methods (addition, maximum, minimum, and average) and DL models (LeNet5, AlexNet, and VGG16) to determine which combination offers the highest accuracy score on classifying each dataset. Thereafter, the highest accuracy scores of the proposed architectures are compared to determine the best architecture. Note that the reduction ratio r is fixed to eight for all experiments. Varying r is considered as future work.

6.3.1. MSE Feature Fusion

Table 9 presents the accuracy scores of the MSE feature fusion proposed in Section 4 according to the merging methods, DL models, and datasets. We have the following results:
  • The highest accuracy score in each dataset is indicated by the asterisk (*). The MSE feature fusion using the minimum merging and AlexNet achieves the highest accuracy score of 99.17% for the PAMAP2 dataset. The MSE feature fusion using the average merging and AlexNet achieves the highest accuracy scores of 98.32% and 98.04% for the DaLiAc and DSAD datasets, respectively.
  • We compare the accuracy scores of the MSE feature fusion to those of the baseline architectures in Section 6.2. According to the highest accuracy scores obtained from these architectures, the results show that the MSE feature fusion outperforms the baseline models.
  • Among the considered merging methods, the MSE feature fusion using the addition merging offers the worst accuracy scores. The MSE feature fusion architectures using the other merging methods provide the same level of performance. Their accuracy scores are rather close to each other. We do not have a conclusive result on which merging method is the best.
  • Among the considered DL models used as feature extractors, the MSE feature fusion using ALexNet outperforms the MSE feature fusion using the other DL models.

6.3.2. MSE Feature Fusion with Skip Connections

Table 10, Table 11 and Table 12 show the accuracy scores of the MSE feature fusion with skip connections proposed in Section 5.1 on classifying the PAMAP2, DaLiAc, and DSAD datasets, respectively. Each table presents the accuracy scores according to the merging methods, the DL models, and skip-connection methods. We have the following results:
  • On classifying the PAMAP2 dataset (Table 10), the MSE feature fusion with local skip connections achieves the highest accuracy score of 99.18% when using the minimum merging and AlexNet, while the MSE feature fusion with a global skip connection offers the highest accuracy score of 99.24% when using the average merging and AlexNet. Both architectures outperform the original MSE feature fusion (whose highest accuracy score is 99.17%).
  • On classifying the DaLiAc dataset (Table 11), the MSE feature fusion with local skip connections achieves the highest accuracy score of 98.59% when using the minimum merging and AlexNet, while the MSE feature fusion with a global skip connection offers the highest accuracy score of 98.42% when using the minimum merging and AlexNet. Both architectures outperform the original MSE feature fusion (whose highest accuracy score is 98.32%).
  • On classifying the DSAD dataset (Table 12), the MSE feature fusion with local skip connections achieves the highest accuracy score of 98.02% when using the average merging and AlexNet while the MSE feature fusion with a global skip connection offers the highest accuracy score of 97.97% when using the average merging and AlexNet. Both architectures offer lower accuracy scores than that of the original MSE feature fusion (whose highest accuracy score is 98.04%).
  • Since the results are not conclusive, we cannot indicate whether the MSE feature fusion with skip connections is better than the original MSE feature fusion nor which skip connection method is the best.

6.3.3. MSE Feature Fusion with Global Channel Attention

Table 13 shows the accuracy scores of the MSE feature fusion with global channel attention proposed in Section 5.2 according to the merging methods, DL models, and datasets. We have the following results:
  • On classifying the PAMAP2 and DaLiAc datasets, the MSE feature fusion with global channel attention achieves the highest accuracy score of 99.17% and 98.08%, respectively, when using the minimum merging and AlexNet.
  • On classifying the DSAD dataset, the MSE feature fusion with global channel attention achieves the highest accuracy scores of 97.87% when using the average merging and AlexNet.
  • By comparing these accuracy scores to those of the original MSE feature fusion, we see that the original MSE feature fusion outperforms the MSE feature fusion with global channel attention. The feature maps obtained by using the global channel attention do not provide any additional information.

6.3.4. Deep MSE Feature Fusion

Table 14, Table 15 and Table 16 show the accuracy scores of the deep MSE feature fusion (in Section 5.3) using AlexNet as the feature extractor on classifying the PAMAP2, DaLiAc, and DSAD datasets, respectively. We consider only AlexNet since it outperforms the other DL models as shown in the previous subsections. Each table presents the accuracy scores according to the merging methods and the number of MSE feature fusion blocks (D). We see that the deep MSE feature fusion with D = 1 offers the highest accuracy scores for all three datasets (i.e, 99.17% for the PAMAP2 dataset, 98.32% for the DaLiAc dataset, and 98.04% for the DSAD dataset). In fact, with D = 1 , the deep MSE feature fusion is equivalent to the original MSE feature fusion. This indicates that, for the investigated datasets, increasing the number of MSE feature fusion blocks does not provide any further useful information to the output feature maps Y ˜ d e e p , ( D ) which are used in the classification process.

6.4. Computational Complexity Comparison

Table 17, Table 18 and Table 19 show the numbers of trainable parameters of the baseline architectures, the proposed MSE feature fusion, and the extensions of the MSE feature fusion on classifying the PAMAP2, DaLiAc, and DSAD datasets, respectively. We do not specify the numbers of trainable parameters for individual merging methods since they are the same. The following results are obtained:
  • The numbers of trainable parameters of the proposed MSE feature fusion are higher than those of the single-branch architecture since the proposed MSE feature fusion consists of several branches using CNN models as feature extractors. On the other hand, the proposed MSE feature fusion requires lower numbers of trainable parameters than the multi-branch architecture does since the MSE feature fusion reduces the number of features which will enter the classification process by using the addition, maximum, minimum, and average merging instead of the concatenation merging.
  • The numbers of trainable parameters of the extensions (of the MSE feature fusion) are slightly higher than those of the proposed MSE feature fusion since the modification parts in the extensions require few trainable parameters.

6.5. Performance Comparison to Other HAR Approaches

Table 20 shows the accuracy scores of other HAR approaches which were evaluated by using the PAMAP2, DaLiAc, and DSAD datasets. These accuracy scores were presented in their publications. Note that the evaluation setups and pre-processing may be different from ours. We compare them to the highest accuracy scores achieved by the original MSE feature fusion (Section 6.3.1). The proposed MSE feature fusion offers higher accuracy scores than those obtained from the other approaches.

7. Conclusions and Future Work

In this work, we proposed a feature fusion method called the merging-squeeze-excitation (MSE) feature fusion for wearable-sensor-based HAR using multibranch architectures. The MSE feature fusion will calibrate the feature maps during the fusion. Each feature map will be emphasized or suppressed according to its importance measured from both itself and the corresponding pre-merged feature map. In addition, we presented the following four extensions of the MSE feature fusion: the MSE feature fusion with local skip connections, the MSE feature fusion with a global skip connection, the MSE feature fusion with global channel attention, and deep MSE feature fusion. LeNet5, AlexNet, and VGG16 were applied as feature extractors. The addition, maximum, minimum, and average merging were used in the pre-merging and post-merging steps. According to the experimental results, the MSE feature fusion with a global skip connection (using the average merging and AlexNet), the MSE feature fusion with local skip connections (using the minimum merging and AlexNet), and the original MSE feature fusion (using the average merging and AlexNet) achieve the highest accuracy scores of 99.24%, 98.59%, and 98.04% on the PAMAP2, DaLiAc, and DSAD datasets, respectively. For future work, in addition to the channel–attention mechanism, other attention techniques such as spatial attention, modal attention, convolutional block attention, and selective kernel convolution can be applied to feature fusion in order to combine feature maps from different branches effectively.

Funding

This research was funded by the SIIT Young Researcher Grant, under a contract No. SIIT 2019-YRG-SL04, the Sirindhorn International Institute of Technology, Thammasat University, Thailand.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used to support the findings of this work are available in [36,37,38].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Architectures of 1D LeNet5, 1D AlexNet, and 1D VGG16 Models

The architectures of 1D LeNet5, 1D AlexNet, and 1D VGG16 models used as feature extractors in Section 4.1 are presented in Table A1, Table A2 and Table A3. Note that, for the AlexNet and VGG16 models, we added a batch normalization layer between the 1D convolutional layer and its activation layer to normalize the output of the convolutional layer before passing it to the activation layer.
Table A1. Architecture of 1D LeNet5. The column names # Filters, K Size, Pad, and Activ are short for number of filters, kernel size, padding, and activation, respectively.
Table A1. Architecture of 1D LeNet5. The column names # Filters, K Size, Pad, and Activ are short for number of filters, kernel size, padding, and activation, respectively.
NameLayer# FiltersK SizeStridePadActivSize of Output
PAMAP2DaLiAcDSAD
WCWCWC
InInput-----3001260061259
C1Conv1D651sametanh300660061256
S2AveragePooling1D-22valid-15063006626
C3Conv1D1651validtanh14616296165816
S4AveragePooling1D-22valid-7316148162916
C5Conv1D12051validtanh6912014412025120
Table A2. Architecture of 1D AlexNet with batch normalization (BatchNorm) layers. The column names # Filters, K Size, Pad, and Activ are short for number of Filters, kernel size, padding, and activation, respectively.
Table A2. Architecture of 1D AlexNet with batch normalization (BatchNorm) layers. The column names # Filters, K Size, Pad, and Activ are short for number of Filters, kernel size, padding, and activation, respectively.
NameLayer# FiltersK SizeStridePadActivSize of Output
PAMAP2DaLiAcDSAD
WCWCWC
InInput-----3001260061259
C1Conv1D + BatchNorm96114validReLU7396148962996
S2MaxPooling1D-32valid-369673961496
C3Conv1D + BatchNorm25651sameReLU362567325614256
S4MaxPooling1D-32valid-17256362566256
C5Conv1D + BatchNorm38431sameReLU17384363846384
C6Conv1D + BatchNorm38431sameReLU17384363846384
C7Conv1D + BatchNorm25631sameReLU17384363846384
S8MaxPooling1D-32valid-8256172562256
Table A3. Architecture of 1D VGG16 with batch normalization (BatchNorm) layers. The column names # Filters, K Size, Pad, and Activ are short for number of filters, kernel size, padding, and activation, respectively.
Table A3. Architecture of 1D VGG16 with batch normalization (BatchNorm) layers. The column names # Filters, K Size, Pad, and Activ are short for number of filters, kernel size, padding, and activation, respectively.
NameLayer# FiltersK SizeStridePadActivSize of Output
PAMAP2DaLiAcDSAD
WCWCWC
InInput-----3001260061259
C1Conv1D + BatchNorm6431sameReLU300646006412564
C2Conv1D + BatchNorm6431sameReLU300646006412564
S3MaxPooling1D-22valid-15064300646264
C4Conv1D + BatchNorm12831sameReLU15012830012862128
C5Conv1D + BatchNorm12831sameReLU15012830012862128
S6MaxPooling1D-22valid-7512815012831128
C7Conv1D + BatchNorm25631sameReLU7525615025631256
C8Conv1D + BatchNorm25631sameReLU7525615025631256
C9Conv1D + BatchNorm25631sameReLU7525615025631256
S10MaxPooling1D-22valid-372567525615256
C11Conv1D + BatchNorm51231sameReLU375127551215512
C12Conv1D + BatchNorm51231sameReLU375127551215512
C13Conv1D + BatchNorm51231sameReLU375127551215512
S14MaxPooling1D-22valid-18512375127512
C15Conv1D + BatchNorm51231sameReLU18512375127512
C16Conv1D + BatchNorm51231sameReLU18512375127512
C17Conv1D + BatchNorm51231sameReLU18512375127512
S18MaxPooling1D-22valid-9512185123512

References

  1. Yudav, S.K.; Tiwari, K.; Pandey, H.M.; Akbar, S.A. A Review of Multimodal Human Activity Recognition with Special Emphasis on Classification, Applications, Challenges and Future Directions. Knowl. Based Syst. 2021, 223, 106970. [Google Scholar] [CrossRef]
  2. Özyer, T.; Ak, D.S.; Alhajj, R. Human Action Recognition Approaches with Video Datasets—A Survey. Knowl. Based Syst. 2021, 222, 106995. [Google Scholar] [CrossRef]
  3. Bulling, A.; Blanke, U.; Schiele, B. A Tutorial on Human Activity Recognition Using Body-Worn Inertial Sensors. ACM Comput. Surv. 2014, 46, 1–33. [Google Scholar] [CrossRef]
  4. Bouchabou, D.; Nguyen, S.M.; Lohr, C.; LeDuc, B.; Kanellos, I. A Survey of Human Activity Recognition in Smart Homes Based on IoT Sensors Algorithms: Taxonomies, Challenges, and Opportunities with Deep Learning. Sensors 2021, 21, 6037. [Google Scholar] [CrossRef] [PubMed]
  5. Chaurasia, S.K.; Reddy, S.R.N. State-of-the-art Survey on Activity Recognition and Classification Using Smartphones and Wearable Sensors. Multimed. Tools Appl. 2022, 81, 1077–1108. [Google Scholar] [CrossRef]
  6. Yang, Y.; Wang, H.; Jiang, R.; Guo, X.; Cheng, J.; Chen, Y. A Review of IoT-Enabled Mobile Healthcare: Technologies, Challenges, and Future Trends. IEEE Internet Things J. 2022, 9, 9478–9502. [Google Scholar] [CrossRef]
  7. Achirei, S.-D.; Heghea, M.-C.; Lupu, R.-G.; Manta, V.-I. Human Activity Recognition for Assisted Living Based on Scene Understanding. Appl. Sci. 2022, 12, 10743. [Google Scholar] [CrossRef]
  8. Sousa Lima, W.; Souto, E.; El-Khatib, K.; Jalali, R.; Gama, J. Human Activity Recognition Using Inertial Sensors in a Smartphone: An Overview. Sensors 2019, 19, 3213. [Google Scholar] [CrossRef]
  9. Ramanujam, E.; Perumal, T.; Padmavathi, S. Human Activity Recognition with Smartphone and Wearable Sensors Using Deep Learning Techniques: A Review. IEEE Sens. J. 2021, 21, 13029–13040. [Google Scholar] [CrossRef]
  10. Pannurat, N.; Thiemjarus, S.; Nantajeewarawat, E.; Anantavrasilp, I. Analysis of Optimal Sensor Positions for Activity Classification and Application on a Different Data Collection Scenario. Sensors 2017, 17, 774. [Google Scholar] [CrossRef]
  11. Ahmed, N.; Rafiq, J.I.; Islam, M.R. Enhanced Human Activity Recognition Based on Smartphone Sensor Data Using Hybrid Feature Selection Model. Sensors 2020, 20, 317. [Google Scholar] [CrossRef] [PubMed]
  12. Chen, L.; Fan, S.; Kumar, V.; Jia, Y. A Method of Human Activity Recognition in Transitional Period. Information 2020, 11, 416. [Google Scholar] [CrossRef]
  13. Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep Learning for Sensor-Based Human Activity Recognition: Overview, Challenges, and Opportunities. ACM Comput. Surv. 2021, 54, 77. [Google Scholar] [CrossRef]
  14. Gu, F.; Chung, M.-H.; Chignell, M.; Valaee, S.; Zhou, B.; Liu, X. A Survey on Deep Learning for Human Activity Recognition. ACM Comput. Surv. 2021, 54, 177. [Google Scholar] [CrossRef]
  15. Zhang, S.; Li, Y.; Zhang, S.; Shahabi, F.; Xia, S.; Deng, Y.; Alshurafa, N. Deep Learning in Human Activity Recognition with Wearable Sensors: A Review on Advances. Sensors 2022, 22, 1476. [Google Scholar] [CrossRef] [PubMed]
  16. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  17. Zhongkai, Z.; Kobayashi, S.; Kondo, K.; Hasegawa, T.; Koshino, M. A Comparative Study: Toward an Effective Convolutional Neural Network Architecture for Sensor-Based Human Activity Recognition. IEEE Access 2022, 10, 20547–20558. [Google Scholar] [CrossRef]
  18. Mekruksavanich, S.; Hnoohom, N.; Jitpattanakul, A. A Hybrid Deep Residual Network for Efficient Transitional Activity Recognition Based on Wearable Sensors. Appl. Sci. 2022, 12, 4988. [Google Scholar] [CrossRef]
  19. Moya Rueda, F.; Grzeszick, R.; Fink, G.A.; Feldhorst, S.; Ten Hompel, M. Convolutional Neural Networks for Human Activity Recognition Using Body-Worn Sensors. Informatics 2018, 5, 26. [Google Scholar] [CrossRef]
  20. Avilés-Cruz, C.; Ferreyra-Ramírez, A.; Zúñiga-López, A.; Villegas-Cortéz, J. Coarse-Fine Convolutional Deep-Learning Strategy for Human Activity Recognition. Sensors 2019, 19, 1556. [Google Scholar] [CrossRef]
  21. Khan, Z.N.; Ahmad, J. Attention Induced Multi-Head Convolutional Neural Network for Human Activity Recognition. Appl. Soft Comput. 2021, 110, 107671. [Google Scholar] [CrossRef]
  22. Liu, S.; Yao, S.; Li, J.; Liu, D.; Wang, T.; Shao, H.; Abdelzaher, T. GIobalFusion: A Global Attentional Deep Learning Framework for Multisensor Information Fusion. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2020, 4, 19. [Google Scholar] [CrossRef]
  23. Al-qaness, M.A.A.; Dahou, A.; Elaziz, M.A.; Helmi, A.M. Multi-ResAtt: Multilevel Residual Network with Attention for Human Activity Recognition Using Wearable Sensors. IEEE Trans. Industr. Inform. 2023, 19, 144–152. [Google Scholar] [CrossRef]
  24. Zhang, H.; Xiao, Z.; Wang, J.; Li, F.; Szczerbicki, E. A Novel IoT-Perceptive Human Activity Recognition (HAR) Approach Using Multihead Convolutional Attention. IEEE Internet Things J. 2020, 7, 1072–1080. [Google Scholar] [CrossRef]
  25. Ihianle, I.K.; Nwajana, A.O.; Ebenuwa, S.H.; Otuka, R.I.; Owa, K.; Orisatoki, M.O. A Deep Learning Approach for Human Activities Recognition from Multimodal Sensing Devices. IEEE Access 2020, 8, 179028–179038. [Google Scholar] [CrossRef]
  26. Dua, N.; Singh, S.N.; Semwal, V.B. Multi-Input CNN-GRU Based Human Activity Recognition Using Wearable Sensors. Computing 2021, 103, 1461–1478. [Google Scholar] [CrossRef]
  27. Yen, C.-T.; Liao, J.-X.; Huang, Y.-K. Feature Fusion of a Deep-Learning Algorithm into Wearable Sensor Devices for Human Activity Recognition. Sensors 2021, 21, 8294. [Google Scholar] [CrossRef]
  28. Challa, S.K.; Kumar, A.; Semwal, V.B. A Multibranch CNN-BiLSTM Model for Human Activity Recognition Using Wearable Sensor Data. Vis. Comput. 2022, 38, 4095–4109. [Google Scholar] [CrossRef]
  29. Li, Y.; Guo, L.; Liu, Y.; Liu, J.; Meng, F. A Temporal-Spectral-Based Squeeze-and- Excitation Feature Fusion Network for Motor Imagery EEG Decoding. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 1534–1545. [Google Scholar] [CrossRef]
  30. Sleeman, W.C.; Kapoor, R.; Ghosh, P. Multimodal Classification: Current Landscape, Taxonomy and Future Directions. ACM Comput. Surv. 2022, 55, 150. [Google Scholar] [CrossRef]
  31. Arevalo, J.; Solorio, T.; Montes-y-Gómez, M.; González, F.A. Gated Multimodal Units for Information Fusion. arXiv 2017, arXiv:1702.01992. [Google Scholar]
  32. Yuan, Z.; Zhang, W.; Tian, C.; Rong, X.; Zhang, Z.; Wang, H.; Fu, K.; Sun, X. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information. IEEE Trans Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
  33. Yuan, Z.; Zhang, W.; Tian, C.; Mao, Y.; Zhou, R.; Wang, H.; Fu, K.; Sun, X. MCRN: A Multi-Source Cross-Modal Retrieval Network for Remote Sensing. Int. J. Appl. Earth Obs. Geoinf. 2022, 115, 103071. [Google Scholar] [CrossRef]
  34. Jia, Z.; Cai, X.; Jiao, Z. Multi-Modal Physiological Signals Based Squeeze-and-Excitation Network With Domain Adversarial Learning for Sleep Staging. IEEE Sensors J. 2022, 22, 3464–3471. [Google Scholar] [CrossRef]
  35. Shu, X.; Yang, J.; Yan, R.; Song, Y. Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5281–5292. [Google Scholar] [CrossRef]
  36. Reiss, A.; Stricker, D. Introducing a New Benchmarked Dataset for Activity Monitoring. In Proceedings of the 6th International Symposium on Wearable Computers, Newcastle, UK, 18–22 June 2012; pp. 108–109. [Google Scholar]
  37. Leutheuser, H.; Schuldhaus, D.; Eskofier, B.M. Hierarchical, Multi-Sensor Based Classification of Daily Life Activities: Comparison with State-of-the-Art Algorithms Using a Benchmark Dataset. PLoS ONE 2013, 8, e75196. [Google Scholar] [CrossRef] [PubMed]
  38. Altun, K.; Barshan, B.; Tunçel, O. Comparative Study on Classifying Human Activities with Miniature Inertial and Magnetic Sensors. Pattern Recognit. 2010, 43, 3605–3620. [Google Scholar] [CrossRef]
  39. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  40. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
  41. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  42. Merging Layers. Available online: https://keras.io/api/layers/merging_layers (accessed on 22 January 2023).
  43. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
  44. Hur, T.; Bang, J.; Huynh-The, T.; Lee, J.; Kim, J.-I.; Lee, S. Iss2Image: A Novel Signal-Encoding Technique for CNN-Based Human Activity Recognition. Sensors 2018, 18, 3910. [Google Scholar] [CrossRef]
  45. Huynh-The, T.; Hua, C.-H.; Tu, N.A.; Kim, D.-S. Physical Activity Recognition With Statistical-Deep Fusion Model Using Multiple Sensory Data for Smart Health. IEEE Internet Things J. 2021, 8, 1533–1543. [Google Scholar] [CrossRef]
Figure 1. Proposed MSE feature fusion architecture. It consists of the following stages: data inputs, feature extraction, merging-squeeze-excitation feature fusion, and classification. The inputs are the sensor data X ( n ) from N IMUs attached to several parts of the user body. Each branch extracts a set of feature maps A ( n ) independently by using a 1D CNN model. In the merging-squeeze-excitation feature fusion stage, each set of feature maps is calibrated by using a set of channel weights. A merging method combines the sets of channel-weighted feature maps P ( n ) and produces a new set of channel-weighted and merged feature maps Y m s e , which is used later in the classification process.
Figure 1. Proposed MSE feature fusion architecture. It consists of the following stages: data inputs, feature extraction, merging-squeeze-excitation feature fusion, and classification. The inputs are the sensor data X ( n ) from N IMUs attached to several parts of the user body. Each branch extracts a set of feature maps A ( n ) independently by using a 1D CNN model. In the merging-squeeze-excitation feature fusion stage, each set of feature maps is calibrated by using a set of channel weights. A merging method combines the sets of channel-weighted feature maps P ( n ) and produces a new set of channel-weighted and merged feature maps Y m s e , which is used later in the classification process.
Applsci 13 02475 g001
Figure 2. MSE feature fusion architectures with skip connections. There are two types: (a) MSE feature fusion with local skip connections. A skip connection is added to each branch. As a result, we combine the feature maps Q ( n ) = P ( n ) + A ( n ) instead of P ( n ) . The output Y l s c still contains the feature maps directly obtained from 1D CNN models and (b) MSE feature fusion with a global skip connection. The pre-merged feature maps B are added to the feature maps Y m s e . As a result, the output feature maps Y g s c contain both pre-merged feature maps (no channels weighted) and channel-weighted feature maps.
Figure 2. MSE feature fusion architectures with skip connections. There are two types: (a) MSE feature fusion with local skip connections. A skip connection is added to each branch. As a result, we combine the feature maps Q ( n ) = P ( n ) + A ( n ) instead of P ( n ) . The output Y l s c still contains the feature maps directly obtained from 1D CNN models and (b) MSE feature fusion with a global skip connection. The pre-merged feature maps B are added to the feature maps Y m s e . As a result, the output feature maps Y g s c contain both pre-merged feature maps (no channels weighted) and channel-weighted feature maps.
Applsci 13 02475 g002
Figure 3. MSE feature fusion architecture with global channel attention. We compute an additional set of channel-weighted feature maps R which is obtained from the pre-merged feature maps B . The output feature maps Y g c a contain both Y m s e (where feature maps are calibrated and, then, merged) and R (where feature maps are merged and, then, calibrated).
Figure 3. MSE feature fusion architecture with global channel attention. We compute an additional set of channel-weighted feature maps R which is obtained from the pre-merged feature maps B . The output feature maps Y g c a contain both Y m s e (where feature maps are calibrated and, then, merged) and R (where feature maps are merged and, then, calibrated).
Applsci 13 02475 g003
Figure 4. Deep MSE feature fusion. We implement a series of D MSE feature fusion blocks such that the feature maps A ( 1 ) , A ( 2 ) , …, A ( N ) are calibrated and merged several times as shown in (a) deep MSE feature fusion architecture, where the structure of the d-th MSE feature fusion block is shown in (b). The output feature maps Y ˜ d e e p , ( D ) are used in the classification process.
Figure 4. Deep MSE feature fusion. We implement a series of D MSE feature fusion blocks such that the feature maps A ( 1 ) , A ( 2 ) , …, A ( N ) are calibrated and merged several times as shown in (a) deep MSE feature fusion architecture, where the structure of the d-th MSE feature fusion block is shown in (b). The output feature maps Y ˜ d e e p , ( D ) are used in the classification process.
Applsci 13 02475 g004
Figure 5. Baseline architectures. (a) Single-branch architecture. (b) Multi-branch architecture.
Figure 5. Baseline architectures. (a) Single-branch architecture. (b) Multi-branch architecture.
Applsci 13 02475 g005
Table 1. List of main symbols used in Section 4, Section 5.1 and Section 5.2.
Table 1. List of main symbols used in Section 4, Section 5.1 and Section 5.2.
SymbolDefinition
g ( n ) R C A vector of channel-wise statistics according to the local feature maps A ( n ) at the n-th branch.
h ( n ) R C A vector of channel-wise statistics according to the addition of g ( n ) and u at the n-th branch.
s ( n ) R C A vector of channel weights for the local feature maps A ( n ) at the n-th branch.
u R C A vector of channel-wise statistics according to the pre-merged feature maps B .
v R C A vector of channel weights for the pre-merged feature maps B .
A ( n ) R 1 × L × C A 3D array of local feature maps at the n-th branch.
B R 1 × L × C A 3D array of pre-merged feature maps.
P ( n ) R 1 × L × C A 3D array of channel-weighted feature maps according to the local feature maps A ( n ) at the n-th branch.
Q ( n ) R 1 × L × C A 3D array of channel-weighted feature maps according to the addition of P ( n ) and B at the n-th branch.
R R 1 × L × C A 3D array of channel-weighted feature maps according to the pre-merged feature maps B .
X ( n ) R 1 × W × C A 3D array of sensor data at the n-th branch (obtained from the n-th IMU).
Y g c a R 1 × L × C A 3D array of channel-weighted and merged feature maps, which is the output of the MSE feature
fusion with global channel attention.
Y g s c R 1 × L × C A 3D array of channel-weighted and merged feature maps, which is the output of the MSE feature
fusion with a global skip connection.
Y l s c R 1 × L × C A 3D array of channel-weighted and merged feature maps, which is the output of the MSE feature fusion with local skip connections.
Y m s e R 1 × L × C A 3D array of channel-weighted and merged feature maps, which is the output of the MSE feature fusion.
Table 2. List of main symbols used in Section 5.3.
Table 2. List of main symbols used in Section 5.3.
SymbolDefinition
g ˜ ( d , n ) R C A vector of channel-wise statistics according to the feature maps P ˜ ( d 1 , n ) at the n-th branch in the d-th MSE feature fusion block.
h ˜ ( d , n ) R C A vector of channel-wise statistics according to the addition of g ˜ ( d , n ) and u ˜ ( d ) at the n-th branch in the d-th MSE feature fusion block.
s ˜ ( d , n ) R C A vector of channel weights for the feature maps P ˜ ( d , n ) at the n-th branch in the d-th MSE feature fusion block.
u ˜ ( d ) R C A vector of channel-wise statistics according to the merged and channel-weighted feature maps Y ˜ d e e p , ( d ) in the d-th MSE feature fusion block.
A ( n ) R 1 × L × C A 3D array of local feature maps at the n-th branch.
B R 1 × L × C A 3D array of pre-merged feature maps.
P ˜ ( d , n ) R 1 × L × C A 3D array of channel-weighted feature maps according to the feature maps P ˜ ( d 1 , n ) at the n-th branch in the d-th MSE feature fusion block.
X ( n ) R 1 × W × C A 3D array of sensor data at the n-th branch (obtained from the n-th IMU).
Y ˜ m s e , ( d ) R 1 × L × C A 3D array of channel-weighted and merged feature maps, which is the output of the weighted feature merging in the d-th MSE feature fusion block.
Y ˜ d e e p , ( d ) R 1 × L × C A 3D array of channel-weighted and merged feature maps, which is the output of the d-th MSE feature fusion.
Table 3. A summary of DL models with SE blocks for HAR using sensor data.
Table 3. A summary of DL models with SE blocks for HAR using sensor data.
YearRef.DatasetDeviceDL Model
2021[21]UCI HAR, WISDMSmartphoneCNN with an SE block
2022[17]HASC, UCI HAR, WISDMSmartphoneState-of-the-art CNNs
with SE blocks
2022[18]HAPT, MobiAct v2.0SmartphoneCNN with a residual block,
an SE block, and BiGRU
Table 4. A summary of multi-branch DL architectures for HAR using sensor data.
Table 4. A summary of multi-branch DL architectures for HAR using sensor data.
YearRef.DatasetDeviceCategoryDL ModelFeature Fusion
2018[19]Opportunity, Order Picking, PAMAP2IMUsMultiple InputsCNNConcatenation
2019[20]UCI HAR, WISDMSmartphoneMultiple DL ModelsCNNConcatenation
2020[22]DG, DSAD, PAMAP2, RealWorld-HARIMUsMultiple InputsCNNConcatenation
2020[24]WISDMSmartphoneMultiple DL ModelsCNNConcatenation
2020[25]MHEALTH, WISDMIMUs, SmartphoneMultiple DL ModelsCNN and LSTMConcatenation
2021[21]UCI HAR, WISDMSmartphoneMultiple DL ModelsCNN with an SE blockConcatenation
2021[26]PAMAP2, UCI HAR, WISDMIMUs, SmartphoneMultiple DL ModelsCNN and GRUConcatenation
2021[27]Self-Recorded Data, UCI HARIMUs, SmartphoneMultiple DL ModelsCNNConcatenation
2022[28]PAMAP2, UCI HAR, WISDMIMUs, SmartphoneMultiple DL ModelsCNNConcatenation
2023[23]Opportunity, PAMAP2, UniMiB-SHARIMUs, SmartphoneMultiple InputsCNN and Residual BlocksConcatenation
Table 5. A summary of related SE fusion.
Table 5. A summary of related SE fusion.
YearRef.ModalityDatasetClassificationFusion Mechanism
2021[29]EEGBCI IV 2a, HGDMotor imagery tasksSE mechanism
2022[34]EEG, EOGMASS-SS3Sleep stagingMultimodal SE mechanism
2022[35]RGB videos,ETRI-Activity3DElderly activitiesExpansion SE mechanism skeleton sequences
Table 6. A summary of sensor data from three datasets.
Table 6. A summary of sensor data from three datasets.
PAMAP2DaLiAcDSAD
Sensor2 accelerometers, 1 gyroscope, 1 magnetometer1 accelerometer, 1 gyroscope1 accelerometer, 1 gyroscope, 1 magnetometer
Sampling Rate100 Hz200 Hz25 Hz
No. IMUs (N)345
Positionswrist, chest, ankleright wrist, chest, right hip, left ankletorso, right arm, left arm, right leg, left leg
No. Sensor Data Types1269
per IMU (M)
No. Subjects9198
No. Activities121319
Window Size3 s3 s5 s
Segment Length (L)300 data points600 data points125 data points
No. Segments576478029120
Table 7. The width W and the number of channel C of the feature maps A ( n ) obtained from LeNet5, AlexNet, and VGG16 by using the PAMAP2, DaLiAc, and DSAD datasets.
Table 7. The width W and the number of channel C of the feature maps A ( n ) obtained from LeNet5, AlexNet, and VGG16 by using the PAMAP2, DaLiAc, and DSAD datasets.
CNN ModelPAMAP2DaLiAcDSAD
WCWCWC
LeNet56912014412025120
AlexNet8256172562256
VGG169512185123512
Table 8. Accuracy scores (%) of the baseline architectures on classifying the PAMAP2, DaLiAc, and DSAD datasets where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each dataset.
Table 8. Accuracy scores (%) of the baseline architectures on classifying the PAMAP2, DaLiAc, and DSAD datasets where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each dataset.
ModelPAMAP2DaLiAcDSAD
LeNet5AlexNetVGG16LeNet5AlexNetVGG16LeNet5AlexNetVGG16
Single-Branch Model96.7998.77 *98.7595.3797.60 *95.5391.3997.18 *96.04
Multi-branch Model98.7398.5197.8596.0997.1094.4994.5496.5993.31
Table 9. Accuracy scores (%) of the MSE feature fusion on classifying the PAMAP2, DaLiAc, and DSAD datasets where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each dataset.
Table 9. Accuracy scores (%) of the MSE feature fusion on classifying the PAMAP2, DaLiAc, and DSAD datasets where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each dataset.
MergingPAMAP2DaLiAcDSAD
LeNet5AlexNetVGG16LeNet5AlexNetVGG16LeNet5AlexNetVGG16
Addition98.5998.9197.9996.8297.6396.2295.5397.4594.64
Maximum98.9299.0698.8298.0398.0496.9297.4297.6896.00
Minimum98.8499.17 *98.7297.5998.1897.7297.3497.7597.50
Average98.9199.0698.7997.8698.32 *97.6797.0098.04 *97.92
Table 10. PAMAP2 dataset: Accuracy scores (%) of the MSE feature fusion with skip connections on classifying the PAMAP2 dataset where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each type of skip connections.
Table 10. PAMAP2 dataset: Accuracy scores (%) of the MSE feature fusion with skip connections on classifying the PAMAP2 dataset where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each type of skip connections.
MergingLocal Skip ConnectionsGlobal Skip Connection
LeNet5AlexNetVGG16LeNet5AlexNetVGG16
Addition98.6898.7598.1698.4998.7798.23
Maximum98.9999.2098.6698.8998.9998.85
Minimum99.0599.18 *98.7098.8799.1398.77
Average98.9199.1598.7098.7299.24 *98.94
Table 11. DaLiAc dataset: Accuracy scores (%) of the MSE feature fusion with skip connections on classifying the DaLiAc dataset where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each type of skip connections.
Table 11. DaLiAc dataset: Accuracy scores (%) of the MSE feature fusion with skip connections on classifying the DaLiAc dataset where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each type of skip connections.
MergingLocal Skip ConnectionsGlobal Skip Connection
LeNet5AlexNetVGG16LeNet5AlexNetVGG16
Addition96.6797.7193.3096.0897.1294.59
Maximum98.3097.7897.1398.1997.9797.00
Minimum98.1398.59 *97.5598.1298.42 *97.60
Average98.1297.9297.6898.0598.0698.00
Table 12. DSAD dataset: Accuracy scores (%) of the SE feature fusion with skip connections on classifying the DSAD dataset where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each type of skip connections.
Table 12. DSAD dataset: Accuracy scores (%) of the SE feature fusion with skip connections on classifying the DSAD dataset where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each type of skip connections.
MergingLocal Skip ConnectionsGlobal Skip Connection
LeNet5AlexNetVGG16LeNet5AlexNetVGG16
Addition94.9896.6195.9695.1697.1694.57
Maximum97.6597.4097.0097.5797.8997.27
Minimum97.4297.7297.6097.4297.2797.48
Average97.1898.02 *97.8397.1297.97 *97.97
Table 13. Accuracy scores (%) of the MSE feature fusion with global channel attention on classifying the PAMAP2, DaLiAc, and DSAD datasets where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each dataset.
Table 13. Accuracy scores (%) of the MSE feature fusion with global channel attention on classifying the PAMAP2, DaLiAc, and DSAD datasets where LeNet5, AlexNet, and VGG16 are applied as feature extractors. The asterisk (*) indicates the highest accuracy score of each dataset.
MergingPAMAP2DaLiAcDSAD
LeNet5AlexNetVGG16LeNet5AlexNetVGG16LeNet5AlexNetVGG16
Addition98.3398.3296.8896.6597.1095.9595.0997.3894.65
Maximum98.9499.0698.0497.7397.5596.8597.5497.8197.12
Minimum99.0199.17 *98.7397.4798.08 *97.6997.2497.2797.55
Average98.8298.9998.7397.6097.7897.2897.0898.03 *97.97
Table 14. Accuracy scores (%) of the deep MSE feature fusion on classifying the PAMAP2 dataset where AlexNet is applied as the feature extractor.
Table 14. Accuracy scores (%) of the deep MSE feature fusion on classifying the PAMAP2 dataset where AlexNet is applied as the feature extractor.
MergingNumber of MSE Feature Fusion Blocks (D)
D = 1D = 2D = 3D = 4D = 5
Addition98.9198.5498.8498.7798.79
Maximum99.0698.7398.9499.0899.06
Minimum99.17 *99.0399.0399.0398.99
Average99.0698.7998.7299.0399.10
Table 15. Accuracy scores (%) of the deep MSE feature fusion on classifying the DaLiAc dataset where AlexNet is applied as the feature extractor.
Table 15. Accuracy scores (%) of the deep MSE feature fusion on classifying the DaLiAc dataset where AlexNet is applied as the feature extractor.
MergingNumber of MSE Feature Fusion Blocks (D)
D = 1D = 2D = 3D = 4D = 5
Addition97.6397.6997.3296.9098.03
Maximum98.0497.9097.8398.2198.17
Minimum98.1898.0498.0397.7498.06
Average98.32 *98.0697.4997.4198.08
Table 16. Accuracy scores (%) of the deep MSE feature fusion on classifying the DSAD dataset where AlexNet is applied as the feature extractor.
Table 16. Accuracy scores (%) of the deep MSE feature fusion on classifying the DSAD dataset where AlexNet is applied as the feature extractor.
MergingNumber of MSE Feature Fusion Blocks (D)
D = 1D = 2D = 3D = 4D = 5
Addition97.4597.5397.0597.5297.28
Maximum97.6897.1997.6097.3297.52
Minimum97.7597.3197.1897.2897.45
Average98.04 *97.6197.5097.5897.49
Table 17. PAMAP2 dataset: The numbers of trainable parameters of the baseline architectures and the proposed MSE feature fusion architectures on classifying the PAMAP2 dataset.
Table 17. PAMAP2 dataset: The numbers of trainable parameters of the baseline architectures and the proposed MSE feature fusion architectures on classifying the PAMAP2 dataset.
ArchitectureLeNet5AlexNetVGG16
BaselinesSingle-Branch Model147,5061,472,6845,460,108
Multi-branch Model413,7104,315,37216,339,852
Proposed MSE Feature Fusion179,1553,841,10015,489,612
ExtensionsLocal Skip Connections179,1553,841,10015,489,612
Global Skip Connection179,1553,841,10015,489,612
Global Channel Attention182,8903,857,77215,555,724
Deep MSE (D = 2)-3,891,116-
Deep MSE (D = 3)-3,941,132-
Deep MSE (D = 4)-3,991,148-
Deep MSE (D = 5)-4,041,164-
Table 18. DaLiAc dataset: The numbers of trainable parameters of the baseline architectures and the proposed MSE feature fusion architectures on classifying the DaLiAc dataset.
Table 18. DaLiAc dataset: The numbers of trainable parameters of the baseline architectures and the proposed MSE feature fusion architectures on classifying the DaLiAc dataset.
ArchitectureLeNet5AlexNetVGG16
BaselineSingle-Branch Model148,1711,461,0375,458,829
Multi-branch Model547,4775,725,06921,778,445
Proposed MSE Feature Fusion193,7775,005,32520,470,029
ExtensionsLocal Skip Connections193,7775,005,32520,470,029
Global Skip Connection193,7775,005,32520,470,029
Global Channel Attention197,5125,021,99720,536,141
Deep MSE (D = 2)-5,072,013-
Deep MSE (D = 3)-5,138,701-
Deep MSE (D = 4)-5,205,389-
Deep MSE (D = 5)-5,272,077-
Table 19. DSAD dataset: The numbers of trainable parameters of the baseline architectures and the proposed MSE feature fusion architectures on classifying the DSAD dataset.
Table 19. DSAD dataset: The numbers of trainable parameters of the baseline architectures and the proposed MSE feature fusion architectures on classifying the DSAD dataset.
ArchitectureLeNet5AlexNetVGG16
BaselineSingle-Branch Model154,9511,489,3635,469,011
Multi-branch Model687,3597,174,73927,228,499
Proposed MSE Feature Fusion214,5146,209,52325,461,907
ExtensionsLocal Skip Connections214,5146,209,52325,461,907
Global Skip Connection214,5146,209,52325,461,907
Global Channel Attention218,2496,226,19525,528,019
Deep MSE (D = 2)-6,292,883-
Deep MSE (D = 3)-6,376,243-
Deep MSE (D = 4)-6,459,603-
Deep MSE (D = 5)-6,542,963-
Table 20. The accuracy scores (%) of related work who were evaluated on classifying the PAMAP2, DaLiAc, and DSAD datasets.
Table 20. The accuracy scores (%) of related work who were evaluated on classifying the PAMAP2, DaLiAc, and DSAD datasets.
DatasetYearReference and Model NameAccuracy
PAMAP22018[19] CNN-IMU93.13
2020[22] GlobalFusion90.86
2021[26] Multi-Input CNN-GRU95.27
2022[28] Multibranch CNN-BiLSTM94.29
2023[23] Multi-ResAtt93.19
2023Proposed MSE Feature Fusion99.17
DaLiAc2018[44] Iss2Image96.40
2021[45] DeepFusionHAR97.20
2023Proposed MSE Feature Fusion98.32
DSAD2020[22] GlobalFusion94.28
2021[45] DeepFusionHAR96.10
2023Proposed MSE Feature Fusion98.04
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Laitrakun, S. Merging-Squeeze-Excitation Feature Fusion for Human Activity Recognition Using Wearable Sensors. Appl. Sci. 2023, 13, 2475. https://doi.org/10.3390/app13042475

AMA Style

Laitrakun S. Merging-Squeeze-Excitation Feature Fusion for Human Activity Recognition Using Wearable Sensors. Applied Sciences. 2023; 13(4):2475. https://doi.org/10.3390/app13042475

Chicago/Turabian Style

Laitrakun, Seksan. 2023. "Merging-Squeeze-Excitation Feature Fusion for Human Activity Recognition Using Wearable Sensors" Applied Sciences 13, no. 4: 2475. https://doi.org/10.3390/app13042475

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop