Merging-Squeeze-Excitation Feature Fusion for Human Activity Recognition Using Wearable Sensors

: Human activity recognition (HAR) has been applied to several advanced applications, especially when individuals may need to be monitored closely. This work focuses on HAR using wearable sensors attached to various locations of the user body. The data from each sensor may provide unequally discriminative information and, then, an effective fusion method is needed. In order to address this issue, inspired by the squeeze-and-excitation (SE) mechanism, we propose the merging-squeeze-excitation (MSE) feature fusion which emphasizes informative feature maps and suppresses ambiguous feature maps during fusion. The MSE feature fusion consists of three steps: pre-merging, squeeze-and-excitation, and post-merging. Unlike the SE mechanism, the set of feature maps from each branch will be recalibrated by using the channel weights also computed from the pre-merged feature maps. The calibrated feature maps from all branches are merged to obtain a set of channel-weighted and merged feature maps which will be used in the classiﬁcation process. Additionally, a set of MSE feature fusion extensions is presented. In these proposed methods, three deep-learning models (LeNet5, AlexNet, and VGG16) are used as feature extractors and four merging methods (addition, maximum, minimum, and average) are applied as merging operations. The performances of the proposed methods are evaluated by classifying popular public datasets.


Introduction
Human activity recognition (HAR) is an active and challenging research field [1] to specify human activities (e.g., sitting, walking, running) based on the data collected from devices such as cameras [2] and wearable sensors [3][4][5].It has been essential in many applications, especially healthcare [6].In addition, HAR helps an information-technology system to automatically monitor and record the activities of users such that we can analyze them and alert related persons (e.g., users, relatives, doctors) when an abnormal activity or an accident happens [7].Due to the limitations of using cameras in HAR such as user privacy, using wearable devices (e.g., smart watches and smartphones) in HAR is receiving significant attention.These wearable devices commonly use sensors such as accelerometers, gyroscopes, and magnetometers to monitor the activities of the users [5,8,9].In addition, many studies have been focused on using several inertial measurement units (IMUs) attached to different parts of the user body such that we can have data from different locations and utilize them together to obtain better recognition accuracy [10].
The HAR using wearable devices will receive sensor data from accelerometers, gyroscopes, and/or magnetometers and use them to classify the activities.Of the classification models/algorithms, two types are popularly applied to HAR: traditional machine learning (ML) algorithms and deep-learning (DL) models.By using a traditional ML algorithm (e.g., support vector machine, random forest), we will manually extract a set of useful features from the sensor data and pass them to the ML algorithm to specify the corresponding activities [11,12].On the other hand, a DL model will automatically extract a set of features from the sensor data and use them in the classification process.As a result, DL models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been extensively studied in HAR [9,[13][14][15].
A CNN model essentially consists of a series of convolutional layers and pooling layers to extract a set of features called feature maps, which will be used later in the classification.However, only some feature maps may be very useful for classifying the activities of interest.Therefore, informative feature maps should be emphasized while ambiguous feature maps should be suppressed.A channel-attention mechanism called the squeeze-and-excitation (SE) block [16] was proposed to solve this issue.The SE block will recalibrate each feature map by a weight value which is proportional to the importance of this feature map in the classification.The SE block has recently been applied to CNN and/or RNN models to better the HAR performances [17,18].
The performances of wearable-sensor HAR can be improved by implementing multibranch DL architectures [19,20].A multi-branch DL architecture consists of several parallel branches using DL models to extract different sets of feature maps independently.Specifically, each branch provides a set of feature maps denoting local information.Thereafter, these sets of feature maps will be fused by using feature fusion such as concatenation to generate a set of fused feature maps (denoting global information); which will be used later in the classification process.A traditional feature fusion method combines the local feature maps equally without being aware that, in each set, some feature maps may be informative while some feature maps are ambiguous.Motivated by this issue, we need a feature fusion method which is able to emphasize informative feature maps and suppress ambiguous feature maps in each branch during fusion such that we can combine the local feature maps efficiently and obtain useful discriminative fused feature maps.
Inspired by the squeeze-and-excitation (SE) mechanism [16], we propose a feature fusion method called the merging-squeeze-excitation (MSE) feature fusion.In each branch, the MSE feature fusion recalibrates the local feature maps by using a set of channel weights.Since the fused feature maps are the ones who enter the classification process, the channel weights could be computed such that the fused feature maps provide very discriminative information.Therefore, unlike the SE mechanism, we design the MSE feature fusion such that, at each branch, it computes the channel weights based on both local feature maps and fused feature maps.As a result, when we consider a set of C local feature maps, the c-th local feature map will be emphasized if either it is important to the classification or the corresponding c-th fused feature map is useful to the classification.
The MSE feature fusion consists of three steps: pre-merging step, squeeze-andexcitation step, and post-merging step.In the pre-merging step, the feature maps from all branches are merged together to obtain a set of pre-merged feature maps.Thereafter, during the squeeze-and-excitation step, the feature maps from each branch are recalibrated according to their importance measured from both the channel-wise statistics obtained from themselves and the channel-wise statistics obtained from the pre-merged feature maps.Finally, in the post-merging step, the MSE feature fusion applies the same merging operation used in the first step to combine the output feature maps from all branches and obtain a set of channel-weighted and merged feature maps which will be used to classify the activities of interest.In this work, we have applied three DL models (i.e., LeNet5, AlexNet, and VGG16) as feature extractors and four merging methods (addition, maximum, minimum, and average) as merging operations in the pre-merging step and post-merging step.Furthermore, we also modify the proposed MSE feature fusion by adding local skip connections, adding a global skip connection, using global channel attention, and stacking a series of the MSE feature fusions to create deep MSE feature fusions.Their performances are evaluated on three public HAR datasets: PAMAP2, DaLiAc, and DSAD.
The main contributions of this work are summarized as follows: 1.
We propose five MSE feature fusion architectures for wearable-sensor HAR using multi-branch architectures such that the feature maps will be recalibrated according to their importance during the fusion.Three DL models and four merging methods used in the MSE feature fusion are studied and investigated.

2.
Extensive experiments are conducted to evaluate and compare the performances of the proposed methods and baseline architectures by using the PAMAP2, DaLiAc, and DSAD datasets.The results show the following findings: • The MSE feature fusion with a global skip connection when using the average merging and AlexNet achieves the highest accuracy score of 99.24% on classifying the PAMAP2 dataset.

•
The MSE feature fusion with local skip connections when using the minimum merging and AlexNet achieves the highest accuracy score of 98.59% on classifying the DaLiAc dataset.

•
The original MSE feature fusion using the average merging and AlexNet achieves the highest accuracy score of 98.04% on classifying the DSAD dataset.• Among the merging methods studied in the proposed methods, the addition merging offers the worst accuracy scores.The maximum, minimum, and average merging have similar performances.

•
All of the highest accuracy scores are from using AlexNet as the feature extractor.
The rest of this paper is organized as follows.Section 2 reviews the previous work focusing on using the SE block in HAR, proposing multi-branch DL architectures for HAR, and presenting SE-based feature fusion methods.The data collection and preparation are described in Section 3. Sections 4 and 5 present the proposed MSE feature fusion and its extensions, respectively.Their performances are evaluated and compared in Section 6.Finally, conclusions and future work are provided in Section 7.
The main symbols used in this paper are summarized as follows.Lower-case and upper-case bold letters represent vectors and three-dimensional (3D) arrays, respectively.The symbols R 1×C and R C denote the space of C-real-number row vectors and the space of C-real-number column vectors, respectively.The symbol R H×W×C denotes the space of 3D arrays (of real numbers) whose height, width, and channel number are equal to H, W, and C, respectively.Tables 1 and 2 summarize the main symbols used in this paper.

Symbol
Definition A vector of channel-wise statistics according to the local feature maps A (n) at the n-th branch.
A vector of channel-wise statistics according to the addition of g (n) and u at the n-th branch.
A vector of channel weights for the local feature maps A (n) at the n-th branch.u ∈ R C A vector of channel-wise statistics according to the pre-merged feature maps B. v ∈ R C A vector of channel weights for the pre-merged feature maps B.
A 3D array of local feature maps at the n-th branch.

B ∈ R 1×L×C
A 3D array of pre-merged feature maps.
A 3D array of channel-weighted feature maps according to the local feature maps A (n) at the n-th branch.
A 3D array of channel-weighted feature maps according to the addition of P (n) and B at the n-th branch.

R ∈ R 1×L×C
A 3D array of channel-weighted feature maps according to the pre-merged feature maps B.
A 3D array of sensor data at the n-th branch (obtained from the n-th IMU).
A 3D array of channel-weighted and merged feature maps, which is the output of the MSE feature fusion with global channel attention.
A 3D array of channel-weighted and merged feature maps, which is the output of the MSE feature fusion with a global skip connection.
A 3D array of channel-weighted and merged feature maps, which is the output of the MSE feature fusion with local skip connections.
A 3D array of channel-weighted and merged feature maps, which is the output of the MSE feature fusion.
Table 2. List of main symbols used in Section 5.3.

Symbol Definition
g(d,n) ∈ R C A vector of channel-wise statistics according to the feature maps P(d−1,n) at the n-th branch in the d-th MSE feature fusion block.h(d,n) ∈ R C A vector of channel-wise statistics according to the addition of g(d,n) and ũ(d) at the n-th branch in the d-th MSE feature fusion block.s(d,n) ∈ R C A vector of channel weights for the feature maps P(d,n) at the n-th branch in the d-th MSE feature fusion block.ũ(d) ∈ R C A vector of channel-wise statistics according to the merged and channel-weighted feature maps Ỹdeep,(d) in the d-th MSE feature fusion block.
A 3D array of pre-merged feature maps.
A 3D array of channel-weighted feature maps according to the feature maps P(d−1,n) at the n-th branch in the d-th MSE feature fusion block.
A 3D array of sensor data at the n-th branch (obtained from the n-th IMU).Ỹmse,(d) ∈ R 1×L×C A 3D array of channel-weighted and merged feature maps, which is the output of the weighted feature merging in the d-th MSE feature fusion block.Ỹdeep,(d) ∈ R 1×L×C A 3D array of channel-weighted and merged feature maps, which is the output of the d-th MSE feature fusion.

Related Work
The SE block was proposed to improve the performances of the CNNs and to demonstrated its potential in image classification [16].It consists of two successive operations: squeeze and excitation.In the squeeze operation, the inputted feature maps will be passed to a global-average pooling (GAP) layer to generate channel-wise statistics, where each value is an average of the corresponding feature map.Thereafter, in the excitation operation, the SE block will compute an appropriate weight for each feature map by using the channelwise statistics and fully connected (FC) layers.The SE block multiplies inputted feature maps by their weights and obtains the channel-weighted feature maps.Due to its success in image classification, the SE block has been adopted in many applications, including HAR.Zhongkai et al. [17] investigated the potential of the SE blocks by adding them to a list of state-of-the-art CNN models (e.g., VGG16, Inception, ReNet18, and PyramidNet18) and comparing the corresponding HAR performances.Mekruksavanich et al. [18] proposed a DL model called the SEResNet-BiGRU, which is a combination of residual blocks, SE blocks, and bidirectional gate recurrent units (BiGRUs), and applied it for transitional activity recognition.Khan et al. [21] proposed a multi-branch DL architecture where each branch uses a CNN model with an SE block to extract and re-weight feature maps.The above DL models with SE blocks are summarized in Table 3.Several DL architectures have been extensively proposed and investigated in HAR using sensor data.In order to improve the HAR performances, instead of using only one branch, we can implement DL architectures with multiple branches such that several different and unique sets of feature maps will be obtained and helpful in classifying the activities.There are two common categories of the multi-branch architectures.In the first category, we consider a scenario wherein there is a set of wearable sensors (e.g., IMUs) attached to parts of the user body (e.g., a wrist, an ankle, the chest).Therefore, different sets of sensor data are obtained initially.These sets of sensor data are inputted into a multi-branch DL architecture.Each branch will receive each set and extract the corresponding features by using the same DL model independently [19,22,23].In order to obtain a set of feature maps on each branch, Rueda et al. [19] applied a series of convolutional layers and max pooling layers, Liu et al. [22] implemented stacked convolutional layers, and Al-qaness et al. [23] employed a CNN model with residual blocks.The feature maps of all branches are fused (i.e., feature fusion) by using concatenation.In the second category, we apply a set of sensor data to a multi-branch DL architecture where each branch uses a different DL model and results in a different set of features.Three-branch DL architectures were proposed in [20,21,[24][25][26][27][28], where a CNN model [20,24,27,28], a hybrid of a CNN model, and a bidirectional long short-term memory (LSTM) layer [25], as well as a CNN model with an SE block [21], and a hybrid of convolutional layers and gated recurrent unit (GRU) layers [26] were used on each branch to extract a set of features.The differences among these branches are the kernel sizes of the convolutional layers [20,21,[24][25][26][27][28] and the number of layers [20].Similarly, the output feature sets are combined by using concatenation.We summarize the aforementioned multi-branch DL architectures in Table 4. Recently, the SE mechanism (i.e., squeeze and excitation operations) has been applied to feature fusion in multi-branch DL architectures where feature maps from each branch will be recalibrated before fusing them together.Li et al. [29] proposed a model called the temporal-spectral-based squeeze-and-excitation feature fusion network (TS-SEFFNet) to classify motor imagery tasks by using electroencephalography (EEG) signals.The TS-SEFFNet receives EEG signals and uses two branches with different DL models called the deep-temporal convolution block and the multi-spectral convolution block to extract two different sets of feature maps.The feature maps of each set are recalibrated by using the SE mechanism.The TS-SEFFNet combines the outputs of these two branches by using concatenation.
Instead of using sensor data from one modality, multimodal classification [30] receives data from multiple modalities and has gained a significant amount of attention [31][32][33].Essentially, a multi-modal classification model will be implemented based on a multibranch architecture where each branch will receive different modal data and extract the corresponding features.These features obtained from various modalities will be combined and sent to the classification process.Since the feature maps from each modality contribute information unequally, an efficient fusion method must be investigated [31][32][33].
In addition, several SE-based feature fusion methods were extensively investigated in multi-modal classification.Jia et al. [34] proposed a feature fusion method called the multi-modal SE feature fusion module to combine feature maps from EEG signals and feature maps from electrooculogram (EOG) signals for sleep-staging classification.Unlike [16,29] where each branch computes the weights for the feature-map calibration in the excitation operation separately, the multi-modal SE feature fusion module will calculate the channel weights based on the channel-wise statistics from both EEG feature maps and EOG feature maps.
Shu et al. [35] proposed a DL model called the expansion-squeeze-excitation fusion network (ESE-FN) for elderly activity recognition using RGB videos and skeleton sequences.The ESE-FN applies two successive fusion modules (modal fusion and channel fusion) to combine RGB features and skeleton features properly.The modal-fusion module performs the modal attention where modal-wise weights are computed and multiplied to the corresponding modalities' feature maps.The channel-fusion module obtains the channel attention by calculating channel-wise weights and multiplying them to the feature maps.Both modules apply a new attention mechanism called the expansion-squeeze-excitation, which consists of three operations: expansion, squeeze, and excitation.The expansion is operated by using convolutional layers to expand the depth along the modality dimension for the modal fusion and expand the depth along the channel dimension for the channel fusion.The squeeze and excitation operations are similar to those in [16].A summary of the work [29,34,35] is shown in Table 5.

Data Collection and Preparation
In order to evaluate the proposed HAR classification architectures in Sections 4 and 5, we select the datasets whose sensor data are from IMUs attached to various locations of the user body.Thereafter, we preprocess the sensor data by scaling and segmentation.The details are explained as follows.The sensor data are from the following three wearablesensor datasets: • PAMAP2: The PAMAP2 dataset [36] contains the sensor data collected from nine subjects who performed 18 physical activities.However, here, only 12 activities are considered: lying, sitting, standing, ironing, vacuum cleaning, descending stairs, walking, Nordic walking, cycling, ascending stairs, running, and rope jumping.Three IMUs were attached to a wrist, the chest, and an ankle of each subject.Each IMU was equipped with two triaxial accelerometers, one triaxial gyroscope, and one triaxial magnetometer.As a result, 12 types of sensor data (M = 12) were obtained from each IMU.The sampling rate was set to 100 Hz. • DaLiAc: The DaLiAc dataset [37] contains the sensor data collected from 19 subjects who performed 13 physical activities: sitting, lying, standing, washing dishes, vacuuming, sweeping, walking, ascending stairs, descending stairs, treadmill running (8.3 km/h), bicycling on ergometer (50 Watt), bicycling on ergometer (100 Watt), and rope jumping.A total of four IMUs were attached to the right hip, the right wrist, the chest, and the left ankle.Each IMU was equipped with one triaxial accelerometer and one triaxial gyroscope.As a result, six types of sensor data (M = 6) were obtained from each IMU.The sampling rate was set to approximately 200 Hz.• DSAD: The DSAD dataset [38] contains the sensor data collected from eight subjects who performed 19 physical activities: sitting, standing, lying on back, lying on right side, ascending stairs, descending stairs, standing in an elevator still, moving around in an elevator, walking in a parking lot, walking on a treadmill with a speed of 4 km/h in flat, walking on a treadmill with a speed of 4 km/h at 15 degree inclined positions), running on a treadmill with a speed of 8 km/h, exercising on a stepper, exercising on a cross trainer, cycling on an exercise bike in horizontal position, cycling on an exercise bike in vertical positions, rowing, jumping, and playing basketball.Five IMUs were attached to the torso, right arm, left arm, right leg, and left leg.Each IMU was equipped with one triaxial accelerometer, one triaxial gyroscope, and one triaxial magnetometer.As a result, nine types of sensor data (M = 9) were obtained from each IMU.The sampling rate was set to 25 Hz.
The sensor data used to predict the current activity are obtained from different sensor types and varied within different ranges.It is a common step to apply the data scaling such that the values of these sensor data will be within the same range.In this work, the standardization method is applied to transform the sensor data such that their mean and standard deviation are zero and one, respectively.Let z (n) t,m be the sensor value at the t-th point obtained from the m-th sensor data of the n-th IMU.Its standardized value is obtained from: where µ m can be expressed as The length L is set to 300, 600, and 125 data points for PAMAP2, DaLiAc, and DSAD, respectively (which are equal to three-second window, three second window, and fivesecond window, respectively).A summary of the sensor data which will be used in the evaluation is shown in Table 6.

Proposed Architecture
The proposed architecture is shown in Figure 1, which is based on a multi-branch architecture.There are N branches to receive the inputs from N IMUs.The number N will be equal to 3, 4, and 5, for the PAMAP2, DaLiAc, and DSAD datasets, respectively, as shown in Table 6.Each branch receives the input X (n) from an IMU and uses a one-dimensional (1D) CNN model to extract a set of feature maps A (n) .Since each feature map owns different significance of information, we propose the merging-squeeze-excitation (MSE) feature fusion to combine these N sets of feature maps A (n) by applying the SE mechanisms [16] and produce the channel-weighted and merged feature maps Y mse , which will be used to predict the corresponding activity.The details are provided as follows:  (n) from N IMUs attached to several parts of the user body.Each branch extracts a set of feature maps A (n) independently by using a 1D CNN model.In the merging-squeezeexcitation feature fusion stage, each set of feature maps is calibrated by using a set of channel weights.A merging method combines the sets of channel-weighted feature maps P (n) and produces a new set of channel-weighted and merged feature maps Y mse , which is used later in the classification process.

Input and Feature Extraction
The input X (n) ∈ R 1×L×M is a three-dimensional (3D) array (consisting of the height, width, and channel dimensions) storing data segments of all sensor data from the n-th IMU, where M is the number of sensor data types per IMU and L is the number of data points in one segment.It can be expressed as m is a data segment of the m-th sensor from the n-th IMU and expressed in Equation (2).Note that [(•); (•); . . ., (•)] denotes that the elements inside are arranged along the channel dimension.Each branch will apply a 1D CNN model to extract feature maps A (n) ∈ R 1×W×C , where W and C are the width and number of channels, respectively.The feature maps A (n) can be expressed as where the row vector a w,c is a value at the w-th data point of the c-th channel.The following CNN models are considered as feature extractors due to their simplicity, low number of layers, and low computational complexities: LeNet5 [39], AlexNet [40], and VGG16 [41].Note that these models originally consist of two-dimensional (2D) layers since they are applied to image processing.Here, we implement their 1D versions by changing all 2D layers to be 1D layers.For example, 2D convolutional layers are replaced by 1D convolutional layers and 2D max pooling layers are replaced by 1D max pooling layers.The other parameters are unchanged such as numbers of filters and kernel sizes.These 1D CNN structures are summarized in Appendix A. The width W and the number of channels C of A (n) according to the considered CNN models are shown in Table 7.In addition to these three CNN models, other CNN models can be applied to extract A (n) .

Merging-Squeeze-Excitation Feature Fusion
Conventional feature fusion methods [19][20][21][22][23][24][25][26][27][28] combine all feature maps from all branches equally without considering which feature maps are useful.However, some feature maps in A (n) may be unhelpful for the classification and they should be suppressed while the informative feature maps in A (n) should be emphasized.Therefore, in this work, inspired by the SE mechanism [16], we propose a feature fusion method called the merging-squeeze-excitation, which is aware of this issue.As shown in Figure 1, all sets of feature maps A (n) , for n = 1, 2, . . ., N, are firstly combined in the pre-merging step to create a set of pre-merged feature maps B. Unlike [16], in the squeeze step, the channel-wise statistics h (n) used to compute the channel weights s (n) are computed according to the feature maps A (n) and pre-merged feature maps B. This implies that the importance of each feature map in A (n) is measured not only from A (n) but also from B. Accordingly, we find the corresponding channel weights s (n) , multiply them to A (n) , and obtain the channel-weighted feature maps P (n) in the excitation step.Finally, in the post-merging step, we recombine P (n) , for n = 1, 2, . . ., N, using the same merging method in the pre-merging step to obtain the channel-weighted and merged feature maps Y mse , which will be used in the classification process later.The details of these steps are explained as follows.

Pre-Merging
We use the pre-merging step to initially combine feature maps A (n) from all N branches together and to produce the pre-merged feature maps B ∈ R 1×W×C , which will be used along with A (n) to compute the channel weights.The feature maps B can be expressed as B = [b 1 ; b 2 ; . . .; b C ], where the row vector b c ∈ R 1×W is expressed as b W,c ] and b w,c is a value at the w-th data point of the c-th channel.Several feature merging methods are available [42].Here, we investigate and compare the following methods:

•
Addition merging creates feature maps B by using the element-wise addition.The value b w,c is obtained from where a w,c is a value at the w-th data point of the c-th channel of A (n) .• Maximum merging creates feature maps V by using the element-wise maximum operation.The value b w,c is obtained from w,c , a w,c , . . ., a • Minimum merging creates feature maps B by using the element-wise minimum operation.The value b w,c is obtained from w,c , a w,c , . . ., a • Average merging creates feature maps B by using the element-wise averaging operation.The value a w,c is obtained from For a future usage, we denote the merging operation as F Merge (•).Specifically, we have B = F Merge A (1) , A (2) , . . ., A (N) .( 7)

Squeeze and Excitation
In the squeeze-and-excitation step, we recalibrate each set of feature maps A (n) such that the informative feature maps will be emphasized and ambiguous feature maps will be suppressed by using channel weights, which will be computed according to both A (n) and B. First, we obtain the channel-wise statistics u ∈ R C by passing B to a 1D GAP layer and the channel-wise statistics g (n) ∈ R C by passing A (n) to another 1D GAP.The statistics u are expressed as u = [u 1 , u 2 , . . ., u C ] T and the statistics g (n) are expressed as C ] T , where [•, •, . . ., •] T is the transpose, u c is obtained by averaging the values in the c-th 1D feature map of B, and g c is obtained by averaging the values in the c-th 1D feature map of U (n) .Specifically, we have and Thereafter, we obtain a channel-wise statistics h The statistics h (n) are expressed as (n) , is obtained by using two fully connected (FC) layers with the ReLU activation after the first FC layer and the Sigmoid activation after the second FC layer [16]: where σ(•) is the Sigmoid activation function, δ(•) is the ReLU activation function, r is the weight matrix of the second FC layer, and r is the reduction ratio which is used to reduce the first FC layer's output dimension.
Finally, we recalibrate the feature maps A (n) according to the channel weights s (n) to emphasize useful feature maps and suppress ambiguous feature maps and, then, obtain a set of channel-weighted feature maps P (n) ∈ R 1×W×C .The feature maps P (n) can be expressed as For a future usage, we denote the squeeze-and-excitation operation to compute P (n) as

Post-Merging
The post-merging step will apply the merging method used in the pre-merging step to combine the N sets of channel-weighted feature maps P (n) and obtain the channel-weighted and merged feature maps Y mse .Similar to Section 4.2.1, we can express Y mse = F Merge P (1) , P (2) , . . ., P (N) .( 14) The set of feature maps Y mse will be used in the classification process.

Classification
In this work, the classifier as shown in Figure 1 consists of a 1D GAP layer and two FC layers whose ReLU activation functions are used in the first FC layer while the Softmax activation function is used in the second FC layer.The numbers of neurons in the first and second FC layers are 1024 and K, respectively, where K is the number of classes (depending on the datasets).As specified in Section 3, the number of classes K is 12, 13, and 19 for the PAMAP2, DaLiAc, and DSAD datasets, respectively.Note that other classifiers' structures are applicable.

Extensions of Merging-Squeeze-Excitation Feature Fusion
In this section, we present four extensions of the MSE feature fusion: MSE feature fusion with local skip connections, MSE feature fusion with a global skip connection, MSE feature fusion with global channel attention, and deep MSE feature fusion.Their performances will be evaluated and compared in Section 6.

MSE Feature Fusion with Skip Connections
Skip connections were used in ResNet models [43] to solve the vanishing-gradient issue.Here, we will apply this technique to the MSE feature fusion such that feature maps entering the classification will be at least as good as the feature maps obtained from the earlier step.We consider two possible positions to add skip connections.

•
The MSE feature fusion with local skip connections is shown in Figure 2a, where we add a skip connection to each branch of A (n) .As a result, we have The feature maps Y lsc that will enter the classifier are obtained from Y lsc = F Merge Q (1) , Q (2) , . . ., Q (N) .( 16)

•
The MSE feature fusion with a global skip connection is shown in Figure 2b.We create a skip connection on the MSE feature fusion such that the pre-merged feature maps B from the pre-merging step will be added to the channel-weighted and merged feature maps Y mse .Thereafter, we have Y gsc entering to the classifier as follows: where Y mse is defined in (14).The prediction will be based on both Y mse and B.

MSE Feature Fusion with Global Channel Attention
In the proposed MSE feature fusion shown in Figure 1, the channel-weighted and merged feature maps Y mse are obtained from F Merge P (1) , P (2) , . . ., P (N) .In addition, we may compute a different set of channel-weighted feature maps based on the channel dependency of B (the output of the pre-merging step) directly.Figure 3 shows the MSE feature fusion with a global channel attention, where we create an additional set of channelweighted feature maps R ∈ R 1×W×C according to B. The set of feature maps R is denoted as R = [r 1 ; r 2 ; . . .; r C ], r c = [r 1,c , r 2,c , . . ., r W,c ], and r w,c is a value.Similar to the previous calculation, we can obtain R according to the following steps.We find the channel weights where u is the channel-wise statistics as shown in Section 4.
Finally, the set of channel-weighted and merged feature maps Y gca entering the classifier is from where Y mse is defined in (14).As a result, the prediction will be computed from both local-channel-attention and global-channel-attention feature maps.The output feature maps Y gca contain both Y mse (where feature maps are calibrated and, then, merged) and R (where feature maps are merged and, then, calibrated).

Deep MSE Feature Fusion
Instead of using only one-level MSE feature fusion to combine and recalibrate the feature maps A (n) as shown in Figure 1, we can stack a series of MSE feature fusion blocks to create deep MSE feature fusion, where feature maps are merged and weighted multiple times.The structure of deep MSE feature fusion is shown in Figure 4a, where D MSE feature fusion blocks are connected in series.The d-th block as shown in Figure 4b receives the channel-weighted feature maps P(d−1,n) , for n = 1, 2, . . ., N, and the channelweighted and merged feature maps Ỹdeep,(d−1) from the previous block to create the new channel-weighted feature maps P(d,n) and the new channel-weighted and merged feature maps Ỹmse,(d) .Note that P(0,n) and Ỹdeep,(0) are equal to A (n) and B (defined in ( 7)), respectively.Similar to Section 4.

Experimental Setup
All experiments were implemented by using Python programming language and Python libraries such as Scikit-learn, TensorFlow, Keras, etc.They were run on the Google Colab Pro+ platform.The performances of the investigated models were measured by the accuracy score, which is obtained from where K is the number of classes, TP k is the number of true positives of the k-th class, FP k is the number of false positives of the k-th class, TN k is the number of true negatives of the k-th class, and FN k is the number of false negatives of the k-th class.There are two basic approaches to evaluate the model performances: training-validation-testing split and k-fold cross validation.The training-validation-testing split will divide a dataset into three separated parts: training set, validation set, and testing set.Therefore, the performance results of the investigated model will highly depend on the data in the testing set.In order to avoid this problem, similar to [18,22,25,27], we applied the k-fold cross validation, where k is set to 10, to the experiments.The 10-fold cross validation will divide a dataset into 10 parts.One part will be selected as the testing set while the remaining nine parts will be the training set.We evaluate an investigated model 10 times.For each time, we select a different part to be the testing set.Thereafter, the performance results will be the average of the testing scores.The investigated models were trained by minimizing the categorical cross-entropy using the Adam optimizer with the settings β 1 = 0.9, β 2 = 0.999, and = 10 −7 .The training rate was set to 0.001.The batch size was 32.The number of epochs was 40.We did not experience an overfitting issue.Our training scores are slightly higher than the testing scores.

Baseline Architectures
We consider a single-branch DL architecture and a multi-branch DL architecture shown in Figure 5 as our baseline architectures for the performance comparison.The classifiers in these two architectures are similar to those used in the proposed MSE feature fusion as shown in Figure 1 and explained in Section 4.3.

•
For a single-branch DL architecture, all available sensor data will be combined first before we extract a set of features [13].Here, all sensor data X (n) (from N IMUs) are concatenated together along the channel dimension.We denote this new array as X ∈ R 1×L×N M .Thereafter, a 1D CNN model extracts a set of feature maps A ∈ R 1×W×C which will be used in the classification process.• A multi-branch DL architecture consists of N branches to receive the sensor data X (n)  individually [19,22,23].Each branch extracts a set of feature maps A (n) using a 1D CNN model.Here, we concatenate these N sets of feature maps together along the channel dimension and obtain a new array Y mb ∈ R 1×W×NC , which will be sent to the classifier.
Note that the sensor data X (n) and feature map A (n) were defined in Section 4. The values W and C were shown in Table 7.The performances of these architectures are evaluated by classifying the PAMAP2, DaLiAc, and DSAD datasets where three CNN models (including LeNet5, AlexNet, and VGG16) are used as feature extraction.The accuracy scores are shown in Table 8, which will be compared to those achieved by the proposed architectures.We observe that the single-branch architectures outperform the multi-branch architectures in many cases.A reason is that the multi-branch architectures have extracted too many features (the output of the GAP in the classifier) and some of them may be ambiguous.The number of features out of the GAP in the multi-branch architectures is equal to NC while the number of features out of the GAP in the single-branch architectures is equal to C. As seen in Table 8, the single-branch architectures using AlexNet offer the highest accuracy scores of 98.77%, 97.60%, and 97.18% for the PAMAP2, DaLiAc, and DSAD datasets, respectively.

Proposed Merging-Squeeze-Excitation Feature Fusion
The performances of the proposed MSE feature fusion in Section 4 and its extensions in Section 5 are shown in the following subsections.For each proposed architecture, we will compare the accuracy scores among the merging methods (addition, maximum, minimum, and average) and DL models (LeNet5, AlexNet, and VGG16) to determine which combination offers the highest accuracy score on classifying each dataset.Thereafter, the highest accuracy scores of the proposed architectures are compared to determine the best architecture.Note that the reduction ratio r is fixed to eight for all experiments.Varying r is considered as future work.

MSE Feature Fusion
Table 9 presents the accuracy scores of the MSE feature fusion proposed in Section 4 according to the merging methods, DL models, and datasets.We have the following results:

•
The highest accuracy score in each dataset is indicated by the asterisk (*).The MSE feature fusion using the minimum merging and AlexNet achieves the highest accuracy score of 99.17% for the PAMAP2 dataset.The MSE feature fusion using the average merging and AlexNet achieves the highest accuracy scores of 98.32% and 98.04% for the DaLiAc and DSAD datasets, respectively.

•
We compare the accuracy scores of the MSE feature fusion to those of the baseline architectures in Section 6.2.According to the highest accuracy scores obtained from these architectures, the results show that the MSE feature fusion outperforms the baseline models.• Among the considered merging methods, the MSE feature fusion using the addition merging offers the worst accuracy scores.The MSE feature fusion architectures using the other merging methods provide the same level of performance.Their accuracy scores are rather close to each other.We do not have a conclusive result on which merging method is the best.• Among the considered DL models used as feature extractors, the MSE feature fusion using ALexNet outperforms the MSE feature fusion using the other DL models.show the accuracy scores of the MSE feature fusion with skip connections proposed in Section 5.1 on classifying the PAMAP2, DaLiAc, and DSAD datasets, respectively.Each table presents the accuracy scores according to the merging methods, the DL models, and skip-connection methods.We have the following results:

•
On classifying the PAMAP2 dataset (   show the accuracy scores of the deep MSE feature fusion (in Section 5.3) using AlexNet as the feature extractor on classifying the PAMAP2, DaLiAc, and DSAD datasets, respectively.We consider only AlexNet since it outperforms the other DL models as shown in the previous subsections.Each table presents the accuracy scores according to the merging methods and the number of MSE feature fusion blocks (D).We see that the deep MSE feature fusion with D = 1 offers the highest accuracy scores for all three datasets (i.e, 99.17% for the PAMAP2 dataset, 98.32% for the DaLiAc dataset, and 98.04% for the DSAD dataset).In fact, with D = 1, the deep MSE feature fusion is equivalent to the original MSE feature fusion.This indicates that, for the investigated datasets, increasing the number of MSE feature fusion blocks does not provide any further useful information to the output feature maps Ỹdeep,(D) which are used in the classification process.show the numbers of trainable parameters of the baseline architectures, the proposed MSE feature fusion, and the extensions of the MSE feature fusion on classifying the PAMAP2, DaLiAc, and DSAD datasets, respectively.We do not specify the numbers of trainable parameters for individual merging methods since they are the same.The following results are obtained:

•
The numbers of trainable parameters of the proposed MSE feature fusion are higher than those of the single-branch architecture since the proposed MSE feature fusion consists of several branches using CNN models as feature extractors.On the other hand, the proposed MSE feature fusion requires lower numbers of trainable parameters than the multi-branch architecture does since the MSE feature fusion reduces the number of features which will enter the classification process by using the addition, maximum, minimum, and average merging instead of the concatenation merging.

•
The numbers of trainable parameters of the extensions (of the MSE feature fusion) are slightly higher than those of the proposed MSE feature fusion since the modification parts in the extensions require few trainable parameters.

Performance Comparison to Other HAR Approaches
Table 20 shows the accuracy scores of other HAR approaches which were evaluated by using the PAMAP2, DaLiAc, and DSAD datasets.These accuracy scores were presented in their publications.Note that the evaluation setups and pre-processing may be different from ours.We compare them to the highest accuracy scores achieved by the original MSE feature fusion (Section 6.3.1).The proposed MSE feature fusion offers higher accuracy scores than those obtained from the other approaches.

Conclusions and Future Work
In this work, we proposed a feature fusion method called the merging-squeezeexcitation (MSE) feature fusion for wearable-sensor-based HAR using multibranch architectures.The MSE feature fusion will calibrate the feature maps during the fusion.Each feature map will be emphasized or suppressed according to its importance measured from both itself and the corresponding pre-merged feature map.In addition, we presented the following four extensions of the MSE feature fusion: the MSE feature fusion with local skip connections, the MSE feature fusion with a global skip connection, the MSE feature fusion with global channel attention, and deep MSE feature fusion.LeNet5, AlexNet, and VGG16 were applied as feature extractors.The addition, maximum, minimum, and average merging were used in the pre-merging and post-merging steps.According to the experimental results, the MSE feature fusion with a global skip connection (using the average merging and AlexNet), the MSE feature fusion with local skip connections (using the minimum merging and AlexNet), and the original MSE feature fusion (using the average merging and AlexNet) achieve the highest accuracy scores of 99.24%, 98.59%, and 98.04% on the PAMAP2, DaLiAc, and DSAD datasets, respectively.For future work, in addition to the channel-attention mechanism, other attention techniques such as spatial attention, modal attention, convolutional block attention, and selective kernel convolution can be applied to feature fusion in order to combine feature maps from different branches effectively.

m
are the mean and standard deviation, respectively, of the values from the m-th sensor data of the n-th IMU.Next, a series of the standardized values z(n) t,m is divided into segments by using a non-overlapping window method.Each segment consists of L values.Let x(n) m ∈ R 1×Lbe a segment of the standardized values from the m-th sensor data of the n-th IMU.The row vector x (n)

Figure 1 .
Figure 1.Proposed MSE feature fusion architecture.It consists of the following stages: data inputs, feature extraction, merging-squeeze-excitation feature fusion, and classification.The inputs are the sensor data X (n) from N IMUs attached to several parts of the user body.Each branch extracts a set of feature maps A(n) independently by using a 1D CNN model.In the merging-squeezeexcitation feature fusion stage, each set of feature maps is calibrated by using a set of channel weights.A merging method combines the sets of channel-weighted feature maps P(n) and produces a new set of channel-weighted and merged feature maps Y mse , which is used later in the classification process.
map and obtained from multiplication between the channel weight s

Figure 2 .
Figure 2. MSE feature fusion architectures with skip connections.There are two types: (a) MSE feature fusion with local skip connections.A skip connection is added to each branch.As a result, we combine the feature maps Q (n) = P (n) + A (n) instead of P(n) .The output Y lsc still contains the feature maps directly obtained from 1D CNN models and (b) MSE feature fusion with a global skip connection.The pre-merged feature maps B are added to the feature maps Y mse .As a result, the output feature maps Y gsc contain both pre-merged feature maps (no channels weighted) and channel-weighted feature maps.
2.2, W † 1 ∈ R C r ×C is the weight matrix of the first FC layer, and W † 2 ∈ R C× C r is the weight matrix of the second FC layer.The c-th 1D feature map r c is equal to the feature map b c weighted by v c :

Figure 3 .
Figure 3. MSE feature fusion architecture with global channel attention.We compute an additional set of channel-weighted feature maps R which is obtained from the pre-merged feature maps B.The output feature maps Y gca contain both Y mse (where feature maps are calibrated and, then, merged) and R (where feature maps are merged and, then, calibrated).

Figure 4 .
Figure 4. Deep MSE feature fusion.We implement a series of D MSE feature fusion blocks such that the feature maps A(1) , A(2) , . .., A(N) are calibrated and merged several times as shown in (a) deep MSE feature fusion architecture, where the structure of the d-th MSE feature fusion block is shown in (b).The output feature maps Ỹdeep,(D) are used in the classification process.

Table 1 .
List of main symbols used in Sections 4, 5.1 and 5.2.

Table 3 .
A summary of DL models with SE blocks for HAR using sensor data.

Table 4 .
A summary of multi-branch DL architectures for HAR using sensor data.

Table 5 .
A summary of related SE fusion.

Table 6 .
A summary of sensor data from three datasets.

Table 7 .
The width W and the number of channel C of the feature maps A(n)obtained from LeNet5, AlexNet, and VGG16 by using the PAMAP2, DaLiAc, and DSAD datasets.

Table 8 .
Accuracy scores (%) of the baseline architectures on classifying the PAMAP2, DaLiAc, and DSAD datasets where LeNet5, AlexNet, and VGG16 are applied as feature extractors.The asterisk (*) indicates the highest accuracy score of each dataset.

Table 9 .
Accuracy scores (%) of the MSE feature fusion on classifying the PAMAP2, DaLiAc, and DSAD datasets where LeNet5, AlexNet, and VGG16 are applied as feature extractors.The asterisk (*) indicates the highest accuracy score of each dataset.

Table 10 )
, the MSE feature fusion with local skip connections achieves the highest accuracy score of 99.18% when using the minimum merging and AlexNet, while the MSE feature fusion with a global skip connection offers the highest accuracy score of 99.24% when using the average merging and AlexNet.Both architectures outperform the original MSE feature fusion (whose highest accuracy score is 99.17%).•Onclassifying the DaLiAc dataset (Table11), the MSE feature fusion with local skip connections achieves the highest accuracy score of 98.59% when using the minimum merging and AlexNet, while the MSE feature fusion with a global skip connection offers the highest accuracy score of 98.42% when using the minimum merging and AlexNet.Both architectures outperform the original MSE feature fusion (whose highest accuracy score is 98.32%).•Onclassifying the DSAD dataset (Table12), the MSE feature fusion with local skip connections achieves the highest accuracy score of 98.02% when using the average merging and AlexNet while the MSE feature fusion with a global skip connection offers the highest accuracy score of 97.97% when using the average merging and AlexNet.Both architectures offer lower accuracy scores than that of the original MSE feature fusion (whose highest accuracy score is 98.04%).•Since the results are not conclusive, we cannot indicate whether the MSE feature fusion with skip connections is better than the original MSE feature fusion nor which skip connection method is the best.

Table 10 .
PAMAP2 dataset: Accuracy scores (%) of the MSE feature fusion with skip connections on classifying the PAMAP2 dataset where LeNet5, AlexNet, and VGG16 are applied as feature extractors.The asterisk (*) indicates the highest accuracy score of each type of skip connections.

Table 11 .
DaLiAc dataset: Accuracy scores (%) of the MSE feature fusion with skip connections on classifying the DaLiAc dataset where LeNet5, AlexNet, and VGG16 are applied as feature extractors.The asterisk (*) indicates the highest accuracy score of each type of skip connections.

Table 12 .
DSAD dataset: Accuracy scores (%) of the SE feature fusion with skip connections on classifying the DSAD dataset where LeNet5, AlexNet, and VGG16 are applied as feature extractors.The asterisk (*) indicates the highest accuracy score of each type of skip connections.

Table 13
shows the accuracy scores of the MSE feature fusion with global channel attention proposed in Section 5.2 according to the merging methods, DL models, and datasets.

Table 13 .
Accuracy scores (%) of the MSE feature fusion with global channel attention on classifying the PAMAP2, DaLiAc, and DSAD datasets where LeNet5, AlexNet, and VGG16 are applied as feature extractors.The asterisk (*) indicates the highest accuracy score of each dataset.

Table 14 .
Accuracy scores (%) of the deep MSE feature fusion on classifying the PAMAP2 dataset where AlexNet is applied as the feature extractor.

Table 15 .
Accuracy scores (%) of the deep MSE feature fusion on classifying the DaLiAc dataset where AlexNet is applied as the feature extractor.

Table 16 .
Accuracy scores (%) of the deep MSE feature fusion on classifying the DSAD dataset where AlexNet is applied as the feature extractor.

Table 17 .
PAMAP2 dataset: The numbers of trainable parameters of the baseline architectures and the proposed MSE feature fusion architectures on classifying the PAMAP2 dataset.

Table 18 .
DaLiAc dataset: The numbers of trainable parameters of the baseline architectures and the proposed MSE feature fusion architectures on classifying the DaLiAc dataset.

Table 19 .
DSAD dataset: The numbers of trainable parameters of the baseline architectures and the proposed MSE feature fusion architectures on classifying the DSAD dataset.

Table 20 .
The accuracy scores (%) of related work who were evaluated on classifying the PAMAP2, DaLiAc, and DSAD datasets.

Table A1 .
Architecture of 1D LeNet5.The column names # Filters, K Size, Pad, and Activ are short for number of filters, kernel size, padding, and activation, respectively.

Table A2 .
Architecture of 1D AlexNet with batch normalization (BatchNorm) layers.The column names # Filters, K Size, Pad, and Activ are short for number of Filters, kernel size, padding, and activation, respectively.

Table A3 .
Architecture of 1D VGG16 with batch normalization (BatchNorm) layers.The column names # Filters, K Size, Pad, and Activ are short for number of filters, kernel size, padding, and activation, respectively.