1. Introduction
As an important research topic in ubiquitous computing, Human Activity Recognition (HAR) from wearable sensors is a key element of various intelligent applications, such as smart personal assistants [
1], healthcare assessment [
2,
3,
4,
5,
6], sports monitoring [
7], and aging care [
8]. HAR can recognize a variety of individual behaviors, such as running, and walking, and has a wide range of applications [
9]. HAR mainly uses wearable devices to collect individual behavior data and model the data for individual behavior identification.
Deep neural networks perform well on HAR due to the ability to extract explicit feature design. Activity recognition models based on neural network technology often consist of three modules, which are the convolution module for extracting the correlation between sensor data, the convolution module for extracting the correlation among different kinds of sensor data, and the recurrent neural network module of extracting the correlation of data in a different moment. The models use convolution networks for feature extraction in the spatial dimension and then use recurrent neural networks to play for feature extraction in the temporal dimension [
10]. In addition, some models also use the Attention mechanism to further enhance the generalization of the model [
11].
However, there are two major deficiencies in the current datasets used to train activity recognition models in the real world: First, sensor data and activity labels are mostly collected using special experimental equipment in a supervised experimental environment, which affects the normal life of users [
12,
13,
14]. Second, the modalities of sensor data are relatively single [
15,
16], and existing studies focus on single or a few modalities of sensor readings, which neglects useful information and its relations existing in multimodal sensor data. Aiming at the two deficiencies of existing datasets, we build an experimental platform for activity recognition data collection and activity label collection: MarSense. MarSense can automatically collect a large amount of sensor data through a small number of user operations in the user’s daily life. This data collection method takes full advantage of the powerful functions and widespread availability of smartphones, reduces the cost of data collection, and can collect richer data from a larger population.
We also build our activity recognition model Marfusion, and use the sensor data and labels collected by the MarSense experimental platform for user experiments to train the model. Marfusion is to extract features in each dimension and the associations between different features to extract hidden features. It extracts features from multimodal sensors by the Convolution Neural Network (CNN) structure and Attention mechanism. For each sensor, a set of CNN-based networks is used for feature extraction for each modality. After that, Dot-Product Scaled Self Attention is used to process and give weight to each feature. Then, the data were multiplied with corresponding weights and they were input into the fusion feature extraction module. Finally, the data were input into the classification subnet to obtain the probability values of different categories. This paper also compares the training results of the Marfusion model with several other representative models. The value of Precision, Recall, and F1 reach , , and , which are higher than other models and verify the advantages of the Marfusion model compared to existing models in activity recognition.
The rest of the paper is organized as follows. In the next 
Section 2, we introduce the existing approaches of HAR, HAR in the wild, and Data Fusion. Then, in 
Section 3 we introduce Marfusion and explain the principle and mechanism of vital modules. The data collection is presented in 
Section 5. In this section, we introduce our collection experiment, data collection platform (MarSense), and dataset construction. 
Section 6 presents the model training and model evaluation. The final 
Section 7 presents the concluding remarks of this paper.
  3. Multimodal HAR Model: Marfusion
In this section, we introduced our proposed multi-modality sensor fusion model for Human Activity Recognition (HAR) Marfusion, which is used to classify the data by the embedded sensor of smartphones to identify the corresponding activities of the data. In the following subsections, the overall structure and each component of Marfusion are introduced extensively.
  3.1. The Overall Structure of Marfusion
Marfusion aims to extract features in each dimension and the associations between different features to extract hidden features. The overall structure of Marfusion is shown in 
Figure 1.
Marfusion extracts features from multimodal sensors by leveraging both the Convolution Neural Network (CNN) structure and Attention mechanism. Specifically, for each sensor, a set of CNN-based networks is used for feature extraction for each modality. After that, we use Dot-Product Scaled Self Attention to process and give weight to each feature channel of each sensor. Then, we multiply the data of different channels with corresponding weights and input them into the fusion feature extraction module. The structure of the fusion feature extraction module is the same as that of the sensor independent feature extraction module, but the parameters may be different. The fusion feature module uses a large convolution kernel for better performance on extracting features across multiple sensors. Finally, the data were input into the classification subnet composed of a full-connection layer, Batch Normalization layer, Dropout layer, ReLU activation function layer, and Softmax activation function layer to obtain the probability values of different categories. Next, the principle and mechanism of vital modules are introduced.
  3.2. Data Augment Based on Fast Fourier Transform (FFT)
Considering the fact that noise is unavoidable in the sensor data, we capture the noise pattern and frequency features in the multidimensional sensing signals using data augmentation techniques. The sensory signal is essentially composed of functions of time and frequency features that represent the changes in the energy content of a signal [
10]. To extract frequency features, therefore, we use 1-order FFT to transform the data on each axis.
The process of FFT is as follows. Set a polynomial 
 as shown in Equation (1).
        
It is divided into two parts according to its parity, as shown in Equation (2).
        
Add two polynomials 
A1(
x) and 
A2(
x), as shown in Equations (3) and (4).
        
Then we can obtain 
. Suppose 
, let 
, respectively, the following equations will be deduced.
        
        where we obtain the value of 
 and 
 according to the value of 
 and 
. In this way, the FFT sequence can be evaluated iteratively.
There are two parts of results in the FFT, which is the real part and the imaginary part. Taking the accelerometer as an example, we can obtain six data axes, which is three-dimension real part and three-dimension imaginary part after FFT. The shapes of different behaviors in the accelerometer before FFT are 
, which is visualized as shown in 
Figure 2.
As shown in the figure, data generated by accelerometers varies greatly when users behave differently. We collect the data by smartphone sensors in real-life scenarios, and we have no restrictions on the participants during data collection. Therefore, the participants may use their mobile phones when they are sitting and the data for sitting is fluctuating. After transforming the data from time-domain into frequency-domain with FFT, periodic noise data will show certain features from a frequency-domain perspective, which can be discovered easily by the neural network, and weaken the influence of the accuracy of recognition results.
After Fourier Transformation, the shapes of sensor data are 
. The 2 represents both the time and frequency domain after Fourier transformation, 200 represents the length of the sequence and 3 represents three data-generating axes of each sensor. The transformed data shown in 
Figure 3 is the input of the neural network.
  3.3. Feature Extraction Module Based on CNN
In Marfusion, each type of sensor has its independent feature extraction module for better performance on extracting features. The feature extraction module is mainly composed of the four-layer Convolution layer, Batch Normalization layer, ReLU activation function layer, and Max Pooling layer. The Convolution layer is used to extract the features in spatial dimension; the Pooling layer is mainly used to compress the features extracted from the Convolution layer and lower the feature vectors to reduce the parameters and calculation of the network. In the case of exploding gradient and vanishing gradient problems in the training process, we use the Batch Normalization Layer to transform the data so that the gradient of the model parameters of each layer can be stabilized in an appropriate range to prevent non-convergence or shaking repeatedly. To improve the fitting of data, there is a ReLU activation function layer before the Max Pooling layer in the feature extraction module. Non-linear activation and less computational cost are the main characteristics of the ReLU function, which further reinforces the fitting of input data. Note that we apply 2-D CNN architecture to extract the features of multi-dimension sensors such as the accelerometer, while for the single-dimension sensors such as linear accelerometer, 1-D CNN is applied to extract the features.
In our proposed method, there is multiple-modality sensor streaming to be introduced into MarFusion. Specifically, acceleration data has 
 axes. Linear acceleration is the combination of the three axes. While gyroscope contains three modalities. The rotation vector represents the current tilt of the phone and the data is hybrid computing by axes and angles. The coordinate system of the rotation vector is the same as that of the accelerometer. The magnetic field represents the magnetic field value in three directions of the space coordinate system. Orientation represents the orientation vector that the phone is currently pointing at and the data is the angle of each axis. In total, there are 19 modalities of data streams introduced into the proposed HAR model. In the SensorConv of Marfusion, we use a five-layer convolution network. The structure of MergeConv is the same as that of SensorConv, but the parameters may be different. The specific parameters and the changes in data shape of the convolution layer in SensorConv is shown in 
Figure 4.
  3.4. Attention Mechanism
For better performance, different weights have been evaluated on different sensors and we use Dot-product scaled Self-Attention to process the outputs of different sensors. By evaluating different sensors, Marfusion dynamically selects the sensors that impact the result and focuses on them. The Attention Mechanism improves the anti-interference capability of Marfusion, which makes higher accuracy in the test set, and enhances the generalization ability of different inputs. The equation of Self-Attention mechanism used in the paper is shown in (7).
        
Q, 
K, 
V are all obtained from the input data 
X after a linear transformation, as Equations (8)–(10) show.
        
 , ,  are the weight matrix of linear transformation corresponding to Q, K, V.
  3.5. Batch Normalization
In the case of exploding gradient and vanishing gradient problems in the training process, we use Batch Normalization to optimize the model. In the training process, if the parameter of a network layer tends to 0, after successive multiplication, the gradient value of some parameters may become very small so it fails to update the parameters, which is a vanishing gradient problem. An exploding gradient problem is the opposite to a vanishing gradient problem, that is, the parameter is too large, which leads to the update of the parameters being unstable. Batch Normalization can transform the input data to make it fluctuate in the small-scale near 0, in case of exploding gradient and vanishing gradient problems, speeding up the training model and improving the stability of the training process.
The Batch Normalization layer makes statistics on the input data, obtaining the mean and variance of each batch, and varies the data by mean and variance, so that the mean and variance of the current batch are 0 and 1, respectively. However, such scaling leads to ignoring the features in the data, which is not conducive to feature extraction. Therefore, in the last step of Batch Normalization, the data will be transformed by trainable parameters 
 and 
 to reserve the feature information. The process of data transformation by Batch Normalization is shown in Equations (11)–(14).
        
  3.6. L2 Regularization
In the case of over-fitting and failing to predict categories of behavior from sensor data accurately, we use L2 regularization to restrict the loss function. With the restriction of L2 regularization, the loss function of the model consists of two parts: one is the Cross-Entropy loss function, whose value is determined by the difference between the predicted value and the actual value. The other is the L2 Regular penalty term determined by all the weight parameters.
Before adding the L2 regular penalty term, the loss function of the model is shown in Equation (15).
        
After adding, the loss function is shown in Equation (16). 
J represents the original Cross-Entropy loss function, and 
 represents the loss function after adding the penalty term.
        
Before adding a penalty term, the model is optimized to minimize the loss of Cross-Entropy. When the L2 regular penalty term is added, the optimization of the model becomes a game of Cross-Entropy loss and L2 regular term. To an extent, the reduction in cross-entropy and the reduction in L2 regular penalty terms are mutually restrictive. To minimize the sum of the two, the model will try to find a balance between them to minimize the loss function.
The addition of L2 regular term makes the weight parameters tend to 0 in general, in the case of excessive parameters, which makes the model smoother and prevents the model from over-learning some interference features. Thus, the model has better performance on the test set and improves the generalization and stability performance of the model.
  4. Data Collection Platform: MarSense
MarSense is a software platform specially developed for smartphone sensor data collection. In previous studies, sensor data were mostly collected by wearing special equipment, which has a great impact on users’ daily life and is not convenient for more users to participate in data collection. With the large-scale popularization and rapid development of smartphones, smartphones are generally equipped with a variety of physical sensors. Therefore, using smartphones to collect data can reduce the interference in users’ daily life, since users can complete the data collection by some simple interaction, which reduces the cost of data collection and the impact on users’ life. Therefore, more participants will participate in the data collection experiment, which is conducive to involving more users and improving the performance of the model in real-life scenarios. The overall architecture of the MarSense platform is shown in 
Figure 5.
MarSense is mainly composed of two parts:
- 1.
- Android client. Android client is the core of the whole platform, which is to collect sensor data and labels. Data collection is automatically completed by the built-in sensor of the mobile phone, while label collection will be completed by interacting with the users according to the experimental setting. 
- 2.
- Server and database. The main function of the server is to provide the experimental data with storage, retrieval, and download services. In the MarSense client, the experimental data will be stored in a file in the form of a log. At an appropriate time, the log file will be uploaded to the Object Storage Service (OSS) of the server through the Hyper Text Transfer Protocol (HTTP). These log files are always stored in OSS. At the beginning of data pre-processing, the log data generated by the experiment will be retrieved and downloaded from OSS through corresponding tools, and these data will be used for filtering, segmentation, and transformation, and finally assembled into an experimental dataset for model training. 
After the model training, we use the built-in smartphone sensors to collect the data and input the data into the model for activity recognition and finally display the results of the activity recognition on the smartphone. The mobile phone is only responsible for data collection, data uploading, and result display. The Android client calls the built-in sensors of the system for data collection and short-term storage. When the data volume reaches the length of the window, it will be uploaded to the server. The server inputs the data into the model and obtains the classification results of activity recognition. The result will be sent and displayed to the Android client to realize real-time recognition. The flowchart of activity recognition using MarSense is shown in 
Figure 6.
  5. Data Collection
Considering the fact that the dataset used in existing work is collected in experimental scenarios, we aim at collecting real-world data to improve the generalization ability of the Human Activity Recognition (HAR) model. In order to collect data from smartphone sensors in naturally-used conditions and real-world environments, we leverage the MarSense platform to conduct a large-scale data collection experiment. Through the MarSense client running on the Android mobile phone, the data are collected in the scene close to the user’s real life as much as possible.
  5.1. Data Collection Procedure
The purpose of the data collection experiment is to collect sensor data and corresponding behavior labels using MarSense developed in this paper for data collection so as to use these data for the training of the activity recognition model. The participants in the experiment were students at Jilin University. Students participating in the experiment must have an Android mobile phone that can be used online, and the system version must be later than Android 7. The dataset is available.
The experiment lasted for 1 week, with a total of seven participants. In the experiment, the user-defined mode is used. When users will perform some behavior in the future, they switch the activity label on the client to what will be performed next. After that, the phone will turn on the sensor to continuously collect data. During the data collection, the phone will generate behavior labels every 2 s. If the user ends the current behavior, the client will stop data collection and label generation. The selectable activities are common simple activities in daily life, such as walking, going upstairs, etc. During the experiment, the client on the mobile phone will automatically upload data and labels without user intervention.
As shown in 
Table 1, there are seven kinds of sensor data collected in this experiment, which are accelerometer, gyroscope, gravity, linear acceleration, magnetic field, direction, and rotation vector. These sensors are generally equipped in current mobile phones. Acceleration is the change in velocity per second along the 
 axes. Linear acceleration is the combination of the three axes. The gyroscope represents the current angular velocity. Gravity represents the local acceleration of gravity. The rotation vector represents the current tilt of the phone and the data is hybrid computing by axes and angles. The coordinate system of the rotation vector is the same as that of the accelerometer. The magnetic field represents the magnetic field value in three directions of the space coordinate system. Orientation represents the orientation vector that the phone is currently pointing at and the data is the angle of each axis. In total, there are 19 modalities of data streams introduced into the proposed HAR model. There are six kinds of data labels collected in this experiment, which are walking, running, lying, sitting, going upstairs, and going downstairs.
The data and labels collected in the experiment need to be further filtered and processed to generate the dataset for training the model. In order to show the structure and the form of the collected data more clearly, this paper visualizes the sensor data, and shows the specific meaning of the experimental data through graphics. In addition, this paper also introduces how to use the collected data to generate datasets for model training through pre-processing and transformation.
  5.2. Data Set Construction
  5.2.1. Data Structure
Taking the accelerometer on the mobile phone as an example, it can generate three axial data, namely 
X, 
Y and 
Z. The data of these three axes will change constantly when the mobile phone is moving, so as to generate the data reflecting the current acceleration state of the mobile phone continuously. The direction of the three axes of the accelerometer relative to the mobile phone is shown in 
Figure 7.
The three data generated by the three axes of a sensor on the mobile phone at a certain time point form a data item, and these successively generated data items will form a data sequence, as shown in 
Figure 8.
All seven sensors used in the experiment have three axes so that the shape of the generated data are completely consistent. However, the sampling frequency of each sensor is slightly different, so it cannot guarantee that all the sensors have the same sampling time. Consequently, the data generated by different types of physical sensors on the mobile phone cannot be automatically aligned in time, so it is necessary to pre-process the data after the experiment to generate the data set for model training.
  5.2.2. Data Pre-Processing
After data collection, this paper carries out format conversion and segmented filtering on the original data. The specific process is as follows:
- 1.
- Make the necessary format conversion of the original data. The original data from server are in CSV format, so we need to use python for pre-processing and we use the tensor to store the data. 
- 2.
- Put each sensor data into the MongoDB database by sensor type to identify them easily. 
- 3.
- After the data import, a joint index is established on two attributes of sensor data, which are the “timestamp” and “sensorName” to speed up the search speed when segmenting the data. 
After importing the sensor data into the database and establishing the index, we traverse the activity labels in the questionnaire and search the sensor data in the corresponding time period according to the time when the label is generated.
In the process of data search, for all activity labels, Algorithm 1 is used to search the data of each sensor. For the searched data, if the length is greater than 200, the part greater than 200 will be cut off and only the previous data will be retained. If the length is less than 200, zeros are added after the data until the length is equal to 200. If the length of the actual available data for a label is less than 180, they will be excluded from the training dataset to prevent too many zeros from affecting the accuracy of the model.
After the data of the seven sensors are searched, the data collected by seven sensors are merged in the sensor dimension and form the sensor data to the corresponding activity label. The data shape after merging is , where 7 is the number of sensors, 200 is the length of sensor data, and 3 is the number of sensor axes.
The search process for certain sensor data is shown in Algorithm 1.
          
| Algorithm 1 Search Certain Sensor Data According to Activity Labels | 
| Require: The timestamp of the label A, the maximum length M of the window, the data quantity K inside the window, the current window C and the data length threshold T.fori =  do         if  then     finish   else     continue   end ifend for
 | 
Taking the label of “running” as an example, the corresponding accelerometer, gyroscope and gravity acceleration data are visualized as shown in 
Figure 9.
It can be seen from 
Figure 9, different types of sensors have significantly different characteristics with a certain periodicity. Therefore, the neural network model can accurately classify the behavior corresponding to the sensor data through feature extraction and fusion.
  5.2.3. Dataset Exploration
Through a data preprocessing procedure, the dataset (We open source SmartJLU dataset and source code on Github: 
https://github.com/Super-Shen/marfusion-dataset (accessed on 17 May 2022)) is finally transformed into a tensor and the shape is 
, which represents the number of available label instances, seven types of sensors, the real part and the imaginary part formed after Fourier transform, the length of the data sequence, and the three axes of the sensor, respectively. The population information of participants is shown 
Table 2.
The number and proportion of different types of labels are shown in 
Figure 10. Running, going upstairs and going downstairs is rare in daily life, so the data samples corresponding to these three behavior labels in the dataset are less than other labels.
  6. Evaluation
  6.1. Experimental Settings
In order to evaluate the effect more accurately, this paper adopts F1 value as the evaluation index of the model effect. F1 value is an improved evaluation index based on Precision and Recall. The specific meanings of these indicators are introduced below:
- Precision. The precision rate refers to the proportion between the number of truly true samples and the number of samples judged to be true by the model. From the perspective of prediction results, precision rate describes the specific proportion of samples that are positively predicted by the model. 
- Recall. Recall rate refers to the proportion of all samples of a certain category that can be correctly found by the model. Recall rate describes the ability of the model to find such samples from all samples from the perspective of a certain category of samples. The higher the recall rate is, the more accurate the model can find such samples from all samples. 
- F1 value comprehensively considers the precision rate and the recall rate and can describe the performance comprehensively. Compared with using the precision rate or recall rate alone, using F1 value as the evaluation index can more accurately and comprehensively describe the effect and prevent the model from cheating. 
This paper will use the evaluation measurements mentioned above to monitor the training process of the model, and better display the changes of the model in the training process by drawing these evaluation indicators into images.
PyTorch framework is used to build and train the neural network model. The development and training environment of Marfusion activity recognition model is shown in the following 
Table 3. As for the Convolution Neural Network (CNN) structures, we applied 4-layer CNN and one-layer pooling to extract features of different modalities.
For fair comparison, we apply the manner of splitting dataset following the settings in the state-of-the-art approaches [
10,
11,
23,
33]. In addition, these state-of-the-art studies [
10,
11,
23,
33] evaluated their models on publicly available datasets, which are similar as our dataset with multiple individuals. Thus, we apply the same setting as these state-of-the-art studies to validate our model in a fair experimental setting. Before starting the model training, all samples in the dataset were divided into two parts, 
 of which were used for model training and 
 for testing the training effect and generalization ability. During the training, the batch size was set to 32, and the total loss in each epoch was recorded. For each epoch training, an accuracy calculation was performed. When the loss value becomes stable, the training is stopped and the model is saved on the hard disk.
The prediction value and true value will be used to calculate the current F1 value for each epoch training. The training was stopped when the epoch loss no longer decreased and the model tended to be stable.
  6.2. Training Procedure
After each epoch is trained, we test the data fitting ability of the model on the training set and the test set. The higher the precision is on the training set or the test set, the better the model can fit the data. The average precision of each tag category changes with the training epoch, as shown in 
Figure 11.
After the end of each epoch, the training state was detected by using the data of the training set and the test set. Each training sample was input into the model to obtain the prediction sequence, and the prediction sequence was compared with the real sequence to obtain the current recall of the model. The average value of recall of each label category changes with the training epoch, as shown in 
Figure 12.
Precision and Recall can only reflect the performance of the model in a certain aspect, but not the overall performance. To better detect the real situation in the training process, at the end of each epoch, the current precision and recall rate were used to calculate the current F1 value. The change of F1 mean values of each label category with the training epoch is shown in 
Figure 13.
After the model training, the confusion matrix visualization is used to visually display the model classification effect. The confounding matrix can better display the real and predicted category of the samples in the whole dataset. If all the samples in the dataset are correctly classified, the real and predicted categories of the samples will overlap diagonally in the confusion matrix. A sample that falls in a non-diagonal position is a misclassified sample.
Using the trained model to predict the training set, and the confusion matrix obtained is shown in 
Figure 14. It can be seen from the confusion matrix that the model has a good fitting effect on the training set, and few samples with predicted tags do not conform to the actual tags, which indicates that the capacity and expression ability of the model are sufficient and can perform well on predicting specific behavior.
After the model completely converges and the training is finished, the saved model is loaded from the hard disk, and the test dataset is input into the model to obtain the predicted value sequence of the model. Then, the predicted value sequence of the model is compared with the real tag value sequence to calculate the accuracy and generalization ability of the model. The corresponding evaluation indexes of labels of each category in the final model are shown in 
Table 4.
As can be seen from the performance on the test set, the accuracy of the model on a certain label category is slightly affected by the number of samples corresponding to this label category. The reason for this may be that the number of samples in the dataset is relatively small, leading to a certain lack of fitting of the sensor data characteristics corresponding to a certain behavior tag. This situation can be improved in the future by increasing the size of the data.
  Model Comparison
In order to better verify the accuracy of the Marfusion model in behavior recognition, the Marfusion model and existing classical models are trained together with the dataset. The final performance of the models is measured by Precision, Recall and F1 value. The following is a brief overview of the model used to compare Marfusion.
- 1.
- Support vector machine (SVM)-  [ 34- ]. SVM classifies data by obtaining a plane in linear space that can segment different types of data and maximize the distance from the data to the segmented plane. The goal of SVM is to maximize the interval, so it can be formalized to solve a convex quadratic programming problem. Support vector machine has the advantages of easy calculation and fast solution. 
 
- 2.
- Random forest-  [ 35- ]. The Random forest classifies data by establishing multiple simple decision trees and gathering the opinions of all decision trees. The random forest classification algorithm fully shows the advantages of swarm intelligence, has the advantages of fast training speed and difficulty in over-fitting. 
 
- 3.
- Convolutional neural network (CNN)-  [ 36- ]. CNN has an excellent performance in computer vision and image classification in recent years and has very strong feature extraction and self-learning ability. However, the CNN with too many layers may have gradient dispersion in the training process, which makes it difficult to converge. The CNN model used for comparison in this paper includes three convolution layers and one fully connected output layer. 
 
- 4.
- Recurrent neural network (RNN)-  [ 37- ]. The RNN is a good solution to the extraction of contextual features in sequential data. LSTM, a variant of RNN, better solves the problem of long-term memory by introducing a gating mechanism. It can carry out association and feature extraction for data with a long time span from the sequential data of a long sequence. The RNN model used for comparison in this paper includes two LSTM layers, two Batch Normalization layers and a fully connected output layer using Softmax as the activation function. 
 
- 5.
- DeepSense-  [ 10- ]. By combining CNN and RNN, the DeepSense model solves the problem of feature extraction and noise interference in behavior recognition and has a good performance in different behavior recognition data sets. 
 
- 6.
- AttnSense-  [ 11- ]. The AttnSense model uses CNN to extract spatial features, and also uses Gated Recurrent Unit (GRU), a variant of RNN, to extract temporal features, so as to better classify individual behaviors. In addition, the AttnSense model also introduces the Attention mechanism to better assign different weights to feature data from different sensors, which enhances the robustness and generalization performance of the model for individual behavior classification tasks. 
 
In this paper, the above model is trained on our activity recognition dataset and compared with the Marfusion. Among them, Precision, Recall and F1 value are weighted mean according to the sample volume, and the comparison results are shown in 
Table 5.
Based on the above data, it can be concluded that the Marfusion model achieves better recognition effect in multimodal data collected by mobile phones. The reason may be that we have added the Attenion mechanism and weighted the values of each sensor. At the same time, we fuse multi-modal data to extract new features. In addition, we analyze the different contribution of each sensor as shown in 
Figure 15. Specifically, we average the weights of attention for each sensor and then normalized them for better comparison and visualization. The results indicate that linear acceleration and acceleration contribute more than other sensors. While magnetic field is the least useful sensor to recognize corresponding activity.
  6.3. Discussion
The advantages of this experiment are as follows. We used MarSense to collect data without user intervention and build a new multimodal dataset. Moreover, based on this dataset, we developed the Marfusion. (1) Marfusion introduces the Attention mechanism. (2) Marfusion fuses multimodal data and extracts new features. Therefore, compared with other models, the identification precision of Marfusion on the multimodal dataset reached 0.944, and the F1 value also reached 0.943.
As for the open research questions with related proposed solutions, although our model has great performance on recognition, it has not been used in applications. Meanwhile, the participants in data collection are only students, which is one-fold in some way. Another shortcoming is the limited data collecting time.
As for the potential future work, we can combine the Marfusion and the MarSense to construct a recognition system for intelligent health systems, intelligent medical systems, motion monitoring systems, and aging systems. For example, in the field of the intelligent health system, MarSense will collect the sensor data by built-in sensors of mobile phones and using Marfusion to recognize the current activity of the user. If the user sits for a long time, our system will go off a beeping sound to remind him of taking a rest or standing up for a while. Furthermore, we will invite participants from different jobs, places, and age groups. For example, the workers that sit in their workstations for a long time and change their activities constantly will be involved in our experiment. Some elderly and children will also be invited to participate in the data collection to optimize the generalization of the model in real-world scenarios. Furthermore, we will extend the time of data collection.