Next Article in Journal
NITS-IQA Database: A New Image Quality Assessment Database
Previous Article in Journal
Preference-Matched Multitask Assignment for Group Socialization under Mobile Crowdsensing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Deep Learning-Based Semantic Segmentation Model Using MCNN and Attention Layer for Human Activity Recognition

School of Integrated Technology, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(4), 2278; https://doi.org/10.3390/s23042278
Submission received: 27 January 2023 / Revised: 14 February 2023 / Accepted: 16 February 2023 / Published: 17 February 2023
(This article belongs to the Section Sensors and Robotics)

Abstract

:
With the development of wearable devices such as smartwatches, several studies have been conducted on the recognition of various human activities. Various types of data are used, e.g., acceleration data collected using an inertial measurement unit sensor. Most scholars segmented the entire timeseries data with a fixed window size before performing recognition. However, this approach has limitations in performance because the execution time of the human activity is usually unknown. Therefore, there have been many attempts to solve this problem through the method of activity recognition by sliding the classification window along the time axis. In this study, we propose a method for classifying all frames rather than a window-based recognition method. For implementation, features extracted using multiple convolutional neural networks with different kernel sizes were fused and used. In addition, similar to the convolutional block attention module, an attention layer to each channel and spatial level is applied to improve the model recognition performance. To verify the performance of the proposed model and prove the effectiveness of the proposed method on human activity recognition, evaluation experiments were performed. For comparison, models using various basic deep learning modules and models, in which all frames were classified for recognizing a specific wave in electrocardiography data were applied. As a result, the proposed model reported the best F1-score (over 0.9) for all kinds of target activities compared to other deep learning-based recognition models. Further, for the improvement verification of the proposed CEF method, the proposed method was compared with three types of SW method. As a result, the proposed method reported the 0.154 higher F1-score than SW. In the case of the designed model, the F1-score was higher as much as 0.184.

1. Introduction

1.1. Research Background

Various issues about the safety and health of the elderly have emerged in our aging society. Studies are being conducted to prevent these issues. Particularly, the awareness of daily activity is becoming more important because it is directly related to the health of the elderly. Owing to an aging society, the elderly population is increasing, but there is a limit to the manpower that can take care of them; thus, a technology that can replace the elderly care manpower is required. For this reason, with the recent development of wearable devices and deep learning (DL)-based artificial intelligence technology, human activity recognition (HAR) is being employed to recognize what people are doing through a series of data over time.
HAR is a technology suitable for the current healthcare field in an aging society. This is because data on perceived human activity can be used in various technological fields, such as human–computer interaction and human–robot interaction (HRI) [1]. By fusion with the internet of things (IoT) technology or timeseries sensor data, it is possible to propose appropriate services for various targets. For example, a mobile robot that can operate in an indoor environment can provide appropriate and proactive new services such as medication recognition for the elderly considering recognized activities. In addition, HAR can generate significant information for implementing a home care system. It is crucial to quickly and accurately recognize issues directly related to diseases or health, such as falls, in the time domain. In this sense, the HAR technology can be of great significance for distributing a monitoring system in a real environment [2].
There are three typical types of data used for HAR [3]. The first one is biosignal data such as electroencephalography, electromyography, and electrocardiography (ECG). Such data cannot be collected easily because the data collection requires specific equipment, including an electrode for recording electrical signals. The second one is behavior-sensing data such as the type of image obtained from an RGB-D sensor. They provide lots of useful information for HAR in the form of the original image and the skeleton type extracted from the depth image. However, because there are issues such as privacy invasion, it is unsuitable for application to people’s home environments. In addition, an RGB-D sensor has a coverage limitation in that the target must be located in the field of view of the sensor during data collection, and there should be no occlusions that compromise the data quality. These issues significantly influence the recognition of the behavior of many objects. Therefore, it is unsuitable for an elderly home care or monitoring system in daily life. The last type is activity-sensing data from an inertial measurement unit (IMU) comprising an accelerometer and a gyroscope. They have high usability according to the development of wearable devices, and the privacy issue is less severe. Moreover, because data can be collected from the sensor itself or using specific anchors with signal communication, coverage limitations are less impactful for the application than RGB-D sensors. Most IMU data are obtained from a wearable sensor attached to the body and have high scalability because they can easily be fused with other sensor data that have timeseries characteristics. For example, if IMU data are combined with indoor localization technology such as ultra-wide-band (UWB) sensing, not only the information necessary for behavior recognition but also context information such as the target position can be obtained. Such sensor fusion improves the performance of behavior recognition and can be the key to HRI or IoT technology. For that reason, IMU data are the most appropriate data type for HAR in daily life [4]. Therefore, a new HAR method using IMU data, particularly acceleration data, is proposed in this study.
According to the development of DL, many scholars have performed HAR using timeseries acceleration data collected with wearable sensors. An elaborate HAR can be achieved by detecting the start and end points of the target activity in timeseries data that include single or multiple activities. Most scholars first performed a segmentation task on the entire timeseries data into the optimal size for classifying target activities [5] and then classified each segment. However, this approach has limitations in performance because human activities are not standardized for each person [6]. In detail, the fixed-size window (FSW) method, shown on the left of Figure 1, cannot cover properly when the target activity execution time is larger than the window size. In addition, the case that there are multiple activities in a single window causes low classification accuracy.
To tackle the above issue, the sliding window (SW) method, shown on the right of Figure 1, has been used recently. The classification window moves along the time axis considering the size of the overlapping area, and classification is performed for every step. However, as with the FSW method, the SW method still has an issue with determining the optimal window size and overlapping area. In the results of several studies, different optimal sizes have been reported for different datasets. In addition, there remains a generalization problem for the recognition performance of the obtained optimal size. In particular, because the duration of human activities is not constant, there are limitations in accurately classifying behavior even using SW.
To solve the above problems, a method of classification for every frame (CEF) in timeseries acceleration data is proposed in this study. The proposed method is similar to the segmentation method presented in fields of two-dimensional (2D) image recognition named semantic segmentation. In addition, a DL-based new architecture is designed for conducting semantic segmentation on three-axis acceleration data.

1.2. Related Work

Many scholars have conducted recognition-based model development studies for HAR using convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Particularly, most scholars have developed a HAR model using CNN. In [7], a one-dimensional CNN (1D-CNN)-based model for HAR was proposed. For the input, the acceleration data from multiple IMU sensors attached at different positions of body parts were used. The proposed model applies multiple CNN (MCNN) and pooling layers to each sensor data separately. In addition, extracted features are concatenated and used to predict the segment class. The proposed classification model predicts which behavior each segment corresponds to. For the evaluation, three public datasets of human activity presented in [8,9,10] were used. The window sizes were 0.72, 3, and 1 s for each dataset, and the size of the overlapping area of SW was set to 50%, 78%, and 99% of the segmented window sizes, respectively. As the result, the accuracy was 92.22%, 93.68%, and 70.80%, respectively. The authors of [11] proposed a 1D-CNN-based model for recognition. The proposed model extracts meaningful features by capturing local dependencies and scale invariance of timeseries activity data acquired by IMU. Similar to [7], the recognition model included a channel-wise 1D-CNN layer (applying 1D-CNN layers to x, y, and z channels) and a pooling layer. Moreover, the window size and the size of the overlapping area were 64 and 50%, respectively. For the evaluation, the public human activity datasets using IMU presented in [8,12,13] were used. The accuracy for each dataset was 76.83%, 88.19%, and 96.88%. The authors of [14] proposed a HAR model using 1D-CNN and a conquer-based classifier. First, the proposed model recognized activity as static (sit, stand, and lay) or dynamic (walking, walking upstairs, and walking downstairs) activity using binary classification. Then, two three-class classifiers were implemented to predict the class of each FSW. Finally, test data sharpening was adopted to improve the HAR performance. The window size and size of the overlapping area were 500 and 250 ms, respectively. The proposed model was evaluated on two public datasets presented in [8,15]. As a result, the accuracy for each dataset was 94.2% and 97.62%. In [16], multiple DL architectures, including deep feed-forward neural network, CNN, long short-term memory (LSTM), and bidirectional LSTM, were implemented for HAR. The recognition models were evaluated on three public datasets [8,9,17] using different window sizes (1, 5.12, and 1 s) and the size of the overlapping area (50%, 78%, and 50%). As a result, the bidirectional LSTM showed the best performance on the three datasets, with F1-scores of 0.929 for [8], 0.745 for [9], and 0.76 for [17]. The authors of [18] used stacked LSTM modules for HAR. Further, the SW method was implemented with a window size of 10 s, and 90% of the window size was set as the size of the overlapping area. As a result, an accuracy of 94% was achieved for the public dataset presented in [19]. The authors of [20] proposed a HAR model using bidirectional LSTM modules with a residual connection. For the classifier implementation, the input data have FSW with a 2.56-s segment. In addition, the size of the overlapping area of SW was 50% of the window size. Notably, the residual connection made the model optimization much easier than the original structure because the gradient values used in the learning process could be spread to the layers more directly through the residual connection. As a result, the F1-score for two public datasets presented in [8,15] was 0.905 and 0.935. The key characteristics of introduced previous works are shown in Table 1.
Owing to the aforementioned studies, the performance of HAR using SW has been improved by applying the DL technology. However, some issues remain. First, the size of the fixed window differs for each proposed model. Although the same model is used, the optimal size of the fixed window and the size of the overlapping area of SW differ for different datasets. This means that the SW using a fixed size is difficult to generalize. In other words, when the SW method is implemented on data collected in different environments, the error rate could be increased. Second, as mentioned in Section 1, the duration of human activity is usually variable. Therefore, the performance of a proposed HAR model could vary according to the size of the window used as input for the recognition model. Similarly, the size of the overlapping area of SW also affects the performance. The authors of [19] found that window size is a key parameter for improving recognition accuracy. A too-small window size could not include the entire activity, and a too-large window size could be a reason for classification error. In [6], the window size of SW significantly influenced the recognition performance. In addition, the authors mentioned that the optimal window size is hard to predefine because of the inconstancy of the type and duration of human activity. Further, a predefined optimal window size could differ for various unseen activities. Therefore, the SW method finds it difficult to handle the various activities. Finally, in the process of learning SW, the issue of determining the label of each segment remains. Most SW studies set the label of each segment as the class corresponding to the most part in the segment or the last frame of the segment to improve the training performance. This means that more than one activity could exist in a single window, and the proportion occupied by each class may be biased. This can degrade the performance of a recognition model on real data and prevent accurate classification. To tackle these issues, the CEF-based semantic segmentation method is proposed in this study.
The remainder of this article is structured as follows. Section 2 describes a DL model made up of stacking MCNN and the designed attention layer. In addition, a new dataset comprising only types of transition activities is described in Section 2. Section 3 describes the evaluation results of the proposed method compared with the basic DL models and previous work that conducted semantic segmentation of ECG data. Further, the performance of CEF was evaluated by comparing it with the existing SW method. Section 4 analyzes and discusses the experimental results. Finally, the conclusion and future work are presented in Section 5.

2. Materials and Methods

2.1. CEF Using DL Model

To implement CEF, the method of semantic segmentation was adopted. Semantic segmentation was proposed for 2D image segmentation; it usually means detecting pixels corresponding to the target in an input image. Similar to image segmentation, the semantic segmentation method was adopted for timeseries data in this study. The designed recognition model used MCNN as the feature extraction layer in timeseries data. In addition, an attention layer similar to CBAM was designed to improve recognition performance.

2.1.1. Feature Extraction Block Using MCNN

The timeseries data comprises different features along the time axis. In the case of data collected using an IMU sensor, the features correspond to the x, y, and z axes. For the feature extraction of timeseries data, 1D-CNN is appropriate because the kernel only moves along the time axis. In addition, the convolutional kernel extracts the features using data of a certain size in the time range. Therefore, 1D-CNN can operate as a local feature extractor in timeseries data. However, the definition of optimal kernel size for improving the recognition performance could not be specified, similar to the problem of SW based on a fixed window. Consequently, multiple 1D-CNNs with different kernel sizes were adopted. Multiple features that consider multiple receptive fields with various ranges can be extracted using MCNN. The designed architecture in this study was inspired by the SPP-Net proposed for image classification [21]. The proposed feature extraction layer uses five 1D-CNN layers with different kernel sizes of 5, 10, 20, 50, and 100. Each layer controls the padding size to make the size of the output data to be the same as that of the input. Each kernel performs a convolution operation on the input data and extracts features by sliding along the time axis. In other words, the features of each time point are extracted considering the surrounding data in different ranges. Then, the extracted features are fed into two 1D-CNNs with kernel size 1. Afterward, different features from each 1D-CNN are concatenated along the channel axis. Finally, features are fed into a single 1D-CNN layer with kernel size 1. The last layer not only adjusts the size of features, but also performs the fusion considering meaningful features among features extracted from each kernel with a different size. The detailed shape of the feature extraction block is shown in Figure 2. After every 1D-CNN layer, a batch normalization function is added to prevent overfitting; a layer normalization function is additionally added to prevent the feature values from becoming too large. The proposed feature extraction block is stacked in several steps to make our model deep.

2.1.2. Implementation of the Attention Layer

To improve recognition performance, an attention layer similar to CBAM presented in [22] was implemented in the designed model. In addition, the attention layer integrates multiple features by weighting for each channel and each time step. Further, it can guide the feature extraction block to extract important features as well. The designed attention layer was based on CBAM reported in image recognition; it performs the attention mechanisms for channel and spatial separately. For the channel attention layer of CBAM, an attention score indicating which input channel is more important is generated with a probability distribution. To obtain the channel-level attention map used for calculating the attention score, average pooling and max pooling are applied to the input data in the spatial direction. Then, the attention score is calculated using a sigmoid function after a feed-forward network. Similarly, in the case of spatial attention of CBAM, an attention score is calculated using a 1D-CNN on the compressed result using average pooling and max pooling in the channel direction. In this study, the attention layer similar to the described CBAM model was applied to timeseries data. As an output, a new highlighted important feature could be acquired.
The channel attention layer of the proposed model is the same as that in CBAM as shown in the upper part of Figure 3. The input of the attention layer is compressed values of input data applying average pooling and max pooling to the direction of the time axis. Then, the attention map generation was achieved through the two types of inputs (average and max) passing through the same two 1D-CNN layers and the activation function, namely, rectified linear unit (RelU). The number of filters used in the first 1D-CNN of the channel attention layer was 1/16 of the number of input channels. Then, the number of filters used in the second 1D-CNN was recovered as the number of filters of input. This is the same as the channel attention layer described in the original CBAM, which increases the generalizing performance of the model, as explained in [22]. After the attention map generation, two attention maps are added element by element. Finally, through a sigmoid function that makes the values to be in the range of 0–1, the attention score is obtained. Then, the score is multiplied by input data for the channel axis. As a result, more important channels that better represent the data could be emphasized.
For attention at the spatial level, a self-attention layer is adopted. The original spatial attention layer presented in [22] is inappropriate for semantic segmentation in timeseries data because spatial information is greatly lost when average and max pooling are applied on the time axis. Therefore, the dot-product self-attention layer presented in [23], as described in the lower part of Figure 3, was used. There are three specific features—query, key, and value—that represent the input data differently. All features are generated through different 1D-CNN layers with the same input data; at this time, the same size as the number of frames of input data is maintained in the output. Then, the attention map is achieved by multiplying the query and key. The attention map of the original self-attention is a relation of each position. In timeseries, the relation of each time step is represented in an attention map. Then, the attention score is achieved using a sigmoid activation function similar to the channel attention layer. Finally, the output of the layer was derived by multiplying the attention score and the value representing the input data. This means that the features of each time step of the output are emphasized considering the entire data. This does not lose positional (time-domain) information and allows the model to be trained to recognize every frame without specifying the input data size. In other words, when calculating the features of a specific frame, the network can be trained to emphasize the features in positions important for classifying. Therefore, it is more suitable for performing semantic segmentation than the spatial attention layer in CBAM.

2.1.3. Semantic Segmentation and Loss Function

The proposed model is designed by stacking three structures comprising a feature extraction layer and an attention layer. In addition, the output of each layer maintains the size of data for the time axis the same as the initial input data. Thus, if necessary, zero-padding is applied to the input data. As mentioned above, semantic segmentation in a 2D image means classifying every pixel in the image. In this study, to apply this approach to timeseries, the feature size of the final output of the model was matched with the size of the target classes to be predicted. For implementation, the output was fed into a fully connected layer with the same filter size as the target classes. Finally, the features corresponding to each frame pass through the softmax function to generate a probability distribution and are encoded with the value of the position with the maximum probability. To train the proposed DL model, the cross-entropy loss, a loss function mainly used in classification problems, is applied in every frame. The losses generated in each frame are summed up as the final loss value of training the proposed model, as described in Formula (1). In the actual training phase, the number of filters used in every block and layer was 64.
T o t a l   C r o s s   E n t r o p y   L o s s = j = 1 T i = 1 C y j i log z j i ,
where T denotes the length of the time axis of input data, C denotes the number of classes, y j i denotes the true label of data at time j , and z j i denotes the probability from the softmax function of the recognition model for class i at time j .

2.2. Dataset Construction

There are various human activity datasets comprising acceleration data, such as WISDM, UCI HAR, and MHEALTH [24], but most of them focus on the change in the human state, not on the transition activity. The human state is changed by transition activity and can mainly refer to human postures. For example, after a transition activity of sitting, the human state becomes seated. However, to precisely recognize the target activity, it is important to recognize the transition activity in which the target state is transformed. In addition, if the transition activity can be recognized with high accuracy, the human state can be predicted easily. Nevertheless, most public datasets are labeled the same for the transition activity and subsequent target state. In other words, there is no distinguished label between the state and transition activity. Therefore, a new dataset comprising only types of transition activity was constructed in this study.
As previously mentioned, there are various issues in public datasets for implementing and evaluating semantic segmentation for human activity data. Therefore, a new dataset was constructed using watch type IMU. The target activities comprise get-up, laying, stand-up, picking, sitting, and walking. Further, the background class means no movement is included. The target classes comprise behaviors that can occur in human daily life, which are commonly included in many public datasets. Data comprising two activities are included in the dataset. The two behavioral data types include all combinations that humans could perform in the target classes. The sensors used to construct the dataset consisted of a UWB sensor and an IMU consisting of a 3-axis (x, y, z-axis) accelerometer (LIS2DS12TR, STMicroelectronics), as shown in Figure 4. The UWB signal, which provides the location information of the indoor sensor, was not used in this study. However, it will be used for future studies that use context information to improve recognition performance. The acceleration data capturing speed was set to 15 fps. Therefore, the movement of the subject was captured with the sampling duration of 66.6 ms. All subjects wore the provided sensors on their right wrists and performed the motion for 250 frames. This means the size of a single sample (the input data) was 250 frames. Therefore, every target activity data were collected by all subjects, not only the single action, but also the two behavioral data (the combination of two action), has a size of 250 frames. Subjects were 8 males between the ages of 20 and 40 years. In addition, all subjects performed 6 single actions and 12 multi-actions, 10 times each, for a total of 180 times. All data were labeled with the corresponding activity by pinpointing the starting and ending points. The labeling procedure was performed manually by one person who watched all movements of all subjects. In addition, activities were performed at various time points in the single data. The state and target activities of the dataset are described in Table 2.

3. Results

Two experiments were performed to evaluate the proposed model on the new dataset. First, we evaluated the performance improvement compared with the basic DL modules and models proposed in the ECG segmentation studies. Second, we experimented to evaluate how the method of CEF proposed in this study is more accurate than SW. The evaluation metrics for all verification were the F1-score, precision, and recall:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N    
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l ,
where T P (True Positive) means a result of predicting a class in which the predicted value for the data at a specific point in time is actually correct, F P (false positive) means that the predicted value of data that is not a specific class is recognized as that specific class, and F N (False Negative) means a result recognizing that data that are not actually of the corresponding class are of the corresponding class.
In addition, experiments were performed on an Intel i9-11900F octa-core microprocessor clocked at 2.50 GHz with 32 GB RAM. For operating the proposed DL model and all comparison models, the RTX 3070 GPU was used. Model development and implementation were performed using Pytorch version 1.10.2 and Python version 3.7, respectively. The size designed model was 13.063 MB with 3,416,583 parameters. All models were trained using a leave-one-subject-out cross-validation, with the number of epochs fixed at 200 during each training. The learning rate and batch size for learning were set to 0.001 and 100 in all experiments, and the Adam method was employed for optimization. The experimental results are described below, and the analysis is performed in the next section.
In the first experiment, the basic DL modules used for comparison included the gated recurrent unit (GRU), LSTM, and CNN modules with different kernel sizes. For RNNs, bidirectional modules were also employed for comparison, and kernel sizes of 5, 10, 20, and 40 steps were used for CNN. All basic DL modules were stacked three times, including batch normalization and ReLu activation function; finally, an output having the same feature size as the input data was obtained through a fully connected layer. In addition, several models presented in [25,26,27,28,29] that reported high accuracy by applying the CEF method to ECG data were adopted for comparison. As mentioned in Section 2, the loss is calculated by comparing the output of all models with the same size as the input and the label on the time axis. Details of the results for each activity are presented in Table 3. Additionally, the computational cost of all models (the number of parameters, and the size of the model) are described in Appendix A.
The proposed model reported the highest performance in all classes, with an F1-score of 0.9 or higher. In detail, for the background, the proposed model reported the best F1-score and precision values of 0.979 and 0.978, respectively. However, for the recall value, [29] was the best, with 0.990. Considering laying, picking, get-up, stand-up, and sitting, for precision, the results of [29] were the best, with 0.945, 0.936, 0.948, 0.91, and 0.94; meanwhile, for recall, the results of the proposed model were the best, with 0.927, 0.930, 0.930, 0.896, and 0.905, respectively. For walking, the CNNs with kernel sizes of 20 and 40 steps had the highest precision, 0.950, but the recall of the proposed model was the highest (0.953). Overall, the F1-score of the proposed model was the best, with an average of 0.929, and it was 0.019 higher than that of the CNN with a 40 step kernel size, which is the highest F1-score among the comparison models. The confusion matrix of experiments is provided in Appendix A.
In the second experiment, the comparison models were the same as in the first experiment. However, to reproduce the SW method, the dataset comprising 250 frames for single data was divided into several segments according to the window size and the size of the overlapping area. Then, all recognition models classified the segment as a specific behavior. As a result, the output data differ from the input data and refer to the class of corresponding data that the recognition model predicted. For predicting overlapping areas, the class is determined by comparing the confidence between surrounding predictions. In other words, the prediction which has a higher confidence value from the same recognition model in different positions is selected as the final decision. For an accurate comparison, the predicted class is expanded by its original size, and the loss is calculated through comparison with the label. For training models, the aforementioned cross-entropy was adopted as the loss function. The various sizes of the fixed window were set to 10, 20, and 40 time steps, and each overlapping area had a size of 5, 10, and 20 steps. In addition, the evaluation criterion is the averaged F1-score of all activities. Details of the results are described in Table 4.
The proposed CEF method reported the best performance compared with three types of SW. The F1-score of SW with a window size of 10 and an overlapping area of 5 was 0.732. For SW with a window size of 20 and an overlapping area of 10, the F1-score was 0.701. In addition, the F1-score of SW with a window size of 40 and an overlapping area of 20 was 0.742. Finally, the proposed CEF method showed the best performance with a 0.88 F1-score, an improvement of 0.154 on average. In detail, the results of CEF showed 0.198, 0.17, and 0.147 improvements compared with GRU, LSTM, and RNN, respectively. Meanwhile, the results of bidirectional RNN showed improvements of 0.171, 0.15, and 0.154 compared with GRU, LSTM, and RNN, respectively. Compared with CNN, the proposed CEF method showed improvements of 0.101, 0.14, 0.121, and 0.165 for kernel sizes of 5, 10, 20, and 40, respectively. Finally, the F1-score of CEF was higher than that of SW by as much as 0.184.

4. Discussion

In this study, the performance of a model designed by stacking various basic DL modules was evaluated through the first experiment mentioned in Section 3. For RNN modules, we confirmed that the average F1-score was 0.083 lower than that of the bidirectional RNN using not only the previous but also the next information. RNN modules may be unsuitable for predicting behaviors with long execution times due to problems such as gradient vanishing. Further, because it is difficult to apply bidirectional RNN in real time, we judged that the CNN module is more suitable for implementing CEF. In addition, among the CNN modules, because the one with the largest kernel size has a wide receptive field, the performance is the best among basic DL models. This means that using more than one piece of surrounding information to classify a particular frame has an advantage over RNNs. We also confirmed that the size of the receptive field used when classifying a specific frame greatly affects the performance. Moreover, depending on the execution time of the target action and the amplitude of the signal, features extracted from receptive fields of different sizes can have a positive effect on the prediction performance because the proposed model using MCNN had the best F1-score.
The model presented in [29], which reported the highest precision value in transition activities of laying, get-up, stand-up, sitting, and picking that have short execution times, used both 1D-CNN and dilated 1D-CNN for feature extraction. This means that the features from multiple receptive fields of various sizes had a positive effect on recognition performance. Therefore, when classifying data at a specific location on the time axis, it is essential to fuse meaningful features using surrounding data of various sizes. In the results of the CNN module with a single kernel size, the F1-score of transition activities with short execution time was 0.094 lower than the activities that include repetitive patterns, such as background and walking. This is because the information that interferes with predicting the class of a specific frame is included in the process of passing the features extracted from the previous layer to the next layer, and it can be improved by selectively filtering the necessary features. The model based on an autoencoder presented in [25,27] can compress and remove relatively insignificant features, but it has a loss of positional information, resulting in a low F1-score of 0.7. In addition, Refs. [28,29], which adopted a U-net architecture with a skip connection to preserve positional information, reported a relatively high F1-score of 0.757, but its performance was still low. Moreover, the model of [25], which emphasizes meaningful features by applying an attention layer that can give low weight to insignificant features, reported higher performance than the aforementioned two methods (0.874).
Consequently, the proposed model was designed by stacking MCNN that can reflect receptive fields of various sizes and an attention layer that emphasizes meaningful features. The attention scores of the channel level were derived differently for each activity (Figure 5). In other words, the features obtained from different sizes of the receptive field were emphasized selectively according to the properties of the target activities. In addition, the channel attention layer was applied differently according to the execution time of the target activity. When the execution time of an action was long, features extracted by kernels of all sizes were evenly emphasized; meanwhile, when the execution time was short, features extracted from a receptive field of a short size were emphasized. For spatial attention, the area of the same behavior as the data of a specific time step to be recognized was emphasized (Figure 6). This improved the classification performance of data at a specific location by filtering the data that are not related to the target and helped demarcate the boundary between the background and activity or distinguish between different adjacent activities. In summary, the proposed model was designed as a stacked structure by the fusion of the two methods, and it reported the highest performance.
In the second experiment, the proposed CEF was evaluated by comparison with the SW method. As a result, the performance of SW was lower than that of CEF because of several reasons. First, if the window size is too small or too large, a recognition error occurs. When a small amount of data is included, it may be insufficient to classify the data in the window. However, if a large amount of data is included in a single window, more than one activity may be involved, increasing the error. In other words, if a window contains more than one action, a misrecognition occurs, and it is impossible to specify the dividing point between the different actions. These problems can occur randomly according to the start and end points of the target behavior and recognition. From Figure 7, as a result of the SW method implemented in this study, misrecognition frequently occurred in the area corresponding to the division points, such as the start, end, and transition of the activities. Notably, the existing research treats the segment label as one class rather than a frame unit. This can have a positive effect on the recognition model’s training, but cannot perform quantitative evaluation precisely. Through additional tuning work, the performance of SW can be improved by obtaining the optimal window size and size of the overlapping area. However, performance improvements are not guaranteed for data with different behaviors or other datasets, because SW fundamentally depends only on the data contained in the window. Therefore, CEF, which is not limited by changes in window size, could perform activity recognition more precisely. In addition, by classifying each frame rather than the window unit, the distinction between various activities could be recognized more elaborately.
In this study, the CEF method was proposed to overcome the limitations of SW. However, several limitations still remain. First, as mentioned above, a CNN module with a different kernel size is required depending on the properties of the target activity. Therefore, the proposed model used features extracted from various receptive fields. Nevertheless, if more complex behaviors that are difficult to distinguish from other activities need to be recognized, a different kernel size may be more suitable. Thus, the number of CNN modules and the kernel size of the currently designed model have to be set up experimentally. Second, for spatial attention, where the attention layer is applied to time-axis data, the attention map size for calculating the score increases as the input data size increases. This issue can cause limitations when applying the recognizing model to embedded systems. Third, in this study, a quantitative comparison with previous studies was not performed. As mentioned in Section 2, there is a limitation to using the public datasets used in the previous SW-based HAR research, and the issue is that the evaluation criterion of SW differs from that of CEF, but various quantitative comparisons with state-of-the-art studies are needed. Finally, if the wearing position of the sensor is changed, the recognition performance may decrease. Therefore, there is a need for a method that can respond to various structures of sensors for generalization.

5. Conclusions

In this study, a CEF method, rather than the conventional SW method, was proposed for HAR. For implementation, features extracted from various receptive fields were used and fused using MCNN. Moreover, we could selectively weight the extracted features by proposing a layer that applies the attention mechanism to each channel and spatial level similar to CBAM. The channel level has the same structure as that of CBAM. Meanwhile, for the spatial level, a dot-product self-attention layer that does not lose positional information was adopted. Further, the proposed recognition model was evaluated using a newly constructed dataset. As a result, the proposed recognition model reported a higher F1-score than the models using basic DL modules. In addition, the proposed model outperformed several models applying CEF to EGC data. An experiment was also performed to verify the superiority of the proposed method over the existing SW method. It was found that the CEF method can perform HAR more precisely than the SW method with three different window sizes and overlapping areas. In addition, we confirmed that the proposed model is suitable for implementing CEF for HAR.
The performance of the proposed CEF method and recognition model was verified through experiments. However, several issues remain to be resolved, necessitating further studies for improvement. First, the proposed model will be advanced using a DL method that is more suitable to timeseries data, such as the temporal convolutional network (TCN) structure presented in [30]. We expect that the performance will be increased because the TCN that had a great performance in timeseries data could better memorize long-term memory for the time axis. In addition, it is expected that memory usage, which increases according to the input data size in spatial attention, can be reduced. The second is the design of a canonical domain transformer layer to increase the generalization performance of the proposed model. It is possible to increase the generalization performance by simply increasing the diversity of the dataset using various augmentation methods, but it is not a fundamental solution and requires a lot of time. Therefore, a layer that transforms the input data or extracted feature into a domain advantageous for recognition is required. A method such as the canonical domain transformer model suggested in [31,32] for obtaining a transformation matrix based on input data will be adopted to improve the generalization performance. Finally, the positional and historical context information will be used to improve recognition performance. The positional context information can be extracted using the distance between the target position and surrounding objects, such as bed, chair, and desk. For instance, when the target is close to the bed, the static activities related to bed such as laying can be more natural than dynamic actions such as running. These rules correspond to the positional context information and will be used in the learning process of an adapted model. To facilitate the collection of positional context information, a UWB sensor that can be attached to various objects to provide the positional information of the sensor of the targets in real time will be used. Meanwhile, historical context information can be extracted from actions that a subject has performed before. For example, after the subject performs the laying action, the subject cannot walk without stand-up or get-up action. Such constraining will enable training a recognition model to reduce the weight of activities inappropriate for the situation. Through these additional studies, if the generalization performance of the proposed CEF method can be improved, it will contribute to the research field of healthcare technology that requires precise recognition, such as a human monitoring system or elderly home care. Further, the recognized human’s daily activity will be a significant controlling factor for the proactive service of robots.

Author Contributions

S.-h.L. designed the algorithm, performed the experimental work, wrote the manuscript. D.-W.L. organized the experiment setup. Corresponding author: M.S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by GIST Research Project grant funded by the GIST in 2022.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the data used in this study was lab-data. Also, the data does not contain any information about the subject’s identity.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. And, written informed consent has been obtained from the patients to publish this paper.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Confusion matrix of comparison with other DL models for evaluating performance improvement.
Table A1. Confusion matrix of comparison with other DL models for evaluating performance improvement.
GRU226,8338391772911139510822427
185513,7236551984016627
157830514,750272331639
27585834613,621172429247
9397026922215,8174701093
1269554529836416,677347
11236655359120222531,480
LSTM227,39171814651125104912682243
182113,6146872573951510
157535514,4961281503601
25475332213,95844497255
113844528415,4395701400
1503230229754116,297563
13403386513111923531,184
RNN226,796112710201378113411152689
175613,133122926112435468
1742136313,273182984070
3126115018312,324498392183
1435305741213,8007922354
19530454272113614,966724
20002328455113331830,553
Bi RNN229,210866125278989110391212
183614,083549248101990
120935615,11071222610
1351506015,086273357283
102252426916,529235796
13989127728434316,898214
116615624949631132,267
Bi LSTM228,42612151461115283210041169
102814,765487147431670
78545015,43017286970
1148528015,559153333135
113202222716,2332301036
1253021031511617,391220
996184967135919332,224
Bi GRU229,067981108998491794412,77
113514,765515164273730
98135315,375111501950
1243386015,551166334176
11430021616,707207607
12821330033812017,257195
104910016333214132,815
CNN 5229,4817228817717058241875
133112,697151535043457226
1540115213,1244995012463
219827815913,191973650407
11438922385914,2339241409
196848044634574814,927591
14052135739968224731,507
CNN 1023,02658638518046867381052
126914,11260323741324051
153338914,49614330618810
13193294015,132459423154
11406316732716,367340476
136016358126232816,603208
11368610737356923632,003
CNN 2023,1035743667728663651772
133914,93738115662473
166539114,636101841709
1310244815,490201412191
1128166720316,805164497
13908531632316517,012214
11832473320664315432,044
CNN 40230,854795707764651652836
130515,130354955360
140031815,237041060
1532232815,412170362140
1434568421816,262294532
13148632130713917,152186
1215628821267821332,042
[25]228,87380298697714037751443
186013,62447733228129358
209445913,7673947621812
1518424014,983265456210
934262834016,1303541068
149310034854518816,638193
1038330256104223031,911
[26]22,7453121711781414107915001418
155214,3583393156825835
148224414,886665631120
18226963413,774492365673
156617836187313,45414431005
180811381233161415,497330
20613503597501231103928,720
[27]188,898461344559894807113330,364
188811,1692535101072322766
2505640957917126110622847
17941587174656826465574530
141734681455947316344799
19194071257861276669425353
269335667646443468338219,291
[28]229,161111441190410395832047
298410,237286876831072352
374421324654235173120272542
539464621474207416302811
143619315933113,1693743218
5113163301432151186623323
219863910518468111330,590
[29]232,803249208543505212739
365212,7282192890037
352030813,131106441
2381116014,294238316511
20561911011514,7142331633
3695021734319414,494562
1918010128699332,292
Proposed22,9838945948884993757956
70015,752277204000
62045115,8212201480
819177015,924204405240
712009617,566154352
958019635416117,651185
1030009331717632,914
Table A2. Computational cost of each model.
Table A2. Computational cost of each model.
ModelGRULSTMRNNBi RNNBi LSTMBi GRUCNN 5CNN 10
Size of model (MB)0.1980.2630.070.1910.7320.5520.1390.272
Number of parameters5160768487178474925519123914391136031709911
ModelCNN 20CNN 40[25][26][27][28][29]Proposed
Size of model (MB)0.5381.070.8350.8890.2240.1620.96413.063
Number of parameters14067128019121786323283958663419752512793416583

References

  1. Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019, 119, 3–11. [Google Scholar] [CrossRef] [Green Version]
  2. Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Comput. Surv. 2021, 54, 1–40. [Google Scholar] [CrossRef]
  3. Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef] [PubMed]
  4. Demrozi, F.; Pravadelli, G.; Bihorac, A.; Rashidi, P. Human activity recognition using inertial, physiological and environmental sensors: A comprehensive survey. IEEE Access 2020, 8, 210816–210836. [Google Scholar] [CrossRef] [PubMed]
  5. Abdel-Salam, R.; Mostafa, R.; Hadhood, M. Human activity recognition using wearable sensors: Review, challenges, evaluation benchmark. In Proceedings of the International Workshop on Deep Learning for Human Activity Recognition, Kyoto, Japan, 8 January 2021; pp. 1–15. [Google Scholar]
  6. Uslu, G.; Baydere, S. A Segmentation Scheme for Knowledge Discovery in Human Activity Spotting. IEEE Trans. Cybern. 2022, 52, 5668–5681. [Google Scholar] [CrossRef] [PubMed]
  7. Rueda, F.M.; Grzeszick, R.; Fink, G.A.; Feldhorst, S.; Hompel, M.T. Convolutional neural networks for human activity recognition using body-worn sensors. Informatics 2018, 5, 26. [Google Scholar] [CrossRef] [Green Version]
  8. Chavarriaga, R.; Sagha, H.; Calatroni, A.; Digumarti, S.T.; Tröster, G.; Millán, J.d.R.; Roggen, D. The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognit. Lett. 2013, 34, 2033–2042. [Google Scholar] [CrossRef] [Green Version]
  9. Reiss, A.; Stricker, D. Introducing a new benchmarked dataset for activity monitoring. In Proceedings of the 2012 16th International Symposium on Wearable Computers, Newcastle, UK, 18–22 June 2012; IEEE: Piscataway, NJ, USA, 2021; pp. 108–109. [Google Scholar]
  10. Grzeszick, R.; Lenk, J.M.; Rueda, F.M.; Fink, G.A.; Feldhorst, S.; Ten Hompel, M. Deep neural network based human activity recognition for the order picking process. In Proceedings of the 4th international Workshop on Sensor-Based Activity Recognition and Interaction, Rostock, Germany, 21–22 September 2017; pp. 1–6. [Google Scholar]
  11. Zeng, M.; Nguyen, L.T.; Yu, B.; Mengshoel, O.J.; Zhu, J.; Wu, P.; Zhang, J. Convolutional neural networks for human activity recognition using mobile sensors. In Proceedings of the 6th International Conference on Mobile Computing, Applications and Services, Austin, TX, USA, 6–7 November 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 197–205. [Google Scholar]
  12. Stiefmeier, T.; Roggen, D.; Ogris, G.; Lukowicz, P.; Tröster, G. Wearable activity tracking in car manufacturing. IEEE Pervasive Comput. 2008, 7, 42–50. [Google Scholar] [CrossRef]
  13. Lockhart, J.W.; Weiss, G.M.; Xue, J.C.; Gallagher, S.T.; Grosner, A.B.; Pulickal, T.T. Design considerations for the WISDM smart phone-based sensor mining architecture. In Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data, San Diego, CA, USA, 21 August 2011; pp. 25–33. [Google Scholar]
  14. Cho, H.; Yoon, S.M. Divide and conquer-based 1D CNN human activity recognition using test data sharpening. Sensors 2018, 18, 1055. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Anguita, D.; Ghio, A.; Oneto, L.; Parra Perez, X.; Reyes Ortiz, J.L. A public domain dataset for human activity recognition using smartphones. In Proceedings of the 21th International European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 24–26 April 2013; pp. 437–442. [Google Scholar]
  16. Hammerla, N.Y.; Halloran, S.; Plötz, T. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv 2016, arXiv:1604.08880. [Google Scholar]
  17. Bachlin, M.; Roggen, D.; Troster, G.; Plotnik, M.; Inbar, N.; Meidan, I.; Herman, T.; Brozgol, M.; Shaviv, E.; Giladi, N.; et al. Potentials of Enhanced Context Awareness in Wearable Assistants for Parkinson’s Disease Patients with the Freezing of Gait Syndrome. In Proceedings of the 2009 International Symposium on Wearable Computers, Linz, Austria, 4–7 September 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 123–130. [Google Scholar] [CrossRef]
  18. Pienaar, S.W.; Malekian, R. Human activity recognition using LSTM-RNN deep neural network architecture. In Proceedings of the 2019 IEEE 2nd Wireless Africa Conference (WAC), Pretoria, South Africa, 18–20 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
  19. Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. Activity recognition using cell phone accelerometers. ACM SigKDD Explor. Newsl. 2011, 12, 74–82. [Google Scholar] [CrossRef]
  20. Zhao, Y.; Yang, R.; Chevalier, G.; Xu, X.; Zhang, Z. Deep residual bidir-LSTM for human activity recognition using wearable sensors. Math. Probl. Eng. 2018, 2018, 7316954. [Google Scholar] [CrossRef]
  21. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  23. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  24. Twomey, N.; Diethe, T.; Fafoutis, X.; Elsts, A.; McConville, R.; Flach, P.; Craddock, I. A comprehensive study of activity recognition using accelerometers. Informatics 2018, 5, 27. [Google Scholar] [CrossRef] [Green Version]
  25. Malali, A.; Hiriyannaiah, S.; Siddesh, G.; Srinivasa, K.; Sanjay, N. Supervised ECG wave segmentation using convolutional LSTM. ICT Express 2020, 6, 166–169. [Google Scholar] [CrossRef]
  26. Matias, P.; Folgado, D.; Gamboa, H.; Carreiro, A. Time Series Segmentation Using Neural Networks with Cross-Domain Transfer Learning. Electronics 2021, 10, 1805. [Google Scholar] [CrossRef]
  27. Sereda, I.; Alekseev, S.; Koneva, A.; Kataev, R.; Osipov, G. ECG segmentation by neural networks: Errors and correction. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–7. [Google Scholar]
  28. Moskalenko, V.; Zolotykh, N.; Osipov, G. Deep learning for ECG segmentation. In Proceedings of the International Conference on Neuroinformatics, Dolgoprudny, Russia, 7–11 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 246–254. [Google Scholar]
  29. Liang, X.; Li, L.; Liu, Y.; Chen, D.; Wang, X.; Hu, S.; Wang, J.; Zhang, H.; Sun, C.; Liu, C. ECG_SegNet: An ECG delineation model based on the encoder-decoder structure. Comput. Biol. Med. 2022, 145, 105445. [Google Scholar] [CrossRef] [PubMed]
  30. Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
  31. Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
  32. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Figure 1. (Left) fixed-size window and (Right) sliding window for activity recognition.
Figure 1. (Left) fixed-size window and (Right) sliding window for activity recognition.
Sensors 23 02278 g001
Figure 2. Feature extraction block using MCNN.
Figure 2. Feature extraction block using MCNN.
Sensors 23 02278 g002
Figure 3. Designed CBAM (up) channel level and (down) spatial level.
Figure 3. Designed CBAM (up) channel level and (down) spatial level.
Sensors 23 02278 g003
Figure 4. IMU sensor.
Figure 4. IMU sensor.
Sensors 23 02278 g004
Figure 5. Examples of attention score in channel level.
Figure 5. Examples of attention score in channel level.
Sensors 23 02278 g005
Figure 6. Examples of attention score in spatial level.
Figure 6. Examples of attention score in spatial level.
Sensors 23 02278 g006
Figure 7. Example of misrecognition of SW method.
Figure 7. Example of misrecognition of SW method.
Sensors 23 02278 g007
Table 1. Key characteristics of previous works.
Table 1. Key characteristics of previous works.
AuthorDatasetWindow Size (s)Sliding Window (%)Accuracy (%)Model
[7][8]0.75092.221D-CNN
[9]37893.681D-CNN
[10]19970.801D-CNN
[11][8]645076.831D-CNN
[12]645088.191D-CNN
[13]645096.881D-CNN
[14][8]5005094.21D-CNN
[15]5005097.621D-CNN
[16][8]1500.929 (F1-Score)CNN + LSTM
[9]5.12780.745 (F1-Score)CNN + LSTM
[17]1500.76 (F1-Score)CNN + LSTM
[18][19]109094LSTM
[20][8]2.56500.905 (F1-Score)LSTM
[15]2.56500.935 (F1-Score)LSTM
Table 2. Status of constructed dataset.
Table 2. Status of constructed dataset.
IDabcdefgh
get-up1010101010101010
laying1010101010101010
stand-up1010101010101010
picking1010101010101010
sitting1010101010101010
walking1010101010101010
walking—picking1010101010101010
walking—sitting1010101010101010
stand-up—walking1010101010101010
sitting—laying1010101010101010
get-up—stand-up1010101010101010
picking—walking1010101010101010
get-up—laying1010101010101010
laying—get-up1010101010101010
stand-up—picking1010101010101010
stand-up—sitting1010101010101010
picking—sitting1010101010101010
sitting—stand-up1010101010101010
Table 3. Comparison with other DL models for evaluating performance improvement.
Table 3. Comparison with other DL models for evaluating performance improvement.
BackgroundLayingPickingGet-Up
PrecisionRecallF1-ScorePrecisionRecallF1-ScorePrecisionRecallF1-ScorePrecisionRecallF1-Score
GRU0.9600.9640.9620.8150.8640.8390.8080.8380.8220.8800.8110.844
LSTM0.9580.9670.9620.8500.8490.8500.8240.8180.8210.8900.8040.845
RNN0.9500.9640.9570.8170.7780.7970.7700.7310.7500.7810.7760.778
Bi RNN0.9660.9740.9700.8780.8850.8810.8860.8750.8810.8840.8320.857
Bi LSTM0.9730.9710.9720.8740.9040.8890.8820.8600.8710.8700.8720.871
Bi GRU0.9710.9740.9720.8900.9010.8950.8950.8850.8900.8940.8720.883
CNN 50.9600.9750.9680.8000.7690.7840.7790.7540.7660.8120.7500.780
CNN 100.9670.9790.9730.8610.8490.8550.8560.8670.8610.8820.8340.857
CNN 200.9660.9820.9740.9090.8580.8820.8980.8900.8940.8960.8830.889
CNN 400.9660.9810.9730.9070.8930.9000.9080.8610.8840.9070.8940.900
[25]0.9620.9730.9680.8820.8070.8430.8150.8540.8340.8810.8050.841
[26]0.9570.9670.9620.8280.8720.850.7920.7130.750.8370.8480.843
[27]0.9390.8030.8660.5820.5610.5710.4030.5020.4470.5940.660.625
[28]0.9170.9740.9440.7590.2730.4010.6950.6980.6960.6770.6050.639
[29]0.9310.990.9590.9450.7690.8480.9360.7790.8510.9480.7520.839
Proposed0.9790.9770.9780.9180.9270.9220.9130.9300.9220.9090.9300.920
Stand-UpSitingWalking
PrecisionRecallF1-ScorePrecisionRecallF1-ScorePrecisionRecallF1-Score
GRU0.8710.7630.8130.8730.8550.8640.8840.9120.898
LSTM0.8430.7820.8110.8410.8360.8380.8750.9040.889
RNN0.8060.6900.7440.8160.7670.7910.8350.8850.860
Bi RNN0.8910.8450.8670.8760.8660.8710.9280.9350.931
Bi LSTM0.8600.8710.8660.9000.8920.8960.9260.9340.930
Bi GRU0.8920.8710.8820.9010.8850.8930.9360.9510.943
CNN 50.8040.7390.7700.8120.7650.7880.8800.9130.896
CNN 100.8760.8470.8610.8850.8510.8680.9430.9270.935
CNN 200.9050.8670.8860.9140.8720.8930.9500.9290.939
CNN 400.9060.8630.8840.9120.8790.8950.9500.9280.939
[25]0.8580.8390.8480.8770.8530.8650.9140.9250.92
[26]0.7860.7710.7790.7590.7950.7760.8920.8320.861
[27]0.4320.3680.3970.4650.3560.4030.2760.5590.369
[28]0.7150.4160.5260.6930.4440.5410.6520.8860.752
[29]0.910.8010.8520.940.7430.830.9020.9360.918
Proposed0.9060.8960.9010.9150.9050.9100.9500.9530.952
Table 4. Results of comparison with the SW method.
Table 4. Results of comparison with the SW method.
ModelGRULSTMRNNBi RNNBi LSTMBi GRUCNN 5CNN 10CNN 20CNN 40Proposed
SW 10-50.7250.7340.6850.7340.740.7470.7370.7410.7660.7230.727
SW 20-100.6310.6570.650.7530.7250.7170.6720.7010.7810.7180.71
SW 40-200.6390.6710.650.6780.7790.7950.7460.7990.8170.7970.799
CEF0.8630.8580.8090.8930.8980.9070.820.8870.9090.9110.93
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, S.-h.; Lee, D.-W.; Kim, M.S. A Deep Learning-Based Semantic Segmentation Model Using MCNN and Attention Layer for Human Activity Recognition. Sensors 2023, 23, 2278. https://doi.org/10.3390/s23042278

AMA Style

Lee S-h, Lee D-W, Kim MS. A Deep Learning-Based Semantic Segmentation Model Using MCNN and Attention Layer for Human Activity Recognition. Sensors. 2023; 23(4):2278. https://doi.org/10.3390/s23042278

Chicago/Turabian Style

Lee, Sang-hyub, Deok-Won Lee, and Mun Sang Kim. 2023. "A Deep Learning-Based Semantic Segmentation Model Using MCNN and Attention Layer for Human Activity Recognition" Sensors 23, no. 4: 2278. https://doi.org/10.3390/s23042278

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop