Empirical Mode Decomposition Based Multi-Modal Activity Recognition

This paper aims to develop an activity recognition algorithm to allow parents to monitor their children at home after school. A common method used to analyze electroencephalograms is to use infinite impulse response filters to decompose the electroencephalograms into various brain wave components. However, nonlinear phase distortions will be introduced by these filters. To address this issue, this paper applies empirical mode decomposition to decompose the electroencephalograms into various intrinsic mode functions and categorize them into four groups. In addition, common features used to analyze electroencephalograms are energy and entropy. However, because there are only two features, the available information is limited. To address this issue, this paper extracts 11 different physical quantities from each group of intrinsic mode functions, and these are employed as the features. Finally, this paper uses the random forest to perform activity recognition. It is worth noting that the conventional approach for performing activity recognition is based on a single type of signal, which limits the recognition performance. In this paper, a multi-modal system based on electroencephalograms, image sequences, and motion signals is used for activity recognition. The numerical simulation results show that the percentage accuracies based on three types of signal are higher than those based on two types of signal or the individual signals. This demonstrates the advantages of using the multi-modal approach for activity recognition. In addition, our proposed empirical mode decomposition-based method outperforms the conventional filtering-based method. This demonstrates the advantages of using the nonlinear and adaptive time frequency approach for activity recognition.


Introduction
Activity recognition plays an important role in many research areas. For example, activity recognition via electroencephalograms helps in the understanding of the working principles of the human brain. Activity recognition via motion signals also helps physical therapists to evaluate the effectiveness of rehabilitation. Moreover, activity recognition via image sequences can be used in security surveillance [1].
Recently, reading recognition using electroencephalograms was proposed [2]. First, the electroencephalograms are filtered by a bandpass filter to suppress noise and to reduce the unwanted movements of the subjects. Then, the electroencephalograms are decomposed into the β, α, θ, and δ waves using the short-time discrete Fourier transform [3,4]. The sums of the absolute discrete Fourier transform coefficients form the feature vectors and the k-nearest neighbor algorithm with k = 3 is used as the classifier.

Multi-Modal Activity Recognition
The objective of this paper is to perform activity recognition via three types of signals. Here, seven common activities are classified. These are: (1) watching television; (2) playing with toys; (3) eating; (4) playing electronic games; (5) performing online exercises; (6) reading/writing; and (7) drawing. The signals employed for activity recognition are electroencephalograms, image sequences, and motion signals.

Features Extracted from the Electroencephalograms
The empirical mode decomposition assumes that a signal can be represented as the sum of a finite number of intrinsic mode functions. The intrinsic mode functions are obtained using the following procedures: Step 1: Initialization: let r 0 (t) = x(t), i = 1 and a threshold value equal to 0.3.
Step 2: Let the ith intrinsic mode function be c i (t). This can be obtained as follows: (a) Initialization: let d 0 (t) = r i−1 (t), i = 1 and j = 1.
(b) Find all the maxima and minima of d j−1 (t).
Sensors 2020, 20, 6055 3 of 14 (c) Denote the upper envelope and the lower envelope of d j−1 (t) as e + (t) and e − (t), respectively. Obtain e + (t) and e − (t) by interpolating the cubic spline function at the maxima and the minima of d j−1 (t), respectively.
(d) Let the mean of the upper envelope and the lower envelope of d j−1 (t) be m(t).
(f) Compute SD = |m(t)| 2 d j−1 (t) 2 . If SD is not greater than the given threshold, then set c i (t) = d j (t). Otherwise, increment the value of j and go back to Step (b).
Step 3: Set r i (t) = r i−1 (t) − c i (t). If r i (t) satisfies the properties of the intrinsic mode function or it is a monotonic function, then the decomposition is completed.
The details of the empirical mode decomposition can be found in [8][9][10]. Because a signal with more extrema will contain more high-frequency components, the intrinsic mode functions with the lower indices will be localized in the higher frequency bands. Hence, the empirical mode decomposition is a kind of time frequency analysis. Because of these desirable properties, this paper applies empirical mode decomposition to decompose the electroencephalograms into various intrinsic mode functions.
However, because the total number of intrinsic mode functions is determined automatically by the above algorithm, it is difficult to obtain the fixed length feature vectors to perform activity recognition. To tackle this difficulty, the intrinsic mode functions are grouped together. Because there are four to eight intrinsic mode functions for most of the electroencephalograms, the intrinsic mode functions are categorized into four groups. Let I 1 , I 2 , I 3 , and I 4 be the sets of the first, second, third, and fourth groups of intrinsic mode functions, respectively.
If there are only four intrinsic mode functions obtained in the empirical mode decomposition, then each set of intrinsic mode functions contains one intrinsic mode function. That is, c 1 (t) ∈ I 1 , c 2 (t) ∈ I 2 , c 3 (t) ∈ I 3 and c 4 (t) ∈ I 4 .
If there are only five intrinsic mode functions obtained in the empirical mode decomposition, then the third and fourth intrinsic mode functions are combined together as one group. That is, c 1 (t) ∈ I 1 , c 2 (t) ∈ I 2 , c 3 (t) + c 4 (t) ∈ I 3 and c 5 (t) ∈ I 4 . If there are only six intrinsic mode functions obtained in the empirical mode decomposition, then the second and third intrinsic mode functions are combined together as one group, and the fourth and fifth intrinsic mode functions are combined together as another group. That is, c 1 (t) ∈ I 1 , c 2 (t) + c 3 (t) ∈ I 2 , c 4 (t) + c 5 (t) ∈ I 3 and c 6 (t) ∈ I 4 . If there are only seven intrinsic mode functions obtained in the empirical mode decomposition, then the first and second intrinsic mode functions are combined together as one group, the third and fourth intrinsic mode functions are combined together as another group, and the fifth and sixth intrinsic mode functions are combined together as another group. That is, c 1 (t) + c 2 (t) ∈ I 1 , c 3 (t) + c 4 (t) ∈ I 2 , c 5 (t) + c 6 (t) ∈ I 3 and c 7 (t) ∈ I 4 . If there are eight intrinsic mode functions obtained in the empirical mode decomposition, then the first and second intrinsic mode functions are combined together as one group, the third and fourth intrinsic mode functions are combined together as another group, the fifth and sixth intrinsic mode functions are combined together as another group, and the seventh and eighth intrinsic mode functions are combined together as another group. That is, c 1 (t) + c 2 (t) ∈ I 1 , c 3 (t) + c 4 (t) ∈ I 2 , c 5 (t) + c 6 (t) ∈ I 3 and c 7 (t) + c 8 (t) ∈ I 4 .
Because the magnitudes of various brain waves for performing different activities are different, the magnitudes and the energies of various brain waves are usually employed as the features for activity recognition. Similar but more physical quantities are employed as the features in this paper. In particular, the entropy, mean, interquartile range, mean absolute deviation, range, variance, skewness, kurtosis, L 2 norm, L 1 norm, and L ∞ norm of each group of intrinsic mode functions are computed and employed as the features [11,12]. Here, there are four groups of intrinsic mode functions for each Sensors 2020, 20, 6055 4 of 14 electroencephalogram and there are 11 features extracted from each group of intrinsic mode functions. Hence, the lengths of each feature vector is 44.

Features Extracted from the Image Sequences
Because different activities involve different objects, the objects are segmented from each image. Due to the movements of the subjects, the camera rotates and translates. As a result, the sizes of the same object in two consecutive images are different. To address this difficulty, because the discrete cosine transform can be used to resize the objects, the discrete cosine transform is first applied to the objects. Next, the matrices of the discrete cosine transform coefficients of the objects in two consecutive images are compared [13]. Then, the zeros are placed into the matrix of the discrete cosine transform coefficients corresponding to the smaller size of objects such that the size of the zero-filled matrix of the discrete cosine transform coefficients is the same as that of the matrix of the discrete cosine transform coefficients without zeros [14]. The zero-filled matrix or the matrix without zeros of the discrete cosine transform coefficients of the object in the ith image is denoted D i .
It is worth noting that the rates of change of the objects in the image for different activities are different. For example, the rates of change of the objects in the image of the computer screen for playing electronic games are faster than those for performing the online exercises. This implies that the changes of the positions of the objects between two consecutive images can be employed as the features for activity recognition. Let the minimum x-coordinate, the maximum x-coordinate, the minimum y-coordinate, and the maximum y-coordinate of the object in the ith image be x min,i , x max,i , y min,i , and y max,i , respectively. The middle point of the x-coordinate and the middle point of the y-coordinate of the object in the ith image are defined as x mean,i and y mean,i , respectively. In particular, x mean,i = Here, x mean,i+1 − x mean,i , y mean,i+1 − y mean,i , and x mean,i+1 − x mean,i 2 + y mean,i+1 − y mean,i 2 are employed as the features. In addition to using the features in the spatial domain, this paper also extracts the features based on the differences of the discrete cosine transform coefficients of the objects between two consecutive images. In particular, the mean, median, variance, skewness, and kurtosis of all of the coefficients in D i+1 − D i are also employed as the features. Obviously, the length of the feature vectors is 12.

Features Extracted from the Motion Signals
It is worth noting that the positions and the angles of the camera are different for different activities. For example, the head is pointing forward during watching television, whereas it is pointing downward for reading/writing and drawing. Furthermore, the movements of the head are different for different activities. For example, the head moves more for eating. Hence, the mean and variance of the x-direction, the y-direction, and the z-direction of the motion signals are employed as the features [15,16]. Obviously, the length of the feature vectors is 6.

Fusion of All the Features Together
The features extracted from each electroencephalogram, each image, and each motion signal are combined to form a feature vector. Here, the length of the feature vectors is 62.

Classification
A random forest is an extended variant of bagging. It consists of a collection of a large number of individual decision trees. However, it differs from bagging in the sense that each node variable is generated only from a handful of the randomly selected variables. Therefore, not only is the sample random, but the generation of each node variable (feature) is also random. Each tree in the random forest gives a class prediction and the class that votes the most becomes the prediction of the model [17]. The procedures for performing the random forest are summarized as follows and shown in Figure 1: Step 1: If there are N samples, then these N samples are selected in a random sequence. Here, each sample is selected randomly at each time. That is, the algorithm selects another sample randomly after the previous sample is selected. These selected N samples form the decision nodes and are used to train a decision tree.
Step 2: Suppose that each sample has M attributes; m attributes are selected randomly such that m << M is satisfied. Then, some strategies such as the information gain are adopted to evaluate these m attributes. Each node of the decision tree needs to split. One attribute is selected as the split attribute of the node.
Step 3: During the formation of the decision tree, each node is split according to Step 2 until it can no longer be split.
Step 4: Repeat Step 1 to Step 3 to establish a large number of decision trees. Thus, a random forest is formed.
A random forest is an extended variant of bagging. It consists of a collection of a large number of individual decision trees. However, it differs from bagging in the sense that each node variable is generated only from a handful of the randomly selected variables. Therefore, not only is the sample random, but the generation of each node variable (feature) is also random. Each tree in the random forest gives a class prediction and the class that votes the most becomes the prediction of the model [17]. The procedures for performing the random forest are summarized as follows and shown in Figure 1: Step 1: If there are N samples, then these N samples are selected in a random sequence. Here, each sample is selected randomly at each time. That is, the algorithm selects another sample randomly after the previous sample is selected. These selected N samples form the decision nodes and are used to train a decision tree.
Step 2: Suppose that each sample has M attributes; m attributes are selected randomly such that m << M is satisfied. Then, some strategies such as the information gain are adopted to evaluate these m attributes. Each node of the decision tree needs to split. One attribute is selected as the split attribute of the node.
Step 3: During the formation of the decision tree, each node is split according to Step 2 until it can no longer be split.  From the above, it can be seen that different strategies, such as the information gain, can be adopted in the random forest. Hence, if an appropriate strategy is selected, then a high classification accuracy can be achieved. Moreover, due to the introduction of two sources of randomness, i.e., from the samples and from the features, the random forest does not easily suffer from the problem of overfitting. Furthermore, using the tree structure can help the model to address the issue of nonlinear data [18,19]. In addition, in the training process, the interaction among the features can be exploited and the importance of the features can be ranked accordingly.
Because of the above advantages, this paper adopts the random forest to extract the features and perform the classification. In particular, the random forest selects five of these 62 features and classifies the feature vectors into seven activities. Here, 30% of the overall data are employed for From the above, it can be seen that different strategies, such as the information gain, can be adopted in the random forest. Hence, if an appropriate strategy is selected, then a high classification accuracy can be achieved. Moreover, due to the introduction of two sources of randomness, i.e., from the samples and from the features, the random forest does not easily suffer from the problem of overfitting. Furthermore, using the tree structure can help the model to address the issue of nonlinear data [18,19]. In addition, in the training process, the interaction among the features can be exploited and the importance of the features can be ranked accordingly.
Because of the above advantages, this paper adopts the random forest to extract the features and perform the classification. In particular, the random forest selects five of these 62 features and classifies the feature vectors into seven activities. Here, 30% of the overall data are employed for training and the remaining 70% are employed for testing. The total number of data points in the training and testing sets is summarized in Table 1. For simplicity, no cross-validation is performed.

Computational Complexity Analysis
It is important to investigate the required computational complexity of the algorithm. It is worth noting that the random forest is the module that requires the heaviest computational power. Let N be the total number of samples, M the total number of features, and D be the depth of the trees. When the classification and the regression tree (CART) grows, the values in all of the features of all samples are taken as the candidates for performing the splitting. The evaluation index, such as the information gain, gain ratio, or Gini coefficient, is calculated. Therefore, the required computational power for each layer of the random forest is O(N × M). Because there are D layers in the tree, the required computational power for the random forest is O(N × M × D). Furthermore, the spatial complexity of the random where Split is the average number of segmentation points for each feature and TreeNum is the total number of trees in the random forest. In the numerical simulation results, N values are chosen as 1336, 112, 1044, 1422, and 400 for volunteers 1 to 5, respectively, M is chosen as 62, and TreeNum is chosen as 100. In addition, Split is set to its default value of 10 −7 and D is automatically determined. The processing time of the proposed method is about 2.094193 s.

Computer Numerical Simulation Results
Here, the full set of measurements was provided by five volunteers including two girls and three boys. The electroencephalograms were acquired by a single channel device and sampled at 512 Hz. Motion data was sampled at 31 Hz and the red green blue (RGB) images were taken at 0.1 Hz. The data acquisition for each volunteer took between 6 and 10 min, with most data acquisitions taking 10 min. A set of motion signals was taken randomly from both the training and the testing sets, and these signals were conducted by the first volunteer performing various activities. These motion signals are shown in Figure 2. It can be seen that the motion signals in the training set are consistent with those of the testing set.
Conventional electroencephalogram-based activity recognition applies various filters to the electroencephalograms to obtain various waves. In particular, the electroencephalograms are localized in the frequency band between 0.5 and 49 Hz. The frequency band of the δ wave is between 0.5 and 4 Hz, that of the θ wave is between 4 and 8 Hz, that of the α wave is between 8 and 12 Hz, that of the sensory motor rhythm (SMR) wave is between 12 and 14.99 Hz, that of the mid-β wave is between 15 and 19.99 Hz, that of the high-β wave is between 20 and 30 Hz, that of the low-β wave is between 12 and 19 Hz, that of the whole-β wave is between 12 and 30 Hz, and that of the γ wave is between 30 and 49 Hz. To extract these waves from the electroencephalograms, the fast Fourier transform approach is employed. That is, the fast Fourier transform coefficients of the electroencephalograms are computed and the coefficients outside the corresponding frequency bands are set to zero. Then, the inverse fast Fourier transform is computed to obtain the corresponding waves. It can be seen from the above that this approach does not introduce nonlinear phase distortion.
To investigate the effectiveness of applying the empirical mode decomposition to the electroencephalograms for activity recognition, Figure 3 plots the magnitude responses of the decomposed components of the electroencephalograms via both empirical mode decomposition and conventional filtering when the first and second volunteers perform various activities. Then, the physical quantities discussed in Section 2.1 are calculated for each wave and these physical quantities are employed as the features for performing the classification. To evaluate the performance of our proposed empirical mode decomposition-based method, it is compared to the above conventional filtering-based method. Here, the same set of physical quantities is computed for the comparison. It can be seen from the figure that the empirical mode decomposition can yield intrinsic mode functions with very narrow bandwidths. Hence, the features extracted from the intrinsic mode functions are more specific. As a result, the empirical mode decomposition approach can yield higher average classification accuracy.
Sensors 2020, 20, 6055 7 of 14 three boys. The electroencephalograms were acquired by a single channel device and sampled at 512 Hz. Motion data was sampled at 31 Hz and the red green blue (RGB) images were taken at 0.1 Hz. The data acquisition for each volunteer took between 6 and 10 min, with most data acquisitions taking 10 min. A set of motion signals was taken randomly from both the training and the testing sets, and these signals were conducted by the first volunteer performing various activities. These motion signals are shown in Figure 2. It can be seen that the motion signals in the training set are consistent with those of the testing set. Conventional electroencephalogram-based activity recognition applies various filters to the electroencephalograms to obtain various waves. In particular, the electroencephalograms are localized in the frequency band between 0.5 and 49 Hz. The frequency band of the δ wave is between 0.5 and 4 Hz, that of the θ wave is between 4 and 8 Hz, that of the α wave is between 8 and 12 Hz, that of the sensory motor rhythm (SMR) wave is between 12 and 14.99 Hz, that of the mid-β wave is between 15 and 19.99 Hz, that of the high-β wave is between 20 and 30 Hz, that of the low-β wave is between 12 and 19 Hz, that of the whole-β wave is between 12 and 30 Hz, and that of the γ wave is  performance of our proposed empirical mode decomposition-based method, it is compared to the above conventional filtering-based method. Here, the same set of physical quantities is computed for the comparison. It can be seen from the figure that the empirical mode decomposition can yield intrinsic mode functions with very narrow bandwidths. Hence, the features extracted from the intrinsic mode functions are more specific. As a result, the empirical mode decomposition approach can yield higher average classification accuracy. To investigate the effects of various signals on the activity recognition, Figure 4 plots some of the features extracted from the motion signals when various volunteers perform various activities. Similarly, Figure 5 plots some of the features extracted from the image sequences when various volunteers perform various activities. Figure 6 plots some of the features extracted from the electroencephalograms when the first and second volunteers perform various activities. It can be seen from these figures that the features of various signals corresponding to different activities are localized in different regions in the feature space. This implies that the features of these signals are effective. To investigate the effects of various signals on the activity recognition, Figure 4 plots some of the features extracted from the motion signals when various volunteers perform various activities. Similarly, Figure 5 plots some of the features extracted from the image sequences when various volunteers perform various activities. Figure 6 plots some of the features extracted from the electroencephalograms when the first and second volunteers perform various activities. It can be seen from these figures that the features of various signals corresponding to different activities are localized in different regions in the feature space. This implies that the features of these signals are effective.
The percentage accuracy and the macro F1 score are used as the metrics to evaluate the performance of various methods. This is because these are the common criteria used in the classification problems. Tables 2-6 show the percentage accuracies and the macro F1 scores obtained by both our proposed empirical mode decomposition-based method and the conventional filtering-based method using the signals acquired from five different volunteers, respectively. It can be seen from the tables that the percentage accuracies based on three types of signal are higher than those using two types of signal. Furthermore, the percentage accuracies based on two types of signal are higher than those using the corresponding individual signals. Although this is not the case for all of the macro F1 scores, this is true for most of the cases. This demonstrates the advantages of using the multi-modal approach for activity recognition.       Moreover, using three types of signal, our proposed empirical mode decomposition-based method outperforms the conventional filtering-based method for the last four volunteers. Although this is not the case for the first volunteer, the difference is very small and can be ignored. To understand why there is an exception for the first volunteer, it can be seen from Figure 6 that the overlaps of the features among various activities in the feature space based on the empirical mode decomposition approach are larger than those based on the conventional filtering approach for the first volunteer. However, this is not the case for the second volunteer. This accounts for the exception. Overall, the obtained results demonstrate the advantages of using the nonlinear and adaptive time frequency approach for activity recognition.

Conclusions
This paper applies empirical mode decomposition to electroencephalograms to obtain the intrinsic mode functions localized in various frequency bands. The intrinsic mode functions are categorized into four groups. Then, 11 physical quantities are computed for each group of the intrinsic mode functions and used as features. Finally, the random forest is employed to perform multi-modal activity recognition. Numerical simulation results show that the percentage accuracies for activity recognition range between 78.21% and 96.90%. This demonstrates that the activities can be successfully recognized by our proposed algorithm. Furthermore, it can be seen that the percentage accuracies based on three types of signal are higher than those using two types of signal or individual signals. This demonstrates the success of using the multi-modal approach for activity recognition. Moreover, the numerical simulation results also show that the empirical mode decomposition-based method outperforms the conventional filtering-based method. This demonstrates the effectiveness of using the nonlinear adaptive approach to decompose the signal into various components for performing activity recognition.

Conflicts of Interest:
The authors declare no conflict of interest.