Hierarchical Activity Recognition Using Smart Watches and RGB-Depth Cameras

Human activity recognition is important for healthcare and lifestyle evaluation. In this paper, a novel method for activity recognition by jointly considering motion sensor data recorded by wearable smart watches and image data captured by RGB-Depth (RGB-D) cameras is presented. A normalized cross correlation based mapping method is implemented to establish association between motion sensor data with corresponding image data from the same person in multi-person situations. Further, to improve the performance and accuracy of recognition, a hierarchical structure embedded with an automatic group selection method is proposed. Through this method, if the number of activities to be classified is changed, the structure will be changed correspondingly without interaction. Our comparative experiments against the single data source and single layer methods have shown that our method is more accurate and robust.


Introduction
Personal lifestyle impacts on our health significantly. For example, some habits like long time sedentary activity or overeating are harmful for our body and lead to many chronic diseases such as diabetes, heart diseases, and hypertension. In order to monitor and evaluate personal health and lifestyle and to discover the relationship between lifestyle and health, an automatic monitoring system is needed to capture personal physical activity data and evaluate personal lifestyle.
In past decades, there have been many methods for activity capture and recognition using different kinds of sensors; the most popular method is using a stationary camera, which is very useful and convenient for activity recognition since there are many closed circuit television systems everywhere. Neil et al. [1] utilized public videos captured from a single RedGreen-Blue (RGB) camera to detect some simple activities such as walking, running, and stopping. Moreover, multiple cameras are introduced in activity recognition (AR) systems. A distributed camera network [2] and a multi-view framework [3] were developed for activity recognition. Further, RGB image data combined with depth data provide more information about activity, so there are many frameworks which were implemented using RGB-D [4,5] data directly or using skeleton [6,7] data extracted from RGB-Depth (RGB-D) data. However, when subjects are out of view of the cameras, image data alone cannot provide any useful information for activity analysis. Aiming at monitoring personal activity 24 h per day, wearable sensors are applied to provide valid data and also assist image data to achieve more accurate recognition of activity.
A wearable device is usually implanted with multiple sensors such as accelerometer, gyroscope, and Global Positioning System (GPS), etc. It is widely used to capture activity data since it records record the location of the subject, which is important for activity recognition [8]. Also, there are some methods using one or two wearable cameras which are worn on the chest or head for video recording [9]. However, both the GPS and camera are power-consuming sensors which cannot last for a long time within wearable devices. Thus, some existing methods have used one accelerometer or gyroscope to capture movement of the wrist [10,11], hip [12], chest [13], and ankle [14] for activity recognition, while others have placed multiple motion sensors over the whole body [15,16]. The number and position of motion sensors are critical for activity recognition, but it is necessary to balance the recognition accuracy and wearing convenience. Ling at al. [17] compared five positions including thigh, ankle arm, wrist, and hip and claimed that combing both thigh and wrist sensors could get a better results than combing any other sensors. Furthermore, a cell phone is a special type of wearable device which offers a convenient way to monitor lifestyle [18,19]. The cell phone is embedded with many sensors and a transmission module (Wi-Fi/Bluetooth) which can send data wirelessly. However, although wearable devices provide a convenient method for data collection, the data from these sensors is too simple for complicated activity recognition. Moreover, there are some other ambient sensors which are used for indoor activity recognition and healthcare, such as microphones, infrared-ray position sensors, and floor sensors, which are not available in cell phones [20].
In order to capture activity data and monitor lifestyle continuously, we have constructed a multi-source daily life monitoring system which consists of wearable smart watches and RGB-D cameras, as shown in Figure 1. The smart watch runs an Android Operating System, which is embedded with an accelerometer, a gyroscope, and a Wi-Fi module for data transmission. The data captured through the watch are shown in Figure 1. Data are sent to a server when network is available. Moreover, one or more fixed RGB-D cameras are utilized for indoor RGB-D image data capture, and the skeleton data [21] is extracted from the depth data. The advantage of choosing a smart watch over traditional sensors is that the watch is a common accessory and so it is comfortable for users to wear. In our system, for indoor cases, RGB-D cameras, which capture rich activity information, combined with motion sensors can easily detect many complicated activities that are undetectable by using motion sensors alone. For outdoor cases, the motion sensor embedded in the smart watch continuously records the wearer's activities. Thus, combining these two types of devices offers a useful tool for the activity recognition, lifestyle evaluation, and healthcare. However, there are still some problems that need to be solved using this system. Because there are many data sources in the system, it is common to use a hierarchical framework to combine them together for activity recognition. For example, motion sensor data is utilized at the first layer to However, there are still some problems that need to be solved using this system. Because there are many data sources in the system, it is common to use a hierarchical framework to combine them together for activity recognition. For example, motion sensor data is utilized at the first layer to separate all activities into a static activity class and a dynamic activity class, and other data sources are introduced at the next layer for further processing. However, the existing hierarchical methods are all hand-designed and cannot be applied to all situations. If the types of activity to be classified are changed, the whole structure has to be changed manually. Moreover, these hierarchical structures are designed using estimates based on experience, and it is not certain whether these structures reflect real situations. Therefore, it is important to design a hierarchical method and an automatic group selection method which could adapt to most general situations.
Moreover, it is possible that there are multiple subjects in the view of one RGB-D camera, and the recognition process will be confused since it cannot judge whether sensor data and skeleton data are from the same subject. In our system, in order to deal with the multi-person situation, a rapid and effective mapping method to bind motion sensor data and image data from the same subject is needed.
The main contributions of this paper are listed as follows: (1) a novel monitoring system which combines both wearable and fixed devices for activity recognition is proposed. Through this system, we can continuously capture personal daily data; (2) a hierarchical activity recognition structure with an automatic group selection method which combines RGB-D data, accelerometer data, and gyroscope data is proposed. With the help of the new recognition structure, if the types of activity to be classified are changed, then the structure could be changed correspondingly without interaction; (3) a normalized cross correlation (NCC)-based mapping method is proposed to establish association between smart watch data and camera data from the same person in multi-person situations.
There is plenty of previous work related to activity recognition based on motion sensor data, RGB-D data, and skeleton data. Features such as mean, variance, energy, entropy [17], and signal-magnitude area (SMA) [13] were designed for motion sensors. However, with motion sensors, only some simple activities could be detected such as walking, sitting, and running [11,22].
Moreover, a rapid and accurate image feature is a key for activity recognition. Histogram of Oriented Gradient 3-Dimension (HOG3D) [23] and Histogram of Optical Flow 3-Dimension HOF3D [24] have performed well on many datasets. In addition, Kantorov [25] extracted motion information as the feature from compressed video and Fisher vectors were used to encode the feature. Liu et al. [26] proposed a hierarchical partwise bag-of-words feature from both local and global areas. Also, there were many methods using data from RGB-D cameras. Ni et al. [27] proposed a novel feature using 3D spatial and temporal descriptors from both grayscale and depth image channels. The Actionlet Ensemble Model [28] based on depth data and skeleton achieved good performance. However, the fixed cameras restrict the system to track subjects only in limited locations.
Classification methods from generative to discriminative methods are commonly used in activity recognition. The Hidden Markov model is the most common generative approach in activity detection [29,30]. There are many other approaches using a dynamic Bayesian Network [31]. Support Vector Machine (SVM) [32], Boosting [33], Neural Network [34] and Conditional Random Field (CRF) [35] are all discriminative methods utilized in activity recognition. However, one single layer of classification method cannot obtain good results, so some hierarchical methods have been proposed to improve the performance. Khan et al. [13] proposed a two-layer method using Neural Networks based on accelerometer data. Three states, including dynamic, static, and transition, are detected at the first layer, and three classifiers for each state are trained for the second layer. Yin et al. [36] proposed a hierarchical probabilistic latent model which contains four layers using video sequence. However, most of the existing methods use hand-designed hierarchical structures which cannot be applied in varied situations. In addition, if the number and types of activity are changed, the whole structure needs to be adjusted manually. Therefore, our study focuses on providing a general hierarchical method which could generate the recognition strategy automatically from multi-source data.
The remainder of the paper is organized as follows. The method is described in Section 2, and comprehensive experiments are described and comparative results are given in Section 3. Finally, some conclusions are drawn in Section 4.

Overview
The method presented in this paper is illustrated in Figure 2. There are three main steps of the proposed method. In the first step, a Normalized Cross Correlation (NCC) based mapping method is used to bind each skeleton data to its corresponding motion sensor data. In the second step, mean variance, and some other features from motion sensor data, Skeleton Shape Histogram, and Edge Histogram Descriptors from RGB-D data are extracted for activity recognition. At the last step, a hierarchical classifier is constructed, and an automatic group selection method is proposed to build an optimal hierarchical structure to improve the performance of activity recognition. The remainder of the paper is organized as follows. The method is described in Section 2, and comprehensive experiments are described and comparative results are given in Section 3. Finally, some conclusions are drawn in Section 4.

Overview
The method presented in this paper is illustrated in Figure 2. There are three main steps of the proposed method. In the first step, a Normalized Cross Correlation (NCC) based mapping method is used to bind each skeleton data to its corresponding motion sensor data. In the second step, mean variance, and some other features from motion sensor data, Skeleton Shape Histogram, and Edge Histogram Descriptors from RGB-D data are extracted for activity recognition. At the last step, a hierarchical classifier is constructed, and an automatic group selection method is proposed to build an optimal hierarchical structure to improve the performance of activity recognition.

Normalized Cross Correlation Mapping Method
With the help of data from the motion sensor and RGB-D camera, a large amount of activity-related information is acquired. However, when more than one person is captured through a RGB-D camera, their motion sensor data and their corresponding skeleton data should be matched for further processing. Therefore, a mapping method is proposed in this paper to discover the relationship between motion sensor data and skeleton data from a single subject. The process is shown in Figure 3. Both features from gyroscope and skeleton are extracted and an NCC method is introduced for mapping.

Normalized Cross Correlation Mapping Method
With the help of data from the motion sensor and RGB-D camera, a large amount of activity-related information is acquired. However, when more than one person is captured through a RGB-D camera, their motion sensor data and their corresponding skeleton data should be matched for further processing. Therefore, a mapping method is proposed in this paper to discover the relationship between motion sensor data and skeleton data from a single subject. The process is shown in Figure 3. Both features from gyroscope and skeleton are extracted and an NCC method is introduced for mapping. The remainder of the paper is organized as follows. The method is described in Section 2, and comprehensive experiments are described and comparative results are given in Section 3. Finally, some conclusions are drawn in Section 4.

Overview
The method presented in this paper is illustrated in Figure 2. There are three main steps of the proposed method. In the first step, a Normalized Cross Correlation (NCC) based mapping method is used to bind each skeleton data to its corresponding motion sensor data. In the second step, mean variance, and some other features from motion sensor data, Skeleton Shape Histogram, and Edge Histogram Descriptors from RGB-D data are extracted for activity recognition. At the last step, a hierarchical classifier is constructed, and an automatic group selection method is proposed to build an optimal hierarchical structure to improve the performance of activity recognition.

Normalized Cross Correlation Mapping Method
With the help of data from the motion sensor and RGB-D camera, a large amount of activity-related information is acquired. However, when more than one person is captured through a RGB-D camera, their motion sensor data and their corresponding skeleton data should be matched for further processing. Therefore, a mapping method is proposed in this paper to discover the relationship between motion sensor data and skeleton data from a single subject. The process is shown in Figure 3. Both features from gyroscope and skeleton are extracted and an NCC method is introduced for mapping.  First, the feature of total velocity from the gyroscope is extracted, which is calculated using Equation (1): where Gyro x (t) is the gyroscope data of x axis at time t. The VEL gyro (t) presents the motion level of the hand of a subject since the smart watch is attached on the wrist. At the same time, the skeleton data also records the location of the hand in an image sequence. In order to capture the motion information from skeleton data, hand motion velocity is calculated as below: where I x (position, t) is the x coordinate of the position in image at time t, and position could be any joints such as head, left hand, right hand, left foot, right foot, and so on. We use left_hand in Equation (2) since the smart watch is worn on the left hand. VEL image (t) calculates the velocity of left hand from the image. When the velocities of two data sources are available, the NCC method is implemented for two data source mapping and subject identification. The NCC method is defined as: Because of the noise of both data sources, we do not use the original velocity for mapping, instead, two thresholds T gyro and T image are introduced, and GT(t) and IT(t) indicate whether the hand is active or not from different sensors. We find the optimal thresholds T gyro and T image through Equation (7), which provides a simple method to locate these two thresholds.
where D is the number of all testing points. ∑ D d=1 GT(d) · IT(d) calculates the number of matching points which meet two thresholds over all testing points, and Max(∑ D d=1 GT(d),∑ D d=1 IT(d)) calculate the max number of points which meet only one threshold. The results of threshold determination will be illustrated in the experiment section.
After the threshold determination, for the same skeleton data, different gyroscope data are tested through the proposed NCC-based method and data which obtains the maximum result in Equation (4) are mapped to the skeleton data.

Feature of Accelerometer and Gyroscope
In our study, data captured through motion sensors including accelerometer and gyroscope are cut into sub segments which last six seconds. The features of the motion sensor include mean, variance, range, spectral energy, and absolute change (AC), which is defined as follows: where x i , y i and z i are the ith values in X, Y, and Z axis of gyroscope or accelerometer raw data sequence and N is the length of the sequence.

Feature of RGB-D Data
RGB-D data are captured through Kinect sensors. The RGB image resolution is 1280 × 960 pixel and the depth image resolution is 640 × 480 pixel. The sampling rate of the RGB images is 12 Hz and that of the depth image is 30 Hz. Skeleton data is extracted from the RGB-D data through Kinect Software Development Kit (SDK). Due to the limitation of the Kinect sensor, the subject cannot be too far away (>5 m) from the sensor. The 3D Shape Histogram feature [37] is utilized for feature extraction. For each skeleton at time t, twelve joints including head, left elbow, right elbow, left hand, right hand, left knee, right knee, left foot, right foot, hip center, left hip, and right hip are used for the histogram calculation. Each joint is transferred to a spherical coordinate where the hip center is set as the origin of coordinates. Six seconds skeleton data form a histogram according to their zenith and azimuth angles. The zenith angle is divided into seven equal bins and the azimuth angle is divided into ten equal bins.
Moreover, Edge Histogram Descriptor (EHD) [38] is extracted around the hand area from RGB image data. The EHD which calculates the edge distribution over hand area is helpful for activity recognition since the activity is highly related to the object which the subject interacts with. Both skeleton histogram and EHD are concentrated as the final feature of RGB-D data.

Hierarchical Recognition Scheme
As long as the motion sensor data is mapped to RGB-D data, all data sources can be obtained for activity recognition. In order to utilize all types of data, a hierarchical method is proposed for activity recognition. Using this structure, when the motion data and image data are all available, the mapping method is used to bind them together, and the activity is recognized through a two-layer hierarchical structure. At the first level, the motion sensor is utilized for classification. All activities can be divided into some groups. At the next layer, image features are introduced. When the subject is out of the scope of the camera, we only use the accelerometer and gyroscope data for activity recognition since the skeleton data is unavailable.
The reason why we chose the hierarchical method is that motion sensor data alone cannot provide enough information because of the limitations and simplicity of its features; specifically, insufficient information is provided for some subtle or complicated activities such as eating or making a call. Therefore, a coarse-fine hierarchical method is used to improve the accuracy rate of the classification. In the hierarchical method, all data are separated into multiple groups at the first layer as the input for the next layer. Finding the optimal groups at the first layer is an important issue. The existing methods all focus on how to design a complicated hierarchical structure, but group selection is only based on experience such as grouping all activities into a static group, dynamic group and transition group. Moreover, once the types of activities are changed, the whole structure cannot be changed correspondingly since the structure is fixed already. So, an automatic group selection method and a hierarchical structure is proposed in this paper.

Automatic Group Selection Method
In this section, we propose a novel automatic group selection method. For the sensor data, we use an SVM classifier for activity recognition. Assuming there are N types of activity to be recognized, and the number of groups is M = 2 at the beginning, which indicates that all activities will be separated into M groups. Then, the performances of all group combinations (GCs) are tested to find the best group combination, and the recognition accuracy of dth GC is T M d . In the next step, the M keeps increasing, and the corresponding T M d is calculated, respectively. When recognition accuracies of all combinations are calculated, the GC with highest T M d is selected as the final GC.
However, the computation complexity of this method is very high since the recognition accuracy rates s of all GCs have to be calculated. We utilized a low time complexity group selection approach, which is shown below:

Algorithm 1. Algorithm of Group Selection
Step 1 N: Number of types of activities M = 2

Step 2
The recognition accuracy rates of all group combinations from C M 1 to C M d is evaluated. For each C M d , it contains M classes and a SVM classifier is trained for performance evaluation.

The Recognition Hierarchical Structure
The whole structure of the proposed method is shown in Figure 4. If the subject is out of the view of the camera, then only motion sensor data are available; the single layer SVM classifier is utilized for activity recognition and there are only four kinds of activities that can be detected since motion sensors provide limited information. The four activities are walking, standing, sitting, and running. On the other hand, if both sensor data and image data are available, the proposed hierarchical method is implemented. At the first layer, data from the motion sensor are used for recognition, and features mentioned in the Section 2.3.1 are extracted in this layer including mean, variance, etc. The number of groups and GC depends on the group selection method. The probability of certain group given sensor data P(g|Sensor) is obtained through the SVM classifier with probability output. At the second layer, another SVM classifier involving all activities are trained, and the probability of certain activity a given image data P(a|Image) is calculated. recognition, and features mentioned in the Section 2.3.1 are extracted in this layer including mean, variance, etc. The number of groups and GC depends on the group selection method. The probability of certain group given sensor data ( | ) P g Sensor is obtained through the SVM classifier with probability output. At the second layer, another SVM classifier involving all activities are trained, and the probability of certain activity a given image data ( | ) P a Image is calculated. After obtaining two probabilities, the final output is obtained through Equation (10): a g a g Activity P a Image g Sensor   (10)

Dataset
Ten human subjects including six males and four females, 32 years old on average, participated in the experimental study for data collection. Each subject wore the smart watch and stood in front of a RGB-D camera for data capture, two types of dataset were recorded. One was obtained through subjects acting according to an activity list including the following eight activities which cover a large part of daily life: brushing (BR), calling (CL), computer working (CW), drinking (DK), eating (ET), reading (RD), sitting (ST) and standing (SD). The collected data contained 3-axis gyroscope data, 3-axis accelerometer data, and RGB-D image data. The sampling rate of motion sensors was 50 Hz, the range of the accelerometer was −2 g to +2 g, and the range of gyroscope was −4 rad/s to +4 rad/s. The duration of collecting one activity of one subject was four minutes. The collected data was cut into more than 40 segments and each segment lasted six seconds. More than After obtaining two probabilities, the final output is obtained through Equation (10):

Dataset
Ten human subjects including six males and four females, 32 years old on average, participated in the experimental study for data collection. Each subject wore the smart watch and stood in front of a RGB-D camera for data capture, two types of dataset were recorded. One was obtained through subjects acting according to an activity list including the following eight activities which cover a large part of daily life: brushing (BR), calling (CL), computer working (CW), drinking (DK), eating (ET), reading (RD), sitting (ST) and standing (SD). The collected data contained 3-axis gyroscope data, 3-axis accelerometer data, and RGB-D image data. The sampling rate of motion sensors was 50 Hz, the range of the accelerometer was −2 g to +2 g, and the range of gyroscope was −4 rad/s to +4 rad/s. The duration of collecting one activity of one subject was four minutes. The collected data was cut into more than 40 segments and each segment lasted six seconds. More than 400 segments including all data sources of each activity were recorded. To deal with the situation without image data, the other dataset was generated though a simple method. Subjects only wore the smart watch to capture motion data according to an activity list which contained walking, standing, running, and sitting. Similarly, each activity had 400 segments as training and testing samples. The experiment of activity recognition used leave-one-subject-out cross validation protocol.
Eight typical activities including all types of data are presented in Figure 5. It can be seen that accelerometer and gyroscope signals and image data have distinctive patterns for activity classification. Moreover, we found that some activities share similar features from motion sensor data such as sitting and computer working.
samples. The experiment of activity recognition used leave-one-subject-out cross validation protocol.
Eight typical activities including all types of data are presented in Figure 5. It can be seen that accelerometer and gyroscope signals and image data have distinctive patterns for activity classification. Moreover, we found that some activities share similar features from motion sensor data such as sitting and computer working.

Thresholds Determination
Two thresholds gyro T and image T are used in the NCC-based mapping method to indicate whether the hand is active or not. In this section, we try to find the optimal thresholds. We collected data from both sources for more than five hours for threshold determination. Features of gyroscope and image were extracted through Equations (1) and (2). The thresholds were determined through Equation (7), and the value of Equation (7) with respect to two thresholds is shown in Figure 6. Two thresholds were selected when the surface reached the top, where

Thresholds Determination
Two thresholds T gyro and T image are used in the NCC-based mapping method to indicate whether the hand is active or not. In this section, we try to find the optimal thresholds. We collected data from both sources for more than five hours for threshold determination. Features of gyroscope and image were extracted through Equations (1) and (2). The thresholds were determined through Equation (7), and the value of Equation (7) with respect to two thresholds is shown in Figure 6. Two thresholds were selected when the surface reached the top, where T gyro = 1.3 and T image = 0.023.

Evaluation on Matching Result
The NCC-based mapping method is tested in this section. Eight subjects were involved in the test, and each two subject pairing were asked to act freely in the front of a RGB-D camera. The data was cut into six second sub segments. For each skeleton data, NCC-based mapping was implemented to test all gyroscope data to find the max results in Equation (4) and map it to the skeleton data. It is possible that some NCC results of both sub segments of skeleton data and gyroscope data were close to zero since subjects were inactive at that moment; all these sub

Evaluation on Matching Result
The NCC-based mapping method is tested in this section. Eight subjects were involved in the test, and each two subject pairing were asked to act freely in the front of a RGB-D camera. The data was cut into six second sub segments. For each skeleton data, NCC-based mapping was implemented to test all gyroscope data to find the max results in Equation (4) and map it to the skeleton data. It is possible that some NCC results of both sub segments of skeleton data and gyroscope data were close to zero since subjects were inactive at that moment; all these sub segments were removed from the test. Figure 7 is an example of the mapping process of eight minutes data from two subjects S 1 and S 2 , the first curve is the velocity of the gyroscope data from S 1 and the second curve is that from S 2 . The third curve is the velocity of the skeleton data from S 2 . All velocity data were processed by a media filter to remove noise. The second and third curves are very similar since they come from the same subject. Two series of N(t) values are calculated during a relative long time period respectively, the first one is between S 2 image (image data of S 2 ) and S 1 gyro (gyroscope data of S 1 ), denote as N(t) 21 , illustrated as light green curve and the second is between S 2 image and (gyroscope data of S 2 ) S 2 gyro , denote as N(t) 22 , illustrated as light red curve. Then each corresponding values in two series are compared and number of higher values is recorded. The length of green bar is the number of higher values of N(t) 21 . Similarly, the length of red bar is the number of higher values of N(t) 22 . S 2 image and S 2 gyro which come from the same subject constructed the serie with more number of higher values (the red bar) are selected as a pair since it indicates a higher match degree.

Evaluation on Matching Result
The NCC-based mapping method is tested in this section. Eight subjects were involved in the test, and each two subject pairing were asked to act freely in the front of a RGB-D camera. The data was cut into six second sub segments. For each skeleton data, NCC-based mapping was implemented to test all gyroscope data to find the max results in Equation (4) and map it to the skeleton data. It is possible that some NCC results of both sub segments of skeleton data and gyroscope data were close to zero since subjects were inactive at that moment; all these sub segments were removed from the test.  Figure 7 is an example of the mapping process of eight minutes data from two subjects S1 and S2, the first curve is the velocity of the gyroscope data from S1 and the second curve is that from S2. The third curve is the velocity of the skeleton data from S2. All velocity data were processed by a media filter to remove noise. The second and third curves are very similar since they come from the same subject. Two series of ( ) N t values are calculated during a relative long time period respectively, the first one is between 2 image S (image data of S2) and 1 gyro S (gyroscope data of S1), Normally, we tracked more than 15 min data for mapping. But it was possible that both subjects preformed similar activities or kept still at the same time, such as A 1 and A 2 intervals in Figure 7. Both subjects kept still in the A 1 interval and performed similar active activities in the A 2 interval. The N(t) values from the two subjects are very close or the same in these situations, and these intervals are denoted as invalid intervals for mapping. However, in our method we tracked a long time period data sequence and it is impossible that all subjects always perform similarly, so these invalid intervals do not affect the mapping results.
S 1 − S 8 are denoted as eight subjects, each cell [S x , S y ] is the NCC-based mapping result between skeleton data of one subject S x and the gyroscope data of two subjects (S x and S y ). From the figure, there are 52 samples matched correctly while four samples matched incorrectly. The accuracy of the mapping method is 92.86% (52/56). For the four missed samples, most of them are matched incorrectly since the skeleton extraction method was inaccurate and the position of skeleton hand data shifted to other places so it was difficult to find the active segment. Figure 8 shows the mapping results of eight subjects. S and 2 gyro S which come from the same subject constructed the serie with more number of higher values (the red bar) are selected as a pair since it indicates a higher match degree. Normally, we tracked more than 15 min data for mapping. But it was possible that both subjects preformed similar activities or kept still at the same time, such as A1 and A2 intervals in Figure 7. Both subjects kept still in the A1 interval and performed similar active activities in the A2 interval. The ( ) N t values from the two subjects are very close or the same in these situations, and these intervals are denoted as invalid intervals for mapping. However, in our method we tracked a long time period data sequence and it is impossible that all subjects always perform similarly, so these invalid intervals do not affect the mapping results.   Figure 8 shows the mapping results of eight subjects.

Results of Group Selection Method
The group selection method is proposed in the Section 2.4.1, and results are shown in Figure 9.
The top part of the figure is the curve of the value of ( ) M d Q C with respect to M based on the data of motion sensors. When M equals 5, the curve reaches the highest point among other GCs, which indicates that five groups, which are shown in the red rectangle, are the optimal GC. When M equals 2, only standing is separated from other activities. It is possible that one axis of the accelerometer of the watch is parallel to the direction of gravity when subjects are standing, so the mean value of the accelerometer during the standing activity is different from that during other activities. When M equals 3, both computer working and sitting are selected since subjects remain still in these two activities. In the next round, the reading is picked out from previous set since flipping pages shows representative features from motion data. Finally, when

Results of Group Selection Method
The group selection method is proposed in the Section 2.4.1, and results are shown in Figure 9. The top part of the figure is the curve of the value of Q(C M d ) with respect to M based on the data of motion sensors. When M equals 5, the curve reaches the highest point among other GCs, which indicates that five groups, which are shown in the red rectangle, are the optimal GC. When M equals 2, only standing is separated from other activities. It is possible that one axis of the accelerometer of the watch is parallel to the direction of gravity when subjects are standing, so the mean value of the accelerometer during the standing activity is different from that during other activities. When M equals 3, both computer working and sitting are selected since subjects remain still in these two activities. In the next round, the reading is picked out from previous set since flipping pages shows representative features from motion data. Finally, when M = 5, calling is separated from other activities and the value of Q(C M d ) reaches the top because subjects always kept their hand still and were standing when they made a call.

Performance on the Proposed Method
The summed confusion matrix from the leave-one-subject-out cross validation is shown in Table 1.The F1 measurement [39] is used for evaluation, which is implemented as below: Precision= / ( ) TP TP FP  Figure 9. Results of group selection method.

Performance on the Proposed Method
The summed confusion matrix from the leave-one-subject-out cross validation is shown in Table 1.The F1 measurement [39] is used for evaluation, which is implemented as below: where TP is the true positive, FP is the false positive and FN is the false negative. This definition indicates that F1 is the harmonic mean of recall and precision. The average F1 value is 0.839. The F1 value of standing is highest since it is easily detected by the motion sensor. The mean feature and variance feature are representative for detection. Moreover, the performances of computer working and sitting are good. Although these two activities are grouped together at the first layer because of the similar features of motion sensor data, the image feature is helpful to distinguish them through the area around the hands. The F1 value of brushing, calling, and reading are over 0.8 since both motion sensor and image data play important roles in detecting these activities. However, results of drinking and eating are not very good because they are grouped into the same group. The differences between these two kinds of activity from skeleton data are too slight.
Results of activity recognition when subjects are out of the view of the cameras are shown in Figure 10. Similarly, the experiment was implemented using leave-one-subject-out cross validation. Ten subjects are involved in this part, and each subject acts four kinds of activity according to an activity list. Similarly, each activity was cut by six seconds. Because motion sensor data provides limited information and cannot distinguish complicated activities, only four kinds of activities were detected including standing, sitting, running, and walking. The average F1 value is 0.946 since the motion sensor is robust to detect these simple activities.
Sensors 2016, 16,1713 13 of 17 the image feature is helpful to distinguish them through the area around the hands. The F1 value of brushing, calling, and reading are over 0.8 since both motion sensor and image data play important roles in detecting these activities. However, results of drinking and eating are not very good because they are grouped into the same group. The differences between these two kinds of activity from skeleton data are too slight. Results of activity recognition when subjects are out of the view of the cameras are shown in Figure 10. Similarly, the experiment was implemented using leave-one-subject-out cross validation. Ten subjects are involved in this part, and each subject acts four kinds of activity according to an activity list. Similarly, each activity was cut by six seconds. Because motion sensor data provides limited information and cannot distinguish complicated activities, only four kinds of activities were detected including standing, sitting, running, and walking. The average F1 value is 0.946 since the motion sensor is robust to detect these simple activities.

Comparison between the Proposed Method and Single Layer Method
A hierarchical structure is proposed in this paper. Results of single layer method without the group selection method are presented to compare with the proposed method as shown in the

Comparison between the Proposed Method and Single Layer Method
A hierarchical structure is proposed in this paper. Results of single layer method without the group selection method are presented to compare with the proposed method as shown in the Figure 11. In the single layer method, motion feature and image feature are concentrated together and an SVM classifier is used for recognition. It can be observed from Figure 11 that, in all cases, the proposed method performs better than the single layer method. The F1 value of reading of the hierarchical method is much higher than that of the single layer method since the reading needs to be classified among all eight activities through the single layer method and a large part of reading activities were misclassified as computer working or sitting, while these two activities belong to another group using the proposed method. Similarly, the result of computer working through the proposed method performs better than the other method. The results of standing and sitting activities of the two methods are close since the features from both sources are discriminative.

Comparison between the Proposed Method and Single Layer Method
A hierarchical structure is proposed in this paper. Results of single layer method without the group selection method are presented to compare with the proposed method as shown in the Figure 11. In the single layer method, motion feature and image feature are concentrated together and an SVM classifier is used for recognition. It can be observed from Figure 11 that, in all cases, the proposed method performs better than the single layer method. The F1 value of reading of the hierarchical method is much higher than that of the single layer method since the reading needs to be classified among all eight activities through the single layer method and a large part of reading activities were misclassified as computer working or sitting, while these two activities belong to another group using the proposed method. Similarly, the result of computer working through the proposed method performs better than the other method. The results of standing and sitting activities of the two methods are close since the features from both sources are discriminative.

Comparison between the Data Fusion and Single Source Data
In order to evaluate whether combining two data sources improves the performance and accuracy of activity recognition, we split the dataset into two datasets. One method only used the motion sensor data and the other method only used image data; a linear SVM was utilized for training and testing. The results are shown in Figure 12. It is concluded that combining both data sources performs better than using each single data source. Standing gets high F1 value among all three datasets. The F1 average value of image is higher than that of sensor, since image provides more information concerning activity. For example, the results of computer working and drinking based on the image dataset are higher than that based on the sensor dataset, it is possible that objects which the user interacts with are utilized in the image data. On the other hand, sensor data performs better in detecting calling activity while a lot of calling activities were classified as brushing through image data since image and skeleton data cannot capture tiny actions when the subject is far away from the camera. three datasets. The F1 average value of image is higher than that of sensor, since image provides more information concerning activity. For example, the results of computer working and drinking based on the image dataset are higher than that based on the sensor dataset, it is possible that objects which the user interacts with are utilized in the image data. On the other hand, sensor data performs better in detecting calling activity while a lot of calling activities were classified as brushing through image data since image and skeleton data cannot capture tiny actions when the subject is far away from the camera.

Conclusions
In this paper, a hierarchical activity recognition method is proposed. A capturing system was constructed for data capture. In this system, a smart watch attached to the subject was used to capture motion senor data for a whole day, and the RGB-D cameras were used to capture image data and skeleton data indoors. Combining both types of data provides rich information and a novel way for data collection and activity recognition.
In this system, it is possible that there were two or more subjects in the view of the camera. In order to map motion sensor data to its corresponding skeleton data, an NCC-based mapping method was implemented for data binding. If the subject was out of the view of the camera, we could not find his corresponding skeleton data. In this case, only four types of activity were recognized including standing, sitting, walking, and running based on motion sensor data only. Otherwise, the hierarchical recognition method was implemented.
In the hierarchical method, a two-layer activity recognition structure was built. The first layer only used motion sensor data, and in order to utilize all data sources effectively, a group selection method was proposed. Instead of designing the structure manually according to experience, the group selection method finds the optimal group combination automatically and builds the hierarchical recognition structure. With this method, if the activities to be classified are changed, the structure will change its group combination and structure automatically without interaction. Our experimental results demonstrate that the proposed algorithm performs better than other methods which use only single layer or single data sources.

Conclusions
In this paper, a hierarchical activity recognition method is proposed. A capturing system was constructed for data capture. In this system, a smart watch attached to the subject was used to capture motion senor data for a whole day, and the RGB-D cameras were used to capture image data and skeleton data indoors. Combining both types of data provides rich information and a novel way for data collection and activity recognition.
In this system, it is possible that there were two or more subjects in the view of the camera. In order to map motion sensor data to its corresponding skeleton data, an NCC-based mapping method was implemented for data binding. If the subject was out of the view of the camera, we could not find his corresponding skeleton data. In this case, only four types of activity were recognized including standing, sitting, walking, and running based on motion sensor data only. Otherwise, the hierarchical recognition method was implemented.
In the hierarchical method, a two-layer activity recognition structure was built. The first layer only used motion sensor data, and in order to utilize all data sources effectively, a group selection method was proposed. Instead of designing the structure manually according to experience, the group selection method finds the optimal group combination automatically and builds the hierarchical recognition structure. With this method, if the activities to be classified are changed, the structure will change its group combination and structure automatically without interaction. Our experimental results demonstrate that the proposed algorithm performs better than other methods which use only single layer or single data sources.