Posture Recognition Using Ensemble Deep Models under Various Home Environments

: This paper is concerned with posture recognition using ensemble convolutional neural networks (CNNs) in home environments. With the increasing number of elderly people living alone at home, posture recognition is very important for helping elderly people cope with sudden danger. Traditionally, to recognize posture, it was necessary to obtain the coordinates of the body points, depth, frame information of video, and so on. In conventional machine learning, there is a limitation in recognizing posture directly using only an image. However, with advancements in the latest deep learning, it is possible to achieve good performance in posture recognition using only an image. Thus, we performed experiments based on VGGNet, ResNet, DenseNet, InceptionResNet, and Xception as pre-trained CNNs using ﬁve types of preprocessing. On the basis of these deep learning methods, we ﬁnally present the ensemble deep model combined by majority and average methods. The experiments were performed by a posture database constructed at the Electronics and Telecommunications Research Institute (ETRI), Korea. This database consists of 51,000 images with 10 postures from 51 home environments. The experimental results reveal that the ensemble system by InceptionResNetV2s with ﬁve types of preprocessing shows good performance in comparison to other combination methods and the pre-trained CNN itself. CNNs used in this paper were performed by VGGNet, ResNet, DenseNet, InceptionResNet, and Xception, frequently used in conjunction with deep learning. Finally, we performed systematic experiments from a large posture database constructed by the Electronics and Telecommunications Research Institute (ETRI). The experimental results reveal that the ensemble deep model shows good performance in comparison with the pre-trained CNN itself. Thus, we expect that the presented method will be an important technique to help elderly people from sudden danger in home environments. In future research, we shall perform the research on the behavior recognition of elderly people from a large 3D database for behavior recognition constructed under home service robot environments.


Introduction
Posture recognition is a technology that classifies and identifies the posture of a person and has received considerable attention in the field of computer vision. Posture is the arrangement of the body skeleton that arises naturally or mandatorily through the motion of a person. Posture recognition helps detect crimes, such as kidnapping or assault, using camera input [1] and also provides a service robot with important information for judging a situation to perform advanced functions in an automatic system. Posture recognition also helps rehabilitate posture correction in the medical field, is used as game content in the entertainment field, and provides suggestions to athletes for maximizing their ability in sports [2,3]. Additionally, it helps elderly people who live alone and have difficulty performing certain activities, by determining sudden danger from their posture in home environments. Thus, there are several studies on posture recognition because it is an important technique in our society.
Several studies have been performed on posture analysis, involving recognition, estimation, etc. Chan [4] proposed the scalable feedforward neural network-action mixture model to estimate three-dimensional (3D) human poses using viewpoint and shape feature histogram features extracted from a 3D point-cloud input. This model is based on mapping that converts a Bayesian network 2 of 26 to a feedforward neural network. Veges [5] performed 3D human pose estimation with Siamese equivariant embedding. Two-dimensional (2D) positions were detected, and then the detection was lifted into 3D coordinates. A rotation equivariant hidden representation was learned by the Siamese architecture to compensate for the lack of data. Stommel [6] proposed the spatiotemporal segmentation of keypoints given by a skeletonization of depth contours. Here, the Kinect generates both a color image and a depth map. After the depth map is filtered, a 2D skeletonization of the contour points is utilized as a keypoint detector. The human detection to a 2D clustering problem is simplified by the extracted keypoints. For all poses, the distances to other poses were calculated and arranged by similarity. Shum [7] proposed a method for reconstructing a valid movement from the deficient, noisy posture provided by Kinect. Kinect localizes the positions of the body parts of a person. However, when some body parts are occluded, the accuracy decreases, because Kinect uses a single depth camera. Thus, the measurements are objectively evaluated to obtain a reliability score for each body part. By fusing the reliability scores into a query of the motion database, kinematically valid similar postures are obtained. Commonly posture recognition is studied using the inertial sensor in a smart phone or other wearable devices. Lee [8] performed automatic classification of squat posture using inertial sensors via deep learning. One correct and five wrong squat postures were defined and classified using inertial data from five inertial sensors attached on the lower body, random forest, and convolutional neural network long short-term memory. Chowdhury [9] studied detailed activity recognition with a smartphone using trees and support vector machines. The data from the accelerometer in the smartphone were used to recognize detailed activity, like sitting on a chair not simply sitting. Wu [10] studied yoga posture recognition with wearable inertial sensors based on a two-stage classifier. The backpropagation artificial neural network and fuzzy C-means were used to divide yoga postures. Idris [11] studied human posture recognition using an android smartphone and artificial neural network. The gyroscope data from two smartphones attached to the arm were used to classify four gestures. To acquire data from inertial sensors or smartphones, a sensor usually needs to attach to the body, which is inconvenient for elderly people at home.
The development of deep learning has greatly exceeded the performance of machine learning; thus, deep learning is actively studied. Various models and learning methods have been developed. The depth of deep learning networks has expanded from tens to hundreds [12], in contrast to conventional neural networks with a depth of two to three. Deep learning networks abstract data to a high level through a combination of nonlinear transforms. There are many deep learning methods, such as deep belief networks [13], deep Q networks [14], deep neural networks [15], recurrent neural networks [16], and convolutional neural networks (CNNs) [17]. CNNs are designed by integrating a feature extractor and a classifier into a network to automatically train them through data and exhibit the optimal performance for image processing [18]. There are many pre-trained deep models based on the CNN, such as VGGNet [19], ResNet [20], DenseNet [21], InceptionResNet [22], and Xception [23]. These pre-trained models can be used for transfer learning. Transfer learning reuses the parameters of the models, which are trained from a large-scale database using high-performance processing units for a long time. Transfer learning can be effectively used in cases where a neural network needs to be trained with a small database and there is insufficient time to train the network with new data [24].
There are many studies on posture estimation based on deep learning. Thompson [25] used a CNN to extract several scale features of body parts. These features include a high-order spatial model sourced from a Markov random field and indicate the structural constraint of the domain between joints. Careira [26] proposed a feedback algorithm employing a top-down strategy called iterative error feedback. It carries the learning hierarchical representation of the CNN from the input to the output with a self-modifying model. Pishchulin [27] concentrated on local information by reducing the receptive field size of the fast R-CNN (Region based CNN) [28]. Thus, partial detection was converted into multi-label classification and combined with DeepCut to perform bottom-up inference. Insafutdinov [29] proposed residual learning [20] that includes more context by increasing the size of the receptive field and the depth of the network. Georgakopoulos [30] proposed a methodology for classifying poses from binary human silhouettes using a CNN, and the method was improved by image features based on modified Zernike moments for fisheye images. The training set is composed of a composite image created from a 3D human model using the calibration model of the fisheye camera. The test set is the actual image acquired by the fisheye camera [31]. To increase performance, many studies have used ensemble deep models at various applications. Lee [32] designed an ensemble stacked auto-encoder based on sum and product for classifying horse gaits using wavelet packets from motion data of the rider. Maguolo [33] studied an ensemble of convolutional neural networks trained with different activation functions using sum rule to improve the performance in smallor medium-sized biomedical datasets. Kim [34] studied deep learning based on 1D ensemble networks using an electrocardiogram for user recognition. The ensemble network is composed of three CNN models with different parameters and their outputs are combined into single data.
Traditionally, to recognize posture, it was necessary to obtain the coordinates of the body points or inertial data. This was achieved using a depth camera such as Kinect, image processing through a body model, or devices for capturing motion connected to the body; regarding the latter, it is a nuisance to wear these sensors with care in everyday life. Posture recognition, using images that do not require a sensor attached to the body, does not have this problem. Since posture recognition is performed using images, it can be applied to inexpensive cameras, and the device used for acquiring experimental data also has an inexpensive feature even though it supports a depth camera. In conventional machine learning, there is a limitation to recognizing posture directly using only an image [12,[35][36][37]. However, owing to advancements in deep learning, good performance in posture recognition can be achieved using only one image. In the present study, several pre-trained CNNs were employed for recognizing the posture of a person using various preprocessed 2D images. To train the deep neural network for posture recognition, a large number of posture images was required. Therefore, a daily domestic posture database was directly constructed for posture recognition under the assumption of an environment of domestic service robots. The database includes ten postures: "standing normal", "standing bent", "sitting sofa", "sitting chair", "sitting floor", "sitting squat", "lying face", "lying back", "lying side", and "lying crouched". The training, validation, and test sets were real captured images, not synthetic images. Moreover, ensemble CNN models were studied to improve performance. The performances of posture recognition using ensemble CNNs with various types of preprocessing, not studied thus far, were compared. In the case of single models, type 2 exhibited 13.63%, 3.12%, 2.79%, and 0.76% higher accuracy than types 0, 1, 3, and 4, respectively, under transfer learning; and VGG19 exhibited 15.78%, 0.78%, 3.53%, 4.11%, 10.61%, and 16.02% higher accuracy than the simple CNN, VGG16 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], respectively, under transfer learning. In the case of ensemble systems, the ensemble system combining InceptionResNetV2s using average of scores was 2.02% higher than the best single model in our experiments under nontransfer learning. We performed posture recognition that can be applied to general security cameras to detect and respond to human risks. To do so, we acquired a large amount of data for training a neural network. In order to use the existing CNN models with fixed input form, various methods were applied to fit the input form. We compared the performance of posture recognition by applying various existing CNN models, and proposed ensemble models by input forms and CNN models.
We performed posture recognition using ensemble preconfigured deep models in home environments. Section 2 describes the deep models of the CNN. Section 3 discusses the database and experimental methods. Section 4 presents the experimental results, and Section 5 concludes the paper.

Deep Models of CNN
There are many pre-trained deep models based on the CNN, such as VGGNet [19], ResNet [20], DenseNet [21], InceptionResNet [22], and Xception [23]. These pre-trained models can be used for transfer learning. Transfer learning reuses the parameters of the models, which are trained from a large-scale database using high-performance processing units for a long time. Transfer learning can be effectively used in cases where a neural network needs to be trained with a small database and there is insufficient time to train the network with new data [24].

VGGNet
VGGNet is ranked second in the ILSVRC 2014, following GoogLeNet, but it is more widely used than GoogLeNet because it has a significantly simpler structure. VGGNet uses a relatively small 3 × 3 convolution (conv) filter and a 1 × 1 convolution filter, in contrast to AlexNet's 11 × 11 filters in the first layer and ZFNet's 7 × 7 filters. When the nonlinear rectified linear unit (ReLU) activation function is applied after the 1 × 1 convolution, the model becomes more discriminative. Additionally, a smaller filter requires less parameters to be learned and results in higher processing speed. In VGGNet, a maxpool with a 2 × 2 kernel size and 2 strides, 2 fully connected (FC) layers with 4096 nodes, 1 FC layer with 1000 nodes, and 1 Softmax layer is used. Two 3 × 3 convolution layers and three 3 × 3 convolution layers have effective receptive fields of 5 × 5 and 7 × 7, respectively. By doubling the number of filters after each maxpool layer, the spatial dimensions are reduced, but the network depth is increased. Originally, VGGNet was designed to investigate how errors are affected by the network depth. There are models with layer depths of 8, 11, 13, 16, and 19 in VGGNet. As the network depth increases, the error decreases, but the error increases if the layer depth exceeds 19. VGGNet uses data augmentation of scale jittering for training and is trained with batch gradient descent using four Nvidia Titan Black graphics processing units for approximately three weeks [19]. Figure 1 shows the structure of VGGNet. In an expression with the form L@M × N, L, M, and N represent the sizes of the map, the row of the kernel, and the column of the kernel, respectively. The @ is a symbol used to separate the map size and filter size here.

VGGNet
VGGNet is ranked second in the ILSVRC 2014, following GoogLeNet, but it is more widely used than GoogLeNet because it has a significantly simpler structure. VGGNet uses a relatively small 3 × 3 convolution (conv) filter and a 1 × 1 convolution filter, in contrast to AlexNet's 11 × 11 filters in the first layer and ZFNet's 7 × 7 filters. When the nonlinear rectified linear unit (ReLU) activation function is applied after the 1 × 1 convolution, the model becomes more discriminative. Additionally, a smaller filter requires less parameters to be learned and results in higher processing speed. In VGGNet, a maxpool with a 2 × 2 kernel size and 2 strides, 2 fully connected (FC) layers with 4096 nodes, 1 FC layer with 1000 nodes, and 1 Softmax layer is used. Two 3 × 3 convolution layers and three 3 × 3 convolution layers have effective receptive fields of 5 × 5 and 7 × 7, respectively. By doubling the number of filters after each maxpool layer, the spatial dimensions are reduced, but the network depth is increased. Originally, VGGNet was designed to investigate how errors are affected by the network depth. There are models with layer depths of 8, 11, 13, 16, and 19 in VGGNet. As the network depth increases, the error decreases, but the error increases if the layer depth exceeds 19. VGGNet uses data augmentation of scale jittering for training and is trained with batch gradient descent using four Nvidia Titan Black graphics processing units for approximately three weeks [19]. Figure 1 shows the structure of VGGNet. In an expression with the form L@M × N, L, M, and N represent the sizes of the map, the row of the kernel, and the column of the kernel, respectively. The @ is a symbol used to separate the map size and filter size here.

ResNet
Deep layers of neural networks result in a vanishing gradient, an exploding gradient, and degradation. The vanishing gradient refers to when a propagated gradient becomes too small, and exploding gradient refers to when a propagated gradient becomes too large to train. Degradation indicates that the deep neural network has worse performance than the shallow neural network even though there is no overfitting. ResNet attempts to solve these problems by reusing the input features of the previous layer. Figure 2 shows the structure of ResNet. The output of Y is calculated using the input of X, and the input of X is reused by being added to the output of Y. This is called a skip connection. Then, learning is performed so that ReLU (W × X) converges to 0, indicating that the output of Y is almost equal to X. This reduces the vanishing gradient, and even small changes in the input are delivered to the output. The number of intermediate layers in the section of the skip connection can be set arbitrarily, and ResNet uses this method to stack layers deeply. ResNet is structured according to VGGNet [19] and uses the convolution of 3 × 3 filters, but not pooling or dropout. The pooling is replaced with the convolution of 2 strides. After every two convolutions, the input layer is added at the output [20].

ResNet
Deep layers of neural networks result in a vanishing gradient, an exploding gradient, and degradation. The vanishing gradient refers to when a propagated gradient becomes too small, and exploding gradient refers to when a propagated gradient becomes too large to train. Degradation indicates that the deep neural network has worse performance than the shallow neural network even though there is no overfitting. ResNet attempts to solve these problems by reusing the input features of the previous layer. Figure 2 shows the structure of ResNet. The output of Y is calculated using the input of X, and the input of X is reused by being added to the output of Y. This is called a skip connection. Then, learning is performed so that ReLU (W × X) converges to 0, indicating that the output of Y is almost equal to X. This reduces the vanishing gradient, and even small changes in the input are delivered to the output. The number of intermediate layers in the section of the skip connection can be set arbitrarily, and ResNet uses this method to stack layers deeply. ResNet is structured according to VGGNet [19] and uses the convolution of 3 × 3 filters, Appl. Sci. 2020, 10, 1287 5 of 26 but not pooling or dropout. The pooling is replaced with the convolution of 2 strides. After every two convolutions, the input layer is added at the output [20].

DenseNet
A typical network structure is a sequential combination of convolution, activation, and pooling. In contrast to the typical network, DenseNet solves the degradation problem by introducing a new concept called dense connectivity. DenseNet has approximately 12 filters per layer and uses the dense connectivity pattern to continuously pile up feature maps in previous layers, effectively conveying the information in the early layers to the latter layers. This allows entire feature maps to enter within the network evenly into the last classifier, while simultaneously reducing the total number of parameters, making the network sufficiently learnable. The dense connection also functions as regularization, which reduces the overfitting even for small datasets. Dense connectivity is expressed by Equation (1) and shown in Figure 3. DenseNet divides the entire network into several dense blocks and groups layers with the same size of the feature map into the same dense block. The part of pooling and convolution is called the transition layer. This layer comprises a batch normalization (BN) layer, a 1 × 1 convolution layer for adjusting the dimensions of the feature map, and a 2 × 2 average pooling layer. A bottleneck structure (i.e., BN-ReLU-conv(1)-BN-ReLU-conv(3)), is employed to reduce the computational complexity. Usually, global average pooling is used instead of FC, which most networks have in the last layer. The network is trained via stochastic gradient descent [21]. Figure 4 shows the structure of DenseNet.

DenseNet
A typical network structure is a sequential combination of convolution, activation, and pooling. In contrast to the typical network, DenseNet solves the degradation problem by introducing a new concept called dense connectivity. DenseNet has approximately 12 filters per layer and uses the dense connectivity pattern to continuously pile up feature maps in previous layers, effectively conveying the information in the early layers to the latter layers. This allows entire feature maps to enter within the network evenly into the last classifier, while simultaneously reducing the total number of parameters, making the network sufficiently learnable. The dense connection also functions as regularization, which reduces the overfitting even for small datasets. Dense connectivity is expressed by Equation (1) and shown in Figure 3. DenseNet divides the entire network into several dense blocks and groups layers with the same size of the feature map into the same dense block. The part of pooling and convolution is called the transition layer. This layer comprises a batch normalization (BN) layer, a 1 × 1 convolution layer for adjusting the dimensions of the feature map, and a 2 × 2 average pooling layer. A bottleneck structure (i.e., BN-ReLU-conv(1)-BN-ReLU-conv(3)), is employed to reduce the computational complexity. Usually, global average pooling is used instead of FC, which most networks have in the last layer. The network is trained via stochastic gradient descent [21]. Figure 4 shows the structure of DenseNet.

DenseNet
A typical network structure is a sequential combination of convolution, activation, and pooling. In contrast to the typical network, DenseNet solves the degradation problem by introducing a new concept called dense connectivity. DenseNet has approximately 12 filters per layer and uses the dense connectivity pattern to continuously pile up feature maps in previous layers, effectively conveying the information in the early layers to the latter layers. This allows entire feature maps to enter within the network evenly into the last classifier, while simultaneously reducing the total number of parameters, making the network sufficiently learnable. The dense connection also functions as regularization, which reduces the overfitting even for small datasets. Dense connectivity is expressed by Equation (1) and shown in Figure 3. DenseNet divides the entire network into several dense blocks and groups layers with the same size of the feature map into the same dense block. The part of pooling and convolution is called the transition layer. This layer comprises a batch normalization (BN) layer, a 1 × 1 convolution layer for adjusting the dimensions of the feature map, and a 2 × 2 average pooling layer. A bottleneck structure (i.e., BN-ReLU-conv(1)-BN-ReLU-conv(3)), is employed to reduce the computational complexity. Usually, global average pooling is used instead of FC, which most networks have in the last layer. The network is trained via stochastic gradient descent [21]. Figure 4 shows the structure of DenseNet.

InceptionResNetV2
The objective of the inception module is to cover a wide area of the image while maintaining the resolution for smaller information. Thus, convolutions of different sizes from 1 × 1 to 5 × 5 are performed in parallel. The inception module first performs convolution with a 1 × 1 filter and then performs convolutions with filters of different sizes, because the convolution with a 1 × 1 filter reduces the number of feature maps and thus the computational cost. The results of convolutions performed in parallel are concatenated at the output layer of the inception module. The InceptionResNetV2 consists of a stem layer, three types of inception modules (A, B, and C), and two types of reduction modules (A and B). The stem layer is the frontal layer, and the reduction module is used to reduce the size of the feature map in InceptionResNetV2. The inception modules of InceptionResNetV2 are based on the integration of the inception module and the skip connection of ResNet [22]. The stem layer, three types of inception modules (A, B, and C), two types of reduction modules (A and B), and the structure of InceptionResNetV2 are shown in

Xception
Xception, which is based on inception, seeks to completely separate the search for relationships between channels from the search for regional information on images. In Xception, depth-wise separable convolution is performed for each channel, and the result is projected to the new channel space via 1 × 1 convolution. If the existing convolution created a feature map considering all the channels and local information, the depth-wise convolution creates one feature map for each channel, and then 1 × 1 convolution is performed to adjust the number of feature maps. The 1 × 1 convolution is called the pointwise convolution (point-conv). In inception, each convolution is followed by the nonlinearity of the ReLU; however, in a depth-wise separable convolution, the first convolution is not followed by nonlinearity. Xception has 36 convolution layers in 14 modules for feature extraction. Except for the beginning and end layers, each module has a linear residual connection. In summary,

InceptionResNetV2
The objective of the inception module is to cover a wide area of the image while maintaining the resolution for smaller information. Thus, convolutions of different sizes from 1 × 1 to 5 × 5 are performed in parallel. The inception module first performs convolution with a 1 × 1 filter and then performs convolutions with filters of different sizes, because the convolution with a 1 × 1 filter reduces the number of feature maps and thus the computational cost. The results of convolutions performed in parallel are concatenated at the output layer of the inception module. The InceptionResNetV2 consists of a stem layer, three types of inception modules (A, B, and C), and two types of reduction modules (A and B). The stem layer is the frontal layer, and the reduction module is used to reduce the size of the feature map in InceptionResNetV2. The inception modules of InceptionResNetV2 are based on the integration of the inception module and the skip connection of ResNet [22]. The stem layer, three types of inception modules (A, B, and C), two types of reduction modules (A and B), and the structure of InceptionResNetV2 are shown in

InceptionResNetV2
The objective of the inception module is to cover a wide area of the image while maintaining the resolution for smaller information. Thus, convolutions of different sizes from 1 × 1 to 5 × 5 are performed in parallel. The inception module first performs convolution with a 1 × 1 filter and then performs convolutions with filters of different sizes, because the convolution with a 1 × 1 filter reduces the number of feature maps and thus the computational cost. The results of convolutions performed in parallel are concatenated at the output layer of the inception module. The InceptionResNetV2 consists of a stem layer, three types of inception modules (A, B, and C), and two types of reduction modules (A and B). The stem layer is the frontal layer, and the reduction module is used to reduce the size of the feature map in InceptionResNetV2. The inception modules of InceptionResNetV2 are based on the integration of the inception module and the skip connection of ResNet [22]. The stem layer, three types of inception modules (A, B, and C), two types of reduction modules (A and B), and the structure of InceptionResNetV2 are shown in Figures 5-8

Xception
Xception, which is based on inception, seeks to completely separate the search for relationships between channels from the search for regional information on images. In Xception, depth-wise separable convolution is performed for each channel, and the result is projected to the new channel space via 1 × 1 convolution. If the existing convolution created a feature map considering all the channels and local information, the depth-wise convolution creates one feature map for each channel, and then 1 × 1 convolution is performed to adjust the number of feature maps. The 1 × 1 convolution is called the pointwise convolution (point-conv). In inception, each convolution is followed by the nonlinearity of the ReLU; however, in a depth-wise separable convolution, the first convolution is not followed by nonlinearity. Xception has 36 convolution layers in 14 modules for feature extraction. Except for the beginning and end layers, each module has a linear residual connection. In summary,

Xception
Xception, which is based on inception, seeks to completely separate the search for relationships between channels from the search for regional information on images. In Xception, depth-wise separable convolution is performed for each channel, and the result is projected to the new channel space via 1 × 1 convolution. If the existing convolution created a feature map considering all the channels and local information, the depth-wise convolution creates one feature map for each channel, and then 1 × 1 convolution is performed to adjust the number of feature maps. The 1 × 1 convolution is called the pointwise convolution (point-conv). In inception, each convolution is followed by the nonlinearity of the ReLU; however, in a depth-wise separable convolution, the first convolution is not followed by nonlinearity. Xception has 36 convolution layers in 14 modules for feature extraction. Except for the beginning and end layers, each module has a linear residual connection. In summary, Xception is formed by linearly stacking depth-wise separable convolution layers with residual connections [23]. Figure 6 shows the structure of Xception.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 25 Xception is formed by linearly stacking depth-wise separable convolution layers with residual connections [23]. Figure 6 shows the structure of Xception.

Construction of Database for Posture Recognition
Training a deep neural network for posture recognition requires a large amount of posture images. Thus, the daily domestic posture database was constructed for posture recognition under the assumption of an environment of domestic service robots. Astra (i.e., an assembly of sensors for color images (red, green, and blue (RGB)), depth images (infrared rays), and sound sources (microphones) made by Orbbec), was used to construct a daily domestic posture database. It senses the depth from 0.6 to 8 m and the color inside a field of view of 60° × 49.5° × 73°. The resolution and frame rate are 640 × 480 pixels and 30 fps, respectively, for the RGB and depth images. The size of Astra is 165 × 30 × 40 m 3 , its power consumption is 2.4 W (through Universal Serial Bus (USB) 2.0), and it has two microphones. Figure 7 shows the Astra for capturing posture images. It can be developed using the Astra software development kit (SDK) and OpenNI in several operating systems (e.g., Android, Linux, and Windows 7, 8, and 10). The SDK also supports body tracking with a skeleton [38]. To construct the database, a graphical user interface (GUI) for the capture tool was developed using the

Construction of Database for Posture Recognition
Training a deep neural network for posture recognition requires a large amount of posture images. Thus, the daily domestic posture database was constructed for posture recognition under the assumption of an environment of domestic service robots. Astra (i.e., an assembly of sensors for color images (red, green, and blue (RGB)), depth images (infrared rays), and sound sources (microphones) made by Orbbec), was used to construct a daily domestic posture database. It senses the depth from 0.6 to 8 m and the color inside a field of view of 60 • × 49.5 • × 73 • . The resolution and frame rate are 640 × 480 pixels and 30 fps, respectively, for the RGB and depth images. The size of Astra is 165 × 30 × 40 m 3 , its power consumption is 2.4 W (through Universal Serial Bus (USB) 2.0), and it has two microphones. Figure 7 shows the Astra for capturing posture images. It can be developed using the Astra software development kit (SDK) and OpenNI in several operating systems (e.g., Android, Linux, and Windows 7, 8, and 10). The SDK also supports body tracking with a skeleton [38]. To construct the database, a graphical user interface (GUI) for the capture tool was developed using the Microsoft Foundation Class library and the Astra SDK in Windows 7. Figure 8 shows the developed GUI for constructing the posture database.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 25 Microsoft Foundation Class library and the Astra SDK in Windows 7. Figure 8 shows the developed GUI for constructing the posture database.  A total of 51 homes participated in the construction of the daily domestic posture database. These homes had a living room, a kitchen, a small room, a dinner table, chairs, a bed, and a sofa. More than two persons in each home equally contributed as subjects for the posture images. Ten postures were defined to construct the daily domestic posture database. The postures were "standing normal", "standing bent", "sitting sofa", "sitting chair", "sitting floor", "sitting squat", "lying face", "lying back", "lying side", and "lying crouched". Each home generated 100 images per posture, and each image included one subject. Each home generated 1000 posture images (10 postures × 100 images per posture). Thus, the total number of images was 51,000 (1000 posture images × 51 homes). Each image was captured by varying the position, type of room, prop (such as clothes), furniture (such as a chair), pose of the sensor, pose of the person, and small movement. Figure 9 shows the environment for capturing the posture images using Astra. Figure 10 shows examples of posture images captured using Astra. Microsoft Foundation Class library and the Astra SDK in Windows 7. Figure 8 shows the developed GUI for constructing the posture database.  A total of 51 homes participated in the construction of the daily domestic posture database. These homes had a living room, a kitchen, a small room, a dinner table, chairs, a bed, and a sofa. More than two persons in each home equally contributed as subjects for the posture images. Ten postures were defined to construct the daily domestic posture database. The postures were "standing normal", "standing bent", "sitting sofa", "sitting chair", "sitting floor", "sitting squat", "lying face", "lying back", "lying side", and "lying crouched". Each home generated 100 images per posture, and each image included one subject. Each home generated 1000 posture images (10 postures × 100 images per posture). Thus, the total number of images was 51,000 (1000 posture images × 51 homes). Each image was captured by varying the position, type of room, prop (such as clothes), furniture (such as a chair), pose of the sensor, pose of the person, and small movement. Figure 9 shows the environment for capturing the posture images using Astra. Figure 10 shows examples of posture images captured using Astra. A total of 51 homes participated in the construction of the daily domestic posture database. These homes had a living room, a kitchen, a small room, a dinner table, chairs, a bed, and a sofa. More than two persons in each home equally contributed as subjects for the posture images. Ten postures were defined to construct the daily domestic posture database. The postures were "standing normal", "standing bent", "sitting sofa", "sitting chair", "sitting floor", "sitting squat", "lying face", "lying back", "lying side", and "lying crouched". Each home generated 100 images per posture, and each image included one subject. Each home generated 1000 posture images (10 postures × 100 images per posture). Thus, the total number of images was 51,000 (1000 posture images × 51 homes). Each image was captured by varying the position, type of room, prop (such as clothes), furniture (such as a chair), pose of the sensor, pose of the person, and small movement. Figure 9 shows the environment for capturing the posture images using Astra. Figure 10 shows examples of posture images captured using Astra. Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 25

Preprocessing Types
To segment the person images, you only look once (YOLO)-v3 was used as a person detector [39]. After segmenting the person, preprocessing was performed to consider various input methods which have different pros and cons because the neural networks were limited in input as a square fixed image. The pros and cons are based on tightness of person and distortion of person by stretching image. Five types of preprocessing for cropping images were defined to input posture images into the neural network: In type 0, the original image was resized to the size of the input layer. This changed the original ratio of the image. In type 1, the person image was segmented from the original image while satisfying the size of the input layer. The original ratio of the image was maintained. In type 2, the person image was tightly segmented from the original image regardless of the size of the input layer. Then, the person image was resized while maintaining the ratio of the original image until the size of the row or column of the person was equal to the size of the row or column of the input layer, respectively. The extra border due to the difference in the image ratio between the person and the input layer was zero-padded. In type 3, the person image was tightly segmented from the original image regardless of the size of the input layer. Then, the person image was resized to the size of the input layer. This changed the original ratio of the image. In type 4, the person image was tightly segmented from the original image regardless of the size of the input layer. Then, the person image was resized while maintaining the ratio of the original image until the size of the row or column of the person was equal to the size of the row or column of the input layer, respectively. The extra border due to the difference in the image ratio between the person and the input layer was replicated to the edge of the original image. Figure 11 shows the examples of five types of original and preprocessed images.

Preprocessing Types
To segment the person images, you only look once (YOLO)-v3 was used as a person detector [39]. After segmenting the person, preprocessing was performed to consider various input methods which have different pros and cons because the neural networks were limited in input as a square fixed image. The pros and cons are based on tightness of person and distortion of person by stretching image. Five types of preprocessing for cropping images were defined to input posture images into the neural network: In type 0, the original image was resized to the size of the input layer. This changed the original ratio of the image. In type 1, the person image was segmented from the original image while satisfying the size of the input layer. The original ratio of the image was maintained. In type 2, the person image was tightly segmented from the original image regardless of the size of the input layer. Then, the person image was resized while maintaining the ratio of the original image until the size of the row or column of the person was equal to the size of the row or column of the input layer, respectively. The extra border due to the difference in the image ratio between the person and the input layer was zero-padded. In type 3, the person image was tightly segmented from the original image regardless of the size of the input layer. Then, the person image was resized to the size of the input layer. This changed the original ratio of the image. In type 4, the person image was tightly segmented from the original image regardless of the size of the input layer. Then, the person image was resized while maintaining the ratio of the original image until the size of the row or column of the person was equal to the size of the row or column of the input layer, respectively. The extra border due to the difference in the image ratio between the person and the input layer was replicated to the edge of the original image. Figure 11 shows the examples of five types of original and preprocessed images.

Posture Recognition Using Ensemble CNN Models
As more people live alone, there is a need for ways to cope with crime or dying alone. Posture recognition emerged as a solution because it contains some information about the person's situation. However, posture recognition using the conventional inertial sensor, the sensor is cumbersome to attach to the body. For the elderly, it is difficult to wear the sensor all the time and extra help is needed for them to do so. However, image-based posture recognition through a camera can be applied to the existing camera system and there is no need to attach any sensors. In addition, low-cost webcams can also be applied because the posture is recognized only by the image. It has been difficult to recognize posture by using only two-dimensional images, but recently, deep learning has proved excellent in various fields. We applied the technique to the 2D image based posture recognition. First, a daily domestic posture database was constructed. Then, the posture images were processed to different types of preprocessing. The CNN models were VGG16, VGG19 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23]. Transfer learning is effectively used in cases where a neural network needs to be trained with a small database and there is insufficient time to train the network with new data. In this study, posture recognition was performed by training the CNNs using transfer learning. Most of the parameters used to extract the feature maps were fixed, and only the final FC layers and Softmax were trained. In addition, CNNs were trained by updating all the parameters (nontransfer learning). Figure 12 shows the posture recognition using the preprocessed images for different CNN models. Then CNNs of the deep models and CNNs of the preprocessing types are combined using the majority of outputs and average of scores as ensemble methods. The majority of outputs decides the most voted class among outputs of CNNs. The average of score decides the class with a max score from averaged scores of CNNs. Tables 1 and 2 show the ensemble methods of majority vote and score average. Figure 13 shows the various ensemble systems by input types and CNN models.

Posture Recognition Using Ensemble CNN Models
As more people live alone, there is a need for ways to cope with crime or dying alone. Posture recognition emerged as a solution because it contains some information about the person's situation. However, posture recognition using the conventional inertial sensor, the sensor is cumbersome to attach to the body. For the elderly, it is difficult to wear the sensor all the time and extra help is needed for them to do so. However, image-based posture recognition through a camera can be applied to the existing camera system and there is no need to attach any sensors. In addition, low-cost webcams can also be applied because the posture is recognized only by the image. It has been difficult to recognize posture by using only two-dimensional images, but recently, deep learning has proved excellent in various fields. We applied the technique to the 2D image based posture recognition. First, a daily domestic posture database was constructed. Then, the posture images were processed to different types of preprocessing. The CNN models were VGG16, VGG19 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23]. Transfer learning is effectively used in cases where a neural network needs to be trained with a small database and there is insufficient time to train the network with new data. In this study, posture recognition was performed by training the CNNs using transfer learning. Most of the parameters used to extract the feature maps were fixed, and only the final FC layers and Softmax were trained. In addition, CNNs were trained by updating all the parameters (nontransfer learning). Figure 12 shows the posture recognition using the preprocessed images for different CNN models. Then CNNs of the deep models and CNNs of the preprocessing types are combined using the majority of outputs and average of scores as ensemble methods. The majority of outputs decides the most voted class among outputs of CNNs. The average of score decides the class with a max score from averaged scores of CNNs. Tables 1 and 2 show the ensemble methods of majority vote and score average. Figure 13 shows the various ensemble systems by input types and CNN models.

1.
Net results scores of each class as 1 × n matrix.
Get the outputs by picking the class to the max value in scores as one hot matrix.  Table 2. Ensemble method of score average.

1.
Net results scores of each class as 1 × n matrix. The number of postures is the number of classes. The number of defined postures was 10: "standing normal", "standing bent", "sitting sofa", "sitting chair", "sitting floor", "sitting squat", "lying face", "lying back", "lying side", and "lying crouched". The 10 posture classes were divided into four categories: standing, sitting, lying, and lying crouched. Instead of 10 and 4 classes, there were 11 and 5 when other images, such as the background, were considered as a separate class. The results of posture recognition under 10, 4, 11, and 5 classes were obtained. The other images were configured by programmatically cropping the original posture images randomly while avoiding the person to the greatest extent possible. These various numbers of classes have the advantage to extract context information. For example, a person does not sleep in a standing posture. Figure 14 shows the four categories.
were 11 and 5 when other images, such as the background, were considered as a separate class. The results of posture recognition under 10, 4, 11, and 5 classes were obtained. The other images were configured by programmatically cropping the original posture images randomly while avoiding the person to the greatest extent possible. These various numbers of classes have the advantage to extract context information. For example, a person does not sleep in a standing posture. Figure 14 shows the four categories.     The posture-recognition performance was evaluated with regard to accuracy, which was defined as the number of correct classifications (CC) divided by the number of total classifications (i.e., the sum of CC and the number of wrong classifications (WC)) [40]:

Database
The daily domestic posture database was directly constructed from 51 homes using Astra. All homes had a living room, a kitchen, a small room, a dinner table, chairs, a bed, and a sofa. The postures included "standing normal", "standing bent", "sitting sofa", "sitting chair", "sitting floor", "sitting squat", "lying face", "lying back", "lying side", and "lying crouched". Each home had 100 images per posture, and each image included one subject. Each home had 1000 posture images (10 postures × 100 images per posture). The total number of images was 51,000 (1000 posture images × 51 homes). The images were captured by varying the position, type of room, prop (such as clothes), furniture (such as a chair), pose of the sensor, pose of the person, and small movement. The posture database was constructed with men and women ranging from 19 to 68 years of age. Figure 15 shows the data configuration for the training, validation, and testing. The posture images were subjected to different types of preprocessing. To segment the person image, YOLO-v3 was used as a person detector [39], and the segmented person images were filtered manually. The number of preprocessed posture images for the training was 3910 for standing normal, 3802 for standing bent, 3899 for sitting sofa, 3899 for sitting chair, 3896 for sitting floor, 3859 for sitting squat, 3752 for lying face, 3777 for lying back, 3773 for lying side, 3505 for lying crouched, and 6802 for other images. The number of preprocessed posture images for the validation was 689 for standing normal, 670 for standing bent, 688 for sitting sofa, 687 for sitting chair, 687 for sitting floor, 681 for sitting squat, 662 for lying face, 666 for lying back, 665 for lying side, 618 for lying crouched, and 1200 for other images. The number of

Experimental Results
The computer used in the experiment had the following specifications: Intel i7-6850K central processing unit at 3.60 GHz, Nvidia GeForce GTX 1080 Ti, 64 GB of random-access memory, and Windows 10 64-bit operating system. The CNN models used for posture recognition in this study were VGG16, VGG19 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], which are pre-trained CNNs. A pre-trained CNN can be used for transfer learning when new data need to be classified efficiently. Most of the parameters used to extract the feature maps were fixed, and only the final FC layers and Softmax were trained. A simple CNN (the conventional method) was added to the experiment for performance comparison. The structure of the simple CNN is depicted in Figure 16. The simple CNN was trained by updating all the parameters, because it was not a pre-trained model. Here, the number of postures was the number of classes. The number of defined postures was 10: "standing normal", "standing bent", "sitting sofa", "sitting chair", "sitting

Experimental Results
The computer used in the experiment had the following specifications: Intel i7-6850K central processing unit at 3.60 GHz, Nvidia GeForce GTX 1080 Ti, 64 GB of random-access memory, and Windows 10 64-bit operating system. The CNN models used for posture recognition in this study were VGG16, VGG19 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], which are pre-trained CNNs. A pre-trained CNN can be used for transfer learning when new data need to be classified efficiently. Most of the parameters used to extract the feature maps were fixed, and only the final FC layers and Softmax were trained. A simple CNN (the conventional method) was added to the experiment for performance comparison. The structure of the simple CNN is depicted in Figure 16. The simple CNN was trained by updating all the parameters, because it was not a pre-trained model. Here, the number of postures was the number of classes. The number of defined postures was 10: "standing normal", "standing bent", "sitting sofa", "sitting chair", "sitting floor", "sitting squat", "lying face", "lying back", "lying side", and "lying crouched". If a class of "other images" (e.g., background) was added to the 10 posture classes, then the number of total classes was 11. The training, validation, and testing data sets for the 11 classes consisted of 44,874, 7913, and 6934 images, respectively. Additionally, the 10 posture classes were divided into four categories: standing, sitting, lying, and lying crouched. If a class for "other images" was added to these four posture classes, the number of total classes was five. The performance for the other numbers of classes was measured through simple mapping of the trained model with the data of 11 classes. The simple CNN was trained with a batch size of 128, 50 epochs, and the RMSProp optimizer [41]. The other pre-trained models were trained via transfer learning with a batch size of 128, 30 epochs, and the Adadelta optimizer [42]. The highest accuracies without data augmentation under transfer learning were 78.35% for VGG19 and type 2 preprocessing with the posture data of 11 classes, 69.89% for VGG19 and type 2 preprocessing with the posture data of 10 classes, 88.86% for VGG19 and type 2 preprocessing with the posture data of five classes, and 88.50% for ResNet50 and type 1 preprocessing with the posture data of four classes. Table 3 presents the classification performance of the CNNs without data augmentation under transfer learning. Figure 17 shows the confusion matrix for VGG19 with type 2 preprocessing, which exhibited the best performance for the posture data of 11 classes without data augmentation under transfer learning. Tables 4-6 present the accuracies of the CNNs for the posture data of 10, 5, and 4 classes, respectively, without data augmentation under transfer learning.
preprocessing with the posture data of four classes. Table 3 presents the classification performance of the CNNs without data augmentation under transfer learning. Figure 17 shows the confusion matrix for VGG19 with type 2 preprocessing, which exhibited the best performance for the posture data of 11 classes without data augmentation under transfer learning. Tables 4-6 present the accuracies of the CNNs for the posture data of 10, 5, and 4 classes, respectively, without data augmentation under transfer learning.          Next, the experiment was performed in the same way, but data augmentation was added for the training. The data were augmented with a rotation range of 10, a shear range of 10, a zoom range of 0.2, and a horizontal flip of "true". The highest accuracies with data augmentation under transfer learning were 76.54% for VGG19 and type 2 preprocessing with the posture data of 11 classes, 67.47% for VGG19 and type 2 preprocessing with the posture data of 10 classes, 87.92% for VGG19 and type 2 preprocessing with the posture data of five classes, and 85.44% for DenseNet201 and type 2 preprocessing with the posture data of four classes. Table 7 presents the classification performance of the CNNs with data augmentation under transfer learning. Figure 18 shows the confusion matrix for VGG19 and type 2 preprocessing, which exhibited the best performance for the posture data of 11 classes with data augmentation under transfer learning. Tables 8-10 present the accuracies of the CNNs for the posture data of 10, 5, and 4 classes, respectively, with data augmentation under transfer learning. . Figure 18. Confusion matrix for VGG19 with preprocessing of type 2.   The CNNs were trained via nontransfer learning. VGG19, ResNet50, ResNet101, and InceptionResNetV2 were considered for nontransfer learning. The first three models were trained with a batch size of 60, three epochs, and the Adam optimizer [43]; the last model was trained with a batch size of 20, one epoch, and the Adam optimizer. The highest accuracies were 93.32% for InceptionResNetV2 and type 2 preprocessing with the posture data of 11 classes without data augmentation under nontransfer learning. Table 11 presents the classification performance for 11 classes without augmentation under nontransfer learning. Table 12 presents the training time of a single model for 11 classes without augmentation under nontransfer learning. Figure 19 shows the training process for InceptionResNetV2 and type 2 preprocessing, which exhibited the best performance for the posture data of 11 classes without data augmentation under nontransfer learning. Figure 20 shows the activations for InceptionResNetV2 and type 2 preprocessing. Figure 21 shows the activations of the last ReLU for ResNet50 and type 2 preprocessing.  The trained CNN models from Table 11 are combined using the majority of outputs and average of scores. First, the five CNNs trained with different input types are combined under VGG19, ResNet50, ResNet101, and InceptionResNetV2 as EV19TNet, ER50TNet, ER101TNet, and EIR2TNet, respectively. Table 13 describes the classification performance of ensemble deep models designed by input types. Secondly, the four CNNs trained with different models are combined under input types from 0 to 4 as ET0MNet, ET1MNet, ET2MNet, ET3MNet, and ET4MNet, respectively. Table 14 indicates the classification performance of ensemble deep models designed by the pre-trained CNNs. Table 15 indicates the training time of the ensemble system for 11 classes without augmentation under nontransfer learning. In ensemble systems, the highest accuracy is 95.34% of InceptionResNetV2s using average of scores. The classification performances listed in Tables 3-10 were computed by average value. Figure 22 shows the average values of classification performance under transfer learning for the different types of preprocessing. The experimental results on preprocessing of type 2 showed 13.63%, 3.12%, 2.79%, and 0.76% higher classification rate than types 0, 1, 3, and 4, respectively. Figure 23 shows a comparison of the average total accuracies for the different models under transfer learning. The 'SimCNN', 'DenseNet', and 'IncResNet' are simple CNN, DenseNet201, and InceptionResNetV2, respectively. The experimental results by VGG19 showed 15.78%, 0.78%, 3.53%, 4.11%, 10.61%, and 16.02% higher classification performance than the simple CNN, VGG16 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], respectively.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 22 of 25 16.02% higher classification performance than the simple CNN, VGG16 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], respectively. . Figure 23. Comparison of the average total accuracies for the different models under transfer learning. Figure 24 visualizes the classification performance of ensemble deep models from the results listed in Tables 13 and 14. Here, 'BestSingle' is the best result of single models listed in Table 11. As shown in Figure 24, the ensemble deep model by InceptionResNetV2s with average method showed the best classification performance in comparison to other models.   16.02% higher classification performance than the simple CNN, VGG16 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], respectively. . Figure 23. Comparison of the average total accuracies for the different models under transfer learning. Figure 24 visualizes the classification performance of ensemble deep models from the results listed in Tables 13 and 14. Here, 'BestSingle' is the best result of single models listed in Table 11. As shown in Figure 24, the ensemble deep model by InceptionResNetV2s with average method showed the best classification performance in comparison to other models.   Tables 13 and 14. Here, 'BestSingle' is the best result of single models listed in Table 11. As shown in Figure 24, the ensemble deep model by InceptionResNetV2s with average method showed the best classification performance in comparison to other models.