Multi-Channel Convolutional Neural Network Based 3D Object Detection for Indoor Robot Environmental Perception

Environmental perception is a vital feature for service robots when working in an indoor environment for a long time. The general 3D reconstruction is a low-level geometric information description that cannot convey semantics. In contrast, higher level perception similar to humans requires more abstract concepts, such as objects and scenes. Moreover, the 2D object detection based on images always fails to provide the actual position and size of an object, which is quite important for a robot’s operation. In this paper, we focus on the 3D object detection to regress the object’s category, 3D size, and spatial position through a convolutional neural network (CNN). We propose a multi-channel CNN for 3D object detection, which fuses three input channels including RGB, depth, and bird’s eye view (BEV) images. We also propose a method to generate 3D proposals based on 2D ones in the RGB image and semantic prior. Training and test are conducted on the modified NYU V2 dataset and SUN RGB-D dataset in order to verify the effectiveness of the algorithm. We also carry out the actual experiments in a service robot to utilize the proposed 3D object detection method to enhance the environmental perception of the robot.


Introduction
The traditional environmental perception of indoor service robots mainly solves the problems of localization, navigation, and obstacle avoidance in order to carry out autonomous movement. However, most of these studies focus on the description of the geometric information of an environment. The high-level perception of the environment for the indoor service robots requires more abstract information, such as semantic information like objects and scenes. With the rapid development of computer vision and artificial intelligence, many intelligent detection technologies have emerged, such as image-based object detection and scene recognition. If the service robot can recognize the common objects in the environment, it will greatly improve the environmental perception ability and the semantic information, and the 3D proposal bounding box is determined according to the a priori size of the object.
The structure of this paper is organized as follows: Section 2 summarizes the recent research in the area of 3D object detection; Section 3 describes the proposed multi-channel CNN for 3D object detection in detail; Section 4 presents experimental results of the algorithm; and finally, Section 5 summarizes the content of the article.

Related Work
Object detection is one of the main tasks in both computer vision and robotics areas. Instead of recognizing and classifying objects, object detection requires placing a cuboid bounding box around the object. In the 2D object detection research area, there are some popular methods, such as selective search, which is based on a region proposal and DPM [21] (deformable parts model). Recently, development of the deep learning methods, such as RCNN [4], Fast-RCNN [5], Faster-RCNN [6], YOLO [22][23][24] and R-FCN [25], have highly improved the accuracy and efficiency of the 2D object detection task. However, the application of 2D object detection is still quite limited in robotics due to the lack of 3D information such as location, direction, and size. For example, if an indoor service robot recognizes a cup in the table from an RGB image, it does not know how to grasp it as there is no relative location of the cup in the space. With the emergence of 3D sensors, there have been numerous works that utilize 3D information to conduct 3D object detection such as obtaining the category, 3D size, and position of an object. Here, we briefly summarize some of the methods that utilize the depth information gained from the sensors to detect 3D objects from the environment.

Voxel and Point Cloud-Based Approach
Some of the existing works, such as Vote3Deep [26] and Deep Sliding Shapes [27], represent the 3D point clouds with the voxel and extend the general image CNN methods to 3D space. However, these methods are limited because of disparity of the point clouds and the high cost of computation. Surprisingly, PointNet [28], PointNet++ [29], and Frustum PointNets (F-PointNet) [10] provide a method that can directly cope with the raw unordered point clouds. As the F-PointNet method employs 2D object detection from images, the missed 2D detections will also lead to missed 3D detections.

3D Multi-View Approach
In order to extract better and useful information from point clouds, and to improve the efficiency of detection, the multi-view work [30] represents the volumetric information of point cloud from different views. It provides an evaluation of how views affect the detection accuracy and present a strong multi-view (MV) classifier that accounts for different object views. MV3D [15] combines the region proposal method with a multi-view representation. The network extracts 3D proposals from the bird's eye view (BEV) and then 3D proposals are projected to the BEV, front view, and RGB image to obtain 2D proposals. AVOD [16] also utilizes RGB and BEV images to extract features and fuses them to get 3D proposals for regression. A multi-view approach is always used in the autonomous driving area as the bird view image preserves the physical information of objects and avoids potential occlusion. We find that this method can also be used for indoor scene object detection as the BEV image can also describe the different information of objects compared to merely using RGB and depth images. In this paper, we employ the BEV to enhance the 3D object detection in an indoor environment.

2.5D Image Approach
Another way to achieve the goal is to treat a depth image the same as a color image and extract features. The previous work [31] fuses the information from color images and depth images at an early stage and trains pose-based classifiers for 2D pedestrian detection. The paper [32] also includes a multimodality component in their framework that explores the fusion of RGB and depth images obtained by high-definition light detection and ranging. The other paper [33] applies the state-of-the-art 2D object detection on a color image first and uses 3D information to orient, place, and score bounding boxes around the objects. Amodal3Det [34] uses the color and depth images to extract features and build models to convert 2D results to 3D space. However, in practice, the rebuilt 3D models are always incomplete because of occlusion and reflection, which may lead to an inaccuracy in the regression. Our algorithm is inspired by the Amodal3Det method. To avoid the inaccuracy of an incomplete depth image, we combine this 2.5D fusion approach with the BEV image to express more 3D information.

Multi-Channel 3D Object Detection CNN
Since an image is a 3D projection in two dimensions, it is quite difficult to fully express 3D information. A depth image expresses the distance of an object to a camera plane, and a BEV image describes the distribution of the spatial object from a perspective perpendicular to the camera's viewpoint. Therefore, the fusion of multiple kinds of information is beneficial for the better detection of objects in 3D space. In this paper, a 3D object detection CNN combining three-channel information is designed. RGB image, depth, and BEV images are used as the input of the network, and the object's category, 3D size, and spatial position are regressed, respectively.
The designed multi-channel object detection neural network system is shown in Figure 1. The Fast R-CNN is employed as the basic network structure to extend the 2D image object detection to the 3D object detection. The input is extended to three channels including an RGB image, a depth image, and a BEV image. VGG16 is utilized as the main convolutional network structure for feature extraction to enhance the learning of 3D spatial information. We employ the multiscale combinatorial grouping (MCG) [35] algorithm to generate many 2D proposals in the RGB image. As the depth image is the same view angle with the RGB image, it shares the same 2D proposals in the depth image. Then, we combine the 2D proposals with the depth information and semantic prior knowledge to generate 3D proposals, and then project them to the BEV plane. Finally, we can obtain 2D proposals for each channel. Then, the pre-obtained 2D proposals in each channel are utilized to generate the feature vectors with the same dimension using a single-layer spatial pyramid pooling layer, and then all vectors are connected as a whole vector. Finally, a multi-task regression is performed through two layers of fully connected layers to predict the object category and the 3D bounding box. models are always incomplete because of occlusion and reflection, which may lead to an inaccuracy in the regression. Our algorithm is inspired by the Amodal3Det method. To avoid the inaccuracy of an incomplete depth image, we combine this 2.5D fusion approach with the BEV image to express more 3D information.

Multi-Channel 3D Object Detection CNN
Since an image is a 3D projection in two dimensions, it is quite difficult to fully express 3D information. A depth image expresses the distance of an object to a camera plane, and a BEV image describes the distribution of the spatial object from a perspective perpendicular to the camera's viewpoint. Therefore, the fusion of multiple kinds of information is beneficial for the better detection of objects in 3D space. In this paper, a 3D object detection CNN combining three-channel information is designed. RGB image, depth, and BEV images are used as the input of the network, and the object's category, 3D size, and spatial position are regressed, respectively.
The designed multi-channel object detection neural network system is shown in Figure 1. The Fast R-CNN is employed as the basic network structure to extend the 2D image object detection to the 3D object detection. The input is extended to three channels including an RGB image, a depth image, and a BEV image. VGG16 is utilized as the main convolutional network structure for feature extraction to enhance the learning of 3D spatial information. We employ the multiscale combinatorial grouping (MCG) [35] algorithm to generate many 2D proposals in the RGB image. As the depth image is the same view angle with the RGB image, it shares the same 2D proposals in the depth image. Then, we combine the 2D proposals with the depth information and semantic prior knowledge to generate 3D proposals, and then project them to the BEV plane. Finally, we can obtain 2D proposals for each channel. Then, the pre-obtained 2D proposals in each channel are utilized to generate the feature vectors with the same dimension using a single-layer spatial pyramid pooling layer, and then all vectors are connected as a whole vector. Finally, a multi-task regression is performed through two layers of fully connected layers to predict the object category and the 3D bounding box.

Input Data Generation for Convolutional Neural Network
In order to increase the network's ability to perceive 3D information, a BEV image is generated from the RGB and depth images and it is used as an additional input to the convolutional neural network. Based on the processing methods of an RGB image as an input to a convolutional neural

Input Data Generation for Convolutional Neural Network
In order to increase the network's ability to perceive 3D information, a BEV image is generated from the RGB and depth images and it is used as an additional input to the convolutional neural network. Based on the processing methods of an RGB image as an input to a convolutional neural network, the depth and BEV images are quantized into the pixel range of [0,255], and then they are used as the input of the network. The depth image acquired by the RGB-D sensor stores distance information, and the defined maximum depth is z max , and the depth value z was quantized to the image range z as: A point cloud is generated from the RGB and depth images, and it is projected onto the ground and rasterized to generate a BEV image. The BEV image is represented by the number of point clouds in each 2D grid cell. In a typical indoor environment, the range of depth images acquired by the RGB-D sensor is limited, so the range of the point cloud is limited to generate a BEV image of the same size. To illustrate it clearly, we provide the projected coordinate system of the camera to the BEV plane, as shown in Figure 2. The origin point o c is projected from the one of camera coordinate system and the orientations of o c x c and o c z c are oriented according to the ones of camera coordinate system. Let the coordinate system have a range of [x min , x max ] in the x-axis direction, [0, z max ] in the z-axis direction, and δ is the resolution of the grid, then the size of the BEV image is ((x max − x min )/δ) × (z max /δ). Due to the different spatial point cloud density, the number of point clouds projected into each cell is very different, which is not conducive to data processing. Therefore, it is logarithmically transformed and converted to the image pixel range. Let the number of points in one cell of the BEV image be n and the maximum number of points in one cell be N, then the quantized image pixels are: A point cloud is generated from the RGB and depth images, and it is projected onto the ground and rasterized to generate a BEV image. The BEV image is represented by the number of point clouds in each 2D grid cell. In a typical indoor environment, the range of depth images acquired by the RGB-D sensor is limited, so the range of the point cloud is limited to generate a BEV image of the same size. To illustrate it clearly, we provide the projected coordinate system of the camera to the BEV plane, as shown in Figure 2

2D and 3D Proposals Generation Based on Semantic Prior
The main idea of Fast R-CNN for object detection is to obtain the feature map of a whole image through the convolutional neural network and to employ the spatial pyramid pooling layer to extract features from the pre-generated 2D proposals in the image. The classifier is used to judge the features extracted in the proposal belonging to which category. The position of the proposal is adjusted by the regressor. This paper draws on the detection strategy based on the region proposal to extend 2D

2D and 3D Proposals Generation Based on Semantic Prior
The main idea of Fast R-CNN for object detection is to obtain the feature map of a whole image through the convolutional neural network and to employ the spatial pyramid pooling layer to extract features from the pre-generated 2D proposals in the image. The classifier is used to judge the features extracted in the proposal belonging to which category. The position of the proposal is adjusted by the regressor. This paper draws on the detection strategy based on the region proposal to extend 2D image detection to the spatial 3D object detection. Therefore, it is necessary to first acquire the 3D proposal of the object before performing category detection, 3D position, and size regression on this basis. In addition, it is also necessary to project the obtained 3D proposal back to BEV image to get a consistent proposal with the other input channels.

3D Proposal Parameter Representation
The regression of the 3D object is parameterized into a seven-dimensional vector (x c , y c , z c , l, w, h, θ). (x c , y c , z c ) is the coordinate of the center point of the object bounding box in the camera coordinate system. (l, w, h) are the length, width, and height of the bounding box, respectively. θ is the angle (with the range of [−π/2, π/2]) between the z-axis direction of the camera coordinate system and the longer edge of the bounding box projected on the xz plane. The initial value of the object center point can be computed based on the point cloud corresponding to the proposal. Since noise always exists in measurement and data are often missed in the point cloud, the median value in the z-axis direction is taken as the initial value z c , and x c and y c can be solved with the camera parameters: where (c x , c y ) is the center point of the proposal in the image; (u, v) is the center point of the image; and f x , f y are the focal lengths.
Since the point cloud of the proposal may contain a background point cloud other than the object, the error will be large if the initial size of the object is directly obtained by the point cloud. For common objects in the indoor environment, such as sofas and chairs, similar objects usually have similar dimensions; therefore, the average size of the objects on the dataset can be used as a priori knowledge to determine the size of the 3D proposal. In addition, since it is difficult to estimate the directional angle of the 3D proposal in the initial stage, all the initial angle values are set to zero for the sake of simplicity.

Proposals Generation
The 3D object detection neural network proposed in this paper is based on the Fast R-CNN network structure. The single-layer spatial pyramid is utilized to pool the object proposal of different sizes in the feature map in order to obtain the output vector with the same dimension. Therefore, it is necessary to obtain object proposals in three channels separately. Since the RGB image and the depth image are acquired under the same view angle, the same proposal can be shared. However, the BEV image is obtained using the point cloud projection and has a constraint relationship with the RGB and depth images. Thus, it is vital to solve how to generate the object proposal in all three channels.
Searching for a suitable proposal box in a 3D space is usually computationally intensive and inefficient. Since the method of extracting the proposal from a 2D image is relatively mature, the 3D information can be acquired by combining the depth image. Multiscale combinatorial grouping [35] (MCG), which generates 2D bounding box proposals in RGB images, usually generates thousands of object proposals of different sizes. In the camera coordinate system, a depth image can be combined to generate a point cloud for each proposal box. According to the representation of the 3D proposal in Section 3.2.1, the center point of the proposal can be computed.
During training, the proposal samples needs to be divided into positive samples and negative samples. In order to determine the possible categories of proposals, the IoU (intersection over union) is calculated between the proposal and each 2D ground truth bounding box in the image, and the category corresponding to the largest value is selected. Inspired by the image semantic segmentation, we can obtain the pixel-level semantic category in the image and count the number of pixels with the same semantic category in the 2D proposal in order to analyze the probability for the proposal to belong to a certain category. This information can be used to judge the positive and negative samples when training. In order to better select the positive and negative samples, the ground truth in the RGB image is used to comprehensively calculate the category's probability of each proposal. Let the number of pixels in the ground truth (belonging to the same category as the ground truth) bounding box be n gt ; the area of the bounding box be S gt ; and the number of pixels of the category in the intersection area be n prop , which is the common area between the 2D proposal and the ground truth bounding box; and the proposal box area be S prop such that the score of the proposal belonging to the category can be calculated by the following formula: The scores between the proposal box and each ground truth bounding box are calculated and the ground truth's category corresponding to the maximum value is taken as the category of the proposal.
A diagram of the system to generate the 2D and 3D proposals required for training is shown in Figure 3. First, the MCG algorithm is applied to obtain the 2D proposals in the image. The semantic segmentation of the image is obtained using the full convolutional neural network (FCN) algorithm. Then, the maximum possible category of the proposal is calculated. Finally, an initial value of the 3D proposal corresponding to each 2D proposal is obtained by combining the depth image and the a priori size of the object. In order to obtain the proposal in the BEV image, the 3D proposal is projected into the BEV image, and a BEV image proposal corresponding to each proposal in the image was obtained.
same semantic category in the 2D proposal in order to analyze the probability for the proposal to belong to a certain category. This information can be used to judge the positive and negative samples when training. In order to better select the positive and negative samples, the ground truth in the RGB image is used to comprehensively calculate the category's probability of each proposal. Let the number of pixels in the ground truth (belonging to the same category as the ground truth) bounding box be gt n ; the area of the bounding box be gt S ; and the number of pixels of the category in the intersection area be prop n , which is the common area between the 2D proposal and the ground truth bounding box; and the proposal box area be prop S such that the score of the proposal belonging to the category can be calculated by the following formula: The scores between the proposal box and each ground truth bounding box are calculated and the ground truth's category corresponding to the maximum value is taken as the category of the proposal.
A diagram of the system to generate the 2D and 3D proposals required for training is shown in Figure 3. First, the MCG algorithm is applied to obtain the 2D proposals in the image. The semantic segmentation of the image is obtained using the full convolutional neural network (FCN) algorithm. Then, the maximum possible category of the proposal is calculated. Finally, an initial value of the 3D proposal corresponding to each 2D proposal is obtained by combining the depth image and the a priori size of the object. In order to obtain the proposal in the BEV image, the 3D proposal is projected into the BEV image, and a BEV image proposal corresponding to each proposal in the image was obtained.

3D Bounding Box Regression of Objects
The designed multi-channel convolutional neural network extracts the features of the RGB image, depth image, and BEV image. The features in 2D proposals are transformed into uniform size vectors through the RoI pooling layer and connected to a whole vector. Finally, the predicted results of the network are obtained by the classifier and the 3D bounding box regressor. For each positive sample, the output of the network is relative to the ground truth bounding box, which is a sevendimensional vector ( , , , , , , ) x y z l w h θ Δ Δ Δ Δ Δ Δ Δ . For each proposal, the ground truth category of the

3D Bounding Box Regression of Objects
The designed multi-channel convolutional neural network extracts the features of the RGB image, depth image, and BEV image. The features in 2D proposals are transformed into uniform size vectors through the RoI pooling layer and connected to a whole vector. Finally, the predicted results of the network are obtained by the classifier and the 3D bounding box regressor. For each positive sample, the output of the network is relative to the ground truth bounding box, which is a seven-dimensional vector (∆x, ∆y, ∆z, ∆l, ∆w, ∆h, ∆θ). For each proposal, the ground truth category of the maximum probability is obtained according to the above method, and the normal value and the 3D proposal predicted value are applied for normalization: where (x c , y c , z c , l, w, h, θ) is the 3D proposal predicted value generated by the 2D proposal, and (x gt , y gt , z gt , l gt , w gt , h gt , θ gt ) is the ground truth of 3D bounding box corresponding to the maximum probability of the 2D proposal.

Multi-Task Loss Function
For joint training classification and bounding box regression, the designed multitask loss function is: where L cls is the classification loss function, using the Softmax function. L 3DBB is the 3D bounding box prediction-loss function, using the L 1 smooth-loss function. λ is the coefficient to balance the two loss function values.

Experiments
In order to verify the proposed multi-channel object detection neural network, the open source dataset NYU V2 is selected for experiments. The datasets are collected using an RGB-D camera in several indoor scenes. It consists of color and depth images and labeled 3D object bounding boxes. The training dataset contains 795 images and the test dataset contains 654 images. Zhuo et al. [34] re-label ground truths of the partial bounding boxes in some images in order to enhance the correct rate of the positive samples in images during training. Therefore, in order to facilitate fair experimental comparison with the work by Zhuo et al. [34], the modified NYUV2 dataset is adopted in the experiments.

Training Data Generation
To train the multi-channel neural network described in Section 3, the relevant training data needs to be prepared, such as RGB images, depth images, BEV images, 2D proposals in three channels, and the corresponding 3D proposals.
RGB and depth images are already included in the dataset. However, the BEV images need to be generated. As the NYUV2 dataset is collected by using a Kinect sensor, the range of the point cloud should be limited in order to acquire the same size for the BEV image obtained by the point cloud projection. In the camera coordinate system, the range of the x-axis direction is [−2.5 m,2.5 m], and the range of the z-axis direction is [0,5 m]. The point cloud is projected onto the plane to obtain a BEV image. The resolution of the grid is set as 0.01 m, then the size of the BEV image is 500 × 500 pixels.
In the RGB image, 2D proposals are generated by using the MCG algorithm. Since the depth image and the RGB image have the same viewpoint, the proposal of the depth image is consistent with the proposal of the image. The 3D proposal is generated according to the algorithm described in the third section, and the image semantic segmentation is done by using FCN. Then, the proposal is scored according to the ground truth of the 2D bounding box in the image and the proposal category is one of the bounding boxes with the highest score. A 3D proposal is generated based on the average size of each type of object on the training set as an initial value. The 3D proposal box is projected onto the BEV image to obtain a plan view proposal box corresponding to the 2D proposal frame in the image. Data augmentation is performed during training. The progress is that the 2D proposals are flipped horizontally in the image, and the corresponding proposals in the 3D proposals and the BEV image are simultaneously flipped.
Since the categories of semantic segmentation based on FCN in the NYUV2 dataset are not completely consistent with the categories of 3D object detection, FCN lacks two categories named "garbage bin" and "monitor" compared to the object categories in Zhuo et al.'s work [34]. Therefore, the two categories are removed and the remaining 17 categories were used.

Training Parameter Setting
The Caffe framework is applied during training, and the pre-trained VGG16 model on ImageNet is utilized to initialize the parameters of the forward channel of the network. The coefficient λ in the loss function is set to 1; the base learning rate is set to 0.0005; and the learning strategy is "step", which is multiplied by 0.1 every 10,000 iterations. The stochastic gradient descent algorithm is employed to implement 40,000 iterations. Two images are randomly selected in each batch, 128 proposals are randomly selected in each image, and the ratio of positive samples to negative samples is 1:3. It takes about 4 h to iterate 40,000 times in an NVIDIA DGX Station (Tesla V100 GPU, NVIDIA, Santa Clara, CA, USA). During the test, the forward channel average inference time is around 0.18 s.

Experimental Results and Analysis
Experiments are carried out on the modified NYUV2 dataset [34] and they are compared with two recently related papers [27,34]. The experimental results are shown in Table 1. The proposed algorithm improves the average accuracy compared with the previous works [27,34], and the performance of the algorithm after adding the BEV image is also verified. In most categories, the accuracy is improved by several percent, especially for objects that are relatively independent in the BEV image, such as beds, chairs, and sofas. To illustrate the details of the algorithm, we conduct an ablation study on the dataset. The results are shown in Table 2. We remove the BEV channel to verify the effect. Furthermore, we can observe that the average accuracy is improved by 3.1% with the BEV channel. To observe the results more intuitively, we give some examples with all the input images and results, as shown in Figure 4. It shows the detection results in four different categories (chair, sofa, toilet, and bed). The figure shows the input three channel images (RGB, depth, and BEV images), semantic segmentation images, and point cloud images in which the detected 3D object bounding boxes are displayed. From the detected 3D bounding boxes, we can observe that the sizes and positions of different objects are estimated well.
To verify the performance of the algorithm sufficiently, we carry out another experiment on the SUN RGB-D [18] dataset, which contains 10,335 RGB-D images and 64,595 3D bounding boxes. We train the model using our method and compares it with other previous works [27,36], which are also tested in the same dataset. The experimental results are shown in Table 3 and it can be observed that our method can achieve better performance in most categories and had a certain improvement in the mean average precision. is scored according to the ground truth of the 2D bounding box in the image and the proposal category is one of the bounding boxes with the highest score. A 3D proposal is generated based on the average size of each type of object on the training set as an initial value. The 3D proposal box is projected onto the BEV image to obtain a plan view proposal box corresponding to the 2D proposal frame in the image. Data augmentation is performed during training. The progress is that the 2D proposals are flipped horizontally in the image, and the corresponding proposals in the 3D proposals and the BEV image are simultaneously flipped.
Since the categories of semantic segmentation based on FCN in the NYUV2 dataset are not completely consistent with the categories of 3D object detection, FCN lacks two categories named "garbage bin" and "monitor" compared to the object categories in Zhuo et al.'s work [34]. Therefore, the two categories are removed and the remaining 17 categories were used.

Training Parameter Setting
The Caffe framework is applied during training, and the pre-trained VGG16 model on ImageNet is utilized to initialize the parameters of the forward channel of the network. The coefficient in the loss function is set to 1; the base learning rate is set to 0.0005; and the learning strategy is "step", which is multiplied by 0.1 every 10,000 iterations. The stochastic gradient descent algorithm is employed to implement 40,000 iterations. Two images are randomly selected in each batch, 128 proposals are randomly selected in each image, and the ratio of positive samples to negative samples is 1:3. It takes about 4 h to iterate 40,000 times in an NVIDIA DGX Station (Tesla V100 GPU, NVIDIA, Santa Clara, CA, USA). During the test, the forward channel average inference time is around 0.18 s.

Experimental Results and Analysis
Experiments are carried out on the modified NYUV2 dataset [34] and they are compared with two recently related papers [27,34]. The experimental results are shown in Table 1. The proposed algorithm improves the average accuracy compared with the previous works [27,34], and the performance of the algorithm after adding the BEV image is also verified. In most categories, the accuracy is improved by several percent, especially for objects that are relatively independent in the BEV image, such as beds, chairs, and sofas. To illustrate the details of the algorithm, we conduct an ablation study on the dataset. The results are shown in Table 2. We remove the BEV channel to verify the effect. Furthermore, we can observe that the average accuracy is improved by 3.1% with the BEV channel. To observe the results more intuitively, we give some examples with all the input images and results, as shown in Figure 4. It shows the detection results in four different categories (chair, sofa, toilet, and bed). The figure shows the input three channel images (RGB, depth, and BEV images), semantic segmentation images, and point cloud images in which the detected 3D object bounding boxes are displayed. From the detected 3D bounding boxes, we can observe that the sizes and positions of different objects are estimated well.
To verify the performance of the algorithm sufficiently, we carry out another experiment on the SUN RGB-D [18] dataset, which contains 10,335 RGB-D images and 64,595 3D bounding boxes. We train the model using our method and compares it with other previous works [27,36], which are also tested in the same dataset. The experimental results are shown in Table 3 and it can be observed that our method can achieve better performance in most categories and had a certain improvement in the mean average precision.   is scored according to the ground truth of the 2D bounding box in the image and the proposal category is one of the bounding boxes with the highest score. A 3D proposal is generated based on the average size of each type of object on the training set as an initial value. The 3D proposal box is projected onto the BEV image to obtain a plan view proposal box corresponding to the 2D proposal frame in the image. Data augmentation is performed during training. The progress is that the 2D proposals are flipped horizontally in the image, and the corresponding proposals in the 3D proposals and the BEV image are simultaneously flipped.
Since the categories of semantic segmentation based on FCN in the NYUV2 dataset are not completely consistent with the categories of 3D object detection, FCN lacks two categories named "garbage bin" and "monitor" compared to the object categories in Zhuo et al.'s work [34]. Therefore, the two categories are removed and the remaining 17 categories were used.

Training Parameter Setting
The Caffe framework is applied during training, and the pre-trained VGG16 model on ImageNet is utilized to initialize the parameters of the forward channel of the network. The coefficient in the loss function is set to 1; the base learning rate is set to 0.0005; and the learning strategy is "step", which is multiplied by 0.1 every 10,000 iterations. The stochastic gradient descent algorithm is employed to implement 40,000 iterations. Two images are randomly selected in each batch, 128 proposals are randomly selected in each image, and the ratio of positive samples to negative samples is 1:3. It takes about 4 h to iterate 40,000 times in an NVIDIA DGX Station (Tesla V100 GPU, NVIDIA, Santa Clara, CA, USA). During the test, the forward channel average inference time is around 0.18 s.

Experimental Results and Analysis
Experiments are carried out on the modified NYUV2 dataset [34] and they are compared with two recently related papers [27,34]. The experimental results are shown in Table 1. The proposed algorithm improves the average accuracy compared with the previous works [27,34], and the performance of the algorithm after adding the BEV image is also verified. In most categories, the accuracy is improved by several percent, especially for objects that are relatively independent in the BEV image, such as beds, chairs, and sofas. To illustrate the details of the algorithm, we conduct an ablation study on the dataset. The results are shown in Table 2. We remove the BEV channel to verify the effect. Furthermore, we can observe that the average accuracy is improved by 3.1% with the BEV channel. To observe the results more intuitively, we give some examples with all the input images and results, as shown in Figure 4. It shows the detection results in four different categories (chair, sofa, toilet, and bed). The figure shows the input three channel images (RGB, depth, and BEV images), semantic segmentation images, and point cloud images in which the detected 3D object bounding boxes are displayed. From the detected 3D bounding boxes, we can observe that the sizes and positions of different objects are estimated well.
To verify the performance of the algorithm sufficiently, we carry out another experiment on the SUN RGB-D [18] dataset, which contains 10,335 RGB-D images and 64,595 3D bounding boxes. We train the model using our method and compares it with other previous works [27,36], which are also tested in the same dataset. The experimental results are shown in Table 3 and it can be observed that our method can achieve better performance in most categories and had a certain improvement in the mean average precision. is scored according to the ground truth of the 2D bounding box in the image and the proposal category is one of the bounding boxes with the highest score. A 3D proposal is generated based on the average size of each type of object on the training set as an initial value. The 3D proposal box is projected onto the BEV image to obtain a plan view proposal box corresponding to the 2D proposal frame in the image. Data augmentation is performed during training. The progress is that the 2D proposals are flipped horizontally in the image, and the corresponding proposals in the 3D proposals and the BEV image are simultaneously flipped.
Since the categories of semantic segmentation based on FCN in the NYUV2 dataset are not completely consistent with the categories of 3D object detection, FCN lacks two categories named "garbage bin" and "monitor" compared to the object categories in Zhuo et al.'s work [34]. Therefore, the two categories are removed and the remaining 17 categories were used.

Training Parameter Setting
The Caffe framework is applied during training, and the pre-trained VGG16 model on ImageNet is utilized to initialize the parameters of the forward channel of the network. The coefficient in the loss function is set to 1; the base learning rate is set to 0.0005; and the learning strategy is "step", which is multiplied by 0.1 every 10,000 iterations. The stochastic gradient descent algorithm is employed to implement 40,000 iterations. Two images are randomly selected in each batch, 128 proposals are randomly selected in each image, and the ratio of positive samples to negative samples is 1:3. It takes about 4 h to iterate 40,000 times in an NVIDIA DGX Station (Tesla V100 GPU, NVIDIA, Santa Clara, CA, USA). During the test, the forward channel average inference time is around 0.18 s.

Experimental Results and Analysis
Experiments are carried out on the modified NYUV2 dataset [34] and they are compared with two recently related papers [27,34]. The experimental results are shown in Table 1. The proposed algorithm improves the average accuracy compared with the previous works [27,34], and the performance of the algorithm after adding the BEV image is also verified. In most categories, the accuracy is improved by several percent, especially for objects that are relatively independent in the BEV image, such as beds, chairs, and sofas. To illustrate the details of the algorithm, we conduct an ablation study on the dataset. The results are shown in Table 2. We remove the BEV channel to verify the effect. Furthermore, we can observe that the average accuracy is improved by 3.1% with the BEV channel. To observe the results more intuitively, we give some examples with all the input images and results, as shown in Figure 4. It shows the detection results in four different categories (chair, sofa, toilet, and bed). The figure shows the input three channel images (RGB, depth, and BEV images), semantic segmentation images, and point cloud images in which the detected 3D object bounding boxes are displayed. From the detected 3D bounding boxes, we can observe that the sizes and positions of different objects are estimated well.
To verify the performance of the algorithm sufficiently, we carry out another experiment on the SUN RGB-D [18] dataset, which contains 10,335 RGB-D images and 64,595 3D bounding boxes. We train the model using our method and compares it with other previous works [27,36], which are also tested in the same dataset. The experimental results are shown in Table 3 and it can be observed that our method can achieve better performance in most categories and had a certain improvement in the mean average precision.  is scored according to the ground truth of the 2D bounding box in the image and the proposal category is one of the bounding boxes with the highest score. A 3D proposal is generated based on the average size of each type of object on the training set as an initial value. The 3D proposal box is projected onto the BEV image to obtain a plan view proposal box corresponding to the 2D proposal frame in the image. Data augmentation is performed during training. The progress is that the 2D proposals are flipped horizontally in the image, and the corresponding proposals in the 3D proposals and the BEV image are simultaneously flipped.
Since the categories of semantic segmentation based on FCN in the NYUV2 dataset are not completely consistent with the categories of 3D object detection, FCN lacks two categories named "garbage bin" and "monitor" compared to the object categories in Zhuo et al.'s work [34]. Therefore, the two categories are removed and the remaining 17 categories were used.

Training Parameter Setting
The Caffe framework is applied during training, and the pre-trained VGG16 model on ImageNet is utilized to initialize the parameters of the forward channel of the network. The coefficient in the loss function is set to 1; the base learning rate is set to 0.0005; and the learning strategy is "step", which is multiplied by 0.1 every 10,000 iterations. The stochastic gradient descent algorithm is employed to implement 40,000 iterations. Two images are randomly selected in each batch, 128 proposals are randomly selected in each image, and the ratio of positive samples to negative samples is 1:3. It takes about 4 h to iterate 40,000 times in an NVIDIA DGX Station (Tesla V100 GPU, NVIDIA, Santa Clara, CA, USA). During the test, the forward channel average inference time is around 0.18 s.

Experimental Results and Analysis
Experiments are carried out on the modified NYUV2 dataset [34] and they are compared with two recently related papers [27,34]. The experimental results are shown in Table 1. The proposed algorithm improves the average accuracy compared with the previous works [27,34], and the performance of the algorithm after adding the BEV image is also verified. In most categories, the accuracy is improved by several percent, especially for objects that are relatively independent in the BEV image, such as beds, chairs, and sofas. To illustrate the details of the algorithm, we conduct an ablation study on the dataset. The results are shown in Table 2. We remove the BEV channel to verify the effect. Furthermore, we can observe that the average accuracy is improved by 3.1% with the BEV channel. To observe the results more intuitively, we give some examples with all the input images and results, as shown in Figure 4. It shows the detection results in four different categories (chair, sofa, toilet, and bed). The figure shows the input three channel images (RGB, depth, and BEV images), semantic segmentation images, and point cloud images in which the detected 3D object bounding boxes are displayed. From the detected 3D bounding boxes, we can observe that the sizes and positions of different objects are estimated well.
To verify the performance of the algorithm sufficiently, we carry out another experiment on the SUN RGB-D [18] dataset, which contains 10,335 RGB-D images and 64,595 3D bounding boxes. We train the model using our method and compares it with other previous works [27,36], which are also tested in the same dataset. The experimental results are shown in Table 3 and it can be observed that our method can achieve better performance in most categories and had a certain improvement in the mean average precision.  In order to utilize the 3D object detection for indoor robot environmental perception, we conduct experiments in a service robot in three indoor scenes. A Kinect RGB-D sensor is used in the robot to collect RGB and depth images. As the robot moves in the environment, the ORB-SLAM2 [37,38] algorithm with a dense point cloud is employed to estimate poses of the robot and establish a dense map in our experiments. Then, the 3D object detection algorithm in this paper is implemented only in the keyframes of ORB-SLAM to decrease the computation. Because one object can be observed in several keyframes, the 3D bounding boxes of objects are transformed into the system coordinates and the average position and size of the same object's 3D bounding boxes are calculated. The experimental results are shown in Figure 5. 3D maps are constructed and several 3D object bounding boxes are shown in each Figure. In Figure 5a, four chairs are detected in an office; in Figure 5b, three sofas and a television are detected; and in Figure 5c, a bookshelf and two sofas are detected. In order to measure the results of the algorithm quantitatively, the estimated 3D bounding boxes of objects are given and compared with the ground truth, as shown in Table 4. To facilitate measurement, we select Figure 5b,c to conduct the test as the 3D bounding boxes of objects in these scenes are almost . Experiments on the modified NYUV2 dataset where four different categories (chair, sofa, toilet, and bed) are detected. The first row (a-d) shows RGB images, the second row (e-h) shows depth images, the third row (i-l) shows BEV images, the fourth row (m-p) shows semantic images, and the last row (q-t) shows the point cloud with detected 3D bounding boxes.   In order to utilize the 3D object detection for indoor robot environmental perception, we conduct experiments in a service robot in three indoor scenes. A Kinect RGB-D sensor is used in the robot to collect RGB and depth images. As the robot moves in the environment, the ORB-SLAM2 [37,38] algorithm with a dense point cloud is employed to estimate poses of the robot and establish a dense map in our experiments. Then, the 3D object detection algorithm in this paper is implemented only in the keyframes of ORB-SLAM to decrease the computation. Because one object can be observed in several keyframes, the 3D bounding boxes of objects are transformed into the system coordinates and the average position and size of the same object's 3D bounding boxes are calculated. The experimental results are shown in Figure 5. 3D maps are constructed and several 3D object bounding boxes are shown in each Figure. In Figure 5a, four chairs are detected in an office; in Figure 5b, three sofas and a television are detected; and in Figure 5c, a bookshelf and two sofas are detected. In order to measure the results of the algorithm quantitatively, the estimated 3D bounding boxes of objects are given and compared with the ground truth, as shown in Table 4. To facilitate measurement, we select Figure 5b,c to conduct the test as the 3D bounding boxes of objects in these scenes are almost parallel to each other and it is easy to measure the ground truth. The Kinect RGB-D sensor is fixed on the robot with the height of 1.2 m. We keep the sensor parallel to the object at the beginning when conducting the ORB-SLAM2 algorithm and set the coordinate system of the first keyframe as the global one (the coordinate system drawn in Figure 5b,c). Then, we transfer the estimated results in each keyframe to the global coordinate system. In Table 4, we give the 3D bounding box ground truth of objects and the estimated results using our method in the global coordinate system. Finally, we calculate the 3D IoU (intersection over union) of 3D bounding boxes to measure the accuracy. The average 3D IoU is 0.61 and we can observe that our method detect the sofa with higher accuracy than the bookshelf and television. During the 3D object detection, it takes about 2.2 s to regress the 3D bounding box with a Tesla V100 GPU (about 1.6 s to extract region proposals in each keyframe based on MCG, 0.42 s to obtain the semantic segmentation using FCN, and 0.18 s to run the forward inference). Experimental results verify that the proposed algorithm can benefit robot environmental perception using 3D object detection. In order to utilize the 3D object detection for indoor robot environmental perception, we conduct experiments in a service robot in three indoor scenes. A Kinect RGB-D sensor is used in the robot to collect RGB and depth images. As the robot moves in the environment, the ORB-SLAM2 [37,38] algorithm with a dense point cloud is employed to estimate poses of the robot and establish a dense map in our experiments. Then, the 3D object detection algorithm in this paper is implemented only in the keyframes of ORB-SLAM to decrease the computation. Because one object can be observed in several keyframes, the 3D bounding boxes of objects are transformed into the system coordinates and the average position and size of the same object's 3D bounding boxes are calculated. The experimental results are shown in Figure 5. 3D maps are constructed and several 3D object bounding boxes are shown in each Figure. In Figure 5a, four chairs are detected in an office; in Figure 5b, three sofas and a television are detected; and in Figure 5c, a bookshelf and two sofas are detected. In order to measure the results of the algorithm quantitatively, the estimated 3D bounding boxes of objects are given and compared with the ground truth, as shown in Table 4. To facilitate measurement, we select Figure 5b,c to conduct the test as the 3D bounding boxes of objects in these scenes are almost parallel to each other and it is easy to measure the ground truth. The Kinect RGB-D sensor is fixed on the robot with the height of 1.2 m. We keep the sensor parallel to the object at the beginning when conducting the ORB-SLAM2 algorithm and set the coordinate system of the first keyframe as the global one (the coordinate system drawn in Figure 5b,c). Then, we transfer the estimated results in each keyframe to the global coordinate system. In Table 4, we give the 3D bounding box ground truth of objects and the estimated results using our method in the global coordinate system. Finally, we calculate the 3D IoU (intersection over union) of 3D bounding boxes to measure the accuracy. The average 3D IoU is 0.61 and we can observe that our method detect the sofa with higher accuracy than the bookshelf and television. During the 3D object detection, it takes about 2.2 s to regress the 3D bounding box with a Tesla V100 GPU (about 1.6 s to extract region proposals in each keyframe based on MCG, 0.42 s to obtain the semantic segmentation using FCN, and 0.18 s to run the forward inference). Experimental results verify that the proposed algorithm can benefit robot environmental perception using 3D object detection.

Conclusions
Aiming at the problem of 3D object detection for an indoor service robot's environmental perception, this paper proposes a multi-channel convolutional neural network that combines the RGB, depth, and BEV images together. By adding the BEV image channel, the perception ability of the neural network was improved. Training and testing are done on the modified NYUV2 dataset and SUN RGB-D dataset. Based on the experimental results, our algorithm is superior compared with the two recent papers. Furthermore, the actual experiments in the service robot also demonstrate that the proposed 3D object detection algorithm had the potential to be applied in indoor service robots for better environmental perception. As it has a process of 2D proposals generation in the method, the future work will focus on the end-to-end method to achieve the 3D object detection using a CNN.