Object Semantic Grid Mapping with 2D LiDAR and RGB-D Camera for Domestic Robot Navigation

: Occupied grid maps are su ﬃ cient for mobile robots to complete metric navigation tasks in domestic environments. However, they lack semantic information to endow the robots with the ability of social goal selection and human-friendly operation modes. In this paper, we propose an object semantic grid mapping system with 2D Light Detection and Ranging (LiDAR) and RGB-D sensors to solve this problem. At ﬁrst, we use a laser-based Simultaneous Localization and Mapping (SLAM) to generate an occupied grid map and obtain a robot trajectory. Then, we employ object detection to get an object’s semantics of color images and use joint interpolation to reﬁne camera poses. Based on object detection, depth images, and interpolated poses, we build a point cloud with object instances. To generate object-oriented minimum bounding rectangles, we propose a method for extracting the dominant directions of the room. Furthermore, we build object goal spaces to help the robots select navigation goals conveniently and socially. We have used the Robot@Home dataset to verify the system; the veriﬁcation results show that our system is e ﬀ ective.


Introduction
Robots are the next frontier technology that affect many aspects of our society, including industry [1], transportation [2], and services [3]. In recent years, more and more service robots, such as FABO [4], KeJia [5], and NAO [6], have entered the human living environment to improve people's quality of life. In order to complete navigation tasks, the robot needs a map to describe its environment. Traditional map representations include metric, topological, and hybrid ones [7]. The metric map uses grids or geometric features of points, lines, or planes to describe the geometric layout of the environment. To achieve an abstract representation of the environment, a topological map models the environment as a graph, in which vertices correspond to places and edges correspond to the paths. Since the metric map has the characteristic of high localization accuracy and topological map of high path-planning efficiency, the hybrid map takes advantage of both to improve the navigation performance.
Based on an occupied grid or hybrid maps, the metric navigation for small or medium-scale environments is relatively mature. However, when it comes to using the robots in the domestic scene, they lack semantic information for users to order them conveniently. For example, the goal point is defined in geometric coordinates, but humans are more inclined to use natural language interaction. Thus, many researchers are committed to building semantic maps, which not only describe the geometric layout but, also, contain the concepts of rooms or objects, to improve human-robot interactions [8].
At present, there is no consensus on the map forms. Some researchers directly add semantic information to the metric map and some combine metric, topological, and concept maps to generate a 1.
An object semantic grid mapping system with 2D LiDAR and RGB-D sensors for domestic robot navigation is proposed.

2.
A method for extracting the dominant directions of the room to generate OOMBRs for objects is proposed. 3.
Object goal spaces are created for robot goal point selection conveniently and socially. 4.
The effectiveness of the system on a home dataset is verified.
The remainder of this paper is organized as follows. The system overview is introduced in Section 2. The front-end and back-end of the system are detailed in Sections 3 and 4, respectively. Section 5 presents the experimental results and discussion. Finally, Section 6 concludes this paper.

System Overview
The proposed system is shown in Figure 1. First, a 2D laser-based SLAM is used to build an occupied grid map and obtain a robot trajectory. Then, an object detection algorithm is employed to detect the semantics of the objects located in the color image of the RGB-D sensor. Based on the timestamp of the image and the robot trajectory, the camera pose is interpolated to obtain a more accurate result. Based on object detection, depth images, and interpolated poses, a point cloud with object semantic information is generated. To more accurately approximate the occupied spaces of objects, the dominant directions of the room are extracted to generate OOMBRs. To make the robot select goal points conveniently and socially, object goal spaces are created. Finally, the object semantic grid map is generated.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 21 In summary, the contributions of this paper are as follows: 1. An object semantic grid mapping system with 2D LiDAR and RGB-D sensors for domestic robot navigation is proposed.
2. A method for extracting the dominant directions of the room to generate OOMBRs for objects is proposed. 3. Object goal spaces are created for robot goal point selection conveniently and socially. 4. The effectiveness of the system on a home dataset is verified.
The remainder of this paper is organized as follows. The system overview is introduced in Section 2. The front-end and back-end of the system are detailed in Section 3 and Section 4, respectively. Section 5 presents the experimental results and discussion. Finally, Section 6 concludes this paper.

System Overview
The proposed system is shown in Figure 1. First, a 2D laser-based SLAM is used to build an occupied grid map and obtain a robot trajectory. Then, an object detection algorithm is employed to detect the semantics of the objects located in the color image of the RGB-D sensor. Based on the timestamp of the image and the robot trajectory, the camera pose is interpolated to obtain a more accurate result. Based on object detection, depth images, and interpolated poses, a point cloud with object semantic information is generated. To more accurately approximate the occupied spaces of objects, the dominant directions of the room are extracted to generate OOMBRs. To make the robot select goal points conveniently and socially, object goal spaces are created. Finally, the object semantic grid map is generated. We divide the system into two parts: the front-end and the back-end. The front-end is used for data processing, including laser-based SLAM, camera pose interpolation, and object detection. The back-end is used to generate an object semantic grid map, including object point cloud generation, room dominant direction detection, and object goal space generation. The following first introduces the front-end of the system.

Laser-Based SLAM
The occupied grid map uses grids to represent the environment. Each grid has three states: occupied, free, or unknown. The occupied indicates the robot cannot pass through while moving; otherwise, it will collide with obstacles. In contrast, the free indicates that there are no obstacles at We divide the system into two parts: the front-end and the back-end. The front-end is used for data processing, including laser-based SLAM, camera pose interpolation, and object detection. The back-end is used to generate an object semantic grid map, including object point cloud generation, room dominant direction detection, and object goal space generation. The following first introduces the front-end of the system.

Laser-Based SLAM
The occupied grid map uses grids to represent the environment. Each grid has three states: occupied, free, or unknown. The occupied indicates the robot cannot pass through while moving; otherwise, it will collide with obstacles. In contrast, the free indicates that there are no obstacles at this position, and the robot can move freely. Finally, the unknown means that the robot failed to detect this grid during the map-building process.
The laser-based SLAM algorithm is always used to build grid maps. The frameworks can be divided into filter-based and graph-based types [10]. Recently, the graph optimization framework has gradually become the mainstream framework, as it can build a more globally consistent map.
Cartographer is an open-source SLAM and proposed by Google in 2016. It is state-of-art, due to its high map accuracy, strong robustness, and good system maintainability. We thus use it in the proposed system. As shown in Figure 2, the algorithm can finally generate an occupied grid map and obtain a robot trajectory.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 21 this position, and the robot can move freely. Finally, the unknown means that the robot failed to detect this grid during the map-building process. The laser-based SLAM algorithm is always used to build grid maps. The frameworks can be divided into filter-based and graph-based types [10]. Recently, the graph optimization framework has gradually become the mainstream framework, as it can build a more globally consistent map.
Cartographer is an open-source SLAM and proposed by Google in 2016. It is state-of-art, due to its high map accuracy, strong robustness, and good system maintainability. We thus use it in the proposed system. As shown in Figure 2, the algorithm can finally generate an occupied grid map and obtain a robot trajectory.

Camera Pose Interpolation
Camera pose is an important factor that affects the accuracy of the point cloud. Although the Cartographer can estimate the robot pose more accurately, it is difficult to collect synchronized laser and RGB-D sensor data, resulting in the two being not strictly aligned on the timestamp. Interpolation is a method of estimating new data between two discrete points; we thus use it to refine camera poses.
The robot's pose ϵ SE 3 contains a translation vector ∈  and a rotation matrix ∈ SO 3 . SO 3 and SE 3 represent the special orthogonal group and the special Euclidean group, respectively. They are defined as follows: The method for robot pose interpolation mainly includes two paradigms: the split and the joint [37]. The former first interpolates the position and rotation separately and then composes the results of both to generate a new pose. On the contrary, the latter interpolates the position and rotation simultaneously from the perspective of the special Euclidean group. As shown in Figure 3, they have different interpolation trajectories. We use the joint interpolation to generate camera poses, since the robot trajectory is more in-line with it.
Assuming that the timestamp of an image is between two laser timestamps and , and the camera poses corresponding to these timestamps are and , then the interpolated camera

Camera Pose Interpolation
Camera pose is an important factor that affects the accuracy of the point cloud. Although the Cartographer can estimate the robot pose more accurately, it is difficult to collect synchronized laser and RGB-D sensor data, resulting in the two being not strictly aligned on the timestamp. Interpolation is a method of estimating new data between two discrete points; we thus use it to refine camera poses.
The robot's pose T SE(3) contains a translation vector p ∈ R 3 and a rotation matrix R ∈ SO(3). SO(3) and SE(3) represent the special orthogonal group and the special Euclidean group, respectively. They are defined as follows: The method for robot pose interpolation mainly includes two paradigms: the split and the joint [37]. The former first interpolates the position and rotation separately and then composes the results of both to generate a new pose. On the contrary, the latter interpolates the position and rotation Appl. Sci. 2020, 10, 5782 5 of 21 simultaneously from the perspective of the special Euclidean group. As shown in Figure 3, they have different interpolation trajectories. We use the joint interpolation to generate camera poses, since the robot trajectory is more in-line with it.
Appl. Sci. 2020, 10, x FOR PEER REVIEW  5 of 21 where exp • and log • are the exponential and logarithm maps of SE 3 .

Object Detection
At present, object detection based on deep learning has become a mainstream method. The stateof-the-art algorithms, such as YOLO [38], Faster R-CNN [39], and Mask R-CNN [40], are favored by researchers. Using the YOLO and Faster R-CNN, we can obtain the semantic labels, probability values, and bounding rectangles (BRs) of objects. However, the BR is only an approximate representation of the object area-it may include some pixels that do not belong to the object. To solve this problem, Mask R-CNN adds a masked prediction branch on the basis of Faster R-CNN object detection. It can further determine whether the pixels belong to the object and provides more accurate results. We thus use Mask R-CNN to obtain object semantics. Figure 4 shows example results of this algorithm.
The algorithm can detect thousands of objects, so the types need to be limited. We detected six object categories, including bed, couch, sink, toilet, microwave, and oven. These objects are usually not moved frequently and can be used to determine room concepts. For example, a room containing beds is most likely a bedroom, and couches have a large probability of being located in the living room. Considering small objects, such as cups, bowls, and forks, are frequently moved, they are not detected. When the navigation goal points are related to these objects, goal reasoning will be performed according to the instances of the six object categories and room concepts.   Assuming that the timestamp t c of an image is between two laser timestamps t l0 and t l1 , and the camera poses corresponding to these timestamps are T c0 and T c1 , then the interpolated camera pose T c at the time t c is: where exp(·) and log(·) are the exponential and logarithm maps of SE(3).

Object Detection
At present, object detection based on deep learning has become a mainstream method. The state-of-the-art algorithms, such as YOLO [38], Faster R-CNN [39], and Mask R-CNN [40], are favored by researchers. Using the YOLO and Faster R-CNN, we can obtain the semantic labels, probability values, and bounding rectangles (BRs) of objects. However, the BR is only an approximate representation of the object area-it may include some pixels that do not belong to the object. To solve this problem, Mask R-CNN adds a masked prediction branch on the basis of Faster R-CNN object detection. It can further determine whether the pixels belong to the object and provides more accurate results. We thus use Mask R-CNN to obtain object semantics. Figure 4 shows example results of this algorithm. where exp • and log • are the exponential and logarithm maps of SE 3 .

Object Detection
At present, object detection based on deep learning has become a mainstream method. The stateof-the-art algorithms, such as YOLO [38], Faster R-CNN [39], and Mask R-CNN [40], are favored by researchers. Using the YOLO and Faster R-CNN, we can obtain the semantic labels, probability values, and bounding rectangles (BRs) of objects. However, the BR is only an approximate representation of the object area-it may include some pixels that do not belong to the object. To solve this problem, Mask R-CNN adds a masked prediction branch on the basis of Faster R-CNN object detection. It can further determine whether the pixels belong to the object and provides more accurate results. We thus use Mask R-CNN to obtain object semantics. Figure 4 shows example results of this algorithm.
The algorithm can detect thousands of objects, so the types need to be limited. We detected six object categories, including bed, couch, sink, toilet, microwave, and oven. These objects are usually not moved frequently and can be used to determine room concepts. For example, a room containing beds is most likely a bedroom, and couches have a large probability of being located in the living room. Considering small objects, such as cups, bowls, and forks, are frequently moved, they are not detected. When the navigation goal points are related to these objects, goal reasoning will be performed according to the instances of the six object categories and room concepts.   The algorithm can detect thousands of objects, so the types need to be limited. We detected six object categories, including bed, couch, sink, toilet, microwave, and oven. These objects are usually not moved frequently and can be used to determine room concepts. For example, a room containing beds is most likely a bedroom, and couches have a large probability of being located in the living room. Considering small objects, such as cups, bowls, and forks, are frequently moved, they are not detected. When the navigation goal points are related to these objects, goal reasoning will be performed according to the instances of the six object categories and room concepts.

Object Point Cloud
An object point cloud is a set of 3D points with object semantic labels. We use two steps to generate the cloud-namely, point cloud generation and refinement. The specific contents of each step are introduced below.
The pinhole camera model describes the process of projecting a 3D point in the camera coordinate system to a 2D pixel in the image coordinate system. Defining a point in the camera coordinate system is X c = [ x c , y c , z c ] T ; its corresponding pixel u is: where K is the camera intrinsic and can be obtained by calibration. As shown in Equation (4), depth information is lost during this projection process, resulting in the inability to recover the 3D point by a color image alone. However, an RGB-D camera can obtain both depth image and color image simultaneously. It can directly obtain (u, v, z c ). Therefore, based on the RGB-D sensor, the back-projection process of the pinhole camera can generate the 3D point in the camera coordinate system: We further use depth and object height constraints to improve the accuracy of the point cloud. The error of the depth image increases with the distance, and the interpolated camera poses may be not very accurate, so we only use the depth value smaller than a certain range.
The object height constraint requires that the 3D object point is within a certain range. We use this constraint to reduce the error results from Mask R-CNN. For example, the model provided by Mask R-CNN is difficult to distinguish between a sink and a chandelier, but the chandelier hangs on the wall and the sink stands on the ground. According to the constraint of the sink, the error results can be removed. Finally, as shown in Figure 5, a 3D point cloud with semantic information can be generated.
Due to sensor noise, inaccurate camera poses, and object detection errors, the point cloud contains many noise and error points. Moreover, the point cloud only includes object semantic labels but does not distinguish different instances. For example, although the environment contains two couches, the point cloud treats them as one. Therefore, the point cloud filter method is used to remove some noise points, and the point cloud segmentation method is employed to obtain different object instances.
Noise points are relatively sparse, while the points within the object are more concentrated. According to this characteristic, we use the statistical filter to remove some noise points [41]. This method assumes that the average distance between a point and its k neighbors follows a Gaussian distribution, then calculates the mean and standard deviation of the distribution, and, finally, takes the points outside the standard deviation as noise points and removes them.
To distinguish different objects with the same semantic label, the Euclidean cluster segmentation is adopted to generate object instances [42]. This method first randomly selects a point in the point cloud and simultaneously generates a new instance, then calculates the nearest neighbors that meets the Euclidean distance threshold and adds them to the instance. Repeat this process until no neighbors can be added to the instance to complete the instance segmentation. At last, repeat the aforementioned process for the remaining point clouds, until all points have an instance label.
Finally, we use the instance minimum number constraint to further refine the object points. In other words, the object instances whose numbers do not satisfy this constraint are removed. Through the above point cloud refinement methods, the generated object point cloud is shown in Figure 6.
not very accurate, so we only use the depth value smaller than a certain range.
The object height constraint requires that the 3D object point is within a certain range. We use this constraint to reduce the error results from Mask R-CNN. For example, the model provided by Mask R-CNN is difficult to distinguish between a sink and a chandelier, but the chandelier hangs on the wall and the sink stands on the ground. According to the constraint of the sink, the error results can be removed. Finally, as shown in Figure 5, a 3D point cloud with semantic information can be generated.
Due to sensor noise, inaccurate camera poses, and object detection errors, the point cloud contains many noise and error points. Moreover, the point cloud only includes object semantic labels but does not distinguish different instances. For example, although the environment contains two couches, the point cloud treats them as one. Therefore, the point cloud filter method is used to remove some noise points, and the point cloud segmentation method is employed to obtain different object instances.  Noise points are relatively sparse, while the points within the object are more concentrated. According to this characteristic, we use the statistical filter to remove some noise points [41]. This method assumes that the average distance between a point and its k neighbors follows a Gaussian distribution, then calculates the mean and standard deviation of the distribution, and, finally, takes the points outside the standard deviation as noise points and removes them.
To distinguish different objects with the same semantic label, the Euclidean cluster segmentation is adopted to generate object instances [42]. This method first randomly selects a point in the point cloud and simultaneously generates a new instance, then calculates the nearest neighbors that meets the Euclidean distance threshold and adds them to the instance. Repeat this process until no neighbors can be added to the instance to complete the instance segmentation. At last, repeat the aforementioned process for the remaining point clouds, until all points have an instance label.
Finally, we use the instance minimum number constraint to further refine the object points. In other words, the object instances whose numbers do not satisfy this constraint are removed. Through the above point cloud refinement methods, the generated object point cloud is shown in Figure 6.

Room Dominant Direction Detection
Although the filtered and instantiated point cloud can better approximate the object positions, due to the influence of the robot trajectory or the installation position of the camera, the robot may not be able to completely observe an object. In other words, the robot can only observe part of the object. To better represent the objects, the minimum bounding box (MBB) is used to determine the approximate shape of the objects.
MBB mainly includes axis-aligned, arbitrarily oriented, and object-oriented forms. The objectoriented means that the box orientation is the same as the local coordinate system of the object. As it can better represent the object and the grid map as a two-dimensional map, we use OOMBR to describe all instances.
The domestic floor plan is usually a rectangular layout, as it has the advantages of simple structure, a high indoor effective area utilization rate, and easy construction. Defining the two directions, which are parallel to the two edges of the rectangle, are the dominant directions of the room. Furthermore, considering that most objects are placed parallel to the dominant directions, we

Room Dominant Direction Detection
Although the filtered and instantiated point cloud can better approximate the object positions, due to the influence of the robot trajectory or the installation position of the camera, the robot may not be able to completely observe an object. In other words, the robot can only observe part of the object. To better represent the objects, the minimum bounding box (MBB) is used to determine the approximate shape of the objects.
MBB mainly includes axis-aligned, arbitrarily oriented, and object-oriented forms. The object-oriented means that the box orientation is the same as the local coordinate system of the object. As it can better Appl. Sci. 2020, 10, 5782 8 of 21 represent the object and the grid map as a two-dimensional map, we use OOMBR to describe all instances.
The domestic floor plan is usually a rectangular layout, as it has the advantages of simple structure, a high indoor effective area utilization rate, and easy construction. Defining the two directions, which are parallel to the two edges of the rectangle, are the dominant directions of the room. Furthermore, considering that most objects are placed parallel to the dominant directions, we define the object local coordinate axes parallel to them.
Aiming at the rectangular layout, the dominant directions can be extracted by means of line length and direction statistics. While the line detection algorithm needs a binary image, the occupied grid map has three states. To convert the grid map to a binary image, the free grids are regarded as the image foreground, and the unknown and occupied grids are used as the image background. Based on the binary image, we can detect the line segments. In the Cartesian coordinate system, the general equation of a straight line is: The traditional Hough transform turns a point in the Cartesian coordinate system into a curve in the Hough space: where ρ is the distance from the origin to the line, and θ is the angle between the perpendicular line and the x-axis. The parameters can be determined by voting. However, this method cannot obtain the coordinates of the endpoints of the line segments. Cumulative probability Hough line detection can help to obtain them. While reducing the accumulation space, the coordinates of the endpoints of the line segments can also be obtained. Therefore, we use a four-tuple (P s , P e , L, l) to describe the line segment parameters. Assuming the points of a line segment are P s = [x s , y s ] T and P e = [x e , y e ] T , the straight line parameters L and line segment length l are: When determining the dominant direction of the room, we take the cumulative line length in the same direction as the number of votes. However, some of the obtained line segments belong to the same straight line but are extracted as two line segments with overlapping parts, as shown in Figure 7.
The traditional Hough transform turns a point in the Cartesian coordinate system into a curve in the Hough space: where is the distance from the origin to the line, and is the angle between the perpendicular line and the x-axis. The parameters can be determined by voting. However, this method cannot obtain the coordinates of the endpoints of the line segments. Cumulative probability Hough line detection can help to obtain them. While reducing the accumulation space, the coordinates of the endpoints of the line segments can also be obtained. Therefore, we use a four-tuple , , , to describe the line segment parameters. Assuming the points of a line segment are , and , , the straight line parameters L and line segment length l are: When determining the dominant direction of the room, we take the cumulative line length in the same direction as the number of votes. However, some of the obtained line segments belong to the same straight line but are extracted as two line segments with overlapping parts, as shown in Figure 7. Although, by adjusting the algorithm thresholds, we can reduce this situation. However, in order to reduce the sensitivity of thresholds and improve the detection accuracy of the dominant directions, we use the direction, distance, and coincidence constraints to detect and merge these line segments. Suppose the points of the first line segment are , and , , Although, by adjusting the algorithm thresholds, we can reduce this situation. However, in order to reduce the sensitivity of thresholds and improve the detection accuracy of the dominant directions, we use the direction, distance, and coincidence constraints to detect and merge these line segments. Suppose the points of the first line segment are P s1 = [x s1 , y s1 ] T and P e1 = [x e1 , y e1 ] T , and the second line segment are P s1 = [x s1 , y s1 ] T and P e1 = [x e1 , y e1 ] T , the direction constraint requires the angle between the two line segments to be less than the angle threshold δ φ : The distance constraint requires that the distance from the first segment points to the second line be smaller than the distance threshold δ d : The coincidence constraint requires the distance between the midpoints P c1 and P c2 of the line segments be less than the sum of the half-lengths of the line segments: The segment merging method is shown in Figure 8. Based on the first line segment, the projection points of the second line segment are calculated. At last, the points and the first line segment length are recalculated as the merge result.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 21 The coincidence constraint requires the distance between the midpoints and of the line segments be less than the sum of the half-lengths of the line segments: The segment merging method is shown in Figure 8. Based on the first line segment, the projection points of the second line segment are calculated. At last, the points and the first line segment length are recalculated as the merge result. To count the length of line segments, the directions of all line segments are used as the vote box, the direction constraint threshold is regarded as the box width, and the cumulative length of each direction is the number of votes. The voting result is calculated as shown in Figure 9. The dominant directions of the room and are: argmax .
argmin | | 2 | . To count the length of line segments, the directions of all line segments are used as the vote box, the direction constraint threshold is regarded as the box width, and the cumulative length of each direction is the number of votes. The voting result is calculated as shown in Figure 9. The dominant directions of the room φ m1 and φ m2 are: Appl. Sci. 2020, 10

Object Goal Space Generation
The dominant direction is used as the orientation of the object. Let us assume that a point , , belongs to an object instance. According to the dominant direction, the rotated point can be computed as follows: .
Then, the MBR can be computed as follows: where is the lower left point of the rectangle, is the upper left, is the lower right, and is the upper right. Then, we transform the rectangle in the rotated coordinate system to the map coordinate system: Based on the four vertex points of the rectangle, the OOMBR can be generated, as shown in Figure 10.

Object Goal Space Generation
The dominant direction φ m1 is used as the orientation of the object. Let us assume that a point P i = [x i , y i , z i ] T belongs to an object instance. According to the dominant direction, the rotated point P i can be computed as follows: Then, the MBR can be computed as follows: where P ld is the lower left point of the rectangle, P ld is the upper left, P rd is the lower right, and P ru is the upper right. Then, we transform the rectangle in the rotated coordinate system to the map coordinate system: Based on the four vertex points of the rectangle, the OOMBR can be generated, as shown in Figure 10. The navigation goal point is located on a free grid, but the object is located on the occupied or unknown grids. In order to help the robot select goals and make its behavior more in-line with social norms, the object goal space is established. It describes the affordance and nonaffordance space (NAS) of an object. According to the definition in [43], affordance space is related to human activities and is a social space provided by the environment, which is of great significance to realize the navigation behavior accepted by humans. For example, Figure 11a defines the affordance space for a couch, and Figure 11b shows that a FABO robot is in this space. The rationality of this goal needs further discussion. If a human orders the robot to take the cushion on the couch to the bedroom, the point is reasonable, as it is convenient for the robot to execute the next action. On the contrary, if the human just orders the robot to stop nearby the couch, the point is unreasonable, as the affordance space is the potential space for human activities. When people want to sit down, the robot will become an obstacle. Therefore, the robot should stop in the nonaffordance space, in this case. To formally define these spaces, we first introduce the concepts of 4-connected and 8-connected pixels. The 4-connected pixels are shown in Figure 12a and are defined as follows: The navigation goal point is located on a free grid, but the object is located on the occupied or unknown grids. In order to help the robot select goals and make its behavior more in-line with social norms, the object goal space is established. It describes the affordance and nonaffordance space (NAS) of an object. According to the definition in [43], affordance space is related to human activities and is a social space provided by the environment, which is of great significance to realize the navigation behavior accepted by humans. For example, Figure 11a defines the affordance space for a couch, and Figure 11b shows that a FABO robot is in this space. The rationality of this goal needs further discussion. If a human orders the robot to take the cushion on the couch to the bedroom, the point is reasonable, as it is convenient for the robot to execute the next action. On the contrary, if the human just orders the robot to stop nearby the couch, the point is unreasonable, as the affordance space is the potential space for human activities. When people want to sit down, the robot will become an obstacle. Therefore, the robot should stop in the nonaffordance space, in this case. The navigation goal point is located on a free grid, but the object is located on the occupied or unknown grids. In order to help the robot select goals and make its behavior more in-line with social norms, the object goal space is established. It describes the affordance and nonaffordance space (NAS) of an object. According to the definition in [43], affordance space is related to human activities and is a social space provided by the environment, which is of great significance to realize the navigation behavior accepted by humans. For example, Figure 11a defines the affordance space for a couch, and Figure 11b shows that a FABO robot is in this space. The rationality of this goal needs further discussion. If a human orders the robot to take the cushion on the couch to the bedroom, the point is reasonable, as it is convenient for the robot to execute the next action. On the contrary, if the human just orders the robot to stop nearby the couch, the point is unreasonable, as the affordance space is the potential space for human activities. When people want to sit down, the robot will become an obstacle. Therefore, the robot should stop in the nonaffordance space, in this case. To formally define these spaces, we first introduce the concepts of 4-connected and 8-connected pixels. The 4-connected pixels are shown in Figure 12a and are defined as follows: To formally define these spaces, we first introduce the concepts of 4-connected and 8-connected pixels. The 4-connected pixels are shown in Figure 12a and are defined as follows: There are two shortcomings in the theoretical model: (1) It does not consider the uncertainty of the BR of the object. Therefore, it may contain free grids. (2) The real affordance and nonaffordance spaces are a subset of the theoretical ones. For example, the couch has only one rectangle in Figure  11a. We use the grid map to partially solve this problem. For the former, we also consider the free grids in the BR as affordance space; for the latter, we use the occupied and unknown girds to remove some rectangles. Therefore, the real affordance  , and nonaffordance spaces and  , are: where  , is the grid set of the BR, and  is the free grid set of the map. Based on the above method, we build the object semantic grid map, as shown in Figure 13.   The 8-connected pixels are shown in Figure 12b and are defined as follows: The theoretical model of the goal space is shown in Figure 12c. Similar to the definition of the pixel neighborhoods, the rectangles of the four neighborhoods of the BRs are the theoretical affordance space N a (i, j). The rectangles belonging to the eight neighborhoods but not the four neighborhoods are the theoretical nonaffordance space N na (i, j): Assuming the width of a BR is l w , and the height is l h . The affordance space contains four rectangles. One edge length of each rectangle is l e , and the other is l w or l h . The nonaffordance space is four squares with the edge length l e : l e = s r l r . (22) where s r is the safety factor, and l r is the radius of the robot circumscribed circle. There are two shortcomings in the theoretical model: (1) It does not consider the uncertainty of the BR of the object. Therefore, it may contain free grids. (2) The real affordance and nonaffordance spaces are a subset of the theoretical ones. For example, the couch has only one rectangle in Figure 11a. We use the grid map to partially solve this problem. For the former, we also consider the free grids in the BR as affordance space; for the latter, we use the occupied and unknown girds to remove some rectangles. Therefore, the real affordance N a (i, j) and nonaffordance spaces and N na (i, j) are: where B(i, j) is the grid set of the BR, and F is the free grid set of the map. Based on the above method, we build the object semantic grid map, as shown in Figure 13.

Experimental Setup
Currently, the academic community has not established a unified standard for the evaluation of semantic maps. Although absolute and relative trajectory errors are usually used to evaluate the algorithm in the field of SLAM, the state-of-the-art can achieve centimeter accuracy in small and medium-scale scenes [44]. For domestic robot navigation, further comparison of the trajectory accuracy is of little practical significance. This evaluation method is more in-line with the needs of 3D reconstruction, virtual reality (VR), augmented reality (AR), and other fields. If we consider how humans are navigating, we can apply this approach to the robot's path planning. When humans are navigating, they usually do not need to reach the accurate goal position but only need to reach the approximate position near an object and then carry out refined operations. Similar to if the robot can navigate to an object and then use the object to establish a reference coordinate system. This approach to path planning can improve a robot's overall completion of operation tasks. Using this method may be more practical than requiring the robot to further build a very accurate map [45].
To summarize, for the following considerations, we do not evaluate the accuracy of robots' trajectories and objects' poses: (1) the current laser SLAM algorithm can meet the navigation accuracy requirements for small and medium-scale domestic environments. (2) Unlike AR and VR application scenarios, the robot does not need very accurate object poses when performing navigation tasks.
We use precision and the recall of object instantiation to evaluate object semantic gird maps. Supposedly the number of correctly instantiated objects, which have the correct semantic labels and their position errors are not very obvious, is , the number of instantiated objects on the map is , and the number of objects in the environment is ; then, the precision and recall of object instantiation are defined as follows: , .
For object detection, we employ an instance of Mask R-CNN pretrained with the COCO [46] dataset, which can detect the six object categories used in this paper.
For the dataset, we use Robot@Home [47] to verify the system. The dataset was generated by a Giraff robot, which was equipped with four Asus Xtion Live RGB-D cameras (manufacturer, city, country) and a Hokuyo 2D laser scanner (manufacturer, city, country). The four cameras were vertically placed on the robot base, making the overall camera field of view up to ~180°. Considering

Experimental Setup
Currently, the academic community has not established a unified standard for the evaluation of semantic maps. Although absolute and relative trajectory errors are usually used to evaluate the algorithm in the field of SLAM, the state-of-the-art can achieve centimeter accuracy in small and medium-scale scenes [44]. For domestic robot navigation, further comparison of the trajectory accuracy is of little practical significance. This evaluation method is more in-line with the needs of 3D reconstruction, virtual reality (VR), augmented reality (AR), and other fields. If we consider how humans are navigating, we can apply this approach to the robot's path planning. When humans are navigating, they usually do not need to reach the accurate goal position but only need to reach the approximate position near an object and then carry out refined operations. Similar to if the robot can navigate to an object and then use the object to establish a reference coordinate system. This approach to path planning can improve a robot's overall completion of operation tasks. Using this method may be more practical than requiring the robot to further build a very accurate map [45].
To summarize, for the following considerations, we do not evaluate the accuracy of robots' trajectories and objects' poses: (1) the current laser SLAM algorithm can meet the navigation accuracy requirements for small and medium-scale domestic environments. (2) Unlike AR and VR application scenarios, the robot does not need very accurate object poses when performing navigation tasks.
We use precision and the recall of object instantiation to evaluate object semantic gird maps. Supposedly the number of correctly instantiated objects, which have the correct semantic labels and their position errors are not very obvious, is N cins , the number of instantiated objects on the map is N ins , and the number of objects in the environment is N gtins ; then, the precision P ins and recall R ins of object instantiation are defined as follows: For object detection, we employ an instance of Mask R-CNN pretrained with the COCO [46] dataset, which can detect the six object categories used in this paper.
For the dataset, we use Robot@Home [47] to verify the system. The dataset was generated by a Giraff robot, which was equipped with four Asus Xtion Live RGB-D cameras (manufacturer, city, country) and a Hokuyo 2D laser scanner (manufacturer, city, country). The four cameras were vertically placed on the robot base, making the overall camera field of view up to~180 • . Considering that most robot platforms are only equipped with one RGB-D sensor, we only use the RGB-D information coming from the front. The data were collected within five dwelling apartments, named anto, alma, pare, rx2, and sarmis. As the observations from the laser scanner were saved at a lower frequency in sarmis than in the rest of the apartments, we cannot build its map successfully. Thus, we use alma, rx2, pare, and anto to verify the effectiveness of the system.

Results
The experimental results of the alma scene are shown in Figure 14, and the statistical results are presented in Table 1. The system instantiated five objects, including two beds, one couch, one sink, and one microwave. Four objects, which were two beds, one couch, and one sink, were correctly instantiated. Therefore, the precision was 80%, and the recall was 50%. For each object, the precision and recall of the bed were 100%. The precisions of the couch and the sink were 100%, and the recalls were 50%. As the system failed to instantiate the toilet and microwave, the corresponding indexes were 0%.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 21 frequency in sarmis than in the rest of the apartments, we cannot build its map successfully. Thus, we use alma, rx2, pare, and anto to verify the effectiveness of the system.

Results
The experimental results of the alma scene are shown in Figure 14, and the statistical results are presented in Table 1. The system instantiated five objects, including two beds, one couch, one sink, and one microwave. Four objects, which were two beds, one couch, and one sink, were correctly instantiated. Therefore, the precision was 80%, and the recall was 50%. For each object, the precision and recall of the bed were 100%. The precisions of the couch and the sink were 100%, and the recalls were 50%. As the system failed to instantiate the toilet and microwave, the corresponding indexes were 0%.
As shown in Figure 14d, the goal spaces of the objects are generated. Since the free grids near the objects have object semantics, and the navigation goal point is located on a free grid, they can facilitate the robot to select the goal point. In addition, due to the consideration of the affordance of the object, the goal point selection can be socialized to a certain extent.  The experimental results of the rx2 scene are shown in Figure 15, and the statistical results are shown in Table 2. The precision is 66.67%, and the recall is 80%. Specifically, the system instantiated  As shown in Figure 14d, the goal spaces of the objects are generated. Since the free grids near the objects have object semantics, and the navigation goal point is located on a free grid, they can facilitate the robot to select the goal point. In addition, due to the consideration of the affordance of the object, the goal point selection can be socialized to a certain extent.
The experimental results of the rx2 scene are shown in Figure 15, and the statistical results are shown in Table 2. The precision is 66.67%, and the recall is 80%. Specifically, the system instantiated six objects, including three beds, one couch, one sink, and one toilet. Among them, four objects were correctly instantiated-namely, one bed, one couch, one sink, and one toilet.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 15 of 21 six objects, including three beds, one couch, one sink, and one toilet. Among them, four objects were correctly instantiated-namely, one bed, one couch, one sink, and one toilet. For each object, although the recall of the bed was 100%, the object detection algorithm recognized some observations that belonged to the couch as a bed, so two beds were incorrectly instantiated, resulting in a precision of 33.33%. The precisions and recalls of the couch and toilet were 100%. The precision of the sink was 100%, and the recall was 50%. The reason is that the scene contains two sinks, but one of them is far from the observation position of the robot and cannot be instantiated correctly, so its recall is lower than precision. In addition, as shown in Figure 15d, each object instance has generated the object goal space, which can help the robot select navigation goals.  The experimental results of the pare scene are shown in Figure 16, and the statistical results are shown in Table 3. The precision is 73.33%, and the recall is 84.62%. The system instantiated 15 objects, including seven beds, two couches, two sinks, two toilets, one microwave, and one oven. Among them, 11 objects were instantiated correctly, including three beds, two couches, two sinks, two toilets, one microwave, and one oven.
The bed recall is 100%, and precision is 42.86%. Similar to the rx2 scene, the object detection algorithm recognized the couch, curtain, and dining table as a bed, respectively, and, thus, incorrectly instantiated three beds. Moreover, the point cloud belonging to the same bed was divided into two  For each object, although the recall of the bed was 100%, the object detection algorithm recognized some observations that belonged to the couch as a bed, so two beds were incorrectly instantiated, resulting in a precision of 33.33%. The precisions and recalls of the couch and toilet were 100%. The precision of the sink was 100%, and the recall was 50%. The reason is that the scene contains two sinks, but one of them is far from the observation position of the robot and cannot be instantiated correctly, so its recall is lower than precision. In addition, as shown in Figure 15d, each object instance has generated the object goal space, which can help the robot select navigation goals.
The experimental results of the pare scene are shown in Figure 16, and the statistical results are shown in Table 3. The precision is 73.33%, and the recall is 84.62%. The system instantiated 15 objects, including seven beds, two couches, two sinks, two toilets, one microwave, and one oven.
Among them, 11 objects were instantiated correctly, including three beds, two couches, two sinks, two toilets, one microwave, and one oven. by the Euclidean cluster segmentation, so one bed was instantiated incorrectly. The precisions and recalls of the toilet, microwave, and oven were 100%. The precision of the sink was 100%, and the recall was 66.67%. This scene contained 2twosinks, but in the observation data, one sink was partially occluded by other objects, resulting in a sparse point cloud, and could not be instantiated correctly.
As shown in Figure 16d, the free grids near each object contain the semantics of the object, which can facilitate the robot to select the goal point. However, since the goal space of bed6 is partly located in one room and partly located in another room, this will lead to an inaccurate selection of the goal. Specifically, the robot may choose a free grid in the room without the bed as the goal, causing the robot to not complete the navigation task successfully.  The experimental results of the anto scene are shown in Figure 17, and the statistical results are shown in Table 4. The precision is 58.33%, and the recall is 50%. The system instantiated 12 objects, including four beds, one couch, one toilet, one sink, one microwave, and four ovens. Among them, seven objects were correctly instantiated-namely, three beds, one toilet, one sink, one microwave, and one oven.  The bed recall is 100%, and precision is 42.86%. Similar to the rx2 scene, the object detection algorithm recognized the couch, curtain, and dining table as a bed, respectively, and, thus, incorrectly instantiated three beds. Moreover, the point cloud belonging to the same bed was divided into two by the Euclidean cluster segmentation, so one bed was instantiated incorrectly. The precisions and recalls of the toilet, microwave, and oven were 100%. The precision of the sink was 100%, and the recall was 66.67%. This scene contained 2twosinks, but in the observation data, one sink was partially occluded by other objects, resulting in a sparse point cloud, and could not be instantiated correctly.
As shown in Figure 16d, the free grids near each object contain the semantics of the object, which can facilitate the robot to select the goal point. However, since the goal space of bed6 is partly located in one room and partly located in another room, this will lead to an inaccurate selection of the goal. Specifically, the robot may choose a free grid in the room without the bed as the goal, causing the robot to not complete the navigation task successfully.
The experimental results of the anto scene are shown in Figure 17, and the statistical results are shown in Table 4. The precision is 58.33%, and the recall is 50%. The system instantiated 12 objects, including four beds, one couch, one toilet, one sink, one microwave, and four ovens. Among them, seven objects were correctly instantiated-namely, three beds, one toilet, one sink, one microwave, and one oven.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 17 of 21 For each object, although the bed recall was 100%, the object detection algorithm recognized part of the observations that belonged to the couch as a bed, so one bed was incorrectly instantiated, resulting in the precision as 75%. Since no couch was instantiated correctly, the corresponding indexes were 0%. The precision of the sink was 100%, and the recall was 33.33%. The precision of the toilet was 100%, and the recall was 50%.
As shown in Figure 17d, each object can generate a goal space. However, due to the inaccurate generation of the bounding box of the bed4, there are too-few free grids in the nonaffordance space, which may cause the robot to not select the goal point socially.

Discussion
The statistical results of the above four scenes are shown in Table 5. The precision of the bed is 56.26%, and the recall is 100%. The precision of the couch is 80%, and the recall is 40%. Part of the reason is that the object detection algorithm easily misidentifies a couch as a bed, reducing the precision of the bed and the recall of the couch. The precision of the sink is 100%, and the recall is 50%. The precision of the toilet is 100%, and the recall is 66.67%. If only the sink and toilet in the bathroom are considered, the recall of them is not much different, but the system hardly instantiated the sink in the kitchen; there is a 16.67% gap between the two. The microwave is relatively small in size and can be easily removed as noise, so its precision and recall are 50% and 66.67%, respectively.  For each object, although the bed recall was 100%, the object detection algorithm recognized part of the observations that belonged to the couch as a bed, so one bed was incorrectly instantiated, resulting in the precision as 75%. Since no couch was instantiated correctly, the corresponding indexes were 0%. The precision of the sink was 100%, and the recall was 33.33%. The precision of the toilet was 100%, and the recall was 50%.
As shown in Figure 17d, each object can generate a goal space. However, due to the inaccurate generation of the bounding box of the bed4, there are too-few free grids in the nonaffordance space, which may cause the robot to not select the goal point socially.

Discussion
The statistical results of the above four scenes are shown in Table 5. The precision of the bed is 56.26%, and the recall is 100%. The precision of the couch is 80%, and the recall is 40%. Part of the reason is that the object detection algorithm easily misidentifies a couch as a bed, reducing the precision of the bed and the recall of the couch. The precision of the sink is 100%, and the recall is 50%. The precision of the toilet is 100%, and the recall is 66.67%. If only the sink and toilet in the bathroom are considered, the recall of them is not much different, but the system hardly instantiated the sink in the kitchen; there is a 16.67% gap between the two. The microwave is relatively small in size and can be easily removed as noise, so its precision and recall are 50% and 66.67%, respectively. In contrast, the size of the oven is relatively large, so the recall is 100%. However, the object detection algorithm recognized many other objects as an oven, so the precision is 40%. In summary, the average precision of the system is 68.42%, and the recall is 65%. The sources of error in the precision and recall mainly include the following three aspects: (1) The object detection algorithm generates many wrong results as we directly employ an instance of Mask R-CNN pretrained with the COCO dataset. (2) There are many insufficient observations of objects, resulting in the corresponding object point clouds being sparse and removed by the point cloud filter algorithm. (3) An object is instantiated into two, as the camera interpolation method cannot refine some camera poses.
For the object goal spaces, since the free grids near an object have object semantics and the navigation goal point is located on the free grid, they can help the robot select the goal conveniently. Furthermore, with the affordance and nonaffordance spaces, the robot can choose the goal socially. However, there are still two main problems with the goal space: (1) Since the object goal space depends on the OOMBR, it may not be accurate enough.
(2) The object goal space may be defined through the walls and in other rooms.
The experimental results provide important ideas for the subsequent improvement of system performance: (1) To make the object detection more accurate, we can retrain the neural network model for domestic environments. As precision is related to the number of correctly instantiated objects, we can obviously improve it. (2) To obtain sufficient observations of the objects, the robot should approach the detected objects to achieve more complete observations and generate dense object point clouds. Based on (1), we can correctly instantiate more objects and improve the recall significantly. (3) To further improve the precision and recall scores and make the object goal spaces more in-line with the actual objects, we can apply a bundle adjustment-like method over all the estimated camera poses and depth maps to improve their accuracy. (4) As the object we detect can only be located in one room, we will use a room segmentation algorithm to solve the problem that part of the goal space is located in another room.

Conclusions
In this paper, we presented an object semantic grid mapping system for domestic robot navigation. We used the Cartographer to build an occupied grid map and Mask R-CNN to detect the objects, including beds, couches, sinks, ovens, microwaves, and toilets. These objects are always static and useful to compute room concepts. To generate the point cloud more accurately, we employed joint interpolation to refine the camera poses. We utilized the statistical filter to remove some outliers and the Euclidean cluster segmentation to get object instances. Furthermore, the maximum range of the depth image and geometric constraints, such as the minimum number of points of the object instances and the height of objects' points, were used to refine the object point clouds. We proposed a dominant direction detection method for rectangle layout environments to generate the object-oriented minimum bounding rectangles of the object instances. We built the object goal space to help the robot select goals conveniently and socially. We used the precision and recall of object instances as the evaluation standard and employed the Robot@Home dataset to verify the system. Experimental results show that the average precision of our system is 68.42%, and average recall is 65%. In our future work, we will improve the system and build semantic topological and concept maps to finally generate a multi-layer map that contains a semantic grid layer, a semantic topological layer, and a semantic concept layer for domestic robot navigation.