Exploration for Object Mapping Guided by Environmental Semantics Using UAVs

: This paper presents a strategy to autonomously explore unknown indoor environments, focusing on 3D mapping of the environment and performing grid level semantic labeling to identify all available objects. Unlike conventional exploration techniques that utilize geometric heuristics and information gain theory on an occupancy grid map, the work presented in this paper considers semantic information, such as the class of objects, in order to gear the exploration towards environmental segmentation and object labeling. The proposed approach utilizes deep learning to map 2D semantically segmented images into 3D semantic point clouds that encapsulate both occupancy and semantic annotations. A next-best-view exploration algorithm is employed to iteratively explore and label all the objects in the environment using a novel utility function that balances exploration and semantic object labeling. The proposed strategy was evaluated in a realistically simulated indoor environment, and results were benchmarked against other exploration strategies.


Introduction
The growth in aerial robotics has led to their ubiquitous presence in various fields-urban search and rescue (USAR) [1][2][3], infrastructure inspection [4], surveillance [5], etc. Some recent research has focused on USAR activities performed by unmanned aerial vehicles (UAVs) to assist rescue teams by providing vital information on time-sensitive situations without endangering human lives. The introduction of unmanned aerial vehicles (UAVs) in USAR environments is beneficial when rapid responses are required to assist first responders with locating victims, survivors, and danger sources in the environment. UAVs' speed, agility, and ability to navigate in hazardous environments that contain rubble and obstacles make them an ideal platform for deployment in USAR environments. UAVs with autonomous exploration, mapping, and navigation capabilities can provide rescuers with valuable data, such as 2D and 3D maps, victims' locations, and localized danger sources, in order to improve their awareness of the situation and respond appropriately.
Maps can provide a variety of information about the environment; they can reflect the structure of the environment, connections, and knowledge sense. Therefore, providing a rich/informative map for the first responders in USAR with the right representation is an active topic in the USAR robotics field.
Robot pose and sensor data are used to build a globally consistent map in which path planning and exploration become feasible. In the literature, maps are classified by the way they represent information into three types [6,7]: topological [8], metric [8,9], and semantic maps [10][11][12][13]. The majority of the existing mapping approaches aim to construct a global consistence metric map of robots operating in the environment [6][7][8]14]. Metric maps do not encapsulate the characteristics that are compatible with human conception. Therefore, semantic maps are preferable to be used for high-level objectives, such as detecting and identifying the locations of different objects, path planning, and increasing environmental awareness.
Autonomous exploration could be performed to minimize the effort required by the first responders in a USAR environment to search for victims and build awareness maps. Many approaches have been heavily used in the literature to perform autonomous exploration, such as frontier-based exploration [15] and information-theory-based exploration [16]. The concept of "next-best-view" (NBV) [17] was extensively used to guide the next best exploration action, whether using the frontier-based or the information-theory-based concepts. In order to identify the best next exploration action, a cost function (also called utility function) is used to evaluate various views to identify the one that maximizes the objectives of the exploration.
Both frontier-based exploration and information-theory-based exploration approaches of unknown indoor environments use metric (i.e., 2D/3D occupancy) maps without considering the high-level abstract of contextual information provided by onboard sensors, such as the class of objects in occupied voxels. Additionally, both approaches do not filter or classify which information is of merit to the application-specific environment, which is responsible for reducing the time of finding objects of interest or labeling all the objects in the environment.
In this work, an efficient strategy for autonomous, semantically-aware exploration of object labeling and 3D mapping for unknown indoor is proposed. The proposed approach utilizes semantic information encapsulated in the proposed 3D, semantically-aware map for object localization and labeling using a UAV with an onboard RGBD camera. New utility functions are proposed to direct the robot towards the objects in the environment. The results show that the proposed strategy is capable of exploring unknown environments and label objects effectively. Our contributions can be summarized as follows: • A 3D, semantically-aware, colored annotated mapping method that uses a deep learning model to semantically segment the objects in 2D images and project the annotated objects in a 3D, semantically-aware occupancy map. This includes a novel data structure that extends the volumetric occuapancy grid representations to include more semantic specific information. • Development of a multi-objective utility function that encapsulates the quantified information generated from the semantic map and volumetric map to allow the robot to explore the unknown space and to direct the robot to visit the objects and label them effectively.

•
An overall exploration strategy that utilizes rapidly exploring random tree (RRT) for viewpoint sampling, and the next-best-view approach with the proposed semantic utility functions to iteratively explore the unknown environment and label all the objects there.
The remainder of the paper is structured as follows. Section 2 provides an overview of the related work. Section 3 details the methodology presented in this work, wherein Section 3.1 illustrates the proposed strategy and describes the main components, Section 3.2 represents the semantic mapping strategy employed in this work, and Section 3.3 presents the proposed exploration approach and its components. Section 4 shows our experiment setup and preliminary results of our simulation experiments, and finally, Section 5 summaries our contributions and provides recommendations for future work.

Related Work
Robots deployed in search and rescue missions should possess basic yet very important skills, such as object detection, 3D mapping, and indoor exploration. The significant impact of the unmanned aerial vehicles (UAVs) in various robotics fields made them a promising solution for urban search and rescue tasks. For victim detection, the UAV could utilize the onboard sensors and the onboard intelligence to accurately detect and locate victims in disaster scenes in order to increase the likelihood of finding victims [18][19][20]. For mapping and 3D reconstruction, the UAV should be capable of reconstructing the scene in 3D and locating the contextual and rich information that can be used by the rescuers' team. For exploration, the UAV should be able select the exploration positions in order to facilitate both the victim detection and 3D reconstruction tasks. In this section, the related work for the object detection, semantic mapping, and exploration is provided.

Object Detection and Classification
Victim detection is the most critical task in search and rescue missions, wherein victims should be identified as fast as possible for the first responders to intervene and rescue them. Object detection task is concerned with discovering instances of semantic objects that belong to a particular class, such as human, table, or chair. In the literature, a variety of mechanisms are used to detect objects either by using a single sensor or fusing multiple sensors [20].
Visual approaches for object detection is the most reliable resource used to find objects in scenes. The visual object detection is extensively performed using machine learning methods and currently through deep learning techniques. Both techniques depend on the visual information provided by an individual image frame. In the context of machine learning approaches, a feature extraction method, such as scale-invariant feature transform (SIFT) [21] or histogram of oriented gradients (HOG) features [22], is used to define the features. Then, a machine learning model, such as supportive-vector-machine (SVM) [23], is employed for object classification. In deep learning, the neural networks are capable of detecting objects without the feature extraction process. There are different types of object-detection-based deep learning model techniques depending on the source of information captured from the onboard sensor, such as 2D images and 3D point cloud.
Utilizing the various information captured from the onboard sensors such as 2D images and 3D point cloud, different deep learning models are designed to detect objects. Commonly, convolutional neural networks (CNNs) are used to detect and identify objects from 2D images. The most commonly used deep learning models for object detection from 2D images are the single shot multiBox detector (SSD) [24], and the you only look once (YOLO) [25][26][27]. These methods create a bounding box that surrounds the classified objects in the 2D image. Nonetheless, object detection and classification can be performed using the deep learning models that apply semantic segmentation for 2D images, such as PSPNet [28]. However, using depth images, object detection and classification is performed using the deep learning networks that segment point cloud and classify the segments to the different classes, such as PointNet [29], PointNet++ [30], and Frustum pointnets [31].

Semantic Mapping
A semantic map is an augmented representation of the robot's environment that encapsulates characteristics compatible with human conception. Exploration and path planning algorithms benefit from the additional information that the semantic maps provide about the surrounding environment. Most of existing mapping strategies intend to build a global consistency metric map of robots operating in the environment [6][7][8]14,32]. In USAR, the robot should be capable of guiding the first responders to the victim's location safely. Therefore, the robots should acquire more advanced skills than the metric mapping; the robot should understand the surrounding environment. A solution can be found by the semantic mapping, which is a qualitative representation of the robot's surrounding environment to augment the exploration and task planning. In the literature, since contemporary robots use the metric maps to explore the environment, most of the work on semantic maps uses the metric maps as a basis for the semantic map and builds on top of semantic attributes; see [33,34].

Exploration
Two of the most commonly used autonomous exploration approaches are frontier-based approaches and information-theory-based approaches. In frontier-based exploration approaches, the robot navigates to the boundaries between free space and unexplored space by evaluating the cost function to maximize map coverage [15]. Utility functions, such as path cost and expected information gain, are usually applied to evaluate next candidate position (i.e., frontier) to visit. However, in information-theory-based approaches, the robot navigates to candidate position (i.e., viewpoint sample) to maximize information gain using probabilistic map representation [16]. Therefore, a good information gain prediction algorithm plays a vital role in both methods. During exploration, different types of maps can be created to either reconstruct the 3D scene or to present the scene with varying types of information depending on their semantics utilizing the obtained data from the onboard sensors [6]. Advanced variant extensions of this frontier-based algorithm were developed in [35][36][37].
The commonly used exploration techniques of unknown indoor environments use metric (i.e., 2D/3D occupancy) maps without considering the high-level abstract of information provided by onboard sensors, such as the class of objects in occupied voxels. These methods may work locally on simple environments but they are unlikely to function well in cluttered and dynamic environments. These drawbacks motivate the need for a new efficient exploration strategy that evaluates the contextual information from the environment. Fortunately, current technological advancements in hardware and software algorithms (such as deep learning) have provided a new research direction by encapsulating semantic in occupancy maps that allows efficient object labeling and localization.
Recently, autonomous semantic-based exploration applied in [33,34] has used the contextual information encapsulated in semantic maps to enhance the exploration process by maximizing the use of available resources. T. Dang et al. [34] proposed an autonomous exploration model that explores an unknown environment and simultaneously searches for objects of interest. The proposed method utilizes a deep learning model that classifies the objects from 2D images and projects them into a 3D semantic occupied map. Each object of interest occupies a voxel in the 3D map with a specific color. The exploration process uses the 3D semantic occupied map to search for objects of interest and to explore the unknown environment. T. Dang et al. in [33] employed the visual saliency model to build an annotated map using visual importance incrementally. The exploration planner simultaneously optimizes the exploration of unknown spaces and directs the robot towards the salient objects.
In contrast to these methods, we firstly propose creating a new, 3D, semantically-aware occupancy map structure that carries the occupancy information and the semantic annotations to be utilized in the autonomous exploration task. A deep learning model is utilized to semantically segment the objects from 2D images and project them into a 3D map where each object is labeled with a unique color, as illustrated in Section 3.2. Secondly, we propose a new multi-objective utility function that uses the semantic information to direct the robot towards the objects in the environment and label them as shown in Section 3.3. Our work is different from the studies in the literature; instead of using only an occupancy map for autonomous exploration, we utilize the contextual information from the 3D semantic map to explore the environment and label all the objects. Besides, the semantic map utilizes a deep learning model to semantically segment the objects in 2D, which allows providing an accurate projection of objects in the 3D semantic map. Additionally, our proposed utility function encapsulates the contextual information by either engaging the confidence value of object belonging to a particular class or the number of visits for objects of interest, thereby it facilitates the labeling purpose.

Proposed Approach
In this work, a bounded exploration area V ∈ R 3 is initially unknown and has to be explored. Our exploration strategy attempts to create a 3D map M and find a collision-free path σ starting from an initial position ζ init that leads to the following: (a) Identify the free V f ree and occupied V occ parts of the whole exploration area and (b) maximize the information gathering to label objects in the environment using two sources of information: (1) the depth information captured by the robot sensor and (2) the semantic information obtained from a deep learning model. Figure 1 shows the proposed semantically-aware exploration of object labeling and 3D mapping system architecture. The proposed system architecture consists of two main modules: (1) mapping and information quantification, and (2) autonomous exploration. The proposed approach aims to provide the first responders in USAR scenarios a rich environmental representation. In the first module, the 3D semantic-occupied map is generated by using a deep learning model to segment the objects in 2D images into different labels. To perform this step, semantic segmentation is used to perform object classification and segmentation in pixels level and assign each pixel to a particular class. The point clouds corresponding to the image pixels are annotated using the output from the deep learning model. The annotated point clouds are used to create a 3D semantic occupancy map. The 3D semantic-occupied map is divided into small cubical voxels m ∈ M with edge length res that refers to the map resolution. Each voxel stores three types of information, occupancy, color, and confidence values, which correspond to the voxel's vacancy, class, and certainty, respectively.
Initially, all the voxels are assigned to be unknown. After the initial data gathering from the onboard sensors, the 3D semantic occupancy map is updated to contain V f ree ∈ V, V occ ∈ V, and the updated information in each voxel about the number of visits, confidence values, and class type. Feasible paths σ are subject to the limited field of view (FOV) and effective sensing distance constraints.
In the second module, the autonomous exploration module uses the contextual information obtained from the mapping stage. In this work, we adopted the NBV algorithm to determine the next best view that fulfills the ultimate goal to label all objects in the environment eventually. The exploration module involves three stages: (i) viewpoint sampling by employing a receding horizon, (ii) viewpoint evaluation (iii) navigation, and termination.
Two new multi-objective utility functions are proposed to evaluate each candidate viewpoint sample in terms of two criteria: (i) information gain obtained from exploring new space that has not been mapped yet; (ii) the information gain obtained from semantically labeled objects with either lower confidence or visits values for the object of interest to direct the robot to label the objects in the unknown environment accurately.
The candidate that maximizes the gain is selected to be the next best view, and the robot moves to that position. The whole process is repeated in a receding horizon fashion. Paths are only planned in the free space V Free . Within the algorithm's implementation, the robot configuration is given as P = [x, y, z, yaw], where roll and pitch are considered near-zero. In this work, the following assumptions were made; (i) the robot's position was assumed to fully known since the simulation experiment was performed on the Gazebo simulator, and the position was obtained directly through the simulated models of the UAV (robot localization is out of the scope of this research); (ii) the environment is unknown; and (iii) the evaluation of utility functions for the exploration process is performed on a voxel scale rather than object scale.

Mapping and Information Quantification
The procedure for the mapping and information quantification is provided in Figure 1. It involves three stages: scene understanding, where semantic segments are generated for objects found in a 2D image frame; annotated point cloud generation, where the point cloud captured from the depth camera is annotated to the corresponding class from the deep learning model output; and the 3D, semantically-aware mapping, which registers the detection annotated point cloud to 3D occupancy grid map. The stages are explained in detail in the following sections.

2D-Image-Based Semantic Segmentation and Semantic Fusion
•

Semantic Segmentation
The deep neural network pyramid scene parsing network (PSPNet) based on semantic segmentation [28] is employed to provide semantic segments for the different objects in 2D images. The network architecture is shown in Figure 2. PSPNet operates in five steps as follows: (i) create a feature map using resnet, (ii) apply pyramid pooling on the created feature map to produce feature maps at varying resolutions, (iii) perform convolution on each pooled feature map, (iv) create a final feature map by upsampling the feature maps and concatenate them, and (v) generate a class score map by employing final convolutions. The PSPNet model is trained [28] on two datasets, the SUNRGB [38] and ADE20K [39]. The ADE20K dataset includes a large vocabulary of 150 classes from both indoor and outdoor objects. However, the SUNRGBD data set contains 38 semantic classes for indoor scenes. The fact that the ADE20K dataset contains a larger number of objects made it a preferable choice to segment an unknown environment with a diversity of objects. Examples of 2D images with semantic segments are shown in Figure 3a,b respectively. The RGB images are obtained using the simulated sensor in the Gazebo simulator that mimics the characteristics of an actual camera.
In this work, only ADE20K dataset is used and the semantic labels of ADE20K dataset are shown in Figure 4.   •

Semantic Fusion
Semantic information encapsulated in the point cloud is fused using the "max fusion" approach [40]. In the max fusion approach, the semantic color that corresponds to the highest confidence value generated from the CNN model is included in the generated point cloud. In addition, the same value of the confidence and the label color are saved in the corresponding voxel of octomap.

Annotated Point Cloud Generation
The point cloud is a collection of points in an unordered structure where each point contains a coordinate in a specific reference frame. To generate a point cloud, firstly, the depth image is registered with the same reference frame that the color image is registered in, which is usually the camera frame. After that, the pixels are transformed from the camera frame to the real world frame using the image position, its depth, and the intrinsic camera parameters to form the point cloud. An example of the point cloud generation process is illustrated in Figure 5. The point-clouds are textured color. Therefore, the colors of the point clouds have the same colors as the objects presented in the image frame, as shown in Figure 5c. An illustration of the point cloud data structure when storing semantic information using max fusion is shown in Figure 6.

3D Semantic Map Generation
To improve the semantic exploration strategies, the object classes detected and localized per depth and color sensor input are used to obtain a 3D semantic-occupied annotated map representation of the environment.
The proposed 3D occupied semantic map structure is based on an occupancy grid map called octomap [41] which uses octree structure. An octree is a hierarchical data structure that divides the spaces in cubes of the same size called voxels. Each division is then recursively subdivided into eight sub-volumes until a given minimum voxel size is reached, as shown in Figure 7. The resolution determines the voxel size. The map M = {m 1 , m 2 , ..., m i } consists of the cubical element of the same size m i ← voxels for index i. The voxels in the traditional occupancy maps can only hold volumetric information. However, in addition to the volumetric information, the proposed semantic occupancy map adds the semantic information to the voxels. As a result, each voxel m i in the proposed map holds the volumetric information and semantic information. The semantic information includes the semantic color, confidence value, and number of visits. The confidence value is the value for the highest confidence generated from the CNN. Each voxel m i holds the following information: 1.
Probability of being occupied (value) P o ; 3.
Number of visits.
Initially, all voxels are assigned to be unknown where P o = 0.5, colored in white, and given confidence value 0. At every point cloud view, the input data to the mapping and exploration module are 2D image and 3D point cloud point clouds that consist of the positional and the semantic (x, y, z, r, g, b, con f idence) information. The color indicates a certain class type and not the texture color. The point clouds are used to build the voxels in the semantic-occupied map. This information is fed back into the map, and the corresponding position and semantic color are assigned to the voxels located within the camera FOV. The occupancy value P o (m i ) ∈ [0 : 1], which represents the probability of being occupied is assigned to the observed voxels using the volumetric mapping. Simultaneously, the class type is assigned to the voxel using semantic color indication. The visualized constructed map is colored according to the object class.

Semantically-Aware Exploration
The proposed exploration strategy provides the robot with the ability to explore unknown environments, while simultaneously optimizing information gathering and directing the robot to label objects in the environment. To enable this functionality, two new multi-objective utility functions are proposed to account for the semantic information (confidence or number of visits) that each voxel encapsulates. The proposed system used information from the semantic-occupied map to evaluate the next best exploration action. A technique used to explore an unknown environment is the next-best view (NBV) approach. The main steps in the exploration task are: (A) viewpoint sampling, (B) viewpoint evaluation, (C) navigating toward the selected viewpoint, and (D) termination. The exploration process is summarized in Figure 8. At the beginning of the exploration process, a robot uses the onboard sensors to observe the scene and produce a set of viewpoints candidates (also known as states or poses). The accepted viewpoint candidates are then evaluated using a utility function (also known as reward, cost, or heuristic function). The viewpoint which maximizes the utility function is selected as the next goal and is called "next-best view". The exploration process is repeated until a termination condition is met, indicating the end of the exploration. Algorithm 1 is used as a path planning procedure adapted from [42]. 1: ξ 0 current robot configuration 2: Initialize T with ξ 0 3: g best ← 0 4: n best ← n 0 (ξ 0 ) 5: N T ← Number of initial Nodes in T 6: while N T < N max or g best = 0 do 7: Incrementally build T by adding n new (ξ new ) 8: if InformationGain(n new ) > g best then 10: g best ← InformationGain(n new ) 11: n best ← n new 12: end if 13: if N T > N TOL then 14: Terminate 15: end if 16: σ RH , n RH , ξ RH ← ExtractBestSegment(n bes ) 17: end while 18: return σ RH

Viewpoint Sampling
The viewpoint samples are the possible positions the robot may visit as a next step. Taking all the possible views in consideration is computationally intractable and expensive [43,44]. Hence, a subset of possible views is selected for the evaluation step based on the information it can provide. Different sampling strategies are used, such as (i) regular grid sampling [45], (ii) frontier sampling [46], and (iii) incremental sampling (i.e., rapidly-exploring random trees (RRT)) [47].
The incremental approaches are different from the regular grid sampling and frontier sampling by selecting a series of points randomly in a tree-like manner instead of multiple single views for evaluation. These trees can be built by rapidly-exploring random trees (RRT) [48] or their variant RRT* [47]. The tree is expanded throughout all the exploration space, and each branch forms a group of random branches. RRT is extensively used due to its ability to work in higher dimension spaces, and it ensures probabilistic completeness. Each vertex in the search space has a Voronoi region that is calculated by the RRT, where the randomization guarantees a biased search towards vertices with more significant Voronoi regions.
Each point in the branch is evaluated using a utility function, and the branch which maximizes the utility function is chosen. Although the evaluation is performed for the whole branch, and the best branch is selected for execution, Bircher et al. [42] executes only the first edge of the selected branch. The rest of the branch is utilized for new tree initialization, maintaining the original path while allowing for new paths to grow as the map is updated. The rapidly-exploring random trees and the viewpoint sampling approach provided in [42] are adapted in this research, as shown in Figure 9.
In each iteration, a random geometric tree T of finite depth with maximum edge length of l is sampled in the known free space of robot configurations and incrementally built from an initial configuration x i0 . All branches of this tree are evaluated using a utility function. When the best branch is extracted, the first viewpoint ξ RH that corresponds to the vertex n RH from the path σ RH is selected and the robot moves to that point. At every iteration, a random tree of finite depth is sampled in the known free space of robot configurations. All branches of this tree are evaluated in terms of a combined information gain related to exploration of unknown areas and semantic-information-based class confidence value as represented in the occupancy map. The best branch is identified, and the step to its first viewpoint is conducted by the robot. Subsequently, the whole procedure is repeated in a receding horizon fashion.
The viewpoint or edges are added to the tree and are accepted for evaluation only if they were inside the exploration area, and the direct line of site from its parent does not cross occupied voxels, or in other terms, the path is collision free.

Viewpoint Evaluation
The utility function is used to select the best candidate viewpoint for the robot to visit. Many factors impact the viewpoint evaluation, such as the cost of moving to a new point in terms of distance, time, and power consumed, as well as the expected information gain (IG) the viewpoint can provide. The proposed utility function utilizes both volumetric and semantic information that a single voxel holds. The gain is calculated for each branch as the accumulated gain for the single nodes in the branch.
At each viewpoint ξ RH sample, the sensor generates a set of rays R where the rays end if it incident a physical surface or reaches the limit of the map. For a single view v within a set of views V, the 3D points from the sensor are annotated and projected into the map. The projection is carried through the ray casting, resulting in a set of rays R v cast for every view. Each ray r traverses the map, the volumetric and semantic information are accumulated within a set of visited voxels X. The predicted information gain G v for a single view is the cumulative volumetric information and semantic information collected along the rays cast from v, such that: (1) where I vol (x) is the volumetric information, I sem (x) is the semantic information that a single voxel encapsulates, and α and β are the weights for alternating between the volumetric and semantic gains. The formulation of the I vol (x) and I sem (x) defines the behavior of the system.
For a single voxel, the uncertainty that encodes occupancy probability is defined as voxel entropy, as in Equation (2).
where P o ∈ [0 : 1] donates the probability of voxel x being occupied andP o is the complement probability of P 0 ; i.e.,P o = 1 − P o . A voxel where no information is provided about it has P o = 0.5 and has the highest entropy H(x) = 1 Shannon. Numerous research in the literature considered the volumetric information to formulate utility functions for different exploration purposes, such as volumetric gain [42], pure entropy [49], average entropy [50], occlusion aware [49], and unobserved voxels [49].
The volumetric gain and pure entropy utility functions tend to reduce the map entropy by directing the robot towards the views that contain a more significant amount of unknown voxels. However, information gain obtained via average entropy favors the views traverse a fewer number of voxels with higher entropy. In addition, the occlusion aware utility function tends to visit views with more unknown voxels near to the candidate view.
Utility functions in the literature count for volumetric,= or visibility likelihood of the voxels and neglect the semantic information. We propose two multi-objective functions that consider both volumetric and semantic information encapsulated in each voxel. In this research, the volumetric information is adapted from [42], where the I vol (x) is defined as the accumulative number of the unknown visible voxel in the candidate viewpoint. Let I vis (x) be the indicator that the voxel is visible. A voxel is visible if the line of sight from the robot to the voxel does not contain an occupied voxel. Hence, I vis (x) is defined as: However, we propose two different multi-objective utility functions, semantic visible voxel (SVV) and semantic visited object Of interest visible voxel (SVOI), with two different semantic information formulations, consequently.

Semantic Visible Voxel (SVV)
The proposed SVV multi-objective function tends to direct the robot towards the views that contain known occupied voxels with a small confidence value to label all the objects in the scene. The confidence value is obtained from CNN, and it is the maximum confidence value for the corresponding class out of 150 objects. Let I c (x) be the indicator if the semantic confidence value assigned to a voxel is less than a predefined threshold: where c(x) is the semantic confidence value fused by the deep learning model, and c threshold is a predefined confidence threshold.
We propose semantic information I svv (x) for a single voxel as a voxel that is visible, semantically labeled, and its confidence value less than a predefined threshold. The I svv (x) is defined as: Finally, the gain obtained from the proposed utility function is calculated by substituting I svv (x) into I sem (x) in Equation (1), where α = 0.3 and β = 0.7. The total gain for a single view combines the volumetric and semantic info for all the voxels in the view. Figure 10 illustrates the SVV function which is a combination(accumulation) of both the volumetric in formation in Figure 10a and semantic information in Figure 10b. •

Semantic Visited Object Of Interest Visible Voxel (SVOI)
The proposed SVOI multi-objective function tends to direct the robot toward the views that contain occupied voxels classified as an object of interest but not visited sufficiently. In this function, the number of visits for each voxel is recorded to be used in the semantic information calculation. Let I s (x) be the indicator if the object belongs to the set of interest object S as follows: Each voxel in the semantic map holds a value x numO f Visits that represents the number of times it has been visited. In addition, let I obj be the indicator of visiting threshold: where x threshold is a predefined threshold for the sufficient number of visiting.
We propose semantic information I svio (x) for a single voxel as a voxel that visible voxels, semantically labeled, belongs to object of interest, and their visiting value is less than a predefined threshold. It is defined as: Finally, the gain obtained from the proposed utility function is calculated by substituting I svio (x) into I sem (x) in Equation (1) where α = 0.3 and β = 0.7. The total gain for a single view combines the volumetric and semantic info for all the voxels in the view.

Termination
The exploration task finishes when a termination condition is satisfied. Most of the termination condition depends on the scenario, such as victims' detection [51], where the number of iterations surpasses a predefined number of iterations [52], and information gain for all sampled viewpoints falls below a particular threshold of iterations [45]. Additionally, for frontier-based methods, the exploration terminates when no more frontiers are found [46].
The aim of the exploration is to label all the objects in the environment, and the termination condition proposed is when all the voxels are identified with a confidence value greater than the confidence threshold.

Scenario
The main goal of the proposed approach is to explore the environment and label all the detected object autonomously. For each scenario, the robot begins in a starting location, and the entire process runs until a termination condition is satisfied. In the proposed scenario, takeoff and rotation steps are performed initially to create an initial map to allow the planner to generate sample viewpoints for evaluation.
The viewpoint sampling is performed using an incremental approach, specifically, rapidlyexploring random tree (RRT), where a series of points are selected randomly in a tree-like manner, instead of multiple single points for evaluation. Then, the viewpoints in each branch are evaluated using different utility functions for each test separately. After that, the branch that maximizes the gain is selected; the first edge is executed. The robot then navigates to the selected point and the process terminates when the number of exploration iteration exceeds a predefined number of iterations.

Simulation
Simulation experiments were performed on an ASUS laptop (Intel Core i7 @ 2.8 GHz x 8, 16 GB RAM). The NBV framework was implemented on Ubuntu 16.04 using the Robot Operating System (ROS kinetic) to handle message sharing and ease the transition to hardware. The gazebo was used to perform the actual simulation, with programming done in both C++ and Python. The simulations were performed using a UAV equipped with one RGB-D camera. The specifications of the camera are shown in Table 1  The gazebo environment, shown in Figure 11, was used as the unknown environment that the robot should explore. The simulation environment has the dimensions of 9.2m, 8.2m, 2.5m of multi-connected rooms with a corridor with several objects placed inside the rooms. The environment contains 11 objects, which are walls, floors, three people, a sofa, two chairs, a book shelve, a vase, and a table. The constructed maps are based on 3D occupancy grid using OctoMap library [41] with res = 0.15m per pixel. Each utility function was tested separately under controlled environments, and the following assumptions were made: • The position of the robot was considered entirely known from the gazebo.

•
The explored environment is unknown with known boundaries only.

Baselines for Comparison
For comparison baselines, we use the following utility functions for our evaluation: volumetric utility [42], pure entropy [49], average entropy [49], occlusion aware [49], and unobserved voxels [49]. The different utility functions are shown in Table 2.

Evaluation Metrics
The evaluation metrics used in this work are the volumetric coverage, the number of detected objects, and the number of sufficiently labeled objects. The sufficiently labeled means that either the voxels have been visited multiple times until their confidence values increase above a certain threshold which is assigned to 0.7 or the number of visits exceeds a predefined threshold, which is assigned to three visits. This results in labeling the voxels correctly. Table 3 summarizes the evaluation metrics used.

Mapping Evaluation
The results for seven different constructed 3D semantically-aware occupancy maps using the different utility functions are presented in Figure 12. The presented maps are obtained after 120 iterations. From Figure 12, the volumetric Gain, SVV, and SVOI utility functions were capable of constructing most of the map and covered almost above 85% of the map. However, the pure entropy, average entropy, and occlusion aware failed to construct the map within the predefined number of iterations.

Area Coverage Over Number of Iteration
The two proposed utility functions are compared with the state of the art commonly used utility functions. In both of the proposed utility functions, α was set to 0.3 and β was set to 0.7 to steer the utility towards semantic labeling rather than volumetric exploration. The reported results are for seven different experiments simulation tests. The tests are divided according to the viewpoint evaluation approaches. The exploration terminates after 120 iterations, and the semantic fusion method used in this test is the max fusion. Table 4 summarizes the results obtained. Figure 13 shows the volumetric coverage plot vs. the number of iteration for all the tests. The SVV utility function almost reached the same volumetric coverage as the SVOI and the volumetric utility functions. The SVOI and volumetric utility functions reached higher volumetric coverage in less number of iterations.  Figure 12. Generated semantic maps using different utility functions after 120 iterations. The colors correspond to semantic classes, the green line segments represent the path travelled, and the red arrows represent the viewpoints along the path.

Object Detection and Labeling
The color map shown previously in Figure 4 is used to show the detected objects. The volumetric gain, SVV, and SVOI utility function were able to detect all the objects in the environment. They were able to label more than 75% of objects sufficiently. Figure 14 shows four different object detections at the voxel level. The proposed SVOI function was capable of detecting a larger number of person voxels in less number of iteration, as shown in Figure 14a. In addition, the SVOI function showed a good performance when detecting chair voxels compared to other utility functions, as shown in Figure 14b.
The number of voxels for each object is recorded at each iteration until the exploration process terminates (120 iterations). Based on the observations from Figure 14, the recorded number of voxels for each object varies. The variation is due to the semantic fusion method, max fusion, which selects the semantic label with the highest confidence for the voxel. In addition, max fusion switches the labels of the voxel if the confidence value is marginal at the end of an iteration.

Object Search Application
The SVOI utility is an adaptive function where we can select the object of interest. In this test, rather than choosing the entire dataset to form interest objects, only the person selected to be the object of interest. Hence, in search and rescue context, this test aims to search for victims, locate them, and project their locations in the 3D semantic map. The results compared to the state of the art utility functions in terms of volumetric coverage, constructed semantic map and the ability to detect persons in less number of iterations. Any object from the dataset can be chosen as an object of interest. Figure 15 shows that the proposed utility SVOI function outperforms the benchmark utility functions by covering 85 % of the environment in less number of iterations. Figure 12g shows the reconstructed map using the proposed function using the person as object of interest. The results show that the proposed utility function was able to sufficiently label one additional object (person) as shown in Table 4. Finally, Figure 16 shows that the proposed utility function performed better in detecting a person (ex. victim) compared with the other utilities in this study.

Conclusions
In this paper, a strategy for autonomous, semantically-aware exploration for object labeling and 3D semantic mapping was presented. This strategy facilitates exploring and mapping unknown environments efficiently, while directing the robot to label the objects present in the environment. In this work, we proposed a novel 3D, semantically-aware occupancy map data structure that carries the occupancy information as well as the semantic specific annotations to be utilized in the autonomous exploration task. The mapping method employs a deep learning model to segment objects semantically, generate annotated 3D point cloud, and create a 3D, occupied, semantically-aware map. Two new multi-objective utility functions were introduced to quantify the contextual information from the 3D semantic-occupied map to explore unknown environments.
Experimental results in the simulation demonstrated that the semantically-aware exploration and mapping was successfully able to explore the environment and label the objects by utilizing the contextual information from the 3D semantic map. The proposed utility functions showed a reliable performance in exploration and object labeling. Comparison between the proposed utility functions and the state of the art utility functions is provided: the proposed utility outperforms the benchmark utilities by volumetric coverage and accurate labeling.
The semantic fusion method used in the proposed strategy suffers from inaccuracies (frequent label shifts) when the confidence in the semantic labeling is on the border between multiple classes. This will be addressed in future work by improving the accuracy of the semantic labeling and adapting a variation of the Bayesian probabilistic semantic fusion. Additionally, the utility function introduced requires careful tuning of the scaling variables α and β; we will explore in the future, the possibility of introducing an adaptive scaling strategy that accommodates for various objectives and environments.