Power Tower Inspection Simultaneous Localization and Mapping: A Monocular Semantic Positioning Approach for UAV Transmission Tower Inspection

Realizing autonomous unmanned aerial vehicle (UAV) inspection is of great significance for power line maintenance. This paper introduces a scheme of using the structure of a tower to realize visual geographical positioning of UAV for tower inspection and presents a monocular semantic simultaneous localization and mapping (SLAM) framework termed PTI-SLAM (power tower inspection SLAM) to cope with the challenge of a tower inspection scene. The proposed scheme utilizes prior knowledge of tower component geolocation and regards geographical positioning as the estimation of transformation between SLAM and the geographic coordinates. To accomplish the robust positioning and semi-dense semantic mapping with limited computing power, PTI-SLAM combines the feature-based SLAM method with a fusion-based direct method and conveys a loosely coupled architecture of a semantic task and a SLAM task. The fusion-based direct method is specially designed to overcome the fragility of the direct method against adverse conditions concerning the inspection scene. Experiment results show that PTI-SLAM inherits the robustness advantage of the feature-based method and the semi-dense mapping ability of the direct method and achieves decimeter-level real-time positioning in the airborne system. The experiment concerning geographical positioning indicates more competitive accuracy compared to the previous visual approach and artificial UAV operating, demonstrating the potential of PTI-SLAM.


Introduction
Regular inspection of overhead transmission lines is of great significance to secure uninterrupted distribution of electricity [1]. The traditional methodology for inspecting overhead transmission lines has been typically carried out using foot patrol [1], which is famous for its low efficiency, labor intensiveness and hostile work environment. To remedy these deficiencies, unmanned aerial vehicles (UAV) have been widely used in transmission line inspection and are expected to replace foot patrol due to their higher efficiency, lower cost and higher trafficability; this has made autonomous UAV inspection a hot issue worldwide [1].

Automatic UAV Power Line Inspection
Most studies of UAV autonomous inspection fall into the category of power corridor inspection [2], which can be directly guided by power lines [3][4][5][6] and towers [7], as shown in Figure 1. To realize autonomous inspection for power towers, the industrial world, including Chinese utility companies and their main inspection service providers (e.g., DJI enterprise), adopt a strategy of geographical waypoint tracking. In a typical waypoint-tracking-based automatic inspection, the patrol team carries the UAV to the tower and sets up the inspection task, including the global positioning system (GPS) coordinate and orientation of UAV for each checkpoint. The UAV automatically flies to the designated position, adjusts its attitude and collects inspection data. There are two ways to obtain the GPS waypoints and their shooting attitude: geometrically calculating in a three-dimensional surveying and mapping map [2], or recording the flight path of a skilled inspector. Such a strategy relies highly on high-precision positioning equipment such as real-time kinematic (RTK) fixed GPS. When the RTK signal coverage is not stable, manual operating is unavoidable. To overcome this limitation, Alvaro et al. [8] developed a cooperative positioning architecture with an unmanned ground vehicle (UGV) and a UAV for tower inspection. In this architecture, both UGVs and UAVs are equipped with RTK, while the UGV carries an additional visual positioning identifier to provide a positioning gain for the UAV. Nonetheless, the positioning is entirely unrelated to the scene so that the RTK fixed GPS is required to perform pre-programmed track flight and the bulky identifier also limits its application. Owing to the lack of real-time environmental information, the existing positioning approaches cannot support further intelligent tasks such as autonomous navigation and self-adapting to scene changes (e.g., vegetation invasion to the flight route and location movement of inspection target caused by maintenance).

Scene Challenges for Monocular Object SLAM
In the context of robotics, a study of surrounding perception often refers to semantic simultaneous localization and mapping (SLAM) [9]. Considering the hardware compatibility concerning inspecting UAVs, the monocular semantic SLAM is a stronger candidate for a power line inspection scenario; however, it faces changes in the power tower inspection scene.
The semantic SLAM concerning navigation usually utilizes objects as the landmarks for locating the scene map and providing navigation information, i.e., the object SLAM. There are two main approaches to extract and represent spatial information of objects: using preset templates to estimate the position and orientation of objects from sparse environment mapping [10] and using dense mapping to reconstruct objects [11]. Concerning the former, the structure and appearance of power tower components are typically variable. For example, the insulator string, which is an important component used to hang the power lines, has a very variable aspect ratio (possibly from 0.1 to 1.2) and is likely to be deformed by the traction of conductor. Fitting such objects requires the template to be very adaptable. When using Quadric to fit the objects (like QuadricSLAM [12]), such a large aspect ratio can cause even a slight error to lead to a huge deviation. Cube-SLAM [13] utilizes a 2D bounding box and the vanishing points of the surface line to model the objects as a cube; however, the sufficiently linear edges are not always present on the tower components like insulator string.
As for using dense mapping to reconstruct objects, limited by the computation consumption and real-time requirement, it is usually achieved by the direct method-based SLAM [11], which relies on the assumption of brightness constancy across multiple images [14] and is violated when illumination varies. Power tower inspection needs to observe objects from multiple angles in outdoor bright light and in scenarios where illumination changes often occur. In addition, power tower components are typically made of metal, ceramic or glass and the surface reflection easily causes a dramatic change of local brightness, resulting in the failure of direct method pixel tracking.

The Objective of the Paper
Motivated by the observations above, this paper introduces an idea of using the structure of a power tower to realize visual positioning of UAVs for tower inspection. In order to select the semantic objects from hundreds of tower components, we analyze the characteristics of tower inspection and offer recommendations concerning the aspects of visual detectability and scene structure describing ability. On this basis, a geographical UAV positioning scheme using prior knowledge of semantic objects is devised. A monocular semantic SLAM combining the feature-based method SLAM and the fusion-based direct method, termed power tower inspection SLAM (PTI-SLAM), is developed to perform robust positioning and semi-dense object mapping under the challenges of a tower inspection scene. The fusion-based direct method is designed to work against complex illumination by using a sliding window strategy to transform long-term pixel tracking into batch-based semantic observation fusion. PTI-SLAM utilizes loose coupling between the SLAM task and the semantics task to guarantee a real-time performance in a UAV onboard platform and further prevent the vulnerability of the direct method from affecting SLAM positioning.

Semantic Object Selection
For a vision task, the suitability of a semantic object can be considered from two perspectives: the capability of the scene structure representation and the adaptability for visual detection.

Structure Semantics for Inspection Flight
With the aim of examining the tower meticulously at a close range, the UAV needs to traverse the tower and hover from time to time to perform more detailed observation and data acquisition. The flight trajectories, including route and hovering points, are carefully charted to match the architecture of the tower and the assembly of power lines, resulting in a variety of flight scheme. Still, the guidelines for a flight scheme can be well tracked by the intent of each decomposed movement. Figure 2 shows a typical flight scheme adopted by a local utility company for a double circuit line tower inspection. The tower is dissected according to the component categorization performed by the China Electric Power Research Institute (CEPRI) during power line inspection aerial image processing competitions. The large-size fittings involve a variety of accessories such as yoke plate and clamp. The hovering points A-N are the checkpoints directed by the examination targets which are listed on the left side of Figure 2. The inspection flight is essentially a traversal of the checkpoints outside the no-fly zone, which is defined by the outermost edged of the power corridor and the regulations about safety distance (typically 2.5 m).
The insulator string is the examination target of hovering points F-H and K-M, while I and J can be located using a clamp and a damper. The particularity of these points is that they are usually located in the outermost areas of the tower and delineate the outline of the power corridor, which is crucial to the identification of a no-fly zone. The targets of hovering points B and C are the structures of the tower. Mapping and modelling such large-size hollowed-out structures requires multi-angle observations, which are usually only possible after a tower has been totally inspected. This means they cannot be used for cognition scenes and navigation during proximity inspection. Besides that, the reconstruction of a tower is less important since checkpoints F-M can also outline the tower. The targets of checkpoints A and N are not the concrete objects, they are directed by the semantic structure of the scene.

Component Detectability
In terms of visual detection adaptability, object detection is taken as representative of a semantic detection task since it attracts more attention compared to the other methods such as semantic segmentation and image classification and makes an efficient comparison more easily. We summarize reports of the last decade and list the best detection results in the 'Object Detection' column of Table 1. The components are classified similarly to Figure 2. Similar to the large-size fittings, the small-size fittings involve a variety of accessories including bolts, core-pin and other tiny accessories. The ancillary facilities consist of the support units like lightning arrester, nameplate and anti-bird tool. The statistics on the foundation and small fittings are marked with the '/' symbol, as their studies are too outdated or too rare to make an efficient comparison. The detections for insulator and damper achieved the most outstanding result in the laboratory, with average precision (AP) of over 95%. Such a high adaptability to visual detection benefits from the conspicuousness in the aerial image and the unified structural features. Although the cable and tower have similar precision in detection, they are hardly helpful in proximity inspection. Zhai et al. [15] detect multiple fittings using hybrid prior knowledge and achieve mean AP (mAP) of 78.6%, where the mAP of the clamp is 78.86%. Kong et al. [16] evaluate the detection performance on the tower plate and obtain an AP of 73.2%. They are among the best results in related studies, while there is still a big gap compared to the detection of the insulator and the damper. The main reason for the low accuracy is that the components are usually small and easily occlude each other. The results of two nationwide invitational competitions organized by CEPRI are also taken into account. The two competitions are aimed at detecting defects of components in the aerial images. Their datasets are collected by UAVs and helicopters, separately, and labelled with CEPRI. Although there is a significant difference between defect detection and object detection, they are valuable as they provide fairness comparison based on an authority dataset. In addition, the adaptability of defect detection may let fall a further hint concerning the object detection performance of the components with defects. Among all the components, the insulator shows a high adaptability to defect detection and performs consistently in both competitions. In UAV datasets, the damper is categorised as large-size fittings and contributes the majority of the experiment results for this category. Apparently it has a similar summing-up to the insulator.
To summarize, the insulator and damper are the stronger semantics candidates for navigation purposes as they are highly associated with the inspection target and have significant advantages in terms of scene structure-describing ability and visual detection adaptability. Taking tower plates and the large-size fittings such as clamps as semantic targets has a certain risk of failure since the visual detection is not always reliable. Although the foundation is one of the inspection targets, its visual detectability has not yet been proved. In addition, the foundation is easily obscured by the tower body in aerial view, making accurate positioning difficult. Therefore, it is not recommended as a semantic object. In this paper, the insulator is selected as semantics representatives for subsequent work.

Geographical UAV Positioning Scheme
The key to the applying visual positioning in navigation is to establish an association between UAV positioning and flight tasks. Semantic objects offer an opportunity to accomplish this. However, some special checkpoint targets are not concrete objects (e.g., checkpoint N) or are not suitable for visual detection (e.g., checkpoint D), making location recognition difficult.
Alternatively, we devise an object-based geographical UAV positioning strategy to achieve geolocation without relying on on-board GPS. Specifically, we utilize the existing tower surveying and mapping data to take the geographical location of the semantic object as prior knowledge. The transformation relationship between the visual positioning coordinates and geographical coordinates is obtained by using object localization in the semantic map established by semantic SLAM. Such a transformation makes it possible to achieve a real-time geographical UAV positioning and use visual positioning instead of RTK-fixed GPS in the waypoint flight. In addition, since the UAV position is calculated based on object location, this strategy can accommodate a certain degree of scene change.

Framework of PTI-SLAM
The proposed PTI-SLAM is built on a famous open source framework termed DS-SLAM [20], which improves the RGB-D mode of ORB-SLAM2 for the indoor dynamic environment by segmenting and removing dynamic objects. A general overview of PTI-SLAM is shown in Figure 3. There are five threads run in PTI-SLAM: tracking, local mapping, loop closing, segmentation and semantic positioning, where the white boxes are the modules from OBR-SLAM2; the green boxes are the semantics modules designed for power line component positioning. The RGB image captured by the pan-tilt camera is first fed into the tracking thread. The tracking thread estimates the camera pose and position with every frame by matching the ORB feature to the local map and minimizing the reprojection error. The keyframes are the representations extracted from the frames. The generation of keyframes is determined according to the keyframe interval, the tracking status, the local thread occupancy and the number of map points. Subsequently, the keyframes are served to two parallel backends for two separate tasks. The first task is to generate and maintain the sparse map points and participate in UAV positioning optimization. This is performed by the backend of SLAM modules, including local mapping thread and loop closing thread. The local mapping thread updates the existing map point observations, culls the map points that are difficult to view with multiple keyframes and creates new map points. A local bundle adjustment is adopted to optimize the estimations of map points and keyframes in the local area. The keyframes that are too visually similar to the incoming keyframe will be replaced. The loop closing thread detects motion loops and corrects the pose and position estimation of keyframes to resolve the drift of long-term estimation. It is noteworthy that the tracking thread calculates on the basis of a local map. When the local map, including map points and keyframes, is changed, it not only corrects past estimations but also affects current calculations.
The second task is to detect, map and locate the semantic objects. This is performed by two parallel threads: semantic segmentation and semantic positioning (i.e., the green boxes in Figure 3). The segmentation thread is responsible for segmenting the semantic object in the keyframe image, while the semantic positioning thread restores the depth of the segmented area from successive keyframe images and locates objects. The spatial coordinates of the object are detemined by the pose and position of the keyframe and the depth of the object in the keyframe image plane. In other words, the position of the semantic object is bound to the keyframes. When the keyframes are optimized by the local mapping thread or loop closing thread, the global position of the object will be recalculated. Details of the semantic modules will be provided in Section 3.2.
The module of the map is not an actual thread, but a conceptual integration of databases distributed across modules. Figure 3 utilizes it to represent the output interface of PTI-SLAM and the visualization is only used to illustrate the output of the system. In practice, the visual interface is not deployed in order to reduce the system overhead.
It is noteworthy that PTI-SLAM conveys a loose coupling between the SLAM task and the semantics task to benefit the real-time performance. In our geolocation architecture, the positioning of the UAV is provided by the SLAM part in real time, while the semantic object positioning serves the coordinates transformation. A delay in the semantic task only causes a delay in the correction of UAV geolocation. Figure 4 shows the overview of the semantic positioning. The semantic task can be divided into three steps: two-dimensional semantic segmentation, three-dimensional semi-dense mapping and object positioning. Following DS-SLAM, PTI-SLAM adopts SegNet, which is a deep fully convolutional neural network architecture for semantic segmentation, as the semantic extractor to process the raw RGB images of keyframes and provide pixel-wise segmentation. Owing to the possible surface reflection, concerning the segmentation of the object, there are occasionally voids or incomplete edges.To overcome this limitation, a grid-based region of interest (RoI) strategy is designed.The grid strategy uses voxel-like ideas to simplify two-dimensional semantic representations and tolerate the internal voids caused by misdetection. More specifically, when the input images are 640 × 480 resolution, they are gridded in 10 × 10 pixels. The grids that contain semantic pixels are treated as the object region and combined into RoIs. Subsequently, those grids surrounded by RoIs are also merged, as they are considered as the misdetection. RoIs which are two grids away from each other are regarded as the independent RoI belonging to separate objects. The RoIs with a too small area are filtered, as they usually come from incorrect segmentation or overly distant objects.

Semantic Positioning
As for three-dimensional semi-dense mapping and object positioning, they are performed by the proposed fusion-based semantic direct method. Instead of long-term pixel tracking, the fusion-based direct method utilizes a sliding window to establish individual batches and locally track the pixel to densely reconstruct and position the objects and then globally fuse the object position.
Let F i , i = 0, 1, 2, . . . , M denote keyframes; the newest keyframe is F M ; the object positioning based on sliding batch can be presented as follows: where O M−N+1,j indicates the object coordinates in the camera coordinates of the reference keyframe, j is the number of objects and N is the sliding windows size (it is empirically taken as 10 in our case; more details are provided in Section 4.5.1). f (·) denotes the object positioning using inverse depth filter within batch. In other words, the first keyframe in the batch is used as the reference frame; we track the pixels of the object forward in turn and densely map the object. To economise on the computational load, PTI-SLAM models the object as centroids, as the red dot shown in Figure 4. Details of the inverse depth filter and object positioning within batch are provided in Sections 3.2.1 and 3.2.2.
To globally fuse the object positioning, the object coordinates need to be transformed into SLAM coordinates; this can be easily obtained as follows: where O W re f denotes object coordinates in SLAM coordinates and T −1 re f is the transformation matrix from the reference frame to SLAM coordinates provided by SLAM part. Then, the object coordinates are fused as follows: where w is the existing fusion weight of the object and the subscript f use presents the updated estimates. We employ the quantity of the point cloud of the object in the reference frame as w M−N+1,j . The subscript J denotes the global object associated with O M−N+1,j . Details of the object association are provided in Section 3.2.3 The fusion-based semantic direct method is designed based on two assumptions that have been observed several times in our scenario testing: • Although illumination changes often occur, the brightness of the image is usually consistent in a small area. By limiting the range of pixel tracking to a small region on both a spatial and a temporal scale, our method enhances the confidence in the assumption of brightness constancy. • A few extreme cases, such as drastic but transient reflections from the target surface, can destabilize the direct method, resulting in a sharp decrease in the quantity of generated point clouds. We utilize this property to reduce the contribution of these unreliable observations to object positioning. Figure 5 shows how the direct method restores depth from two adjacent keyframes. I r and I c are the images of the reference frame and current frame, O r and O c denote the corresponding camera optical centres, P r and P c are the projection points of P in I r and I c , respectively. According to the epipolar geometry, P r and P c are constrained to lie along conjugate epipolar lines l c . Geometrically, l c is the projection of extension of O r P r ; it can be uniquely determined by giving P r and the transformation matrix T of the camera. To find the match point P c , the pixel blocks on l c are examined in a stepwise manner by calculating its zero-mean normalized cross correlation (ZNCC) [21] with the block of P r :

Semi-Dense Mapping for RoIs
where A, B ∈ R ω×ω are the pixel blocks in I r and I c , respectively, S(A, B) ZNCC presents the matching score. A(i, j) and B(i, j) are the pixel values in A and B, A and B denote the mean value of the corresponding blocks. Only the block pairs with S(A, B) ZNCC > 0.85 will be considered as the potential match and the best of them is regarded as the correct match. By subtracting the mean pixel value, the local brightness variation of the image can be suppressed to some extent. To speed up the traversal of blocks, the search domain is defined as follows: where p cmax and p cmin are the spatial points corresponding to the endpoints of the search domain on epipolar, representing the hypothetical range of the spatial point P on the extension of O r P r . They are determined by depth assumption d ini (i.e., the hypothetical distance from P to image plane I r ) and depth uncertainty σ and then transfer to camera O c coordinate by transformation matrix T. Transformation matrix T represents the transformation relationship between the reference frame and the current frame, including the translation of the position and rotation of perspective. They are given by the SLAM module, i.e., the keyframe database in the map module. d ini and σ are artificially assigned at the very beginning, then automatically iteratively updated in the depth filter. p cmax and p cmin can be easily mapped into I c by the pinhole camera model: where f x , f y , c x and c y are the camera intrinsics. Estimating the depth of the pixel from only two frames is not reliable, as the baseline between keyframes is usually short. To combat this problem, the inverse depth filter is employed to fuse observations from multiple keyframes. As Figure 4 shows, the last M keyframes are packaged into an estimation batch, where M is the sliding window coefficient. In the process of a single batch, the first frame is used as the reference frame. The depth of the RoIs in the reference frame is estimated with the subsequent frames one by one using the direct method described above. There are M − 1 depth estimations for reference frames in total. To fuse these observations, the estimated depths are represented as the Gaussian probability distribution over the inverse depth [22] and fused as follows: where µ and σ denote the existing inverse depth estimate and the error-variance used in (5), the subscript obs indicates the estimates of new incoming items and the subscript f use presents the updated estimates. Owing to the strategies of ZNCC and sliding window, we only consider the geometric disparity error [23] in the calculation of σ obs . Suppose there is a one-pixel error in the matching of P r , as shown in Figure 5, where P c stands for a match with error. The observation uncertainty can be obtained by: To avoid inefficient observation, only the pixels that exceed the gradient threshold of 50 are selected for depth estimation.
At the last stage within batch processing, the depth estimates of the reference frame are filtered with respect to the threshold of σ to erase untrustworthy observations.Qualified estimates are used to generate the point cloud of the reference frame.

Object Positioning within Batch
Essentially, depth recovery is an estimate of the norm of d p in Figure 5. The point cloud of the RoI is always distributed in a pyramidal space, as shown in Figure 6 and the outliers typically come from estimation errors and background noises. Owing to the scale uncertainty of monocular SLAM, the statistical filtering method is adopted to filter out noise. More specifically, in a single RoI block, the average distance between a point and its nearest 80 points is calculated; then the mean and standard deviation of the average distances are worked out to exclude the points that exceed 0.5 times the standard deviation. After statistical filtering, the residual outliers, however they are distributed, have very little effect on the centroid because of the significant difference in numbers between inliers and outliers.

Object Association
To speed up object association, the potential associated global object is suggested by two-dimensional reprojection and is verified by three-dimensional distance statistics. Specifically, the global centroids of objects are reprojected into the reference frame image to identify the centroids within three grids of the RoI. The qualified centroids are projected into the reference frame point cloud to calculate its average distance to the points and compare with the centroids of the reference frame. When the former one is not greater than twice the latter, the global centroid is considered as the same object as the reference frame centroid.
Objects in the reference frame that cannot be matched by the object database are temporarily recorded as potential new objects. A new object can only be established if it can be observed in three consecutive batches. Such a mechanism can reject occasional fake observations.

Object-Based Geographical UAV Positioning
The proposed strategy can be described as follows (Algorithm 1): Given OP E of these objects 3: Performs umeyama algorithm [24] umeyama(OP Sr , OP E ) to obtain the rotation matrix R, translation vector t and scale factor s 4: Performs sim3 transformation: UP Er = R, s × UP Sr + t

5:
if continuous observation ≥ 12 then 6: Find the five observations with the lowest object positioning deviation to the current fusion result, obtain their UP Sri and UP Eri

7:
R gol , t gol , s gol = umeyama(UP Sri , UP Eri ) 8: end if 10: end while 11: while observed objects ≤ 3 do 12: if R gol , t gol and s gol are empty then 13: Waiting for initialization 14: else 15: UP Ec = R gol , s gol × UP Sc + t gol 16: end if 17: end while The essence of the strategy can be simply summarised as: • When the observation of objects is possible, obtain the transformation matrix and recalculate the UAV position by using the object associations between SLAM and the GPS. • When objects are not observed, transform the UAV positioning in the SLAM coordinates into GPS coordinates using the existing transformation matrix.
It is noteworthy that the Umeyama algorithm [24] adopted in Algorithm 1 needs at least three matched object pairs to calculate the transformation variables. In practice, it is not difficult to find multiple semantic objects (such as insulator, damper and other potential large-size fittings) for the hover points.

Environment Setup
All experiments in this paper are carried out in the simulated transmission tower scene, as shown in Figure 7a  To verify the performance in the onboard system, PTI-SLAM is deployed in the prototype of the inspection UAV. As Figure 7b shows, the prototype is a refitted DJI Matric 100 quad-rotor UAV equiped with a pan-tilt camera Zenmuse X3, two advanced lowpower-consumption embedded processors-DJI Mainfold 1 and DJI Mainfold 2-GPU-and an additional RTK module. While Mainfold 1 takes charge of communicating with the sensors and the flight controller, Mainfold 2-GPU, with more computing power, is primarily responsible for running algorithms. These two scattered subsystems are connected by a local area network and communicate via a robot operating system (ROS). The groundtruths, including the UAV trajectories and the locations of insulators, are obtained with the RTK module, with an absolute error of less than 2 centimeters.
The images captured by the Zenmuse X3 camera are resized to 640 × 480 resolution and then fed into PTI-SLAM. PTI-SLAM employs ORB-SLAM2 as the backbone of the SLAM part for monocular UAV positioning; 1500 point features for image at eight scale levels with a scale factor of 1.2 are extracted in ORB-SLAM2. As for the semantic part, the sliding window coefficient M is taken as 10. The initial values of d ini and σ are empirically set to 3 and 0.1, respectively. Meanwhile, the threshold of σ for erasing the untrustworthy observations in the generation of the frame point cloud is 0.001.

Trajectory Consistency Evaluation
The trajectory consistency represents the ability to keep track of UAV position during continuous movement, which plays a significant role in UAV flight control. For example, the autopilot in the DJI onboard software development kit (SDK) is realized by controlling the deviation between the current position and the destination, all of which are represented in local coordinates taking the take-off point as the origin. In other words, the performance of autonomous flight is determined by the tracking accuracy of continuous motion relative to the origin, rather than the absolute positioning accuracy of the GPS. In this part, the trajectory consistency of PTI-SLAM is tested and compared with airborne GPS to explore the feasibility of PTI-SLAM for UAV inspection flight control. A comparison of existing methods is provided in Section 4.6. Concerning quantitative assessment, the absolute trajectory error (ATE), which stands for the global consistency of the trajectory, is adopted and the metrics, including root mean squared error (RMSE), mean error, max error, min error and standard deviation (S.D.), are presented in Table 2, where x, y and z items show the errors along the x-y-z axes of the east north up (ENU) coordinate system and 'Resultant' denotes the resultant errors. In practical terms, civilian GPS usually has an error of 2-15 m. As other airborne sensing data are fused by the SDK, our airborne GPS draws a relatively accurate trajectory during continuous motion and the positioning error often manifests as the shift of the origin point, as shown in Figure 8a. While the RMSE and mean error of GPS nearly reach 8 m, the S.D. of GPS is only 0.2, signifying that the error likely comes from a global consistent offset. For further comparison, the zero of GPS trajectory is aligned with the true value, that is, the GPS-alignment, which has an RMSE and mean error of only 0.37 and 0.35 m, respectively, and a max error of no more than 0.65 m. As illustrated in Figure 8b-d, the trajectory estimated by the proposed PTI-SLAM is more closely matched to the groundtruth than the GPS-alignment, both in the horizontal (x-y axes) and vertical (z axis) components. As seen in the last item of Table 2, the proposed PTI-SLAM has an overall advantage over airborne GPS, with errors less than half those of the GPS-alignment. This advantage of motion estimation demonstrates the high potential of PTI-SLAM for UAV inspection.

Evaluation of Object Positioning
SegNet is trained using an actual inspection dataset with 3741 images and tested with 418 images. After 56,000 iterations of training, it achieves a result of 92.3% mean pixel accuracy (MPA) and 88.6% mean intersection over union (MIoU). Figure 9 shows the visualization of segmentation results on actual inspection and a simulated scene. The positioning for an object is estimated on the basis of the positions and poses of a camera given by the SLAM part. The experiment results of this part therefore incorporate the factor of the SLAM module error. Table 3 gives the effect of point cloud filtering for a single batch. For intuitive comparison, the point cloud is converted to the global coordinate system. About a quarter of the points are culled after statistical filtering. As shown in Figure 6, the culled points are mainly outliers and the errors of the centroid are significantly reduced by 39.9%. The experiments of batch fusion are presented in Table 4. The data contains a total of 21 valid keyframes coming from a 14 second flight period of simulated checkpoint examination. The observation distance is within the range of 1-3.8 m. By setting the sliding window coefficient M to 10, there are 12 estimate batches in total. The semantic positioning module delivers a competitive object positioning effect with RMSEs of no more than 0.4 m. The errors and S.D. along the x axis are notably larger compared to the others. This is mainly due to the error distribution of SLAM positioning. As seen in Table 2, the SLAM error in the x axis is about one-third larger than in the y axis, which is similar to the error distribution of object positioning. Altogether, the object positionings for a complete flight around the tower are visualised in Figure 8b.

Evaluation of Object-Based Geographical Positioning
To satisfy the requirement of the Umeyama algorithm, two additional insulators are added to each side of the tower model. Note that the object association between SLAM coordinates and geographical coordinates is created artificially. In fact, establishing this association mathematically is an urgent issue that needs to be addressed in a follow-up. In addition, the incomplete object observations (i.e., the RoI of the object is located on the image boundary) are rejected for object positioning. Figure 10 shows the results of object-based UAV repositioning. In Figure 10a, we denote the SLAM trajectory with several colors. The blue and green segments indicate the intervals with the observations of blue and green insulators, respectively. They are located by the dynamic transformation variables R gol , t gol and s gol . The olive segment represent the interval without object observation and is estimated by the fixed transformation variables. The red line is the correction for cumulative error of the olive segment.   Table 5 shows the tests on sliding window coefficient N. In a normal scene, each of the two additional frames in the batch takes an extra 0.7 s to process and increase the number of point cloud, leading to finer details concerning the object modelling but with little effect on object positioning. However, increasing N allows more candidate frames into the batch, enhancing the success rate of the direct method in disadvantageous scenarios. An extremely disadvantageous scenario concerning examining the target against the sun is set up to test this idea. We deliberately move the UAV from side to side, facing the sunlight to check the target and the brightness of the image changes due to the drastic change in the lighting conditions. In such an adverse situation, the object cannot be located only by a few adjacent keyframes, as shown in Table 5; the object positioning is invalid when N = 4. By adding more candidate frames, the direct method can find more pixel matching from the later frames and give a decent object positioning with a low fusion weight (i.e., the number of point clouds). For some extreme cases, such as the first observation of semantic object being in an extreme backlight condition, this strategy can provide a rough positioning at the beginning and then gradually correct it in subsequent observations. Otherwise, it takes another 5-10 s to establish a reliable object positioning. In order to ensure the adaptability of PTI-SLAM to these unfavorable conditions, the value of N is the maximum value allowed by the computation time. For PTI-SLAM in testing scene, the keyframe generation period is about 0.656 s, so we set N = 10.

Influence Factor
In practice, there are three main factors that affect object positioning accuracy: direct method error, SLAM module error and background noise. Direct method errors are mainly caused by a mismatch of image block pairs. According to Figure 5, the mismatch of block pairs will only lead to an incorrect depth estimate on spatial points and the error is then assigned to each axis of the coordinate according to the view angle. SLAM module errors, including camera position and orientation, not only cause direct method errors (since they can affect the determination of the epipolar lines), but also lead to errors in coordinate transformation. Background noise is driven by segmentation accuracy and is intensified by the grid strategy. It usually appears at the rear of the target because the UAV typically observes the objects from the outside, resulting in a backward deviation of object positioning, as seen in Figure 6.
In order to investigate the factors influencing object positioning, an assessment was conducted with movement around the east side of the tower. To stably control the motion variables, the data in this experiment are acquired with a hand-held UAV. Figure 11 shows five representative object positionings, which are estimated from different angles and distances. The results are denoted in different colors according to their error in depth estimation. It is noteworthy that the observation results are the raw data without batch fusion. The marker × represents the locations of the keyframes within the estimate batch; the marker at the end point of the line connecting the camera and the object is the reference frame.  Table 6 shows the quantitative results. The object positioning is evaluated in two metrics: depth error and absolute positioning error (APE). The depth error is the deviation in the distance between the object and the camera, which is mainly affected by the matching error and excludes the effect of coordinate transformation. The APE is the position deviation between estimated object location and the truth, representing the comprehensive error level. We adopt the angle between l est and l gt to approximately express the orientation error of SLAM, where l est is the line connecting the estimated object and the estimated camera and l gt is the line connecting the groundtruths of the object and the camera. To eliminate the influence of the difference in segmentation results, the objects are manually segmented in this experiment. The difference between manual segmentation and automatic segmentation is shown in Figure 12. Three additional factors, which most likely contribute to the observation error, are examined: the distance from the camera to the object, the object location in the image and the movement range within the batch. The object location in the image is denoted by the deviation between the object pixel centroid and the image center. We numbered the observations according to their error in depth estimation.  As seen in Figure 11 and Table 6, the depth estimate error is not directly related to the viewing angle, the SLAM error and the movement range within the batch, but has a strongly positive relationship with the object distance and the object location in the image. This is consistent with the previous information: • The accuracy of visual measurement decreases with increasing distance and the measurement for the object at the edge of the view field is usually less accurate than in the central area. The latter is partly related to lens distortion, even if distortion correction has been performed. • The fusion-based direct method uses the relative relationships of a position and orientation between frames, rather than the absolute positioning of each frame, to restore the depth. In a short-term gradual motion, SLAM typically performs good movement tracking capabilities and the estimation errors of relative orientation and position between frames tend to converge to a consistent range. This error is less related to the absolute error of the current frame, unless the batch happens to be in an extremely unfavourable situation. Therefore, in normal conditions, there is no direct correlation between the positioning error of the camera and the depth estimation error.
The absolute location of the object is essentially the projection of the depth estimation onto the global map. Theoretically, the APE is positively correlated with the depth error and the SLAM error. According to Table 6, the orientation error of SLAM has a much greater impact on APE than the location error of SLAM and the depth error and even dominates. When the orientation errors are similar, such as obs-1 and obs-5, the depth error has a greater impact on the object positioning than the positioning error of SLAM.
Compared with Table 4, the object positioning errors in Table 6 are significantly lower, indicating that the background noise plays an essential role in the object positioning error. This means that the proposed method is sensitive to the background noise and highly dependent on noise filtering.
In term of overall performance, the positioning of the object can converge in a small range, which ensures the reliability of the batch fusion. As shown in Figure 11, the depth estimate is slightly greater than the truth. This is mainly caused by the background noise. The gridding strategy improves the recall of 2D segmentation while also increasing the background noise of observations.

Robustness of Algorithms
To verify the advancement, the proposed PTI-SLAM is compared with other iconic methods which are widely used as the baselines in many studies. The proposed PTI-SLAM shares a consistent configuration with ORB-SLAM2 to guarantee a fair comparison. The direct ORB-SLAM 2, which is a restructuring ORB-SLAM2 accelerated by the direct method of SVO [25], extracts the features twice as much as PTI-SLAM to improve its robustness in a testing scene. We also compared the LSD-SLAM, which is a milestone in direct method-based SLAM and is adopted as a SLAM part in a monocular semantic SLAM [11].
All the methods are compared in three different operation scenes: smooth flight in steady illumination (termed the normal scene), flight with occasional rapid rotation in steady illumination and smooth flight in variable illumination. Two datasets are collected for each scene. The methods are tested five times separately on each dataset and the tracking success rate and average RMSE of ATE are presented in Table 7. As to the tracking success rate, there are two situations that can be judged as tracking failure. The first one is that the system state is lost and the relocation fails.The second case is where the RMSE is too large (more than 0.8 meters in this paper); the algorithm can be considered to have actually failed. Direct methods, including direct ORB-SLAM2 and LSD-SLAM, have a significantly lower success rate than feature-based SLAM. By adopting a more advanced depth filter, direct ORB-SLAM2 is notably more accurate than LSD-SLAM; however, the success rate is not improved. In contrast, the feature-based SLAM demonstrates a high robustness against disadvantageous scenes, especially illumination variation. As a result of using the feature-based method in the SLAM part, the success rate and accuracy of PTI-SLAM are similar to ORB-SLAM2, indicating that PTI-SLAM inherits the robustness advantage of the feature-based approach. It is noteworthy that PTI-SLAM is slightly less accurate than ORB-SLAM2. This is mainly due to the fact that semantic tasks take up computational resources and reduce the executions for backend optimization. Although PTI-SLAM adopts the direct method in semantic modules, its robustness is not compromised. Figure 13 shows a comparison of the traditional direct method, which is represented by LSD-SLAM and the fusion-based direct method. The red dots in Figure 13b,d represent the pixels with filtered pointcloud. In the normal scene, the fusion-based direct method creates a dense mapping similar to the LSD-SLAM. In the scene of intense lighting, there is a strong reflection on the surface of the tower and the insulator, resulting in a failure of LSD-SLAM. In contrast, the fusion-based direct method still works as expected.  Table 8. DS-SLAM [20] is taken as the baseline because it shares the same algorithms with PTI-SLAM for 2D semantic extraction and SLAM solving. The tracking thread is the main source of the real-time requirement as it is demanded to respond in a timely manner to the image input and provide UAV positioning. The direct methods have significant advantages over feature-based SLAM in time consumption. Even so, the processing speed of exceeding 12 frames per second of ORB-SLAM2 also satisfies the real-time requirements [7]. As for semantic SLAM, DS-SLAM serially performs a semantic task for every frame in a tracking thread, resulting in low tracking efficiency A tracking speed of 2 frames per second makes practical applications on airborne platforms impossible. By stripping the semantic-related tasks, the track thread of PTI-SLAM is executed at a speed exceeding 12 frame per second, which is consistent with the original ORB-SLAM2. Compared with Cube-SLAM, which is an advanced monocular object SLAM and is also improved from ORB-SLAM2, PTI-SLAM shows a significant improvement in tracking speed, demonstrating its advantage in real-time performance. As for the time consumption on semantic tasks, there are two parallel threads in PTI-SLAM responsible for the semantic task. In the segment thread, the keyframe can be handled in 0.29 s, while the average period of keyframe generation is 0.645 s; hence, the requirements of keyframe segmentation can be easily satisfied. The time consumption of the semantic positioning thread is related to the RoI size of reference frame. When taking the whole image as RoI, the processing time can be up to 2.31 s. Table 9 represents the average time consumptions of semantic positioning threads in the presence and absence of RoI, which are 0.7 s and approximately zero, respectively. Attributed to the parallelism of threads, the incoming keyframe can be segmented during the object positioning, so the overall elapsed time for the semantic positioning is still 0.7 s. As shown in Figure 2, there are a large number of non-RoI flight areas during the whole inspection, in which the time consumption of object positioning is approximately zero. Therefore, all the keyframes can be processed even though the time consumption of semantic positioning is slightly larger than the generation of keyframes. In our geographical UAV positioning strategy, the processing speed of semantic positioning only affects the refresh rate of geographical positioning correction, for which a delay of seconds is not fatal. Table 9 shows the comparisons on geographical UAV positioning of PTI-SLAM and the existing visual method for tower inspection. The artificial operating performance reported by the China Southern Power Grid (CSG) [28] is adopted as the baseline. Generally, the hovering error of artificial operation is about 1.5 m, indicating that the inspection target can be observed within this error range. This provides a good reference baseline to verify the feasibility of our approach.

Geographical Positioning Comparison
The UGA-UAV cooperative architecture [8] mentioned in the introduction reports the max positioning errors of 1.75 and 2.07 m at different resolutions of input image without using RTK in UAV. By taking tower components as the positioning identifier, our approach reduces the measurement distance to a small range and presents a significant accuracy gain.

Discussion
Power tower inspection is a serial task oriented by scene structure. Since some scene structures are difficult to locate directly, considering the visual navigation as structure-based geolocating and geographical waypoint flight is a feasible scheme at present. By using semantic SLAM, geolocating can be converted into estimating the transformation between SLAM and geographic coordinate systems based on semantic landmarks. This scheme can reduce real-time requirements for semantic tasks. Once the system has completed the initialization of geolocation, the delay of the semantic task will only affect the refresh rate of geolocation correction. Taking this advantage, the proposed PTI-SLAM removes semantic tasks from the real-time constraints by loosely coupling semantic tasks and SLAM tasks. In the time consumption comparison, we verify that PTI-SLAM can achieve an additional semantic task without slowing down the tracking speed of the SLAM part. This scheme can be seamlessly plugged into the current inspection works and can be combined with the study of tower-structure-based automatic waypoint planning [29] to further explore the possibility of autonomous UAV tower inspection. In addition, its idea can be easily extended to other similar applications.
The challenge of the scene mainly comes from the complex and changeable lighting conditions. The proposed fusion-based direct method utilizes a sliding window strategy to address this challenge and realize semantic semi-dense mapping. The sliding window strategy is essentially a fault tolerance mechanism. For the light changes to continue as fragments, it enhances the confidence in the assumption of brightness constancy by limiting the range of pixel tracking to a small region on both a spatial and a temporal scale. For the transient changes to a few frames, it increases the fault tolerance to local variation by adding more alternative frames for the pixel tracking. In the SLAM part, the robustness experiment preliminarily proves that the feature-based SLAM approach can effectively deal with this challenge; however, it still has a risk of failure during fast rotation. According to previous studies [30], the fusing of the camera and the IMU can alleviate this problem and also facilitate the restoration of the monocular scale. Since there is a transition between the gimbal camera and the body, this is a challenging information fusion problem of three systems with different sampling frequencies: the camera, the IMU and the gimbal attitude sensing.
As for the performance of semantic object positioning, there are two additional influence factors besides the accuracy of segmentation and the SLAM part that should be noted: the distance between the object and the camera and the offset between the object and the image center. The former suggests that the distance factor can be taken into account in the fusion of object observation, while the latter indicates that some optimization aid, such as auxiliary aiming and shooting, can be considered. Generally, we regard our approach as a rough global positioning solution of guiding UAV to approach the inspection target and a finer adjustment for data acquisition should be achieved by an auxiliary shooting subsystem.

Conclusions
This paper reports an investigation on using the structure of a power tower to realize visual positioning of UAV for tower inspection and presents a monocular semantic positioning framework to cope with the challenge of a scene. To offer advice on semantic selection, the potential of the tower component as the semantic object for inspection flight is examined in terms of scene structure describing ability and visual detection adaptability. The insulator and damper are strong semantic candidates, while some of the large-size fittings and ancillary facilities, such as clamp and tower plate, are of great promise but need more evidence to prove their visual detection adaptability. The proposed PTI-SLAM conveys a hybrid architecture combining the feature-based SLAM method with the direct method to juggle positioning robustness and mapping density. A fusion-based direct method is presented to improve the robustness of the direct method against the adverse conditions in the inspection scene by limiting the pixel track and taking semantic observation fusion as a substitute in global tracking. The trajectory consistency evaluation shows that PTI-SLAM offers a better motion estimate compared to the airborne GPS, verifying its feasibility for UAV control. Comparisons of methods demonstrate that the hybrid architecture inherits the advantages of both the feature-based method and the direct method. The loose coupling between the SLAM task and the semantics task guarantees real-time performance and preserves the robustness against a complex scene with the assistance of a fusion-based direct method. To explore the application of PTI-SLAM in tower inspection, an object-based geographical UAV positioning strategy is devised. The preliminary experiment indicates more competitive accuracy compared to the artificial UAV operation and the previous visual approach, proving the potential of PTI-SLAM for inspection.
Compared with the existing UAV positioning approaches, PTI-SLAM breaks the dependence on GPS and provides additional environment information, producing the prospect of autonomous UAV inspection. In future works, establishing additional constraints of semantic mapping may contribute to richer semantic modelling, thereby improving semantic map accuracy and providing more possibilities for semantic applications. In general, this work presents the first part of a more intelligent transmission tower inspection system. There are certain other aspects that must be investigated to practically implement autonomous inspection, such as object association between SLAM and the actual world, dynamic path planning based on object observation and obstacle avoidance. Acknowledgments: The authors thank the local power supply company and its UAV working group members for their advice and practical experience sharing.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:  The updated coordinate of object J in SLAM coordinate system w J The existing fusion weight of object J w M,j The fusion weight of the j-th object of reference frame w f use,J The updated fusion weight of object J I r The image of the reference frame I c The image of the current frame O r The camera optical centres of the reference frame O c The camera optical centres of the current frame P The spatial point P r The projection points of P in I r P c The projection points of P in I c l c The epipolar lines of P in I c S(A, B) ZNCC The matching score obtained by ZNCC A, B the pixel blocks in I r and I c respectively A(i, j), B(i, j) The values of the pixels in A and B A, B the mean value of A and B p c max(X, Y, Z) The hypothetical max depth of the spatial point P to I c p c min(X, Y, Z) The hypothetical min depth of the spatial point P to I c d ini The initial value of depth estimating The true depth of P in I r d p The false depth of P in I r UP Sr UAV position and pose of reference frame in SLAM coordinate system UP Sc UAV position and pose of current frame in SLAM coordinate system OP Sr object position of reference frame in SLAM coordinate system OP E object position of reference frame in SLAM coordinate system UP Ec UAV position and pose of current frame in ENU coordinate system R The rotation matrix t The translation vector s The scale factor