3D Reconstruction of a Complex Grid Structure Combining UAS Images and Deep Learning

: The latest advances in technical characteristics of unmanned aerial systems (UAS) and their onboard sensors opened the way for smart ﬂying vehicles exploiting new application areas and allowing to perform missions seemed to be impossible before. One of these complicated tasks is the 3D reconstruction and monitoring of large-size, complex, grid-like structures as radio or television towers. Although image-based 3D survey contains a lot of visual and geometrical information useful for making preliminary conclusions on construction health, standard photogrammetric processing fails to perform dense and robust 3D reconstruction of complex large-size mesh structures. The main problem of such objects is repeated and self-occlusive similar elements resulting in false feature matching. This paper presents a method developed for an accurate Multi-View Stereo (MVS) dense 3D reconstruction of the Shukhov Radio Tower in Moscow (Russia) based on UAS photogrammetric survey. A key element for the successful image-based 3D reconstruction is the developed WireNetV2 neural network model for robust automatic semantic segmentation of wire structures. The proposed neural network provides high matching quality due to an accurate masking of the tower elements. The main contributions of the paper are: (1) a deep learning WireNetV2 convolutional neural network model that outperforms the state-of-the-art results of semantic segmentation on a dataset containing images of grid structures of complicated topology with repeated elements, holes, self-occlusions, thus providing robust grid structure masking and, as a result, accurate 3D reconstruction, (2) an advanced image-based pipeline aided by a neural network for the accurate 3D reconstruction of the large-size and complex grid structured, evaluated on UAS imagery of Shukhov radio tower in Moscow.


Introduction
Periodical monitoring of the technical state of industrial buildings and constructions is of great importance for their safety and proper operating.The importance of this issue grows notably if the object to be monitored is aged and is of cultural heritage meaning.New sensors and technologies such as photogrammetric multi-view stereo or laser scanning can now provide accurate and dense 3D geometric information of complex objects.However some complicated man-made structures, such as mesh-like tall objects, electricity towers, metallic bridges with arches, and so forth, still pose challenges for comprehensive studies and 3D reconstructions.
Nowadays unmanned aerial systems (UAS) [1][2][3] are used in wide variety of applications [1,4,5] due to their ability to fly in an extensive range of heights and velocities, to reach hardly accessible area and to carry various sensors as a payload.Their advanced capabilities allow to use them in very complicated missions such as rescue operations, cargo delivery to dangerous or inaccessible areas, monitoring of hardly accessible objects, and so forth.Technical abilities of UASs and their onboard sensors are sufficient to reach, inspect and survey complex objects such as television towers and bridges for surveying and image acquisition purposes.
The paper addresses a problem of an image-based 3D reconstruction of the Shukhov Radio Tower in Moscow (Russia), also known as Shabolovka.This tower was built during 1920-1922 years by Russian architect Vladimir Shukhov, who proposed a novel type of constructions-doubly curved structural forms used both for light-weight towers (Figure 1a,b) and roofs (Figure 1c) Shukhov Radio Tower (Figure 1b) is one of these construction, and now it is a part of World cultural heritage.Unfortunately Shukhov Tower has had no extensive technical inspection for a long time, so its technical state needs to be observed and documented.With this aim two surveys has been carried out during the period 2012-2015.The first one employed laser scanning with further 3D modeling of the wired structure [6] whereas the second employed UAS-based images for photogrammetric processing and 3D reconstruction.The UAS-based survey resulted in a set of images acquired during the UAS ascending/descending (vertical image 'stripes') trajectory (Figure 2).
Despite the impressive progress in image processing techniques for automatic 3D reconstruction based on Structure-from-Motion and Multi-View Stereo frameworks [7], complex wired structures such as the Shukhov Tower could not be automatically reconstructed.Indeed, the complicated grid elements of the tower and its tubular structure induce many failures in image-based methods and require a lot of manual processing, even for laser scanning [6].
The paper presents a methodology developed for the automated image-based 3D reconstruction of complex grid structures aided by deep learning.The main contributions of the paper are: (1) a deep learning WireNetV2 convolutional neural network model that outperforms the state-of-the-art results of semantic segmentation on a dataset containing images of grid structures of complicated topology with repeated elements, holes, self-occlusions, thus providing robust grid structure masking and, as a result, accurate 3D reconstruction, (2) the first accurate image-based textured 3D reconstruction of the Shukhov Radio Tower by multi-view stereo processing of UAS-taken imagery, failing to be processed by standard photogrammetric methods.We made our dataset and reconstruction results publicly available (http://www.zefirus.org/ShukhovTower).

UAS Based Photogrammetric Imaging
The spectrum of UAS types and application fields is wide and is expanding rapidly, including agriculture, industrial monitoring of large size objects, cultural heritage, forestry, environment and ecological monitoring and mapping, fast updating of local geospatial information, and so forth [8][9][10][11][12].Due to their high performance and easy control, UASs are successfully exploited for photogrammetric surveying.
In archaeology and cultural heritage UAS supports aerial surveying for planning and monitoring excavation sites [13,14], for producing new types of archaeological documents such as textured 3D models and orthoimages [15].
The application of UASs in agriculture and forestry grows rapidly due to possibility of obtaining high quality actual data that is allows to plan agricultural activity, to support precision farming [16], to estimate plants condition [17], and so forth.
Environment monitoring involves the use of UASs in different purposes such as a disaster impact analysis [5,18], wildlife monitoring and conservation [19], plastic pollution detection and classification [20], collecting real time information from a specific location and uploading this data onto web server for on-line viewing [21], and so forth.
Due to flexibility and variety of UAS-based imaging platforms, they are also used to perform surveying and monitoring of roads for estimating traffic conditions and road pavement state [22][23][24].UAS is also an attractive platform for acquiring imagery for 3D reconstruction and technical inspection of industrial large-size objects [25][26][27].UAS' ability to reach almost any place, acquiring multi-modal data and delivering high quality information make these flying machines also useful in many situations and tasks where actual geo-spatial information is required for rapid reaction to changing circumstances [28].

Grid Structures 3D Reconstruction
Image-based 3D reconstruction of grid structures poses significant challenges due to the complicated topology of the objects, repetitive features hard to be distinguished in matching operations, self-occlusions, and so forth.Due to the complexity of the problem a list of works related to grid structures 3D reconstruction is not so long.Most of the approaches presented in recent publications try to extract a topology of a mesh object using some assumption about the object structure, like wire smoothness [29], linearity of elements [30,31] or their tubular shape [32].
In Huang et al. [33] a L 1 -medial skeleton as a curve skeleton representation for 3D point cloud object data was introduced.The developed algorithm extracts curve skeletons from unorganized, un-oriented, and incomplete 3D raw point clouds, thus providing topology representation (but not 3D reconstruction) of the object.Similar to Reference [33], Morioka et al. [34] retrieved a topology of a 3D point cloud as a 3D graph, with further representation of a wire-structure object as a combination of cylindrical elements centered along the edges of the graph.The method uses Delaunay tetrahedralization to make the initial edges and simplifies the edges by applying iterative edge contractions to extract the graph representing the wire topology.Furthermore, an optimization technique is applied to the positions of the cylindrical surfaces in order to improve the geometrical accuracy of the final reconstructed surface.So the methods allows to reconstruct the 3D structure of an object without reconstructing the shape of wire elements.Su et al., 2018 [35] used an optical-based method to produce digital 3D data of spider web architecture and perform topology analysis.The focus of the study was developing an innovative experimental method to directly capture the complete digital 3D spider web architecture with micron scale resolution.The authors built an automatic segmentation and scanning platform to obtain high-resolution 2D images of individual cross-sections of the web that were illuminated by a sheet laser.The developed image processing algorithms were used to reconstruct the digital 3D fibrous network by analyzing the 2D images.This digital network provides a model that contains all of the structural and topological features of the porous regions of a 3D web with high fidelity, and when combined with a mechanical model of silk materials, will allow us to directly simulate and predict the mechanical response of a realistic 3D web under mechanical loads.
The available publications on grid structures 3D reconstruction do not propose methods for dense accurate grid object 3D reconstruction from raw UAS-taken imagery.The main problem that one has to solve for multi-view stereo 3D reconstruction of such objects is a false feature matching in the images, caused by repeated elements of a construction appearing both in the foreground and in the background of a scene.Masking images for eliminating disturbing or not significant for 3D reconstruction areas seems to be a promising approach for complex cases.Reference [36] used semantic image segmentation and binary masking for eliminating effect of moving objects in images.A method [37] utilises the camera relative orientation of a pair of images to find a reliable object segmentation for further accurate 3D reconstruction.But these and some more related works [38][39][40] addresses to the problem of continuous (not grid-structured) objects.Recent impressive progress in deep learning methods makes them powerful mean for solving various complicated task with high quality.

Deep Convolutional Neural Networks
In the last years, deep convolutional neural networks (CNNs) started to be employed within the 3D image-based pipeline in order to boost the processing and facilitate some steps.According to their role within the 3D reconstruction pipeline, neural networks could be divided intro three broad groups: 1. CNNs for single-photo 3D reconstruction: multiple neural network models were proposed for reconstruction of objects and buildings from a single image using conditional generative adversarial networks (GAN) [41][42][43][44][45][46][47].While deep models such as Pix2Vox [44] and Z-GAN [47] proved to reconstruct complex structures from a single photo, but a large training dataset is required to achieve the desired quality.However, no public datasets of wire structures are available to date to train such models.2. CNNs for feature matching: the presented approaches [48][49][50][51][52] seem to outperform handcrafted feature detectors/descriptor methods.Still, their performance is closely related to the similarity of local image patches in the training dataset with respect to the images used during inference.However, repeating metal beams of wire structures are not present in modern datasets.

CNNs for semantic image segmentation and boosting of SfM/MVS procedures:
CNN methods [53][54][55][56][57][58][59] have also demonstrated their potential for detecting a numerous number of elements in the images and then boost the processing pipeline in terms of constrained tie point extraction or semantic multi-view stereo [60][61][62].The advantages of image masking for dense point cloud generation are well known in the literature [62][63][64].While there are multiple readily available segmenation models for oblique aerial photos [63] or buildings [64,65], the generation of pixel-level semantic segmentation for sparse wire objects is challenging.The analysis of repetitive patterns [66,67] allows to partly solve this problem for opaque objects (e.g., skyscrapers).Still, for objects with holes, such methods do not provide robust results.Generative Adversarial Networks (GANs) [68,69] have demonstrated a significant improvement for models that generate high fidelity output such as color images and semantic segmentation.Luc et al. [70] has proposed an adversarial framework for learning a robust semantic segmentation models capable of reconstructing fine details in the input imagery.Luc proposed to use masked images as an input for the discriminator.The discriminator observes color images masked with real masks and masks predicted by the framework.It learns to distinguish 'real' images and 'fake' images.This allows to provide a meaningful adversarial loss that improves the quality of segmentation in terms of small objects and object boundaries.So, considering image segmentation and masking as a key point for repetitive and self-occlusive structures 3D reconstruction from images, some deep network models were presented: MobileNetV2 [54], a fast network leveraging inverted residuals and linear bottlenecks; UPerNet [71] model, a multi-task network that uses internal feature map fusing to increase the labelling accuracy; HRNetV2 [72] which utliizes high-resolution representation and multiple streams of different spatial sizes to perform high-fidelity image segmentation.These CNN models serve as baselines and a starting point for developing our deep learning technique for accurate and robust image segmentation for further multi-view stereo 3D reconstruction of the Shukhov Radio tower.

Shukhov Radio Tower as a Photogrammetric Challenge
Shukhov Radio Tower, also called Shabolovka tower, was built in Moscow (Russia) in the years 1920-1922.The author of the Tower design is Vladimir Shukhov, a genius Russian engineer and architect.He has invented a new type of grid constructions based on hyperboloid structure.Such approach allows to significantly reduce the weight of the construction keeping its high rigidity.
Shukhov has built the first diagrid tower for the All-Russian Exhibition in Nizhny Novgorod (Russia) in 1896 (Figure 1a).Later, Shukhov designed the Shabolovka tower, which was built in Moscow under his direction in 1920-1922.The Shukhov radio tower in Moscow is a landmark in the history of structural engineering and an emblem of the creative genius of an entire generation of modernist architects in the years that followed the Russian Revolution.The tower is interesting for its original architectural construction method and is now a cultural heritage monument under preservation.
Due to historical circumstances, the original drawing for the Shukhov towers has been left.During the almost century from the day of the tower starting to operate only two inspections of the tower condition were performed (1947 and 1971).So gathering information about the current state of the Shukhov tower is very important for safety and preserving this historical monument.
While photogrammetric techniques, such as SfM or MVS, demonstrated an impressive performance in automatic image processing and accurate high-quality 3D model generations of continuous surfaces, they meet significant problem with complicated objects like grid structures.As such, the Shukhov tower poses several challenges for image-based 3D reconstruction: 1.The tower's size (137 m height) and shape require some specific means for acquiring the necessary images keeping appropriate scale and ray intersection angles.UASs give a solution for this challenge allowing to acquire images of such huge-sized and hardly-get object according the specific requirements.2. The 3D surveying's design and preparation must consider that the historical monument is now an operating radio translation tower: radio transmitters located on the tower disturb UAS control and operations.
3. An effective image processing should minimize manual operation and also be able to handle holes, wire structures, repeated elements and shiny surfaces.This challenge can be answer with deep learning technique for detecting tower elements in images (Section 4).

UAS-Based Survey
The aim of the photogrammetric UAS based survey was the 3D reconstruction of the tower geometry along with visual data acquisition about the current state of the tower steel elements.The photogrammetric survey aimed at collecting and producing documentation and restoration data about the tower.The survey was performed using an AscTec Falcon 8 UAS equipped with a SONY NEX-5 camera (Figure 3).Main technical characteristics of the UAS and the camera are presented in Tables 1 and 2. A preliminary geodetic survey was carried out for obtaining a set of ground control points (GCP).GCPs are necessary to assess the quality of the 3D reconstruction and for geo-referencing the resulting photogrammetric 3D results.A geodetic group, using a Geomax Zoom 25pro total station, performed the measurement of 10 GCPs located at two levels of the tower (Figure 4): at foundation level and at 3-rd section level (about 50 meters altitude).Special targets located on the tower parts helped to identify the control points while measuring them and in the acquired imagery (Figure 5).For UAS survey '8-ray star configuration' (Figure 4) was applied that allows all-around imaging with required image overlapping for photogrammetric processing.This configuration provided a scale of approximately 1:1700 with an average ground sample distance (GSD) of of 8.4 mm.Eight vertical stripes were flown and images acquired during the UAS ascending/descending.This resulted in about 600 images.Sample images from one of the 'stripes' are shown in Figure 2.

Standard Imagery Processing
The first attempt to perform 3D reconstruction using the acquired UAS imagery was carried out applying a standard photogrammetric pipeline provided by Agisoft Photoscan software (https: //www.agisoft.com).The results of the image triangulation process (SfM) and dense point cloud generation (MVS) are shown in Figure 6.The 3D point cloud has a lot of outliers and many images are not correctly oriented.Even after a manual selection of images with reliable orientation and good intersection angles, the multi-view stereo 3D reconstruction (Figure 6c) failed due to the many repeated structures and holes in the tower.The complex structure of the tower poses a challenge for corresponding points' detection and dense point cloud generation, thus resulting in a great number of false correspondences and, as a consequence, in problem with image orientation and 3D coordinates estimation.In absence of a robust algorithm for corresponding points matching, occlusion detection and repeated pattern handling, the only way to overcome the problem would be manual image masking, although very time consuming and error-prone.
With the recent advances in deep learning techniques, it was understood that a learning-based approach for image segmentation and background detection could allow to develop a convolutional neural network model for robust tower structure detection and masking in the acquired images.

Deep Learning Approach
Local patch similarity is one of the main problem in photogrammetric 3D reconstruction procedures.False matches result in poor quality of camera external orientation estimation and a large number of outliers in the dense point clouds.In the case study under investigation, the main reason for the false feature point matching is the repeating structures and similar elements.Moreover, feature point matching algorithms confuse points located on the foremost sections of the tower with those points located on the rear but visible through the holes of the wire tower.
Masking irrelevant object parts to improve the stereo matching accuracy is a well-known technique for improving the quality of 3D reconstructions.Still, the total number of photos in the UAS survey exceeded 600.Therefore, manual labelling of all acquired data was impossible.The presented approach was inspired by recent research [62] which used semantic segmentation in images for improving accuracy of a multi-view stereo processing.A deep learning based technique is proposed to automatically generate image masks in case of complex wired structures (like the Shukhov tower).
Firstly, a simple U-net [53] model was trained but the quality of image segmentation was insufficient for correct point matching.The segmentation results of a HRNetv2 [72] were much more correct.Still, the model was unable to distinguish between foreground and background wire structures in the images.Hence, a new model, based on the HRNetv2, was developed and called WireNet [73] to improve the segmentation of the frontmost and rear parts of the tower.

WireNetV2 Model Architecture
The WireNet model was designed using multi-scale fusion and high-to-low resolution convolutions developed for a HRNetv2 [72] model.Similar to the HRNetv2, the original WireNet model has four multi-resolution convolution blocks that provide parallel fusion of multi-scale convolutions.Such approach allows the model to track both fine and coarse details of the processed image at the layers of different depth.The multi-resolution group convolution layer is similar to a regular group convolution layer that divides the input channels into groups and learns a separate kernels for each group.In contrast with a regular group convolution, the multi-resolution group convolution includes different spatial resolution.This allows the network to implicit reasoning about relationships between find and coarse details.
The original WireNet [73] model extended the HRNetv2 with two key contributions: (1) an additional parallel channel for the segmentation of rear structures, (2) a negative log likelihood loss function.While the modified architecture was capable of segmenting images of dense wire structures with sufficient quality, it still suffered from two disadvantages: (i) It had low generalization ability and failed to label wire parts with a similar texture but different structure, such as the upper levels of the tower, if the training dataset included only ground-truth masks of the bottom and middle levels; (ii) The segmentation had soft edges at sharp corners that were caused by the negative log likelihood loss.Such soft edges reduced the matching accuracy during the sparse key-point matching stage.
To eliminate these disadvantages, the neural network was improved into WireNetV2 by adding an additional adversarial loss to the WireNet baseline (Figure 7).Assumptions made by Luc et al. [70] were used as a starting point for the developed adversarial loss.Specifically, an additional adversarial loss provided by a discriminator network was added to improve the labelling quality in terms of both generalization ability and reduction of soft edges of the segmentation.Furthermore, following Zhang [74], the proposed approach uses a tree like discriminator structure that verifies the synthesized images at different resolutions.Five PatchGAN [69] discriminators were added to the framework: D B , D 0 , D 1 , D 2 , D 3 .Discriminator D B aimes to qualify the labelling of the rear structures as either 'real' or 'fake' and the remaining discriminators similarly verify the network labelling with different spatial resolutions.The PatchGAN [69] discriminator consists of N convolutional layers.Each layer provides a receptive field of r i = r i−1 • s + k, where s is the stride for the layer, and k is the kernel size.Hence, the total receptive field of the PatchGAN model depends on the number of convolutional layers.The architecture of the PatchGAN discriminator and receptive fields for various number of layers are presented in Table 3.The necessity of the proposed adversarial loss function was evaluated by comparing two ablated versions of WireNetV2 framework (Figure 8).Qualitative experimental results demonstrate that the adversarial loss allows to improve the quality of the segmentation in terms of contour accuracy.This improvement results in notable reducing of the root mean square error of the best-fit point-to-point alignment with point cloud of the laser scanning [6] in comparison with first version of WireNet [73] (Section 5.3, Figure 15).Please, note that adversarial loss allows to reduce the amount of background visible through the masks (areas in red circles).

WireNetV2 Loss Function
Three loss functions govern the training process of the WireNetV2 model: where L f is the ground truth foreground segmentation, L f is the predicted foreground segmentation, L b is the ground truth background segmentation, Lb is the predicted background segmentation, λ f , λ b and λ adv are the hyperparameters, L NLL (A, B) is a negative log likelihood loss function given by: where w, h are the image width and height, A ∈ {0, 1} w×h is the ground truth semantic labelling, B ∈ [0, 1] 2×w×h is multichannel probability map defining the probability of pixel with coordinates (x, y) belonging to class i, m i is the class weight for class i.
The adversarial loss function L adv (A f , A b , B f , B b ) is given by:

WireNet Training Dataset
The acquired UAS imagery, consisting of about 600 images, was analyzed to identify the training sample for the WireNet model.Fifty images (about 8% of the whole data), containing descriptive features of the grid structure, have been selected for creating a training dataset.Image processing for preparing the training dataset included the following steps: i.
pixel-wise segmentation into two classes "tower" and "background", ii. generation of training labels.
As a result, the training dataset contains original RGB images and corresponding binary ground truth labels.
Figure 9 shows samples of image-label pairs from the training dataset at different heights (levels) of the tower.

Training Process and Performance of WireNetV2 Model
The developed WireNetV2 model has been trained using the PyTorch library [75] on the training part of the generated dataset (Section 4.4).
The training procedure was similar to a baseline training protocol.The data are augmented by random cropping (from 4912 × 3264 to 1000 × 1000), random scaling in the range of [0.5, 2], and random horizontal flipping.The stochastic gradient descend (SGD) optimizer had the base learning rate of 0.01, the momentum of 0.9 and the weight decay of 0.0005.The poly learning rate policy with the power of 0.9 is used for dropping the learning rate.All the models are trained for 120 K iterations with the batch size of 12 using two NVIDIA GTX 2080 Ti GPU and syncBN.
The evaluation of the model on the independent test set demonstrated 91% accuracy for the Intersection-over-Union (IoU) metric.The validation proved the better generalization ability of the WireNetV2 model comparing with WireNet, thus allowing to improve the quality of the imagery processing aimed at tower segmentation in images.
An independent test split of the labeled images consisting of 100 images was used to evaluate the segmentation accuracy of the WireNetV2 model and with respect to the other three state-of-the-art methods.The test split contains images captured at different heights of the tower (I, II, III).The accuracy is reported in terms of the Intersection over Union (IoU) metric.Qualitative results are given in Figure 10.Quantitative results are reported in Table 4.

Image-Based 3D Reconstruction
Preliminary image selection was performed to eliminate blurred and low quality images captured during the complex acquisition moments in the field.Then the photogrammetric image processing was performed on a remaining set of ca 500 images using the COLMAP pipeline (https://colmap.github.io)[76,77].As all images contain the far-away scene in the background of the tower, given the short baselines between the UAS images, a threshold on the ray intersection angle was imposed in order to avoid 3D points reconstructed under a very small intersection angle.
COLMAP has demonstrated good performance and the results of the camera orientation are shown in Figure 11.These camera poses and sparse point cloud were used to apply the dense image matching procedure in order to derive the final dense point cloud of the wire tower.As said (Section 3.3), the main problem preventing an accurate dense 3D reconstruction of the wire structure was the incorrect and noisy dense matching result when no masks were used in the MVS process (Figure 6).Therefore a detailed image masking, based on the developed neural network, was applied to constrain the patch-based MVS method.Figures 12 and 13 illustrate the results of dense point cloud generation and surface 3D reconstruction.To evaluate the accuracy of the image-based 3D reconstruction results, the obtained dense point cloud was compared with the 3D data (http://www.andreyleonov.ru/projects/shukhov-tower.html)produced by the manual processing of the laser scanning survey (32 mil points) [6].
Qualitative results that demonstrate the impact of the masks quality on the reconstructed 3D model are given in Figure 14.
Table 5 shows the results of the best-fit point-to-point alignment between the point cloud obtained by laser scanning and multi-view stereo processing with different masking methods: UPerNet [71], HRNetV2 [72], and the WireNetV2.
UPerNet [71] HRNetV2 [72] WireNetv2 ablated WireNetv2  Results of the best-fit point-to-point alignment between the two point clouds are presented in Figure 15.The root mean square error between the two 3D clouds is 0.12 m with standard deviation of 0.115 m.

Textured 3D Model
An advantage of image-based 3D reconstruction methods is the provision of high quality texture information, useful to investigate current construction state (e.g., location of rust areas or cracks in the metallic structures) and plan conservation activities.Figure 16 presents a fragment of the created textured 3D model of the tower structures.Surface 3D model was created by Delaunay triangulation of the dense point cloud with restriction on edge length.An image with the orientation that is closest to the mean normal of the surface fragment was used for texture mapping.Photorealistic surface texturing was performed using the external orientation of the chosen image provided by image orientation from MVS processing.The quality of the produced textured 3D model is enough for preliminary analysis of the tower elements condition and for the identification of possible problem in its structures.

Discussion
Image-based 3D reconstruction of wire-structured and non-Lambertian reflecting objects share many common problems.Reconstruction of such objects is repeatedly reported to be one of the most challenging fields in modern photogrammetry and computer vision [31,78,79].Many attempts were made to reconstruct various complicated objects with multiple holes such as grid structures [29][30][31][32], gas flows [80,81], and flames [82,83].While multiple modern baselines demonstrate the theoretical possibility of reconstruction of wire-structured objects, most methods require either fixed setup of sensors or a detailed physical model of the object being reconstructed.An example of such an object is a radio tower constructed from metal rods with multiple holes.Multiple attempts were made to develop methods for reconstruction of such objects.Existing algorithms leverage either constraint based on the repetitive structure of an object [66,67] or try to match line segments instead of feature points [84].
The presented research was focused on developing a robust pipeline for 3D reconstruction of complex wire structures with repeated elements.The pipeline should use only multi-view images as an input and provide robust performance without prior orientation and physical constraints on the object's structure.The primary goal was to make the proposed pipeline robust and general.To this end, the developed method should be similar to existing image-based 3D reconstruction methods.The main contribution of the presented research is a deep learning-based image masking approach that allows to virtually convert a semi-transparent wire object to an opaque object that can be easily reconstructed using readily available multi-view stereo and structure-from-motion algorithms.To this end, the developed WireNetV2 model aims to separate front looking faces of the object from the backward-looking faces visible through the holes in the object's surface.
The extensive experimental research proved that false matches are one of the main problems of the MVS procedure.Such matches usually occur when the point matching algorithm confuses the point on the front side with the feature point on the inside part visible through the holes in the object.These false matches generate random outliers inside and outside of the true wire 'surface'.A comparison of reconstruction results (Figure 14 and Table 5) generated with and without image masking prove that only the developed WireNetV2 model that leverages semantic labeling allows reconstructing the surface of the complicated wire object and its texture.
Moreover, the comparison of the developed WireNetV2 model with state-of-the-art segmentation baselines presented in Figure 10 demonstrated that only the developed model could solve this sophisticated task focused on distinguishing front and rare metal structures.An ablation study (Figures 8 and 14) demonstrates that the proposed adversarial loss allows reducing the number of outliers at the boundaries of the object's structure.The main reason for this is that the negative log-likelihood loss aims to minimize the integral segmentation error.It tends to learn smoothed silhouettes of objects, especially at the sharp edges.While such smoothing is acceptable for general segmentation tasks, it includes parts of the background at the feature point matching step.Such errors in the masks stimulate the outlier points to appear in the resulting dense point cloud.While the developed method was crucial for the reconstruction of the Shukhov tower, the proposed pipeline is general and can be applied to other similar objects.

Conclusions
The paper presented an advanced image-based pipeline aided by a neural network for the 3D reconstruction of the large-size and complex Shukhov radio tower in Moscow.The performed study demonstrates a high potential of UASs for solving challenging 3D reconstruction tasks of complex grid structures.The image-based pipeline was combined with a deep learning approach in order to robustly detect and separate wire structure elements in UAS imagery, thus facilitating the feature matching and the creation of accurate 3D products.The developed WireNetV2 neural network model is able to carry out robust segmentation of complex grid structure in images with an IoU quality of 0.83, thus significantly reducing the number of false feature matching, and, as consequence, improving the quality of 3D reconstruction to 0.12 m RMSE compared to laser scanning.The quality of the resulting dense point cloud is enough for creating a textured 3D model of the tower, a product useful to carry out preliminary visual inspection of the construction condition and to easily identify the location of places of interest in the tower.
Further research will address the problem of improving the performance of WireNetV2 model for more accurate wire structure segmentation to obtain thin elements in images.This will allow to perform more detailed 3D reconstruction of the tower elements using the same UAS imagery.Another topic of the future research is to use texture maps for the automatic detection of potentially weak elements.

Figure 1 .
Figure 1.The world's first diagrid hyperboloid water tower (37 m height) built by V. Shukhov for the All-Russian Exposition in 1896 in Nizhny Novgorod, Russia (a).The Shukhov radio tower, also known as the Shabolovka tower, build between 1919 and 1922 in Moscow, Russia (b).The world's first double curvature steel diagrid by Shukhov (during construction), Vyksa near Nizhny Novgorod, 1897 (c).

Figure 5 .
Figure 5. Special markers used for labeling the reference point measured by Geomax Zoom 25pro total station (a-c) and corresponding images from UAS imagery (d-f).

Figure 6 .
Figure 6.Image orientation (a), sparse point cloud (b) and dense point cloud (c) produced on a set of images by standard processing.

Figure 8 .
Figure 8. Ablation study of the adversarial loss function: Two versions of the WireNetV2 model are compared: an ablated version without adversarial loss (NLL) and the full version (NLL + adv).Please, note that adversarial loss allows to reduce the amount of background visible through the masks (areas in red circles).

Figure 9 .
Figure 9. Training dataset:The samples from the paired training dataset containing color images and labelling.Roman numbers indicate various levels of the tower.All images were labeled at the original resolution of 4912 × 3264 pixels.To match the size of the receptive field of the WireNetV2 model, full images were cropped tiles of 1000 × 1000 pixels.Please note that only labeled levels I-IV of the tower from two imaging directions were labeled to perform the segmentation of all levels from all viewpoints.

Figure 10 .
Figure10.Examples of semantic segmentation using MobileNetV2[54], UPerNet[71], HRNetV2[72], and the WireNetV2 model on independent test split of the dataset to automatically separate wires and background.Note that most of the compared methods fails to distinguish between foremost and rare wire structures.

Figure 11 .
Figure 11.Results of camera orientation procedure for ca 500 acquired images.

Figure 12 .
Figure 12.One of the 'stripes' processed with proposed technique: left-whole 'stripe' processing, right-a fragment of surface 3D model.The trained WireNetV2 model was applied to the UAS images for the automatic segmentation of the wire structures, providing a quick and robust masking of the background scene and also eliminating many outliers in the dense point cloud.

Figure 13 .
Figure 13.Dense 3D point cloud of Shukhov tower (4.2 mil points): bottom view (a), isometry view (b) and front view (c), with rear part eliminated for better presentation.

Figure 14 .
Figure 14.Comparison of the accuracy (in meters) of the final point clouds with respect to various accuracy of object masks provided by UPerNet [71], HRNetV2 [72], and the WireNetV2 model in the ablated and full versions.All point clouds were obtained from a series of ten photos of the II level of the tower.

Figure 15 .
Figure 15.Textured full 3D model of the tower (a), and point-to-point alignment and comparison the two point clouds: image-based and manually processed laser scanning data (b).

Figure 16 .
Figure 16.Textured fragment of the tower 3D model (a) with up-scaled detail (b).

Table 1 .
Main characteristics of SONY NEX-5 camera.
B and D 0 use five convolutional layers.Discriminators D 1 , D 2 , D 3 , use four convolutional layers.

Table 4 .
Intersection-over-Union (IoU) values for the WireNetV2 model and three state-of-the-art methods for various levels of the tower and the average IoU for all levels.

Table 5 .
Best-fitting errors (in meters) between the reference laser scanning point cloud and the photogrammetric point cloud computed with the different masking methods.