1. Introduction
With the development of deep learning and the continuous increase in GPU performance, convolutional neural networks (CNNs) have found wide applications in image recognition and object detection [
1,
2,
3,
4]. These algorithms are still being developed, and new deep learning models are being created. However, one problem that arises in this area is the requirement of a large number of ground-truth annotations to train the deep convolutional neural networks. Moreover, the process of acquiring such datasets is time-consuming. The acquisition of traditional training datasets involves data gathering and annotation. Although data are obtained all the time and new image data are provided every day, whether by aircraft, drones, mobile platforms, or sensors mounted on satellites, there are many reasons why these data may not be sufficient or helpful in training deep learning models. Still, one of the biggest challenges is preprocessing and preparing the acquired data so the model can be taught based on it.
The largest training databases for CNN consist of natural scenes. These are large datasets made available for public use. ImageNet [
5], PASCAL VOC 2012 [
6], and MSCOCO [
7] are a few examples of learning datasets [
8]. However, the limitation is that they can only be applied to specific scenes or as an input to pretrain models. The challenge is also that objects are mapped differently in aerial photographs compared with those in natural ground scenes. The low diversity of object classes, variability of scale, orientation, and shape of objects on the Earth’s surface are also limitations that make using these existing and well-archived datasets impossible in specific tasks. As for photogrammetry and remote sensing applications, manual annotation is usually unavoidable, which requires a lot of time and labor.
However, with the rise of data-intensive deep learning methods, the number of solutions and effective strategies for generating ground-truth data is increasing. One methodology that is also worth mentioning here is that proposed by Laupheimer and Haala, 2021 [
9]. It involves transferring labels from the manually-annotated point clouds to the mesh and from that place to the image space for the semantic scene analysis task. This is one of the examples of methods and approaches that partially solve the problem of dataset inaccessibility and manual work. Several other examples are briefly described in
Section 2 with a review of related works.
The following work also uses a process to semi-automate the creation of training datasets for detection in nadir and oblique aerial images. For this purpose, orthophotomaps and point clouds are used as a starting point for generating ground-truth bounding boxes on images. A more detailed description of the methodology is described in
Section 4.
Although such an approach does not eliminate manual work, it reduces the effort. Creating a point shapefile layer and marking objects using an orthophotomap is much less time-consuming than labeling all photos in a project and marking all bounding boxes surrounding objects.
The main contribution in the following work, on the other hand, is the aspect of the accuracy and quality of the training datasets acquired in this way. The research aims to assess the influence of point cloud accuracy extracted from dense UAV imagery matching on the resulting bounding boxes extracted using the proposed method.
The rest of this paper consists of the following parts.
Section 2 reviews the availability of training datasets and describes methods for automating the process of generating annotations for images.
Section 3 describes the data used.
Section 4 is a presentation of the methodology adopted in this work.
Section 5 contains the results and an assessment of the accuracy. Finally,
Section 6 is a summary of the work and includes conclusions.
2. Related Works
The spread of deep learning methods in Earth observation has resulted in the creation and availability of some datasets for training models on aerial and satellite images. However, in the case of training datasets consisting of aerial photographs or satellite scenes, their abundance is much smaller than, for example, the ImageNet. Moreover, these collections are rather characterized by a low diversity of object categories. The most frequent datasets include objects such as cars: TAS set [
10], VEDAI [
11], UCAS-AOD [
12], the 3K-DLR-Munich [
13]; ships: RSOD [
14], HRSC2016 [
15]; and buildings: the SZTAKI-INRIA dataset [
16]. The two largest datasets for object detection in the Earth observation domain are DOTA [
17], which consists of 15 categories of objects and 2806 aerial images, and DIOR [
18], which contains 23,463 images and 192,472 instances, covering 20 object classes. Nevertheless, even such multiclass, mentioned datasets among these dozen or so classes/categories of objects may not contain those desired in a particular case. Sometimes it happens that these objects are particular for specific applications.
However, their availability of oblique aerial images is still low regarding training datasets for object detection. The potential that oblique photogrammetry brings is quite significant. Firstly, it provides a data source with distinct advantages: multiple views from different perspectives and significantly different image scales [
19]. In addition, oblique photogrammetry carries the possibility of obtaining information about the location of an object in a terrain system and the use of the multitemporal feature [
20].
The scientific community’s interest in using oblique aerial photographs has made the advantages of this technique obvious. This is evidenced by the appearance of publications and scientific studies on object detection [
1,
21].
UAV photogrammetry is a cost-effective and flexible data acquisition approach that provides a data source of nadir and oblique images [
22]. Aerial and drone photogrammetry also brings high-accuracy imagery of the same object several times in different photos due to the side and forward overlap in a block of images. Moreover, the use of oblique images gives the additional advantage that a given object is imaged from several directions at different angles, which further increases the size and diversity of the dataset. Using data acquired from UAVs and photogrammetric products such as point clouds or orthophotomaps enables the detection of objects using algorithms based on deep learning frameworks [
23]. Due to the fast high-resolution data acquisition ability, a UAV-based system could be used in many fields. Such solutions can be applied in the inventory and modeling of technical and transport infrastructure objects. The interest in using deep learning algorithms for object detection has grown in such industries as railroads, power generation, and road construction [
24,
25,
26,
27,
28]. Drone-based solutions are also applied to inspect solar panels [
29].
The presented examples outline the potential of using UAV-acquired data to detect objects of interest using deep learning methods. The review papers [
30,
31] summarize the previous ones regarding the fundamentals of deep learning applied in UAV-based imagery. Ramachandran and Sangaiah highlighted the mentioned problem with the availability of the datasets in their work [
30], and it was pointed out that it is essential for the progress of research in this field to create a large benchmark dataset dedicated to the problem of object detection by UAVs.
Similar to aerial and satellite imagery, publicly available training datasets for UAV applications most often include classes such as cars [
21,
32,
33,
34]. Training datasets for engineering infrastructure [
35] or tree detection [
36] are also starting to appear, but class size and diversity are still disappointing.
Therefore, research teams spend a lot of time creating such datasets or using other solutions such as fine-tuning-based approaches or other transfer learning methods. Alternative methods can also be used to speed up the manual process—weak supervision or Semi-Supervised Learning (SSL) [
37,
38].
Another approach to tackle the lack of training data is to automate the process of generating annotations for images. Such solutions are used in both image segmentation and object detection. In their work, Ros et al., 2016 [
39], proposed to generate synthetic images with pixel-level annotations. A further proposition to solve this problem is using a LiDAR point cloud or 3D reconstruction of the scene to lift the semantic instance labeling task from 2D into 3D [
40]. The authors in [
41] proposed the automatic generation of annotations on images. The method consists of three steps: (1) manual labeling of one or two aerial images; (2) transferring the pixel labels to multiple UAV images via the UAV point cloud; (3) refining the generated annotations using a densely-coupled CRF model and naive Bayes classifier. In their study, Zachar et al., 2022 [
20], addressed the lack of training data for the model and proposed a methodology where manual labor is replaced by the use of existing resources for transferring references to new databases for training models for detecting objects on oblique aerial images. Similar solutions with transferring references and adopting deep learning-based algorithms in natural scene images to detect objects in UAV images have already been proposed [
42,
43,
44,
45].
3. Dataset Description
For the experiments in the following work, photogrammetric data (aerial nadir and oblique images) were acquired with a DJI Phantom 4 RTK drone with the camera FC6310R (resolution: 5472 × 3648, focal length: 8.8 mm; pixel size: 2.41 µm), and products (point clouds and orthophotomaps) were used. The study area over which the photogrammetric data were acquired was a railroad section near Czestochowa, Poland (Herby). The data were acquired in March 2021. Oblique and nadir aerial images were acquired in multivariate photogrammetric missions. Different flight heights and forward and side overlaps of the photos were tested. Data were processed in Pix4D Mapper software for a section of railroad infrastructure in the test area.
Data acquired in multiple variations were used to generate point clouds. The selection of photogrammetric mission parameters was related to UAV data acquisition methodology experiments with input data requirements for neural networks. Experiments on this topic were not studied in the above publications. Nadir images were acquired at two heights (54 m and 90 m), while oblique images were acquired at 40 m with a tilt of camera angle by 45°. In addition, the variants differed in their forward and side overlap, which affects not only the resulting point cloud but also the processing time and economic aspect.
Due to different mission flight parameters (heights, coverages) and different parameter settings in the software, products with other characteristics were obtained. Fourteen scenarios for image acquisition and processing were prepared for the study area. The variants also differed in the use of vertical and oblique photos and their combinations. Not all variants used all acquired photos, and this was modified by excluding, for example, every second row or every second photo.
Table 1 shows all the variants’ summaries and basic parameters. As a result, 14 different point cloud variants were obtained.
The resulting alignment accuracies are shown in the tables below (
Table 2 and
Table 3). The average RMS error ranges for control points from 0.2 cm to 3.6 cm, while for check points, it ranges from 1.3 cm to 3.6 cm. The best accuracies were obtained for variant no. 3 (Flight Altitude: 54 m; OF: 90%; OS: 80%; GSD: 1.50 cm) comparing accuracies on control points. Variant no. 1 (Flight Altitude: 54 m; OF: 90%; OS: 90%; GSD: 1.50 cm) showed the best accuracy on check points. Bundle Block Adjustment Details are also indicated by the average of the reprojection error in pixels (
Table 4).
For the experiment’s performance, terrestrial laser scanning (TLS) point clouds were used as the ground truth (the reference data to which the results would be compared).
An important aspect from a research perspective was that the data were acquired on the same day because it was a railroad station reconstruction site. The dynamic changes that can occur from day to day in such an environment could make it impossible to compare results from data taken at different times. Then, changes in land cover—such as the placement of a new building—would additionally need to be verified. In the case of the experiments above, the data were confirmed in this regard, but it is a crucial point to keep in mind.
The area for which TLS data was acquired was smaller than the area covered by the UAV data. Therefore, the area for UAV data was also limited to the site for which TLS data was acquired. As a result, this area had 20 traction poles and 15 railroad gates. These objects were analyzed.
The terrestrial laser scanning (TLS) data collection was conducted using a Leica RTC360 scanner. For the study area, 115 scans were taken using the medium settings, corresponding to a point resolution of 6 mm at a distance of 10 m. The data were georeferenced into the PL-1992 terrain coordinate system using ground control points. For this purpose, 13 points were measured by the RTK method, using the national reference network (ASG-EUPOS) with a measurement accuracy of 0.03 m (horizontal) and 0.05 m (vertical). Registration of the scans was performed using the Cloud-to-Cloud method in Leica Cyclone REGISTER 360 software. The accuracy of the C2C fit for the bundles was 1 cm. The average error of matching the point clouds acquired from different scanner locations to ground control points was < 10 cm. The accuracy values are shown in the
Table 5.
An important aspect investigated in the present experiments is the accuracy of the generated point clouds from DIM (dense image matching). The alignment accuracies are presented above. However, in addition to the analysis of the alignment reports of the images of each variant, the point cloud densities and the visual analysis to evaluate the noise of the point clouds are also compared. A summary of the average point density per m
3 is shown in
Table 6. The highest densities are indicated by the variants for which the point cloud was generated on high settings.
By subjecting all variants to visual analysis, it can be seen that a more significant number of elements were mapped into the point cloud for the variants generated on the high setting. However, point clouds from these variants are noisier. This is particularly evident for the traction poles. The following figures show examples of a gate (
Figure 1) and a traction pole (
Figure 2) for all point cloud variants.
When comparing even the most accurate point cloud variants from dense image matching with terrestrial laser scanning data, it is apparent that some elements have not mapped into the dense point cloud. Furthermore, as more details are mapped and the density of the point cloud increases, noise increases. Such an effect is not desirable, especially when the next step is to project the points into pixel coordinates of the image to determine each object’s precise location (bounding box).
Taking into account all indicated aspects, experiments were conducted. The aim was to verify how the point cloud accuracy affects the resulting bounding boxes. As a result of research, recommendations have been made based on which the parameters of a photogrammetric mission flight should be used to automate the creation of high-accuracy training datasets. In addition to the mission flight parameters, essential elements are the data processing settings and parameters with which the point clouds should be generated in such a way that they are sufficient for the specified purpose.
In the following part of the paper, we describe in detail the particular steps of the research, the methodology adopted, and the results.
4. Methodology and Experimental Setup
The processing of UAV-acquired images produces various photogrammetric products, such as point clouds and orthophotomaps. These products were used as source data as part of the methodology to support the preparation of training datasets.
The first essential step was to process the image blocks; that is, to orient the data. The results from this step were described in an earlier section, where the alignment accuracies for each variant were also presented. As part of the data processing, point clouds and orthophotomaps for all variants were created in Pix4D.
The next part was the preparation of files containing information on the location of objects of interest. This step was necessary because of the need to have the terrain coordinates of each object so that, based on them, the point clouds from dense image matching could be clipped to the cloud fragments containing the object. Two different approaches were used for gates and traction poles. Based on the orthophotomap taken from one of the image variants, a point layer was created in ArcGIS Pro. The traction poles were marked with points on the orthophotomap, which made it possible to capture the information of the pole’s X, Y, and Z terrain coordinates (
Figure 3a). As for the gates, it was decided to create a polygon layer due to the different characteristics of the object. An example of a vectorized gate can be seen in
Figure 3b.
The resulting terrain coordinates of all the objects served as input information for cutting the sections from the point cloud that contained points belonging to the object (
Figure 4). In the case of gates, the point clouds were cropped with a polygon, while for traction poles, a buffer from a point with a radius of 5 m was used, so there was assurance that all traction pole elements would be mapped including the booms.
As can be seen in
Figure 4., the cropped point clouds still contain elements that do not belong to the objects of interest (including points belonging to the ground class). Thus, ground filtering proved necessary for the bounding boxes surrounding the object to be well represented. The Cloth Simulation Filter (CSF) method, used to extract ground points in clouds and process LiDAR data, was decided upon. However, after testing (initially in CloudCompare software) and verifying the results, it was found that for the purposes of the above experiments, this method is also relevant for point clouds from dense image matching, as shown in
Figure 5. Therefore, the CSF filter [
46] implementation provided on GitHub (
https://github.com/jianboqi/CSF, accessed on 1 November 2022) was used and added as a component of our algorithm.
In addition to filtering out ground points, it was also necessary to filter out points that were noise and did not belong to the object of interest. First, “isolated points” noise filtering was applied using Open3D library functions. These methods examine the neighborhood in the point cloud and reject outliers on this basis. Two methods were used:
statistical outlier removal—removes points that are further away from their neighbors compared to the point cloud average;
radius outlier removal—removes points that have few neighbors in a given surrounding space.
An example of outlier filtering for gates is shown in
Figure 6, where the red points represent noise.
While the aforementioned filters worked well for gates, which generally had less noise, manual filtering was necessary for traction poles, which turned out to be more complex objects. An example of such a case is shown in
Figure 4c, where objects that do not belong to the object of interest (the cloud points on the right) have also been mapped in the cut-out fragment of the point cloud (in the buffer of 5 m from the pole location point). It was decided that, despite the initial filtering using Open3D library functions, all objects would be verified for each variant, and unnecessary objects would be removed since no optimal tool was found that would automatically remove such fragments of the cloud. Manual verification and editing (filtering) were applied to both TLS point clouds and point clouds from UAV image matching. After making sure that all the point clouds for the objects were correctly prepared, the final step occurred. This consisted of projecting the cloud points onto the image, basically converting the terrain coordinates into pixel coordinates of the images.
Having the cloud points transformed into pixel coordinates of the photo, it was possible to carry out the last part; that is, to calculate the coordinates of the bounding box. These values were estimated based on the maximum and minimum values of the projected points (umax, vmax, umin, and vmin). This is how the final result was created. The bounding box surrounding the object was obtained by transferring the point cloud to the images (
Figure 8).
The methodology (
Figure 9) thus developed involves using photogrammetric data and products as source data for object annotation. Then all the steps described in this section are carried out. The final result is numerous training datasets consisting of images and information about the object’s position in the image by means of a bounding box saved in a text file according to the requirements under ML detectors and models.
For example, the resulting training databases could be used to train detectors such as YOLO or Fast R-CNN. On the other hand, an important issue is evaluating the accuracy of the resulting bounding boxes. This part related to the evaluation of the results is addressed in the next section, where the overlap between the reference (ground truth) and the bounding box, which is the result obtained by transferring the point cloud to the images, is examined.
As mentioned earlier, the experiments were conducted on two objects of railroad infrastructure—gates and traction poles. The following section shows sample results—bounding boxes plotted on the images and bounding objects of interest. Visual inspection of the results also formed part of the analyses.
Figure 10 shows examples of the results for the traction poles. The right side shows results for point clouds from TLS, and the left side shows bounding boxes generated from point clouds from dense image matching. A similar visual comparison was made for the second object analyzed—gates (
Figure 11).
The resulting bounding boxes obtained in this way can constitute a training dataset. Thus, the input to the network will be an image or a fragment of a photo containing the object of interest, and a file, for example, a text file containing information about the object’s position in the pixel coordinate system. If a photo involves more than one object, all objects should be included in the file containing bounding box information to complete the collection. An important point to emphasize here is that by marking an object on an orthophotomap at one time, it is possible to obtain numerous training datasets as a given object may be visible in up to a dozen images. This depends on the parameters of the flight mission. This is described more extensively in the next section, including the number of bounding boxes obtained on all the images of a variant. The presented methodology was used to conduct experiments, the results of which are described in the next section.
5. Results
An important issue in evaluating the accuracy of deep learning models is the quality and accuracy of the training dataset. As is well known, the accuracy of an object detection model depends on the quality and number of training samples, input images, model parameters, and the required accuracy threshold. Therefore, the accuracy, correctness, and completeness of the bounding boxes for training the model are crucial when automating the training dataset generation process.
When interpreting the deep learning model results for object detection, the accuracy is evaluated using different metrics. One of them is the Intersection over Union (IoU) factor. This is a metric used as a threshold for determining whether a predicted result is a true positive or a false positive. This coefficient determines the overlap between the predicted bounding box around the object and the ground-truth bounding box. In this way, IoU can be used as a metric to evaluate the accuracy of an object detector on a particular dataset by comparing the results from the model to the reference data.
As previously mentioned, the main contribution of this paper is to analyze the impact of the accuracy of point clouds extracted from the nadir and oblique image matching on the resulting bounding boxes extracted automatically. Thus, in order to be able to evaluate the influence of the accuracy of point clouds extracted from image matching on the resulting bounding boxes, in addition to the visual assessment, the metric Intersection over Union was used (
Figure 12).
This metric is used for the accuracy assessment of object detectors on a given set of data. Each detector, which provides bounding boxes as an output from the model, can be evaluated with the IoU metric. To apply this metric, it is necessary to have:
ground-truth bounding boxes (labeled bounding boxes from the test dataset, which specify the location of the object on a pixel coordinate system of the image);
predicted bounding boxes (output bounding boxes from the model).
Based on these values, it is possible to evaluate the overlap between the reference (ground truth) and the bounding box, which is the result obtained from projecting cloud points onto images. IoU is therefore measured as the area of intersection of the ground-truth bounding box with the output bounding box divided by the area of the combination of the two.
In the numerator, there is the area of intersection of the predicted bounding box and ground-truth bounding box, and the denominator is the total area of the predicted bounding box and ground-truth bounding box. Dividing the area of overlap by the area of union yields the final result—Intersection over Union. The IoU value ranges from 0 (no overlap) to 1 (the boxes are identical).
Due to the fact that no object detector was used in the above experiments, the ground-truth bounding boxes were treated as a reference (obtained based on point clouds from the terrestrial laser scanning), while the resulting bounding boxes (obtained based on point clouds from dense image matching) were treated as “predicted”. Both ground-truth and the resulting bounding boxes were obtained using a semi-automation script, as described in the section above.
As a result, all 14 variants were compared with ground-truth bounding boxes generated from TLS.
5.1. Gates
The first evaluated objects were traction gates. The experiment results are presented in
Table 7, where the accuracies for all variants are summarized. The best results were obtained for the variant marked no. 5, for which the IoU value was 92.42%. The results obtained for the rest of the variants also showed relatively high accuracies.
The lowest IoU value was obtained for the variant marked no. 14—77.06%. A regularity that can be observed from the results is that for the variants for which the point cloud was generated with the “high” settings, lower values were obtained than for the variants with the default settings. After analyzing the experiment results, it can be concluded that the higher noise, which was obtained for the dense point clouds generated with “high point density” settings, significantly influenced the results.
Based on the results, there is a possibility to indicate a mission flight variant, for which resulting bounding boxes demonstrated the highest accuracies relative to references. Furthermore, the threshold that is used to distinguish a valid result can be defined (more precisely, what is (or is not) a “good” match). In that case, the threshold metric is the IoU value. Changing the score threshold allows the false positive and true positive rates to be distinguished to create a high-quality training dataset. By doing so, it is possible to discard erroneous results and highlight those that possibly need manual correction before being passed to the model. More about the threshold value is described in
Section 5.3.
Thus, it can be stated that high point cloud density settings gave slightly worse results. Moreover, such a point cloud requires up to four times more processing time and RAM than optimal density.
Below are examples of the results obtained as a result of the experiments carried out—applied to the images’ bounding boxes that surround the objects. Examples of correct results with high IoU values (
Figure 13) and worse results (
Figure 14) that would require possible manual improvement before inclusion in the training of the model are shown.
5.2. Traction Poles
As for the results for the second object, they differ from those obtained for the gates. The traction poles turned out to be more complex objects that mapped less well to the point clouds than the gates. As for the gates, the accuracies for traction poles are summarized in the
Table 8.
The best results were obtained for variant marked no. 9, for which the IoU value was 87.54%. For the rest of the variants, the results showed accuracies of 0.75. The lowest IoU value was obtained for variant marked no. 6—68.64%. The rule that was noticed in the case of the experiments for gates, that for point clouds generated on high settings, worse values were obtained, did not occur in the case of traction poles. Here, higher accuracies were obtained for three out of four variants on high settings.
Analyzing the accuracy results for the traction poles, it can also be noticed that for variants where in addition to the nadir photos the oblique images were used, much better results were obtained. Thus, based on these results, it can be concluded that for more complex objects extending above the terrain, such as traction poles, it is appropriate to include oblique photos when generating point clouds.
A similar summary to that for gates was presented for traction poles. The correct result had an IoU of 0.94, where the differences in bounding boxes are almost invisible to the eye (
Figure 15). An example where part of the traction pole was probably not mapped to the point cloud from the dense image matching resulted in an IoU of only 0.48 (
Figure 16).
5.3. General Considerations
For both types of analysed objects, no correlation or significant effect of forward and side overlaps and other mission flight parameters (such as flight altitude) were observed.
As stated before, IoU values are usually expressed in percentages; the most used threshold values are 50% and 75% [
47]. However, such values are used as detection evaluation metrics to quantify the performance of detection algorithms in different areas and fields. Thus, the question arises about what threshold to apply to the above methodology. Obviously, the training dataset obtained should automatically be as accurate as possible. Therefore, to define an IoU threshold for individual objects, a more precise evaluation would have to be carried out, aiming to indicate from which value the result should still be manually checked and corrected. Such a study could be an extension of the above experiments.
The overall conclusion is that it can be seen that the influence of mission flight type, including image acquisition parameters and processing parameters, is significant to the quality of point clouds.
An important achievement is the size of the training dataset. A summary of the instances of objects in the images for all variants can be seen in the above tables (
Table 7,
Table 8). A total of 35 objects (20 traction poles and 15 gates) were vectorized, and the final count of the set of instances in the photos for some variants was even about three thousand. Thus, in general, it is possible to use the methodology to automate the creation of training datasets as much as possible.
Figure 17 and
Figure 18 show cases for the whole high-resolution images with ground-truth annotations and the resulting bounding boxes obtained from the process.
6. Conclusions
Annotating the ground-truth data and gathering the training datasets have a crucial relevance for training models, supervised object detectors, and deep learning approaches. This is tedious, time-consuming, and labor-intensive work, resulting in a lack of ground-truth datasets. The necessity to train models with a vast amount of independent annotated data (usually on the order of tens of thousands) is still an open problem that limits the use of the potential CNN carry. This is noticeable in many applications, and also for datasets to train models to detect objects in UAV images.
Exploiting the potential of available deep learning methods for object detection is the reason for developing the topic of automating the acquisition of training datasets. Thus, the need to apply the proposed approach is apparent, and the above work highlights the utility of automating this process as much as possible.
Adopting the strategy proposed in this paper makes data labeling easier. The method makes it possible to create bounding boxes based on points indicating only the location of the object, and thanks to the fact that it is common in photogrammetry to use overlaps in a block of images, the object transfer is performed on a few or a dozen images representing the object. Although the described approach does not exclude manual annotation at the beginning of the process, it significantly reduces time. Moreover, there is nothing to prevent the locations of existing objects from being used from already created databases, which would overcome manual work. Open tools for manual annotation exist and are quite widespread (for example, LabelMe), but preparing training datasets in this way is very time-consuming.
The above method, or others described in Chapter 1 using semi-automation processes, can help overcome the problem of acquiring training datasets. This process minimizes the manual effort involved in generating ground truth and significantly supports and promotes the training of deep learning algorithms.
An important point to highlight is the number of training datasets created using the above methodology. Using both vertical and oblique images, we obtain a dataset that maps the same area and the same objects but from different perspectives. Therefore, by vectorizing the object once on the orthophotomap, we obtain a collection consisting of a larger number of images using the more significant methodology described above. For example, for twenty traction poles for the variant numbered 9, for which the highest accuracies were obtained, a total of 3930 instances were obtained in the images.
This approach, of course, could also be applied using an orthophotomap cut into smaller tiles instead of images. However, the size of such a collection would be much smaller than if all the images in the block were taken since each object would be imaged only once. However, it is worth keeping in mind that with few changes in the code, this method, after some adaptations, can also be applied for orthophotomaps, the use of which has many advantages.
A particularly important issue in the context of the automatic creation of training datasets is the accuracy of the input data. This approach should pay attention to the accuracy of the point clouds included in the process, as it affects the resulting bounding boxes, essentially the prepared training dataset.
However, it has been shown that even with the use of slightly noisy point clouds, it is possible to achieve a reasonably accurate set, possibly requiring few manual corrections. Thus, it can be concluded that this method can be used analogously to the labeling methods mentioned above, where automatic data annotation is performed first, and then verification and possible improvement are performed later. The lowest accuracies obtained on the test dataset were about 70%, and for some variants, over 90%.
Based on the experiments, it can be said that the proposed approach is optimal in terms of performance and processing speed. It provides a semi-automated way to achieve consistent labels for all images in a block, reducing the effort of manually labeling each object in the photos. This is accomplished by starting with labeling the objects once on the orthophotos and then proceeding by clipping the point cloud to a fragment of the object of interest and projecting the points belonging to the object onto the pixel coordinates of the photo; the finished result is a bounding box surrounding the object. It is worth pointing out again that the step related to the manual marking of the location of objects using an orthophotomap is one of the options. We can start from terrain points, but extracted, for example, from a database of topographic objects or using BIM data. Manual labeling of bounding boxes on thousands or hundreds of thousands of images is much more time-consuming. In addition, point clouds from image matching or orthophotos are products often generated in the production process, which marks this strategy’s potential. Naturally, the proposed methodology can also be adapted to other types of data (not only from UAV). The approach could also use point clouds from dense matching of aerial images, or point clouds acquired from laser scanning from different ceilings (both aerial and ground). Admittedly, the methods used are not a discovery but simple calculations, but the added value here is the ordering of strategy and accuracy analysis experiments.
In summarizing, based on the above results, it can be concluded that it is reasonable to use the developed methodology to semi-automate the process of creating datasets for training deep learning models for object detection in nadir and oblique images acquired from UAVs.