1. Introduction
Environmental conservation is a highly relevant topic in the global environmental context, especially with the ongoing climate changes. In this sense, the conservation of the Amazon holds special significance, given its vast area of over 5 million square kilometers [
1].
Due to the large quantity and diversity of natural resources, the Amazon is often targeted by illegal exploratory activities, with illegal mining on indigenous lands being one of the most prominent [
2]. According to [
3], 91% of Brazilian mining is located in the Legal Amazon, which can lead to humanitarian problems due to the violent actions of such criminals against indigenous tribes. According to [
4], if the Amazon were considered a country, it would have had the fourth highest homicide rate in the world in 2017, demonstrating how ingrained criminality is in this region.
However, for such criminal activities to be carried out, it is necessary to establish consistent supply lines [
5], given the high forest density. In this context, clandestine aviation acts as a logistical ally [
6], transporting materials necessary for the maintenance of these activities. In 2021, there were 456 illegal airstrips within 5 km of mining sites [
3], demonstrating the connection between these two factors. Additionally, in 2021 there were 804 airstrips within environmental conservation areas [
3], showing the link between this activity and the deforestation of this biome.
Thus, it becomes necessary to quickly and accurately map airstrips in the Amazon. In this context, a mapping of such targets already exists, carried out by [
3]. However, some weaknesses in this mapping hinder its use by the competent authorities. Firstly, this mapping was conducted through visual inspection of satellite images, which complicates its reproducibility given the vastness of the Amazon biome, making the task less dynamic. Moreover, images from 2021 were used for this task, making the results less relevant to the present day.
Some works aim to automate the search for such targets. One of the seminal works for identifying these targets was conducted by Alves et al. [
7]. In their approach, images from Synthetic Aperture Radar (SAR) sensors were used, and the Circularity Ratio (CR) [
8] was calculated to assess the shape of the targets, selecting those that were more elongated and straight.
Additionally, ref. [
9] conducted a study to perform segmentation without using Artificial Neural Networks (ANNs). The proposed methodology focused on decomposing the RGB bands into three matrices and then generating a fourth grayscale matrix from them. After that, attributes with greater variability were searched for and grouped using clustering techniques.
In this context, Convolutional Neural Networks (CNNs) have shown very promising results in target detection tasks since the work of [
10], with new architectures frequently being launched that present superior results. For example, the You Only Look Once (YOLO) architecture, initially conceived by [
11], shows very promising results in object detection tasks. Over the years, several versions of this architecture have been created, with new tasks being added. The authors of [
12] presented the 8th version of this network, which added the capability to perform image classification and segmentation. The latest version of YOLO is its 11th version [
13], which presents improvements over previous versions but only performs object detection tasks.
A brief evolution in research related to CNNs shows how each study has contributed to the advancements in this field. Beginning with [
14], their work had a significant impact on the development of CNNs, as their discoveries about the functional organization of the visual cortex, including receptive fields, orientation columns, and binocular interaction, provided a crucial theoretical foundation that inspired the structure and function of CNNs.
According to [
15], “the neocognitron, introduced by Fukushima, is a self-organizing neural network that, after the self-organization process, acquires a hierarchical structure similar to the visual system model proposed by Hubel and Wiesel, allowing visual pattern recognition independent of its position”. This position-invariance feature is fundamental to developing modern CNNs, which use hierarchical layers to extract and recognize complex image features.
Rumelhart et al. [
16] introduced the backpropagation technique, which efficiently adjusts weights in CNNs by propagating the error backward through the layers and using the error gradient to minimize the difference between the network’s actual output vector and the desired output vector. Each subsequent layer in a CNN combines features extracted from the previous layers, enabling the detection of increasingly complex features in images.
LeCun et al. [
17] contributed by applying the backpropagation algorithm to handwritten digit recognition, such as the MNIST dataset used for determining ZIP codes on mail, demonstrating CNNs’ ability to handle distortions and positional variations.
The findings of [
18] further established that standard feedforward neural networks are universal approximators, capable of approximating any measurable function to the desired accuracy. This fact implies that shortcomings in applications can often be attributed to inadequate learning, insufficient hidden units, or stochastic relationships between input and target.
In recent years, ref. [
19] noted that “machine learning techniques, especially when applied to neural networks, have played an increasingly important role in pattern recognition system design”. The advances in these techniques have enabled the creation of more accurate and efficient systems capable of handling a wide range of pattern recognition tasks across various fields.
As noted by [
20], CNNs have been widely applied in fields such as object detection, fault diagnosis, and image recognition. Since 2010, these applications have emerged as some of the most active research areas in computer vision.
In the literature, several papers discuss target detection algorithms, specifically focusing on the YOLO model. Studies such as [
11,
21], and more specifically [
22,
23] develop research with the YOLOv3 model, while [
24,
25] address YOLOv5. Ref. [
26] provides an overview of YOLO models from YOLOv1 to YOLOv8, making them excellent resources for object detection studies. Additionally, works such as [
27] address the limitations of traditional algorithms due to background noise and limited information, proposing improvements such as attention mechanisms in architecture to optimize the detection of small targets.
Using such techniques to identify targets, ref. [
28] performed target classification through transfer learning [
29]. In their work, a CNN was used to extract image features. With these data, classification algorithms were used to determine whether the dataset contained airstrips.
In his work, ref. [
30] presented the first architecture of the so-called Visual Transformer (ViT), specializing in [
10]’s work for computer vision. In the following years, several improvements were made, such as [
31]’s Swin Transformer, which enhanced the architecture further. Later, ref. [
32] presented the Global Context Visual Transformers (GCViT), which showed very promising results in image classification.
The literature review reveals that studies on airstrip detection in the Amazon often address isolated aspects of the problem rather than providing a comprehensive solution. A notable exception is the work of [
33], which employs various techniques to identify these targets. However, several areas in their proposed solution still require improvement. This study contributes by enhancing [
33]’s algorithm by reducing the number of false positives while maintaining high levels of precision, addressing limitations in the original approach to better meet the specific challenges of this detection task. Additionally, a key focus is improving the algorithm’s inference time, enabling faster target identification and more efficient monitoring.
This work is organized as follows: In
Section 1, the introduction is presented.
Section 2 covers the materials and methods,
Section 3 discusses the results, and
Section 4 provides the concluding remarks.
2. Materials and Methods
In this section, we present the entire methodology related to the work. First, we will discuss data acquisition and processing to generate the datasets for training the CNNs. We will then present the proposed algorithm, the metrics created to evaluate performance, and the experiment conducted. It is worth noting that all experiments were carried out on a computer with an AMD Ryzen 9 7900X CPU, 256 GB of RAM, and an Nvidia RTX 4090 GPU.
2.1. Data Acquisition
To generate the dataset for training Artificial Intelligence (AI) techniques, we used the mapping conducted by [
3], as shown in
Figure 1. This work provides the spatial location of 2869 airstrips as of 2021. To capture the targets in their entirety, we created 2 × 2 km square cutouts. The choice of this size is due to the objective of finding illegal airstrips, which are relatively small; therefore, a 2 km square captures most targets to their full extent. In addition, these cutouts were made to make the targets more evident in relation to the size of the analyzed image. Satellite scenes typically cover large areas, meaning the targets would appear very small compared to the total image size. Furthermore, the available hardware was unable to process entire satellite scenes.
Using the airstrip locations, we selected the “Image © 2023 Planet Labs PB” dataset from the Planet satellite constellation [
34], which provides images with a spatial resolution of 4.7 m, enabling clear visualization of the targets. Additionally, these images consist of four spectral bands, three in the visible region (RGB) and one in the near-infrared (NIR) region, each with 16-bit radiometric resolution. Using the previously obtained squares, we cropped the targets using scenes from June 2021 to ensure temporal alignment between the mapping and the dataset. After this, images in which it was difficult to identify the target were excluded, resulting in a final set of 1989 Planet constellation images of airstrips. Such exclusions were made because, in some images, a large number of clouds obstructed visibility, making it impossible to observe the target.
Furthermore, the algorithm must detect the target using different satellite imaging sensors, as the most recent data may only be available from a specific constellation, depending on the sensor’s temporal resolution. Therefore, to achieve such generalization, the same process was applied to images from the Sentinel-2 satellite [
35,
36], selecting the same four bands as those from the Planet constellation. This database contains images with a 10-meter spatial resolution, which still allows for clear identification despite offering a lower target definition. Additionally, for the cutouts, we chose images from June 2021. After the visual verification stage, we obtained 2468 images of Sentinel-2 airstrips.
Figure 2 shows an example of the same airstrip for both sensors.
To generate the dataset of images that do not contain airstrips, we conducted an empirical sampling of the Amazon. The reason for this is that, due to the high forest density of the biome, a random sampling would produce images with very similar shapes, mostly representing forest areas. Therefore, images were empirically selected to contain shapes resembling airstrips, such as rivers, cities, and highways.
Figure 3 shows two samples contained in this set.
2.2. Data Processing
When examining the images, we noticed a mismatch in the radiometric resolutions of the sensors, with Planet images having 16-bit resolution and Sentinel-2 images having 12-bit resolution. Therefore, we scaled all cutouts to 8 bits so that all images have the same value scale per pixel. This rescaling was performed because the YOLO-series CNNs are implemented using the Ultralytics framework [
12], which only supports images with 8-bit resolution and a maximum of three bands. Additionally, we only considered the visible bands, as we observed in
Figure 2 that the target is well-represented in these bands. Additionally, since the goal of this work is to improve the original algorithm presented in [
33], we used only the visible bands, as this was the best configuration reported in the original study.
For the Planet images, we standardized the image dimensions by cropping each image to 352 × 352 pixels. This value was chosen because the smallest dimension of the raw images was 366 pixels in width, and the images needed to have dimensions that are multiples of powers of 2. Since 352 is a multiple of 32, it was selected to be used in CNNs. Thus, 352 is the nearest number that satisfies both criteria.
Due to the difference in spatial resolution, Sentinel-2 images had much smaller dimensions compared to Planet images. Therefore, we performed upsampling of all Sentinel-2 images to dimensions of
using bicubic interpolation [
37].
2.3. Base Algorithm
In this section, the base algorithm proposed by [
33] will be presented, being an algorithm in 5 stages. The first stage involves an iterative search of the scenes. To reduce potential issues with targets that span across multiple scenes, the current images are extended by 2 km from the spatially adjacent scenes. Then,
pixel cutouts are made from the images with 75% overlap between successive cutouts. There are two reasons for this overlap. The first is to reduce the likelihood of missing any cutout that does not fully cover a target in the image. The second is that the previously generated datasets have centered targets, so a greater overlap increases the probability of having a cutout with the target centered. However, there is a trade-off as increasing the overlap also increases the computational cost of the algorithm. Thus, 75% was the empirically determined value that balances these factors.
The second stage involves an image classification algorithm that determines whether the current cutout contains the target or not. This step acts as a filter for the subsequent steps, forwarding only the cutouts most likely to contain the target. After this, the segmentation of the cutout is performed using a pre-trained segmentation algorithm to generate the specific shape of the target.
In the fourth stage, post-processing of the generated segment occurs. Then, the minimum bounding box of the segment is generated using the algorithm presented by [
38], and the smallest dimension of this rectangle is checked. If it is less than 250 m in spatial dimensions, this value was chosen because we verified that the smallest landing strip in the training set was 434.23 m. Thus, 250 m becomes a reasonable threshold for discarding detections, and the segment is discarded. This step is important because small scars in the forest, despite their size, can closely resemble landing strips.
Still, in the fourth stage, the centroid of the segment is calculated and related to a pixel coordinate. Thus, knowing the position of the centroid relative to the cutout and the position of the cutout relative to the entire scene, it is possible to determine the centroid’s position in the complete scene. With this position, and using the georeferenced coordinate of the original image, the geographic location of the landing strip can be obtained.
In the fifth and final stage, clustering of the points generated by the algorithm is performed. This stage uses a modified version of hierarchical clustering [
39], which merges points based on a distance threshold. If two points are within this threshold distance, they are merged into a cluster. The reason for this is that there is significant overlap between cutouts, so a valid target is likely detected by more than one cutout. Therefore, clustering merges these points to avoid multiple predictions of the same target. Due to the required size of the target, a merging distance of 1250 m was used. The Algorithm 1 presents the pseudo-code of the algorithm.
Algorithm 1 Landing Strip Detection Algorithm Presented in [33] |
- 1:
function LandingStripDetection(image) - 2:
Input: image, a georeferenced satellite image with dimensions - 3:
Output: A set of geographical coordinates of detected landing strips - 4:
Step 1: Patch Extraction and Processing - 5:
Patch size: - 6:
Step size for patch movement: (75% overlap) - 7:
Initialize list of detected centroids: - 8:
for to with step size 88 do - 9:
for to with step size 88 do - 10:
Extract patch from image with top-left corner at - 11:
Step 2: Classification - 12:
Apply the classification function - 13:
if (Non-Landing Strip) then - 14:
Discard the patch - 15:
else - 16:
Step 3: Segmentation - 17:
Generate the segment - 18:
Step 4: Bounding Box Size Check - 19:
Calculate the minimum bounding box for the segment - 20:
The size of the bounding box is defined by its smallest dimension - 21:
if meters then - 22:
Calculate the centroid of the segment - 23:
Append the centroid to the list C - 24:
end if - 25:
end if - 26:
end for - 27:
end for - 28:
Step 5: Clustering of Detected Points - 29:
Perform clustering on centroids C using hierarchical clustering - 30:
Merge points within a distance threshold of 1250 meters - 31:
Return the final set of clustered centroids representing predicted landing strips - 32:
end function
|
2.4. Modified Algorithm
Based on the algorithm presented by the authors of [
33], it was observed that despite the high recall rate achieved, the number of false positive predictions was relatively high compared to the best result presented.
To address this issue, several improvements were introduced in the fourth step of the algorithm. Firstly, the Normalized Difference Water Index (NDWI) [
40] is calculated for each pixel of the generated segment using the original image bands, without any prior processing. Since, in production, the algorithm receives the raw satellite image without the preprocessing steps mentioned earlier, it is possible to retrieve these data. This step checks if any pixel in the image represents water bodies, and if so, the cutout is discarded. This is crucial because the Amazon region contains numerous water bodies, which can, depending on the cutout, resemble the shapes of landing strips. Additionally, when the minimum bounding box of the generated segment is calculated, the Circularity Ratio (CR) is also computed. If the CR is greater than 0.1, the cutout is discarded—this threshold was determined empirically through extensive testing.
Furthermore, after calculating the segment’s centroid, this location is checked for its proximity to any federal or state highway [
41]. If the distance is less than 50 m, the location is discarded. This is an important step to eliminate potential false positives, as the cutout has only regional context and not a complete view of the entire scene. Highways often have shapes that closely resemble landing strips.
To improve the perform in terms of computational time of the algorithm, the entire process described above was parallelized using 4 threads. This means that 4 scenes are processed in parallel, with each scene going through the mentioned steps. The number of threads was chosen to be the highest value that would not cause memory issues on the GPU.
Additionally, a new step was added to the algorithm after the generation of the locations. This step involves obtaining images with dimensions of 2 × 2 km around all the generated locations. With these images, a new classification step is performed using a different classifier from the previous one. The reason for this new step is that, in the initial stage, due to the nature of the search conducted, the targets were unlikely to be centered, which could confuse the algorithm and result in a high number of false positives. However, with the locations of the predictions, it is now possible to have images with the targets centered, which enhances this classification step. In Algorithm 2, the modified algorithm is presented.
Algorithm 2 Modified Landing Strip Detection Algorithm |
- 1:
function ModifiedLandingStripDetection(image) - 2:
Input: image, a georeferenced satellite image with dimensions - 3:
Output: A set of geographical coordinates of detected landing strips - 4:
Step 1: Patch Extraction and Processing - 5:
Patch size: - 6:
Step size for patch movement: (75% overlap) - 7:
Initialize list of detected centroids: - 8:
for to with step size 88 do - 9:
for to with step size 88 do - 10:
Extract patch from image with top-left corner at - 11:
Step 2: Classification - 12:
Apply the classification function - 13:
if (Non-Landing Strip) then - 14:
Discard the patch - 15:
else - 16:
Step 3: Segmentation - 17:
Generate the segment - 18:
Step 4: Enhanced Validation - 19:
Calculate the NDWI for the segment pixels - 20:
if any pixel in segment represents water bodies then - 21:
Discard the patch - 22:
end if - 23:
Calculate the minimum bounding box for the segment - 24:
Calculate the Circularity Ratio - 25:
if then - 26:
Discard the patch - 27:
end if - 28:
Calculate the centroid of the segment - 29:
if distance to any federal or state highway meters then - 30:
Discard the location - 31:
else - 32:
Append the centroid to the list C - 33:
end if - 34:
end if - 35:
end for - 36:
end for - 37:
Step 5: Clustering of Detected Points - 38:
Perform clustering on centroids C using hierarchical clustering - 39:
Merge points within a distance threshold of 1250 meters - 40:
Step 6: Enhanced Classification - 41:
for each centroid in C do - 42:
Obtain images of dimensions km around - 43:
Apply new classification function - 44:
if (Landing Strip) then - 45:
Add to the final list of predicted landing strips - 46:
end if - 47:
end for - 48:
Return the final set of predicted landing strips - 49:
end function
|
2.5. Training Parameters
As presented in the previous sections, the modified algorithm requires three distinct algorithms: two for classification and one for segmentation. For the first classification algorithm, YOLOv8 was used, as in the original paper. For the second classification task, the GCViT network was employed due to its demonstrated success in achieving excellent results in classification tasks. YOLOv8 was again utilized for the segmentation task, as in the original paper.
For the training of neural networks, defining certain hyperparameters is necessary. Initially, the batch size for all training executions was set to 16, as this was the largest value that did not cause memory issues.
Additionally, to reduce the possibility of overfitting during training, learning rate scaling was implemented using the CosineAnnealingLR strategy [
42], as defined by the following equation:
where
represents the current learning rate at epoch
, and
represents the maximum number of training epochs. The parameters
and
define the minimum and maximum learning rates, which are set before the start of the training.
We searched for the best hyperparameter values.
was searched within the range from
to
in 20 equally spaced intervals. Similarly,
was searched in the range from
to
in 20 equally spaced intervals. Additionally, we searched for the best optimizer for training, selecting between Adam and Stochastic Gradient Descent (SGD) [
43].
For hyperparameter tuning, the recall metric was used for classification training. For segmentation training, the metric used was Intersection Over Union (IoU). IoU is calculated by the ratio of the number of pixels in the intersection between the predicted and ground truth segments to the union of these segments [
44].
For YOLOv8, the loss function chosen is the standard one defined by [
12] for the network. For GCViT, binary cross-entropy [
45] was chosen due to its extensive use in image classification tasks.
To reduce overfitting during training, data augmentation was applied to the selected images [
46]. The selected operations included horizontal and vertical flips, translation limited to 20% of the image dimensions, rotation by a random angle between 0 and 90º counterclockwise, and removal of a rectangle with dimensions corresponding to 15% of the image at a random position. These operations are not applied simultaneously to the images. During each training epoch, each image undergoes a random combination of the aforementioned operations, with each operation having a 50% probability of being applied to an image in a given epoch.
2.6. Performance Metrics
As in the original work, to enable comparison in this study, a predicted location is considered a true positive (TP), or a correct detection, if it is within a certain distance threshold from a previously mapped location. Conversely, a predicted location is classified as a false positive (FP) if no mapped location exists within the chosen distance threshold. Finally, a mapped location is considered a false negative (FN) if no predicted location falls within the defined distance threshold.
To ensure the effectiveness of this metric, only one correct detection is considered per mapped location. For example, if two predicted locations are within the distance threshold of a mapped location, only one correct detection will be counted.
When analyzing the impact of detections, it is evident that a false negative carries more weight than a false positive. A false negative means that a landing strip was not detected by the algorithm, implying that a potential focal point of illegal activity would not be identified by authorities. On the other hand, a false positive indicates a predicted location that does not correspond to a real landing strip, which is a less significant error since all locations must be visually validated by authorities. However, the number of false positives should be controlled, as a higher number would delay the visual validation process.
Therefore, to evaluate which network composition produces superior results, the recall metric will be used, defined as follows:
The recall metric will be used to determine how many landing strips were detected by the algorithm in relation to the total number of landing strips. Additionally, the number of false positives will be analyzed to minimize such errors. Thus, the best composition will be the one that achieves the highest recall and the lowest number of FPs.
A distance threshold of 1500 m between predictions and actual targets will be used to calculate performance metrics. This is because landing strips, being of considerable length, may have predictions that do not exactly coincide with the center or any specific point of the strip. A smaller distance could result in counting errors, as a prediction that represents a point on the strip might still be considered incorrect if it is not exactly at the center.
2.7. Experiment Across the Amazon Region
To compare the results of the algorithms, tests were conducted in the same area as in the original paper, using the same images. Regarding the modified algorithm, two tests were performed: one without the second classifier (GCViT) and another with both classifiers, in order to evaluate the impact of the second classifier.
Figure 4 shows the test region where the algorithm was evaluated. A visual inspection of all targets present in June 2023 was carried out in this region, and images from that month were acquired to compare the performance differences between the original and modified algorithms. In the original paper [
33], results for different dataset compositions are presented for this area. In this study, the comparison will be made against the best-performing dataset composition from the original work.
After that, the modified algorithm will be applied across the entire Amazon biome. A total of 22,164 Planet images from June 2023 were acquired, amounting to 2.8 terabytes of data, covering the entire Legal Amazon region. The activation will be performed on the same images used for the original algorithm, and, as in the original paper, a visual inspection of the images was conducted to verify which landing strips mapped by MapBiomas remained in the same locations on the date the images were captured. This ensures that the evaluation reflects current conditions, allowing for a meaningful comparison between the modified and original algorithms.
4. Conclusions
The Amazon is the largest and most diverse biome on Earth, containing an unimaginable amount of natural resources. Due to this factor, irregular exploitation is an urgent issue in the Brazilian socio-environmental context, leading to increased crime rates and often causing deep scars on indigenous tribes. In this context, many of these activities face logistical problems, as in the Amazon, due to the high forest density, it is complex to establish supply lines to maintain activities. Therefore, clandestine aviation becomes an ally, enabling the delivery of necessary materials to any location in the forest with considerable speed.
In this context, this work presented a modification of the seminal algorithm for solving this problem [
33], aiming to address some of its shortcomings. For this purpose, pre-existing MapBiomas mappings were used to compose the training datasets for the neural networks. The modifications made caused a slight drop in the recall of the identified targets in both tests. However, there was a significant reduction in the number of false positives, which was a notable issue in the original algorithm.
In the test conducted on a specific region of the biome, the recall dropped by less than 1%, but there was a 26.6% reduction in the number of false positives. For the test applied across the entire Amazon biome, the recall drop was slightly larger, at 1.7%, but the number of false positives decreased by 17.88%. These results indicate a substantial improvement in the algorithm, as the significant reduction in false positives leads to less time required for visual inspection of the predictions, thus enhancing the speed and efficiency of biome mapping. This, in turn, contributes significantly to Brazil’s environmental protection efforts.
One important point is that this improvement comes with a small loss in recall. However, since the recall values remain high, this trade-off is justified by the substantial reduction in false positives.
Regarding future work, it would be important to explore how the modified algorithm can be implemented for large-scale, ongoing mapping of the biome. Furthermore, the recently released YOLOv11 network [
13] could be an interesting addition to integrate into the presented algorithm for further performance enhancements. Additionally, an important aspect could be the implementation of the proposed methodology in other biomes worldwide to assess the overall effectiveness of the technique. Moreover, another relevant aspect is to check for the existence of water body maps of the Amazon to incorporate this information into the algorithm. Furthermore, it is relevant to conduct tests with other classical techniques to compare their performance against CNNs.