Lithological Mapping from UAV Imagery Based on Lightweight Semantic Segmentation Methods

Liu, Jingzhi; Wei, Zhen; Gong, Xiangkuan; Sun, Minjia; Cheng, Yuanfeng; Zhang, Yingying; Zhang, Zizhao

doi:10.3390/drones9120866

Open AccessArticle

Lithological Mapping from UAV Imagery Based on Lightweight Semantic Segmentation Methods

by

Jingzhi Liu

¹

,

Zhen Wei

^2,*,

Xiangkuan Gong

²

,

Minjia Sun

²,

Yuanfeng Cheng

²,

Yingying Zhang

²

and

Zizhao Zhang

²

¹

School of Geology and Mining Engineering, Xinjiang University, Urumqi 830017, China

²

Xinjiang Key Laboratory for Geodynamic Processes and Metallogenic Prognosis of the Central Asian Orogenic Belt, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(12), 866; https://doi.org/10.3390/drones9120866

Submission received: 5 November 2025 / Revised: 10 December 2025 / Accepted: 10 December 2025 / Published: 15 December 2025

(This article belongs to the Special Issue Advances in Deep Learning for Drones and Its Applications: 2nd Edition)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Applied a UAV oblique photogrammetry-based CA-DeepLabV3+ framework for automatic lithology identification.
Integrated Coordinate Attention and a lightweight MobileNetV2 backbone, improving multi-scale feature extraction and spatial context; achieved 97.95% OA and 95.71% mIoU.

What are the implications of the main findings?

Overcame limitations of manual mapping and satellite imagery, enabling precise boundaries and fine-texture recognition in UAV field data.
Provided an efficient, scalable solution for geological mapping and mineral exploration in remote high-altitude regions.

Abstract

Traditional geological mapping is often time-consuming, labor-intensive, and restricted by rugged terrain. This study addresses these challenges by proposing a novel methodology for automated lithological identification in the Ququleke area of the eastern Kunlun Mountains, which pioneers the integration of portable UAV oblique photogrammetry with a Coordinate Attention-enhanced DeepLabV3+ (CA-DeepLabV3+) semantic segmentation framework for geological mapping. Using a DJI Mavic 3M quadcopter, high-resolution oblique photogrammetric orthophotos were captured to build a pixel-level lithology dataset containing four classes: sandstone, diorite, marble, and Quaternary sediments. The CA-DeepLabV3+ model, adapted from the DeepLabV3+ encoder–decoder framework, integrates a lightweight MobileNetV2 backbone and a Coordinate Attention mechanism to strengthen spatial position encoding and fine-scale feature extraction, crucial for detailed lithological discrimination. Experimental evaluation demonstrates that the proposed model achieves an overall accuracy of 97.95%, mean accuracy of 97.80%, and mean intersection over union of 95.71%, representing a 5.48% improvement in mean intersection over union (mIoU) over the standard DeepLabV3+. These results indicate that combining UAV oblique photogrammetry with the CA-DeepLabV3+ network enables accurate lithological mapping in complex terrains. The proposed method provides an efficient and scalable solution for geological mapping and mineral resource exploration, highlighting the potential of low-altitude UAV remote sensing for field-based geological investigations.

Keywords:

deep learning; semantic segmentation; lithology identification; unmanned aerial vehicle (UAV); eastern Kunlun

1. Introduction

In mineral exploration, lithology identification and mapping play an essential and irreplaceable role. At present, data collection still relies heavily on manual field surveys. However, in high-altitude prospecting regions characterized by complex terrain and harsh environmental conditions, such surveys encounter considerable challenges, greatly limiting the depth of geological investigation and the efficiency of mineral exploration.

Remote sensing technology, with its unique capability to capture spatial information, offers a promising solution to these obstacles. Early lithology identification efforts relied primarily on satellite imagery [1], which proved effective on a macro scale. Nevertheless, its relatively low spatial resolution hampers the accurate depiction of fine-scale geological features [2]. In recent years, the rapid advancement of unmanned aerial vehicle (UAV) technology has brought revolutionary changes to geological mapping. Compared with traditional ground surveys or satellite remote sensing, UAVs provide notable advantages—low cost, convenience, and high flexibility—and have been widely applied in diverse fields [3,4,5]. Operating efficiently even in high-altitude and inaccessible areas, UAVs equipped with visible, multispectral, or hyperspectral cameras can capture centimeter-level low-altitude imagery. Such imagery enables precise characterization of lithological details and facilitates the extraction of small-scale geological structures [6]. Integrating UAV-acquired high-resolution remote sensing images with image classification methods thus allows for reliable lithological discrimination and a comprehensive understanding of mapping areas, overcoming the inherent constraints of manual surveys.

Early lithology extraction from imagery depended largely on manual classification [7], which is time-consuming, reliant on expert knowledge, and prone to misclassification due to suboptimal feature selection. The emergence of computer-based approaches introduced traditional machine learning algorithms—such as Support Vector Machines (SVMs) and Random Forests (RFs) [8,9,10,11]—for lithology identification. However, these methods rely on manually engineered features and struggle to handle the complexity of field outcrop scenes.

Recent developments in deep learning, particularly convolutional neural networks (CNNs), have enabled automatic feature extraction without manual intervention, markedly improving classification accuracy and efficiency [12]. From early shallow architectures such as AlexNet and VGGNet to deeper networks including ResNet and Inception [13,14,15], CNNs have shown strong performance in image classification, object detection, and semantic segmentation. Applied in lithology mapping, Fully Convolutional Networks (FCNs) have improved category accuracy [16], while U-Net has achieved higher segmentation precision in land-cover mapping [17]. Yet, limitations remain—such as reduced receptive field, blurred boundaries, and high computational costs. The DeepLab series, particularly DeepLabV3+, offers high segmentation accuracy [18] but still demands considerable computational and memory resources.

Lightweight network architectures, such as MobileNetV2, have been developed to address the high computational demands of processing high-resolution remote sensing imagery. MobileNetV2 utilizes inverted residual blocks and depthwise separable convolutions to reduce parameter count and computational complexity while maintaining strong representational power [19]. Furthermore, to enhance feature extraction efficiency, attention mechanisms have gained prominence. Among these, the Coordinate Attention (CA) mechanism is particularly notable for its ability to effectively capture spatial and channel-wise relationships by encoding precise positional information into channel attention, which has been shown to improve model performance across various computer vision tasks, including agricultural pest detection, microstructure analysis [20,21].

Despite these advances, most lithology identification models are still trained on satellite imagery and rarely on high-resolution UAV oblique photogrammetry data. Even when UAV data are used, complex high-altitude geological environments can pose persistent challenges, such as blurred boundaries, high intraclass texture similarity, class imbalance, and heavy computational costs when applying conventional CNN architectures.

To address these challenges, this study focuses on the Ququleke Mountain region in Qiemo County, Xinjiang, China. A high-resolution orthophoto dataset was constructed using UAV oblique photogrammetry. Building upon the DeepLabV3+ semantic segmentation framework, a Coordinate Attention (CA) mechanism together with a multi-scale feature fusion module was integrated to enhance the extraction of multi-scale information and spatial context. While the CA-DeepLabV3+ framework has demonstrated effectiveness in various remote sensing and computer vision applications, such as impervious surface extraction [22] and SAR image semantic segmentation [23], its integration with UAV oblique photogrammetry for lithological mapping represents a pioneering application in a geological context. In addition, a lightweight MobileNetV2 backbone was employed and transfer learning was applied to reduce computational complexity. Comparative experiments against FPN, FCN, U-Net, PSPNet, DeepLabV1, and the standard DeepLabV3+ [24,25,26,27,28,29] were conducted to validate the accuracy and portability of this novel methodology, offering an efficient approach for lithological mapping in complex regions.

2. Dataset Construction

2.1. Overview of the Study Area

The study area is located in the Ququleke region of Qiemo County, at the southern end of Xinjiang, China (Figure 1a). The region is situated within a high-altitude mountainous environment, with an average elevation exceeding 4000 m. The landscape is dominated by deeply incised valleys, steep ridges, and exposed rock walls (Figure 1b). Slopes are steep, and numerous gullies dissect the terrain. In many sections, vegetation coverage is sparse, weathering is intense, and transportation is extremely inconvenient. These harsh natural conditions pose serious challenges to conventional geological field surveys, leading to limited accessibility, poor visibility of outcrops, and increased safety risks (Figure 1c).

Tectonically, the Ququleke region occupies a key position within the Kunlun Orogenic Belt, forming a structural junction between the Tarim Craton to the north and the Bayan Har Block to the southeast [30] (Figure 2a). This large-scale tectonic setting influences local structures, resulting in granodiorite plutons, felsic magmatic intrusions, and several major fault zones—including the Ququlekebei Fault and the Ququlekedong Fault [31]. These complex geological conditions, particularly the interplay of major faults, intrusive bodies, and multi-stage deformation, create favorable environments for various types of mineralization (Figure 2b). The region hosts several documented mineral occurrences, including gold-antimony (Au-Sb) and copper-gold (Cu-Au) deposits, copper-zinc (Cu-Zn) mineralization, as well as gold (Au) and placer gold deposits [32]. The strong association between these mineralizations and the local lithological and structural features underscores the practical relevance of accurate and efficient lithological mapping for mineral exploration and resource assessment in this region.

Stratigraphically, the study area is characterized by a diverse rock assemblage (Figure 2b). Exposed stratigraphic units primarily include the Lower Proterozoic Kuhai Group, the Lower Carboniferous Tuokuzidaban Group, the Upper Carboniferous Halamilanhe Group, the Lower-Middle Permian Shuweimenke Formation, and the Lower-Middle Jurassic Yeerqiang Group [33,34]. These sedimentary and volcanic-clastic strata are extensively distributed and are intruded by various magmatic bodies. Prominent intrusive rocks, influenced by the regional tectonic activity, include granites, monzogranites, granodiorites, diorites, and diabases, which form plutons and smaller intrusions. Metamorphic rocks, such as marble, are also present within the Kuhai Group and Halamilanhe Group. Holocene and Pliocene sediments represent the youngest geological units, primarily filling valley bottoms and depressions. This diverse lithological composition, ranging from ancient metamorphic and intrusive rocks to younger sedimentary and unconsolidated deposits, forms the basis for our lithological identification task. In addition to the large-scale regional geological map (Figure 2b), we also incorporate a detailed geological map provided by local geological workers (Figure 2c). This map covers the UAV survey extent and provides a finer representation of lithological boundaries and unit distributions, thereby facilitating validation of the predicted results.

2.2. Data Acquisition and Processing

Field outcrop images were acquired using a DJI Mavic 3M (DJI, Shenzhen, China) quadcopter equipped with a high-definition visible-light camera (24 mm focal length, 4/3 CMOS sensor, 20-megapixel resolution) and an integrated GPS/Position and Orientation System (POS) for precise recording of flight position and camera orientation. The UAV survey was conducted on 23 August 2025 at an altitude of 500 m, with 70% side overlap and 80% forward overlap. These systematic flight parameters were carefully designed to ensure comprehensive coverage and consistent spatial data acquisition across the rugged Ququleke terrain, drawing parallels to established practices in autonomous UAV mapping for various environmental and reconstruction tasks [35]. A total of 212 aerial photographs were captured, covering approximately 4.11 km² in about 120 min of flight (Figure 3).

Given its geometric accuracy and radiometric fidelity, UAV orthophotography formed the primary data source for lithology identification. Orthophotos preserve true spectral reflectance and color relationships of surface features and typically achieve centimeter-level ground sample distance (GSD), revealing fine-grained rock textures. Photogrammetric correction removes image tilt and terrain distortion, ensuring that each pixel is accurately georeferenced. Distinct lithologies exhibit unique spectral and texture characteristics in the imagery—for instance, diorite appears light gray, marble appears whitish, and sandstone tends toward yellowish tones. Variations in grain size and arrangement provide subtle texture differences that geologists can interpret visually [36].

Pix4D Mapper (version 4.5.6, Pix4D S.A., Prilly, Switzerland) was used for image stitching and photogrammetric processing [37]. The workflow included organizing raw images, running quality control, extracting feature points, performing image alignment, and generating a sparse point cloud followed by a 3D surface model (Figure 4). Afterward, a Digital Elevation Model (DEM) and an orthorectified mosaic at 0.23 m spatial resolution were produced. This robust processing pipeline is essential for transforming raw UAV imagery into spatially consistent and geometrically accurate orthophotos, which is a prerequisite for reliable downstream semantic segmentation, echoing the importance of data quality and consistency highlighted in advanced UAV mapping studies [38]. These products provided the fundamental base data for subsequent lithological analysis (Figure 5).

All visible elements within the orthophotos were manually classified into four lithological categories—sandstone, diorite, marble, and Quaternary sediments—using ArcGIS (version 10.8.1, Esri, Redlands, CA, USA) software for pixel-level annotation (Figure 6). This classification was meticulously conducted through a comprehensive approach, drawing upon expert geological guidance, existing large-scale geological maps of the study area, and detailed visual interpretation of the UAV orthophotos. For visual interpretation, distinct colorful and textural characteristics inherent in the imagery were critical: diorite typically appeared as light gray, coarse-grained masses; marble was identified by its whitish, blocky distribution; and sandstone exhibited yellowish tones with medium-to-thick bedded structures. Quaternary sediments were recognized as light white, irregularly distributed accumulations occupying valley bottoms and low-lying depressions, with a visually loose and heterogeneous texture. Areas outside the study site were marked as “invalid regions.” After annotation, vector shapefiles (.shp) were converted to raster format, keeping the same pixel size as the original orthophoto. Background, sandstone, Quaternary sediments, marble, and diorite were assigned numeric labels from 0 to 4, respectively.

To balance class representation, the original large orthophoto was divided into two spatially non-overlapping regions of equal size for training and validation (Figure 1c). This strict geographical separation was implemented to ensure that the validation set provides a genuinely independent assessment of model generalization, mitigating the risk of overfitting to specific spatial features within the training area. The training images were further cropped using a sliding-window approach to generate 256 × 256-pixel patches while maintaining spatial coherence. Data augmentation techniques—including rotation, flipping, scaling, color adjustment, and noise addition—were applied to expand the training samples and enhance model generalization (Figure 7). Ultimately, the dataset comprised 15,659 training samples and 3023 validation samples.

3. Experiment and Methods

3.1. Overview of the Proposed Lithology Segmentation Method

To rigorously evaluate the performance and adaptability of the proposed UAV orthophoto-based lithology identification approach, a step-by-step experimental workflow was implemented.

First, high-resolution RGB orthophotos of the Ququleke study area were obtained via UAV oblique photogrammetry. Raw images underwent geometric correction to remove distortions due to flight tilt and terrain relief, followed by color normalization to standardize brightness and contrast across different acquisition sessions. The imagery was then tiled into fixed-size 256 × 256-pixel patches, a format compatible with deep convolutional networks. Combined with precise manual annotation, a lithology semantic segmentation dataset was constructed.

For the semantic segmentation task, DeepLabV3+ was adopted as the baseline model from which modifications were applied. Its original heavy backbone (such as Xception or ResNet) was replaced by the lightweight MobileNetV2 to reduce parameter count and computational complexity—essential for running on limited GPU memory. MobileNetV2 incorporates inverted residual blocks and depthwise separable convolutions, drastically lowering computational cost while maintaining strong representational power.

Transfer learning was employed: initial weights were pre-trained on large-scale image datasets, allowing the model to converge faster and reduce overfitting risk with the relatively small UAV lithology dataset. Mixed-precision training (using AMP in PyTorch) was used to accelerate computation and lower memory usage, enabling larger batch sizes under the same resource constraints.

To address the difficulties intrinsic to UAV field outcrop imagery—such as blurred lithological boundaries, high intra-class texture similarity, and loss of fine-scale details—we integrated a Coordinate Attention (CA) module into DeepLabV3+. The CA mechanism augments conventional channel attention by encoding spatial positional information into the attention computation. Specifically, it separates global average pooling into two directions—horizontal (width-wise) and vertical (height-wise)—thereby capturing long-range dependencies with explicit coordinate context [39]. This allows the network to assign higher weights to channels that are both spatially and semantically relevant, strengthening discrimination of lithologies with irregular boundaries.

For comparison, alternative state-of-the-art segmentation networks—FPN, FCN, U-Net, PSPNet, DeepLabV1, and standard DeepLabV3+—were trained and tested on the same dataset with identical preprocessing and augmentation. This ensured that performance differences reflect architectural strengths rather than data processing bias.

3.2. Semantic Segmentation Model

In recent years, digital image processing techniques have been extensively studied in the field of computer vision, with image segmentation [40] being one of its key tasks. Image segmentation aims to classify pixels within an image, thereby identifying and separating different targets and regions. For UAV-captured field outcrop imagery, factors such as vegetation cover, rock weathering, and blurred lithological boundaries can reduce the accuracy of traditional segmentation methods. Therefore, this study adopts the DeepLabV3+ network model as the baseline segmentation framework. Proposed by Google in 2018, DeepLabV3+ builds upon an improved encoder–decoder structure and a multi-scale feature fusion strategy, enabling substantial accuracy improvements under complex background conditions. It is particularly suitable for geological imagery tasks where details are abundant and inter-class differences are subtle.

The powerful performance of DeepLabV3+ stems from the automatic feature learning capability of convolutional neural networks (CNNs). For input UAV orthophoto RGB imagery, low-level convolutions extract features such as color, edges, and fine-grained textures; mid-level convolutions capture shape contours and repetitive texture patterns; and high-level convolutions learn patterns and class combinations at a higher semantic level. These hierarchical features collectively provide the multi-scale information required for lithology identification. Unlike traditional handcrafted feature extraction methods, deep learning can directly learn the mapping between features and rock categories from large-scale annotated datasets, thereby enabling UAV imagery-driven automatic lithology recognition and mapping.

To achieve effective extraction and fusion of the aforementioned multi-level features, DeepLabV3+ employs in its encoder a feature extraction network composed of a series of inverted residual blocks. Each residual block consists of a 1 × 1 convolution for channel expansion, a depthwise separable 3 × 3 convolution, and a 1 × 1 convolution for channel reduction. The encoder output is processed by the ASPP (Atrous Spatial Pyramid Pooling) module, which contains a 1 × 1 convolution layer, three 3 × 3 convolution layers with different atrous rates, and a pooling layer. This multi-scale pooling approach expands the receptive field and integrates semantic information across different scales. In the decoder, the high-level features from the encoder are fused with low-level detail features from the feature extraction network. The fused features are then passed through a 1 × 1 convolution and an upsampling operation to restore spatial resolution, ultimately producing high-precision lithology segmentation results (Figure 8).

3.3. Coordinate Attention Mechanism

To address the difficulty of classifying boundary pixels for rock lithology in remote sensing imagery, an attention mechanism was incorporated into the DeepLabV3+ network. Specifically, channel attention was applied to high-level features, while spatial attention was applied to low-level features, and the two were then concatenated in series. This design enables more effective extraction of key feature information from remote sensing images, thereby improving the classification accuracy of boundary pixels.

Attention mechanisms widely used in computer vision primarily include spatial attention [41], channel attention [42], and hybrid attention [43]. However, these approaches often focus solely on feature information in either the channel or spatial dimension, without emphasizing the importance of positional information. Positional information is crucial for visual tasks that aim to capture the structural characteristics of a target. Therefore, this study employs the Coordinate Attention (CA) mechanism, which is well-suited for outdoor rock lithology segmentation. CA is an improved attention mechanism designed to capture both channel attention and positional information simultaneously. Unlike traditional channel attention methods (e.g., SE-Net) that rely solely on global average pooling to obtain global channel weights, CA decomposes the global pooling operation and encodes spatial information into the channel attention. This process preserves the position-dependent characteristics of features along the height (H) and width (W) dimensions. Such capability has a significant impact on tasks requiring precise spatial localization, including semantic segmentation and object detection. The structure of the CA module is illustrated in Figure 9. In this article, the symbols used in the network architecture are defined as follows: C denotes the number of channels in the feature map, H and W denote the height and width, respectively, and r denotes the channel reduction ratio used to control the compression of feature dimensions in the attention module.

For the input image, to avoid compressing all spatial information into the channel dimension and to capture long-range spatial interactions with precise positional cues, global average pooling is applied separately to each column and each row. This produces a width-wise feature vector of length W and a height-wise feature vector of length H, as follows:

z_{c}^{h} (w) = \frac{1}{H} \sum_{h = 1}^{H} x_{c} (h, w), w \in [1, W]

(1)

z_{c}^{v} (h) = \frac{1}{W} \sum_{w = 1}^{W} x_{c} (h, w), h \in [1, H]

(2)

The outputs are denoted as

z^{h} \in R^{C \times W}

, representing the global distribution of each channel along the width dimension, and

z^{v} \in R^{C \times H}

representing the global distribution of each channel along the height dimension.

Subsequently,

z^{h}

and

z^{v}

are concatenated, and passed through a shared 1 × 1 convolution followed by a non-linear activation function to generate the intermediate features:

f = δ (F_{1} ([z^{h}, z^{ν}]))

(3)

where F₁ denotes 1 × 1 convolution, reducing the channel dimension to C/r × (H + W) (with r being the reduction ratio, typically r = 8), and

δ

represents a non-linear activation function, such as Sigmoid or ReLU.

The above formulation addresses the embedding of coordinate information. Next, the directionally encoded features are split into horizontal and vertical components to generate attention weights separately. Specifically,

f

is split into two parts:

f^{h} \in R^{C / r \times W}

and

f^{v} \in R^{C / r \times H}

. Each part is then passed through a 1 × 1 convolution for dimensionality expansion, followed by a sigmoid activation function, resulting in the final attention vectors

g^{h} \in R^{C \times H \times 1}

and

g^{v} \in R^{C \times 1 \times W}

.

g^{h} = σ (F_{h} (f^{h}))

(4)

g^{v} = σ (F_{v} (f^{v}))

(5)

where F_h and F_v denote 1 × 1 convolutions that restore the channel dimension to C, and

σ

: represents the Sigmoid function, which normalizes the weights to the range [0, 1]. The final output feature Y is obtained by the element-wise multiplication of the input X with the attention weights:

y_{c} (h, w) = x_{c} (h, w) \times g_{c}^{h} (w) \times g_{c}^{v} (h)

(6)

In summary, the entire process can be regarded as an automatic learning and optimization procedure for the weight coefficients of individual channels. These weights are adaptively adjusted by the network without manual intervention, thereby enhancing its ability to discriminate between different channel features and increasing the emphasis on informative features. By introducing the Coordinate Attention (CA) module, the model can assign higher weights to channels with strong responses to target features. Although this approach introduces some additional computational cost, the overall performance of the model is significantly improved. On the one hand, by encoding positional information along the height and width dimensions, CA enables the network to account for both spatial layout and channel feature importance during semantic segmentation, thus improving recognition of fine-grained textures and spatial organization. On the other hand, for rock segmentation tasks, lithological features are often associated with spatial distribution patterns (e.g., fracture orientation, bedding position), and CA helps preserve such directional information.

3.4. Experimental Environment and Parameter

The experiments were conducted on a Windows 11 operating system. A virtual working environment was configured using the Anaconda environment manager with Python 3.12.3 and PyTorch 2.6.0. The hardware setup comprised an Intel Core i9-13900HX CPU and an NVIDIA RTX 4060 GPU. The initial learning rate was set to 0.001, with the AdamW optimizer and AMP mixed-precision training employed. Due to limited computational resources, the batch size was set to 8, the number of epochs to 100, and the patch size to 256 × 256, with five target classes in total.

3.5. Evaluation Metrics

To evaluate the accuracy of the constructed dataset and the applicability of the deep learning model, the experiments adopted four metrics for lithology identification performance assessment: mean intersection over union (mIoU), class IoU (CIoU), overall accuracy (OA), and mean accuracy (MeanAcc). mIoU represents the ratio between the intersection and the union of the predicted and ground-truth regions for each class, averaged across all classes. CIoU denotes the overlap accuracy between the predicted and ground-truth regions for a single class. OA is defined as the ratio of correctly predicted samples to the total number of samples across the entire test set. MeanAcc refers to the average recall across all classes, reflecting the model’s ability to correctly capture each lithological category. The calculation formulas are as follows:

m I o U = \frac{1}{k} \sum_{k = 1}^{K} I o U_{k} = \frac{1}{k} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F P_{k} + F N_{k}}

(7)

I o U_{k} = \frac{T P_{k}}{T P_{k} + F P_{k} + F N_{k}}

(8)

O A = \frac{Σ_{k = 1}^{K} T P_{k}}{Σ_{k = 1}^{K} (T P_{k} + F P_{k})}

(9)

M e a n A c c = \frac{1}{k} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F P_{k}} = \frac{1}{k} \sum_{k = 1}^{K} R e c a l l

(10)

where K denotes the number of labeled categories (including background); TP_k represents the number of correctly predicted pixels for class k (true positives); FP_k denotes the number of pixels from other classes that are incorrectly predicted as class k (false positives); FN_k denotes the number of pixels belonging to class k that are missed by the prediction (false negatives).

4. Results and Discussion

4.1. Convergence Analysis

The performance of the proposed CA-DeepLabV3+ model was benchmarked against the baseline DeepLabV3+ and other mainstream semantic segmentation networks, including U-Net, PSPNet, and DeepLabV1.

Figure 10 illustrates the loss curves across training epochs. All models exhibited a rapid loss reduction during the early stage (Epoch < 10), indicating that they quickly learned the basic discriminative features necessary for classifying different lithologies. In the mid-to-late training stage (Epoch 20–100), the rate of loss reduction slowed, approaching a stable convergence value.

CA-DeepLabV3+ demonstrated the most pronounced early-stage improvement: the loss dropped sharply to approximately 0.28 within the first few iterations, markedly faster than in other networks. This suggests that the integration of Coordinate Attention enhances the model’s ability to focus on critical spatial-channel features at an early stage, accelerating convergence. By Epoch 40, the loss fell below 0.20 and steadily decreased to ~0.06—the lowest among all tested architectures. In comparison, DeepLabV1 and standard DeepLabV3+ converged around 0.12–0.13, while U-Net and PSPNet converged more slowly and at higher final losses (>0.16).

Taken together, these results highlight the capacity of CA-DeepLabV3+ to capture key lithological features earlier in training while maintaining stable optimization in later epochs, ultimately achieving a substantially lower convergence loss than alternative methods.

4.2. Comparison Experiments of Backbone Network

To comprehensively evaluate the lightweight nature and practical deployability of the CA-DeepLabV3+ framework as implemented in this study, we conducted an efficiency analysis comparing various backbone networks. This analysis focused on two key metrics: the total number of trainable parameters and the inference time. Such quantification is crucial for understanding the model’s suitability for real-world geological mapping applications, especially in resource-constrained environments.

We integrated the Coordinate Attention mechanism and the DeepLabV3+ decoder structure with five different backbone networks: MobileNetV2 (our primary choice), ResNet50, ResNet101, EfficientNet-B3, and timm-mobilenetv3_large_100. All models were evaluated on the same hardware setup (NVIDIA RTX 4060 GPU, NVIDIA Corporation, Santa Clara, CA, USA) with an input image size of 256 × 256 pixels and a batch size of 1. Inference times were averaged over 100 runs after 10 warm-up runs to ensure stable measurements. The results are summarized in Table 1.

As shown in Table 1, the choice of backbone network significantly impacts both the model’s parameter complexity and its inference speed. Our primary backbone, MobileNetV2, consistently demonstrates superior efficiency, featuring the lowest parameter count of 4.54 Million and the fastest inference time of 0.0046 s (4.6 ms) per image. This performance is notably better than heavier backbones like ResNet50 (27.08 Million parameters, 0.0052 s) and ResNet101 (46.07 Million parameters, 0.0086 s), which, while offering greater representational capacity, incur substantially higher computational overhead. Even when compared to other lightweight architectures such as EfficientNet-B3 (11.70 Million parameters) and timm-mobilenetv3_large_100 (4.80 Million parameters), MobileNetV2 maintains a competitive edge in inference speed.

This efficiency analysis underscores the practical advantages of our CA-DeepLabV3+ model with the MobileNetV2 backbone. It confirms that our design choice effectively achieves the goal of a lightweight architecture, enabling faster training and inference without compromising accuracy. Such computational efficiency is paramount for deploying deep learning models in large-scale geological surveys where rapid processing and limited hardware resources are common constraints.

4.3. Quantitative Evaluation

As shown in Table 2, all tested networks achieved relatively high accuracy after sufficient training. Traditional frameworks such as FPN, FCN, and U-Net delivered stable performance in overall accuracy (OA), mean accuracy (MeanAcc), and mean intersection over union (mIoU). Among them, FCN recorded the highest mIoU (92.99%) within traditional architectures, while U-Net attained a slightly higher MeanAcc (95.02%) but a lower mIoU (90.07%), reflecting its limitations in fine-grained boundary extraction. PSPNet, despite its multi-scale pyramid pooling module, produced an mIoU of only 90.03%—likely affected by the high texture complexity and inter-class spectral similarity in field lithology imagery. DeepLabV1 benefited from atrous convolution to expand the receptive field while preserving resolution, yielding OA = 96.68% and MeanAcc = 96.83%, the highest among non-attention models.

In further comparative experiments, standard DeepLabV3+ improved segmentation consistency through the ASPP module and decoder refinement. However, CA-DeepLabV3+ outperformed all baselines across metrics—OA = 97.95%, MeanAcc = 97.80%, mIoU = 95.71%—representing relative gains of +2.87%, +2.76%, and +5.48% over standard DeepLabV3+. These gains confirm that coordinate attention effectively models spatial–channel dependency, enabling more precise boundary localization and detailed texture preservation while maintaining robust global context modeling.

From a feature extraction perspective, embedding coordinate information into channel attention allows simultaneous modeling of long-range contextual dependencies and local structural details. This is particularly advantageous in UAV field imagery, where lithologies often share similar colors/textures and have irregular boundaries. By enriching global semantics with focused positional attention, the CA module reduces misclassification and boosts segmentation accuracy for small targets (Figure 11). Furthermore, during high-resolution feature reconstruction, CA-DeepLabV3+ achieves enhanced edge preservation owing to its independent horizontal–vertical encoding mechanism, thereby improving the network’s spatial resolution awareness.

4.4. Category-Specific Performance

Table 3 compares CIoU values across lithological categories. Sandstone, marble, and diorite reported consistently high CIoU scores across models, indicating stable and distinguishable spectral-textural properties. Notably, sandstone CIoU exceeded 96% for all methods, with CA-DeepLabV3+ achieving 98.48%. Marble recognition remained strong (>87% CIoU in all cases) due to its distinct brightness and texture, with CA-DeepLabV3+ attaining 94.62%. Diorite, characterized by visible mineral grain patterns, achieved a maximum CIoU of 94.86% in CA-DeepLabV3+, outperforming other models.

Quaternary sediments posed the greatest challenge, with CIoU scores as low as 77–79% for PSPNet and U-Net, and 79.24% for standard DeepLabV3+. Causes include strong spectral resemblance to sandstone, fine-grained structure variability, dust cover from weathering, and blurred visual boundaries. CA-DeepLabV3+ improved Quaternary sediment CIoU to 90.66%, a +11.42% gain over the baseline—and the only score exceeding 90% among all networks—underscoring its effectiveness for visually similar, heterogeneous classes.

By incorporating refined spatial positional encoding into multi-scale contextual extraction, CA guides the network toward key spatial regions during category separation, reducing both false positives and false negatives. Figure 12 visualizes this improvement: all networks approximated the overall lithology distribution, but CA-DeepLabV3+ achieved the cleanest category boundaries and preserved fine-scale texture details, aligning with quantitative gains. For clearer interpretation of the model’s predictions, Figure 12 compares the predicted lithological map directly with the detailed geological map of the same area. This visual comparison reveals good agreement in the main lithological boundaries and unit distributions, supporting the reliability of the model’s outputs.

4.5. Field Validation and Error Correction Capability

High-altitude, deeply dissected regions present severe mapping challenges, including steep relief, sparse accessibility, extreme climate, and uneven outcrop visibility [44]. Conventional surveys often yield discontinuous, sample-based observations that cannot capture complete spatial patterns. Portable UAVs, with low-altitude oblique photogrammetry, offer centimeter-level, multi-view coverage of otherwise inaccessible areas, bridging gaps between field photography and remote sensing continuity [45].

Following model training, a field validation was conducted in a representative lithological zone in the eastern study area (Figure 13).

The lithological distribution within the validation area exhibits a clear spatial zonation pattern. Marble is widely distributed across mid-to-upper-slope sections and exposed mountain bodies, appearing gray-white to light gray in color with medium-to-thick-bedded structures. Diorite primarily intrudes along fault zones into marble strata, showing a gray-black color and massive structures. Sandstone is concentrated on gentle lower slopes and the mid-sections of valley slopes, characterized by light yellow to brown coloration and thick-bedded structures. Quaternary sediments fill the lowest depressions of river valleys and gullies, consisting mainly of gravel, sand, and silt, with loose structures and poor particle sorting. The lithological contact boundaries and overall distribution patterns are jointly controlled by river valley erosion and tectonic activity.

During the field validation, we identified an area located to the north of the survey route (Figure 13d) that had been misclassified as marble during the initial manual annotation stage. This misclassification primarily stemmed from visual interpretation errors based on imagery: in the UAV orthophoto, the area exhibited a relatively light tone similar to typical marble, leading to failure in recognizing it as heavily weathered diorite. However, the deep learning-based lithology classification model predicted this area as diorite, and this identification was fully confirmed during the field survey.

This observation indicates that the deep learning model can classify lithologies by leveraging multidimensional image features—such as texture, spatial distribution patterns, and relationships with adjacent strata—and that its predictions can, to a certain extent, correct systematic errors that may arise from manual interpretation. In the context of remote-sensing-based geological surveys, this qualitative demonstration of “automatic error-correction” capability is of considerable significance: on the one hand, it can reduce the influence of subjective human judgment on lithological delineation, thereby enhancing the objectivity and consistency of the results; on the other hand, when field conditions are constrained (e.g., steep terrain, obstructed viewpoints), model predictions can serve as a reliable reference for geologists, reducing redundant fieldwork and the cost of post hoc revisions. Therefore, the practical validation conducted in this study demonstrates the potential of integrating UAV remote sensing with deep learning in geological surveying—not only as a tool for automated recognition, but also as an effective means to improve geological data accuracy and correct human interpretation errors. Crucially, this real-world field validation on unseen terrain served as the ultimate robust assessment of the model’s generalization capabilities, providing strong evidence against overfitting to the training data.

4.6. Practical Implications for Geological Mapping

Combining multi-view, high-resolution UAV imagery with the CA-DeepLabV3+ segmentation framework enables precise lithology identification under extreme topographic constraints. The approach reliably recognizes small-scale lithological variations, leverages the fine detail of UAV data, and addresses limitations of accessibility and coverage in mountainous gorge regions.

Regarding its generalizability, the model typically requires retraining or fine-tuning when applied to new geological settings with different lithologies or significantly varying environmental conditions. This is because deep learning models learn patterns specific to their training data. However, the use of a lightweight backbone (MobileNetV2) and transfer learning facilitates faster adaptation to new datasets. For readers unfamiliar with UAV-based deep learning workflows, it is important to clarify that the initial identification of lithologies in the training area, as detailed in Section 2.2 relies on a combination of expert geological guidance, existing large-scale geological maps, and meticulous visual interpretation of the UAV orthophotos. In areas entirely lacking prior lithological information, the framework would initially require a geologist to provide a limited set of high-confidence labels for new rock types (e.g., through active learning or weakly supervised methods) to guide the model’s learning process.

However, it is crucial to discuss the method’s applicability under varying conditions. For instance, its performance in areas with dense vegetation cover is inherently limited when relying solely on visible light imagery, as vegetation can obscure underlying lithological features. In such scenarios, integration with multispectral or hyperspectral data, or bare-earth models derived from LiDAR, would be necessary to effectively delineate obscured rock units. Similarly, distinguishing between lithologies that exhibit very subtle visual characteristics, such as various types of quartz schist, mica schist, and calcite schist in metamorphic terrains, presents a significant challenge for RGB-based deep learning models. While the Coordinate Attention mechanism aids in capturing nuanced textural and spatial differences, achieving robust discrimination in these highly ambiguous cases might necessitate incorporating additional geological features or more advanced spectroscopic data.

Despite these considerations, the framework’s demonstrated capability to produce high-precision lithological maps in the complex geomorphic conditions of the Ququleke area is substantial. This methodology can be extended to various applications, including geological mapping, mineral resource prediction, and hazard assessment in other high-altitude, deeply dissected terrains, thereby enhancing both accuracy and efficiency in field geological surveys. Future work will focus on integrating multi-sensor data and advanced annotation strategies to broaden the method’s robustness and applicability across a wider spectrum of complex geological environments.

5. Conclusions

In this study, a novel methodology was presented that pioneers the integration of portable UAV oblique photogrammetry with a Coordinate Attention-enhanced DeepLabV3+ semantic segmentation framework for automated lithological mapping in complex geological scenes. High-resolution UAV orthophotos of the Ququleke area were utilized to construct a pixel-level lithology dataset comprising sandstone, marble, diorite, and Quaternary sediments. The CA-DeepLabV3+ model, an enhanced version of the DeepLabV3+ framework, effectively integrates the Coordinate Attention mechanism to strengthen spatial position encoding and fine-scale feature extraction, crucial for detailed lithological discrimination.

Experimental results confirmed substantial performance improvements, with overall accuracy of 97.95%, mean accuracy of 97.80%, and mIoU of 95.71%, outperforming all comparative models. The CA mechanism effectively preserved boundary details and fine textures, significantly improving the recognition of visually similar lithologies such as Quaternary sediments. Field validation further demonstrated that the model can correct manual mapping errors and deliver reliable lithological interpretation under harsh terrain conditions.

Overall, the proposed UAV-based deep learning approach provides an efficient and accurate solution for automatic lithological mapping in high-altitude and inaccessible regions. It offers a valuable technical reference for geological investigation, mineral resource assessment, and the broader application of low-altitude remote sensing in field geology.

Author Contributions

Conceptualization, J.L., Z.W. and X.G.; methodology, J.L.; software, J.L., Z.W. and Y.C.; formal analysis, J.L., X.G. and M.S.; investigation, J.L. and Z.W.; data curation, J.L., Y.C. and Y.Z.; writing—original draft preparation, J.L., Z.W. and X.G.; writing—review and editing, M.S., Y.Z. and Z.Z.; visualization, J.L.; funding acquisition, Z.W. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by special fund of Deep Earth Probe and Mineral Resources Exploration-National Science and Technology Major Project (No. 2025ZD1006208).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request. The source code for the proposed model is publicly available at https://github.com/LJZ1129/Semantic-segmentation-model-code.git (accessed on 25 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Takodjou Wambo, J.D.; Nomo Negue, E.; Traore, M.; Asimow, P.D.; Ganno, S.; Pour, A.B.; Yamgouot Ngounouno, F.; Nzenti, J.P. Integrating multispectral remote sensing and geological investigation for gold prospecting in the Borongo-Mborguene gold field, eastern Cameroon. Adv. Space Res. 2024, 74, 4574–4597. [Google Scholar] [CrossRef]
Carrino, T.A.; Crósta, A.P.; Toledo, C.L.B.; Silva, A.M. Hyperspectral remote sensing applied to mineral exploration in southern Peru: A multiple data integration approach in the Chapi Chiara gold prospect. Int. J. Appl. Earth Obs. Geoinf. 2018, 64, 287–300. [Google Scholar] [CrossRef]
Chen, J.; Dowman, I.; Li, S.; Li, Z.; Madden, M.; Mills, J.; Paparoditis, N.; Rottensteiner, F.; Sester, M.; Toth, C.; et al. Information from imagery: ISPRS scientific vision and research agenda. ISPRS J. Photogramm. Remote Sens. 2016, 115, 3–21. [Google Scholar] [CrossRef]
Scholtz, A.; Kaschwich, C.; Krüger, A.; Kufieta, K.; Schnetter, P.; Wilkens, C.S.; Krüger, T.; Vörsmann, P. Development of a new multi-purpose uas for scientific application. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, XXXVIII-1/C22, 149–154. [Google Scholar] [CrossRef]
Colomina, I.; Molina, P. Unmanned aerial systems for photogrammetry and remote sensing: A review. ISPRS J. Photogramm. Remote Sens. 2014, 92, 79–97. [Google Scholar] [CrossRef]
Wei, W.; Chen, J.; Wang, F.; Li, J.; Wang, J.; Zhou, W.; Wu, J.; Xiao, Y. Digital outcrop based on UAV oblique photogrammetry 3D modeling and its application in profile measurement and research—A case study of the Xiaheyan section, Zhongwei City, Ningxia. Comput. Tech. Geophys. Geochem. Explor. 2024, 46, 490–498. [Google Scholar]
Sang, X. Application of UAV and Deep Learning in Geological Surveys. Ph.D. Thesis, Jilin University, Changchun, China, 2018. [Google Scholar]
Gupta, A.K.; Mathur, P.; Sheth, F.; Travieso-Gonzalez, C.M.; Chaurasia, S. Advancing geological image segmentation: Deep learning approaches for rock type identification and classification. Appl. Comput. Geosci. 2024, 23, 100192. [Google Scholar] [CrossRef]
Wei, Y.; Huang, Y.; Liu, N.; Li, Y.; Liu, D.; Zhang, J. Prospects and exploration of artificial-intelligence-aided geological mapping in practical teaching—A case study of Zhoukoudian geological field course. China Geol. Educ. 2024, 33, 101–105. [Google Scholar] [CrossRef]
Xu, Z.; Ma, W.; Li, S.; Lin, P.; Liang, F.; Xu, G.; Li, S.; Han, T.; Shi, H. Lithology identification: Methods, current status, and trends toward intelligent development. Geol. Rev. 2022, 68, 2290–2304. [Google Scholar] [CrossRef]
Li, X.; Li, H. A new method of identification of complex lithologies and reservoirs: Task-driven data mining. J. Pet. Sci. Eng. 2013, 109, 241–249. [Google Scholar] [CrossRef]
Chen, G.; Liang, S.; Wang, J.; Sui, S. Application of convolutional neural networks in lithology identification. Well Logging Technol. 2019, 43, 129–134. [Google Scholar] [CrossRef]
Chen, Z.; Yuan, F.; Li, X.; Zheng, C. Interpretability study of deep transfer learning from granitoid images of the Dabie Mountains. Geol. Rev. 2023, 69, 2263–2273. [Google Scholar] [CrossRef]
Liu, C.; Zhao, X.; Liang, N.; Zhang, Y. Lithology recognition and classification based on ResNet-50 and transfer learning. Comput. Digit. Eng. 2021, 49, 2526–2530+2578. [Google Scholar]
Hou, Y.; Quan, J.; Wang, H. A review on the development of deep learning. Ship Electron. Eng. 2017, 37, 5–9+111. [Google Scholar]
Liu, H.; Zuo, R. Remote sensing image classification and lithology recognition based on fully convolutional neural networks. In Proceedings of the 1st National Conference on Mineral Exploration, Anhui, China, 11–13 October 2021; p. 510. [Google Scholar]
Xu, H. Research on High-Resolution Remote Sensing Image Classification Method Based on U-Net Deep Learning Model. Master’s Thesis, Southwest Jiaotong University, Chengdu, China, 2018. [Google Scholar]
Luo, S.; Yin, S.; Chen, J.; Wu, Y.; Chen, X. Lithology Identification of UAV Oblique Photography Images Based on Semantic Segmentation Neural Network Algorithm. Math. Geosci. 2023, 56, 1053–1072. [Google Scholar] [CrossRef]
Wang, Y.; Guo, D.; Wang, Y.; Shuai, H.; Li, Z.; Ran, J. Improved DeepLabV3+ for UAV-Based Highway Lane Line Segmentation. Sustainability 2025, 17, 7317. [Google Scholar] [CrossRef]
Gong, Y.; Li, R.; Liu, Y.; Wang, J.; Cao, B.; Fu, X.; Li, R.; Chen, D.Z. MR2CPPIS: Accurate prediction of protein-protein interaction sites based on multi-scale Res2Net with coordinate attention mechanism. Comput. Biol. Med. 2024, 176, 108543. [Google Scholar] [CrossRef]
Sun, W.; Li, Y.; Feng, H.; Weng, X.; Ruan, Y.; Fang, K.; Huang, L. Lightweight and accurate aphid detection model based on an improved deep-learning network. Ecol. Inform. 2024, 83, 102794. [Google Scholar] [CrossRef]
Wei, D.; Chang, Y.; Kuang, H. Extraction and spatiotemporal analysis of impervious surfaces in Chongqing based on enhanced DeepLabv3. Sci. Rep. 2025, 15, 9807. [Google Scholar] [CrossRef]
Li, Q.; Kong, Y. An Improved SAR Image Semantic Segmentation Deeplabv3+ Network Based on the Feature Post-Processing Module. Remote Sens. 2023, 15, 2153. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2014, arXiv:1411.4038. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 660–669. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision–ECCV 2018, Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11213, pp. 833–851. [Google Scholar] [CrossRef]
Gao, J.; Man, R.; Shi, H.; Zhao, X.; Ma, F.; Han, B.; Zhu, W.; Xue, C.; Yu, L.; Sun, Y.; et al. Primary halo characteristics and deep ore prospecting prediction of the Qu Kulek East gold-antimony deposit in the Eastern Kunlun Mountains. Geol. Explor. 2024, 60, 944–954. [Google Scholar]
Ma, X.; Zhao, W.; Zhao, X.; Ma, F.; Han, B.; Zhu, W.; Xue, C.; Man, R.; Liu, Y.; Chen, Z.; et al. Geological characteristics, ore-controlling structures, and genesis of the Qu Kulek East gold-antimony deposit in the Eastern Kunlun Mountains. Miner. Depos. 2024, 43, 865–876. [Google Scholar]
Xing, L. Genetic Study of the Large Qu Kulek East Au–Sb Deposit in the Eastern Kunlun Mountains. Ph.D. Thesis, Chengdu University of Technology, Chengdu, China, 2023. [Google Scholar]
Liu, B.; Lian, Z.; Chen, W.; Zheng, H. Structural superposition halo characteristics and deep ore prospecting prediction of the Qu Kulek East gold-antimony deposit in the Eastern Kunlun Mountains, Xinjiang. Xinjiang Geol. 2025, 43, 54–60. [Google Scholar]
Sun, Y. Prospecting potential for gold-copper deposits in the Qu Kulek area, Qiemo County, Xinjiang. World Nonferrous Met. 2024, 7, 103–105. [Google Scholar]
Bakirci, M. Efficient air pollution mapping in extensive regions with fully autonomous unmanned aerial vehicles: A numerical perspective. Sci. Total Environ. 2024, 909, 168606. [Google Scholar] [CrossRef]
Lu, D.; Liu, D.; Liu, Q.; Zhang, Z.; Chai, Z. Satellite imagery experiment on Quaternary geology visual interpretation in the Chengdu region. Geol. Rev. 1981, 1, 79–86+89. [Google Scholar] [CrossRef]
Zhao, M. Data processing using Pix4D Mapper software in UAV aerial photography and engineering geological investigation. Hydropower Stn. Des. 2017, 33, 47–48+62. [Google Scholar] [CrossRef]
Koch, T.; Körner, M.; Fraundorfer, F. Automatic and Semantically-Aware 3D UAV Flight Planning for Image-Based 3D Reconstruction. Remote Sens. 2019, 11, 1550. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar]
Huang, X.; Zhang, S.; Li, J. Review of image segmentation technology research. Equip. Mach. 2021, 2, 6–9. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. arXiv 2015, arXiv:1506.02025. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision–ECCV 2018, Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11211, pp. 3–19. [Google Scholar] [CrossRef]
Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Martinez-Gonzalez, P.; Garcia-Rodriguez, J. A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 2018, 70, 41–65. [Google Scholar] [CrossRef]
Zheng, M.; Song, Y.; Tang, J.; Liu, Z.; Hu, G.; Hu, Y. Experimental study and future prospects of UAV geological surveys in high-altitude and inaccessible areas of the Qinghai–Tibet Plateau. Geol. Rev. 2022, 68, 1423–1438. [Google Scholar] [CrossRef]

Figure 1. Overview of the Study Area. (a) The northern part of the Kunlun Mountains is located at the southern part of Xinjiang Province; (b) Digital Elevation Model of the Eastern Kunlun Mountain; (c) Locations of training and test sets in the study area.

Figure 2. Geological map of the study area, modified according to Dr. Xing Ling’s dissertation [32]. (a) Schematic Map of Regional Tectonics of the Study Area; (b) Large-Scale Geological Map of the Study Area; (c) Detailed Small-Scale Geological Map of the UAV Survey Area (provided by local geological workers).

Figure 3. Route layout diagram of the UAV survey. The blue rectangles represent the planned flight paths and image footprints; black vertical lines indicate the camera positions along each flight path; the underlying grayscale surface shows the 3D terrain model of the study area generated from UAV photogrammetry. The axes in the lower right corner indicate the spatial orientation (X, Y, Z).

Figure 4. Three-dimensional model of the study area.

Figure 5. Orthophoto of the study area.

Figure 6. Examples of UAV orthophotos (top row) and corresponding lithology label maps (bottom row). The label colors indicate lithology classes: light green—marble; light yellow—Quaternary sediments; pale beige—sandstone; pink—diorite.

Figure 7. Data augmentation examples. (a) Original image from the training set; (b) image after random cropping; (c) image after random rotation; (d) image after cutout occlusion, where the black square represents the masked region randomly removed to simulate missing data; (e) image after color jitter.

Figure 8. DeepLabV3+ network model architecture.

Figure 9. Coordinate Attention Mechanism Module.

Figure 10. Training loss curves of each model.

Figure 11. Comparison of attention weight distributions between DeepLabV3+ and CA-DeepLabV3+ for the same lithological region. The color scale ranges from blue to red, indicating low to high attention intensity. (a) Original image; (b) attention distribution of DeepLabV3+; (c) attention distribution of CA-DeepLabV3+; (d) overlay comparison of the two models, where the left image shows the DeepLabV3+ attention map overlaid on the original image with 40% transparency, and the right image shows the CA-DeepLabV3+ attention map overlaid on the original image with 40% transparency.

Figure 12. Comparison of recognition results with other semantic segmentation methods.

Figure 13. Field validation. (a) Survey route and survey points in the validation area; (b) a1—Diorite-based gully; (c) a2—Diorite; (d) a3—Fractured diorite; (e) a4—Marble.

Table 1. Efficiency Comparison of CA-DeepLabV3+ with Different Backbone Networks.

Backbone Network	Parameter Count (Millions)	Inference Time (s/Image)
MobileNetV2	4.54	0.0046
ResNet50	27.08	0.0052
ResNet101	46.07	0.0086
EfficientNet-B3	11.70	0.0112
timm-mobilenetv3_large_100	4.80	0.0056

Table 2. Comparison of Evaluation Metrics for Different Semantic Segmentation Networks Based on the Field Lithology Dataset.

Evaluation Metrics	OA/%	MeanAcc/%	mIoU/%
FPN	96.15%	96.26%	92.19%
FCN	96.52%	96.50%	92.99%
U-Net	94.95%	95.02%	90.07%
PSPNet	95.08%	94.95%	90.03%
DeepLabV1	96.68%	96.83%	93.32%
DeepLabV3+	95.08%	95.04%	90.23%
CA-DeepLabV3+	97.95%	97.80%	95.71%

Table 3. Comparison of CIoU of Different Semantic Segmentation Networks Based on Field Lithology Dataset.

Lithology Type	Quaternary Sediments/%	Sandstone/%	Marble/%	Diorite/%
FPN	83.26%	97.23%	89.79%	90.88%
FCN	85.31%	97.58%	90.73%	91.40%
U-Net	79.04%	96.96%	86.68%	87.76%
PSPNet	77.55%	96.91%	87.57%	88.30%
DeepLabV1	86.06%	97.82%	91.00%	91.87%
DeepLabV3+	79.24%	96.40%	87.37%	88.33%
CA-DeepLabV3+	90.66%	98.48%	94.62%	94.86%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Wei, Z.; Gong, X.; Sun, M.; Cheng, Y.; Zhang, Y.; Zhang, Z. Lithological Mapping from UAV Imagery Based on Lightweight Semantic Segmentation Methods. Drones 2025, 9, 866. https://doi.org/10.3390/drones9120866

AMA Style

Liu J, Wei Z, Gong X, Sun M, Cheng Y, Zhang Y, Zhang Z. Lithological Mapping from UAV Imagery Based on Lightweight Semantic Segmentation Methods. Drones. 2025; 9(12):866. https://doi.org/10.3390/drones9120866

Chicago/Turabian Style

Liu, Jingzhi, Zhen Wei, Xiangkuan Gong, Minjia Sun, Yuanfeng Cheng, Yingying Zhang, and Zizhao Zhang. 2025. "Lithological Mapping from UAV Imagery Based on Lightweight Semantic Segmentation Methods" Drones 9, no. 12: 866. https://doi.org/10.3390/drones9120866

APA Style

Liu, J., Wei, Z., Gong, X., Sun, M., Cheng, Y., Zhang, Y., & Zhang, Z. (2025). Lithological Mapping from UAV Imagery Based on Lightweight Semantic Segmentation Methods. Drones, 9(12), 866. https://doi.org/10.3390/drones9120866

Article Menu

Lithological Mapping from UAV Imagery Based on Lightweight Semantic Segmentation Methods

Highlights

Abstract

1. Introduction

2. Dataset Construction

2.1. Overview of the Study Area

2.2. Data Acquisition and Processing

3. Experiment and Methods

3.1. Overview of the Proposed Lithology Segmentation Method

3.2. Semantic Segmentation Model

3.3. Coordinate Attention Mechanism

3.4. Experimental Environment and Parameter

3.5. Evaluation Metrics

4. Results and Discussion

4.1. Convergence Analysis

4.2. Comparison Experiments of Backbone Network

4.3. Quantitative Evaluation

4.4. Category-Specific Performance

4.5. Field Validation and Error Correction Capability

4.6. Practical Implications for Geological Mapping

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI