Data-Model Complexity Trade-Off in UAV-Acquired Ultra-High-Resolution Remote Sensing: Empirical Study on Photovoltaic Panel Segmentation

Zou, Zhigang; Zhou, Xinhui; Yang, Pukaiyuan; Liu, Jingyi; Yang, Wu

doi:10.3390/drones9090619

Open AccessArticle

Data-Model Complexity Trade-Off in UAV-Acquired Ultra-High-Resolution Remote Sensing: Empirical Study on Photovoltaic Panel Segmentation

by

Zhigang Zou

^1,2

,

Xinhui Zhou

³,

Pukaiyuan Yang

^1,2,

Jingyi Liu

^1,2 and

Wu Yang

^1,2,4,*

¹

State Key Laboratory of Soil Pollution Control and Safety, Zhejiang University, Hangzhou 310058, China

²

Center for Intelligent Ecology & Digital Remote Sensing, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou 310058, China

³

Engineering Management Department, Hangzhou New Energy Investment and Development Co., Ltd., Hangzhou 310051, China

⁴

Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(9), 619; https://doi.org/10.3390/drones9090619

Submission received: 12 July 2025 / Revised: 24 August 2025 / Accepted: 1 September 2025 / Published: 3 September 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

Ultra-high-resolution offsets the benefit of adding spectral bands.
State-of-the-art architectures do not guarantee better segmentation performance.

What is the implication of the main finding?

A public ultra-high-resolution PV segmentation benchmark dataset is available.
ResNet50+UNet with spatially diverse data is recommended as a strong baseline.

Abstract

With the growing adoption of deep learning in remote sensing, the increasing diversity of models and datasets has made method selection and experimentation more challenging, especially for non-expert users. This study presents a comprehensive evaluation of photovoltaic panel segmentation using a large-scale ultra-high-resolution benchmark of over 25,000 manually annotated unmanned aerial vehicle image patches, systematically quantifying the impact of model and data characteristics. Our results indicate that increasing the spatial diversity of training data has a more substantial impact on training stability and segmentation accuracy than simply adding spectral bands or enlarging the dataset volume. Across all experimental settings, moderate-sized models (DeepLabV3_50, ResUNet50, and SegFormer B4) often provided the best trade-off between segmentation performance and computational efficiency, achieving an average Intersection over Union (IoU) of 0.8966 comparable to 0.8970 of larger models. Moreover, model architecture plays a more critical role than model size; as the ResUNet models consistently achieved higher mean IoU than both DeepLabV3 and SegFormer models, with average improvements of 0.047 and 0.143, respectively. Our findings offer quantitative guidance for balancing architectural choices, model complexity, and dataset design, ultimately promoting more robust and efficient deployment of deep learning models in high-resolution remote sensing applications.

Keywords:

segmentation; PV panel; multispectral; U-Net; CNN; clean energy; carbon peaking and carbon neutrality

1. Introduction

1.1. Background

Recent advancements in unmanned aerial vehicle (UAV) platforms and sensor technologies have facilitated the extensive real-world application of ultra-high-resolution (UHR) remote sensing [1,2,3]. In comparison to conventional satellite systems and manned aerial platforms, UAV remote sensing offers unparalleled advantages in agricultural monitoring, infrastructure inspection, and environmental hazard assessment, attributable to its operational agility and centimeter-scale spatial resolution [4,5]. For instance, imagery obtained from the MicaSense RedEdge-MX UAV demonstrated a significantly lower relative root mean square error (rRMSE) of up to 20.11% compared to Sentinel-2 and PlanetScope when evaluated against ground spectroradiometer data, particularly in regions with complex land-cover transitions. In addition, UAV-derived normalized difference vegetation index (NDVI) also exhibited greater sensitivity to fine-scale land cover and shadow variations, thus outperforming satellite-based estimates in spatially heterogeneous environments [6]. However, the proliferation of UHR data acquisition has markedly increased collection costs, storage demands, annotation workload, and computational complexity. These factors constitute a significant bottleneck to large-scale deployment.

Deep learning has increasingly supplanted traditional machine learning methods, such as Random Forest and Support Vector Machines, in remote sensing image segmentation [7,8]. Three core architectural paradigms now dominate the field. First, convolutional encoder–decoder frameworks (e.g., U-Net variants) use skip connections to combine low-level spatial details with high-level semantic features [9,10]. Their parameter counts typically grow with the depth of the backbone network (e.g., ResNet18 vs. ResNet101) [11]. Second, multi-scale context aggregation models, such as DeepLabv3+, extend CNN-based architectures by incorporating dilated convolutions and pyramid pooling modules (e.g., atrous spatial pyramid pooling, ASPP), enabling effective modeling of contextual information across varying scales [12]. Thirdly, transformer-based models (e.g., SegFormer) employ self-attention mechanisms to learn global dependencies across spatial or spectral dimensions [13]. These architectures generally require larger datasets to avoid overfitting and to fully realize their capacity [14,15]. Importantly, model performance is highly task-dependent. For instance, SegFormer’s global attention may outperform U-Net’s local convolutions for multispectral data, while DeepLabv3 achieves accuracy comparable to U-Net+ResNet101 with fewer parameters at 10 cm resolution [16,17,18]. At the same time, escalating model diversity and complexity introduce computational burdens, parameter optimization challenges, and overfitting risks, particularly in data-scarce scenarios like photovoltaic (PV) panel detection in specific regions, where complex models converge to suboptimal minima and compromise generalization [19,20].

The integration of high-resolution and multidimensional data with advanced deep learning models theoretically creates a dual-drive synergy, whereby finer-grained data provides detailed information, while sophisticated architectures are tasked with extracting higher-level abstract features. However, practical applications frequently reveal asymmetric performance gains, wherein indiscriminate increases in data or model complexity may lead to diminishing marginal returns. For instance, the regular geometric patterns of PV panels induce significant edge information redundancy in high-resolution data, where overcomplicated models (e.g., Unet+ResNet101) may suffer generalization degradation due to parameter redundancy [21]. Conversely, even state-of-the-art architectures like SegFormer-B5 are unable to recover texture details lost in low-resolution satellite data [22,23,24]. This underscores the critical need to quantify the synergies between data and model complexities to fully harness deep learning’s potential in high-resolution remote sensing.

1.2. Related Works

For PV segmentation, current comparative studies exhibit three major limitations that hinder generalizability and practical relevance. First is the architectural and parameter homogeneity. Most existing comparisons focus on intra-family tweaks (e.g., attention blocks, loss functions) but rarely test scaling effects or cross-paradigm baselines (e.g., Transformers), which may exhibit distinct sensitivities to spectral input, resolution, and data volume [23,25,26]. For instance, Kleebauer et al. [23] conducted a comprehensive analysis using publicly available datasets, comparing the performance of DeepLabV3 with a ResNet101 backbone across different training hyperparameters (e.g., batch size, loss function, stride), and assessed its adaptability to varying spatial resolutions ranging from 0.1 m to 3.2 m. Second is the insufficient assessment of data adaptability. Most comparative analyses have focused on RGB data, emphasizing cross-platform resolution differences [22,23,27]. For example, Jiang et al. [22] and Guo et al. [27] evaluated model performance across resolutions from 0.1 m to 0.8 m using RGB aerial and satellite imagery. However, they rarely consider intra-platform variation, such as the fine-grained resolution gradients within UAV imagery (e.g., 5 cm, 10 cm, 15 cm, up to 30 cm/pixel). Such variation can lead to significant differences in data acquisition burden, spatial detail, and data volume. In addition, the impact of input spectral dimensionality (e.g., RGB vs. multispectral) has not been systematically quantified. The third limitation is the oversimplification of evaluation metrics. Current comparisons predominantly emphasize accuracy metrics such as Intersection over Union (IoU) and F1-score, while overlooking practical constraints like training time, memory usage, and inference speed. This limits their applicability in computationally demanding UHR scenarios, particularly for non-expert users with constrained resources and limited technical expertise.

1.3. Motivations and Contributions

To address the lack of systematic evaluation of spectral band sensitivity and model parameter scalability, we focused on the practical problem of selecting data–model combinations for UHR UAV segmentation that maximize accuracy while satisfying data acquisition and model computational constraints. To tackle this problem, we constructed a benchmark PV array segmentation dataset from real UAV acquisitions, encompassing multi-spectral imagery across a fine-grained resolution gradient (10 cm to 25 cm/pixel). Using this dataset, we conducted a series of orthogonal experiments to quantitatively evaluate model sensitivity to spectral bands, spatial resolution, and training data volume. We systematically benchmarked parameter-scaled variants of three core deep learning architectures, evaluating both computational cost and segmentation accuracy. Our results provide quantitative insights to guide practical decision-making in UAV-based segmentation, promoting efficient use of data and computational resources.

2. Materials and Methods

2.1. Data Collection

This study employed the DJI Phantom 4 Multispectral UAV for data acquisition, equipped with one RGB imaging sensor (blue, green, and red bands) and five discrete single-band sensors capturing multispectral imagery (blue, green, red, red edge, and near-infrared bands), each with 2.08 million effective pixels. Based on a prior review of UAV-based UHR remote sensing, which summarizes typical dataset configurations in terms of patch size, dataset scale, and spatial resolution, we determined that a ground sampling distance (GSD) of 10 cm would be optimal for balancing spatial detail and acquisition flexibility [28]. In addition, the 10 cm resolution provides unambiguous boundary delineation between PV modules and the surrounding vegetation from the perspective of manual visual interpretation (Figure S1). In 2022, we conducted a multi-site UAV survey at ten PV power plants and non-adjacent areas in Zhejiang Province, flying at an altitude of 180 m to achieve the target GSD and desired dataset scale (Figure 1a). This campaign produced 20 orthomosaic datasets, covering a total area of 31.8 km² and encompassing diverse meteorological, illumination, and vegetation conditions (referred to as the 10 cm cross-site dataset). At one of the PV power plants, supplementary flights at altitudes of 275 m, 370 m, and 465 m (resulting in GSDs of 15 cm/pixel, 20 cm/pixel, and 25 cm/pixel, respectively) were conducted in June and July to evaluate resolution-dependent segmentation performance (referred to as the multi-resolution single-site dataset). This 10–25 cm resolution span was chosen to ensure consistent illumination and environmental conditions across flights covering a 1.8 km² area. Real-Time Kinematic (RTK) positioning ensured centimeter-level georeferencing accuracy, with flight parameters set to 75% frontal overlap and 60% sidelap for optimal spatial registration [29,30]. The post-processing phase included geometric correction and image stitching using DJI Terra (version 3.7) which automatically reads the RTK positioning data and interior orientation parameters of each image achieve precise pixel-level alignment between RGB and multispectral layers. Radiometric calibration, utilizing reference panels with reflectance values of 25%, 50% and 75%, was employed to standardize spectral responses for each flight.

2.2. Data Pre-Processing

All PV plant orthomosaics underwent meticulous manual annotation through precise vector delineation of PV array boundaries, resulting in high-quality segmentation datasets. These vector annotations were rasterized at the native resolution of each orthomosaic prior to patch generation. Following the annotation process, the original orthomosaics were partitioned into 512 × 512 pixel patches using non-overlapping cropping to minimize redundancy in PV panel information. Subsequently, since the raw orthophotos contained no-data areas, we eliminated invalid patches containing more than 5% null-value pixels (Figure S2c). The resulting 10 cm cross-site dataset contains 17,953 validated image-label pairs. The separate multi-resolution single-site dataset was also generated, consisting of 4270 pairs at 10 cm/pixel, 1878 at 15 cm/pixel, 1014 at 20 cm/pixel, and 669 at 25 cm/pixel. All images were stored in JPG format with corresponding PNG masks that encode the PV panel and non-panel classes.

2.3. Experimental Design

To systematically evaluate the impact of data and model variations on the training and performance of PV panel segmentation, this study conducted a series of comparative experiments. Model variations were systematically investigated across two principal dimensions: architecture and parameter size. Three representative deep learning segmentation models were selected for architectural comparison: (1) U-Net, which employs an encoder–decoder architecture with skip connections to integrate multi-scale features and preserve spatial details; (2) DeepLabV3, which utilizes dilated convolutions combined with an ASPP module to capture multi-scale contextual information and enlarge the receptive field; and (3) SegFormer, which leverages Transformer-based self-attention mechanisms coupled with lightweight multi-layer perceptrons (MLPs) to facilitate efficient global feature representation. To control for the influence of parameter scale, specific configurations were implemented. Within the flexible U-Net and DeepLabV3 frameworks, ResNet18, ResNet50, and ResNet101 backbones were utilized, generating three graded versions with approximate parameter sizes of 12 million (12 M), 42 million (42 M), and 60 million (60 M), respectively. For SegFormer, the officially provided B1, B4, and B5 versions were directly selected to ensure that their parameter scales were comparable to the aforementioned ResNet-based variants. In total, nine model configurations were included in this study: DeepLabV3_18, DeepLabV3_50, DeepLabV3_101; ResUNet18, ResUNet50, ResUNet101; and SegFormer B1, B4, and B5. These represent three distinct model families across multiple parameter scales.

Data variations were examined across three critical aspects pertinent to UHR remote sensing: input spectral bands, training data volume, and spatial resolution. To assess the influence of spectral band combinations, three distinct datasets were constructed: (1) an RGB Group, representing the most commonly utilized deep learning input format (Figure 1b); (2) an RGB+NIR Group, incorporating near-infrared (NIR) band widely used in both satellite and UAV-based environmental monitoring and precision agriculture (Figure 1c); and (3) an RGB+NIR+Red Edge Group, which further includes red edge (RE) band, particularly valuable for vegetation health monitoring due to the unique reflectance characteristics of vegetation in this spectral region (Figure 1d). The RGB+NIR+RE configuration was selected not only for its broad support across commercial UAV platforms (e.g., MicaSense, DJI, Parrot, and Sentera) but also for its potential to enhance the discernibility of fine boundaries between the PV modules and surrounding vegetation, given its proven effectiveness in vegetation analysis [31,32,33]. Furthermore, they correspond closely to bands used in widely adopted satellite platforms such as Sentinel-2 and Landsat-8, which facilitates the transfer, adaptation, or comparison of algorithms developed for satellite data [34,35].

To investigate the effect of training sample size on model performance, 20% of the original dataset was reserved for testing and evaluation, with the remaining data allocated for training. Stratified random sampling was then conducted on the training pool at three distinct scales relative to its full size: 100% (yielding a large dataset with 14,363 patches), 62.5% (yielding a medium dataset with 8976 patches), and 25% (yielding a small dataset with 3590 patches). Importantly, the ratio of PV-containing patches to non-PV patches was rigorously maintained throughout all sampling procedures to ensure consistent class distribution and mitigate potential impacts of data imbalance. Finally, the influence of spatial resolution was assessed through an additional comparative test using the specially constructed multi-resolution single-site dataset derived from actual UAV flight data. This design simulates practical constraints in UAV operations, where selecting a lower spatial resolution reduces the number of usable training patches due to larger ground sampling distance. While additional flights could theoretically compensate for the reduction in patch quantity, such redundancy is generally impractical in operational scenarios given flight time, data storage, and labor limitations.

2.4. Training and Evaluation Framework

Based on insights from prior PV segmentation algorithm comparison [23] and our preliminary experiments (Table S1), we employed the optimal hyperparameter combination of “BCEWithLogitsLoss” as the loss function, AdamW as the optimizer, and an initial learning rate of 1 × 10⁻⁴. IoU was adopted as the primary evaluation metric during the training phase [22,23]. To mitigate class imbalance caused by the limited spatial representation of PV panels in the imagery, we implemented a positive weight of 5.0 to recalibrate gradient contributions. The learning rate was dynamically adjusted using the ‘ReduceLROnPlateau’ scheduler to enable finer parameter refinements when validation loss plateaued. Training termination criteria were established when both the “BCEWithLogitsLoss” and IoU on the validation set exhibited a variation of less than 0.1% over a patience window of 7 epochs, or after 200 epochs. Subsequently, the model was reverted to its optimal state, and the parameters were preserved. To minimize potential biases arising from random initialization, each experiment was conducted in triplicate. If any replicate deviated by more than three times the sample variance, it was replaced with an additional run to ensure statistical reliability. For the 10 cm cross-site dataset, a total of 243 valid training runs were conducted, corresponding to 9 models × 3 input combinations × 3 dataset sizes × 3 repetitions. For the multi-resolution single-site dataset, 324 valid runs were also conducted, corresponding to 9 models × 3 input combinations × 4 spatial resolutions × 3 repetitions. Batch sizes were adaptively configured based on model architectures and input data dimensions to optimize performance within computational constraints (Table 1). Furthermore, Automatic Mixed Precision (AMP) was activated exclusively during SegFormer training sequences, achieving the dual objectives of accelerated throughput and enhanced batch capacity, aligning strategically with practical deployment scenarios [36]. Training stability was assessed by comparing the convergence trajectories of loss and IoU metrics throughout the optimization process.

During the testing phase, all three independent runs of each training configuration were systematically evaluated on the test set. A comprehensive set of quantitative metrics was computed, namely accuracy, precision, recall, f1-score, IoU, boundary f1-score, and boundary IoU [37]. Given the binary nature of the segmentation task and to minimize redundancy (Figure S3), only IoU and boundary IoU were rigorously reported. These metrics were further supplemented by expert visual inspections of segmentation boundaries to ensure a comprehensive assessment of spatial accuracy and boundary fidelity. Furthermore, computational efficiency was evaluated in terms of inference speed (measured in frames per second, FPS), GPU memory utilization.

The hardware configurations maintained experimental rigor: both the 10 cm-resolution experiments and the multi-resolution gradient experiments were conducted on a NVIDIA RTX 6000 workstation, equipped with 24 GB of VRAM. The computational environment standardized the use of CUDA 12.1 and PyTorch 2.4.1 across all trials to ensure software consistency.

3. Results

3.1. Effects of Data Variations on Training

During the training phase, comparisons within models across different input band combinations revealed only slight increases in GPU memory usage, which did not affect batch size settings (Table 1). The loss and IoU curves showed no substantial differences across input configurations (Figures S4 and S5, Figure 2 and Figure 3). Furthermore, for each model, the total number of training epochs, total training time, and best validation IoU remained largely consistent regardless of the spectral bands used (Figure 4, Figures S6 and S7).

The training behavior varied noticeably across datasets of different sizes, particularly in terms of stability, efficiency, and performance outcomes. Larger training datasets resulted in more stable training processes, with smaller epoch-to-epoch fluctuations in both loss and IoU curves (Figure 2 and Figure S4). The reduction in fluctuations was more noticeable from the small to the medium dataset, while the difference between medium and large datasets was less significant.

For large models such as DeepLabV3_101, ResUNet101, and SegFormerB5, increasing the dataset size did not lead to substantial alterations in total training epochs (Figure 5a). In contrast, small and medium-sized models generally required fewer epochs as the dataset grew, especially within the SegFormer series. Training time increased significantly with dataset size across all model types and scales (Figure 5b). However, this increase did not consistently lead to improvements in best validation IoU (Figure 5c). From small to medium datasets, most models showed clear IoU gains (overall mean IoU increased from 0.844 to 0.864), except DeepLabV3_101. From medium to large datasets, significant improvements were only observed in ResUNet50, SegFormerB4, and SegFormerB5, with overall IoU rising marginally from 0.864 to 0.874. Notably, this marginal accuracy improvement came with a disproportionate computational cost. The average total training time per run increased from 2.749 h (small dataset) to 10.247 h (large dataset), representing a 2.7-fold increase for just a 3.6% gain in best validation IoU.

The impact of spatial resolution on training stability, duration, and model performance varied notably across different models. When training with the multi-resolution dataset, coarser spatial resolutions inherently resulted in fewer image patches. This led to more pronounced epoch-to-epoch fluctuations in both loss and IoU curves, particularly evident even during the late-stage convergence of the 20 cm and 25 cm datasets (Figure 3 and Figure S5). For the DeepLabV3 models, the total number of training epochs at 10 cm and 15 cm was significantly higher than at 20 cm and 25 cm, while no significant difference was observed between 10 cm and 15 cm (Figure 6a). In contrast, the ResUNet models exhibited a gradual increase in total training epochs as resolution became coarser. The SegFormer models displayed a bimodal pattern, with both 10 cm and 25 cm configurations requiring more epochs than the 15 cm and 20 cm cases. In terms of total training time, all models followed a consistent pattern: the 10 cm dataset resulted in the longest training time, followed by 15 cm, while 20 cm and 25 cm yielded the shortest durations. No significant difference was found between the 20 cm and 25 cm groups (Figure 6b). As for validation performance, DeepLabV3 and SegFormer models showed a clear decline in best validation IoU with decreasing resolution. Specifically, the average IoU of DeepLabV3 dropped from 0.899 (at 10 cm resolution) to 0.665 (at 25 cm), while SegFormer decreased from 0.907 to 0.816 over the same range. In contrast, the ResUNet models were relatively unaffected by resolution changes, maintaining an average IoU of 0.960 across all resolutions.

3.2. Effects of Data Variations on Testing

Consistent with the training observations, testing results also showed no significant differences across input band combinations (Figure 7, Figures S8 and S9). In other words, augmenting the standard RGB input with NIR and Red Edge bands did not lead to substantial improvements or degradations in the training dynamics or segmentation performance across any of the model architectures examined. However, from the perspective of inference efficiency, incorporating additional spectral bands resulted in a noticeable reduction in FPS and a significant increase in GPU memory usage during the testing phase (Table 2).

In the evaluation of the test dataset, boundary IoU improved with increasing dataset size across all models except DeepLabV3_18 and ResUNet18 (Figure 8). IoU also increased in most models, except for DeepLabV3_18, ResUNet18, and ResUNet50, with overall IoU rising marginally from 0.901 to 0.911. Overall, increasing the dataset size significantly improved model performance on the test dataset in terms of both accuracy and IoU.

A similar trend was also evident in the test results when dataset resolution changed (Figure 9). IoU consistently declined with lower resolution for all models, although adjacent resolution levels did not always show statistically significant differences. Specifically, the average IoU of DeepLabV3 dropped from 0.889 (at 10 cm resolution) to 0.702 (at 25 cm), while ResUNet remained relatively stable, decreasing only slightly from 0.978 to 0.967 over the same range. Boundary IoU of DeepLabV3 showed a similar decline, whereas that of ResUNet slightly increased, and SegFormer exhibited minimal change. Visual inspection of the segmentation outputs further supported this observation: as resolution became coarser, DeepLabV3 models failed to clearly distinguish the narrow gaps between adjacent PV arrays (Figure 10).

3.3. Effects of Model Variations on Training

Under identical GPU hardware, the maximum attainable training batch size varied significantly across model families (Table 1). However, peak GPU memory usage differed significantly only between the DeepLab V3 models and those of the ResUNet and SegFormer models. Learning curve analysis further revealed distinct convergence patterns (Figures S4 and S5, Figure 2 and Figure 3). The DeepLab V3 and ResUNet models exhibited broadly similar trajectories, while SegFormer models diverged markedly. These differences were also reflected in the summary training statistics. While the total training time did not differ significantly between DeepLab V3 models and ResUNet models, model family had a pronounced impact on both the total number of epochs and the best validation IoU (Figure 4, Figures S6 and S7). SegFormer models required substantially more training epochs and wall-clock time to converge (7.227 h on average), compared to ResUNet models (2.876 h). However, despite the increased training cost, SegFormer achieved a significantly lower best validation IoU (0.844) than ResUNet (0.933).

Differences among model sizes within the same architecture family were generally less pronounced than those observed across families. During training, most discrepancies were found between the smallest variant and the medium or large versions (Figure 4, Figures S6 and S7). Specifically, the total number of training epochs did not differ significantly among models within the DeepLabV3 or SegFormer families. However, within the ResUNet models, the ResUNet18 model required significantly more epochs to converge. Despite this, the increased average epoch duration in larger models meant that the smallest variants (namely DeepLabV3_18, ResUNet18, and SegFormer B1) consistently achieved convergence in significantly less total training time compared to their larger counterparts. In terms of best validation IoU, only DeepLabV3_18 performed significantly worse than the medium and large models within its family, with an average IoU of 0.720 compared to 0.843 for the latter. For the ResUNet and SegFormer models, no significant differences in validation IoU were observed among different model sizes.

When the evaluation was limited to the 10 cm resolution dataset, intra-family differences became even less pronounced (Figure 5 and Figure S4). No significant differences in the total number of training epochs were observed across model sizes within any architecture family. Differences in total training time were significant only between the smallest models and the medium or large variants.

When comparing model architectures with similar parameter scales, the training behaviors of the DeepLabV3 and ResUNet models appeared more closely aligned, particularly in contrast to the SegFormer models (Figure 4 and Figure 11). For instance, DeepLabV3_18 and ResUNet18, as well as DeepLabV3_50 and ResUNet50, exhibited no significant differences in total training time. Although DeepLabV3 and SegFormer exhibited dissimilar convergence behaviors during training, their best validation IoU did not significantly differ at either the medium or large model size. Nonetheless, both architectures consistently underperformed compared to the ResUNet models in terms of validation IoU. When the comparison was restricted to the 10 cm dataset, the differences became more nuanced (Figure S10). Specifically, DeepLabV3_50 and ResUNet50, as well as DeepLabV3_101 and ResUNet101, achieved comparable best validation IoU values, with no statistically significant differences observed between them. However, both significantly outperformed their SegFormer counterparts of similar model size.

3.4. Effects of Model Variations on Testing

Performance discrepancies among the model families persisted on the held-out test dataset. Both boundary IoU and IoU showed statistically significant differences, with ResUNet achieving the highest scores, DeepLabV3 the lowest, and SegFormer yielding intermediate results (Figure 7). In terms of inference efficiency, clear differences were observed across model architectures. The DeepLabV3 series achieved the fastest inference speed and the lowest GPU memory usage, followed by ResUNet, while the SegFormer models consistently demonstrated the slowest inference and the highest memory consumption (Table 2).

In the testing phase, DeepLabV3_18 showed significantly lower IoU and boundary IoU than its larger counterparts, while SegFormer achieved higher boundary IoU in the larger variant (Figure 7). Visual inspection of the segmentation results further corroborated these findings. Within the DeepLabV3 family, DeepLabV3_50 and DeepLabV3_101 produced comparably detailed outputs, with IoU scores of 0.894 and 0.896, respectively. In contrast, DeepLabV3_18 was unable to capture finer structural features and primarily distinguished only the relatively wide gaps between neighboring PV arrays, resulting in notably inferior segmentation quality (Figure 12). When evaluated on the multi-resolution dataset, training-phase differences showed slight changes, but the overall testing-phase trends remained consistent. In particular, the SegFormer family continued to exhibit no significant differences in performance among different model sizes (Figure 9). Similarly to the differences observed across model architectures, increasing model size led to reduced inference speed (FPS) and higher GPU memory consumption (Table 2). However, the magnitude of change associated with model size was generally smaller than that introduced by architectural differences.

In contrast to training-phase results, inter-family comparisons during testing revealed significant differences even between models with similar parameter scales (Figure 7 and Figure 13). In terms of both boundary IoU and IoU, ResUNet models consistently achieved the highest scores, followed by SegFormer, while DeepLabV3 models performed the worst. When focusing specifically on the 10 cm dataset, DeepLabV3_50 and SegFormer B4, as well as DeepLabV3_101 and SegFormer B5, showed no significant differences between each other but both significantly underperformed compared to their ResUNet counterparts (Figure S11). Visual comparisons of the testing results support these quantitative findings (Figure 14). Using the medium-sized models as examples, both DeepLabV3_50 and SegFormer B4 failed to accurately delineate narrow gaps between adjacent PV arrays and were unable to reconstruct the sharp angular boundaries that ResUNet50 captured reliably.

4. Discussion

Under our available computational resources, all model configurations could be trained with appropriately adjusted batch sizes [23]; even the SegFormer variants were trainable with a batch size of 11. For a dataset of approximately 5000 image patches, a single training run required about 14 h. However, in practical workflows, model selection and training typically involve multiple trials and repeated runs, which can easily extend the total workflow beyond one week. When transferred to a consumer-grade GPU (e.g., RTX 3070 with 8 GB VRAM), the training time for a single run increased sharply to roughly 51 h (Table S2). Under these conditions, the tuning process could extend to a month or longer. Although accuracy metrics such as IoU remain comparable, this represents an impractical timeframe for most applications. Therefore, based on our findings, selecting an appropriate model architecture is critical. This choice can substantially reduce trial-and-error costs and improve the overall feasibility of the pipeline under constrained computational resources. Furthermore, for real-time onboard inference on UAV platforms, the DeepLabV3 and ResUNet variants emerge as more suitable options due to their modest memory footprint (≈1–1.5 GB) and lower computational demands, making them well aligned with the capabilities of mainstream edge AI hardware [38].

The effectiveness of spectral and other auxiliary data is highly task-dependent. In our study, incorporating additional spectral bands did not improve segmentation performance, even in high vegetation cover areas (Figure S12). This suggests that the absolute advantage provided by UHR imagery might outweigh the marginal gains from additional spectral inputs. One possible reason is that, at such fine spatial scales, UHR RGB imagery offers significantly richer texture and spatial variation than satellite imagery. This results in a more pronounced contrast between PV panels and their background, along with sharper boundaries, making segmentation easier and reducing the need for additional spectral inputs [39]. Furthermore, PV panel segmentation is inherently a binary classification task with minimal class confusion, which further reduces the potential benefit of adding spectral information. However, when task complexity increases and the contrast between targets and background decreases, auxiliary inputs become more important. For example, Collin et al. [40] demonstrated that adding near-infrared or red edge bands improved accuracy by more than 2% in intertidal reef mapping. Similarly, in fine scale land use classification within sparsely vegetated regions, the inclusion of these bands increased accuracy by more than 3%, with improvements for the dead matter class reaching up to 30% [41]. In PV-related applications, the optimal GSD for detecting cracks on panels was reported as 0.1 mm using RGB imagery, whereas hotspot detection remains highly effective using thermal infrared data with a coarser GSD of 13.4 cm [42]. Moreover, our UAV data acquisition experience indicates that reducing the GSD from 10 cm to 5 cm would require roughly four times the flight time to capture the same scene coverage. These examples further highlight the importance of selecting task-specific input data when working with UHR imagery.

In real-world remote sensing segmentation, two key challenges arise: (1) ensuring model generalization across spatially diverse training samples, and (2) maintaining robustness across varying image resolutions. To address the first challenge, we compared models trained on subsets with a similar number of image patches, drawn from the single-site multi-resolution dataset and the 10 cm cross-site dataset. Although the single-site-trained model achieved higher validation IoU, its performance dropped by 13.2% on unseen multi-site data, whereas the multi-site-trained model showed only a 1.7% drop, highlighting the importance of spatial diversity (Figure 15). Adding NIR and Red Edge bands had negligible or negative effects on generalization. For cross-resolution generalization, models trained on multi-resolution data were tested on unseen resolutions. IoU decreased consistently when applied to resolutions not seen during training, although SegFormer and ResUNet trained on mixed-resolution data showed greater robustness (Figure S13 and Figure 16). These results align with previous studies [22,23], indicating that spatially diverse and mixed-resolution datasets are essential for reliable deep learning performance. Even modest resolution changes, such as from 5 cm to 10 cm, leading to noticeable performance degradation (Figure S13 and Figure 16). Notably, SegFormer demonstrated strong robustness in both scenarios (Figure 15, Figure S13, and Figure 16), suggesting that it is particularly well-suited for complex tasks requiring robustness to multi-modal data across multiple spatial resolutions.

Model performance uncertainty across architectures remains a critical concern in UHR segmentation tasks. Although our experiments indicated that ResUNet achieves the highest IoU for UAV-based PV-array segmentation, other studies reported DeepLabV3+, FCN-8s, or SegFormer as the best alternatives [22,25,43]. This disparity underscores how architectural “superiority” is often task- and data-dependent. Several factors help explain these inconsistent findings. First, annotation protocols differ markedly. Many PV-segmentation studies rely on imagery coarser than 1 m; even those labeled “UHR” frequently use data in the 20–50 cm range. Narrow gaps between panel rows are therefore often labeled as part of the arrays (similar to Figure 10b3), producing ground truths that favor certain architectures and obscure fine-scale differences [39,44]. Second, variations in dataset size, class balance, and random initialization (i.e., different runs) contribute significantly to performance differences. In our experiments, enlarging the training set raised IoU by more than 0.10, while the standard deviation of boundary IoU across runs reached 0.14 (Table S3). This is comparable to the performance gaps typically reported between the top two models or modules in prior studies [45,46]. When sample imbalance is added to these differences, model rankings based on narrowly defined datasets can become unreliable [27]. Therefore, to reliably evaluate deep learning models for UHR imagery segmentation, particularly for challenging tasks like PV array extraction, there is a pressing need for a unified benchmark dataset that is diverse, well-annotated, and functionally analogous to those used for evaluating large language models [47]. Such a resource would help bridge the gap between research and real-world deployment. Fortunately, both our findings and previous studies suggest that for a given task, most publicly available architectures already deliver acceptable performance, with differences that remain tolerable in operational settings.

While our systematic comparison of model–data combinations provides practical guidance for directly reusing standard models in operational contexts, it offers less direct insight into scenarios involving modified architectures. In the model development domain, making minor adjustments to established architectures before deployment is a common strategy. For example, Zhu et al. [26] incorporated a detail-oriented attention mechanism into Deeplabv3+ and added a PointRend module to achieve more accurate segmentation of small solar PV systems in satellite imagery. Our results indicated that SegFormer exhibited five times more patch-level false positives than the other two model families because of its global attention mechanism (Table 3). Introducing an appropriate mechanism to suppress such errors could therefore yield significant performance gains. Similarly, our findings suggest that other mainstream enhancement approaches, such as data augmentation and the inclusion of prior knowledge, should be applied with greater caution or may even be unnecessary in UHR imagery segmentation. For instance, while Tan et al. [43] demonstrated that the color of PV panels can serve as a key cue for distinguishing them from the background, Yang et al. [39] showed that this cue is highly sensitive to weather conditions and can become unreliable under certain circumstances.

5. Conclusions

Many studies have demonstrated significant effectiveness in remote sensing by developing task-specific models through architectural modularization and data fusion. However, systematic comparisons of fundamental differences in model design and data characteristics remain scarce. Such a gap in knowledge increases the cost and uncertainty of selecting effective approaches, particularly for ultra-high-resolution applications where experimentation is computationally intensive. To address these challenges, this study presents a systematic evaluation of how model architecture and size interact with spectral inputs, dataset volume, and spatial resolution to influence photovoltaic panel segmentation performance. Our findings yield several significant insights:

(1): Limited impact of spectral band augmentation: the incorporation of NIR and Red Edge bands into standard RGB inputs did not significantly improve segmentation performance, but did reduce inference speed.
(2): Sample diversity outweighs dataset volume: while both training data volume and diversity contribute to model generalization, models trained on geographically diverse datasets consistently outperformed those trained on single-site data of comparable size.
(3): Architecture matters more than size: ResUNet models consistently achieved higher performance than DeepLabV3 and SegFormer across scenarios. Specifically, the average accuracy of ResUNet reached 0.9873, compared to 0.9742 for SegFormer and 0.9322 for DeepLabV3; for IoU, the corresponding means were 0.9579 (ResUNet), 0.9110 (SegFormer), and 0.8145 (DeepLabV3).
(4): Moderate model sizes offer optimal trade-offs: although increasing model size improved training stability and accuracy, medium-sized models often matched the performance of larger counterparts. Across all experimental settings, the average IoU of medium-sized models was 0.8966, nearly identical to that of larger models (0.8970), suggesting that they represent a practical balance between efficiency and effectiveness.

Collectively, these findings provide quantitative guidance for balancing architecture choice, model size, and dataset design, ultimately supporting more efficient and robust deployment of deep learning in ultra-high-resolution remote sensing tasks.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/drones9090619/s1, Figure S1: Examples of local visible-light imagery at different resolutions; Figure S2: Dataset annotation examples; Figure S3: Index relationship of test performance on the 10 cm cross-site dataset; Figure S4: Loss and IoU curves during the training phase under different training dataset sizes and experimental settings; Figure S5: Loss and IoU curves during the training phase under different training data resolutions and experimental settings; Figure S6: Training metrics on the 10 cm cross-site dataset series under different input bands and experimental settings; Figure S7: Training metrics on the multi-resolution dataset series under different input bands and experimental settings; Figure S8: Test metrics on the 10 cm cross-site dataset series under different input bands and experimental settings; Figure S9: Test metrics on the multi-resolution dataset series under different input bands and experimental settings; Figure S10: Statistical significance matrices across different model settings on the 10 cm cross-site dataset series; Figure S11: Statistical significance matrices across different model settings on the 10 cm cross-site dataset series; Figure S12: Comparison of IoU and boundary IoU across different input band combinations for ResUNet50 trained on a high vegetation cover subset of the 10 cm cross-site dataset; Figure S13: Cross-test IoU results with same-site training datasets of different spatial resolutions; Table S1: Summary of training settings for different models; Table S2: Performance metrics across different configurations on RTX 3070 (8 GB VRAM) using the 10 cm subset of the multi-resolution single-site dataset; Table S3: Standard deviation and relative variability of performance metrics across runs.

Author Contributions

Conceptualization, Z.Z. and W.Y.; methodology, Z.Z.; formal analysis, Z.Z.; investigation, Z.Z. and X.Z.; data curation, Z.Z. and J.L.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z., X.Z., P.Y. and W.Y.; visualization, Z.Z.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (42471101).

Data Availability Statement

The benchmark photovoltaic segmentation dataset constructed for the main scripts in this study is openly available at the following website: https://zenodo.org/.

Acknowledgments

The authors would like to express their sincere gratitude to all staff members of the photovoltaic power plants involved in this study for their cooperation during the field investigations. Their support in facilitating site access, and ensuring safe and efficient UAV data collection are invaluable to the success of this work.

Conflicts of Interest

Xinhui Zhou is employed by Hangzhou New Energy Investment and Development Co., Ltd. The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned aerial vehicle
UHR	Ultra-high resolution
GSD	Ground sampling distance
IoU	Intersection over Union
GPU	Graphics processing unit
FPS	Frames per second

References

Kwon, D.Y.; Kim, J.; Park, S.; Hong, S. Advancements of remote data acquisition and processing in unmanned vehicle technologies for water quality monitoring: An extensive review. Chemosphere 2023, 343, 140198. [Google Scholar] [CrossRef]
Feng, L.; Chen, S.; Zhang, C.; Zhang, Y.; He, Y. A comprehensive review on recent applications of unmanned aerial vehicle remote sensing with various sensors for high-throughput plant phenotyping. Comput. Electron. Agric. 2021, 182, 106033. [Google Scholar] [CrossRef]
Hassler, S.C.; Baysal-Gurel, F. Unmanned Aircraft System (UAS) Technology and Applications in Agriculture. Agronomy 2019, 9, 618. [Google Scholar] [CrossRef]
Sozzi, M.; Kayad, A.; Gobbo, S.; Cogato, A.; Sartori, L.; Marinello, F. Economic comparison of Satellite, Plane and UAV-acquired NDVI images for site-specific nitrogen application: Observations from Italy. Agronomy 2021, 11, 2098. [Google Scholar] [CrossRef]
Khaliq, A.; Comba, L.; Biglia, A.; Ricauda Aimonino, D.; Chiaberge, M.; Gay, P. Comparison of satellite and UAV-based multispectral imagery for vineyard variability assessment. Remote Sens. 2019, 11, 436. [Google Scholar] [CrossRef]
Jiang, J.; Johansen, K.; Tu, Y.-H.; McCabe, M.F. Multi-sensor and multi-platform consistency and interoperability between UAV, Planet CubeSat, Sentinel-2, and Landsat reflectance data. GISci. Remote Sens. 2022, 59, 936–958. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. arXiv 2016, arXiv:1612.03144. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science; Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Jiao, L.; Wang, M.; Liu, X.; Li, L.; Liu, F.; Feng, Z.; Yang, S.; Hou, B. Multiscale Deep Learning for Detection and Recognition: A Comprehensive Survey. IEEE Trans Neural Netw Learn. Syst 2025, 36, 5900–5920. [Google Scholar] [CrossRef]
Ying, X. An Overview of Overfitting and its Solutions. J. Phys. Conf. Ser. 2019, 1168, 022022. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
Shang, J.; Xu, J.; Zhang, A.A.; Liu, Y.; Wang, K.C.P.; Ren, D.; Zhang, H.; Dong, Z.; He, A. Automatic Pixel-level pavement sealed crack detection using Multi-fusion U-Net network. Measurement 2023, 208, 112475. [Google Scholar] [CrossRef]
Gao, Y.; Li, Y.; Jiang, R.; Zhan, X.; Lu, H.; Guo, W.; Yang, W.; Ding, Y.; Liu, S. Enhancing Green Fraction Estimation in Rice and Wheat Crops: A Self-Supervised Deep Learning Semantic Segmentation Approach. Plant Phenomics 2023, 5, 0064. [Google Scholar] [CrossRef] [PubMed]
Gibril, M.B.A.; Shafri, H.Z.M.; Al-Ruzouq, R.; Shanableh, A.; Nahas, F.; Al Mansoori, S. Large-Scale Date Palm Tree Segmentation from Multiscale UAV-Based and Aerial Images Using Deep Vision Transformers. Drones 2023, 7, 93. [Google Scholar] [CrossRef]
Shi, Y.; Han, L.; Zhang, X.; Sobeih, T.; Gaiser, T.; Thuy, N.H.; Behrend, D.; Srivastava, A.K.; Halder, K.; Ewert, F. Deep Learning Meets Process-Based Models: A Hybrid Approach to Agricultural Challenges. arXiv 2025, arXiv:2504.16141. [Google Scholar] [CrossRef]
Long, A.; Han, W.; Huang, X.; Li, J.; Wang, Y.; Chen, J. Distributed Deep Learning for Big Remote Sensing Data Processing on Apache Spark: Geological Remote Sensing Interpretation as a Case Study. In Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science; Song, X., Feng, R., Chen, Y., Li, J., Min, G., Eds.; Springer: Singapore, 2024; pp. 96–110. [Google Scholar]
Chen, Y.; Zhou, J.; Chen, Y.; Wang, J.; Zhang, X.; Ge, Y.; Ma, H. Edge-enhanced SAM for extracting photovoltaic power plants from remote sensing imagery. Int. J. Appl. Earth Obs. Geoinf. 2025, 140, 104580. [Google Scholar] [CrossRef]
Jiang, H.; Yao, L.; Lu, N.; Qin, J.; Liu, T.; Liu, Y.; Zhou, C. Multi-resolution dataset for photovoltaic panel segmentation from satellite and aerial imagery. Earth Syst. Sci. Data 2021, 13, 5389–5401. [Google Scholar] [CrossRef]
Kleebauer, M.; Marz, C.; Reudenbach, C.; Braun, M. Multi-resolution segmentation of solar photovoltaic systems using deep learning. Remote Sens. 2023, 15, 5687. [Google Scholar] [CrossRef]
Meng, Z.; Hu, Y.; Ren, G.; Zhu, W.; Wang, J.; Liu, S.; Ma, Y. Remote sensing monitoring of seagrass bed dynamics using cross-temporal-spatial domain transfer learning in Yellow river Delta. Int. J. Remote Sens. 2024, 45, 1972–1996. [Google Scholar] [CrossRef]
Ge, F.; Wang, G.; He, G.; Zhou, D.; Yin, R.; Tong, L. A hierarchical information extraction method for large-scale centralized photovoltaic power plants based on multi-source remote sensing images. Remote Sens. 2022, 14, 4211. [Google Scholar] [CrossRef]
Zhu, R.; Guo, D.; Wong, M.S.; Qian, Z.; Chen, M.; Yang, B.; Chen, B.; Zhang, H.; You, L.; Heo, J.; et al. Deep solar PV refiner: A detail-oriented deep learning network for refined segmentation of photovoltaic areas from satellite imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103134. [Google Scholar] [CrossRef]
Guo, Z.; Zhuang, Z.; Tan, H.; Liu, Z.; Li, P.; Lin, Z.; Shang, W.-L.; Zhang, H.; Yan, J. Accurate and generalizable photovoltaic panel segmentation using deep learning for imbalanced datasets. Renew. Energy 2023, 219, 119471. [Google Scholar] [CrossRef]
Cheng, J.; Deng, C.; Su, Y.; An, Z.; Wang, Q. Methods and datasets on semantic segmentation for Unmanned Aerial Vehicle remote sensing images: A review. ISPRS J. Photogramm. Remote Sens. 2024, 211, 1–34. [Google Scholar] [CrossRef]
Stroner, M.; Urban, R.; Reindl, T.; Seidl, J.; Broucek, J. Evaluation of the Georeferencing Accuracy of a Photogrammetric Model Using a Quadrocopter with Onboard GNSS RTK. Sensors 2020, 20, 2318. [Google Scholar] [CrossRef]
Taddia, Y.; Stecchi, F.; Pellegrinelli, A. Coastal Mapping Using DJI Phantom 4 RTK in Post-Processing Kinematic Mode. Drones 2020, 4, 9. [Google Scholar] [CrossRef]
Dimyati, M.; Supriatna, S.; Nagasawa, R.; Pamungkas, F.D.; Pramayuda, R. A Comparison of Several UAV-Based Multispectral Imageries in Monitoring Rice Paddy (A Case Study in Paddy Fields in Tottori Prefecture, Japan). ISPRS Int. J. Geo-Inf. 2023, 12, 36. [Google Scholar] [CrossRef]
Shafiee, S.; Mroz, T.; Burud, I.; Lillemo, M. Evaluation of UAV multispectral cameras for yield and biomass prediction in wheat under different sun elevation angles and phenological stages. Comput. Electron. Agric. 2023, 210, 107874. [Google Scholar] [CrossRef]
Franzini, M.; Ronchetti, G.; Sona, G.; Casella, V. Geometric and Radiometric Consistency of Parrot Sequoia Multispectral Imagery for Precision Agriculture Applications. Appl. Sci. 2019, 9, 5314. [Google Scholar] [CrossRef]
Pádua, L.; Guimarães, N.; Adão, T.; Sousa, A.; Peres, E.; Sousa, J.J. Effectiveness of Sentinel-2 in Multi-Temporal Post-Fire Monitoring When Compared with UAV Imagery. ISPRS Int. J. Geo-Inf. 2020, 9, 225. [Google Scholar] [CrossRef]
Mazzia, V.; Comba, L.; Khaliq, A.; Chiaberge, M.; Gay, P. UAV and Machine Learning Based Refinement of a Satellite-Driven Vegetation Index for Precision Agriculture. Sensors 2020, 20, 2530. [Google Scholar] [CrossRef]
Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed Precision Training. arXiv 2017, arXiv:1710.03740. [Google Scholar] [CrossRef]
Cheng, B.; Girshick, R.; Dollar, P.; Berg, A.C.; Kirillov, A. Boundary IoU: Improving Object-Centric Image Segmentation Evaluation. arXiv 2021, arXiv:2103.16562v1. [Google Scholar]
Koubaa, A.; Ammar, A.; Abdelkader, M.; Alhabashi, Y.; Ghouti, L. AERO: AI-Enabled Remote Sensing Observation with Onboard Edge Computing in UAVs. Remote Sens. 2023, 15, 1873. [Google Scholar] [CrossRef]
Yang, R.; He, G.; Yin, R.; Wang, G.; Peng, X.; Zhang, Z.; Long, T.; Peng, Y.; Wang, J. A large-scale ultra-high-resolution segmentation dataset augmentation framework for photovoltaic panels in photovoltaic power plants based on priori knowledge. Appl. Energy 2025, 390, 125879. [Google Scholar] [CrossRef]
Collin, A.; Dubois, S.; James, D.; Houet, T. Improving Intertidal Reef Mapping Using UAV Surface, Red Edge, and Near-Infrared Data. Drones 2019, 3, 67. [Google Scholar] [CrossRef]
Furukawa, F.; Laneng, L.A.; Ando, H.; Yoshimura, N.; Kaneko, M.; Morimoto, J. Comparison of RGB and Multispectral Unmanned Aerial Vehicle for Monitoring Vegetation Coverage Changes on a Landslide Area. Drones 2021, 5, 97. [Google Scholar] [CrossRef]
Zefri, Y.; ElKettani, A.; Sebari, I.; Ait Lamallam, S. Thermal Infrared and Visual Inspection of Photovoltaic Installations by UAV Photogrammetry—Application Case: Morocco. Drones 2018, 2, 41. [Google Scholar] [CrossRef]
Tan, H.; Guo, Z.; Zhang, H.; Chen, Q.; Lin, Z.; Chen, Y.; Yan, J. Enhancing PV panel segmentation in remote sensing images with constraint refinement modules. Appl. Energy 2023, 350, 121757. [Google Scholar] [CrossRef]
Zhang, X.; Wu, H.; Qi, K.; Qian, Y.; Zhang, Y.; Wang, L.; Wang, J. Detailed PV Monitor: A Highly Generalized Photovoltaic Panels Segmentation Network Integrating Context-Aware and Deep Feature Reconstruction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10131–10143. [Google Scholar] [CrossRef]
Jie, Y.; Ji, X.; Yue, A.; Chen, J.; Deng, Y.; Chen, J.; Zhang, Y. Combined Multi-Layer Feature Fusion and Edge Detection Method for Distributed Photovoltaic Power Station Identification. Energies 2020, 13, 6742. [Google Scholar] [CrossRef]
da Costa, M.V.C.V.; de Carvalho, O.L.F.; Orlandi, A.G.; Hirata, I.; de Albuquerque, A.O.; e Silva, F.V.; Guimarães, R.F.; Gomes, R.A.T.; Júnior, O.A.d.C. Remote Sensing for Monitoring Photovoltaic Solar Plants in Brazil Using Deep Semantic Segmentation. Energies 2021, 14, 2960. [Google Scholar] [CrossRef]
Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv 2022, arXiv:2206.04615. [Google Scholar] [CrossRef]

Figure 1. Overview of data collection sites and example imagery: (a) distribution of flight sampling points; (b) visible-band image sample; (c) near-infrared image sample; and (d) red-edge image sample.

Figure 2. Loss and IoU curves during the validation phase under different training dataset sizes and experimental settings. RGB: Red, Green, Blue; NIR: near-infrared; RE: Red Edge. (a–i) illustrate the results corresponding to each row and column setting in the experimental design.

Figure 3. Loss and IoU curves during the validation phase under different training data resolutions and experimental settings. RGB: Red, Green, Blue; NIR: near-infrared; RE: Red Edge. (a–l) illustrate the results corresponding to each row and column setting in the experimental design.

Figure 4. Training metrics under different input bands and experimental settings: (a) total epochs, (b) total training time, and (c) best validation IoU. *** p ≤ 0.001, ** p ≤ 0.01, * p ≤ 0.05.

Figure 5. Training metrics under different dataset sizes and experimental settings: (a) total epochs, (b) total training time, and (c) best validation IoU. *** p ≤ 0.001, * p ≤ 0.05. Letters A–C denote significant differences at the 0.01 level; same letters indicate no significant difference.

Figure 6. Training metrics under different dataset resolutions and experimental settings: (a) total epochs, (b) total training time, and (c) best validation IoU. *** p ≤ 0.001, ** p ≤ 0.01, * p ≤ 0.05. Letters A–C denote significant differences at the 0.01 level; same letters indicate no significant difference.

Figure 7. Test metrics under different input bands and experimental settings: (a) mean boundary IoU, (b) mean IoU. *** p ≤ 0.001, * p ≤ 0.05.

Figure 8. Test metrics under different dataset sizes and experimental settings: (a) mean boundary IoU, (b) mean IoU. *** p ≤ 0.001. Letters A–C denote significant differences at the 0.01 level; same letters indicate no significant difference.

Figure 9. Test metrics under different dataset resolutions and experimental settings: (a) mean boundary IoU, (b) mean IoU. *** p ≤ 0.001. Letters A–D denote significant differences at the 0.01 level; same letters indicate no significant difference.

Figure 10. Visual comparison of segmentation results across different models and resolutions using the medium-sized series as an example. Subfigures (a1–d5) illustrate the results corresponding to each row and column setting in the experimental design.

Figure 11. Statistical significance matrices across different model settings for: (a) total number of training epochs, (b) total training time, and (c) best validation IoU.

Figure 12. Visual comparison of segmentation results across different model sizes using the DeepLabV3 series training with 10 cm large dataset as an example. Subfigures (a1–c5) present samples (a1–c5), each consisting of the RGB image, the ground truth, and three different prediction results.

Figure 13. Statistical significance matrices across different model settings for: (a) mean boundary IoU and (b) mean IoU. *** p ≤ 0.001, * p ≤ 0.05.

Figure 14. Visual comparison of segmentation results for different models using the medium-sized series as an example. Subfigures (a1–c5) present samples (a1–c5), each consisting of the RGB image, the ground truth, and three different prediction results.

Figure 15. Cross-test IoU results with patch number-matched training datasets: (a) trained on the single-site dataset and tested on the multi-site dataset; and (b) trained on the multi-site dataset and tested on the single-site dataset.

Figure 16. Visual comparison of cross resolution test results using the ResUNet101 as an example. Subfigures (a1–d7) illustrate the results corresponding to each row and column setting in the experimental design.

Table 1. Summary of training settings for different models.

Model	Input Bands	Batch Size	Fwd/Bwd Pass Size (MB)	Trainable Parameters (M)	Train GPU Memory (MiB)	Validation GPU Memory (MiB)
DeepLabV3_18	3	110	5133.98	15.9	19,857.27	19,864.44
	4			15.9	19,864.79	19,869.6
	5			15.91	19,951.02	19,953.6
DeepLabV3_50	3	35	26,072.09	39.63	21,313.47	21,316.5
	4			39.64	21,396.41	21,398.72
	5			39.64	21,470.6	21,474.17
DeepLabV3_101	3	24	36,339.74	58.63	19,950.15	19,951.67
	4			58.63	20,007.91	20,028.86
	5			58.63	20,220.95	20,223.44
ResUNet18	3	43	18,689.29	12.91	21,450.95	21,454.24
	4			12.92	21,628.44	21,638.32
	5			12.92	21,671.14	21,673.89
ResUNet50	3	26	27,549.24	44.31	20,954.92	20,957.88
	4			44.31	20,960.28	20,961.27
	5			44.32	21,765.62	21,767.05
ResUNet101	3	17	25,285.89	63.3	18,439.77	18,433.43
	4			63.31	20,062.14	20,059.79
	5			63.31	20,915.45	20,916.1
SegFormerB1	3	23	10,201.6	11.45	21,346.68	21,348.21
	4			11.45	21,579.26	21,581.35
	5			11.46	21,591.96	21,594.03
SegFormerB4	3	14	18,827.18	46.63	22,242.63	22,245.07
	4			46.63	22,342.3	22,344.28
	5			46.63	20,362.44	20,350.79
SegFormerB5	3	11	17,336.11	62.24	20,249.75	20,250.7
	4			62.25	20,753.57	20,748.75
	5			62.25	21,566.58	21,555.58

Table 2. Summary of test information for different models.

Model	Input Bands	FPS	Avg Time per Image (ms)	Avg GPU Memory (MiB)	Max GPU Memory (MiB)
DeepLabV3_18	3	61.46	16.65	121.12	506.34
	4	45.74	22.07	132.91	518.36
	5	37.1	28.95	144.71	530.37
DeepLabV3_50	3	38.71	26.03	223.93	921.28
	4	31.97	31.75	224.59	922.03
	5	27.9	36.24	236.38	934.04
DeepLabV3_101	3	33.47	30	285.09	982.3
	4	27.91	35.91	296.86	994.31
	5	25.4	39.43	405.83	1103.49
ResUNet18	3	58.37	17.47	139.74	1047.06
	4	41.21	24.66	120.16	1027.7
	5	35.51	28.58	170.18	1078.06
ResUNet50	3	41.09	24.42	229.34	1434.56
	4	32.64	30.76	241.15	1446.57
	5	28.54	35.11	253.01	1458.62
ResUNet101	3	35.68	28.11	302.42	1507.63
	4	29.44	34.09	314.2	1519.64
	5	26.01	38.58	325.98	1531.65
SegFormerB1	3	43.36	23.21	111.59	7118.41
	4	34.55	29.1	123.39	7130.42
	5	30.01	33.56	135.17	7142.43
SegFormerB4	3	26.94	37.17	247.62	7254.74
	4	23.39	42.87	259.54	7266.76
	5	21.25	47.19	271.33	7278.77
SegFormerB5	3	23.87	41.93	307.66	7314.54
	4	21.18	47.3	319.51	7326.6
	5	19.25	52.05	331.28	7338.61

Table 3. Summary of error-related test metrics for different models.

Model	Recall-BG	Recall-PV	Error-FP (%)	Error-TN (%)	Error-Both (%)
DeepLabV3_18	0.87 ± 0.01	0.99 ± 0.01	0.01 ± 0.01	0.01 ± 0.00	0.02 ± 0.01
DeepLabV3_50	0.95 ± 0.04	0.98 ± 0.01	0.01 ± 0.01	0.01 ± 0.00	0.01 ± 0.01
DeepLabV3_101	0.95 ± 0.04	0.98 ± 0.01	0.01 ± 0.01	0.00 ± 0.00	0.01 ± 0.01
ResUNet18	0.99 ± 0.01	0.98 ± 0.01	0.02 ± 0.01	0.00 ± 0.00	0.02 ± 0.01
ResUNet50	0.99 ± 0.01	0.99 ± 0.00	0.01 ± 0.01	0.00 ± 0.00	0.02 ± 0.01
ResUNet101	0.99 ± 0.01	0.99 ± 0.00	0.01 ± 0.01	0.00 ± 0.00	0.01 ± 0.01
SegFormerB1	0.97 ± 0.00	0.98 ± 0.01	0.10 ± 0.04	0.00 ± 0.00	0.10 ± 0.04
SegFormerB4	0.97 ± 0.01	0.98 ± 0.00	0.05 ± 0.03	0.00 ± 0.00	0.06 ± 0.03
SegFormerB5	0.97 ± 0.01	0.98 ± 0.00	0.05 ± 0.03	0.00 ± 0.00	0.06 ± 0.03

Note: BG, background; PV, photovoltaic; Error-FP and Error-TN, false positives and true negatives at the patch level; Error-Both, combined FP and TN.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zou, Z.; Zhou, X.; Yang, P.; Liu, J.; Yang, W. Data-Model Complexity Trade-Off in UAV-Acquired Ultra-High-Resolution Remote Sensing: Empirical Study on Photovoltaic Panel Segmentation. Drones 2025, 9, 619. https://doi.org/10.3390/drones9090619

AMA Style

Zou Z, Zhou X, Yang P, Liu J, Yang W. Data-Model Complexity Trade-Off in UAV-Acquired Ultra-High-Resolution Remote Sensing: Empirical Study on Photovoltaic Panel Segmentation. Drones. 2025; 9(9):619. https://doi.org/10.3390/drones9090619

Chicago/Turabian Style

Zou, Zhigang, Xinhui Zhou, Pukaiyuan Yang, Jingyi Liu, and Wu Yang. 2025. "Data-Model Complexity Trade-Off in UAV-Acquired Ultra-High-Resolution Remote Sensing: Empirical Study on Photovoltaic Panel Segmentation" Drones 9, no. 9: 619. https://doi.org/10.3390/drones9090619

APA Style

Zou, Z., Zhou, X., Yang, P., Liu, J., & Yang, W. (2025). Data-Model Complexity Trade-Off in UAV-Acquired Ultra-High-Resolution Remote Sensing: Empirical Study on Photovoltaic Panel Segmentation. Drones, 9(9), 619. https://doi.org/10.3390/drones9090619

Article Menu

Data-Model Complexity Trade-Off in UAV-Acquired Ultra-High-Resolution Remote Sensing: Empirical Study on Photovoltaic Panel Segmentation

Abstract

Highlights

Abstract

1. Introduction

1.1. Background

1.2. Related Works

1.3. Motivations and Contributions

2. Materials and Methods

2.1. Data Collection

2.2. Data Pre-Processing

2.3. Experimental Design

2.4. Training and Evaluation Framework

3. Results

3.1. Effects of Data Variations on Training

3.2. Effects of Data Variations on Testing

3.3. Effects of Model Variations on Training

3.4. Effects of Model Variations on Testing

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI