Remote Sensing for Monitoring Photovoltaic Solar Plants in Brazil Using Deep Semantic Segmentation

: Brazil is a tropical country with continental dimensions and abundant solar resources that are still underutilized. However, solar energy is one of the most promising renewable sources in the country. The proper inspection of Photovoltaic (PV) solar plants is an issue of great interest for the Brazilian territory’s energy management agency, and advances in computer vision and deep learning allow automatic, periodic, and low-cost monitoring. The present research aims to identify PV solar plants in Brazil using semantic segmentation and a mosaicking approach for large image classiﬁcation. We compared four architectures (U-net, DeepLabv3+, Pyramid Scene Parsing Network, and Feature Pyramid Network) with four backbones (Efﬁcient-net-b0, Efﬁcient-net-b7, ResNet-50, and ResNet-101). For mosaicking, we evaluated a sliding window with overlapping pixels using different stride values (8, 16, 32, 64, 128, and 256). We found that: (1) the models presented similar results, showing that the most relevant approach is to acquire high-quality labels rather than models in many scenarios; (2) U-net presented slightly better metrics, and the best conﬁguration was U-net with the Efﬁcient-net-b7 encoder (98% overall accuracy, 91% IoU, and 95% F-score); (3) mosaicking progressively increases results (precision-recall and receiver operating characteristic area under the curve) when decreasing the stride value, at the cost of a higher computational cost. The high trends of solar energy growth in Brazil require rapid mapping, and the proposed study provides a promising approach. (U-net, DeepLabv3+, FPN, and PSPNet) with four backbones (ResNet-50, ResNet-101, Eff-b0, and Eff-b7), totaling 16 combinations. Additionally, we used augmentation and transfer learning. The PV panel spectral and shape characteristics facilitate the accurate detection of the panels. Results were satisfactory using the different backbones and architectures, but U-net with Eff-b7 backbone presented the best results with 98% accuracy, 92% IoU, and 95% F-score. We estimate that the most critical factors when mapping PV solar panels are a reliable source of data and their possible applications. For the classiﬁcation of large regions, the image mosaicking procedure signiﬁcantly improves when using more overlapping pixels, minimizing edge errors. The results are also expressive when analyzing the ROC AUC score and PR AUC score, in which the results progressively increase whilst decreasing the stride value. However, the computational cost may be a signiﬁcant challenge for practical applications, since the processing time signiﬁcantly increases with the stride value reduction. This methodology has many applications and satisﬁes the conditions for automatically classifying PV solar plants using free Sentinel-2 imagery, allowing for a signiﬁcant advance in monitoring the implanted infrastructure.


Introduction
Solar energy is one of the most promising renewable energy sources, being crucial for sustainable development in places with intense sunlight. Several studies have shown that solar energy systems allow for economic and efficiency gains, driven by technological and productive development that enables cost reduction to overcome technical barriers [1,2]. According to Sampaio and Gonçalez [3], the main advantages of solar energy systems are reliability, low costs of operation and servicing, low maintenance, a free energy source, clean energy, high availability, generation closer to the consumer, a low environmental impact, potential to mitigate greenhouse gas emissions, and noiselessness. In contrast, the main disadvantages are a high initial cost, large installation area, high dependence on technology development, and climatic conditions (solar irradiation). The benefits of solar technology provided an exponential increase in installed solar energy capacity between 1992 and 2020 [4,5]. This detected growth of solar energy was not foreseen in previous scenarios of the Intergovernmental Panel on Climate Change's fifth assessment report [6].
In automatic detection, Deep Learning (DL) emerges as a powerful method, especially in regards to computer vision problems using convolutional neural networks (CNN), due to its ability to process multi-dimensional arrays [69] with wide remote sensing applications [70][71][72][73][74][75]. Several reviews were carried out on the different DL methods, in which object detection, semantic segmentation, and instance segmentation were the most common approaches [76][77][78]. The method choice is highly dependent on the task objectives. When the main goal is to make a pixel-wise classification (as is the case with PV solar plants), semantic segmentation is a great alternative [79,80].
Previous studies in PV solar panel detection have shown promising results using the DL method, presenting very high accuracy. However, most studies consider urban PV panels using aerial or high-resolution satellite images [81][82][83], while PV solar plant mapping is still restricted [84]. This approach is an effective alternative to construction inspection, requiring periodic data and free satellite imagery. Previous studies on PV panel detection have not yet shown reasonable solutions for classifying large regions, and the use of mosaicking with sliding windows is a promising solution [85][86][87].
The primary motivation for this study is the development of a methodology based on remote sensing for the automatic monitoring of new installations of PV solar plants. In Brazil, the high growth of solar energy throughout the territory, with a continental dimension, prevents on-site inspection due to the financial and time cost, requiring the development of technological alternatives. Therefore, this research aims to evaluate the use of DL methods, representing the state of the art of computer vision, to identify and monitor PV solar power plants from ANEEL's database using Sentinel-2 images. This methodology represents an innovation for the management and monitoring of installed solar energy structures on the Brazilian territory, and similar research does not exist in the country to date.
Previous studies in PV solar panel detection have shown promising results using the DL method, presenting very high accuracy. However, most studies consider urban PV panels using aerial or high-resolution satellite images [81][82][83], while PV solar plant mapping is still restricted [84]. This approach is an effective alternative to construction inspection, requiring periodic data and free satellite imagery. Previous studies on PV panel detection have not yet shown reasonable solutions for classifying large regions, and the use of mosaicking with sliding windows is a promising solution [85][86][87].
The primary motivation for this study is the development of a methodology based on remote sensing for the automatic monitoring of new installations of PV solar plants. In Brazil, the high growth of solar energy throughout the territory, with a continental dimension, prevents on-site inspection due to the financial and time cost, requiring the development of technological alternatives. Therefore, this research aims to evaluate the use of DL methods, representing the state of the art of computer vision, to identify and monitor PV solar power plants from ANEEL's database using Sentinel-2 images. This methodology represents an innovation for the management and monitoring of installed solar energy structures on the Brazilian territory, and similar research does not exist in the country to date.

Materials and Methods
The present research had the following methodological steps (

Study Area
Brazil has a large and diverse territory, presenting different solar energy incidence [21]. Nevertheless, many areas are extremely suitable for the installation of PV panels. Therefore, we selected 24 areas to conduct this experiment ( Figure 2). There are limited PV plants installed in the Brazilian territory and currently no open datasets considering Sentinel-2 data [88]. However, the development of methodologies and expansion of databases is a fundamental strategy for monitoring large-scale PV with a high growth trend.

Study Area
Brazil has a large and diverse territory, presenting different solar energy incidence [21]. Nevertheless, many areas are extremely suitable for the installation of PV panels. Therefore, we selected 24 areas to conduct this experiment ( Figure 2). There are limited PV plants installed in the Brazilian territory and currently no open datasets considering Sentinel-2 data [88]. However, the development of methodologies and expansion of databases is a fundamental strategy for monitoring large-scale PV with a high growth trend.

Image Acquisition and Annotations
We obtained Sentinel-2 cloudless images with four channels (Red, Green, Blue, and near infra-red) for each region containing PV solar power plants. For each image, a specialist manually annotated ground truth (GT) masks considering two classes: background and PV solar plant. The background class presents a wide variety of spectral behaviors, including the different soil and vegetation compositions present in a large-scale country such as Brazil. The research considered the difference in the light incidence and the construction of panels in each region for DL model training.

Data Split
After preparing each tile with their respective annotations, we separated the dataset into training, validation, and testing sets. For each area of interest that may contain more than one PV solar plant, we cropped at least seven 256 × 256-pixel tiles. Table 1 lists the distribution of areas and images for training, validation, and testing.

Model Configurations
In addition to choosing the appropriate models, it is crucial to make fine adjustments for the task at hand. The first problem is the reduced number of available samples. Therefore, in addition to obtaining at least seven frames from each location, we applied two augmentations in the training process: random horizontal flip and random vertical flip (both with a probability of 0.5). The second problem is class distribution (there are many more background pixels than solar panel pixels). Thus, we used a loss function that minimizes this effect, the Dice Loss: in which pred is the DL prediction, and GT is the ground truth mask. In addition, we used transfer learning with Imagenet [99] pre-trained weights for faster convergence; to avoid overfitting, we applied callbacks, saving the model with the lowest Dice Loss in the validation set. Regarding hyperparameters, we used: (a) 300 epochs; (b) Adam optimizer; (c) 5 × 10-3 learning rate (lr); and (d) batch size of 5.

DL Accuracy Analysis
Accuracy analysis is a fundamental step for DL model evaluation. Since semantic segmentation models provide a pixel-wise mask, the metrics compare the predicted mask and the GT mask through confusion matrix metrics. The confusion matrix (Table 2)   The model outputs probability, whereas the GTs are integers. Thus, it was necessary to establish a cutoff point for the threshold metrics. A stricter threshold tends to reduce the commission errors, while a more permissive threshold tends to reduce omission errors. Thus, we applied a commonly intermediate threshold of 0.5 for three metrics (overall accuracy, F-score, and IoU):

Mosaicking
The 256 × 256 pixel tiles used in training may not represent an entire scene, requiring a postprocessing stage. Mosaicking using a sliding window algorithm is a very promising solution. However, combining frames side by side to reconstruct a scene may also induce errors in the single frame edges. A way to minimize this effect is to apply a sliding window with overlapping pixels, where the final pixel will be the average from the overlapped pixels. Thus, we compared six different stride values for the mosaicking strategy: 8, 16, 32, 64, 128, and 256 (adjacent frames). Figure 3 shows four images with consecutive frames using different stride values. The smaller the stride value, the more overlapping pixels (which tends to reduce errors in the frame edges).

DL Accuracy Analysis
Accuracy analysis is a fundamental step for DL model evaluation. Since semantic segmentation models provide a pixel-wise mask, the metrics compare the predicted mask and the GT mask through confusion matrix metrics. The confusion matrix (Table 2) F score = TP

Mosaicking
The 256 × 256 pixel tiles used in training may not represent an entire scene, requiring a postprocessing stage. Mosaicking using a sliding window algorithm is a very promising solution. However, combining frames side by side to reconstruct a scene may also induce errors in the single frame edges. A way to minimize this effect is to apply a sliding window with overlapping pixels, where the final pixel will be the average from the overlapped pixels. Thus, we compared six different stride values for the mosaicking strategy: 8, 16, 32, 64, 128, and 256 (adjacent frames). Figure 3 shows four images with consecutive frames using different stride values. The smaller the stride value, the more overlapping pixels (which tends to reduce errors in the frame edges).

Mosaicking Accuracy Analysis
To evaluate the mosaicking, we analyzed the ranking metrics Receiver Operating Characteristic Area Under the Curve (ROC AUC) and Precision-Recall (PR) AUC, considering six stride values: 8, 16, 32, 64, 128, and 256. The ROC curve considers the true positive rate (TP/(TP + FN)) and false positive rate (FP/(TN + FP)) and the PR curve considers the precision (TP/(TP + FP)) and recall (TP/TP + FN). From the points generated, it is possible to calculate the area under these curves.

DL Metrics Results
Overall, the different architectures and backbones presented good results ( Table 3). The U-net presented the best metrics results regarding the different architectures, followed by DeepLabv3+, FPN, and PSPNet. Despite the higher complexity of the DeepLabv3+ architecture, the U-net presented better results as the targets do not present a high variance in scaling, one of the most significant benefits of this model. Moreover, although PSPNet provided the worst results, the difference is not extremely large, and the training period is considerably lower (less than half the time to train the Eff-b7 using the U-net architecture, and nearly one-fifth of the period for training on the DeepLabv3+ architecture). When analyzing the different backbones, apart from Eff-b0 with the PSPNet architecture, the results did not change significantly. Moreover, metrics-wise, the accuracy score shows high values among all models (<3% variation), possibly due to the fact that there are many more pixels corresponding to the background class than the panels class. The IoU and F-score provide much more meaningful results. The Eff-b7 using the U-net architecture had the best IoU and F-score results, and an intermediate computational cost. Table 3. Semantic segmentation evaluation (accuracy, IoU, F-score, and epoch period) using three architectures (U-net, DeepLabv3+, and PSPNet), and four backbones (Efficient-net-b7 (Eff-b7), Efficientnet-b0 (Eff-b0), ResNet-101 (R-101), and ResNet-50 (R-50)).  Figure 4 shows three examples from the test set, and three examples from the validation set with their corresponding original images (RGB channels), GT, and prediction. Despite some errors in the edges of the objects, these results suggest a correct identification of the target, with few errors.   Table 4 shows the ROC AUC scores using the 1536 × 768 area, using six different stride values (8, 16, 32, 64, 128, and 256). The analysis only considered the best model (U-net with Eff-b7 backbone). When the stride value decreases, results progressively improve in both metrics. Nevertheless, decreasing the stride value increases the computational cost needed, becoming a significant limitation, especially for practical applications.  Figure 5 shows the original image, its corresponding GT, and the prediction using U-net with Eff-b7 backbone and 8-pixel stride value on a 1532 × 768-pixel image. This mosaicking strategy enables the classification of areas with large dimensions, outputting images with no discontinuity.  Table 4 shows the ROC AUC scores using the 1536 × 768 area, using six different stride values (8, 16, 32, 64, 128, and 256). The analysis only considered the best model (Unet with Eff-b7 backbone). When the stride value decreases, results progressively improve in both metrics. Nevertheless, decreasing the stride value increases the computational cost needed, becoming a significant limitation, especially for practical applications.  Figure 5 shows the original image, its corresponding GT, and the prediction using Unet with Eff-b7 backbone and 8-pixel stride value on a 1532 × 768-pixel image. This mosaicking strategy enables the classification of areas with large dimensions, outputting images with no discontinuity. Figure 5. Mosaic representation on a 1536 × 768-pixel image with the original image, the corresponding ground truth (GT), and prediction using the U-net with Efficient-net-b7 backbone. Figure 5. Mosaic representation on a 1536 × 768-pixel image with the original image, the corresponding ground truth (GT), and prediction using the U-net with Efficient-net-b7 backbone.

Discussion
The best result of our study was the U-net with the Eff-b7 backbone, although the other methods also reach high or adequate values. However, an unexpected result is that U-net outperformed DeepLabv3+ by a slight margin. This result is probably because the input images do not present multi-scale objects-one of the main contributions of the DeepLabv3+ method. Therefore, these results show that simpler structures may be well suited in some scenarios, highlighting the importance of testing different architectures.
Other solar panel detection studies using DL methods have demonstrated high accuracy in different locations. However, studies carried out on PV solar plants are still much lower than residential PV solar panels. Considering the large-scale solar plants, Hou et al. [84] proposed a study in China with one thousand images achieving 95% IoU from the U-net model. They used a much more significant amount of data, and the results were not dissimilar to ours (92% IoU).
Generally, accuracy results are lower in residential PV solar panels due to their smaller dimension and higher susceptibility to noise interference. Yuan et al. [100] applied a simple ConvNet for large-scale solar panel mapping from aerial images, and evaluated their model in the cities of Boston and San Francisco using completeness (0.84 and 0.87) and correctness (0.81 and 0.85) metrics. Yu et al. [101] proposed DeepSolar with a substantial amount of training data using high-resolution satellite images, obtaining 93.1% recall and 88.5% precision, results very similar to our F-score (95%). Zhuang et al. [83] applied the U-net in satellite images for residential panels, achieving 74% IoU. Recently, Jie et al. [82] combined a U-net model with edge detection networks. The authors showed that the edge detection increased performance on two city panel datasets by nearly 2% IoU. This effect may be even less prominent in large solar plants since it is easier to detect borders, as shown in our study. Even though these studies trained with smaller PV solar panels, the results show an excellent ability to segment panels even with simpler models.
Thus, the results of our and previous studies suggest that the mapping of PV solar panels should be addressed in a data-driven, rather than model-driven, perspective, i.e., the DL models do not present a significant difference, and the most important endeavor is to obtain a reliable source of generating good annotations. Moreover, the present study showed significant results using data augmentation despite a limited amount of data.
The mosaicking procedure enables the classification of areas of indefinite and large sizes. We have shown that using a smaller stride value increases performance, but also the computational cost. The stride value for a practical application should take both factors into consideration. Regarding the mosaicking technique on semantic segmentation models, de Albuquerque et al. [86] performed a comparative analysis using different stride values, presenting progressively better ROC AUC scores for lower stride values, a result also verified in our research.
This research presents many possibilities for future studies. A first proposition would be to estimate energy production using the mapping of the photovoltaic solar panel from DL, and the level of solar incidence in a specific region. Another relevant test would be evaluating radar images due to cloud cover and atmospheric interference in optical images. Although synthetic aperture radar (SAR) images are noisy, they can be useful in some scenarios. Studies comparing the frame sizes according to the proposal by Bem et al. [102] can also be valuable in understanding the model's differences in various tasks (e.g., binary and multiclass) and object scales.

Conclusions
The survey and monitoring of PV solar power plants are extremely important for energy management and planning. The high growth of solar energy in Brazil, a country with continental dimensions, generates an increase in inspection processes for ANEEL that is only possible through technological innovation. Thus, this paper presented a comparison between DL models for the classification of PV solar plants using Sentinel-2 images with four spectral bands (RGB and near infra-red), comparing four architectures (U-net, DeepLabv3+, FPN, and PSPNet) with four backbones (ResNet-50, ResNet-101, Eff-b0, and Eff-b7), totaling 16 combinations. Additionally, we used augmentation and transfer learning. The PV panel spectral and shape characteristics facilitate the accurate detection of the panels. Results were satisfactory using the different backbones and architectures, but U-net with Eff-b7 backbone presented the best results with 98% accuracy, 92% IoU, and 95% F-score. We estimate that the most critical factors when mapping PV solar panels are a reliable source of data and their possible applications. For the classification of large regions, the image mosaicking procedure significantly improves when using more overlapping pixels, minimizing edge errors. The results are also expressive when analyzing the ROC AUC score and PR AUC score, in which the results progressively increase whilst decreasing the stride value. However, the computational cost may be a significant challenge for practical applications, since the processing time significantly increases with the stride value reduction. This methodology has many applications and satisfies the conditions for automatically classifying PV solar plants using free Sentinel-2 imagery, allowing for a significant advance in monitoring the implanted infrastructure.