1. Introduction
Wind energy is one of the fastest-growing renewable energy sources worldwide. In 2023, wind energy recorded its highest ever growth: in a single year, more than 100 GW of new onshore capacity and over 11 GW of offshore wind capacity were added globally. Total installed capacity worldwide exceeded the symbolic milestone of 1 TW for the first time and is expected to reach 2 TW before the end of this decade if current growth trends continue [
1]. In addition, the International Energy Agency (IEA) forecasts scenarios in which wind energy could meet more than 20% of global electricity demand by 2030, provided that ambitious climate protection measures are implemented [
2]. The transition to renewable energy sources presents major challenges. Accurate mapping and monitoring of wind turbine locations and meta-information on the turbine characteristics (e.g., turbine types, nominal power, hub height, or rotor diameter) are critical for effective integration into electricity grids and sustainable infrastructure planning.
Despite its growing global importance, detailed and spatially accurate datasets of wind turbine infrastructure remain scarce in many regions of the world. Existing global datasets often focus on aggregated capacities or rough location data, lacking precision for localized planning and operational decision-making. Recent research efforts address these limitations through advanced remote sensing and machine learning approaches. For instance, global offshore wind turbine locations were mapped using Sentinel-1 radar images [
3,
4], while segmentation methods utilizing high-resolution aerial images [
5,
6] and Sentinel-2 RGB imagery [
7,
8] improved the detection accuracy of onshore wind turbines. Moreover, the integration of multimodal data sources [
9,
10], further enhances detection accuracy and completeness. Even approaches to enable global detection are being researched [
11,
12].
In the specific context of South Africa, the national Renewable Energy Independent Power Producer Procurement Programme (REIPPPP) plays a central role in realizing the country’s long-term energy infrastructure goals. Launched in 2011 by the Department of Mineral Resources and Energy in cooperation with National Treasury and the Development Bank of Southern Africa, the REIPPPP was designed to facilitate private sector investment into grid-connected renewable energy generation through competitive bidding. The programme has since led to the procurement of more than 6.3 GW of renewable capacity, including wind, solar photovoltaic (PV), and other sources [
13]. Recent regulatory changes have further expanded the landscape of wind energy development in South Africa. In particular, the lifting of the 100 MW licensing cap for private generation in January 2023 has enabled the construction of wind farms outside the REIPPPP framework [
14]. As part of this programme, the Independent Power Producers (IPP) Projects Database is maintained by the IPP Office and provides a structured overview of utility-scale renewable energy projects, including wind farms. The database includes information such as project names, capacities, and commissioning dates. However, it does not contain detailed geospatial information on individual turbines and typically excludes smaller or non-utility-scale developments. At the same time, it does not provide any technical information such as turbine types, hub height, or rotor diameter [
15]. This study aims to fill this data gap and provide a spatially refined and attribute-based dataset that captures the full extent of wind turbine infrastructure in the country. This includes both large utility-scale farms and smaller, decentralized installations, enabling more comprehensive and accurate energy system analyses.
To overcome these data limitations, this article builds upon the methodologies initially presented in the conference paper by Kleebauer et al. (2024) entitled “Enhancing Wind Turbine Location Accuracy: A Deep Learning-Based Object Regression Approach for Validating Wind Turbine Geo-Coordinates” [
16]. Here, the original methods are further developed, combining OSM data, DL-based object detection with RetinaNet, high-resolution satellite imagery from Google and Bing, and manual attribute enrichment, to produce a comprehensive, spatially precise dataset of wind turbines in South Africa. This multi-step pipeline ensures robust validation and enrichment, significantly enhancing data quality and applicability for detailed infrastructure planning and energy modelling. Structured as following, this study introduces a multi-step data processing pipeline that combines open data sources, deep learning-based geo-coordinate correction, and manual validation. For better readability, the term “coordinate” will be used synonymously with “geo-coordinate” in the following.
As illustrated in
Figure 1, the construction of the dataset follows a multi-stage workflow. First, training data is prepared using the German Core Energy Market Data Register (MaStR) and high resolution aerial imagery. A RetinaNet-based deep learning model is trained and fine-tuned to detect turbines based on this reference data. Preparing the South African wind turbine dataset starts with downloading, extracting and filtering the raw wind turbine data from OSM. High-resolution satellite imagery from both Bing Maps and Google Satellite is then integrated to provide visual context for turbine locations. The model is then applied to correct the spatial positions of turbines, improving the coordination accuracy. Subsequently, a manual attribute enrichment step ensures the inclusion of key turbine information such as name, turbine type, turbine capacity and total wind farm capacities. A capacity analysis and a spatial analysis are then carried out for further description and evaluation. This leads to the final high-quality, geo-referenced dataset of wind turbines in South Africa.
In the larger project context, a comprehensive open-source strategy was developed to ensure barrier-free access to tools and data for energy system modeling. This ecosystem promotes transparency and supports the wider use of open-source solutions for renewable energy planning and analysis. The methodological chain includes renewable energy system detection [
17], high-resolution time series generation [
18], and energy system modelling with integration into IRENA FlexTool [
19].
4. Results
Initially, we briefly present the results from model training, the data extracted and processed from OSM, followed by the results of the location correction. Finally, we present the results of the additional attribute enrichment.
4.1. Performance and Results of Deep Learning Training
This section presents the results of the DL training, including the loss functions and the accuracy achieved. These results provide insight into the robustness and performance of the applied RetinaNet approach. As Training Progress Summary, the progression of the two losses from the classification and regression networks, as well as the AP, were validated to determine the networks’ performance, as displayed in
Figure 4.
Shown in blue are the results of the first training session, in which all training data was used, and in red the second training session, in which the training data was used after filtering. A consistent upward trend can be observed in the AP. Finally, the AP is 85% for the first training and 96% for the second training with manually post-filtered samples. In addition, the following
Figure 5 shows the losses during training phase.
Both the regression loss
used to localize the objects and the Smooth L1 loss used for classification decrease significantly and almost evenly in both training runs. The total loss represents the cumulative sum of the individual losses. The training is terminated by early stopping after 17 epochs in each case, indicating no further progress in training. In the test set with 700 samples, the final model correctly identified 420 wind turbines as TP, missed 18 turbines (FN), and incorrectly identified 17 objects (FP). The remaining 245 samples were correctly identified as true negatives (TN). Overall, the various metrics clearly show the strong generalization of the network based on the training examples. Incorrect recognition are shown in
Figure 6.
This includes a construction site, a biogas plant and two churches. Secondly, some of the poorly represented turbines are not recognized by the network. This applies to different backgrounds, so that turbines in open fields, in the forest and also in the settlement are not recognized. However, they are also difficult to identify during a visual inspection. Examples of correctly recognized wind turbines, conversely, are shown in
Figure 7. In addition to turbines with good resolution, poorly resolved turbines can also be identified in the images. All images show that the regression locates the towers of the turbines exactly in the centers of the bounding boxes. In other words, the centers of the regression boxes can be interpreted as exact coordinates of the wind turbines.
4.2. OSM Data Extraction
The initial dataset for South Africas wind turbines, extracted from OSM, contained a total of 1546 point features. After a manual review and refinement process, this number was reduced to 1487 verified wind turbines. Point features with the tags generator and diesel as well as solar were excluded and deleted. However, 55 turbines in the OSM data are not assigned to any wind farm. These are added manually. Among the wind farms, Longyuan Mulilo de Aar 2 North has the highest number of turbines with 96 individual units, while the smallest wind farm, Buffeljags Abalone Farm, consists of only two turbines. For all turbines without an associated wind farm, a manual assignment to the respective farms was carried out to ensure the completeness of the data. A capacity is given for 351 of the 1487 turbines, while no capacity data is available for 1144 turbines. This ensures that all wind turbines are assigned to a wind farm and capacity information if possible.
4.3. Coordinate Correction
The accuracy of the neural network’s predictions heavily depends on the domain-specific characteristics of the training and application datasets. To analyze this effect, we compare the confidence scores of the predictions for onshore wind turbines in South Africa.
Table 1 presents the results of the coordinate correction process using both Bing and Google satellite imagery.
The Table summarizes results for 1487 wind turbines, showing that the overall distribution of confidence scores differs considerably between Bing and Google imagery. While only a small fraction of detections reaches confidence scores above 0.8 (0.2% for Bing and 3.0% for Google), the majority falls below 0.5, indicating potential challenges in image consistency or domain transfer. Despite this, visual inspection confirms the accurate detection of turbines in both datasets, as illustrated in
Figure 8 and
Figure 9.
A total of 90 turbines (6.05%) on the Bing images and 43 turbines (2.89%) on the Google images are not detected and thus fall into the null category. The analysis shows that 36 of the non-detected South African wind turbines are matched by Bing and Google. All these overlaps are exclusively located within four specific farms: San Kraal Wind Farm, Phezukomoya, Cookhouse Wind Farm, and Wolf Wind Farm. The visual inspection of the zero category shows that there are often construction sites for wind turbines at the locations, which means that some of the images are not up-to-date enough to show the existing wind turbine. In addition to the accuracy of the detection, the accuracy of the regression is examined in the following.
Table 2 summarizes the distances between pre-dataset coordinates and regression analysis.
The
Table 2 presents the distribution of coordinate deviations for wind turbines in South Africa, comparing results derived from Bing and Google Maps. The deviations are categorized into six distance intervals: <5 m, 5–10 m, 10–15 m, 15–20 m, 20–25 m, and >25 m. A significant portion (64.3%) of the Google-based coordinates fall within 10 m of the reference, whereas only 29.4% of the Bing-based coordinates achieve this accuracy. The largest deviations (>25 m) occur in 16.6% of Bing and 12.9% of Google. To provide a visual summary of the distribution of location errors, a histogram of the distance deviations was created, as indicated in
Figure 10. It shows the proportion of turbines falling within specific distance ranges for both Bing and Google images.
4.4. Wind Turbine Dataset
An overview of the existing wind farms in South Africa is provided below. The summarizing
Table 3 combines spatial information with key technical attributes for each wind turbine. It includes both operational and under-construction sites and was cross-checked and harmonized based on multiple publicly available sources. Listed are commissioning years, the number of turbines, the total installed capacity in MW, the rated capacity per turbine in MW and the type of turbine installed in each wind farm.
Two wind farms, Phezukomoya and San Kraal, are still under construction. In these cases, not all turbines have yet been built or identified, which explains deviations from the detailed point-based turbine dataset. A more detailed graphical evaluation is summarized in
Figure 11. Boxplots illustrate three key parameters from left to right: the number of turbines per wind farm, the total installed capacity, and the specific capacity per turbine.
The number of turbines varies significantly, ranging from small farms with only 2 to 4 turbines to large-scale farms hosting up to 96 turbines. However, the majority of wind farms contain between around 15 and under 50 turbines. On average, there are 37 turbines within a farm. The total installed capacity per wind farm ranges from as little as 0.1 MW to 147 MW. The majority of projects lie within the interquartile range of 35 to 140 MW, the median is 100 MW. The nominal capacity per turbine spans a wide range, from small-scale units with 25 kW to modern high-capacity turbines rated at 4.5 MW. Most turbines, however, fall within the interquartile range of 2.3 to 3.1 MW, with mean capacity of a turbine is 2.5 MW, typical for recent onshore turbine installations.
Figure 12 shows the development of wind power capacity in South Africa over time, starting with the first installations in 2008 through to 2025. To illustrate the growth trend in recent years, the left panel shows the annual installed capacity between 2008 and 2025 based on the commissioning years of the individual wind farms. At least three different phases of capacity growth can be observed: an initial phase with isolated installations between 2008 and 2012, a first strong expansion phase from 2014 to 2021 with significant annual growth and a second expansion phase since 2022. The largest annual increases were in 2016 with around 580 MW and in 2021 with almost 570 MW of newly installed capacity. The right panel shows the cumulative installed capacity over the same period. By 2025, the total installed capacity will reach over 3.9 MW.
Figure 13 shows the total installed capacity per wind farm in descending order, distributed across 42 different wind farms with capacities ranging from 147 MW to 0.1 MW. The bar lengths provide a quick indication of the relative capacity of the individual wind farms. This ranking makes it easier to identify the wind farms in South Africa with the highest rated capacity. The largest farms—such as Roggeveld, Karusa, Nxuba or Soetwater—reach around 140–150 MW. The smallest wind farms such as Coega, Buffeljags Abalone Farm and Silo Distict’s Sustainable Design have significantly lower total capacities of less than 2 MW.
Alongside the total installed capacity, the
Figure 14 shows the number of wind turbines installed in the individual wind farms in descending order. The order provides a quick overview of the locations with a particularly high amount of turbines. Longyuan Mulilo De Aar 2 North stands out with 96 turbines, while Longyuan Mulilo De Aar Maanhaarberg with 67 turbines and Cookhouse Wind Farm with 66 turbines are the next largest farms. Coega Wind Farm has only two turbines. In combination with the capacity data, this also gives an indication of the average turbine size in each wind farm.
The
Figure 15 shows the nominal capacity per wind turbine at each wind farm. This overview can be used to determine which sites mainly use smaller turbines and which rely on turbines with a higher rated capacity. The frequent use of turbines with a capacity of 2.3 MW (here with Siemens SWT-2.3 turbines) in the Jeffreys Bay Wind Farm, Kangnas Wind Farm, Khobab Wind Farm, Loeriesfontein 2, Noupoort Mainstream, and Perdekraal East Wind Farm is particularly evident. However, turbines with a capacity of 3 MW are also widely used in Dassieklip, Chaba Wind Farm, Copperton Wind Farm, Gouda Wind Facility, Red Cap - Gibson Bay, and Van Stadens Wind Farm. The lower end of the scale includes turbines with relatively small capacities, such as those at Buffeljags Abalone Farm or the vertical axis turbines in the Silo District. Higher bars correspond to larger capacity turbines, such as the Vestas V136 and V162 models with capacities with up to 5 MW.
The following section of the results focuses on the spatial distribution of wind turbines in South Africa. The installed wind power capacity is concentrated in just three of the country’s nine provinces, Northern Cape, Eastern Cape, and Western Cape.
Table 4 provides a summary of wind energy infrastructure at the provincial level.
The majority of capacity is located in the Northern Cape and Eastern Cape, which together host 32 wind farms and 1231 turbines. The Western Cape follows with 10 wind farms. Together, the Eastern Cape and the Northern Cape account for 1571 MW and 1670 MW of installed capacity, respectively. The Western Cape contributes 575 MW, bringing the total installed capacity in these three provinces to more than 3800 MW. The Roggeveld Wind Farm represents a special case, as it spans across two provinces. Since the majority of its 42 turbines are located in the Northern Cape and only five fall within the Western Cape, the entire wind farm is attributed to the Northern Cape for consistency in the provincial analysis.
Figure 16 illustrates the spatial distribution of all 42 existing wind farms in South Africa. It clearly shows that the facilities are exclusively located in the southwestern provinces, particularly in the Northern Cape, Eastern Cape, and Western Cape.
To supplement the analysis at provincial level, a more detailed spatial aggregation was carried out at district municipality level. This approach enables a finer resolution of the spatial distribution and highlights the differences within the provinces in the expansion of wind energy.
Figure 17 shows the total installed capacity on the one hand and the number of wind turbines per municipality on the other. The results show a very uneven distribution, with a limited number of municipalities hosting the majority of turbines and installed capacity. In contrast, many regions are still completely undeveloped, indicating a significant spatial concentration of wind energy infrastructure.
4.5. Validation Against Official Capacity Figures
In order to assess the accuracy of the data compiled in the publication with regard to installed capacity, the total installed capacity of wind farms in operation was compared with the official IPP project database [
15]. According to our data, a total of 3627 MW is currently in operation. The IPP database lists an installed capacity of 3428 MW (Wolf Wind Farm is considered to be already in operation). The slight deviation of less than 200 MW can be explained by the inclusion of additional wind farms in our dataset that are not part of the projects supported by the REIPPPP, such as small or privately financed farms. According to the official database, three additional wind farms, each with a capacity of 140 MW, are currently in the planning phase but have not yet been commissioned and are therefore not included in our dataset. This comparison confirms both the consistency of our data with national figures and the added value of including additional data sources.
5. Discussion
This study presents a comprehensive and spatially validated dataset of wind power infrastructure in South Africa. With 1487 turbines across 42 wind farms and a total installed capacity exceeding 3.9 GW, the dataset offers both spatial and technical detail, with a total of 3.6 GW currently in operation. Most turbines are concentrated in the Northern Cape, Eastern Cape, and Western Cape provinces, reflecting the regional clustering of wind development in the country. In addition to the spatial information, the dataset includes harmonized metadata such as commissioning year, turbine type, wind farm capacity, and per-turbine capacity. These attributes were manually collected and cross-checked from various sources.
Although labor-intensive, this enrichment process significantly increases the usability and reliability of the dataset—enabling advanced applications in energy system modelling, infrastructure planning, and policy design. However, manually collecting turbine-specific information also revealed common challenges regarding the availability and quality of public data. The information on operators’ websites was often unstructured, inconsistently formatted, or partially incomplete. In several cases, additional sources such as press releases, freely accessible news articles, and energy-related databases were consulted. While these secondary sources were useful for cross-checking, they sometimes contained unverifiable or contradictory data, highlighting the limitations of public reporting on renewable energy infrastructure. These challenges underline the crucial role of manual processing within the overall pipeline, which, despite advances in automation, remains indispensable for ensuring technical completeness and high data quality.
While most of the data processing, including the localization of the turbines for coordinate correction using DL methods, was automated, manual steps were essential to ensure the technical completeness and reliability of the dataset. In particular, turbine attributes such as turbine type, capacity and year of commissioning were manually enriched by comparing several publicly available sources (e.g., operator website, project reports, press releases). This manual effort was necessary because the detailed technical metadata in open datasets such as OSM or national databases is almost completely missing, incomplete or inconsistent. If the pipeline were transferred to other countries or regions, a similar manual enrichment step would probably be required due to the heterogeneous availability of data and the different reporting standards worldwide. Automated extraction of attributes from semi-structured text sources (e.g., using NLP methods) could be investigated as a future extension to partially automate this step. However, full automation is currently only possible to a limited extent due to the lack of standardized and structured publication of turbine metadata. Furthermore, regular updates of the dataset (e.g., every 1-2 years) would require re-verification of new wind farm projects and updating of technical attributes, meaning that some level of manual verification and enrichment will still be essential to maintain data quality. Nevertheless, further improvements, such as the integration of automated web scraping techniques combined with manual quality checks, could significantly reduce the manual workload while ensuring high standards of data accuracy.
The dataset was systematically checked against several external sources to ensure its completeness. A comparison with the official South African IPP database [
15] confirms that all 34 large wind farms currently in operation are included in this dataset. In addition, two projects under construction and several smaller wind farms not listed in the official database have been included. The dataset thus shows that it not only covers large infrastructures but also takes into account smaller and emerging projects. It is noteworthy that the aggregate installed capacity of the wind farms currently included in our dataset is largely consistent with the total capacity reported in official IPP sources, further supporting the validity and representativeness of the dataset.
The coordinate correction process based on RetinaNet was trained on German aerial imagery and applied to South African wind turbine locations using both Bing and Google satellite data. The application resulted in a notable drop in confidence scores, which can be attributed to the domain shift between training and application imagery—a typical challenge in DL when transferring models across data sources. Despite this, the visual and statistical evaluation confirms a high localization accuracy. More than 60% of Google-based predictions and 29% of Bing-based predictions fall within a 10 m range from the reference coordinates. The model’s ability to correctly identify turbine locations across different landscapes and image types confirms its practical value as a scalable validation tool. Due to the lack of official, publicly available data on wind turbines in South Africa, the spatial validation of the turbine coordinates was carried out by visual comparison with high-resolution satellite images from Google and Bing. Although this method does not replace GPS-based ground validation, it improves accuracy compared to the raw OSM data. Furthermore, the high degree of agreement between the visually validated and corrected coordinates suggests that the original OSM point data already provides relatively high positional accuracy in many cases.
However, some aspects of the detection and correction process could be improved in future applications. First, the exclusive use of a RetinaNet architecture could limit performance in more complex or visually diverse environments. Although RetinaNet has demonstrated high accuracy in correcting wind turbine coordinates, its performance is sensitive to variations in image quality and background complexity. This may reduce its generalizability when applied to unknown regions or alternative satellite image sources. These limitations become more apparent in large-scale applications where wind turbines need to be detected across large areas without predefined coordinate references. In such contexts, it can be difficult for the model to distinguish wind turbines from visually similar structures such as high-voltage pylons, cranes, or communication towers, especially in complex environments. Alternative approaches—such as modern transformer-based models—could offer greater robustness and accuracy, particularly under conditions of visual ambiguity or clutter. Second, the image data itself could be further diversified. The current approach is limited to single time frames from Bing and Google images, which may not capture seasonal variations or recent changes in infrastructure. The use of time series imagery or higher-resolution commercial datasets could improve model generalization and enable the detection of newer or smaller installations.
From a methodological perspective, the study highlights the importance of combining open spatial data, deep learning, and manual curation to overcome the usual limitations of public datasets. OSM offers broad coverage but lacks standardization and, in some cases, location accuracy. The integration of DL fills this gap by refining the location data, while manual enrichment ensures the completeness and technical detail required for meaningful application. Together, these components form a transferable and reproducible workflow for the creation of high-quality renewable energy datasets in data-poor regions.
6. Conclusions
This study presents the most accurate, comprehensive, and up-to-date dataset on wind turbines and wind farms currently available for South Africa. By integrating publicly available OSM data, high-resolution satellite imagery, and advanced DL-based coordinate correction using RetinaNet, the spatial accuracy of turbine locations has been significantly improved. The dataset has been further enhanced through manual enrichment with important technical and temporal attributes such as wind farm names, turbine types, capacities, and commissioning years—information that is often missing or inconsistent in existing sources. Spatial metadata has been mapped to administrative boundaries from the GADM database, enabling regional analysis and integration with other relevant datasets.
This dataset thus provides accurate turbine coordinates, technical specifications, and harmonized metadata. It includes not only all large wind farms currently listed in the South African IPP project database, but also smaller and emerging wind farms that are not covered by official sources. The result is a high-quality, freely accessible dataset that provides a solid foundation for research, energy system modelling, infrastructure planning, and policy evaluation. It makes an important contribution to the open energy data landscape and provides a transferable methodology for creating similarly detailed datasets in other countries and for other renewable energy technologies.
Keeping the data up to date is particularly important given the rapid expansion of wind energy infrastructure and evolving project developments. In order to continue to provide valuable support to this ongoing development in South Africa’s dynamic wind energy sector, we are currently in discussions with national stakeholders to facilitate regular updates to the datasets. The aim is to establish a process that ensures updates every 1–2 years, including the review of new wind farms and the enrichment of technical attributes.
The dataset is freely available for download [
36]. We strongly encourage its reuse and further development by the broader research and planning community.