1. Introduction
To meet the 1.5 °C climate target set in Paris in 2015, the world must achieve net zero carbon dioxide (CO
2) emissions by 2050, and this will require a systemic transformation of the global energy system with the transition from fossil fuels to renewable resources. The year 2023 established a new record for the deployment of installed renewable power capacity (473 GW), mainly driven by the highest annual increase ever in solar photovoltaic (PV) systems (347 GW), which corresponds to 73%, with China, the European Union, and the United States accounting for 83% of these new installations in 2023 [
1].
In some regions, photovoltaic installations are beginning to face land scarcity due to high acquisition costs and competing land uses [
2]. Thus, several studies are being developed on the installation of photovoltaic systems on highways around the world [
3]. One of the infrastructures that has been considered is noise or sound barriers.
In Italy, since 2009, there has been a 1 km barrier (5.6 m high) with 3944 modules (the installed power is not indicated) that has been generating 750 MWh of electricity per year [
4]. The Dutch Ministry of Infrastructure and Water Management, together with the company TNO, developed the Solar Highways project [
5]. The barrier was installed in 2018 and supports bifacial modules that generate electricity from both sides of the acoustic barrier; it is 400 m long and five meters high. The total installed capacity is 248 kWp, and in 2019, the system generated 202.8 MWh, resulting in an annual yield of 816.8 kWh/kWp. Based on this, Peerlings et al. [
6] estimated that the potential generation on the Dutch highways could achieve 200 GWh/year. In this work, the location of the sound barriers is known and mapped in a GIS database. Estimation of the PV generation potential is performed using the irradiance data at the locations of the sound barriers and then by multiplying by the installation area and the efficiency. More recently, in Germany [
7], a barrier that is 5 m high and 234 m long was capable of generating 51.5 MWh/year.
Portugal has 3322 km of highways and it is estimated that there are 381 km of sound barriers, but the exact number and location have not been fully determined [
8]. This poses an opportunity to take advantage of the potential synergies between the digital economy and sustainability, namely by applying artificial intelligence techniques to georeferenced imagery of highway infrastructure to identify suitable locations for the installation of PV systems [
9]. In this way, this study proposes the use of You Only Look Once (YOLO), a state-of-the-art object detection algorithm, to automatically perform a large-scale identification of sound barriers along highways, based on a small database of images combined with the use of PVLIB to assess their suitability for solar energy generation considering different installation configurations. In this way, this work extends the methodology proposed in [
10].
The article is structured in the following way: in
Section 2, a brief literature review is conducted on the use of YOLO.
Section 3 presents the methodology used in this work. The results are presented in
Section 4 and
Section 5 presents the main conclusions and future research ideas.
2. Literature Review
Computer vision, a subfield of AI, enables machines to interpret and analyse visual data, providing automated insights from digital images and videos [
11]. Among its core tasks, object detection (OD) is particularly relevant to this study, as it combines object classification and object localisation [
12]. Although image classification assigns a label to an entire image, OD not only identifies objects within an image but also determines their spatial location using bounding boxes [
13].
Deep learning-based OD algorithms are generally categorised into two-stage and one-stage detectors. Two-stage detectors, such as region-based convolutional neural networks (R-CNNs) and their variants, first generate region proposals before the refinement of object classifications and bounding box coordinates [
14]. These models are highly accurate but computationally expensive, making them less suitable for real-time or large-scale applications.
In contrast, one-stage detectors like YOLO, first proposed by Redmon et al. in 2015 [
15], revolutionised object detection by unifying detection steps into a single-stage process, enabling significantly faster inference speeds. This efficiency is achieved by using convolutional neural networks (CNNs) to directly predict bounding boxes and class probabilities from an input image.
2.1. Artificial Neural Networks and Convolutional Neural Networks
OD models such as YOLO are built upon Artificial Neural Networks (ANNs), which aim to mimic the structure and function of the human brain. ANNs consist of neurones that process and transmit information through weighted connections, allowing the model to learn complex patterns in data.
Figure 1 represents an example of a neuronal architecture that consists of:
Inputs (,,⋯,): Data fed into the neurone, which could come from external sensory systems or other neurones within the network.
Weights (,,⋯,): Synaptic weights modify the importance of each input, simulating the way in which synapses between biological neurones strengthen or weaken signals. Weights can either amplify or attenuate the input values.
Bias (): The bias is an additional parameter that helps the neurone adjust its output independently of the input signals, functioning as a threshold.
Net Input: The neurone computes the net input by adding the weighted inputs and adding the bias. Mathematically, this is expressed as:
Here,
u represents the net input to the neuron [
16].
Activation Function: A non-linear function applied to u to introduce complexity in learning.
Figure 1.
Example of a neuron architecture. Adapted from: [
17].
Figure 1.
Example of a neuron architecture. Adapted from: [
17].
While traditional ANNs can handle various tasks, CNNs are specifically designed for image-based applications such as object detection and classification [
18]. The key difference between CNNs and ANNs lies in the presence of convolutional and pooling layers, which act as feature extractors before fully connected layers generate final predictions.
The convolutional layer is the core building block of CNNs. It applies filters/kernels (small, learnable matrices) to the input image, performing a convolution operation that extracts specific features, such as edges, shapes, and textures.
For a 2D image input, the convolution operation is defined as:
where
f represents the input image matrix,
g is the kernel matrix, and
h is the output feature map. The indices
i and
j correspond to the positions in the output matrix
h, while
m and
n represent the positions within the kernel [
19].
If the kernel size is 3 × 3, the indices
m and
n range from −1 to 1, as the kernel slides across the image. This operation effectively combines the kernel values with the corresponding region of the input image to produce the feature map, where each value represents the presence or intensity of the detected feature at that location. This process is illustrated in
Figure 2.
2.2. YOLO General Model
Unlike region proposal-based methods, YOLO predicts all bounding boxes simultaneously, significantly reducing computation time and enabling real-time performance. It divides the input image into an S × S grid, where each cell predicts:
The confidence score reflects not only the likelihood that the box contains an object but also the accuracy of the bounding box’s coordinates. Mathematically, the confidence score is given by:
where
(Intersection over Union) measures the overlap area between the predicted and ground truth bounding boxes, as illustrated in
Figure 3. This metric evaluates how well the predicted box matches the true location, contributing to a more precise assessment of object detection models [
21].
A detection is considered correct if IoU ≥ threshold (t). Otherwise, it is classified as incorrect. This threshold ensures consistent accuracy in object location in different scenarios [
23].
The final YOLO output is a tensor of dimensions S × S × (B × 5 + C), which can be post-processed using Non-Maximum Suppression (NMS) to eliminate duplicate detections.
Figure 4 illustrates a simplified YOLO output vector for a 3 × 3 grid with three classes and a single bounding box per grid cell. In this case, the output tensor with the coordinates has dimensions 3 × 3 × 8. However, since only two classes are detected and there are no coordinates associated with the missing class, a "?" sign is assigned.
Due to its one-stage detection mechanism, YOLO stands out for its exceptional balance of speed and accuracy, enabling the reliable identification of objects at high speeds. Its versatility makes it invaluable in domains where both accuracy and efficiency are critical [
21].
2.3. YOLOv10 Architecture
When the research work presented in this paper was initiated, the latest YOLO version was YOLOv10 (May 2024). Since then, YOLOv11 (September 2024) and YOLOv12 (February 2025) have been released. However, YOLOv10 was still used due to its balance between accuracy and inference speed. Compared to earlier versions such as YOLOv9, YOLOv10 achieved higher detection performance [
24]. This underscores the trade-offs between speed and accuracy in real-time object detection, a key consideration when mapping large-scale sound barriers.
The YOLOv10 architecture, illustrated in
Figure 5, introduced key innovations to enhance both training and inference efficiency.
This architecture is mainly defined by two components: Dual Label Assignments and the Consistent Matching Metric.
In traditional YOLO models, training typically employs a one-to-many assignment strategy, where multiple positive samples are assigned to each ground truth instance. This technique, known as Task-Aware Learning, provides richer supervisory signals, leading to improved optimisation and overall performance. However, in inference, NMS is required to filter out redundant predictions, increasing computational overhead and latency.
To address this inefficiency, YOLOv10 integrates a dual label assignment strategy, combining one-to-many and one-to-one matching:
One-to-One Matching: Assigns a single prediction to each ground truth instance, eliminating the need for NMS, which enables end-to-end deployment. However, this approach alone may lead to weaker supervision, reducing accuracy, and slowing convergence.
One-to-Many Assignment: Provides stronger supervision by assigning multiple positive predictions per instance, but requires NMS at inference.
YOLOv10 strategically combines these two approaches by introducing an additional one-to-one head that mirrors the original one-to-many branch in structure and optimisation. During training, both heads are optimised jointly, leveraging the rich supervision from one-to-many assignments while ensuring one-to-one consistency. During inference, only the one-to-one head is used, eliminating the need for NMS and maintaining high efficiency without increasing computational cost.
To further enhance the dual label assignment strategy, YOLOv10 uses a consistent matching metric, ensuring alignment between the one-to-one and one-to-many training processes. This metric is defined as:
where:
p is the classification score;
and b represent the prediction and instance bounding boxes;
s indicates whether the prediction’s anchor point is within the instance;
and control the relative importance of classification and localisation.
The one-to-many and one-to-one matching metrics are denoted as:
Using the same metric for both heads, YOLOv10 ensures that the best samples chosen by the one-to-many head are also the best for the one-to-one head. This alignment reduces the supervision gap between the two training branches, improving the one-to-one head’s predictions at inference. When and , both heads select the same best samples, which leads to improved detection accuracy.
By integrating dual label assignments with a consistent matching metric, YOLOv10 optimises both training efficiency and inference performance, making it a robust choice for real-time object detection tasks [
26,
27].
2.4. Applications of YOLO
YOLO has been successfully applied to many different fields. In agriculture, YOLO models are used to detect and classify crops, pests, and diseases, enabling precision agriculture techniques. These applications optimise farming operations, improve productivity, and reduce input costs [
28].
In healthcare, YOLO has significantly impacted diagnostic processes, helping to detect lesions, segment brain tumours, classify skin lesions, and detect personal protective equipment. These applications demonstrate YOLO’s adaptability to various challenges in medical imaging and diagnostics [
29]. YOLO has also supported public health measures, such as face mask detection and social distancing monitoring during pandemics, ensuring compliance with health regulations [
30].
Regarding surveillance and security, YOLO has proven invaluable for real-time monitoring and rapid identification of suspicious activities [
31]. Detecting unwanted human actions, especially in low-light conditions and in varying poses, remains a complex task. By integrating YOLO models into surveillance systems, security personnel can monitor environments more effectively and respond promptly to potential threats, enhancing public safety [
32].
In industrial settings, YOLO has been used for surface inspection processes to detect defects and anomalies, ensuring quality control in manufacturing. These systems can be seamlessly integrated into production lines, maximising efficiency and reducing operational costs [
33].
In the energy area, YOLO has been successfully employed for the detection of solar panels in satellite imagery, achieving a mean average precision (mAP) of 74.0%. This study demonstrated the feasibility of using computer vision for automated mapping of solar infrastructure, which is essential to assess the potential of photovoltaic systems on sound barriers [
34].
YOLO has also been used to detect sound barriers [
10]. In this work, the authors use YOLOv3 to process Google Street View Images in urban contexts [
35]. The authors trained a model with 72,399 images and obtained an accuracy of 96.22%. Solar potential estimation was performed using a simple model for the solar panel, where the estimated energy is calculated by multiplying the installed capacity of the solar PV system, the average peak sunlight hours, and an overall performance coefficient of 0.8.
Several studies have also explored the application of YOLO for infrastructure monitoring and object detection in similar environments, demonstrating the effectiveness of the algorithm in various domains. For example, to address the detection of road cracks, YOLOv10 has been used with an accuracy of 77.4%, and an enhanced version (YOLOv10-D-CBAM) significantly improved the accuracy to 99.6%. This highlights the potential of methods based on convolutional neural networks (CNNs) for fine-grained infrastructure analysis, which could be adapted for the identification of sound barriers [
36].
Since no existing datasets were available for this task, a dedicated dataset was created to train the YOLO model. In addition to object detection, this study integrates geolocation capabilities using a GoPro camera, allowing the detected structures to be accurately mapped. This enables two key applications:
Infrastructure Management—Highway operators can automatically map and monitor their infrastructure.
Solar Energy Feasibility Assessment—By performing the geolocation of the barriers, it is possible to cross-reference this information with solar irradiance data, allowing a precise evaluation of their solar energy generation potential.
Despite these promising applications, several challenges must be addressed, including the need for a large and diverse dataset, vegetation occlusions, and variations in the colour and size of sound barriers. However, the proposed approach enables both automated detection and geolocation, facilitating broader applications beyond sound barriers, such as mapping other highway infrastructure through platforms such as Google Earth Pro [
37].
Given these advantages in diverse scenarios, this study uses YOLOv10 to detect sound barriers efficiently, enabling a systematic evaluation of their feasibility for photovoltaic integration.
3. Methodology
The object detection problem of sound barriers can be divided into four phases: the generation of a training dataset, which includes the annotation of the objects to be identified in an image; the training of the object detection model (e.g., YOLO); the application of the trained model into the new images, which in our case includes the geographical localisation of the objects and its number; and finally, the estimation of the amount of electricity that can be generated in the sound barriers.
Figure 6 presents a flowchart of the applied methodology.
As mentioned above, we extend the the methodology proposed in [
10] in different ways. Firstly, we use a more recent version of YOLO (YOLOv10 versus YOLOv3) and train the module using a significantly lower training dataset (less than 1000 images versus 72,399). Secondly, we integrate the use of other images sources in addition to Google Street View, namely video images recorded using a GOPRO 7 camera. Thirdly, we integrate the use of PVLIb for the estimation of solar energy generation, which improves the energy generation approach in three ways: the estimation of the irradiance in the modules is improved, as it is estimated using typical meteorological data; the estimation of the generation in the panel is improved, as it uses equivalent diode circuit models for the PV modules, rather than a general efficiency; and finally, it integrates the module arrangements in strings and arrays according to the inverter characteristics. In this way, the proposed methodology can be easily implemented to automatically process large extensions of highway sound barriers.
3.1. Dataset Preparation and Annotation
For any computer vision algorithm to detect objects in various scenarios, it is essential to annotate a large and varied dataset that captures a wide range of real-world conditions. In this way, four different datasets were prepared.
Initially, we manually annotated 40 images of sound barriers, mainly sourced from Google Street View along the A32 highway [
38], which connects Oliveira de Azeméis to Vila Nova de Gaia. The authors had access to a georeferenced database with the coordinates of sound barriers for some Portuguese highways, and Google Street View was selected as the image data source, as it is very easy to obtain the images based on the coordinates of a certain location. This allowed for the efficient extraction of images using the Google Street View static API.
To facilitate the annotation process, Roboflow was used [
39]. Roboflow is a platform that enables efficient bounding box annotations and provides built-in data augmentation techniques. The annotations were primarily bounding boxes, but in cases where the sound barriers were significantly rotated, polygons were used to achieve a more precise representation. Furthermore, to ensure consistency in the input of the model, all images were resized to 640 × 640 pixels and automatically orientated as part of the pre-processing pipeline.
Figure 7 illustrates the annotation approach, in which bounding boxes were applied to locate individual sound barriers along the highway.
To increase the diversity of the dataset and improve the model generalisation, the following augmentation techniques were applied:
Horizontal flipping: Since sound barriers can appear in various orientations, flipping improved the model’s adaptability to different perspectives;
Rotations ( to ): This range was selected to account for the slight angular deviations commonly seen in video footage captured from a moving vehicle. It was determined through trial and error, ensuring that the rotation improved model robustness without introducing unrealistic distortions not representative of real driving scenarios;
Blur (up to 0.7 pixels): Applied to simulate mild motion blur caused by vehicle movement during video capture. A blur intensity of 0.7 was chosen to reflect driving speeds without significantly degrading image quality.
In the end, the original 40 images dataset was extended to 94 images. Then, three additional datasets were created using additional images from videos of sound barriers collected on Highway A8 that connects Lisbon to Leiria, using a GoPRO 7 camera with GPS location [
40]:
A data set with 629 images including the augmented ones where all sound barriers were annotated (complete or incomplete, close or far away, and even transparent sound barriers);
A data set with 647 images including the augmented ones which did not include the sound barriers that were incomplete, far away, or transparent ones;
A dataset with a total of 383 manually annotated images, which, after augmentation, resulted in a second dataset of 984 images.
To ensure an unbiased evaluation, a common validation dataset of 76 images that include 64 sound barriers was created using frames extracted from unseen videos from A1, A8, A32, and IC17, and used for all trained models. This approach allowed for a direct comparison of model performance under identical conditions, mitigating potential over-fitting to the training images.
The dataset used for the final model was divided into 92% training, 6% validation, and 2% testing, prioritising a larger training set to improve model learning while ensuring sufficient validation data for performance evaluation. This dataset consists of sound barriers with and without vegetation, with and without graffiti, and with varying colours and formats. Despite its existing complexity, the dataset could be further diversified to include additional scenarios that were not covered in this study.
3.2. Model Training
To train the object detection model, we employed both YOLOv10n, a lightweight version optimised for real-time applications, and YOLOv10m, the version used for general purposes.
The naming convention for the models follows the structure SB_X_Y, where SB refers to sound barriers, X represents the number of images in the dataset, and Y indicates the YOLO version used.
Table 1 summarises the key hyperparameters used in training the YOLOv10 models. The batch size refers to the number of samples (images) that are processed together in one iteration during training. The initial model was trained using YOLOv10n with a batch size of 8, restricted by CPU limitations. As training progressed, YOLOv10m was adopted, leveraging a GPU (T4 on Google Colab), allowing the batch size to increase to 16, improving stability.
The number of epochs is the number of training iterations and was set at 200, as increasing it to 250 epochs did not result in performance improvements.
The confidence threshold is the minimum probability required for an object to be classified as a sound barrier. Fine-tuning this parameter is crucial, as a higher confidence threshold reduces false positives but may cause the model to miss actual sound barriers (increasing false negatives), while a lower threshold increases true positives but can introduce misclassification (increase of false positives). We considered a value of 0.4 to balance detection sensitivity, ensuring that even less prominent sound barriers were identified without significantly increasing false positives.
Data augmentation, including horizontal flipping, rotation, and blur, was applied consistently across the models to enhance generalisation and robustness.
The YOLOv10 model was trained using the AdamW optimiser, automatically selected by the Ultralytics framework. The default training parameters included a learning rate of 0.002, a momentum of 0.9, and a weight decay of 0.0005 applied to selected weights.
The model was trained using the code snippet described in Algorithm 1.
Algorithm 1 YOLO Model Training |
Require:
YOLO model file, dataset file Ensure:
Trained YOLO model ▹Load YOLOv10m model
|
If the model is accurate, the number of detected bounding boxes should correspond directly to the number of sound barriers. However, in video-derived images, the same barrier may appear in multiple frames depending on the vehicle’s speed and the video’s frame rate. In addition, accurately mapping the sound barriers is essential for estimating the electricity generation potential of PV systems. Therefore, the next step in the methodology is to determine their exact geographic locations.
3.3. Geolocation of Sound Barriers
For the geolocation of the sound barriers, a GoPro 7 camera was used to record video footage along highways while also capturing GPS coordinates and timestamps at one second intervals. The metadata were extracted using ExifTool [
41], which generated a GPX file containing the GPS data recorded.
To identify the locations of the detected sound barriers, the trained YOLOv10m model was applied to each video frame. The model detected sound barriers and recorded the frame indices where the detections occurred. Since the video was recorded at a known frame rate and the start time of the recording was available, each frame index was converted into a precise timestamp using the following relation:
This allowed the transformation of detection frames into real-world timestamps. The timestamps were then matched with the GPS data extracted from the GPX file to determine the geolocations of the first and last detected sound barriers.
The Algorithm 2 was used for object detection and geolocation extraction:
Algorithm 2 Geolocation algorithm for a particular dataset |
Require:
YOLO model file, video file Ensure:
First and last detection timestamps for in do if then Append to end if end for if
then print “First detection timestamp:”, print “Last detection timestamp:”, end if
|
Finally, the GPX file was modified to retain only the coordinates of the detected sound barriers. This modified file was imported into Google Earth Pro, allowing a precise visualisation of the sound barriers detected along the highway for validation purposes. Furthermore, by calculating the distance between the first and last detected sound barriers, an estimate of the total number of sound barriers on a given highway section was obtained, which can be compared to the number of detected bounding boxes.
Knowing the exact number of sound barriers and their exact location, it is possible to estimate the potential for PV generation.
3.4. Estimating the Electricity Generation Potential
To estimate the generation potential of PV systems installed in sound barriers, we use the PVLIB Python
library, which is a tool that provides a set of functions and classes to simulate the performance of photovoltaic energy systems [
42]. The advantage of using PVLIB against the product between irradiance, area, and efficiency, as is performed in [
6,
10], is that we have a more accurate generation potential as we take into account the typical meteorological year data and the losses considering specific PV systems’ equipment.
To simulate a PV system, it is necessary to: (1) choose a PV module and an inverter; (2) provide a location of the system to obtain meteorological data such as irradiance and temperature; and (3) indicate the slope (tilt angle) and the orientation (azimuth angle) of the PV panels.
Sound barriers are generally rectangular structures measuring 4 m in width and 5 m in height. PV panels can be installed directly on the surface (in which case, the slope will be or placed on top of the barriers using another slope angle). Using the Google Street View API, it is possible to estimate the orientation of sound barriers by assuming that they are perpendicular to the image’s orientation.
In this way, and knowing the number of sound barriers, the dimensions, the location, the orientation, and the slope, as well as the PV module and inverter, it is possible to estimate the estimated yearly energy, considering typical meteorological year data for the location.
In this work, we consider the SunPower X-Series PV module: X22-485-COM (with 485 W), and Yaskawa Solectria Solar inverter, PVI 10 kW 208 (10 kW/208 V), available in the PVLIB database [
42]. The PV module has an area of 2
, and we assume that at most 60% of the barrier area is available for installation (4 modules), to avoid overestimation due to potential vegetation shading effects.
4. Results and Discussion
In this section, we present the performance of the models across different datasets, followed by a detailed comparison of their predictions. Finally, we discuss the geolocation results and the estimation of the generation potential of the PV system.
4.1. Model Performance
When training a YOLO model, the algorithm generates a set of weights, which captures the features that the model has learnt from the training data. These weights determine how the model detects and classifies objects, refining its ability to differentiate between relevant and irrelevant elements. The best-performing weights are then used for the prediction of sound barriers in new data.
For the prediction phase, we applied the model to unseen video frames (the validation dataset with 76 images and 64 sound barriers).
Table 2 summarises the key performance metrics obtained for each model.
The results confirm that the size of the dataset significantly affects performance. The smallest dataset (SB_94_v10n) resulted in the models with the lowest precision, highlighting its limited generalisation ability. In contrast, the largest dataset (SB_984_v10m) yielded the best performance, demonstrating the strong correlation between dataset expansion and model accuracy.
When comparing SB_647_v10m and SB_629_v10m, two models trained on similar dataset sizes, the main difference lies in recall. The SB_629_v10m model had a recall of 0.78, which is lower than the 0.82 recall of SB_647_v10m. This discrepancy is due to the annotation methodology. Please recall that in the dataset with 629 images, transparent sound barriers were annotated in an attempt to include them in the detection process. However, this led the model to learn patterns from objects behind the barriers rather than the barriers themselves, increasing false positives. Since transparent barriers are relatively uncommon on Portuguese highways, they were removed from the annotations in the subsequent dataset to avoid negatively impacting model performance. Furthermore, distant sound barriers were annotated, while in the dataset with 647 images only, sound barriers that are clearly visible and closer to the camera were annotated. These annotation differences contributed to the observed differences in recall, highlighting the importance of consistent and accurate labelling.
To evaluate the feasibility of using YOLOv10m or YOLOv10n, we applied both models to smaller datasets and assessed their performance, focussing on mAP, as it is the most relevant metric for this type of project. The mAP@0.5 metric evaluates both the localisation and classification accuracy, providing a more comprehensive assessment of detection performance. By comparing SB_94_v10n with SB_94_v10m and SB_629_v10n with SB_629_v10m, we observed an improvement in precision values (0.7 for SB_94_v10n compared to 0.79 for SB_94_v10m, and 0.83 for SB_629_v10n compared to 0.84 for SB_629_v10m). Although there was a slight decrease in recall, the overall mAP@0.5 indicates that YOLOv10m achieves higher accuracy. Since this improvement does not come with a significant cost in inference time, YOLOv10m was selected as the preferred model for this study.
The SB_984_v10m model mAP@0.5 of 0.91 confirms that it not only detects sound barriers accurately but also correctly localises them within the image, a key factor for real-world applications.
4.2. Comparison of Model Predictions
Although comparing performance metrics is important to evaluate and select the best model, the most effective way to understand their differences is by visualising particular examples.
Figure 8 provides a comparison of the different models when predicting sound barriers in frames captured from various highways.
To evaluate and compare the models, we discuss the differences in bounding box accuracy, false positives, false negatives, and confidence scores.
4.2.1. Bounding Box Accuracy
One of the first observable differences between the models is how well the bounding boxes fit the actual sound barriers. The SB_984_v10m model exhibits well-defined and tightly fitted bounding boxes for almost all detected sound barriers, indicating a high level of localisation precision. This precision is particularly important when calculating properties such as the area of detected objects, which is relevant for assessing photovoltaic potential.
The other models also demonstrate reasonable performance, especially the SB_94_v10n model, which performs surprisingly well given the limited number of images it was trained on.
4.2.2. False Positives
False positives occur when the model incorrectly classifies non-sound barrier structures as sound barriers. The SB_94_v10n model exhibits a significant number of misdetections, particularly in the fourth column, third-row frame, where it falsely detects multiple sound barriers with confidence scores of 0.5 or higher, despite the image containing only background. This contributes to its lower precision value.
Similarly, the SB_647_v10m model also presents misclassifications, with the most notable case occurring in the fourth column, first-row frame, where a section of the road is mistakenly identified as a sound barrier with a high confidence score.
Both examples can be seen in detail in
Figure 9.
These errors may be attributed to inaccuracies in the annotation process or the smaller size of the dataset, as was explained in
Section 4.1.
4.2.3. False Negatives
False negatives, where actual sound barriers are not detected, present the most noticeable contrast among the models.
The SB_94_v10n model does not detect a substantial number of sound barriers, as is evident in
Figure 10, where multiple structures remain undetected.
The SB_647_v10m and SB_629_v10m models, while more successful in identifying most sound barriers, struggle with certain challenging conditions, such as horizontally aligned barriers, barriers partially covered by vegetation, or those that are not perfectly aligned, as shown in detail in
Figure 11. These limitations are likely due to their smaller training datasets, resulting in weaker generalisation capabilities. This correlates directly with their lower recall values and reinforces the importance of dataset expansion and diversity in improving detection robustness.
In contrast, the SB_984_v10m model not only detects the majority of sound barriers but also identifies some that are particularly difficult to discern, even for the human eye (the last frame in the fourth column, fourth row). See the detail in
Figure 12.
4.2.4. Confidence Score
The confidence score associated with each detection is critical to assessing the reliability of predictions.
The SB_984_v10m model consistently produces high confidence scores, reflecting its balanced trade-off between precision and recall, as well as its robust generalisation ability. Some predictions even achieve a perfect confidence score of 1, meaning complete certainty in detection.
For the other models, confidence scores are noticeably lower for the same predictions. This suggests that models trained on smaller or less diverse datasets not only misclassify objects but also do so with high certainty, which can be problematic in real-world applications. In the context of photovoltaic feasibility assessment, high-confidence false positives could lead to incorrect site selection, while false negatives might result in missed opportunities for solar energy generation. This highlights the need for careful threshold tuning and refinement of the dataset to optimise the practical usability of the model.
Based on these results, the SB_984_v10m model was selected for geolocating sound barriers in videos captured using a GoPro 7.
4.3. Geolocation
After selecting the best-performing model and following the approach described in
Section 3.3, we applied it to a video recorded using a GoPro 7 to geolocate the detected sound barriers along a highway. The SB_984_v10m model was chosen for this task because of its strong precision and recall and mean average precision, which ensures reliable detections.
Figure 13 shows the track where the model identified sound barriers. The detected locations match exactly where the video was recorded, with the track starting at the first detected sound barrier and ending when no more barriers are visible. This validates the model’s ability to accurately locate sound barriers along highways.
This map also enables a qualitative assessment of the model’s spatial accuracy by allowing to visually compare predicted detection points with actual sound barrier locations.
Geolocating sound barriers is particularly useful for mapping these structures and assessing their potential for renewable energy applications. By determining their exact locations, we can estimate the solar irradiance at each site, information that would otherwise be unavailable. Furthermore, since sound barriers are continuously placed along highways, the total detected segment length of 292 m represents the full extent of the barriers in this section. This allows for an estimation of the total number of barriers present, contributing to large-scale mapping efforts. To derive the number of independent physical barriers from frame-level detections, a spatial filtering approach was used: a new barrier was only registered when the geodesic distance from the previous detection exceeded 4 m. This ensured that repeated detections of the same structure across consecutive frames were not overcounted, resulting in a total of 64 spatially distinct sound barriers along the analysed section, which was further confirmed by manually counting the number of sound barriers on the predicted track.
This capability enables the automatic identification and mapping of highway sound barriers, which can be further refined for applications such as solar panel feasibility studies and infrastructure planning.
4.4. PV Generation Potential in Sound Barriers
The detected sound barriers (64 over an extension of 292 m) are orientated toward the southwest (azimuth angle of 225). Please note that the 64 sound barriers are not mounted continuously. The azimuth of the barriers is taken from orientation of the road (extracted from the Google Street View API) and assuming that the sound barriers are mounted perpendicularly to the road. The area of each sound barrier is 4 × 5 = 20
, but considering the height of the module (1.3 m) and the width (1.9 m), the available area per sound barrier is 12
, which means that if the modules are installed directly on the surface of the sound barriers, it will be possible to install 296 modules. If they are installed at the top of the sound barriers, with length 1.9
, it will be possible to install 128 modules with an optimal tilt of
. The irradiance data and the optimal slope angle are extracted from PVGIS [
43].
Table 3 shows the results for both scenarios. Although mounting the PV modules directly on the surface allows one to have a higher installed capacity of the PV system (146.8 kW) and generate more energy yearly (167.5 MWh), the yield is significantly lower (1141 kWh/kW) than in the case where the PV panels are mounted on the top. This result is aligned with a previous study on the feasibility of installing PV systems in sound barriers in the US [
44], but that study does not directly address in economic feasibility the fact that surface mounting could eventually be performed using existing sound barriers, while installation on the top could require the installation of new sound barriers that are structurally capable of supporting the PV module loads under different weather conditions (e.g., windy). In any case, the results presented demonstrate that the sound barriers in highways have an interesting solar yield to generate renewable electricity. In addition, the sound barriers are located where there are households close to the highway, so the installation of PV systems in the sound barriers can be viewed as an opportunity to develop energy communities between households close to the highway and the highway sound barriers. However, due to the uncertainty associated with the installation in this type of location, a risk-averse energy management strategy is used to avoid over-optimistic solutions [
45].
5. Conclusions
This study demonstrated the potential to leverage computer vision techniques, specifically the YOLOv10 object detection algorithm, for the identification and geolocation of highway sound barriers. By automating the detection process, this approach facilitates infrastructure mapping and provides essential data for evaluating the feasibility of PV system integration on sound barriers.
The results indicate that model performance is highly dependent on the diversity of the dataset and the consistency of the annotation. The final model, trained on an expanded dataset of 984 images, achieved strong detection performance, with a mAP@0.5 of 0.91. The geolocation approach, using video data captured with a GoPro 7, accurately mapped the detected barriers along highways, demonstrating the viability of this method for large-scale infrastructure monitoring.
5.1. Challenges
The diversity of the dataset played a crucial role in the model’s ability to generalise to different scenarios. Since most of the sound barriers in the training set were vertically orientated, the models struggled to detect horizontally positioned barriers. Furthermore, variations in colour, and the existence or lack of graffiti and/or surrounding vegetation further affected the detection accuracy, as these elements altered the visual characteristics of the barriers. The videos used for the prediction were recorded along A1, A32, A8, and IC17, providing a broader evaluation of the model’s ability to detect sound barriers in diverse highway environments.
One of the primary challenges in this work was the annotation process. Initially, all annotations needed to be revised to ensure that transparent sound barriers were not included. This issue led to unexpected misclassifications, as the model learnt to detect objects behind transparent barriers rather than the barriers themselves. Additionally, partway through the project, we decided to focus only on sound barriers that were closer and more visible, excluding those that were too far in the background. This required re-annotating a significant portion of the dataset, which was both time consuming and labour intensive. Manual annotation remains one of the most demanding aspects of dataset creation, particularly given that all images were captured manually and that the videos were recorded specifically for this study.
Another challenge was related to model performance under different driving conditions. While augmentation techniques were applied to improve generalisation, the model showed increased difficulty in detecting sound barriers when the vehicle’s velocity was higher. This is expected due to motion blur and rapid scene changes, but despite this, the model still performed well, demonstrating its potential for real-time applications.
Finally, the size of the dataset was a constraint. Since data collection was entirely manual, expanding the dataset was a slow process, limiting the diversity of training examples. A larger and more varied dataset could further improve the model’s ability to generalise between different highway environments and barrier types.
5.2. Future Work
This work can be further extended in several directions. The first direction is to improve the generalisation and enable the detection of sound barriers from other regions or countries. The dataset should be expanded to include images representing a broader diversity of sound barrier designs, road geometries, and environmental conditions. This would ensure that the model performs reliably beyond Portuguese highways and adapts to different infrastructure contexts. The second direction is to apply the same methodology to identify other areas on highways that could be used to install PV systems, namely slopes (excavation or landfill ones), as well as clear areas on the edgeways. However, these pose additional challenges, as these objects are not as homogeneous as sound barriers and can vary in size and colour, and may or may not include vegetation. The third potential development is to contribute to the digitalisation of highways operators’ databases, namely by identifying different objects like road signs. Finally, another important development is to perform an economic analysis and use the economic factors to evaluate and promote the social acceptance of this type of system in highways [
46].