DARE-YOLO: A Lightweight Object Detection Algorithm and Its FPGA Acceleration for Sustainable PV Panel Inspection

Yang, Yuchuan; Xing, Feng; Qin, Caiyan; Chen, Shuxu; Shin, Hyundong; Lee, Sungyoung

doi:10.3390/su18104999

Open AccessArticle

DARE-YOLO: A Lightweight Object Detection Algorithm and Its FPGA Acceleration for Sustainable PV Panel Inspection

by

Yuchuan Yang

¹,

Feng Xing

¹

,

Caiyan Qin

^2,*,

Shuxu Chen

³,

Hyundong Shin

³

and

Sungyoung Lee

^4,*

¹

School of Electrical Engineering, Liaoning University of Technology, Jinzhou 121001, China

²

School of Robotics and Advanced Manufacture, Harbin Institute of Technology, Shenzhen 518055, China

³

Department of Electronics and Information Convergence Engineering, KyungHee University, Yongin-si 17104, Gyeonggi-do, Republic of Korea

⁴

College of Software, School of Computer Science and Engineering, KyungHee University, Yongin-si 17104, Gyeonggi-do, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Sustainability 2026, 18(10), 4999; https://doi.org/10.3390/su18104999

Submission received: 16 April 2026 / Revised: 11 May 2026 / Accepted: 12 May 2026 / Published: 15 May 2026

(This article belongs to the Special Issue Sustainable Solar Power Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

As a critical component of sustainable energy systems, the efficient maintenance of photovoltaic (PV) panels is essential. While deep learning is an important approach for PV panel defect detection, the high complexity of existing models and their substantial computational demand make deployment on edge platforms difficult. This paper studies an acceleration method for photovoltaic panel defect detection on the Zynq-7020 heterogeneous platform. We design DARE-YOLO, a lightweight network for photovoltaic panel defect detection, together with a Zynq-based accelerator. In DARE-YOLO, we introduce RepConv and a lightweight single-path backbone to reduce the memory bandwidth overhead caused by multi-branch structures. We further design a Dilated Context Block (DCB) and a Dual-scale Decoupled Head (DDH), which effectively improve the detection accuracy of DARE-YOLO. On the Zynq platform, we develop the accelerator through a mixed fixed-point quantization strategy, a custom convolution IP core, and pipeline unrolling. These optimizations reduce data access latency, improve computational parallelism, and increase computational throughput. Experimental results show that DARE-YOLO achieves 93.84% mAP@0.5 with only 6.4 M parameters. The accelerator has a total on-board power consumption of only 1.95 W, while delivering a throughput of 37.5 GOPS, an energy efficiency of 19.23 GOPS/W. The image inference latency is 661.3 ms. This low-power, high-efficiency co-design paradigm ensures the long-term reliability of renewable energy facilities.

Keywords:

object detection; FPGA acceleration; photovoltaic defect detection; edge computing; hardware–software co-design; sustainable energy maintenance

1. Introduction

Against the broader backdrop of the global energy transition, photovoltaic (PV) power generation sees rapid growth as a core clean energy solution [1]. PV panels operate for long periods in complex outdoor environments and are highly vulnerable to physical impact and environmental erosion, which leads to various defects such as dust coverage, bird droppings, surface cracks, and deep electrical damage [2]. These heterogeneous defects not only reduce photoelectric conversion efficiency, but may also induce hotspot effects and even cause fires in extreme cases, thereby posing a serious threat to the operational safety of the entire power station [3,4,5]. Traditional defect detection relies heavily on manual visual inspection or basic image processing algorithmsม such as Otsu thresholding, Canny edge detection, and template matching. While these conventional methods are computationally simple, they depend heavily on hand-crafted features and strict environmental conditions. Such a paradigm is time-consuming and labor-intensive, and it cannot meet the demand for frequent maintenance in ultra-large-scale PV arrays [6]. This limitation becomes more pronounced for low-contrast and small-scale defects, where conventional methods are prone to missed detections and false alarms, revealing the limited generalization ability of traditional vision techniques [7]. In recent years, deep learning methods represented by convolutional neural networks (CNNs) have reshaped industrial visual inspection [8,9]. The mathematics and architectures of these detection technologies are based on deep learning paradigms [10,11] and digital signal processing theories for hardware implementation [12,13].

Among them, the YOLO family of one-stage detectors gradually becomes a mainstream baseline for high-frequency defect detection because of its strong end-to-end inference efficiency [14,15,16].

However, real industrial defect datasets exhibit a highly long-tailed distribution [17]. Defects such as electrical damage and physical scratches often occupy only a very small number of pixels and have ambiguous geometric boundaries [18]. During multi-level feature downsampling, conventional CNNs tend to lose the high-frequency edge details of tiny defects in deep semantic representations [19,20]. At the same time, the class imbalance caused by the long-tailed distribution makes gradient optimization dominated by a large amount of background information, which suppresses the model’s ability to capture features of scarce defect samples [21]. In the regression stage, the irregular physical boundaries of small defects often cause local geometric drift in the predicted boxes, which reduces detection accuracy [22]. To alleviate this issue, some studies introduce complex feature pyramids, yet the forced expansion of the receptive field instead brings substantial background noise [23]. In addition, conventional detection heads suffer from feature misalignment between classification and regression when handling long-tailed data, which makes optimization more difficult [24]. Therefore, exploring a new network topology that balances sensitivity to weak fine-grained features with a highly simplified forward path remains a key challenge in industrial vision [25,26,27].

Beyond the accuracy challenge at the algorithm level, edge computing platforms in industrial scenarios are usually constrained by strict power and memory budgets [28]. Portable inspection devices such as micro unmanned aerial vehicles (UAVs) cannot accommodate high-power GPUs [29]. High-power computing platforms run counter to the carbon reduction goals of renewable energy. Field-programmable gate arrays (FPGAs), with their superior computational energy efficiency and highly flexible low-level spatial parallelism capabilities, have become an ideal platform for green computing in industry [30,31].

Nevertheless, mainstream lightweight vision networks commonly adopt complex multi-branch topologies and dense cross-layer connections [32]. Such architectures require frequent off-chip memory access, which quickly exhausts the limited internal bus bandwidth of heterogeneous chips [33]. The high latency of data movement severely limits the effective speed of the underlying logic array, so the theoretical peak computing capability cannot be translated into actual throughput [34]. Moreover, directly deploying full-precision floating-point models consumes a large amount of DSP resources, while conventional low-bit quantization may introduce severe information loss and cause small-object features to disappear [35]. The separate deployment of operators at different scales further consumes on-chip logic resources and reduces the overall rate of hardware reuse [36]. This disconnect between algorithm design and physical hardware constraints limits the large-scale deployment of vision models on edge devices [37].

In summary, this study aims to bridge the existing scientific gap, namely the contradiction between complex algorithms for long-tail and tiny defect detection and the resource constraints of edge deployment. To address the fundamental engineering tension between visual detection and low-latency physical deployment, this study proposes DARE-YOLO, a lightweight object detection algorithm with deep co-optimization for heterogeneous hardware, where DARE denotes Dilated Aggregation and Reparameterized Edge. The main contributions of this paper are as follows:

The DARE-YOLO model employs RepConv to construct a hardware-friendly single-path backbone network, and integrates a Dilated Context Block (DCB) and a Dual-scale Decoupled Head (DDH) to achieve high-precision photovoltaic panel defect detection while reducing the number of model parameters.
Under the on-chip resource constraints of Zynq, we design a custom convolution IP core, which pushes the hardware throughput of single-path forward inference toward the physical limit.
We adopt a mixed fixed-point quantization strategy on the Zynq-7020 platform to improve data transfer efficiency and reduce on-chip resource usage, thereby providing practical support for PV defect detection deployment.

The remainder of this paper is organized as follows. Section 2 presents the lightweight PV defect detection model DARE-YOLO and describes its core modules in detail, with a focus on small defect size, complex morphology, and the limited computing capability of the target platform. Section 3 introduces a dedicated hardware accelerator and a software-hardware co-designed quantization strategy under the resource constraints of the Zynq platform. Section 4 reports experimental results and analysis. Through ablation studies and comparisons with mainstream YOLO models, we verify the effectiveness of the proposed method. We further compare different hardware platforms to demonstrate the advantages of the proposed model and acceleration scheme. Section 5 concludes the paper.

2. DARE-YOLO Algorithm Architecture Design

Surface defects on photovoltaic panels usually exhibit small spatial scales and complex shapes. At the same time, edge computing devices in industrial environments are often constrained by limited computation and storage resources. To address the tension between high-accuracy detection and low-latency deployment, this section proposes DARE-YOLO, a lightweight hardware-oriented object detection algorithm. The network introduces lightweight RepConv and removes the large-scale feature branch to construct an extremely simple single-path backbone, thereby avoiding the memory bandwidth overhead caused by multi-branch structures. In the feature fusion stage, the conventional down-sampling operation in feature pyramid networks is discarded, and a DCB is adopted to aggregate rich contextual semantic information without sacrificing spatial resolution. In the prediction stage, a dedicated DDH is designed to separately output class predictions, bounding box regression, and confidence scores through a low-level feature reuse mechanism. This comprehensive pruning and reconstruction at the algorithm level reduces the number of parameters and computational complexity, while improving both the detection accuracy for photovoltaic panel defects and the inference throughput of the hardware system.

2.1. Mathematical Formalization of Hardware and Software Co-Design

To mathematically define the objective of this hardware–software co-design, photovoltaic panel defect detection is modeled as a multi-criteria optimization problem. Let θ denote the architectural parameters of the DARE-YOLO network, and A denote the hardware deployment configuration on the FPGA. The objective is to achieve a balance among detection accuracy, energy efficiency, and economic feasibility. The overall system utility is maximized as Φ. To ensure comparability among indicators with different dimensions, each metric is normalized. The multi-criteria objective function is formally defined as follows:

\underset{θ, A}{m a x} Φ = α \times \bar{m A P} (θ) + β \times \bar{η} (θ, A) - γ \times \bar{C} (A) .

(1)

Physical constraints:

P_{t o t a l} (A) \leq P_{m a x}, M_{r e q} (θ, A) \leq M_{c h i p},

(2)

where mAP denotes the mean average precision of defect detection, η represents hardware energy efficiency, and C denotes the cost of the deployment platform. The weighting coefficients α, β, and γ are non-negative balancing factors satisfying α + β + γ = 1. The system is subject to physical constraints, including the maximum power budget P_max and the physical memory limit M_chip of the edge device.

2.2. Overall Architecture of DARE-YOLO

The overall architecture of DARE-YOLO is shown in Figure 1. The design aims to achieve efficient photovoltaic panel defect detection and low-latency deployment on edge devices. The network takes a three-channel image with a resolution of 640 × 640 as input, denoted by X_in ∈ R^3×640×640 and feeds it directly into the backbone composed of RepConv blocks. Since surface defects on photovoltaic panels usually occupy only a small number of pixels, the network retains only the high-resolution feature maps at down-sampling ratios of 8 and 16, while discarding the deep low-resolution feature branch that is mainly used for large objects. This lightweight design reduces the overall number of model parameters and concentrates the limited computational resources on feature extraction for small and medium-sized defects. From the perspective of multi-objective optimization, this structural truncation optimizes the algorithmic parameters θ, maximizing hardware energy efficiency without significantly compromising detection accuracy.

The extracted high-resolution features are then sent to the feature fusion neck. At this stage, the complex bidirectional feature pyramid network is removed, and DCB is adopted instead. The two-scale features are first processed by DCB for context-aware aggregation, and then fused across scales through up-sampling and basic convolution. The fused features are finally passed to DDH. The detection head separately predicts object categories, regresses bounding box locations, and estimates object confidence, thereby completing the end-to-end defect detection task efficiently.

During inference, the algorithm execution flow is as follows: first, the network receives a 640 × 640 resolution RGB image as input. The data then passes through a single-path backbone composed of RepConv for feature extraction. After feature extraction, the representations are sent to the Neck, where a Dilated Context Block (DCB) is applied for multi-scale feature fusion. Subsequently, the fused features are fed into the Dual-scale Decoupled Head (DDH), which independently computes defect classification and spatial localization, and outputs confidence scores. Finally, post-processing is performed to generate the final defect detection results.

2.3. RepConv-Based Backbone

The core of the backbone lies in the lightweight design based on RepConv. The backbone of DARE-YOLO is shown in Figure 2. During deployment on edge computing platforms, the system faces two physical constraints, namely limited memory bandwidth and insufficient computing resources. Conventional lightweight designs, especially those with complex multi-branch topologies, often trigger frequent low-level data movement, which degrades the throughput of the parallel pipeline inside the FPGA. To overcome the bottleneck between computation and data transfer, the network adopts a fully decoupled structural design for training and inference. In the training stage, the network uses a multi-branch topology composed of a 3 × 3 convolution, a 1 × 1 convolution, and an identity mapping. This design ensures strong feature representation capability and smooth gradient flow during parameter optimization. After training is completed, these parallel branches are mathematically merged into a standard 3 × 3 convolution. During the fusion process, the parameters of the batch normalization layer are fully absorbed into the convolution weights through a linear transformation. Let γ denote the scaling factor of the batch normalization layer, β the bias term, μ the running mean, σ² the running variance, ε a small constant introduced to avoid division by zero, and W the original convolution weight. The equivalent fused weight W’ and bias B’ for a single branch are computed as follows:

W^{'} = \frac{γ}{\sqrt{σ^{2} + ε}} \times W,

(3)

B^{'} = β - \frac{γ \times μ}{\sqrt{σ^{2} + ε}} .

(4)

After the basic fusion of each branch is completed, zero padding is applied to the equivalent weights of the 1 × 1 convolution and the identity mapping, so that their spatial dimensions are expanded and aligned as W_1×1 and W_id, respectively. For single-path deployment, the final 3 × 3 convolution weight W_final and bias B_final of the network are obtained by matrix summation over the feature branches as follows:

W_{f i n a l} = W^{'} + W_{1 \times 1} + W_{i d},

(5)

B_{f i n a l} = B^{'} + B_{1 \times 1} + B_{i d} .

(6)

After parameter reparameterization, the deployed network exhibits a branch-free straight-through forward structure. Beyond the internal reparameterization of each module, the backbone is also simplified in depth. As shown in Table 1, compared with the YOLOv6 series, DARE-YOLO adopts a lighter backbone with fewer stacked layers. The network retains only the initial layer and the first three feature extraction stages, while removing the deeper layers that are mainly used to enlarge the receptive field for large objects. As a result, the total number of RepConv blocks is reduced from 19 to 11. This structural truncation at the physical level, together with the single-path forward pattern, matches the parallel computing architecture of the Zynq-7020, eliminates unnecessary feature map buffering, and improves the overall computational throughput of the acceleration system.

2.4. Dilated Context Block

The neck of the YOLO family usually relies on feature pyramid structures with down-sampling to enlarge the receptive field, but this process often damages the edge details of defects. To address this issue, this paper proposes a DCB, whose structure is shown in Figure 3.

As illustrated in Figure 3, the input feature tensor is first denoted as X₀ ∈ R^C^×H×W. A 1 × 1 pointwise convolution is then applied to reduce the channel dimension C by half, thereby lowering the computational cost of the subsequent operations. The reduced features are next distributed in parallel to three convolution branches with different dilation settings for feature extraction. The base branch uses a standard RepConv to capture local features while preserving the basic 3 × 3 receptive field. The other two expanded branches adopt dilated convolutions with dilation rates d = 2 and d = 4, which enlarge the effective receptive fields to 5 × 5 and 9 × 9, respectively, without reducing the spatial resolution. The contextual features that cover both local details and larger-scale information are then directly concatenated along the channel dimension C. Finally, the resulting mixed feature tensor is fed into a final RepConv layer for channel restoration and deeper semantic alignment, producing an output tensor Y₀ ∈ R^C^×H×W with the same dimensions as the input. This discrete sampling scheme effectively avoids the degradation caused by spatial down-sampling and enhances the model’s multi-scale perception of defect edges and surrounding textures on photovoltaic panels.

2.5. Dual-Scale Decoupled Head

Photovoltaic panel defect datasets usually exhibit a long-tailed distribution with severe class imbalance. When the feature network processes such extremely skewed data, it often suffers from a conflict between category mapping and spatial localization, which leads to feature misalignment between classification and regression. To resolve this underlying computational issue, DARE-YOLO designs a DDH that is deeply optimized for FPGA deployment in the final prediction stage.

As shown in Figure 4, the two-scale feature tensors output by the feature fusion neck, namely X₁ ∈ R^128×80×80 and X₂ ∈ R^256×40×40, are independently fed into two parallel detection branches. The input features are first passed through RepConv for channel compression and preliminary semantic alignment. The data flow is then fully decoupled into two independent paths, namely classification (cls) and regression (reg). The classification path uses RepConv to extract semantic features and predict class probabilities, while the regression path focuses on modeling the geometric boundaries and spatial coordinates of defects. To address the limited on-chip resources of the FPGA, this module further incorporates a feature reuse mechanism. The object confidence prediction does not require an additional deep convolution branch. Instead, it directly reuses the deep geometric features from the regression path and maps them to objectness scores through only a single 1 × 1 pointwise convolution. This architecture avoids redundant memory consumption and effectively balances detection accuracy under long-tailed data with the demand for lightweight edge deployment.

3. Zynq-Based Hardware Accelerator and System Design

The deployment of photovoltaic panel defect detection systems is constrained by power consumption and on-chip memory resources. An FPGA with a single-processor architecture can hardly satisfy the real-time throughput demand of high-frequency tensor computation in feature networks. To overcome this underlying physical bottleneck, this section develops a dedicated hardware accelerator on the Zynq-7020 heterogeneous computing platform, which integrates a processing system (ARM) with programmable logic (FPGA). The selection of the Zynq-7020 platform is based on both hardware and economic considerations. This platform is not a high-performance system intended for offline training tasks, but rather a low-power edge computing platform designed for on-site inference. Drone inspection scenarios impose strict power consumption constraints, and high-end graphics processing units (GPUs) consume excessive power. The low-power characteristics of the Zynq-7020 therefore align well with these application requirements. The platform costs approximately 150 USD, and compared with expensive high-end hardware families such as Zynq UltraScale+, its advantages are evident. Its internal heterogeneous architecture and abundant digital signal processing (DSP) resources are well matched with the proposed hardware–software co-design strategy. The physical architecture of the accelerator follows a hardware–software co-design paradigm. At the hardware level, the system adopts a custom convolution IP core together with mixed-precision quantization and pipeline scheduling, thereby enabling photovoltaic panel defect detection under limited hardware overhead.

3.1. Overall Accelerator Architecture and Dataflow Scheduling

Deploying DARE-YOLO on the Zynq-7020 requires precise task partitioning at the hardware level. The block diagram of the accelerator is shown in Figure 5. Through the heterogeneous architecture that combines programmable logic (PL) and the processing system (PS), the system builds a highly coordinated global dataflow. Feature tensors, network weights, and static bias parameters are transferred in batches through the high-performance bus ports to the input physical buffer at the front end of the logic array. The PS is responsible for global network configuration and interaction with external devices, while the PL deploys the proposed convolution IP core. To improve parallel throughput, tasks with irregular memory access patterns and complex nonlinear control, such as non-maximum suppression and feature map up-sampling, are offloaded to the PS. This task partitioning matches the requirement of high-frequency full-load inference for defect detection models on edge devices. From a multi-objective optimization perspective, the customized hardware deployment configuration A ensures that the system satisfies the physical constraints on power consumption and memory capacity.

3.2. Hardware–Software Co-Designed Quantization Strategy

High-precision floating-point computation on Zynq consumes substantial memory bandwidth and incurs high computational latency. To overcome the hardware computing constraints while preserving defect detection accuracy, this section introduces a hardware–software co-designed quantization strategy. The whole deployment pipeline is divided into two sequential optimization stages, namely batch normalization fusion and a mixed fixed-point acceleration strategy for the programmable logic (PL).

3.2.1. Batch Normalization Fusion

In the inference stage of deep convolutional networks, batch normalization operators are invoked frequently. Retaining BN during hardware deployment introduces redundant off-chip memory access and additional floating-point computation overhead. In essence, the normalization operation applies a linear transformation to the output tensor of the convolution by subtracting the mean and dividing by the variance. Through mathematically equivalent reformulation, these separate computations are folded into a single operation, and the scaling and shifting parameters of the normalization layer are fused into the weight matrix of the preceding convolution operator. Let γ denote the scaling factor of the BN layer, β the bias term, μ the running mean, σ² the running variance, ε a small constant introduced to avoid division by zero, and W the original convolution weight. The fused weight W_fused and fused bias B_fused can then be derived from the original network parameters as follows:

W_{f u s e d} = \frac{γ}{\sqrt{σ^{2} + ε}} \times W,

(7)

B_{f u s e d} = \frac{γ}{\sqrt{σ^{2} + ε}} \times (B - μ) + β .

(8)

Through the above transformation, the two originally separate high-frequency operators are fused into a single multiply–accumulate computation. This strategy eliminates the independent computational overhead introduced by the normalization operation.

3.2.2. Mixed Fixed-Point Strategy for the PL

Full-precision feature tensors consume substantial bus bandwidth and storage space on the Zynq platform. To reduce this overhead, the accelerator adopts a fixed-point quantization strategy for both network weights and input feature maps, and compresses the feature tensors into low-precision 8-bit fixed-point integers. This design greatly reduces data transfer time and storage cost. However, during multi-dimensional convolution, feature maps are prone to numerical overflow and truncation, which leads to accumulated errors. If low-precision multiply–accumulate operations are used throughout the entire pipeline, the fine geometric textures of small defects may gradually disappear as features propagate through the network. To preserve the effective representation range of features, the core operators on Zynq are designed with an asymmetric bit-width path. After receiving low-precision feature pixels and network weights, the convolution IP core performs internal computation at high precision and uses 32-bit registers to store intermediate results. This mixed fixed-point strategy combines low-precision external data transfer with high-precision internal computation. It fully exploits the hardware resources while keeping the accuracy loss of the edge platform within an acceptable range for practical deployment.

Table 2 records the model’s average precision on the validation set under different quantization strategies. Accuracy Drop is defined as the absolute difference between the FP32 baseline accuracy and the accuracy of the quantized model. Full-precision floating-point computation preserves the most complete feature tensor distribution with 32-bit representation, and is therefore used as the reference for evaluating quantization error in hardware deployment. In conventional quantization strategies, the input activations, network weights, and core accumulators in the logic array are all uniformly compressed to low-bit formats. This aggressive truncation causes the loss of high-frequency semantic information when handling long-tailed photovoltaic defect data. The geometric boundaries of small defects are especially vulnerable to severe numerical overflow in low-precision accumulators, which reduces the mean detection accuracy by 4.5%. In contrast, the mixed fixed-point strategy retains 8-bit fixed-point format for bus transfer and data loading to reduce off-chip memory bandwidth pressure, while using 32-bit fixed-point format inside the convolution IP core to preserve computational precision. As a result, the loss in defect detection accuracy is only 1.36%. Since the objective of this study is deployment on edge hardware, the 93.84% achieved under the mixed-precision fixed-point strategy is used as the standard evaluation metric for DARE-YOLO in subsequent ablation studies and comparative analyses.

3.3. Design of the Custom Convolution IP Core

The DARE-YOLO model contains convolutions with different kernel scales and dilation rates. However, the multiplier-accumulators and storage resources on Zynq are highly limited. Instantiating a separate IP core for each convolution type would consume a large amount of hardware logic resources. To address this issue, we design a reusable convolution IP core, in which the same underlying computing array processes different convolution types in a time-multiplexed manner. This design effectively reduces hardware resource consumption. The unified address derivation formula for input feature map access is denoted by Addr_in, and is expressed as follows:

{A d d r}_{i n} (i, j) = (x_{o u t} \times S + i \times d) \times W_{i n} + (y_{o u t} \times S + j \times d),

(9)

here, x_out and y_out denote the output pixel coordinates, S is the stride, d is the dilation rate, 0 ≤ i, j ≤ K − 1, and W_in is the width of the input feature map. This mapping formula provides a unified addressing scheme for pointwise, standard, dilated, and strided convolutions, allowing the reuse rate of the original discrete hardware logic to reach 100%.

As shown in the blue section of Figure 6, the IP core adopts a task-level pipeline architecture for read, compute, and write operations. Through the ping-pong buffering mechanism of the on-chip memory, data transfer and core computation are fully overlapped in time. Assume that one convolution layer is divided into N computation blocks. Read, compute, and write are denoted as t_load, t_comp, and t_store, respectively. Without pipeline scheduling, the serial execution time T_serial is given by:

T_{s e r i a l} = N \times (t_{l o a d} + t_{c o m p} + t_{s t o r e}) .

(10)

After pipeline execution is enabled, the total latency T_pipeline is given by:

T_{p i p e l i n e} = t_{l o a d_f i r s t} + \sum_{N = 1}^{N} \max (t_{l o a d}, t_{c o m p}, t_{s t o r e}) + t_{s t o r e_l a s t},

(11)

When the pipeline reaches the steady state, the system throughput is limited only by the slowest stage among read, compute, and write, while the bus access latency is effectively hidden.

As shown by the purple part in Figure 6, the read module uses the maximum physical width of the high-speed interface for burst transfer, so that data movement does not become the bottleneck of the pipeline. This design adopts an 8-bit quantization format. With a 64-bit bus, eight channels of data can be transferred simultaneously in a single clock cycle. Under a conventional scalar read mode, data movement would waste a large number of clock cycles. Let V_feature denote the total feature map volume. Under the 64-bit full-load burst read mode, the theoretical read latency is given by:

T_{l o a d} = \frac{V_{f e a t u r e}}{8} + L_{b u r s t_d e l a y} .

(12)

This strategy raises the memory bandwidth utilization of the read module to the physical limit.

As shown by the yellow part in Figure 6, the core computation of the compute module is carried out with parallelism unfolded along three dimensions. After the read module transfers data into the on-chip buffer, the address generator fetches T_K × K × K raw feature values from the buffer at one time, and this same data block is replicated to T_C fully independent computing units through physical wiring with no extra delay. Inside each computing unit, a dedicated set of convolution kernel weights is assigned. The computing array then performs parallel expansion in both spatial area and channel depth. Spatial expansion corresponds to K × K parallel operations, while depth expansion corresponds to the parallel execution of T_K spatially expanded operations. Each computing unit therefore occupies T_K × K × K multipliers. All product terms generated by the multiplier array are immediately sent to the downstream pipelined adder tree. After multi-stage registered accumulation, the final output for a single pixel is produced. Before parallel expansion, the computation of one output pixel requires C_in × K² clock cycles. Under the fully unfolded three-dimensional architecture, the latency of the core computation module, denoted by T_comp, is given by:

T_{c o m p} = \frac{C_{o u t}}{T_{c}} \times \frac{C_{i n}}{T_{K}} \times H_{o u t} \times W_{o u t} .

(13)

After each computing unit produces its output, the write module does not immediately return each result individually. Instead, it first stores the results in an output queue. Once the accumulated data reach eight bytes and are aligned with the 64-bit system bus width, the write module uses the burst mode of the high-speed bus to write the results back to off-chip memory in a batched manner. After this optimization, the latency of the write module, denoted by T_store, is given by:

T_{s t o r e} = \frac{H_{o u t} \times W_{o u t} \times C_{o u t}}{8} + L_{a x i_w r i r e_r e s p},

(14)

here, L_{axi_wrire_resp} denotes the time consumed by the handshake protocol. The write-back width alignment strategy eliminates invalid bus occupancy and improves write-back efficiency.

By combining the three-stage pipeline of read, compute, and write with the three-dimensional parallel expansion in the core computation module, the proposed convolution IP core achieves a substantial performance gain over a serial computing architecture. Let T_serial denote the latency of the conventional architecture. The overall acceleration ratio, denoted by Acc_total, is defined as the ratio between T_serial and T_pipeline, which can be written as:

A c c_{t o t a l} = \frac{C_{o u t} \times C_{i n} \times H_{o u t} \times W_{o u t} \times K^{2}}{m a x (\frac{V_{f e a t u r e}}{8}, \frac{C_{o u t} \times C_{i n} \times H_{o u t} \times W_{o u t}}{T_{c} \times T_{K}})} .

(15)

Under the ideal computing condition, the upper bound of the acceleration ratio approaches the product of the hardware unfolding factors. The ideal acceleration ratio, denoted by Acc_ideal, is given by:

A c c_{i d e a l} \approx T_{C} \times T_{K} \times K^{2} .

(16)

This mathematical derivation verifies the correctness of the proposed convolution IP core with three-dimensional parallel expansion and provides a theoretical foundation for the subsequent implementation of photovoltaic defect detection.

4. Experiments and Results Analysis

To evaluate the performance of DARE-YOLO and its deployment efficiency on edge devices, this study builds a complete hardware–software experimental platform, which consists of two stages, namely model training and evaluation, and Zynq-based hardware deployment.

4.1. Photovoltaic Dataset and Experimental Deployment Environment

This section describes in detail the data foundation and system environment used in the two experimental stages.

4.1.1. Photovoltaic Panel Defect Dataset and Software Platform for Model Training

The original dataset was constructed by integrating multiple open-source photovoltaic defect datasets from the Roboflow platform, containing approximately 2100 images. The dataset exhibits a long-tailed distribution with severe class imbalance. To mitigate this issue and address physical disturbances in real-world inspection scenarios, data augmentation techniques were applied, including random flipping, occlusion, brightness adjustment, rotation, cropping, translation, mirroring, and Gaussian blur. These targeted augmentation strategies simulate complex noise and variations encountered in real-world environments. For example, brightness adjustment mimics natural illumination conditions under different times of day and weather states; geometric transformations represent dynamic viewpoints from UAV-mounted cameras; occlusion forces the model to recognize targets partially blocked by shadows or environmental objects; and Gaussian blur simulates motion blur and defocus caused by drone vibration. These tailored augmentation processes effectively enhance the model’s robustness in practical deployment scenarios.

The final dataset contains 8000 images. The augmented dataset achieves a relatively balanced sample distribution across five defect categories: Bird dropping, Dust, Electrical damage, Physical damage, and Snow. The dataset is randomly split into training, testing, and validation sets in an 8:1:1 ratio. The training set contains 6400 images for model optimization, while the testing and validation sets each contain 800 images for performance evaluation.

In the model training stage, all algorithms are implemented based on the PyTorch deep learning framework. To meet the computational demand of deep neural network training, the models are trained on a high-performance platform equipped with an NVIDIA GeForce RTX 4070 GPU with 8 GB of memory. The detailed software environment for model training is listed in Table 3.

For the training configuration, A typical configuration of mainstream object detection frameworks was adopted as the initial reference. Subsequently, a grid search strategy was applied on the validation set to fine-tune key hyperparameters such as the initial learning rate and batch size. This effectively balanced the model’s convergence speed and overfitting prevention. Finally, the input image resolution is normalized to 640 × 640 pixels. The batch size is set to 32, and the total number of training epochs is 300. In the backpropagation stage, the SGD optimizer is adopted, with the momentum factor set to 0.937 and the weight decay coefficient set to 5 × 10⁻⁴ to prevent overfitting. The global learning rate follows a cosine annealing schedule, with an initial learning rate of 10⁻² that gradually decays to 10⁻⁴. The classification branch computes Focal Loss only on positive samples, with the balancing factor set to 0.25 and the focusing parameter set to 2.0, so that the model pays greater attention to scarce electrical and physical damage samples. To reduce local localization drift in small defects, the bounding box regression branch introduces CIoU as a geometric penalty term and assigns it a relatively large weight of 2.5 to accelerate the fitting of true physical boundaries. In the post-processing stage, the confidence threshold is set to 0.1, and the IoU threshold of the NMS algorithm is fixed at 0.5.

To objectively evaluate detection performance, this paper adopts Precision (P), Recall (R), Average Precision (AP), and mean Average Precision (mAP) as the evaluation metrics.

P r e c i s i o n = \frac{T P}{T P + F P},

(17)

R e c a l l = \frac{T P}{T P + F N},

(18)

{A P}_{i} = \sum_{k = 1}^{n} P_{k} Δ r_{k},

(19)

m A P = \frac{1}{C} \sum_{i = 1}^{C} {A P}_{i} .

(20)

Here, TP denotes the number of correctly detected objects, FP denotes the number of falsely detected objects, and FN denotes the number of missed objects, Let n denote the total number of discrete recall levels sorted by confidence. P_k represents the highest precision among the current recall level and all levels to its right, and Δr_k denotes the change in recall between the k-th level and the (k − 1)-th level.

To intuitively demonstrate the stability of the training process, Figure 7 illustrates the variation in evaluation metrics over 300 epochs. As shown in Figure 7a, mAP@0.5 increases rapidly during the initial learning stage and gradually stabilizes near its peak performance without significant fluctuations. In addition, Figure 7b presents both the total training loss and validation loss. Both loss curves exhibit a continuous and smooth decreasing trend. The validation loss remains consistently parallel to the training loss, showing no signs of divergence or upward rebound in the later stages. This indicates that DARE-YOLO avoids overfitting and successfully learns defect features.

4.1.2. Edge Hardware Acceleration Deployment Platform

The low-level logic synthesis and high-level architectural mapping of the edge hardware accelerator are completed using Vivado HLS 2018.3. Global hardware scheduling and nonlinear post-processing control are jointly compiled with Vivado 2018.3 and its corresponding Software Development Kit (SDK). A hardware–software co-design test framework is established in the integrated development environment. Figure 8 shows the physical interconnection topology of the top-level hardware system. The system primarily consists of the Zynq Processing System (PS) shown on the right side of Figure 8 and IP cores in the Programmable Logic (PL) on the left side. Data communication is implemented through four AXI SmartConnect modules instantiated between the PS and PL.

The deployment experiments are conducted on the Zynq-7020 heterogeneous computing platform. Figure 9 presents a close-up view of the core board equipped with the XC7Z020-CLG400-2 main chip. The Zynq-7020 mainly consists of two parts, namely the processing system (PS) and the programmable logic (PL). The PS integrates a dual-core ARM Cortex-A9 general-purpose processor to coordinate global scheduling instructions. The PL contains 4.9 Mb of block random access memory (BRAM), 220 digital signal processors (DSPs), 106,400 flip-flops (FFs), and 53,200 lookup tables (LUTs).

To evaluate the efficiency of the hardware accelerator, this study further introduces throughput (GOPS) and energy efficiency (GOPS/W) as hardware performance metrics. The complete experimental platform for photovoltaic panel defect detection is shown in Figure 10. The test images are sent from the host computer to the Zynq-7020 development board, and the detection results are then returned to the host display for visualization.

4.2. Ablation Study

To evaluate the contribution of each module to the overall detection accuracy, this study conducts ablation experiments under a unified hardware–software testing setting. The baseline model and its upgraded variants are denoted as Baseline, Rep-YOLO, and DCB-Rep-YOLO, respectively. The statistical results of the ablation study are presented in Table 4.

As shown in Table 4, the Baseline network achieves an mAP of 84.12%, with 5.5 M parameters and 21.6 GFLOPs. After introducing the lightweight RepConv module, the upgraded model is denoted as Rep-YOLO. Because RepConv adopts a fully decoupled design for training and inference, Rep-YOLO improves the mAP by 3.68% without introducing any additional parameters or GFLOPs.

Based on the Rep-YOLO architecture, the feature pyramid structure is further removed and replaced by the DCB module, resulting in DCB-Rep-YOLO. By enlarging the receptive field during feature extraction, the dilated convolution in DCB strengthens the network’s ability to capture the geometric features of small defects on photovoltaic panel surfaces. With only 0.6 M additional parameters and 2.3 GFLOPs, this module further improves the mAP by 2.48%.

In the final DARE-YOLO architecture, a DDH is deployed at the prediction stage to separate the underlying feature computation of category prediction and spatial bounding box regression. With only 0.3 M additional parameters and 0.9 GFLOPs, the final detection accuracy reaches 93.84% mAP.

4.3. Comprehensive Comparison with Mainstream YOLO Series

To comprehensively evaluate the performance of the proposed DARE-YOLO on the photovoltaic defect detection task, this section compares it with mainstream one-stage YOLO detectors through a set of controlled experiments. Table 5 reports the evaluation results of all models under a unified input resolution of 640 × 640 and on the same photovoltaic panel defect dataset. To ensure a fair comparison, all baseline models used their official pretrained weights and were fine-tuned on the photovoltaic defect dataset. In addition, the same data augmentation strategies and hyperparameter optimization settings as those used for DARE-YOLO were applied.

As shown in Table 5, benefiting from the feature extraction capabilities of the lightweight backbone DCB and DDH, the mAP of DARE-YOLO improves by 7.44% and 1.14% compared with YOLOv5s and the latest YOLO11s architecture, respectively.

From the perspective of physical resource consumption, DARE-YOLO removes redundant deep large-scale feature branches in the backbone and adopts a fully decoupled design for training and inference. As a result, it achieves a parameter count of only 6.4 M, which is even lower than that of the compact YOLOv5s model. This compressed parameter scale is well suited to the limited storage resources of FPGA platforms. The network requires 24.8 GFLOPs, which is slightly higher than that of some YOLO variants, yet still remains within the data throughput limit of the target hardware bus. Therefore, DARE-YOLO achieves a favorable balance between detection accuracy and hardware resource constraints.

To further illustrate the practical advantage of DARE-YOLO in photovoltaic panel inspection, this section presents a visual comparison between DARE-YOLO and YOLOv8s on five representative defect categories. As shown in Figure 11, when detecting densely distributed and very small targets such as Bird dropping and Snow, YOLOv8s exhibits evident missed detections, whereas DARE-YOLO accurately identifies and localizes these small defects. In addition, for low-contrast and weak-feature defect regions, such as the faint dark hotspot on the right side of Electrical damage, YOLOv8s fails to provide reliable detection, while DARE-YOLO shows stronger fine-grained feature perception and produces accurate bounding boxes.

To further investigate the classification performance across different defect categories, normalized confusion matrices of YOLOv8s and DARE-YOLO are visualized in Figure 12. In Figure 12a, YOLOv8s shows a significant misclassification rate between the Bird dropping and Snow categories, as well as a high miss detection rate for small-scale bird droppings. In contrast, Figure 12b shows that DARE-YOLO reduces the miss detection rate of small targets to 6%, and decreases the cross-misclassification rate between Bird dropping and Snow by half.

4.4. FPGA Resource Utilization and Power Analysis

The performance of the hardware platform is evaluated in terms of system resource utilization and power consumption. According to the synthesis results reported by the Vivado tool, the hardware resource utilization is summarized in Table 6. The reported system utilization includes the convolution IP core, the AXI interconnect, and other auxiliary modules.

As shown in Table 6, the system uses 59.8% of LUTs, 36.5% of FFs, 84.3% of BRAMs, and 87.3% of DSPs. The convolution IP core accounts for 53.5% of LUTs, 32.1% of FFs, 81.4% of BRAMs, and 87.3% of DSPs. The AXI interconnect uses 5.9% of LUTs, 4.0% of FFs, and 2.9% of BRAMs, while the remaining modules occupy only 0.4% of LUTs and 0.3% of FFs. The customized IP cores effectively utilize the DSP array for multiply–accumulate operations while retaining sufficient logic resources.

4.5. Comparison Across Different Hardware Platforms

In real-world deployment scenarios, the feasibility of DARE-YOLO depends on power constraints, making energy efficiency a critical metric. Table 7 presents a comparison between the accelerator and conventional hardware computing platforms. To ensure the rigor of the comparison in Table 7, the power consumption of the GPU and CPU was obtained using the NVIDIA System Management Interface and Intel Power Gadget v3.7.0, respectively. For the Zynq-7020 platform, total power consumption was analyzed using the Vivado 2018.3 tool. Throughput was calculated by multiplying the theoretical computational cost per frame by the number of frames processed.

The NVIDIA RTX 4070 GPU achieves a computational throughput of 2137.4 GOPS due to its massive parallel processing architecture; however, its extremely high power consumption of 165.9 W limits its energy efficiency to 12.88 GOPS/W. The Intel i9-13980 CPU consumes 103.4 W to deliver 617.0 GOPS, and although its power consumption is lower than that of the GPU, its energy efficiency is significantly reduced to 5.97 GOPS/W. In contrast, the accelerator based on the Zynq-7020 keeps power consumption at only 1.95 W, with a throughput of 37.5 GOPS and an inference latency of 661 ms per image. Although the latency is higher than that of CPU and GPU platforms, it is still sufficient for UAV-based photovoltaic defect inspection tasks. Ultimately, the energy efficiency reaches 19.23 GOPS/W, which is 3.2 times that of the CPU platform and surpasses that of high-power GPU platforms consuming hundreds of watts.

4.6. Performance Comparison with Existing Advanced Hardware Acceleration Schemes

In the vertical application field of photovoltaic panel defect detection, the performance comparison is reported in Table 8. Reference [38] deploys a lightweight YOLOv4-Tiny architecture on the same Zynq-7020 platform. Compared with Reference [38], the proposed system reduces LUT consumption from 44.4 k to 31.8 k and, with the reuse mechanism of the convolution IP core, increases the overall throughput to 37.5 GOPS. The power consumption reported in Reference [39] is 1.2 W, but that design relies on the high-end multi-core heterogeneous chip ZU9EG, which costs about 1500 dollars. In contrast, this work achieves high-concurrency computation on a hardware platform that costs only about 150 dollars, which demonstrates a favorable cost-performance ratio for industrial deployment.

For different object detection tasks on the same hardware platform, References [40,41] deploy YOLOv3-Tiny and YOLOv5s, respectively. In particular, the method in Reference [41], which is designed for tiny object feature extraction, uses 205 DSPs at a clock frequency of 100 MHz and achieves an overall throughput of 10.2 GOPS and an energy efficiency of 3.64 GOPS/W. In this work, under a clock frequency of 200 MHz, the proposed design uses only 192 DSPs. Benefiting from the single-path deployment structure of RepConv and the pipeline scheduling of on-chip buffers, DARE-YOLO achieves a total power consumption of 1.95 W, with an overall throughput of 37.5 GOPS and an energy efficiency of 19.23 GOPS/W. In addition, to maintain higher detection accuracy, the system sacrifices part of the inference speed, keeping it within an acceptable range for UAV inspection scenarios. The actual inference latency is 661.3 ms, which is higher than that reported in [40,41].

4.7. Discussion

As shown in the hardware comparisons in Table 7 and Table 8, the integration of the DARE-YOLO algorithm with the Zynq-7020 platform provides a practical trade-off solution to the multi-criteria optimization problem described in Section 2.1. From the power consumption perspective, compared with high-performance GPU and CPU platforms, the system sacrifices part of the throughput to satisfy the low-power constraints of edge deployment. This trade-off aligns well with the industry demand for green computing and sustainable development. From an economic perspective, compared with high-end FPGA families such as ZU9EG, the selected hardware platform demonstrates strong cost-effectiveness and feasibility. From the accuracy perspective, the mixed-precision quantization strategy combined with algorithmic re-architecture preserves high detection performance.

In practical deployment scenarios, the system is integrated with a micro UAV equipped with an RGB camera. With the support of a wireless transmission module, it forms a complete, low-cost, and highly maneuverable inspection system. Although promising results have been achieved, the system still has certain limitations in both hardware deployment and visual perception. On the hardware side, FPGA devices are constrained by on-chip resources, making it difficult to deploy highly complex models. Compared with purely software-based solutions, updating the network architecture requires re-synthesizing and recompiling the hardware logic, which reduces deployment flexibility. In addition, from the perspective of large-scale production, its economic efficiency is lower than that of application-specific integrated circuits (ASICs). On the visual perception side, the current model does not account for internal structural defects such as micro-cracks and hot spots. These defects cannot be captured by standard RGB cameras and require specialized electroluminescence or infrared thermal imaging equipment. In future work, the dataset will be expanded by collecting a wider range of surface defects under diverse environmental conditions. Furthermore, multimodal data fusion combining infrared thermal imaging and RGB imaging will be introduced to develop a more comprehensive photovoltaic inspection system.

5. Conclusions

To address the practical conflict in real photovoltaic power station inspection, where defect targets are small, background interference is complex, and portable edge devices are constrained by power and computing resources, this paper proposes a lightweight object detection system with deep co-optimization for heterogeneous hardware.

At the algorithm level, this paper designs DARE-YOLO, which integrates RepConv, DCB, and DDH. The proposed model enhances feature perception while maintaining an extremely simple forward backbone. With only 6.4 M parameters, it achieves 93.84% mAP@0.5, which is 7.44% higher than the classical YOLOv5s baseline, and effectively alleviates the missed detection problem of small defects in conventional lightweight networks. At the hardware acceleration and edge deployment level, the system customizes a convolution IP core on the Zynq-7020 heterogeneous platform and integrates a mixed fixed-point quantization strategy with a three-dimensional pipelined unfolding scheme. Experimental results show that this hardware architecture achieves a computational throughput of 37.5 GOPS and image inference latency of 661.3 ms at a low power consumption of 1.95 W, with an energy efficiency of 19.23 GOPS/W. Its energy efficiency reaches 3.2 times that of the high-performance CPU Intel i9-13980.

Overall, the proposed hardware–software co-design framework bridges the gap between algorithm iteration and underlying physical constraints at the theoretical level, and demonstrates a favorable balance among high accuracy, low latency, and low power consumption in engineering practice, providing a highly energy-efficient and sustainable solution for the maintenance of photovoltaic power plants.

Author Contributions

Conceptualization, F.X. and Y.Y.; methodology, F.X.; software, Y.Y.; validation, F.X.; formal analysis, C.Q. and H.S.; investigation, S.C.; writing—original draft preparation, Y.Y. and F.X.; writing—review and editing, C.Q. and S.L.; visualization, S.C.; supervision, F.X.; funding acquisition, C.Q. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Shenzhen Municipal Science and Technology Innovation Bureau (JCYJ20250604145646062), Power China Key Technology Research and Development Program (DJ-HXGG-2025-25), and Jinzhou 2025 Guided Science and Technology Program (JZ2025B061).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the first author.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DARE	Dilated Aggregation and Reparameterized Edge
DCB	Dilated Context Block
DDH	Dual-scale Decoupled Head
CNNs	convolutional neural networks
cls	classification
reg	regression
BN	Batch Normalization
ICNN	Improved Convolutional Neural Network Algorithm
PV	photovoltaic
PL	programmable logic
PS	processing system
RAM	Random Access Memory
BRAM	block random access memory
DSPs	digital signal processors
FFs	flip-flops
LUTs	lookup tables

References

Wang, J.; Bi, L.; Sun, P.; Jiao, X.; Ma, X.; Lei, X.; Luo, Y. Deep-learning-based automatic detection of photovoltaic cell defects in electroluminescence images. Sensors 2022, 23, 297. [Google Scholar] [CrossRef] [PubMed]
Karakan, A. Detection of Defective Solar Panel Cells in Electroluminescence Images with Deep Learning. Sustainability 2025, 17, 1141. [Google Scholar] [CrossRef]
Lu, H.; Huang, X.; Shi, H.; He, H. Defect detection of photovoltaic panels based on deep learning. In Proceedings of the 2023 3rd International Conference on Neural Networks, Information and Communication Engineering (NNICE); IEEE: New York, NY, USA, 2023; pp. 99–103. [Google Scholar]
Islam, M.; Rashel, M.R.; Ahmed, M.T.; Islam, A.K.M.K.; Tlemçani, M. Artificial intelligence in photovoltaic fault identification and diagnosis: A systematic review. Energies 2023, 16, 7417. [Google Scholar] [CrossRef]
Su, B.; Chen, H.; Chen, P.; Bian, G.-B.; Liu, K.; Liu, W. Deep learning-based solar-cell manufacturing defect detection with complementary attention network. IEEE Trans. Ind. Inform. 2020, 17, 4084–4095. [Google Scholar] [CrossRef]
Kuznetsov, P.; Kotelnikov, D.; Yuferev, L.; Panchenko, V.; Bolshev, V.; Jasiński, M.; Flah, A. Method for the automated inspection of the surfaces of photovoltaic modules. Sustainability 2022, 14, 11930. [Google Scholar] [CrossRef]
Fan, Z.; Yan, Z.; Wen, S. Deep learning and artificial intelligence in sustainability: A review of SDGs, renewable energy, and environmental health. Sustainability 2023, 15, 13493. [Google Scholar] [CrossRef]
Alhazmi, A.; Kholoud, M.; Eke, C.I. A Systematic Review of Advances in Deep Learning Architectures for Efficient and Sustainable Photovoltaic Solar Tracking: Research Challenges and Future Directions. Sustainability 2025, 17, 9625. [Google Scholar] [CrossRef]
Czimmermann, T.; Ciuti, G.; Milazzo, M.; Chiurazzi, M.; Roccella, S.; Oddo, C.M.; Dario, P. Visual-based defect detection and classification approaches for industrial applications—A survey. Sensors 2020, 20, 1459. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Parhi, K.K. VLSI Digital Signal Processing Systems: Design and Implementation; John Wiley & Sons: Hoboken, NJ, USA, 2007. [Google Scholar]
Guo, K.; Zeng, S.; Yu, J.; Wang, Y.; Yang, H. A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst. 2019, 12, 2. [Google Scholar] [CrossRef]
Tang, J.; Liu, S.; Zhao, D.; Tang, L.; Zou, W.; Zheng, B. PCB-YOLO: An improved detection algorithm of PCB surface defects based on YOLOv5. Sustainability 2023, 15, 5963. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A review of YOLO algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Liang, Q.; Tian, W.; Han, S.; Sun, J.; Jia, M.; Chi, J. ACN-YOLO: Solar cell surface defect detection based on improved YOLOv8s. AIP Adv. 2025, 15, 105132. [Google Scholar] [CrossRef]
Li, W.; Li, J.; Cao, B.; Zhu, J.; Tian, M. LW-YOLO: A Lightweight Network for Defect Detection in Photovoltaic Modules. In Proceedings of the 2024 4th International Conference on Robotics, Automation and Intelligent Control (ICRAIC); IEEE: New York, NY, USA, 2024; pp. 538–543. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 7464–7475. [Google Scholar]
Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020, 97, 103910. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 2980–2988. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 10213–10224. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 13733–13742. [Google Scholar]
Qiu, S.; Yang, C.H.; Wu, L.; Song, W.; Pan, J.-Z. Improved Reparameterization You-Only-Look-Once v5 Model for Strip-Steel Surface Defect Detection. Sens. Mater. 2024, 36, 4881–4902. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 13713–13722. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 11963–11975. [Google Scholar]
Saidani, T.; Ghodhbani, R.; Alhomoud, A.; Alshammari, A.; Zayani, H.; Ben Ammar, M. Hardware acceleration for object detection using yolov5 deep learning algorithm on xilinx zynq fpga platform. Eng. Technol. Appl. Sci. Res. 2024, 14, 13066–13071. [Google Scholar] [CrossRef]
Wang, Z.; Xu, K.; Wu, S.; Liu, L.; Liu, L.; Wang, D. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 2020, 8, 116569–116585. [Google Scholar] [CrossRef]
Mani, P.; Komarasamy, P.R.G.; Rajamanickam, N.; Shorfuzzaman, M.; Abdelfattah, W.M. Enhancing sustainable transportation infrastructure management: A high-accuracy, FPGA-based system for emergency vehicle classification. Sustainability 2024, 16, 6917. [Google Scholar] [CrossRef]
Ponzina, F.; Machetti, S.; Rios, M.; Denkinger, B.W.; Levisse, A.; Ansaloni, G.; Peón-Quirós, M.; Atienza, D. A hardware/software co-design vision for deep learning at the edge. IEEE Micro 2022, 42, 48–54. [Google Scholar] [CrossRef]
Kim, H.; Kim, T.K. Design and Implementation of a YOLOv2 Accelerator on a Zynq-7000 FPGA. Sensors 2025, 25, 6359. [Google Scholar] [CrossRef] [PubMed]
Xu, G.; Zhao, W.; Ren, Z.; Denkinger, B.W.; Levisse, A.; Ansaloni, G.; Peon-Quiros, M.; Atienza, D. Design and Implementation of the High-Performance YOLO Accelerator Based on Zynq FPGA. In Proceedings of the 2024 3rd International Conference on Electronics and Information Technology (EIT); IEEE: New York, NY, USA, 2024; pp. 246–250. [Google Scholar]
Lv, Q.; Yu, X.; Liu, Y.; Hu, T.; Wu, H.; Wang, A. Design and implementation of object detection system based on Zynq hardware acceleration. In Proceedings of the International Conference on Measurement, Communication, and Virtual Reality (MCVR 2024); SPIE: San Francisco, CA, USA, 2025; Volume 13634, pp. 229–239. [Google Scholar]
Wang, J.; Gu, S. Fpga implementation of object detection accelerator based on vitis-ai. In Proceedings of the 2021 11th International Conference on Information Science and Technology (ICIST); IEEE: New York, NY, USA, 2021; pp. 571–577. [Google Scholar]
Li, S.; Yu, C.; Xie, T.; Feng, W. A power-efficient optimizing framework FPGA accelerator for YOLO. In Proceedings of the 2022 15th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI); IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Tan, A.; Yang, Y.; Duan, S.; Wang, L. Fpga-based convolution accelerator and memristor ip core for cooperative acceleration of the yolo network. In Proceedings of the 2024 4th International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI); IEEE: New York, NY, USA, 2024; pp. 54–57. [Google Scholar]
Ye, J.; Liu, Y.; Chen, H.; Wang, C.; Zhou, Y.; Yang, L.; Zhang, W. Edge computing accelerator for real-time defect detection of photovoltaic panel on lightweight FPGAs. IEEE Trans. Instrum. Meas. 2025, 74, 3001815. [Google Scholar] [CrossRef]
Zhao, J.; Xiao, H.; Zhi, M.; Su, W. An Improved Convolutional Neural Network Algorithm (ICNN) with Channel Attention Mechanism and Hardware Acceleration Concept for Solar Panel Dust Detection. IEEE Access 2025, 13, 189814–189832. [Google Scholar] [CrossRef]
Calì, R.; Falaschetti, L.; Biagetti, G. Optimized implementation of YOLOv3-tiny for real-time image and video recognition on FPGA. Electronics 2025, 14, 3993. [Google Scholar] [CrossRef]
Fang, N.; Li, L.; Zhou, X.; Zhang, W.; Chen, F. An FPGA-based hybrid overlapping acceleration architecture for small-target remote sensing detection. Remote Sens. 2025, 17, 494. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of DARE-YOLO.

Figure 2. Lightweight feature extraction backbone network.

Figure 3. Structure of dilated convolution block.

Figure 4. Dual-scale Decoupled Head.

Figure 5. Overall architecture of hardware acceleration system.

Figure 6. Architecture of the custom convolution IP core.

Figure 7. Training stability of DARE-YOLO over 300 epochs. (a) Mean Average Precision (mAP@0.5) curve. (b) Total training and validation loss curves.

Figure 8. Top-level hardware interconnection topology.

Figure 9. Physical view of Zynq-7020 core board.

Figure 10. Physical view of defect detection experimental platform.

Figure 11. Visualization of defect detection results.

Figure 12. Normalized confusion matrices on the photovoltaic defect dataset: (a) YOLOv8s. (b) DARE-YOLO.

Table 1. Comparison of backbone network structures.

Network Stage (Output)	DARE-YOLO	YOLOv6N/YOLOv6S
Stem	1 RepConv	1 RepConv
Layer1 (P2)	2 RepConv	3 RepConv
Layer2 (P3)	4 RepConv	5 RepConv
Layer3 (P4)	4 RepConv	7 RepConv
Layer4 (P5)	0 (Truncated)	3 RepConv
Total Blocks	11	19

Table 2. Accuracy comparison under different quantization strategies.

DARE-YOLO	Weight Bit-Width	Feature Map Bit-Width	Accumulator Bit-Width	mAP@0.5 (%)	Accuracy Drop (%)
Full-Precision Baseline	FP32	FP32	FP32	95.2	Baseline
Conventional Quantization	INT8	INT8	INT8	90.7	4.5
Proposed Mixed Fixed-Point Strategy	INT8	INT8	INT32	93.84	1.36

Table 3. Software and hardware environment configuration.

Component	Specification/Version
Operating System (OS)	Windows 11 64-bit
Central Processing Unit (CPU)	Intel Core i9-13980
Graphics Processing Unit (GPU)	NVIDIA GeForce RTX 4070 Laptop GPU
Random Access Memory (RAM)	16 GB DDR5
Deep Learning Framework	PyTorch 2.6.0
Parallel Computing Platform	CUDA 12.6
Programming Language	Python 3.13.5

Table 4. Ablation study results of improved modules.

Model	mAP@0.5 (%)	Params (M)	GFLOPs (G)
Baseline	84.12	5.5	21.6
Rep-YOLO	87.8	5.5	21.6
DCB-Rep-YOLO	90.28	6.1	23.9
DARE-YOLO	93.84	6.4	24.8

Table 5. Performance comparison with mainstream object detection models.

Model	Input Size	mAP@0.5 (%)	Params (M)	GFLOPs (G)
YOLOv5s	640 × 640	86.4	7.0	16.5
YOLOv8s	640 × 640	89.3	11.1	28.6
YOLOv10s	640 × 640	91.6	7.2	21.6
YOLO11s	640 × 640	92.7	9.3	21.5
DARE-YOLO	640 × 640	93.84	6.4	24.8

Table 6. FPGA hardware resource utilization.

Module	LUT	FF	BRAM	DSP
Available on board	53,200	106,400	140	220
Conv Accelerators	28,450 (53.5%)	34,200 (32.1%)	114 (81.4%)	192 (87.3%)
Axi Interconnect	3120 (5.9%)	4250 (4.0%)	4 (2.9%)	0
Miscellaneous	237 (0.4%)	405 (0.3%)	0	0
Total utilization	31,807 (59.8%)	38,855 (36.5%)	118 (84.3%)	192 (87.3%)

Table 7. Performance comparison across hardware platforms.

Platform	Power (W)	Throughputs (GOPs)	Energy Efficiency (GOPS/W)	Latency (ms)
Intel i9-13980	103.4	617.0	5.97	40.2
RTX 4070	165.9	2137.4	12.88	11.6
Zynq-7020	1.95	37.5	19.23	661.3

Table 8. Performance comparison with existing FPGA acceleration schemes.

	[38]	[39]	[40]	[41]	This Work
Year	2025	2025	2025	2025	2026
Platform	Zynq-7020	ZU9EG	Zynq-7020	Zynq-7020	Zynq-7020
Board cost	$150	$1500	$150	$150	$150
Frequency (MHz)	200	-	200	100	200
Network	YOLOv4-Tiny	ICNN	YOLOv3-Tiny	YOLOv5s	DARE-YOLO
BRAM	94	375	138	107	118
DSP	220	148	204	205	192
LUT	44.4 k	25.1 k	41.6 k	46 k	31.8 k
Throughput (GOPS)	24.25	-	64.1	10.2	37.5
Energy efficiency (GOPS/W)	5.29	-	25.1	3.64	19.23
Latency (ms)	272	87	6.07	477	661.3
Power (W)	4.58	1.2	2.55	2.8	1.95
mAP@0.5 (%)	62.73	-	17.68	93.2	93.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Xing, F.; Qin, C.; Chen, S.; Shin, H.; Lee, S. DARE-YOLO: A Lightweight Object Detection Algorithm and Its FPGA Acceleration for Sustainable PV Panel Inspection. Sustainability 2026, 18, 4999. https://doi.org/10.3390/su18104999

AMA Style

Yang Y, Xing F, Qin C, Chen S, Shin H, Lee S. DARE-YOLO: A Lightweight Object Detection Algorithm and Its FPGA Acceleration for Sustainable PV Panel Inspection. Sustainability. 2026; 18(10):4999. https://doi.org/10.3390/su18104999

Chicago/Turabian Style

Yang, Yuchuan, Feng Xing, Caiyan Qin, Shuxu Chen, Hyundong Shin, and Sungyoung Lee. 2026. "DARE-YOLO: A Lightweight Object Detection Algorithm and Its FPGA Acceleration for Sustainable PV Panel Inspection" Sustainability 18, no. 10: 4999. https://doi.org/10.3390/su18104999

APA Style

Yang, Y., Xing, F., Qin, C., Chen, S., Shin, H., & Lee, S. (2026). DARE-YOLO: A Lightweight Object Detection Algorithm and Its FPGA Acceleration for Sustainable PV Panel Inspection. Sustainability, 18(10), 4999. https://doi.org/10.3390/su18104999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DARE-YOLO: A Lightweight Object Detection Algorithm and Its FPGA Acceleration for Sustainable PV Panel Inspection

Abstract

1. Introduction

2. DARE-YOLO Algorithm Architecture Design

2.1. Mathematical Formalization of Hardware and Software Co-Design

2.2. Overall Architecture of DARE-YOLO

2.3. RepConv-Based Backbone

2.4. Dilated Context Block

2.5. Dual-Scale Decoupled Head

3. Zynq-Based Hardware Accelerator and System Design

3.1. Overall Accelerator Architecture and Dataflow Scheduling

3.2. Hardware–Software Co-Designed Quantization Strategy

3.2.1. Batch Normalization Fusion

3.2.2. Mixed Fixed-Point Strategy for the PL

3.3. Design of the Custom Convolution IP Core

4. Experiments and Results Analysis

4.1. Photovoltaic Dataset and Experimental Deployment Environment

4.1.1. Photovoltaic Panel Defect Dataset and Software Platform for Model Training

4.1.2. Edge Hardware Acceleration Deployment Platform

4.2. Ablation Study

4.3. Comprehensive Comparison with Mainstream YOLO Series

4.4. FPGA Resource Utilization and Power Analysis

4.5. Comparison Across Different Hardware Platforms

4.6. Performance Comparison with Existing Advanced Hardware Acceleration Schemes

4.7. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI