PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro

Huang, Junyu; Lian, Jialing; Cao, Fangyu; Chen, Jiawei; Luo, Renbo; Yang, Jinxin; Shi, Qian

doi:10.3390/rs17152650

Open AccessArticle

PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro

by

Junyu Huang

¹

,

Jialing Lian

²,

Fangyu Cao

¹

,

Jiawei Chen

¹

,

Renbo Luo

^1,3,*

,

Jinxin Yang

¹ and

Qian Shi

^4,5

¹

The School of Geography and Remote Sensing, Guangzhou University, Guangzhou 510006, China

²

Guangdong Zhengchuang Space-Time Information Technology Co., Ltd., Guangzhou 510006, China

³

The Institute of Aerospace Remote Sensing Innovations, Guangzhou University, Guangzhou 510006, China

⁴

School of Geography and Planning, Sun Yat-sen University, Guangzhou 510275, China

⁵

Guangdong Provincial Key Laboratory for Urbanization and Geo-Simulation, Guangzhou 510275, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2650; https://doi.org/10.3390/rs17152650

Submission received: 2 June 2025 / Revised: 22 July 2025 / Accepted: 25 July 2025 / Published: 30 July 2025

Download

Browse Figures

Versions Notes

Abstract

Multi-modal object detection that fuses RGB (Red-Green-Blue) and infrared (IR) data has emerged as an effective approach for addressing challenging visual conditions such as low illumination, occlusion, and adverse weather. However, most existing multi-modal detectors prioritize accuracy while neglecting computational efficiency, making them unsuitable for deployment on resource-constrained edge devices. To address this limitation, we propose PONet, a lightweight and efficient multi-modal vehicle detection network tailored for real-time edge inference. PONet incorporates Polarized Self-Attention to improve feature adaptability and representation with minimal computational overhead. In addition, a novel fusion module is introduced to effectively integrate RGB and IR modalities while preserving efficiency. Experimental results on the VEDAI dataset demonstrate that PONet achieves a competitive detection accuracy of 82.2% mAP@0.5 while sustaining a throughput of 34 FPS on the OrangePi AIpro 20T device. With only 3.76 M parameters and 10.2 GFLOPs (Giga Floating Point Operations), PONet offers a practical solution for edge-oriented remote sensing applications requiring a balance between detection precision and computational cost.

Keywords:

multi-modal object detection; RGB-IR fusion; lightweight networks; edge computing

1. Introduction

Object detection is a fundamental task in computer vision with widespread applications, including remote sensing imagery analysis [1], video surveillance [2], and autonomous driving [3]. Although recent advances in detection methods have significantly improved performance, most state-of-the-art techniques primarily rely on RGB images [4,5,6]. However, the performance of RGB-based detectors degrades markedly under challenging conditions such as low-light environments, intense glare, or adverse weather. This is because RGB sensors depend heavily on ambient lighting, making them unreliable in poorly illuminated or highly variable lighting scenarios. Additionally, RGB imaging struggles when objects are partially obscured by environmental factors such as smoke, fog, or physical barriers, often leading to inaccurate detection results. In contrast, infrared (IR) sensors capture thermal radiation emitted by objects, allowing for more robust detection across diverse lighting conditions and even through visual occlusions [7,8]. By leveraging the complementary strengths of RGB and IR modalities, multi-modal object detection has shown considerable promise in improving target characterization and enhancing detection accuracy under complex environmental conditions [9,10]. Specifically, IR imaging performs well in low-visibility situations, while RGB imaging provides rich spatial and color information under optimal lighting. Their integration, therefore, offers a balanced and robust foundation for accurate object detection.

To improve the accuracy and robustness of multi-modal object detection, various fusion architectures have been developed. Wagner et al. [11] first introduced a two-branch convolutional network (ConvNet) with a Halfway Fusion strategy, which fused mid-level features from RGB and IR branches through simple concatenation, while such early fusion techniques are straightforward and computationally efficient, they often result in feature redundancy and fail to fully exploit the complementary information between modalities. Consequently, the detection performance and generalization capability of these models remain limited.This limitation stems from inefficient feature fusion mechanisms, where simply concatenating mid-level features introduces redundancy and prevents effective modality interaction. In response to these shortcomings, more sophisticated fusion strategies have emerged. For example, Fang et al. [12] proposed the Cross-Modality Fusion Transformer (CFT), built upon a dual-stream YOLOv5 backbone, to perform both intra- and inter-modal feature fusion using self-attention mechanisms. Similarly, Shao et al. [13] developed MOD-YOLO, which integrates a Cross-Stage Partial CFT module to further improve multi-modal feature interaction. Meng et al. [14] enhanced target representation by incorporating a pre-trained semantic branch to guide the fusion process, while Zhang et al. [15] introduced an illumination-guided feature weighting module to adaptively balance modality-specific features. Furthermore, Zhang et al. [16] addressed the cross-modal misalignment issue by designing a region alignment module that adaptively adjusts positional biases during fusion, leading to more reliable multi-modal representations.

In addition to detection accuracy, real-time performance is critical for remote sensing applications, particularly in time-sensitive scenarios such as UAV-based traffic monitoring, disaster assessment, and military reconnaissance. In these cases, rapid and accurate detection of vehicles or objects is essential for enabling timely response and decision-making. Moreover, many of these applications rely on edge computing platforms with limited resources. As a result, designing lightweight models capable of real-time inference without sacrificing detection accuracy becomes a necessary objective in multi-modal remote sensing.

To address these constraints, we propose PONet, a lightweight yet effective multi-modal vehicle detection network tailored for edge deployment scenarios. PONet leverages Polarized Self-Attention (PSA) modules to enhance the quality of feature extraction and fusion between RGB and IR modalities. PSA allows the network to capture both channel-wise and spatial-wise dependencies in a more structured and efficient manner, thereby improving feature representation without significantly increasing computational overhead. Additionally, we design a custom fusion mechanism that tightly integrates multi-modal features while preserving computational efficiency. With only 3.76 M parameters and a computational footprint of 10.2 GFLOPs, PONet achieves high detection performance with an average mAP@0.5 of 82.2% on the VEDAI dataset, while maintaining a throughput of 34 FPS on the OrangePi AIpro 20T—demonstrating its potential for practical remote sensing applications on edge devices.

The main contributions of this paper are summarized as follows:

We propose PONet, a novel lightweight multi-modal object detection network optimized for RGB-IR fusion under constrained edge computing environments.
We integrate Polarized Self-Attention into PONet to improve feature representation while keeping the model compact and efficient.
We achieve competitive performance on the VEDAI dataset with a mAP@0.5 of 82.2%, while inference at 34 FPS on the OrangePi AIpro 20T, validating the network’s suitability for edge deployment.

2. Related Work

2.1. Visible–Infrared Object Detection Methods

Conventional object detection methods predominantly rely on unimodal visible light (RGB) imagery. However, their effectiveness significantly diminishes under adverse conditions such as low illumination, overexposure, or occlusions during nighttime. These limitations hinder their performance in complex, real-world environments. In contrast, infrared (IR) imagery captures thermal emissions from objects, offering complementary visual cues that are invariant to lighting conditions. Consequently, the integration of RGB and IR modalities within a unified detection framework has gained increasing attention for enhancing detection precision and overall system robustness [10,17].

Liu et al. [11] were among the first to introduce a dual-branch convolutional neural network for multispectral pedestrian detection. Their study demonstrated that mid-level feature fusion—referred to as the Halfway Fusion strategy—outperformed both early and late fusion approaches. This pioneering work laid the foundation for many later methods that adopted dual-stream architectures, often incorporating YOLO-based backbones, to better exploit cross-modal interactions [17].

For instance, Fang et al. [12] developed the Cross-Modality Fusion Transformer (CFT), utilizing self-attention mechanisms to facilitate both intra-modal and inter-modal information fusion. Building on this, Shao et al. [13] proposed MOD-YOLO, which enhances feature fusion through the introduction of the Cross-Stage Partial CFT (CSP-CFT) module, and refines the detection head by incorporating the VoV-GSCSP structure. Furthermore, they designed a novel SIoU loss function to improve localization accuracy.

Other recent advancements include Dual-YOLO [18] by Bao et al., which implements attention-based fusion, and GMD-YOLO [10] by Sun et al., which introduces multi-scale modulation techniques to strengthen modality complementarity and facilitate small-object detection. Additionally, Zhang et al. [19] proposed SuperYOLO, combining super-resolution learning with multi-modal fusion to address challenges posed by complex scenes and small targets.

Inspired by the progress in this field, our work presents a refined dual-stream detection framework that enhances both intra-modal representation and inter-modal collaboration. This is realized through a novel channel exchange strategy coupled with a Spatial-Channel Attention Fusion Module, designed to optimize feature integration across modalities.

2.2. Lightweight Models for Object Detection

While deep learning techniques have achieved significant advancements in image processing tasks, their practical deployment on edge or embedded devices is often hindered by substantial computational costs and large parameter counts. To address these limitations, lightweight model design has become a key area of research. Current model compression and acceleration strategies can be broadly categorized into four main approaches: designing efficient network architectures [20], applying quantization techniques [21], utilizing knowledge distillation [22], and performing network pruning [23].

Quantization reduces the precision of model parameters and activations, enabling faster and more memory-efficient inference. Traditional neural networks typically operate with 32-bit floating-point (FP32) representations. By converting these to lower-precision formats such as FP16 or INT8, quantization can significantly decrease model size and computation demands, often with minimal loss in performance.

Knowledge distillation is a training paradigm where a compact student network learns from the outputs or intermediate features of a larger, well-trained teacher model. Instead of relying solely on hard labels, the student is guided by the teacher’s soft predictions, which contain richer information about class relationships and decision boundaries, resulting in improved learning efficiency.

Network pruning focuses on eliminating redundant or non-essential components of a network, such as neurons, channels, or even layers. By identifying and removing these components, pruning reduces both memory usage and inference latency, while typically preserving overall model accuracy.

Lightweight model architecture design aims to create efficient neural networks from the ground up by reducing the number of parameters and operations required. This approach is particularly important for real-time applications on devices with limited resources, such as mobile phones or embedded systems.

Among existing lightweight architectures, MobileNetV1 [24] introduced depthwise separable convolutions, which break down standard convolution operations into depthwise and pointwise components, achieving a significant reduction in computation and parameters. MobileNetV2 [25] refined this design by incorporating inverted residuals and linear bottlenecks, as well as residual connections inspired by ResNet. MobileNetV3 [26] further advances this series by leveraging neural architecture search (NAS) and incorporating optimizations such as Squeeze-and-Excitation (SE) modules, h-swish activation, and refined channel configurations to enhance both accuracy and speed.

ShuffleNet [27,28] employs group convolutions combined with channel shuffle operations to reduce computational burden while maintaining effective information exchange across channels. The group convolution decreases the number of operations, and the shuffle mechanism compensates for limited cross-channel interaction, thus preserving representational capacity.

GhostNet [29] introduces an innovative strategy that generates additional feature maps through cheap linear transformations, rather than relying entirely on standard convolutions. This reduces redundancy in feature extraction and leads to a more efficient network structure, making GhostNet a popular choice in the domain of lightweight model design.

In addition to convolutional optimizations, recent works have begun to explore efficient attention mechanisms as a means to improve performance while keeping computational costs low. Lightweight attention modules such as Squeeze-and-Excitation (SE) [30], Efficient Channel Attention (ECA) [31], and Coordinate Attention (CA) [32] enhance feature selection and inter-channel modeling with minimal overhead. These modules are particularly well-suited for multi-modal fusion tasks, as they can selectively emphasize informative features across modalities while preserving efficiency. The integration of such attention mechanisms into dual-stream or transformer-based frameworks forms a promising direction for designing high-performance, low-complexity object detection models—an idea we further explore in our proposed method.

Despite significant progress in multi-modal object detection, existing methods commonly suffer from two main drawbacks. First, many approaches adopt heavy backbone networks or complex fusion modules, resulting in high computational costs and limiting their applicability on edge devices with constrained resources. Second, simple fusion strategies, such as direct concatenation of RGB and IR features, often cause redundancy and fail to fully exploit complementary information, which hampers detection robustness and accuracy. To address these challenges, our proposed PONet integrates a lightweight Polarized Self-Attention mechanism that efficiently models spatial and channel-wise dependencies, coupled with an effective fusion module optimized for low computational complexity. This design achieves a favorable balance between detection performance and inference speed, making it suitable for real-time vehicle detection in resource-limited remote sensing scenarios.

3. Methods

3.1. Overall Architecture

The proposed architecture is designed to exploit the complementary characteristics of RGB and infrared (IR) imagery for robust multimodal object detection. An overview of the network is presented in Figure 1. Given an aligned pair of RGB and IR images, we employ two modality-specific convolutional backbones to extract hierarchical features at three semantic levels: shallow, middle, and deep. These multi-scale features encode diverse representations ranging from low-level texture to high-level semantic cues, which are essential for capturing objects under complex environmental conditions, including low illumination and occlusion.

To effectively integrate the multimodal features, each level’s RGB and IR features are processed through a lightweight yet expressive Fusion Module, which performs cross-modal attention-driven enhancement and residual blending. The resulting fused features from all levels are concatenated and fed into a Path Aggregation Feature Pyramid Network (PAFPN), which aggregates and refines multi-scale representations by combining low-resolution semantically strong features with high-resolution spatially rich features.

In the final neck stage (denoted as N1), we introduce a Polarized Self-Attention (PSA) module to further recalibrate the feature maps. Unlike conventional attention mechanisms, PSA decouples attention computation into channel and spatial domains, thereby facilitating fine-grained feature discrimination. The enhanced features are subsequently passed to the detection head, which performs object localization and classification.

3.2. Fusion Module

To effectively capture complementary cues from RGB and infrared (IR) modalities while maintaining the unique semantic and structural properties of each domain, we introduce a parameter-efficient yet highly expressive Fusion Module. This module is specifically designed to enhance cross-modal interaction without compromising computational efficiency, making it well-suited for deployment in resource-constrained environments such as drones and embedded systems.

At its core, the Fusion Module leverages a cross-modality Squeeze-and-Excitation (SE) attention mechanism [30], which allows one modality to modulate the spatially aligned features of the other. This is achieved by extracting global channel-wise descriptors from one branch and using them to reweight the features of the other, thereby enabling the model to dynamically emphasize informative features across modalities. Specifically, let

F_{RGB}

and

F_{IR}

denote the RGB and IR feature maps, respectively, both of size

R^{C \times H \times W}

. The fusion is conducted through the following mechanism:

\begin{matrix} {\tilde{F}}_{RGB} & = {Conv}_{3 \times 3} (SE (F_{IR}) ⊙ F_{RGB}) + w_{RGB} \cdot F_{RGB}, \\ {\tilde{F}}_{IR} & = {Conv}_{3 \times 3} (SE (F_{RGB}) ⊙ F_{IR}) + w_{IR} \cdot F_{IR}, \end{matrix}

(1)

Here, the SE blocks are modality-specific but operate in a cross-guided fashion:

SE (F_{IR})

computes channel-wise attention vectors from the IR stream and applies them to modulate the RGB features, and vice versa. The operation ⊙ denotes element-wise multiplication across corresponding spatial positions and channels, allowing each channel to be scaled according to its global relevance from the complementary modality. The

{Conv}_{3 \times 3}

refers to a standard convolution block comprising a

3 \times 3

convolution, batch normalization, and ReLU activation, collectively denoted as Update.

The inclusion of learnable residual weights

w_{RGB}

and

w_{IR}

is a crucial component of the design. These scalar parameters enable the model to adaptively control the strength of the identity mapping from each modality, ensuring that salient modality-specific information is preserved even as cross-modal features are blended. In practice, we initialize these weights to 1 and allow the model to fine-tune them during training, leading to a soft gating behavior where the network learns how much to trust the original features versus the attended ones.The overall architecture of the proposed fusion module is illustrated in Figure 2.

The core of our fusion design lies in the Squeeze-and-Excitation Attention (SEAttention) module [30], which introduces a lightweight yet effective method for channel-level feature recalibration. Rather than modeling spatial relationships, the SE module captures channel interdependencies, enabling the network to adaptively emphasize important feature channels.

The SE block comprises two main operations: squeeze and excitation. Given an input feature map

F \in R^{C \times H \times W}

, the squeeze operation performs global average pooling to obtain a compact descriptor:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{c} (i, j), c = 1, \dots, C .

(2)

This generates a global context vector

z \in R^{C}

. The excitation step then passes this through a bottleneck MLP:

s = σ (W_{2} \cdot δ (W_{1} \cdot z)),

(3)

where

W_{1} \in R^{\frac{C}{r} \times C}

and

W_{2} \in R^{C \times \frac{C}{r}}

are learnable weights,

δ (\cdot)

denotes the ReLU activation,

σ (\cdot)

is the sigmoid function, and r is the reduction ratio (typically

r = 16

). Finally, the attention weights

s

are applied back to the feature map by channel-wise multiplication:

F_{c}^{'} = s_{c} \cdot F_{c} .

(4)

This mechanism adaptively highlights discriminative channels based on global context, which is especially useful for cross-modality scenarios where certain channels may become prominent depending on environmental conditions (e.g., thermal signature vs. visible color).The detailed pipeline of the SEAttention block is depicted in Figure 3.

In our design, we extend SEAttention into a cross-guided form to enable each modality to conditionally recalibrate the other. This ensures that the fusion process remains context-aware while preserving modality-specific structures.

This fusion design presents several key advantages:

Cross-Modality Enhancement: By employing SE blocks in a cross-guided manner, each modality benefits from the global context of the other. This promotes the emergence of features that are both locally discriminative and globally coherent.
Modality-Specific Preservation: The residual connections, modulated by learnable weights, ensure that each modality retains its intrinsic characteristics, which is critical in early fusion stages to avoid over-blending.
Computational Efficiency: The SE blocks and update modules are lightweight, relying only on global pooling, small fully connected layers, and depth-preserving convolutions.
Implicit Attention Alignment: Our symmetrical design fosters bidirectional conditioning, improving modality alignment without the need for explicit warping or matching.

After fusion, the refined feature maps

{\tilde{F}}_{RGB}

and

{\tilde{F}}_{IR}

are concatenated along the channel dimension and forwarded to the PAFPN for multiscale semantic enhancement and detection head supervision. The empirical results in Table 1 validate the effectiveness of this design, demonstrating significant improvements in both accuracy and robustness over conventional fusion strategies.

In summary, the proposed Fusion Module achieves a favorable balance between precision and efficiency by integrating modality-aware attention, residual learning, and lightweight convolutional refinement. It forms the core of our multispectral backbone, enabling robust object detection across heterogeneous visual spectra in real-time scenarios.

3.3. Polarized Self-Attention

While the Fusion module enhances cross-modal alignment, effective spatial and channel-wise discrimination is still critical for accurate detection, especially in challenging scenarios such as camouflage, background clutter, and sensor noise. To address this, we integrate a Polarized Self-Attention (PSA) mechanism [33] into the final feature aggregation stage. PSA explicitly separates the attention computation into two orthogonal pathways: channel-only and spatial-only self-attention, thereby enabling fine-grained and complementary recalibration of the fused features.

Formally, let

X \in R^{B \times C \times H \times W}

be the input feature map. The channel attention is formulated as a global descriptor weighted by spatial relationships:

Y_{ch} = σ (LN ({Conv}_{c} (X_{v}^{c} \cdot Softmax (X_{q}^{c})))) ⊙ X,

(5)

where

X_{v}^{c} = {Conv}_{1 \times 1}^{v} (X)

,

X_{q}^{c} = {Conv}_{1 \times 1}^{q} (X)

, and

{Conv}_{c}

is a transformation to restore channel dimensions.

LN (\cdot)

and

σ (\cdot)

denote layer normalization and sigmoid activation, respectively.

The spatial attention path operates by pooling channel descriptors and applying global-to-local weighting:

Y_{sp} = σ (X_{q}^{s} \cdot Softmax (X_{v}^{s})) ⊙ X,

(6)

where

X_{v}^{s}

and

X_{q}^{s}

are again

1 \times 1

convolutions over

X

, followed by global pooling on

X_{q}^{s}

to form query descriptors.

The final output of the PSA module is the summation of both attention-enhanced branches:

Y_{out} = Y_{ch} + Y_{sp} .

(7)

By decomposing the attention mechanism and leveraging global-context encoding in both domains, PSA enhances critical features while suppressing irrelevant or redundant information. This leads to more discriminative and robust feature representations, particularly beneficial in multimodal detection scenarios with subtle appearance variations.The overall architecture of the PSA module is illustrated in Figure 4.

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

The vehicle detection dataset in aerial imagery (VEDAI) consists of cropped images obtained from a larger dataset from the Utah Automated Geographic Reference Center (AGRC) [34]. Each image in AGRC is approximately 16,000 × 16,000 pixels, captured at the same altitude with a resolution of about 12.5 × 12.5 cm per pixel. RGB and IR are two modes of each image from the same scene. We exclude classes with fewer than 50 instances, such as plane, motorcycle, and bus. The task is to detect eight different object categories, using the same models, YOLO-fine [35] and SuperYOLO [19].

4.1.2. Evaluation Metrics

The evaluation metrics of the experiment are precision (Precision), recall (Recall), mean average precision (mAP), Params, and FPS.

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

R e c a l l = \frac{T P}{T P + F N}

(9)

AP refers to the area under the PR curve, which is the average precision across different recall points. The calculation formula is as follows:

A P = \int_{0}^{1} p (r) d r

(10)

where:

AP: Average Precision, which is the area under the precision-recall curve.
p(r): Precision at recall value r.
The integral from 0 to 1 indicates the process of calculating precision at different recall levels and then summing them to compute the overall average precision.

mAP@0.5 refers to the mAP value when the Intersection over Union (IoU) value is set to 0.5. It is calculated by finding the Average Precision (AP) value of each class and then averaging them to get the mAP. The formula is as follows:

m A P = \frac{1}{K} \sum_{i = 1}^{K} A P_{i}

(11)

where:

mAP: Mean Average Precision, which is the average of the Average Precision (AP) values for all classes.
K: The total number of classes in the dataset.
AP_i : The Average Precision for class i, which is calculated by integrating the precision-recall curve for each class.
The formula computes the mean of the AP values for all K classes to obtain the mAP value, which is a common evaluation metric in object detection tasks.

mAP@0.95 refers to the average mAP value when the IoU value ranges from 0.5 to 0.95 with a step size of 0.05.

where True Positive (TP) is the number of positive samples correctly identified as positive, True Negative (TN) is the number of negative samples correctly identified as negative, False Positive (FP) is the number of negative samples incorrectly identified as positive, and False Negative (FN) is the number of positive samples incorrectly identified as negative.

4.2. Implementation Details

The experiments were conducted on a Windows 11 system with an Nvidia RTX 4070Ti Super 16 GB GPU, utilizing the PyTorch framework version 2.2.2. All models were trained for 150 epochs with a batch size of 8, without pretrained weights, to ensure a fair comparison. For testing, the IOU threshold was set to 0.7, and the batch size was increased to 16 to achieve higher detection accuracy. The optimization process employed Stochastic Gradient Descent (SGD) with an initial learning rate of 0.01, a momentum of 0.937, and a weight decay of 0.0005. A cosine annealing learning rate schedule was applied with a final learning rate factor (lrf) of 0.2. A warm-up phase of 3 epochs was used, during which the learning rate and momentum were gradually ramped up from 0.1 and 0.8, respectively. The loss function consisted of a combination of box regression loss (weighted by 0.05), objectness loss (1.0), and classification loss (0.5), without the use of focal loss (i.e., fl_gamma = 0.0).

4.3. Algorithm Performance Experiment

To comprehensively evaluate the performance and practical value of our proposed PONet, we conducted extensive experiments on the VEDAI dataset—a challenging benchmark comprising high-resolution aerial images annotated with multiple vehicle categories under varied lighting and spectral conditions. The evaluation framework included both unimodal (RGB-only and IR-only) baselines and a wide range of existing multispectral fusion models for fair and thorough comparison. The visual comparison results are presented in Figure 5, while quantitative metrics are summarized in Table 1 and Table 2.

As shown in Table 2, PONet achieves a strong mAP@50 of 82.20%, which is nearly on par with the current best-performing model, Multispectral DETR [43] (82.70%). However, PONet notably surpasses Multispectral DETR in terms of mAP across multiple IoU thresholds (denoted simply as mAP), achieving 52.70%, which sets a new state-of-the-art in overall detection accuracy on the VEDAI dataset. This indicates that PONet is not only effective in detecting easy cases (IoU@0.5), but also maintains high consistency and precision across stricter evaluation thresholds, which is critical in real-world applications requiring robust object localization.

In terms of model efficiency, PONet offers a substantial improvement in compactness. With only 3.76 M parameters, it is significantly more lightweight than most existing fusion-based methods. For example, it is over 19 times smaller than ICAFusion [41] (120.2 M), and much more efficient than GHOST [42] (9.7 M) and Multispectral DETR (73.0 M), without sacrificing detection accuracy. This highlights the architectural merit of our model, which leverages a lightweight yet expressive cross-modal fusion design, making it ideal for edge-computing platforms and power-constrained deployment environments.

A more fine-grained analysis is provided in Table 1, which reports the per-class average precision at IoU 0.5 (AP@50) for different vehicle categories. Compared to the unimodal YOLOv5s baselines, PONet achieves substantial improvements across all categories. For example, relative to the infrared-only model, PONet improves detection on the “tractor” class by +26.5%, “boat” by +17.2%, and “other” by +11.0%, reflecting its superior ability to leverage complementary RGB and thermal cues for enhanced recognition of low-contrast or partially occluded targets. Notably, PONet attains the highest AP scores in seven out of nine categories, such as “car” (95.2%), “camping car” (97.9%), and “truck” (94.7%), demonstrating its robustness across diverse object types with varying spectral signatures and sizes.

To verify the individual contribution of our proposed Polarized Self-Attention (PSA) module, we conducted an ablation study comparing PONet with and without PSA. The baseline model (without PSA) attains an AP@50 of 81.5%, whereas the full model with PSA achieves 82.2%, marking a clear performance gain. More importantly, the class-wise performance reveals that the PSA module improves detection particularly for thermally ambiguous or partially visible targets, suggesting its effectiveness in refining modality-aware spatial alignment and enhancing cross-scale feature interaction.

In addition to accuracy, inference speed is critical for real-world deployment. Our optimized implementation of PONet achieves a remarkable 158 FPS on the evaluation hardware, which is substantially faster than the 104 FPS of the baseline model without PSA. This speed gain results from the efficient design of our fusion module and attention mechanisms, which prioritize channel-wise computation and minimize redundant spatial operations. The achieved speed-accuracy trade-off demonstrates that PONet is not only accurate but also highly responsive—satisfying the stringent requirements of real-time multispectral perception tasks such as drone-based monitoring, mobile robotics, and autonomous surveillance.

In summary, our experimental results establish that PONet strikes an excellent balance between detection accuracy, model compactness, and inference speed. It achieves state-of-the-art detection performance with significantly fewer parameters and higher real-time efficiency than existing methods. These advantages collectively position PONet as a highly viable and scalable solution for embedded multispectral vision systems deployed in resource-constrained environments.

5. Application on OrangePi AIpro 20T

To evaluate the real-time performance and deployment practicality of our proposed multispectral fusion detection framework, we deploy the full pipeline on the OrangePi AIpro 20T—an edge computing platform equipped with the Huawei Ascend 310B NPU. Thanks to its high computing power and low power consumption, this device is well-suited for scenarios like drone-based monitoring and autonomous perception tasks.

The trained model is converted to the .om format using Huawei’s MindX SDK, and inference is performed via the ais_bench and InferSession APIs. Our deployed system covers all stages of the detection pipeline: decoding of RGB and IR inputs, forward pass through the fusion module, detection head computation, and post-processing including non-maximum suppression.

5.1. Deployment Performance

When running on the OrangePi AIpro 20T, our model achieves an average processing speed of 34 frames per second (FPS) with input resolution set to 1024×1024. This frame rate accounts for the entire pipeline—from preprocessing to postprocessing—and verifies that our approach is capable of real-time execution. Importantly, the detection accuracy remains consistent with that observed during GPU-based inference, indicating minimal performance loss after deployment.

Platform: OrangePi AIpro 20T (Ascend 310B);
Model Format: .om (converted from ONNX using ATC);
Input Resolution: 1024 × 1024 (RGB+IR);
Average Inference FPS: 34;
End-to-End Latency: ∼29.4 ms per frame.

5.2. Discussion

These results highlight the practical deployability of our fusion model. Thanks to the use of lightweight, channel-efficient attention modules, the additional computation introduced by our cross-modal SE-based fusion is kept to a minimal. Even with dual-stream RGB and IR inputs, the system maintains high efficiency without compromising detection accuracy—an essential requirement for real-time embedded applications.

Achieving 34 FPS with 1024 × 1024 resolution inputs is particularly notable, as it includes all stages from data preparation to final post-processing. The entire pipeline runs smoothly on the OrangePi without requiring any hardware-specific optimizations beyond standard model compilation, making the solution easy to reproduce and adapt to other Ascend-based platforms.

Beyond speed, our fusion strategy offers enhanced robustness under challenging imaging conditions. Unlike basic early fusion methods, our design employs bidirectional attention to model interactions between the RGB and IR branches. This allows the network to focus dynamically on thermally salient regions—especially useful in low-light or high-noise scenarios—while retaining key structural information from the RGB input. As a result, the model performs reliably in diverse environmental conditions.

The inclusion of modality-aware residual pathways also contributes to the model’s adaptability. These connections help preserve domain-specific information throughout the fusion process, which proves effective when handling variations such as partial occlusion, cluttered backgrounds, or spectral noise. In deployment, this translates to improved stability, fewer false detections, and more precise localization in complex scenes.

In summary, deploying our method on the OrangePi AIpro 20T confirms not only its effectiveness for multispectral object detection but also its real-world readiness. The system runs efficiently, delivers accurate results, and adapts well to new environments—making it a strong candidate for embedded applications such as aerial thermal inspection, border security, and search-and-rescue missions. Figure 6 presents sample outputs from actual deployment, illustrating the model’s visual accuracy and real-time performance in the field.

6. Conclusions

In this work, we introduce PONet, a lightweight and efficient multi-modal object detection framework designed for real-time edge deployment. By integrating RGB and infrared (IR) modalities through a novel Fusion Module and incorporating a low-cost Polarized Self-Attention mechanism, PONet effectively captures complementary features while maintaining high computational efficiency.

Extensive experiments on the VEDAI dataset validate the effectiveness of our approach, where PONet achieves a competitive detection accuracy of 82.2% mAP@0.5 with only 3.7 M parameters and 10.2 GFLOPs. Moreover, real-world deployment on the resource-constrained OrangePi AIpro 20T demonstrates its practical value, achieving a real-time throughput of 34 FPS on 1024 × 1024 resolution inputs.

These results highlight PONet’s strong potential for edge-oriented applications such as aerial surveillance, autonomous navigation, and all-weather monitoring. Future work will explore the integration of temporal cues, transformer-based backbone enhancements, and cross-domain adaptation to further improve performance in diverse and dynamic environments.

Author Contributions

Conceptualization, J.H. and R.L.; methodology, J.H. and F.C.; software, J.H. and J.C.; validation, J.H., F.C., and J.Y.; formal analysis, J.C. and F.C.; investigation, J.H. and J.L.; resources, Q.S.; data curation, J.C. and J.L.; writing—original draft preparation, J.H.; writing—review and editing, R.L. and Q.S.; visualization, F.C.; supervision, R.L. and Q.S.; project administration, R.L.; funding acquisition, Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Grants by National Natural Science Foundation of China (Grant Nos. 42222106, 42271345).

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. The source code and trained model are publicly available at https://github.com/Bigyu-777/PONet (accessed on 24 July 2025). Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Payghode, V.; Goyal, A.; Bhan, A.; Iyer, S.S.; Dubey, A.K. Object detection and activity recognition in video surveillance using neural networks. Int. J. Web Inf. Syst. 2023, 19, 123–138. [Google Scholar] [CrossRef]
Yang, B.; Li, J.; Zeng, T. A Review of Environmental Perception Technology Based on Multi-Sensor Information Fusion in Autonomous Driving. World Electr. Veh. J. 2025, 16, 20. [Google Scholar] [CrossRef]
Zhao, H.; Chu, K.; Zhang, J.; Feng, C. YOLO-FSD: An improved target detection algorithm on remote-sensing images. IEEE Sens. J. 2023, 23, 30751–30764. [Google Scholar] [CrossRef]
Hussain, M. Yolov1 to v8: Unveiling each variant–a comprehensive review of yolo. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Li, Y.; Hu, Z.; Zhang, Y.; Liu, J.; Tu, W.; Yu, H. DDEYOLOv9: Network for detecting and counting abnormal fish behaviors in complex water environments. Fishes 2024, 9, 242. [Google Scholar] [CrossRef]
Krišto, M.; Ivasic-Kos, M.; Pobar, M. Thermal object detection in difficult weather conditions using YOLO. IEEE Access 2020, 8, 125459–125476. [Google Scholar] [CrossRef]
Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-frame infrared small-target detection: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
Sun, J.; Yin, M.; Wang, Z.; Xie, T.; Bei, S. Multispectral object detection based on multilevel feature fusion and dual feature modulation. Electronics 2024, 13, 443. [Google Scholar] [CrossRef]
Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks. In Proceedings of the ESANN, Bruges, Belgium, 27–29 April 2016; Volume 587, pp. 509–514. [Google Scholar]
Wang, Y.; Tang, C.; Shi, Q. Cross-Modality Fusion Deformable Transformer for Multispectral Object Detection. In Proceedings of the International Conference on Guidance, Navigation and Control, Changsha, China, 9–11 August 2024; pp. 372–382. [Google Scholar]
Shao, Y.; Huang, Q.; Mei, Y.; Chu, H. MOD-YOLO: Multispectral object detection based on transformer dual-stream YOLO. Pattern Recognit. Lett. 2024, 183, 26–34. [Google Scholar] [CrossRef]
Meng, F.; Chen, X.; Tang, H.; Wang, C.; Tong, G. B2MFuse: A Bi-branch Multi-scale Infrared and Visible Image Fusion Network based on Joint Semantics Injection. IEEE Trans. Instrum. Meas. 2024, 73, 5037317. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, H.; He, Y.; Wang, X.; Yang, W. Illumination-guided RGBT object detection with inter-and intra-modality fusion. IEEE Trans. Instrum. Meas. 2023, 72, 2508013. [Google Scholar] [CrossRef]
Fu, L.; Gu, W.b.; Ai, Y.b.; Li, W.; Wang, D. Adaptive spatial pixel-level feature fusion network for multispectral pedestrian detection. Infrared Phys. Technol. 2021, 116, 103770. [Google Scholar] [CrossRef]
Gallagher, J.E.; Oughton, E.J. Surveying You Only Look Once (YOLO) Multispectral Object Detection Advancements, Applications Furthermore, Challenges. IEEE Access 2025, 13, 7366–7395. [Google Scholar] [CrossRef]
Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO architecture from infrared and visible images for object detection. Sensors 2023, 23, 2934. [Google Scholar] [CrossRef]
Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
Xu, J.; Tan, X.; Luo, R.; Song, K.; Li, J.; Qin, T.; Liu, T.Y. NAS-BERT: Task-agnostic and adaptive-size BERT compression with neural architecture search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 1933–1943. [Google Scholar]
Liu, Y.; Zhang, W.; Wang, J. Zero-shot adversarial quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 1512–1521. [Google Scholar]
Zhu, J.; Tang, S.; Chen, D.; Yu, S.; Liu, Y.; Rong, M.; Yang, A.; Wang, X. Complementary relation contrastive distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 9260–9269. [Google Scholar]
Wimmer, P.; Mehnert, J.; Condurache, A. Interspace pruning: Using adaptive filter representations to improve training of sparse cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 12527–12537. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 13713–13722. [Google Scholar]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise regression. arXiv 2021, arXiv:2107.00782. [Google Scholar]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Pham, M.T.; Courtrai, L.; Friguet, C.; Lefèvre, S.; Baussard, A. YOLO-Fine: One-stage detector of small objects under various backgrounds in remote sensing images. Remote Sens. 2020, 12, 2501. [Google Scholar] [CrossRef]
Betti, A.; Tucci, M. YOLO-S: A lightweight and accurate YOLO-like network for small target detection in aerial imagery. Sensors 2023, 23, 1865. [Google Scholar] [CrossRef]
Ju, M.; Niu, B.; Jin, S.; Liu, Z. SuperDet: An efficient single-shot network for vehicle detection in remote sensing images. Electronics 2023, 12, 1312. [Google Scholar] [CrossRef]
Shen, L.; Lang, B.; Song, Z. DS-YOLOv8-based object detection method for remote sensing images. IEEE Access 2023, 11, 125122–125137. [Google Scholar] [CrossRef]
Ren, Z. Enhanced YOLOv8 Infrared Image Object Detection Method with SPD Module. J. Theory Pract. Eng. Technol. 2024, 1, 1–7. [Google Scholar]
Shen, L.; Lang, B.; Song, Z. Infrared object detection method based on DBD-YOLOv8. IEEE Access 2023, 11, 145853–145868. [Google Scholar] [CrossRef]
Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913. [Google Scholar] [CrossRef]
Zhang, J.; Lei, J.; Xie, W.; Li, Y.; Yang, G.; Jia, X. Guided hybrid quantization for object detection in remote sensing imagery via one-to-one self-teaching. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614815. [Google Scholar] [CrossRef]
Zhu, J.; Chen, X.; Zhang, H.; Tan, Z.; Wang, S.; Ma, H. Transformer based remote sensing object detection with enhanced multispectral feature extraction. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5001405. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed multimodal detection architecture. Features are extracted at multiple levels from RGB and IR branches, fused via attention-based modules, aggregated in the PAFPN structure, and refined using Polarized Self-Attention (PSA) before prediction.

Figure 2. Structure of the proposed multimodal fusion module.

Figure 3. Structure of the SEAttention module.

Figure 4. Structure of the PSA module.

Figure 5. Comparison of Detection Performance with Existing Models on the VEDAI Dataset.

Figure 6. Real-world detection results captured on the OrangePi AIpro 20T platform using our proposed multi-modal PONet. Each frame (a–d) represents inference on fused RGB and IR input at 1024 × 1024 resolution in real-time, showcasing robustness under diverse conditions including day/night, fog, and occlusion.

Table 1. Ablation Study of PONet on the VEDAI Dataset (mAP@0.5)

Methods	All	Car	Pickup	Camping	Truck	Other	Tractor	Boat	van	Params (M)	FPS
YOLOv5s-RGB	80.0	94.7	90.9	95.7	91.7	77.6	92.1	48.5	49.0	-	-
YOLOv5s-IR	71.8	91.4	88.5	92.6	87.4	66.3	68.2	31.5	48.3	-	-
Baseline (without PSA)	81.5	92.4	92.0	93.7	93.8	79.9	85.5	43.8	70.9	4.13	104
PONet (ours)	82.2	95.2	92.1	97.9	94.7	88.5	94.7	48.7	45.9	3.76	158

Table 2. Comparison of various methods on detection performance in the Vedai dataset, with the corresponding mAP and computational costs. Comparison of different methods on multispectral object detection. “–” denotes unavailable data.

Method	Input Modality	mAP@50 (%)	mAP@95 (%)	Params (M)
YOLO-S [36]	RGB	70.40	–	78.0
SuperDet [37]	RGB	77.60	–	–
DS-YOLOv8 [38]	RGB	78.90	51.10	–
SPD-YOLOv8 [39]	Thermal	63.70	52.10	–
DBD-YOLOv8 [40]	Thermal	76.00	–	–
SuperYOLO [19]	RGB+Thermal	75.09	–	7.0
ICAFusion [41]	RGB+Thermal	76.62	44.93	120.2
GHOST [42]	RGB+Thermal	80.31	49.05	9.7
Multispectral DETR [43]	RGB+Thermal	82.70	50.80	73.0
PONet (Ours)	RGB+Thermal	82.20	52.70	3.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, J.; Lian, J.; Cao, F.; Chen, J.; Luo, R.; Yang, J.; Shi, Q. PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro. Remote Sens. 2025, 17, 2650. https://doi.org/10.3390/rs17152650

AMA Style

Huang J, Lian J, Cao F, Chen J, Luo R, Yang J, Shi Q. PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro. Remote Sensing. 2025; 17(15):2650. https://doi.org/10.3390/rs17152650

Chicago/Turabian Style

Huang, Junyu, Jialing Lian, Fangyu Cao, Jiawei Chen, Renbo Luo, Jinxin Yang, and Qian Shi. 2025. "PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro" Remote Sensing 17, no. 15: 2650. https://doi.org/10.3390/rs17152650

APA Style

Huang, J., Lian, J., Cao, F., Chen, J., Luo, R., Yang, J., & Shi, Q. (2025). PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro. Remote Sensing, 17(15), 2650. https://doi.org/10.3390/rs17152650

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro

Abstract

1. Introduction

2. Related Work

2.1. Visible–Infrared Object Detection Methods

2.2. Lightweight Models for Object Detection

3. Methods

3.1. Overall Architecture

3.2. Fusion Module

3.3. Polarized Self-Attention

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Implementation Details

4.3. Algorithm Performance Experiment

5. Application on OrangePi AIpro 20T

5.1. Deployment Performance

5.2. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI