GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection

Zhang, Yi; Zhao, Qing; Xie, Xurui; Shen, Yang; Ran, Jinhe; Gui, Shu; Zhang, Haiyan; Li, Xiuhe; Zhang, Zhen

doi:10.3390/ai6090229

Open AccessArticle

GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection

by

Yi Zhang

^1,†

,

Qing Zhao

^2,†,

Xurui Xie

²,

Yang Shen

¹,

Jinhe Ran

¹,

Shu Gui

¹,

Haiyan Zhang

¹,

Xiuhe Li

^1,*,† and

Zhen Zhang

^2,*,†

¹

College of Electrical and Electronic Engineering, National University of Defense Technology, Hefei 230000, China

²

School of Mechanical and Automotive Engineering, Anhui Polytechnic University, Wuhu 241000, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2025, 6(9), 229; https://doi.org/10.3390/ai6090229

Submission received: 23 July 2025 / Revised: 4 September 2025 / Accepted: 9 September 2025 / Published: 12 September 2025

(This article belongs to the Special Issue Deep Learning Technologies and Their Applications in Image Processing, Computer Vision, and Computational Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In the field of modern computer vision, pedestrian detection technology holds significant importance in applications such as intelligent surveillance, autonomous driving, and robot navigation. However, single-modal images struggle to achieve high-precision detection in complex environments. To address this, this study proposes a GLNet-YOLO framework based on cross-modal deep feature fusion, aiming to improve pedestrian detection performance in complex environments by fusing feature information from visible light and infrared images. By extending the YOLOv11 architecture, the framework adopts a dual-branch network structure to process visible light and infrared modal inputs, respectively, and introduces the FM module to realize global feature fusion and enhancement, as well as the DMR module to accomplish local feature separation and interaction. Experimental results show that on the LLVIP dataset, compared to the single-modal YOLOv11 baseline, our fused model improves the mAP@50 by 9.2% over the visible-light-only model and 0.7% over the infrared-only model. This significantly improves the detection accuracy under low-light and complex background conditions and enhances the robustness of the algorithm, and its effectiveness is further verified on the KAIST dataset.

Keywords:

multimodal; feature fusion; pedestrian detection; feature enhancement

1. Introduction

Object detection technology plays a pivotal role in domains such as intelligent surveillance, autonomous driving, and robot navigation. Specifically, the accuracy and reliability of pedestrian detection directly impact system performance and safety [1]. In intelligent surveillance, accurate identification of pedestrians can prevent safety accidents; in autonomous driving, real-time detection is the core for avoiding collisions; and in robot navigation, the perception of pedestrians can ensure the efficiency of obstacle avoidance and interaction. Traditional single-modal approaches have obvious limitations: although visible light images provide rich texture details, their imaging quality drops sharply under harsh environments such as low illumination and haze, leading to increased rates of missed detections and false detections; infrared imaging captures pedestrian contours relying on thermal radiation, performing excellently in low-light scenarios but lacking detailed textures, making precise positioning difficult [2]. The inherent defects of single-modal technologies make pedestrian detection in complex environments a key challenge in the field of computer vision.

Multimodal fusion technology provides an effective path to break through the above limitations, and researchers have carried out explorations around the fusion of visible light and infrared images. Early methods mostly adopted simple concatenation or weighted fusion, which failed to fully explore the complementarity of modalities, resulting in limited effects; driven by deep learning, fusion mechanisms such as cross-modal attention modules and Transformer architectures have emerged, but there are still shortcomings: the lag in fusion timing leads to attenuation and loss of original features; the one-sidedness of fusion mechanisms makes it difficult to balance global associations and local refinement; complex structures cause a surge in computational costs and a decline in real-time performance.

To address this, this paper proposes a pedestrian detection method based on multimodal feature fusion and constructs a novel multimodal object detection framework, namely GLNet-YOLO (Global-Local Network YOLO). By leveraging deep feature fusion technology, this framework achieves efficient integration of features from visible light and infrared images, thereby enhancing pedestrian detection performance in complex environments. Specifically, the FM module within the framework is responsible for global feature fusion, while the DMR module focuses on local feature separation and interaction. These two modules work in synergy to realize deep fusion of dual-modal features and optimized feature representation.

The key contributions of this paper are as follows:

Construct an early-fusion dual-branch architecture: Based on YOLOv11, a dual-branch structure is designed, with the fusion node moved forward to the modality input stage. This enables interaction to be completed before feature attenuation, preserving modality-specific learning while enhancing modality complementarity in complex scenes.
Propose a global–local collaborative mechanism: The FM module optimizes global features through “interaction–enhancement–attention,” while the DMR module refines local interactions via “slice separation + residual connection.” The collaboration between the two improves the depth and precision of fusion.
Verify the effectiveness of the lightweight framework: Systematic experiments on the LLVIP and processed KAIST datasets validate the superiority of the framework, providing an efficient and practical technical paradigm for the field of multimodal object detection.

2. Related Work

In recent years, multimodal object detection technology has emerged as a research hotspot in the field of computer vision. Its core advantage lies in significantly improving detection accuracy and robustness in complex scenarios by fusing complementary information from different modal images [3]. Specifically, visible light images can provide rich texture and structural details in well-lit environments, but under conditions of low light, backlighting, or strong shadows, target features are prone to blurring due to degraded imaging quality. Infrared images, on the other hand, can stably capture target contours in all lighting scenarios by virtue of thermal radiation information, and they exhibit outstanding detection capabilities for thermal radiation sources such as pedestrians in nighttime or low-light environments. However, their imaging lacks the detailed textures of the visible light band, making it difficult to distinguish the appearance features of targets [4]. The differences in these modal characteristics make the design of an efficient fusion mechanism crucial: it is necessary not only to avoid information redundancy caused by simple feature superposition but also to achieve complementary advantages through cross-modal interaction. Therefore, how to construct an effective fusion mechanism to realize efficient modal complementarity is currently a hotspot and difficulty in research [5].

Fusion methods based on deep learning have gradually become mainstream. By leveraging the feature learning capabilities of neural networks, they achieve deep feature extraction and fusion of visible light and infrared images, effectively improving detection accuracy and robustness in complex scenarios within fields such as pedestrian detection and target tracking [6]. Researchers have explored fusion mechanisms from multiple dimensions: In terms of cross-modal information interaction, X Sun et al. proposed an adaptive frequency domain gating mechanism to dynamically learn the dependence of RGB and infrared data in frequency filtering features for integrating complementary information [7]. J Chen et al. iteratively optimized nighttime image fusion through a closed-loop structure consisting of brightness adjustments and feedback networks [8]. K Du et al. strengthened information transformation between modalities using a mutual information transfer module and a mutual promotion training paradigm [9]. L Tang et al. introduced an illumination perception subnetwork, and combined a cross-modal difference perception module with an intermediate fusion strategy to enable the model to dynamically adjust fusion weights according to illumination conditions [10]. S Meng et al. proposed building a multimodal fusion network based on YOLOv5, which fuses visible light and thermal imaging images with an adaptive weighting module to enhance detection robustness in relevant scenarios [11]. In terms of balancing global and local features, D. Rao et al. combined Transformer and adversarial learning to model global fusion relationships while preserving infrared target characteristics and visible light details [12]; Hui et al. achieved multi-scale feature aggregation through a dense block encoding network [13]; Wei Tang et al.’s Y-shaped dynamic Transformer module strengthens the preservation of modal details and structural consistency [14].

In addition, lightweight initiatives such as S. Hwang’s IC-Fusion, which employs multi-scale feature distillation [15], task-collaborative frameworks like H. Zhang’s MRFS with fusion–segmentation cascades [16], and dynamic weight allocation mechanisms such as Z. Hu’s ASAF module [17] have also made strides in boosting efficiency and performance. However, overall, existing methods still grapple with common challenges: some adopt complex network structures in the pursuit of accuracy, resulting in high computational resource consumption and inadequate real-time performance; under extreme adverse conditions (e.g., heavy smoke and torrential rain), the complementarity of modal features is weakened, making fusion strategies ineffective in enhancing robustness; furthermore, most rely on manually designed fusion rules, lacking the ability to adaptively adjust to dynamic scenes and involving cumbersome optimization processes.

Compared with existing research, the key distinctions of this study lie in two aspects: First, the forward-looking nature of fusion timing. Existing methods, such as YOLO-MS [18], predominantly perform fusion after the convolution or bottleneck modules in the backbone, which can cause the attenuation of original features due to multiple rounds of convolution. In contrast, GLNet-YOLO accomplishes fusion via the FM and DMR modules before the two modalities enter their respective channels, thereby reducing information loss at the source. Second, the synergetic and adaptive design of the fusion mechanism. On the one hand, existing methods either focus on global fusion (e.g., D. Rao’s Transformer architecture [12]) or concentrate on local interactions (e.g., Z. Hu’s attention mechanism [17]), while our framework adopts a progressive design of “global fusion (FM module) → local refinement (DMR module)” to balance cross-modal commonalities and modality-specific features. On the other hand, GLNet-YOLO achieves adaptive adjustment to complex environments through dynamic interactions between the FM (adaptive global feature weighting) and DMR (dynamic local detail refinement) modules, avoiding cumbersome manual parameter tuning.

To tackle the aforementioned issues, this paper extends the YOLOv11 network architecture and proposes a novel multimodal object detection framework—GLNet-YOLO. This framework leverages the lightweight improvements of YOLOv11 and, through the innovative FM and DMR modules, realizes a lightweight structure of “global fusion first, then local refinement,” enabling efficient fusion of visible light and infrared images directly within the detection framework. Without the requirement for complex optimization processes, it significantly reduces computational resource consumption while effectively enhancing detection performance in complex environments [19].

3. Research Methodology

3.1. GLNet-YOLO: Improved Network Framework

Existing multimodal models in the YOLO series typically perform feature fusion after the Conv or CSP bottleneck module variants in the backbone and then split the features into their respective modal pathways [20]. However, this conventional operation has significant drawbacks: as the number of convolution operations increases, feature information may gradually be lost, thereby affecting the final detection performance.

To address this issue, the GLNet-YOLO framework, built upon the YOLOv11 network, proposes an innovative improvement strategy. As illustrated in Figure 1, the framework draws inspiration from early feature fusion methods [21], whose core principle is to fuse features before they undergo attenuation caused by multiple rounds of convolution, thereby preserving more of the initial cross-modal information before it is processed by the deep feature extractors. Before the two modal images enter their respective channels, they are first directly concatenated in the channel dimension and processed through the FM (Feature Mixer) and DMR (Dual-Modality Refiner) modules. Specifically, the FM module achieves efficient integration and optimization of global features through feature interaction, enhancement, and attention mechanisms, initially exploring the commonalities and complementary information between the two modalities at a holistic level. In contrast, the DMR module focuses on refined interaction and refinement of local features via feature slicing separation and residual connection strategies, aiming to retain modality-specific details. After processing by these two modules, the data are fed into their respective channels. This “global fusion first, then local refinement” design ensures that effective fusion and optimization are completed before feature information undergoes attenuation due to multiple rounds of convolution. Consequently, the original feature information is effectively preserved, information loss is reduced, and detection accuracy is significantly improved.

3.2. FM Module: Global Feature Fusion and Enhancement

Effective multimodal feature fusion is crucial for improving detection performance in pedestrian detection tasks. To address this, this study introduces the Feature Mixer (FM) module, designed to optimize feature representation and enhance the robustness of pedestrian detection. The framework of the FM module is illustrated in Figure 2.

The FM module consists of three core components: a feature interaction part, a feature enhancement part, and an attention mechanism part. Each component plays a distinct role in the feature fusion process, and they work together to enhance the feature representation capability and the detection performance of the model. To simplify the model description and improve readability, some general operations widely used in the module are defined as follows:

Convolution operation (Conv2d):

$C o n v (X) = W_{C o n v} * X + b_{C o n v}$

(1)
Batch normalization (BatchNorm2d):

$B N (X) = \frac{X - μ}{\sqrt{σ^{2} + ϵ}} \times γ + β$

(2)
ReLU activation function:

$ReLU (X) = \max (0, X)$

(3)

where

W_{C o n v}

is the convolution kernel weight,

b_{C o n v}

is the bias term, and X is the input feature map;

μ

and

σ

are the mean and variance of the mini-batch, respectively;

ϵ

is a very small coefficient that prevents dividing by zero; and

γ

and

β

are the learnable scaling and offset parameters.

The input feature maps of the FM module are derived from the channel concatenation of visible light and infrared multimodal data: visible light images provide 3-channel texture information, while infrared images provide 3-channel thermal radiation contour information. After concatenation, these two types of data form a 6-channel feature map. To ensure the coherence of feature dimensions—especially to meet the subsequent requirement of the DMR module to separate the features into 3-channel visible light and 3-channel infrared—the convolution layers of the feature interaction and feature enhancement components all maintain consistent input and output channel counts, except for the attention mechanism, which requires temporary dimension reduction to optimize computation. Additionally, all convolution operations maintain the spatial size of the feature map at 640 × 640 through padding design, preventing dimensional changes from affecting the detection process.

The feature interaction part serves as the foundation of the FM module. The module utilizes 1 × 1 convolution to achieve inter-channel information integration and optimizes feature representation through batch normalization with the ReLU activation function [22], and the computational process can be expressed as follows:

X_{interact (FM)} = ReLU (B N (C o n v (X)))

(4)

In the formula, the 1 × 1 convolution maintains a 6→6 channel mapping, ensuring that the number of channels of the 6-channel input feature map remains unchanged. Meanwhile, it preserves the 640 × 640 spatial size through padding, which not only enables the initial interaction of cross-modal information but also provides input with consistent dimensions for subsequent feature enhancement.

The feature enhancement module consists of two consecutive 3 × 3 convolutional layers, each followed by batch normalization. A ReLU activation function is introduced between the convolutional layers to enhance the consistency of the feature space and the expressive power of the model. Padding = 1 is applied to each convolutional layer to preserve the size of the feature maps. The computational process can be expressed as follows:

X_{enhanced} = B N (C o n v (ReLU (B N (C o n v (X_{interact (FM)})))))

(5)

In the formula, both convolutions maintain a 6→6 channel mapping, and padding = 1 ensures the spatial size remains unchanged. The final output is still 640 × 640 × 6, which meets the input dimension requirements of the attention mechanism.

The attention mechanism section dynamically assigns channel weights to the FM module by perceiving global information. It highlights significant features while suppressing insignificant ones. This effectively enhances the model’s focus on features relevant to object detection, thereby improving the accuracy and robustness of detection [23]. The computational process can be expressed as follows:

A = σ (C o n v (ReLU (C o n v \cdot (A_{Pool} (X_{enhanced})))))

(6)

X_{output} = X_{enhanced} ⊙ A

(7)

where:

A_{Pool}

is the adaptive average pooling operation (AdaptiveAvgPool);

σ

is the Sigmoid activation function; ⊙ denotes the element-by-element multiplication; A is the attentional weight;

X_{enhanced}

is the output of the feature enhancement part and

X_{output}

is the final output of the module.

In the formula, after

A_{Pool}

(Average Pooling) compresses the spatial dimension, two 1 × 1 convolutions sequentially realize dimension reduction from 6 to 3 and dimension elevation from 3 to 6. The finally generated attention weight A has a dimension of 640 × 640 × 6, which is fully matched with the input feature map dimension.

The design of the FM module integrates feature interaction, enhancement, and attention mechanisms within a concise yet efficient structure, ensuring optimal feature fusion. This design effectively enhances feature representation while maintaining computational efficiency. Moreover, the module’s flexibility enables seamless integration into diverse network architectures, facilitating a broad range of computer vision applications.

3.3. DMR Module: Local Processing and Mode Separation

After the FM module completes the global feature processing, the DMR module further separates and localizes the features of the two modalities in order to enhance the model’s ability to learn the features of different modalities. This design significantly improves feature representation and detection performance through the integration of feature interaction, separation, and residual connectivity [24]. The framework is shown in Figure 3.

The feature interaction part is one of the core components of the DMR module, implemented by two sequential 1 × 1 convolutional layers. The first layer is followed by batch normalization (BatchNorm2d) and the ReLU activation function. The second convolutional layer is followed only by batch normalization, ensuring feature stability and consistency. The computational process can be represented as

X_{interact (D M R)} = B N (C o n v (ReLU (B N (C o n v (X)))))

(8)

Built on feature interaction, feature separation splits the input feature map into two branches: visible light (RGB) and infrared (IR). This process is achieved through slice indexing, which extracts the first three channels and the last three channels of the input feature map respectively, thereby preserving the feature details of both modalities [25]. In the actual PyTorch implementation, we strictly adhere to the standard [batch, channel, height, width] convention. However, to maintain consistency with Figure 1, all equations adopt the [height, width, channel] format, where the third dimension corresponds to channels (3 for visible light + 3 for infrared). This ensures consistency between the visual and mathematical descriptions. The calculation process can be expressed as follows:

X_{1}^{'}, X_{2}^{'} = X_{interact (D M R)} [:, :, : 3], X_{interact (D M R)} [:, :, 3 :]

(9)

where

X_{1}^{'}

corresponds to visible features (taking the first three channels) and

X_{2}^{'}

corresponds to infrared features (taking the last three channels). These features are extracted as described in Section 3.3.

To enhance feature representation, the residual linking mechanism combines the original input feature map with the processed features through element-wise summation to further enhance the expression of characteristics. This operation reinforces critical features while retaining detailed information from the original feature map [26]. The computational process can be represented as

X_{1} = X_{1}^{'} + X [:, :, : 3]

(10)

X_{2} = X_{2}^{'} + X [:, :, 3 :]

(11)

where

X_{1}

and

X_{2}

are the final visible and infrared features, respectively.

By setting different parameter values, the model can choose to output the feature maps of the visible branch or the infrared branch, providing flexibility for the model to adapt to different task requirements.

The design of the DMR module balances efficiency and flexibility, and realizes the deep fusion of feature interaction, separation, and residual connectivity through a concise but effective network structure. This design not only enhances the feature representation capability, but also maintains a high computational efficiency, enabling easy integration into various network architectures for a wide range of visual tasks.

4. Experimental Results and Analysis

4.1. Experimental Environment

The experiments were conducted on the Windows operating system with hardware configurations including an Intel (R) Core (TM) i9-14900HX CPU and an NVIDIA GeForce RTX 4090 graphics card, and the experimental environment was built on the PyTorch 2.6.0 framework with dependencies specified as Python 3.9.21 and CUDA 12.6. Two categories of baseline models were designed for comparative validation, where unimodal baselines used YOLOv11n to independently process visible light and infrared images (aiming to verify the performance improvement of multimodal fusion over unimodal detection), and the multimodal baseline was a dual-branch structure constructed based on YOLOv11n (serving as the reference model for comparing with the improved GLNet-YOLO in ablation experiments). Hyperparameters adhered to the default specifications of the YOLO series: batch size = 16; workers = 8; training epochs = 200; and initial learning rate = 0.01. Input images were uniformly resized to 640 × 640, and data augmentation adopted YOLO’s native strategies, including random horizontal flipping, random cropping and scaling, and Mosaic augmentation.

4.2. Dataset

This experiment uses the LLVIP dataset and the preprocessed KAIST dataset. The LLVIP dataset contains 15,488 pairs of visible–infrared images captured in low-light scenarios, among which 12,025 pairs are used for training and 3463 pairs for testing. The image pairs in this dataset are strictly aligned in time and space with precise pedestrian annotations, serving as a commonly used benchmark dataset in the field of low-light vision [27]. After screening and processing, the processed KAIST dataset retains 7601 pairs of training images and 2252 pairs of testing images, with labels simplified to the “person” category. It features high data quality and is suitable for pedestrian detection task scenarios [28].

4.3. Evaluation Indicators

To systematically evaluate the model performance, this study uses mean average precision (mAP) as the primary metric. mAP@50 reflects the detection capability under regular precision by calculating the average precision at an IoU threshold of 0.5; mAP@50-95 more comprehensively reflects the model’s performance under multiple thresholds by calculating the average precision at IoU thresholds from 0.5 to 0.95, and is especially advantageous in robustness evaluation [29].

4.4. Model Testing and Evaluation

To validate the effectiveness and superiority of the proposed model, multiple comparative experiments were conducted on the LLVIP dataset, including (1) comparison with state-of-the-art multimodal detection models and (2) comparison with unimodal detection methods. All models were first retrained and tested under this study’s unified experimental setup to ensure fair comparison. For core baselines (unimodal models such as YOLOv8n and YOLOv11n and state-of-the-art multimodal models like CSAA and RSDet), re-implemented performance is used directly as it aligns with reasonable variation; for non-core auxiliary baselines (e.g., early multimodal models like DenseFuse), their re-implemented performance in our setup was significantly lower than reported in the original literature due to unavoidable discrepancies in legacy code adaptation and historical training configurations. To avoid misleading conclusions from underperforming re-implementations, their literature data are cited instead. All comparisons adhere to the same evaluation metrics, with results summarized in Table 1.

YOLOv11 was selected as the baseline in this study, primarily due to its extensive validation in unimodal detection tasks. It provides a stable foundation for transformation into a multimodal framework, featuring a well-defined network structure and performance benchmarks that can reliably support the evaluation of fusion effects of the FM and DMR modules. Although newer versions such as YOLOv12 have been released, their improvements are focused on unimodal backbone networks, and they lack mature application cases in visible light and infrared fusion scenarios, and are thus not included in the current study. Future research will extend to newer versions to verify the universality of the proposed fusion mechanism.

The metrics described in the previous section are adopted as the benchmark for evaluating model performance, with the size of the model weight file introduced as an auxiliary indicator. The size of the weight file is directly related to the computational complexity and storage requirements of the model, and it is a key factor in measuring the deployment efficiency of the model. Ablation experiments on the proposed model were conducted under the same environment, and the results are shown in Table 2, where all YOLOv11 models adopt a dual-branch structure.

To further verify the generalization ability of the model in different scenarios, auxiliary validation experiments were conducted on the processed KAIST dataset, and the results are shown in Table 3.

On the LLVIP dataset, a visual analysis was conducted on the detection effects of the YOLOv11 and GLNet-YOLO models, with partial results shown in Figure 4. The figure selects typical complex scenarios such as low light, dense pedestrians, and occlusion to demonstrate the pedestrian detection performance of the two types of models. It intuitively presents the performance differences between the models when coping with different environmental challenges, facilitating a deeper understanding of their advantages and disadvantages.

4.5. Summary of Experimental Results

Based on the experimental environment, datasets, and evaluation metrics specified earlier, this section summarizes the core experimental results of this study.

On the LLVIP dataset, as shown in Table 1, the proposed GLNet-YOLO model with RGB + IR multimodal input achieves an mAP@50 of 96.6% and an mAP@50-95 of 65.9%. Compared to the baseline YOLOv11n dual-branch model (RGB + IR input, mAP@50 = 96.9%, mAP@50-95 = 62.9%), GLNet-YOLO shows a 0.3% decrease in mAP@50 but a 3.0% increase in mAP@50-95; it also outperforms other mainstream multimodal models.

Ablation experiments in Table 2 further validate the impact of individual modules. Using the YOLOv11n dual-branch model (mAP@50 = 96.9%, mAP@50-95 = 62.9%, model size = 8.16 MB) as the baseline, adding only the FM module results in a 0.1% drop in mAP@50 (to 96.8%) and a 2.8% rise in mAP@50-95 (to 65.7%), with the model size increasing to 8.17 MB. Adding only the DMR module leads to a larger 0.4% decline in mAP@50 (to 96.5%) and a 2.1% increase in mAP@50-95 (to 65.0%), while keeping the model size at 8.17 MB. When both FM and DMR modules are integrated (i.e., GLNet-YOLO), mAP@50 decreases by 0.3% (to 96.6%), mAP@50-95 rises by 3.0% (to 65.9%), and the model size only increases by 0.02 MB to 8.18 MB, confirming that all improved modules cause a slight mAP@50 decline but drive significant mAP@50-95 improvement.

In the preprocessed KAIST dataset in Table 3, GLNet-YOLO’s RGB + IR input outperforms YOLOv11n’s single-modal inputs: mAP@50 increases from 43.6% (RGB) and 50.7% (IR) to 54.4%, and mAP@50-95 rises from 13.7% (RGB) and 17.2% (IR) to 18.3%, demonstrating stable performance in outdoor scenes with variable lighting.

Combined with the visualization results in Figure 4, these findings are further supported: in low-light scenes, YOLOv11n exhibits missed detections of partially occluded pedestrians, while GLNet-YOLO achieves complete detection without omissions; in dense pedestrian scenes, YOLOv11n shows obvious bounding box offsets for overlapping targets, whereas GLNet-YOLO maintains more accurate localization; in strong-backlight scenes, YOLOv11n produces false detections of bright spots (misclassified as pedestrians), while GLNet-YOLO effectively avoids such errors.

5. Discussion

5.1. The Trade-Off Between Detection Quantity and Quality

The “mAP@50 decline and mAP@50-95 improvement” observed in Table 1 and Table 2 is rooted in the mismatch between the model’s design goals and the evaluation logic of the two metrics. mAP@50 uses a relaxed IoU threshold (≥0.5) that prioritizes “target detection” over localization precision—any bounding box with ≥50% overlap with the real target is counted as valid. The YOLOv11n dual-branch model uses direct channel concatenation to retain raw multimodal features (visible light textures and infrared contours), which easily generates bounding boxes meeting the IoU

\geq 0.5

criterion (even with minor offsets), explaining its higher mAP@50. In contrast, mAP@50-95 evaluates precision across 10 IoU thresholds (0.5–0.95), demanding high localization accuracy at strict thresholds (e.g., 0.8–0.95). Unfiltered feature fusion in YOLOv11n introduces background noise (e.g., low-light stray light), degrading performance at high IoUs—while GLNet-YOLO’s modules address this by refining features for precise localization.

The ablation data in Table 2 clarifies how modules drive these changes. The FM module’s ‘1 × 1 convolution interaction → dual 3 × 3 convolution enhancement → attention weighting’ applies mild feature filtering: it strengthens effective features (e.g., pedestrian contours) and slightly suppresses low-overlap bounding boxes, causing only a 0.1% mAP@50 drop while boosting mAP@50-95 by 2.8%. The DMR module’s ‘1 × 1 convolution → slice separation → residual connection’ uses stricter local feature screening: it prunes more blurred, low-overlap bounding boxes (key contributors to mAP@50), leading to a 0.4% mAP@50 decline but a 2.1% mAP@50-95 rise. When combined, the FM module’s global feature optimization reduces the DMR module’s mis-pruning of ‘effective low-overlap boxes,’ limiting mAP@50 decline to 0.3% while maximizing mAP@50-95 improvement to 3.0%—among this gain, the slight 0.2% improvement (from FM-only 65.7% to FM+DMR 65.9%) is likely attributed to the gains from local refinement. This validates the ‘global fusion → local refinement’ design.

This trade-off is practically valuable for core applications. In autonomous driving, higher mAP@50-95 reduces collision risks from localization errors (even minor offsets can misjudge pedestrian distance), while the 0.3% mAP@50 drop does not compromise critical target detection (96.6% mAP@50 remains high). In intelligent surveillance, improved mAP@50-95 minimizes identity confusion from bounding box offsets, while high mAP@50 ensures no key targets are missed.

5.2. Synergistic Value of FM and DMR Modules

GLNet-YOLO’s FM and DMR modules address limitations of existing multimodal fusion methods. In contrast, the FM module first establishes cross-modal global associations (e.g., linking visible light textures to infrared contours) via attention weighting, while the DMR module refines local modality-specific details (e.g., using infrared contours to correct blurry visible light boundaries) within this global framework. This “global-focus + local-refinement” synergy avoids one-sided optimization, explaining why GLNet-YOLO outperforms existing models in mAP@50-95.

5.3. Model Limitations and Deployment Challenges

While GLNet-YOLO demonstrates significant advancements, its practical deployment is subject to several limitations that must be addressed. A primary challenge is the inherent “black-box” nature of deep learning models. Without a clear understanding of the model’s internal logic, diagnosing failure cases or trusting its predictions in safety-critical scenarios remains difficult. This is compounded by the observed performance trade-off, where the model improves localization precision (mAP@50-95) at the slight cost of recall for lower-quality detections (mAP@50). While advantageous for precision-focused tasks, this behavior must be carefully managed depending on the application’s tolerance for false negatives versus inaccurate localization. Furthermore, the model’s robustness is constrained by data scarcity in specialized scenarios. Although trained on large datasets, performance can degrade when faced with unusual pedestrian postures or extreme weather conditions not well represented in the training data, a common bottleneck in real-world applications. Finally, real-world performance is also subject to contextual factors beyond the immediate sensor data. Just as sensor positioning affects acoustic monitoring in manufacturing, environmental variables like camera angles and dynamic backgrounds can influence detection reliability, posing an ongoing challenge for robust deployment.

5.4. Broader Context and Future Directions

This research should be viewed within the broader context of applying data-driven models to solve complex systems where first-principles modeling is intractable. Similar to challenges in advanced manufacturing, where data-driven approaches are essential for predicting outcomes in high-frequency pulsed processes, multimodal perception in unstructured environments presents a level of complexity that is well-suited for learned, rather than manually engineered, solutions.

The primary future direction, therefore, is not just to improve accuracy but to enhance interpretability. The limitations of the “black-box” approach can be directly addressed by integrating Explainable AI (XAI) techniques. The successful use of methods like Grad-CAM [39] to connect a model’s focus to physical process dynamics in electrochemical machining provides a clear blueprint, e.g., [40]. Applying XAI to GLNet-YOLO would allow us to validate the fusion mechanism, diagnose failures, and build trust in the model’s decisions, moving the field towards more transparent and reliable AI.

This work also highlights a broader shift towards task-specific tuning and evaluation. The trade-off between mAP@50 and mAP@50-95 shows that a single metric cannot capture a model’s full utility. The future of model deployment lies in tailoring architectures and evaluation criteria to specific operational needs—prioritizing precision for autonomous driving or recall for general surveillance. Our findings advocate for a more nuanced approach to model assessment, ensuring that the deployed solution is truly optimized for its intended task.

6. Conclusions

To address the challenge of pedestrian detection in complex environments, this study proposes the GLNet-YOLO framework based on multimodal feature fusion. By integrating the FM and DMR modules, the framework achieves significant improvements in detection accuracy and robustness under low-light conditions and complex backgrounds. Its core innovations lie in three aspects: a dual-branch network structure that processes visible light and infrared inputs in parallel, enabling full exploitation of cross-modal complementarity; the integration of feature interaction, enhancement, and attention mechanisms to generate high-quality feature representations; and the preservation of inter-modal interaction information via feature separation and residual connections, which optimizes fusion performance. These enhancements boost model capability while maintaining high computational efficiency, rendering it easily integrable into diverse network architectures and suitable for a wide range of visual tasks.

In practical deployment, GLNet-YOLO still confronts multiple challenges. Beyond extreme environments (e.g., heavy smoke, torrential rain, and sandstorms) that may diminish the complementarity of modal features, other key limitations include the high hardware costs and spatiotemporal synchronization requirements of multimodal devices, the stringent real-time performance demands of dynamic scenes, and the model’s limited generalization ability in data-scarce scenarios. Although validation on large-scale datasets such as LLVIP and KAIST demonstrates the model’s robustness, insufficient sample data in special scenarios—such as unusual pedestrian postures and extreme weather—can lead to performance degradation. Data collection in these scenarios is inherently difficult, and annotation costs are prohibitive, posing common bottlenecks in real-world applications.

Future research will focus on tackling the aforementioned challenges: optimizing the feature fusion mechanism to enhance modal robustness in extreme environments; exploring low-cost hardware adaptation schemes to reduce deployment thresholds; further improving real-time performance through lightweight computing strategies; and leveraging explainable AI technologies (e.g., class activation mapping) to analyze the model’s feature dependence mechanism in complex scenes. This will deepen our understanding of the collaborative functionality of the dual modules and provide theoretical support for further optimizing fusion strategies. Concurrently, efforts will be directed toward exploring data augmentation and virtual sample generation techniques (e.g., drawing on methods like BLS-VSG [41]) to enhance the model’s generalizability in data-scarce scenarios and alleviate sample insufficiency in specialized tasks. Additionally, the framework’s application will be extended to multimodal tasks such as image fusion and target tracking, facilitating the practical advancement of related technologies to better adapt to complex and dynamic real-world demands.

Author Contributions

Conceptualization, Y.Z. and X.L.; methodology, Y.Z. and Q.Z.; software, X.X.; validation, Y.S., J.R. and S.G.; formal analysis, H.Z.; investigation, X.L.; resources, Zhen Zhang Z.Z.; data curation, Y.Z.; writing—original draft preparation, Q.Z.; writing—review and editing, X.L. and Q.Z.; visualization, Y.Z.; supervision, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, W.-Y.; Jovanov, L.; Philips, W. Multimodal pedestrian detection based on cross-modality reference search. IEEE Sens. J. 2024, 24, 17291–17306. [Google Scholar] [CrossRef]
Dasgupta, K.; Das, A.; Yogamani, S. Spatio-contextual deep network-based multimodal pedestrian detection for autonomous driving. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15940–15950. [Google Scholar] [CrossRef]
Zhao, R.; Hao, J.; Huo, H. Research on Multi-Modal Pedestrian Detection and Tracking Algorithm Based on Deep Learning. Future Internet 2024, 16, 194. [Google Scholar] [CrossRef]
Hamdi, S.; Sghaier, S.; Faiedh, H.; Souani, C. Robust pedestrian detection for driver assistance systems using machine learning. Int. J. Veh. Des. 2020, 83, 140–170. [Google Scholar] [CrossRef]
Chen, J.; Ding, J.; Ma, J. HitFusion: Infrared and visible image fusion for high-level vision tasks using transformer. IEEE Trans. Multimed. 2024, 26, 10145–10159. [Google Scholar] [CrossRef]
Gupta, I.; Gupta, S.; Mishra, A.K.; Diwakar, M.; Singh, P.; Pandey, N.K. Deep learning-enabled infrared and visual image fusion. In Proceedings of the International Conference Machine Learning, Advances in Computing, Renewable Energy and Communication; Springer: Singapore, 2024. [Google Scholar]
Sun, X.; Yu, Y.; Cheng, Q. Adaptive multimodal feature fusion with frequency domain gate for remote sensing object detection. Remote Sens. Lett. 2024, 15, 133–144. [Google Scholar] [CrossRef]
Chen, J.; Yang, L.; Liu, W.; Tian, X.; Ma, J. LENFusion: A joint low-light enhancement and fusion network for nighttime infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2024, 73, 5018715. [Google Scholar] [CrossRef]
Du, K.; Li, H.; Zhang, Y.; Yu, Z. CHITNet: A complementary to harmonious information transfer network for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2025, 74, 5005917. [Google Scholar] [CrossRef]
Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83–84, 79–92. [Google Scholar] [CrossRef]
Meng, S.; Liu, Y. Multimodal Feature Fusion YOLOv5 for RGB-T Object Detection. In Proceedings of the 2022 China Automation Congress (CAC), Xiamen, China, 25–27 November 2022; pp. 2333–2338. [Google Scholar] [CrossRef]
Rao, D.; Xu, T.; Wu, X.-J. TGFuse: An infrared and visible image fusion approach based on transformer and generative adversarial network. IEEE Trans. Image Process. 2023; early access. [Google Scholar] [CrossRef]
Hui, L.; Wu, X.-J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Liu, Y. YDTR: Infrared and visible image fusion via Y-shape dynamic transformer. IEEE Trans. Multimed. 2023, 25, 5413–5428. [Google Scholar] [CrossRef]
Hwang, S.; Han, D.; Jeon, M. Multispectral Detection Transformer with Infrared-Centric Feature Fusion. arXiv 2025, arXiv:2505.15137. [Google Scholar] [CrossRef]
Zhang, H.; Zuo, X.; Jiang, J.; Guo, C.; Ma, J. MRFS: Mutually Reinforcing Image Fusion and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024; pp. 26964–26973. [Google Scholar] [CrossRef]
Hu, Z.; Kong, Q.; Liao, Q. Multi-Level Adaptive Attention Fusion Network for Infrared and Visible Image Fusion. IEEE Signal Process. Lett. 2025, 32, 366–370. [Google Scholar] [CrossRef]
Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.-M. YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-Time Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Su, Y.; Kang, F.; Wang, L.; Lin, Y.; Wu, Q.; Li, H.; Cai, Z. PC-YOLO11s: A lightweight and effective feature extraction method for small target image detection. Sensors 2025, 25, 348. [Google Scholar] [CrossRef]
Xie, Y.; Zhang, L.; Yu, X.; Xie, W. YOLO-MS: Multispectral object detection via feature interaction and self-attention guided fusion. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 2132–2143. [Google Scholar] [CrossRef]
Deng, J.; Bei, S.; Su, S.; Zuo, Z. Feature fusion methods in deep-learning generic object detection: A survey. In Proceedings of the IEEE 9th Joint International Conference IT Artificial Intelligence (ITAIC), Chongqing, China, 11–13 December 2020; pp. 431–437. [Google Scholar] [CrossRef]
Pan, L.; Diao, J.; Wang, Z.; Peng, S.; Zhao, C. HF-YOLO: Advanced Pedestrian Detection Model with Feature Fusion and Imbalance Resolution. Neural Process. Lett. 2024, 56, 90. [Google Scholar] [CrossRef]
Du, Q.; Wang, Y.; Tian, L. Attention module based on feature normalization. In Proceedings of the 4th International Conference on Intelligent Computing Human-Computer Interaction (ICHCI), Guangzhou, China, 4–6 August 2023; pp. 438–442. [Google Scholar] [CrossRef]
Sun, J.; Yin, M.; Wang, Z.; Xie, T.; Bei, S. Multispectral Object Detection Based on Multilevel Feature Fusion and Dual Feature Modulation. Electronics 2024, 13, 443. [Google Scholar] [CrossRef]
Li, Y.; Sun, L. FocusNet: An infrared and visible image fusion network based on feature block separation and fusion. In Proceedings of the 17th International Conference on Advanced Computer Theory and Engineering (ICACTE), Hefei, China, 13–15 September 2024; pp. 236–240. [Google Scholar] [CrossRef]
Liu, J.; Zhang, W.; Tang, Y.; Tang, J.; Wu, G. Residual feature aggregation network for image super-resolution. In Proceedings of the IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 2356–2365. [Google Scholar] [CrossRef]
Jia, J.X.; Zhu, C.; Li, M.; Tang, W.; Liu, S.; Zhou, W. LLVIP: A Visible-infrared Paired Dataset for Low-light Vision. arXiv 2021, arXiv:2108.10831. [Google Scholar] [CrossRef]
Li, C.; Song, D.; Tong, R.; Tang, M. Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation. arXiv 2018, arXiv:1808.04818. [Google Scholar] [CrossRef]
Zheng, X.; Zheng, W.; Xu, C. A multi-modal fusion YoLo network for traffic detection. Comput. Intell. 2024, 40, e12615. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Chen, Y.-T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 139–158. [Google Scholar]
Cao, Y.; Bin, J.; Hamari, J.; Blasch, E.; Liu, Z. Multimodal Object Detection by Channel Switching and Spatial Attention. In Proceedings of the IEEE/CVF Conference Computer Vision Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 403–411. [Google Scholar] [CrossRef]
Zhao, T.; Yuan, M.; Jiang, F.; Wang, N.; Wei, X. Removal then Selection: A Coarse-to-Fine Fusion Perspective for RGB-Infrared Object Detection. arXiv 2024, arXiv:2401.10731. [Google Scholar] [CrossRef]
Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A Generative Adversarial Network With Multiclassification Constraints for Infrared and Visible Image Fusion. IEEE Trans. Instrum. Meas. 2021, 70, 5005014. [Google Scholar] [CrossRef]
Zhang, H.; Ma, J. SDNet: A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion. Int. J. Comput. Vis. 2021, 129, 2761–2785. [Google Scholar] [CrossRef]
Tang, L.; Xiang, X.; Zhang, H.; Gong, M.; Ma, J. DIVFusion: Darkness-free infrared and visible image fusion. Inf. Fusion 2022, 91, 477–493. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Wu, M.; Yao, Z.; Verbeke, M.; Karsmakers, P.; Gorissen, B.; Reynaerts, D. Data-driven models with physical interpretability for real-time cavity profile prediction in electrochemical machining processes. Eng. Appl. Artif. Intell. 2025, 160, 111807. [Google Scholar] [CrossRef]
Ge, J.; Yao, Z.; Wu, M.; Almeida, J.H.S., Jr.; Jin, Y.; Sun, D. Tackling data scarcity in machine learning-based CFRP drilling performance prediction through a broad learning system with virtual sample generation (BLS-VSG). Compos. Part B Eng. 2025, 305, 112701. [Google Scholar] [CrossRef]

Figure 1. The GLNet-YOLO framework optimizes the feature fusion process by introducing a two-branch network structure and FM and DMR modules in the YOLOv11 architecture.

Figure 2. FM module integrating feature interaction, feature enhancement and attention mechanisms in order to optimize multimodal feature fusion and improve target detection performance.

Figure 3. The DMR module further refines the multimodal features processed by the FM module through the mechanisms of feature interaction, separation and residual linkage, thus enhancing the model’s ability to learn visible and infrared modal features.

Figure 4. Visualization comparison between GLNet-YOLO and YOLOv11 for pedestrian detection on the LLVIP dataset.

Table 1. Performance comparison of different detectors on the LLVIP dataset.

Model	Input	mAP@50	mAP@50-95
YOLOv3 (Darknet53) [30]	RGB	85.9	43.3
	IR	89.7	52.8
YOLOv5 (CSPD53) [31]	RGB	90.8	50.0
	IR	94.6	61.9
FasterR-CNN [32]	RGB	91.4	49.2
	IR	96.1	61.1
YOLOv8n	RGB	88.0	49.9
	IR	96.3	65.2
YOLOv11n	RGB	87.4	49.7
	IR	95.9	63.1
ProbEn [33]	RGB + IR	93.4	51.5
CSAA [34]	RGB + IR	94.3	59.2
RSDet [35]	RGB + IR	95.8	61.3
GANMcC [36]	RGB + IR	87.8	49.8
DenseFuse [13]	RGB + IR	88.5	50.4
SDNet [37]	RGB + IR	86.6	50.8
DIVFusion [38]	RGB + IR	89.8	52.0
YOLOv11n	RGB + IR	96.9	62.9
GLNet-YOLO (ours)	RGB + IR	96.6	65.9

Table 2. Performance comparison of different improvement methods.

Model Configuration	map50	map50-95	Model Size	Core Changes
YOLOv11n (RGB + IR)	96.9%	62.9%	8.16	Basic dual branch
+FM Module	96.8% (−0.1%)	65.7% (+2.8%)	8.17	Verify FM module
+DMR Module	96.5% (−0.4%)	65.0% (+2.1%)	8.17	Verify DMR module
+FM+DMR Module	96.6% (−0.3%)	65.9% (+3.0%)	8.18	Verify FM+DMR synergy

Table 3. Performance comparison of different detectors on the preprocessed KAIST dataset.

Model	Input	mAP@50	mAP@50-95
YOLOv11	RGB	43.6	13.7
	IR	50.7	17.2
GLNet-YOLO (ours)	RGB + IR	54.4	18.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Zhao, Q.; Xie, X.; Shen, Y.; Ran, J.; Gui, S.; Zhang, H.; Li, X.; Zhang, Z. GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection. AI 2025, 6, 229. https://doi.org/10.3390/ai6090229

AMA Style

Zhang Y, Zhao Q, Xie X, Shen Y, Ran J, Gui S, Zhang H, Li X, Zhang Z. GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection. AI. 2025; 6(9):229. https://doi.org/10.3390/ai6090229

Chicago/Turabian Style

Zhang, Yi, Qing Zhao, Xurui Xie, Yang Shen, Jinhe Ran, Shu Gui, Haiyan Zhang, Xiuhe Li, and Zhen Zhang. 2025. "GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection" AI 6, no. 9: 229. https://doi.org/10.3390/ai6090229

APA Style

Zhang, Y., Zhao, Q., Xie, X., Shen, Y., Ran, J., Gui, S., Zhang, H., Li, X., & Zhang, Z. (2025). GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection. AI, 6(9), 229. https://doi.org/10.3390/ai6090229

Article Menu

GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection

Abstract

1. Introduction

2. Related Work

3. Research Methodology

3.1. GLNet-YOLO: Improved Network Framework

3.2. FM Module: Global Feature Fusion and Enhancement

3.3. DMR Module: Local Processing and Mode Separation

4. Experimental Results and Analysis

4.1. Experimental Environment

4.2. Dataset

4.3. Evaluation Indicators

4.4. Model Testing and Evaluation

4.5. Summary of Experimental Results

5. Discussion

5.1. The Trade-Off Between Detection Quantity and Quality

5.2. Synergistic Value of FM and DMR Modules

5.3. Model Limitations and Deployment Challenges

5.4. Broader Context and Future Directions

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI