Infrared Temporal Differential Perception for Space-Based Aerial Targets

Guo, Lan; Chen, Xin; Gao, Cong; Zhao, Zhiqi; Rao, Peng

doi:10.3390/rs17203487

Open AccessArticle

Infrared Temporal Differential Perception for Space-Based Aerial Targets

by

Lan Guo

^1,2

,

Xin Chen

^1,2,

Cong Gao

^1,2,

Zhiqi Zhao

^1,2 and

Peng Rao

^1,2,*

¹

National Key Laboratory of Infrared Detection Technologies, Shanghai Institute of Technical Physics, Chinese Academy of Sciences, 500 Yutian Road, Shanghai 200083, China

²

Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(20), 3487; https://doi.org/10.3390/rs17203487

Submission received: 4 September 2025 / Revised: 14 October 2025 / Accepted: 15 October 2025 / Published: 20 October 2025

(This article belongs to the Special Issue Deep Learning-Based Small-Target Detection in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

Proposed an event-triggered Infrared Temporal Differential Detection (ITDD) method to enhance dim targets under complex backgrounds by capturing pixel-level transient variations and generating sparse event streams that suppress static background, reduce redundancy, and enhance weak dynamic signals.
Developed an irradiance-based ITDD detection sensitivity model and an infrared (IR) differential event stream generation method, converting conventional IR frames into differential event streams.
Proposed a lightweight multi-modal fusion network combining differential event frames and IR images for detecting dim small aerial targets in cluttered space-based backgrounds.

What are the implications of the main findings?

Simulation experiments demonstrate that ITDD achieves a 3.59% event-triggered rate, compresses data to 0.1% of the original volume, and improves the SNR by 4.21-fold.
The fusion network with 2.44 M parameters achieves a detection rate of 99.31% and a false alarm rate of $1.97 \times 10^{- 5}$ on the SITP-QLEF dataset, confirming the effectiveness of ITDD with multi-modal fusion for moving target detection.

Abstract

Space-based infrared (IR) detection, with wide coverage, all-time operation, and stealth, is crucial for aerial target surveillance. Under low signal-to-noise ratio (SNR) conditions, however, its small target size, limited features, and strong clutters often lead to missed detections and false alarms, reducing stability and real-time performance. To overcome these issues of energy-integration imaging in perceiving dim targets, this paper proposes a biomimetic vision-inspired Infrared Temporal Differential Detection (ITDD) method. The ITDD method generates sparse event streams by triggering pixel-level radiation variations and establishes an irradiance-based sensitivity model with optimized threshold voltage, spectral bands, and optical aperture parameters. IR sequences are converted into differential event streams with inherent noise, upon which a lightweight multi-modal fusion detection network is developed. Simulation experiments demonstrate that ITDD reduces data volume by three orders of magnitude and improves the SNR by 4.21 times. On the SITP-QLEF dataset, the network achieves a detection rate of 99.31%, and a false alarm rate of

1.97 \times 10^{- 5}

, confirming its effectiveness and application potential under complex backgrounds. As the current findings are based on simulated data, future work will focus on building an ITDD demonstration system to validate the approach with real-world IR measurements.

Keywords:

aerial target; space-based platform; infrared temporal differential detection; event stream characterization; multi-modal fusion detection

1. Introduction

Infrared (IR) technology is widely used for moving aerial target detection, with space-based systems playing crucial roles in military surveillance and civilian applications. Such targets typically have the characteristics of small size, insufficient texture, sparse distribution, uncertain trajectory, and high speed [1,2].

The performance of space-based IR detection is degraded by several factors [3]. Dim targets often vanish in clutter, as their radiation is comparable to or weaker than that of backgrounds such as urban areas or sea surfaces. Inherent detector noise further reduces the signal-to-noise ratio (SNR). Motion blur and trailing complicate high-speed target detection, with the frame rate constrained by onboard storage and downlink capacity. Satellite platform jitter further reduces accuracy. In addition, IR integral imaging introduces motion aliasing, noise, and quantization errors during acquisition and sampling. Addressing these challenges is essential for reliable detection of space-based IR aerial targets.

1.1. Biomimetic Vision Perception Technology

Traditional IR imaging with inter-frame sampling suffers from temporal information loss, background redundancy, and heavy data loads, limiting real-time detection. Inspired by biological vision, bio-inspired perception models have been proposed to enhance dim target detection and overcome these limitations.

Event-based sensors (EBSs) output asynchronous events when local intensity changes exceed a threshold, each event recording pixel coordinates, a timestamp, and polarity [4,5,6,7,8,9,10,11,12]. Offering microsecond resolution, a >100 dB dynamic range, low power consumption, and reduced data volume, EBSs are well suited for moving target observation [7].

Applications include military surveillance, autonomous driving [13], drones [14], and space situational awareness, with demonstrations ranging from ground-based Low Earth Orbit (LEO) and Geosynchronous Earth Orbit (GEO) target detection [15,16] to the Falcon Neuro system on the International Space Station [17]. However, no public reports have yet confirmed their use in space-based aerial target detection.

Vidar, an integrating visual sensor, encodes light intensity into inter-pulse intervals, preserving temporal variations but showing strong threshold dependence [18,19]. This method preserves temporal variations in light intensity and enables information extraction through threshold adjustment. Low thresholds can cause strong background signals to obscure weak targets, while high thresholds may suppress faint signals. Although effective in dynamic, high-temporal-resolution scenes, Vidar is less suitable for space-based IR aerial target observation. In contrast, EBSs, with higher sensitivity to local intensity changes and wider dynamic range, show greater potential.

Most commercial EBSs are designed for visible light, whereas IR EBSs remain in early development with few datasets and instruments. Prototypes include a

64 \times 64

long-wavelength infrared (LWIR) EBS (<8 mW power, 20 fps event rate) [20]. In 2021, the Defense Advanced Research Projects Agency (DARPA) launched the FENCE program to develop military IR EBSs and processing algorithms [21], and in 2022, Semiconductor Devices (SCD) introduced a 640 × 512 short-wavelength infrared (SWIR) EBS (600–1700 nm) with a parallel event and frame outputs [22,23]. Research groups in China, including Huazhong University of Science and Technology and Peking University, are actively advancing the design of IR EBS readout circuits [24,25].

1.2. Multi-Modal Data Fusion Method

Front-end imaging aims to support high-level vision tasks such as detection, tracking, and recognition. Multi-modal data fusion leverages complementary features to enhance perception robustness in complex environments and challenging scenarios.

To leverage complementary information across modalities, extensive research has focused on multi-modal fusion algorithms. These methods aim to extract critical features from heterogeneous data to construct more comprehensive and robust representations, overcoming single-modality limitations and improving vision task performance. Common fusion strategies include IR–visible image fusion [26], joint IR–radar perception [27], and dual-band IR image fusion [28,29].

Designing effective multi-modal fusion algorithms requires analysis of the differences and complementarities among modalities in terms of texture, contrast, intensity distribution, and illumination response. Traditional image fusion approaches primarily encompass spatial-domain, transform-domain, sparse-representation, and multiscale-transform methods [30,31,32,33] but rely on handcrafted features and often lack adaptability in dynamic scenes. Deep learning-based approaches address these limitations by learning salient and complementary features directly from data [34]. Common architectures include autoencoder-based compression–reconstruction frameworks, CNN-based end-to-end models, GAN-based contrast-enhancing generators, and Transformer-based global modeling methods [35,36,37,38], which improve fusion quality, edge detail preservation, and semantic consistency by modeling complex nonlinear relationships.

Although most algorithms focus on pixel-level features, the ultimate goal of fusion is to provide accurate and effective inputs for high-level vision tasks. Task-driven fusion methods integrate fusion with detection or recognition tasks to enhance performance. Examples include multi-spectral composite detection and recognition in IR bands [39], cascaded fusion-detection networks [40], saliency-enhanced IR–visible fusion [41], and dual-channel fusion of event streams with visible images for video restoration [42].

Evaluation of fusion algorithms relies on metrics related to information content, image features, correlation, structure, and human perception [43,44]. Task-driven fusion further uses downstream metrics such as detection rate (DR), false alarm rate (FR), recognition rate, and recall. Despite progress, challenges remain in data labeling, multi-modal feature extraction, and registration accuracy. This paper focuses on task-driven fusion for space-based IR aerial target detection, aiming to enhance the effectiveness and accuracy of multi-source data fusion in practical applications.

The key contributions of this study are as follows:

An event-triggered Infrared Temporal Differential Detection (ITDD) model is proposed to overcome the limitations of traditional energy integration-based methods in enhancing dim targets under complex backgrounds. By monitoring pixel-level transient variations, it generates sparse event streams that suppress static background, reduce data redundancy, and improve the perception of weak dynamic signals.
To address the issue of a low SNR in the temporal dimension, an irradiance-based detection sensitivity model and a space-based IR differential event stream generation method for aerial targets are developed. By coupling parameters such as threshold voltage, wavelength, and aperture, conventional frames are transformed into differential event streams with intrinsic noise characteristics. Experiments show a 3.59% event-triggered rate, data compression to 0.1% of the original volume, and a 4.21-fold SNR increase, while real-time software enables timestamp-aligned reconstruction.
A lightweight multi-modal fusion detection method based on differential events and IR images is proposed for dim, small aerial targets in cluttered space-based backgrounds. The network, with 2.44 million parameters, achieves a DR of 99.31%, and an FR of $1.97 \times 10^{- 5}$ on the SITP-QLEF dataset, demonstrating the effectiveness of the ITDD combined with multi-modal fusion for moving target detection.

The structure of this paper is as follows. Section 2 introduces the basic concept and principle of the ITDD, focusing on the theoretical model for detection sensitivity based on irradiance. Section 3 and Section 4 present the generation of differential event streams and a multi-modal fusion network for target detection, respectively. Section 5 evaluates the performance of the ITDD and the fusion network on multiple datasets, and Section 6 concludes the paper with discussions on future research directions.

2. Infrared Temporal Differential Detection (ITDD)

Traditional differential detection methods generally follow an “integrate-then-subtract” scheme, in which multiple total irradiance measurements, each containing both signal and background, are independently acquired and subsequently processed via electronic circuits or digital algorithms to extract the target signal. In contrast, Infrared Differential Detection (IDD) directly senses differences in the target’s physical quantities at the physical layer [45,46]. ITDD extends this approach by capturing radiative changes over micro-temporal intervals, triggering events whenever these changes exceed a contrast threshold. This mechanism produces sparse and asynchronous outputs with a higher dynamic range and faster response. The following presents a comparison of ITDD with existing EBSs from three perspectives:

Imaging principles and spectral differences: Traditional EBSs typically operate in the visible spectrum and rely on changes in reflected light under limited illumination, whereas ITDD images are based on intrinsic IR radiation changes. These differences lead to fundamentally distinct target characteristics, event features, and fabrication processes.
Signal response mechanism: Unlike conventional EBSs, ITDD does not employ logarithmic compression in its electrical signal conversion, preserving system sensitivity. Its linear response is particularly suitable for detecting dim, space-based targets.
Application scenarios: ITDD is designed for detecting dim, moving targets against complex space-based backgrounds, whereas EBS has not been reported to perform effectively in such conditions. The subsequent sections present a systematic assessment of ITDD’s feasibility and performance through modeling, simulation, and data analysis.

Figure 1 shows the ITDD system flowchart, and the model with signal-level detection sensitivity is described below.

2.1. Mathematical Model of ITDD

When an aerial target moves across the field of view, ITDD selectively amplifies the target signal by distinguishing abrupt pixel responses from the slowly varying background. Each pixel independently senses the local rate of change in radiation. The temporal change in the photocurrent per pixel is expressed as

Δ I_{pho} = I_{pho} (t_{0} + Δ t) - I_{pho} (t_{0}) = A Δ t + o (Δ t),

(1)

where

I_{pho}

is the photocurrent per pixel,

o (Δ t)

is an infinitesimal of a higher order, A is the differential coefficient, and

A Δ t

is the differentiation of the photocurrent at

t_{0}

.

d I_{pho}

can be approximately expressed as a linear computation of

d t

:

d I_{pho} = A d t .

(2)

The spectral radiance of an object, as received by a space-based IR detection system at the entrance pupil, is expressed as

L_{opt} (λ) = L (λ) τ_{a} (λ),

(3)

where

L (λ)

is the spectral radiance of the object and

τ_{a}

is the atmospheric transmittance.

The spectral irradiance at the entrance pupil is expressed as follows:

E_{opt} (λ) = L_{opt} (λ) ω,

(4)

where

ω

is the solid angle of the instantaneous field of view.

The spectral radiant flux per pixel is expressed as

Φ_{\det} (λ) = Φ_{opt} (λ) τ_{o} = L_{opt} (λ) ω A_{o} τ_{o},

(5)

where

Φ_{opt} (λ)

is the spectral radiant flux at the entrance pupil,

A_{o}

denotes the area of the entrance pupil,

τ_{o}

is the optical transmittance;

ω A_{o} = \frac{A_{d}}{f^{2}} \cdot \frac{π D^{2}}{4}

, where

A_{d}

is the pixel area, f is the focal length, and D is the optical aperture. The F number is defined as

F = \frac{f}{D}

. Equation (5) can be rewritten as

Φ_{\det} (λ) = \frac{π τ_{o} A_{d}}{4 F^{2}} \cdot L_{opt} (λ) .

(6)

The photon radiant flux per pixel is as follows:

Φ_{\det, q} = \int_{λ_{s}}^{λ_{s} + Δ λ} Φ_{\det} (λ) \cdot \frac{λ}{h c} d λ \approx \frac{\bar{λ}}{h c} Φ_{\det},

(7)

where h is Planck’s constant, c is the velocity of light in vacuum,

λ_{s}

is the initial wavelength of the IR sensor,

Δ λ

is the spectral interval, and

\bar{λ} = λ_{s} + Δ λ / 2

. The number of electrons produced during the integration period can be expressed as

N_{\det} = Φ_{\det, q} η Δ t_{int},

(8)

where

η

is the quantum efficiency and

Δ t_{int}

is the interval between the current and previous events, that is, the effective integration time.

The photocurrent per pixel is calculated as follows:

I_{pho} = \frac{N_{\det} \cdot e}{Δ t_{int}} = K_{\det} L_{opt} (λ_{s}, Δ λ),

(9)

where e is the electron charge and

K_{\det} = \frac{π τ_{o} A_{d} η e (2 λ_{s} + Δ λ)}{8 F^{2} h c}

.

The temporal change in the photocurrent per pixel is expressed as

Δ I_{pho} (u_{k}, t_{k}) = I_{pho} (u_{k}, t_{k}) - I_{pho} (u_{k}, t_{k} - Δ t_{k}),

(10)

where

u_{k}

is the position of the event in terms of the x–y coordinates. The first-order Taylor expansion of Equation (10) is obtained as follows:

Δ I_{pho} (u_{k}, t_{k}) \approx \frac{\partial I_{pho}}{\partial t} (u_{k}, t_{k}) Δ t_{k} = K_{\det} \frac{\partial L_{opt}}{\partial t} (u_{k}, t_{k}) Δ t_{k} .

(11)

According to Equations (2) and (11), the differential coefficient is expressed as

A = K_{\det} \frac{\partial L_{opt}}{\partial t} (u_{k}, t_{k}) .

(12)

In theory, the ITDD method operates at a high sampling frequency, allowing the differential signal to be approximated by the temporal difference of the photocurrent. An IR differential event is triggered whenever this temporal change exceeds a contrast threshold. Each event includes the pixel coordinates, timestamp, and polarity:

e_{k} = {u_{k}, t_{k}, p_{k}}_{k = 1, \dots, N},

(13)

where

t_{k}

denotes the timestamp of an event,

p_{k}

indicates the polarity (positive for an increase and negative for a decrease in IR radiation), and N is the number of events. An event is triggered with positive polarity when the temporal increment exceeds the threshold, negative polarity when the decrement reaches the threshold, and no event otherwise.

p_{k} = f (Δ I_{pho} (u_{k}, t_{k}), C) = \{\begin{matrix} + 1, & i f Δ I_{pho} (u_{k}, t_{k}) \geq C \\ - 1, & i f Δ I_{pho} (u_{k}, t_{k}) \leq - C, \\ 0, & o t h e r w i s e \end{matrix}

(14)

where C is the contrast threshold.

2.2. Detection Sensitivity Model of ITDD

According to the relative motion between the target and the detector, larger target radiation leads to greater temporal variation. In this study, the detectable irradiance at the pupil entrance is adopted to evaluate the ITDD sensitivity. Accordingly, the number of electrons per pixel is expressed as follows based on Equation (8):

N_{\det} = \frac{\bar{λ} A_{o} τ_{o} η Δ t_{int}}{h c} E_{opt} .

(15)

In the readout circuit, if voltage is taken as the criterion for logical judgment, the signal voltage of each pixel can be expressed as

V_{s} = N_{\det} G = N_{\det} \cdot \frac{e}{c_{int}} = \frac{\bar{λ} A_{o} τ_{o} η Δ t_{int} e}{h c \cdot c_{int}} E_{opt},

(16)

where G is the transformation gain and

c_{int}

is the capacitance.

The event-triggered mechanism is defined as

|Δ V_{s} (u, t)| = |V_{s} (u, t_{0} + Δ t) - V_{s} (u, t_{0})| \geq V_{th},

(17)

where

t_{0}

is the timestamp of the previous event, u is the pixel position, and

Δ V_{s} (u, t)

is the voltage change during

Δ t

. The threshold voltage is

V_{th} = α V_{\max}

, where

α

is the turnover factor and

V_{\max}

is the maximum voltage.

To determine the detectable limit, the voltage change is set to

V_{s}

and the instrument background of the IR imager is removed through differential processing. A positive-polarity event is generated when the voltage variation satisfies:

V_{s} \geq V_{th} .

(18)

Based on Equation (15), the above formula is expressed as

E_{opt} \geq \frac{h c \cdot c_{int}}{e \bar{λ} A_{o} τ_{o} η Δ t_{int}} \cdot V_{th} .

(19)

2.3. Factors Affecting ITDD Sensitivity

With fixed optical system and detector parameters, Equation (19) shows that the detectable irradiance is linearly dependent on the threshold voltage. The following section analyzes the influencing factors and evaluates the performance of representative photoelectric instruments.

Due to atmospheric attenuation, space-based IR observations are typically conducted within atmospheric windows [47]. Accordingly, the experimental bands are chosen as 3–5 µm (mid-wavelength infrared, MWIR) and 8–10 µm (LWIR). The radiant intensity of an aerial target in cruise is assumed to range from 25 to 2500

W / sr

[48,49]. At a detection distance of 500 km, the target can be regarded as a point source on the imaging plane. The normal incident irradiance of a point target at the entrance pupil is given by

E_{opt} = \frac{I τ_{a}}{l^{2}},

(20)

where I is the target’s radiant intensity as seen by the detector,

τ_{a}

is the atmospheric transmittance, and l is the detection distance.

According to Equation (20), the irradiance range is

10^{- 14}

to

10^{- 12}

W·

{cm}^{- 2}

. Lower detectable irradiance corresponds to higher system sensitivity.

Based on Equation (19), a controlled variable approach is used to examine how optical aperture, optical transmittance, integration time, and capacitance affect detection sensitivity with respect to the threshold voltage. Relevant parameters are listed in Table 1, and the threshold voltage range is varied from 70 to 350 mV.

Figure 2a,b illustrate the relationship between optical aperture and logarithmic irradiance. For a target with a radiant intensity of 25 W/sr, the required aperture is calculated using the system parameters in Table 1 and Equations (19) and (20). In the MWIR band, the minimum aperture is 500 mm, and in the LWIR band, it is 330 mm. For targets with irradiance from

10^{- 14}

to

10^{- 12}

W·

{cm}^{- 2}

, the MWIR aperture should be 115–500 mm, and the LWIR aperture 75–330 mm.

As shown in Figure 2c,d, to meet detection requirements of the target, the MWIR system optical transmittance should be 0.2–1, and and the LWIR system 0.1–1. This ensures effective capture of target IR radiation while maintaining adequate detection sensitivity.

As shown in Figure 2e,f, with other parameters fixed, the MWIR system requires an integration time of 25–500 µs, while the LWIR system requires 11–220 µs.

Figure 2g,h show that, at a given threshold voltage, smaller integration capacitance lowers the detectable irradiance, increasing sensitivity. At fixed capacitance, reducing the threshold voltage also decreases detectable irradiance. To achieve irradiance levels of

10^{- 14}

to

10^{- 12}

W·

{cm}^{- 2}

, the MWIR system requires 5–97 fF, and the LWIR system 11–218 fF.

For typical aerial target detection, integration time and optical aperture are key factors in photoelectric design. Based on Equation (19), Figure 3a,c show that the parameter combinations meeting detection requirements lie within the area enclosed by the two curves. A higher threshold voltage expands the feasible parameter range and reduces implementation complexity but also lowers the event-triggered rate, potentially affecting faint target detection. Thus, practical design must balance threshold voltage, integration time, optical aperture, and system sensitivity to optimize performance.

In summary, selecting detection system parameters requires consideration of the specific task and target radiant intensity, balancing performance and engineering complexity. Using the entrance pupil irradiance sensitivity model, this section analyzes the coupling of parameters such as threshold voltage, detection band, and optical aperture, providing a valuable reference for the design and optimization of photoelectric instruments.

3. Infrared Differential Event Stream Simulation

This paper presents a method for simulating IR differential event streams by converting real and simulated image sequences. The approach enables analysis of the ITDD performance under diverse observation conditions and backgrounds, incorporating system noise to enhance realism. Based on this algorithm, software was developed to acquire event frames in real time when paired with an IR camera. The workflow is illustrated in Figure 4.

3.1. Detection Scene Image Generation

The IR differential event stream simulation method can use input images acquired either by frame-based IR cameras or generated from simulated scenes. According to Equation (14), when the temporal change in light intensity exceeds the threshold, the continuous images are converted into discrete events.

Detection scenes can be efficiently simulated by constructing moving targets based on IR imaging and remote sensing principles. This involves defining target and background radiation models, a target–background synthesis model, an atmospheric transmission model, and an IR detection system model. Parameters such as target size, temperature, emissivity, and velocity; background reflectivity and solar altitude; and system properties including spectral band, field of view, detection distance, and optical transmittance are set to generate diverse scenarios. System coupling, detector noise, and other factors degrade image quality, with grayscale intensity used as input for subsequent processing.

3.2. Infrared Differential Event Stream Generation

Taking a single pixel as an example, the simulation and generation of the IR differential event stream are described below.

3.2.1. Contrast Threshold and Threshold Noise

The contrast threshold, which strongly influences detection (Equation (19)), was adjusted for various scenarios. To simulate realistic conditions, per-pixel variations from manufacturing and other factors were modeled as time-invariant, spatially normally distributed threshold noise. The contrast threshold is expressed as

C_{th} \sim N (C, {σ_{C}}^{2}),

(21)

where

σ_{C}

represents the standard deviation of the contrast threshold and C is the initial threshold,

C > 0

.

To determine the initial threshold C, the first five frames of the image sequence are differenced to generate the corresponding differential images:

Δ I (u) = I_{i + 1} (u) - I_{i} (u), i = 1, 2, 3, 4 .

(22)

The mean and standard deviation of the pixel intensities are then calculated for each differential image:

μ (u) = \sum_{i = 1}^{4} Δ I_{i} (u) / 4,

(23)

where

μ (u)

denotes the mean intensity of pixel

u = (x, y)

across the four differential images, and

σ (u)

represents the corresponding standard deviation.

The initial threshold is calculated by setting a threshold adjustment parameter:

C (u) = μ (u) + k \times σ (u),

(24)

where k is a constant used to adjust the initial threshold.

3.2.2. Grayscale Initialization and Logical Judgment

There are two types of grayscale intensities:

I (u, t)

from the scene simulation module and

I_{cur} (u, t)

updated by event triggering. At the initial time

t_{0}

,

I_{cur} (u, t) = I (u, t)

. At the next instant, the difference between

I (u, t)

and

I_{cur} (u, t)

is calculated as follows:

Δ I (u, t) = I_{cur} (u, t) - I (u, t) .

(25)

If the absolute value is below the threshold, no event is generated:

|Δ I (u, t)| < C_{th} (u),

(26)

where

C_{th} (u)

is the threshold for pixel u.

If the absolute value is above the threshold,

I (u, t)

is assigned to

I_{cur} (u, t)

:

|Δ I (u, t)| \geq C_{th} (u),

(27)

I_{cur} (u, t) = I (u, t) .

(28)

Furthermore, if the intensity change is positive, the event polarity is set to

p (u, t) = 1

. Otherwise, the polarity is negative,

p (u, t) = - 1

.

3.2.3. System Noise

Considering detector manufacturing variations and circuit leakage, system noise is inevitable. This module models both thermal-pixel noise and temporal noise.

Thermal-pixel noise comes from fixed pixels that trigger events continuously, even without radiation changes, and is assumed to follow a normal spatial distribution. Temporal noise fluctuates over time, resembling “flickering pixels” [50], producing irregular high-frequency triggers, and is also modeled as spatially normal.

The IR differential event stream records pixel position, a timestamp, and polarity according to Equation (13). For easier visualization and processing, the asynchronous stream is reconstructed into synchronous event frames. The pseudocode for generating the IR differential event stream is provided in Algorithm 1.

Algorithm 1 IR differential event stream generation

Input: $I (u, t)$
Output: $p (u, t)$ , $I_{cur} (u, t)$
1:
Calculate the contrast threshold $C_{th}$ via Equation (21).
2:
Initialization: $t = t_{0}$ , $I_{cur} (u, t) = I (u, t)$ .
3:
while $t < t_{max}$ do
4:
   Calculate $Δ I (u, t)$ via Equation (25).
5:
   if $|Δ I (u, t)| \geq C_{th} (u)$ then
6:
      $I_{cur} (u, t) = I (u, t)$
7:
     if $Δ I (u, t) > 0$ then
8:
         $p (u, t) = 1$
9:
     else
10:
         $p (u, t) = - 1$
11:
     end if
12:
   end if
13:
   Add the system noise.
14:
end while

4. Event Frame–Infrared Image Fusion Detection

Based on the preceding analysis, target detection algorithms using IR differential event frames face two main limitations:

Limited target events features: IR events encode thermal variations, with each pixel carrying only polarity information. Event data primarily capture target motion trajectories but lack grayscale and texture details, which may limit detection accuracy.
Background interference: In complex scenes, such as urban environments, background regions can trigger a high number of events. These background events can obscure target information, leading to increased false alarms.

To address these bottlenecks, this paper leverages the complementary strengths of event frames and IR images. Event frames provide high temporal resolution (TR) and suppress static backgrounds but suffer from low spatial resolution (SR) and lack grayscale and texture details. IR images, in contrast, offer higher SR, with grayscale values reflecting radiative properties such as temperature and reflectivity, effectively capturing target–background contrast, though their TR is lower.

Image fusion algorithms are typically classified into pixel-level, feature-level, and decision-level fusion based on processing hierarchy [44]. Pixel-level fusion preserves the most information but has high computational demands, poor real-time performance, and strong dependence on registration accuracy (Figure 5a). Feature-level fusion reduces data volume, improves fault tolerance, and balances information retention with computational efficiency (Figure 5b). Decision-level fusion processes the least information and has the highest fault tolerance but incurs relatively greater information loss (Figure 5c).

Considering the characteristics of these fusion methods and the practical needs of space-based aerial target perception, this paper adopts a combination of feature-level and decision-level fusion. As shown in Figure 5d, a task-driven fusion algorithm for target detection is proposed. By constructing a neural network, the algorithm fully exploits multi-dimensional target information and learns deep correlations between event frames and IR images within the same scene, enabling dynamic adaptation to scene variations and improving detection robustness and accuracy.

4.1. Registration of Event Frames and Infrared Images

Image registration is essential for image fusion. It aligns images with overlapping regions captured from different viewpoints, times, or sensors. Event frames and IR images originate from different sensors, and differences in imaging principles and target characteristics make registration challenging [51]. Accurate temporal and spatial alignment is required before fusion to ensure reliable results.

4.1.1. Temporal Alignment

According to Section 2 and Section 3, ITDD captures only dynamic changes, yielding smaller data volume, lower latency, and higher TR than traditional IR detection. This necessitates temporal alignment between event frames and IR images. As shown in Figure 6, with a frame rate ratio of

n : 1

(ITDD to IR), n event frames correspond to one IR image. As noted in [52], temporal energy accumulation enhances detection network performance. Accordingly, event frames are aligned with IR images based on the frame rate ratio and accumulated within the same interval to improve fusion. Based on the parameters of the near-infrared EBS SWIFT [23] and the IR EBS currently being developed by our research team, in this paper, the frame rate ratio of event frames to IR images is set to 16:1.

4.1.2. Spatial Alignment

Real IR EBSs are difficult to obtain, so this paper converts conventional IR image sequences into event frames using the algorithm from Section 3, ensuring both share the same field of view. Due to the ITDD characteristics, target events capture spatial and temporal variations. In Figure 6, target regions overlap between the IR image and accumulated event frame within the same interval. Therefore, temporal alignment also ensures spatial alignment. Furthermore, data-driven fusion algorithms tolerate camera calibration errors, and training on large datasets effectively improves generalization.

4.2. Multi-Modal Fusion Detection Network

Figure 7 illustrates the multi-modal fusion framework combining accumulated event frames and IR images, with fusion focused on target detection. The network is based on YOLOv8 and consists of a backbone, neck, and head.

Feature-level fusion of accumulated event frames and IR images is performed in the backbone network. The IR image (channel 1) and accumulated event frame (channel 2) form dual-channel inputs with six channels in total. Each channel is processed through convolution and a JSCA block [52] to extract features, which are then concatenated. The JSCA block uses a spatial-channel joint attention mechanism to enhance feature representation.

The combined and original features are fused three times to produce fused features 1–3. Fused feature 3 is input to the SPPF block, which fuses multi-scale features to improve robustness and generalization.

In the neck, upsampling, C2f, and CBS blocks integrate fused features 1 and 2 with intermediate feature maps to form composite representations. Feature maps at multiple scales are passed to detection heads to predict target positions and categories. Bounding box loss, classification loss, and distribution focal loss are used to optimize detection performance and generate final results.

5. Experiments and Results

5.1. Infrared Differential Event Stream Simulation

According to the event stream simulation method described in Section 3, the real and simulated IR image sequences are converted into event frames for visualization and performance evaluation.

5.1.1. Datasets and Experimental Settings

As shown in Table 2, six datasets were used for IR aerial target detection, including four real sequences and two synthetic ones. According to the International Society for Optics and Photonics, a small target is defined as one that occupies less than 0.15% of the image area. Real data were collected from satellite-borne LWIR and MWIR cameras (Seqs. 1 to 4), covering airliner observations in orbit, on runways, and over sea, sky, and port backgrounds, with 270 to 500 frames each. Synthetic datasets (Seqs. 5 and 6) were generated by combining real backgrounds with simulated targets of varying sizes, numbers, and speeds across diverse scenes such as cities, clouds, land, and wharfs. Each scenario contained 6 to 20 moving targets, with 300 or 500 frames per scenario. In total, 41,570 images were evaluated.

Note that real-scene images do not undergo preprocessing, such as non-uniformity correction or bad pixel compensation, which reduces the overall complexity of data processing.

5.1.2. Qualitative Demonstration

(1): Visualization of IR Differential Event Stream

Figure 8 shows the IR differential event streams generated from the six scenes in Table 2. Figure 8a shows a three-dimensional voxel grid of the events, visualizing the spatial (x–y) and temporal (time) dimensions. Positive- and negative-polarity events are marked in red and blue, respectively, with the corresponding original image displayed at the bottom. The sparse event distribution qualitatively illustrates the reduced data volume.

Figure 8b,c indicate that targets can be clearly distinguished from backgrounds through their trajectories. When grayscale variations are below the contrast threshold, as in the sea region of Seq. 2 and the sky region of Seq. 3, few events are triggered. Even with complex backgrounds, ITDD effectively captures the moving targets.

Overall, the ITDD method suppresses static backgrounds and regions with little IR radiation variation, thereby highlighting moving targets. In addition, the stationary bright point observed in Seq. 4 (Figure 8b) corresponds to thermal-pixel noise.

(2): Target Characteristics in Event Frames

For visualization, IR event frames were reconstructed from discrete events by timestamp alignment, with pixel values denoting event polarity. Taking Seq. 1 as an example, Figure 6a shows an LWIR observation of an aerial target under an urban background. Due to temperature and emissivity differences, the target exhibits lower radiance and grayscale intensity than the background (Figure 9a–d), appearing as a small, dim object with a local SNR of 3 and a

5 \times 5

scale. Numerous ground signals of similar size and intensity act as clutters. In Figure 9c,d, the target profile approximately follows a Gaussian distribution caused by optical attenuation.

Since IR events are triggered by temporal radiance changes, the event stream exhibits spatiotemporal properties. In staring mode, static backgrounds generate almost no events, while dynamic backgrounds produce edge-like events correlated with platform motion and clutter variations (Figure 9e). Noise appears as isolated, random signals lacking spatiotemporal coherence. In the target region (Figure 9f–h), both positive- and negative-polarity events are observed, reflecting motion and radiance contrast. These spatiotemporal characteristics of target, background, and noise were subsequently leveraged to suppress interference and enhance target extraction.

To obtain real-time event streams, software based on the simulation algorithm was developed to capture IR camera data and reconstruct event frames. Figure 10a shows an outdoor aerial detection experiment. As seen in Figure 10b,c, the target events conform to the expected polarity distribution. The IR image (Figure 10d), with lower TR, contains brighter distractors that obscure the target. Nevertheless, the reconstructed event frame (Figure 10e) yields a local SNR of 8, making the target more distinguishable.

5.1.3. Measurement Metrics

Given that most event sequences are synthesized from conventional IR images and cannot capture the full range of noise and variations inherent to real sensors, the evaluation of metrics such as data reduction and SNR improvements accordingly rests on simulation until validated with physical devices.

(1): Data Reduction

Grayscale images (8–32 bit) trade detail for data volume. In space-based detection, bandwidth limits require the balancing of real-time performance and transmission efficiency. ITDD mitigates this by removing redundant static information while retaining moving target signals. The raw data can be expressed as:

b_{frame} = N_{p} \times k \times N_{frame},

(29)

where

N_{p}

is the number of pixels on the focal plane array, k is the bit depth, and

N_{frame}

is the number of frames.

The data compression ratio between the raw images and events is defined as

ρ = \frac{b_{event}}{b_{frame}},

(30)

where

b_{event}

represents the data volume of the event.

The event-triggered rate is defined as

R = \frac{event rate}{N_{p}},

(31)

where event rate is the number of events per unit time.

After the frame-to-event conversion, the data volume is reduced by approximately three orders of magnitude, as shown in Table 3. The average event-triggered rate is 3.59%, confirming the sparsity of differential events. This demonstrates that ITDD effectively reduces data redundancy, thereby facilitating efficient and timely information transmission.

(2): Image Quality Evaluation

The image SNR is defined as an index for evaluating the image quality and relative intensity of noise:

SNR = μ / σ, \bar{SNR} = \frac{1}{N} \sum_{i = 1}^{N} {SNR}_{i},

(32)

where

μ

is the mean intensity of an image and

σ

is the standard deviation.

\bar{SNR}

is the average SNR of N images. A higher SNR indicates better image quality. To compare the noise levels between the original frames and the event frames, the SNR gain (SNRG) is defined as follows:

SNRG = \frac{\bar{{SNR}_{EF}}}{\bar{{SNR}_{o}}},

(33)

where

\bar{{SNR}_{o}}

and

\bar{{SNR}_{EF}}

are the average SNRs of the original and event frames, respectively.

Because the event polarity is represented by a single bit (positive or negative), unlike the higher bit depth of raw images, both the original images and event frames were normalized before SNR calculation. As shown in Table 3, the average SNR of the raw images is lower than that of the event frames, with an average SNRG of 4.21. This demonstrates the effectiveness of background suppression by ITDD.

(3): Target Detection Ability

The average probability of detection,

\bar{P_{d}}

, is used to discuss the target detectability.

\bar{P_{d}} = \frac{1}{N_{tar}} \sum_{i = 1}^{N_{tar}} \frac{N_{t}}{T_{t}},

(34)

where

N_{t}

and

T_{t}

are the number of detected and ground truth targets, respectively.

N_{tar}

is the number of frames containing targets in the field of view.

Note that the event frames used in the evaluation were not denoised or enhanced; thus, the target detection ability was lower than that after algorithm processing. For small, dim targets,

\bar{P_{d}}

reaches 93.97% (Table 3), showing ITDD’s sensitivity. However, for slow targets, such as Seq. 3 with a motion speed of no more than 1 pixel per frame, temporal changes may not trigger events, reducing

\bar{P_{d}}

to 80.96%. Because event triggering is pixel-based, system parameters, particularly thresholds, should be adaptively set according to scene conditions. Consequently, ITDD is less effective for stationary or very slow targets.

To address these inherent limitations, several strategies can be considered. At the system level, combining ITDD with frame cameras and LiDAR enables a multi-modal perception system with complementary information. At the processing level, integration or generative modeling can be employed to reconstruct static information from event streams. At the platform level, controlled motion, such as scanning or jittering, can activate event responses in otherwise static scenes.

(4): Noise Analysis

This section investigates the impact of event noise on various performance metrics. Since Seqs. 5 and 6 contain a large number of frames (approximately 40,000 in total), a reduced subset was used for efficient evaluation. Specifically, Set. 1 includes all data from Seqs. 1–4, along with two sequences (300 frames each) selected from Seq. 5 and two sequences (500 frames each) selected from Seq. 6. Sets. 2 and 3 were then derived by adding event noise of different intensities to Set. 1.

In Table 4,

σ_{C}

is the threshold standard deviation,

σ_{h}

is the thermal-pixel noise density, and

σ_{t}

is the temporal noise density. The analysis indicates that increasing the noise level introduces more spurious triggering events, leading to a decrease in

\bar{{SNR}_{EF}}

and SNRG. Although the R and

\bar{P_{d}}

slightly improve due to the higher number of triggered events, this comes at the cost of an increased false trigger rate.

Specifically, without added noise, Set. 1 achieves a

\bar{P_{d}}

of 95.43%, while after introducing noise, the

\bar{P_{d}}

of Sets. 2 and 3 rises marginally to 95.47% and 95.48%, accompanied by a corresponding increase in false trigger rate by 0.04% and 0.05%, respectively. Therefore, in the design of ITDD systems, it is crucial to effectively control both the number of noisy event electrons and the false trigger rate to maintain reliable detection performance.

5.2. Event Frame–Infrared Image Fusion Detection

5.2.1. Datasets and Evaluation Metrics

The experiments in this section are conducted on three datasets, comprising a total of 31,052 images, as detailed in Table 5 [52].

Dataset 1 is derived from an MWIR on-orbit camera, covering diverse scenarios such as urban areas, clouds, sea, and land backgrounds. The image resolution is

512 \times 512

pixels, and the target size is

11 \times 11

pixels. Each scene contains 1–3 aerial targets moving at 1–1.4 pixels per frame, with an average SNR of 2. To simulate relatively stationary observation conditions, the platform jitter speed is set to 1 pixel per frame. In total, the dataset contains 58 sequences, each consisting of approximately 200 frames.

Datasets 2 and 3 share backgrounds captured by the QLSAT-2 LWIR camera, with an image resolution of

256 \times 256

pixels. Both datasets cover representative scenes such as urban areas, land, and cloud backgrounds, each containing 67 sequences of 200 frames. The platform motion speed does not exceed three pixels per frame. In Dataset 2, the target size is

11 \times 11

pixels with an average SNR of 4. Dataset 3 (SITP-QLEF dataset) features smaller targets of

5 \times 5

pixels, with the same average SNR of 4, and serves as a publicly available space-based IR aerial target event dataset.

To prevent data leakage, all datasets are split by sequence into training, validation, and test sets at an approximate ratio of 8:1:1, ensuring that the training and test scenes share no overlapping backgrounds.

The commonly used evaluation metrics for detection are precision and recall:

Precision = \frac{TP}{TP + FP},

(35)

Recall = \frac{TP}{TP + FN},

(36)

where TP is true positive, FP is false positive, and FN is false negative.

The F1 score is defined as

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall} .

(37)

Intersection over Union (IoU) is used to evaluate both classification and localization by comparing predicted and true target regions:

IoU = \frac{B_{p} \cap B_{r}}{B_{p} \cup B_{r}},

(38)

where

B_{p}

and

B_{r}

denote the areas of the predicted and ground truth (GT) bounding boxes. IoU ranges from 0 to 1, with values closer to 1 indicating higher accuracy. A detection is considered positive when IoU exceeds a specified threshold. Since precision and recall often trade off, mean average precision (mAP) is used to evaluate their overall balance and serves as a key metric in deep learning-based target detection:

mAP = \frac{\sum_{i = 1}^{m} A P_{i}}{m},

(39)

where m represents the number of classes, and

A P

is the average precision, calculated as the area under the precision–recall curve. Evaluation metrics are mAP@0.5 and mAP@0.5:0.95, computed at IoU 0.5 and averaged over 0.5–0.95 in 0.05 increments.

DR and FR are common evaluation metrics, calculated as follows:

DR = \frac{N_{D}}{N_{T}},

(40)

FR = \frac{N_{F}}{N_{p}},

(41)

where

N_{D}

is the number of correctly detected targets,

N_{T}

is the total number of true targets,

N_{F}

is the number of false detections, and

N_{p}

is the total number of pixels in the detected image. The model’s efficiency is further assessed by parameter count (PC), inference time, and frames per second (FPS).

5.2.2. Quantitative Results

To comprehensively assess the performance of the proposed network, several fusion algorithms are selected for comparison, including methods that fuse data before detection and those driven by object detection.

LatLRR fuses images by decomposing them into low-rank (global) and salient (local) components, combining them, and reconstructing the final image [53]. FGMC uses sparse regularization with non-convex penalties to solve an inverse problem efficiently [54]. The fused images from both methods are then input to YOLOv8n for object detection.

DEYOLO is designed for object detection in low-light conditions using dual-input IR and visible-light images. It optimizes multi-modal semantic and spatial feature fusion through dual-semantic and dual-spatial enhancement modules, reducing interference between modalities. The bidirectional decoupling focus module expands the receptive field and strengthens feature extraction, improving detection accuracy and robustness [55].

The experimental algorithms include lightweight target detection based on event frames (LTDEFs) [52], LatLRR+YOLOv8n, FGMC+YOLOv8n, DEYOLO, and the proposed algorithm. All experiments were run on an Intel® Core™ i9-14900HX 2.20 GHz CPU and an NVIDIA GeForce RTX 4060 Laptop GPU. Neural networks were trained for 300 epochs with early stopping to prevent overfitting, using stochastic gradient descent with a learning rate of 0.01.

The efficiency comparison of two “fusion-then-detection” approaches is summarized as follows. For Comparison Method 2 (LatLRR+YOLOv8n), the average fusion times per frame on the three datasets are 56.13 s, 11.32 s, and 11.43 s, respectively, while the subsequent YOLOv8n detection achieves 293.89, 289.98, and 283.62 FPS. For Comparison Method 3 (FGMC+YOLOv8n), the per-frame fusion times are 6.00 s, 1.21 s, and 1.27 s, with detection speeds of 286.21, 297.27, and 291.48 FPS, respectively (Table 6, Table 7 and Table 8). These results indicate that the high computational cost of the fusion stage severely limits the system’s real-time processing capability. In contrast, the proposed end-to-end dual-modal fusion network achieves 248.83, 246.20, and 247.62 FPS on the three datasets, respectively. It maintains comparable detection performance while providing significantly higher overall efficiency than the stepwise fusion approaches.

Table 6 and Figure 11a show that the proposed algorithm achieves an mAP@0.5 of 99.00% and an F1 score of 98.24% on Dataset 1, outperforming all comparison methods. At an FR of

3.78 \times 10^{- 8}

, the DR reaches 96.27%. However, across IoU thresholds from 0.5 to 0.95, its mAP is lower than DEYOLO and LTDEF. All models exhibit a notable drop in localization accuracy (mAP@0.5:0.95), mainly due to the complexity of Dataset 1 and the characteristics of event frame annotations. With a low SNR of 2, targets are closely coupled with the background, making event triggering difficult, while platform jitter introduces additional background events. Occupying only 0.046% of the frame, targets provide limited structural or textural cues, and their discrete polarities with blurred boundaries (Figure 9f) deviate from the quasi-Gaussian features of IR images. Bounding boxes are often larger than the actual targets to capture all events, but this inherent ambiguity further constrains localization accuracy.

Figure 11a shows that the performance of the two combined algorithms is the lowest, even well below LTDEF. This indicates that for targets with an extremely low SNR in complex backgrounds, fusing images before detection may introduce more background interference than pure event-based detection, making accurate target identification more difficult.

The proposed algorithm uses 90.8%, 40.63%, and 83.14% of the parameters of YOLOv8n, DEYOLO, and LTDEF, respectively. This demonstrates that the network achieves high accuracy while remaining lightweight, making it suitable for deployment in diverse environments and helping to reduce device power consumption.

Figure 11 illustrates that all five detection methods achieve superior performance on Dataset 2 compared with the other datasets. As detailed in Figure 11b and Table 7, the proposed algorithm outperforms all comparison methods in mAP, F1 score, and DR. The mAP@0.5 improves by 8.27% and 3.65% compared to FGMC+YOLOv8n and LatLRR+YOLOv8n, respectively. The mAP@0.5:0.95 reaches 99.40%, surpassing LTDEF and DEYOLO by 10.69% and 2.05%. The F1 score achieves 99.00%, exceeding FGMC+YOLOv8n and LTDEF by 13.36% and 2.87%, respectively. Moreover, a comparison of mAP@0.5:0.95 between Datasets 1 and 2 demonstrates that a higher SNR substantially improves model robustness under strict IoU thresholds.

Figure 11c shows that both the proposed algorithm and DEYOLO achieve higher mAP and F1 scores than LTDEF and the two fusion-before-detection methods. Compared with LTDEF, at an FR of

1.97 \times 10^{- 5}

, the DR of the proposed algorithm reaches 99.31%, indicating that detection-driven fusion is more effective for weak moving targets in complex backgrounds than early fusion or pure event frame detection.

As shown in Table 8, the proposed algorithm attains an mAP@0.5 of 98.90%, which is 31.17% and 2.70% higher than FGMC+YOLOv8n and LTDEF, respectively. The mAP@0.5:0.95 reaches 95.00%, improving 45.71% and 25.66% over LatLRR+YOLOv8n and LTDEF. Its F1 score is 96.09%, exceeding FGMC+YOLOv8n and LTDEF by 36.98% and 3.29%. Across Table 6, Table 7 and Table 8, the proposed algorithm requires only 40.63% of DEYOLO’s parameters while achieving comparable or superior detection performance, demonstrating its overall efficiency and effectiveness. As shown in Figure 12, the precision–recall curves further highlight the superiority of our approach, exhibiting robust performance across a wide range of confidence thresholds.

The experiments show that exploiting complementary information between accumulated event frames and IR images improves target detection, making the proposed method effective for detecting dim targets in complex space-based backgrounds. The robustness of the proposed method to spatial misalignment was further evaluated, as detailed in Appendix A.

5.2.3. Visual Results

Figure 13 shows the detection performance of each fusion algorithm across the three datasets. LatLRR+YOLOv8n detects only one target in Datasets 1 and 3, indicating a low recall. FGMC+YOLOv8n correctly detects one target in Datasets 1 and 3 but produces many false positives, resulting in a high FR and low DR. Both combination algorithms perform better on Dataset 2. DEYOLO detects all targets in Datasets 1 and 2 but also generates many false positives, and it detects only one target in Dataset 3. In contrast, the proposed algorithm accurately detects all targets across all datasets, achieving the best performance among the compared methods.

5.2.4. Ablation Experiment

To assess the contributions of different modules in the proposed network, three key components were analyzed, as shown in Table 9: module A (IR image input), module B (accumulated event frame input), and module C (JSCA block). Dataset 3, which includes dim and small targets in complex backgrounds, is used as a representative example.

Using only IR images, the mAP@0.5 reaches 97.5%, but performance drops sharply at higher IoU thresholds due to the small target size and limited grayscale information. Using only accumulated event frames improves mAP@0.5:0.95 by 16.89% compared to IR-image-only input, although overall detection performance declines. Combining IR images and event frames increases the DR to 96.62%. Compared to single-channel inputs, fusion improves mAP@0.5 by 0.82% and 9.83%, mAP@0.5:0.95 by 37.41% and 17.56%, and F1 score by 0.22% and 11.17%, demonstrating that complementary multi-modal features enhance both classification and localization. Adding the JSCA block further improves all metrics, reducing false alarms and boosting detection accuracy.

5.3. Statistical Reliability and Confidence Interval Analysis

In space-based scenarios with few targets (Table 2 and Table 5), assessing the statistical variability of model performance is essential. This paper applied bootstrap resampling to compute 95% confidence intervals for key metrics, providing a distribution-free estimate of their variability. Specifically, for Seqs. 5 and 6 in Table 2, which have relatively small sample sizes (50 samples each), 50,000 bootstrap resamples were performed, with the mean and interval bounds reported in Table 10 and Table 11. For Datasets 1–3 in Table 5, which have larger test sets (1110, 1010, and 1088 samples, respectively), 10,000 resamples were used, with the results presented in Table 12.

Analysis shows that the mean performance metrics (Base) in Table 3, Table 6, Table 7, and Table 8 mostly fall within their respective confidence intervals, indicating stable and reliable estimates. Minor deviations in Dataset 1 arise from subtle differences between our mAP implementation and YOLO’s official code, but all other metrics remain consistent, confirming the overall robustness of the results.

6. Conclusions

This paper focuses on the ITDD mechanism modeling, parameter optimization, data simulation, and detection of dim, small moving targets, proposing corresponding algorithms and solutions to enhance space-based IR aerial target detection. The main contributions are as follows:

This paper proposes an ITDD model based on event triggering and develops an irradiance sensitivity model to characterize aerial target radiation in complex space-based scenarios. Key parameters, including threshold voltage and optical aperture, are analyzed to guide the design and optimization of photoelectric instruments.
Simulation results of the IR differential event stream demonstrate ITDD’s advantages in data compression, background suppression, and moving target sensitivity. Experiments show the event-triggered rate is reduced to 3.59%, data volume is compressed to one-thousandth, and SNR improves 4.21-fold. ITDD effectively eliminates static background redundancy, highlights moving targets, and, with low latency and high sensitivity, enables rapid capture of high-speed targets.
This paper proposes a detection-driven fusion network that integrates accumulated event frames with IR images. In space-based IR staring mode, the network achieves an mAP@0.5 of 99.0%, DR of 96.27%, and FR of $4 \times 10^{- 8}$ for $11 \times 11$ targets with a mean SNR of 2. For $11 \times 11$ targets with a mean SNR of 4 and platform motion under 3 pixels per frame, performance improves to 99.5% mAP@0.5, 98.33% DR, and $1.4 \times 10^{- 7}$ FR. Even for smaller $5 \times 5$ targets with a mean SNR of 4 and jitter under 3 pixels per frame, it maintains a 98.9% mAP@0.5, 99.31% DR, and $1.97 \times 10^{- 5}$ FR. These results demonstrate that multi-modal fusion of event and IR data substantially improves dim moving target detection in complex space-based backgrounds.
Future work will address annotation challenges in event frames by combining bounding boxes, polygons, and orientation features for small, dim targets while incorporating annotation uncertainty modeling to improve localization robustness. As this paper relies on simulated data with perfect spatiotemporal alignment and locally curated datasets, its applicability to real multi-sensor systems is limited, where registration errors and real-world variability may affect performance. To overcome these limitations, we plan to design a learnable feature registration module to compensate for inter-sensor misalignments, apply random spatial transformations during training to enhance robustness, and develop an ITDD demonstration system to collect real data for validating models and optimizing fusion detection algorithms under practical conditions.

Author Contributions

Conceptualization, P.R. and L.G.; methodology, L.G. and X.C.; software, L.G. and Z.Z.; validation, L.G. and C.G.; formal analysis, C.G.; investigation, L.G.; resources, P.R.; data curation, X.C.; writing—original draft preparation, L.G.; writing—review and editing, P.R. and X.C.; visualization, L.G. and Z.Z.; supervision, X.C.; project administration, P.R.; funding acquisition, P.R. and X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant No.62175251), Talent Plan of Shanghai Branch, Chinese Academy of Sciences (No. CASSHB-QNPD-2023-007), and Shanghai 2022 “Science and Technology Innovation Action Plan” Outstanding Academic/Technical Leader Program Project (No. 22XD1404100).

Data Availability Statement

The SITP-QLEF dataset is available at https://github.com/Joyce-Lan88/SITP-QLEF (accessed on 27 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Effect of Spatial Misalignment

In practice, multi-sensor systems are often affected by registration mismatches, which can degrade fusion quality. Evaluating the robustness of the proposed method described in Section 4.2 to such misalignments is therefore essential. To further investigate this aspect, a sensitivity analysis was conducted on spatial misalignment using the parameters already trained on Datasets 1–3. Specifically, rotational offsets were artificially introduced into the IR images of the test sets, simulating clockwise and counterclockwise rotations of ±0.5°, ±1°, ±2°, and ±3°. Qualitative evaluations were then performed based on the resulting detection outcomes.

The results are summarized in Figure A1. As shown, the proposed method maintains high detection accuracy under small rotations (

{\leq 1}^{°}

), demonstrating strong robustness to slight spatial perturbations. For Datasets 1 and 2, performance remains generally stable even with rotations up to

3^{°}

, although occasional missed detections begin to occur. In contrast, Dataset 3 exhibits noticeable misses and false alarms at

- 2^{°}

, likely due to its smaller target size, which makes it more sensitive to positional displacement.

Although the present analysis does not encompass all potential forms or degrees of spatial misalignment, it provides a clear indication of how registration accuracy influences fusion-based detection performance. The primary experiments, as illustrated in the first row of Figure A1, were conducted under an assumption of perfect spatiotemporal alignment between event and IR modalities, wherein all targets were accurately detected. While this assumption facilitates controlled validation, it simultaneously delineates a limitation of the current study.

Figure A1. Effect of spatial misalignment on algorithm robustness.

References

Chen, L.; Rao, P.; Chen, X.; Huang, M. Local Spatial–Temporal Matching Method for Space-Based Infrared Aerial Target Detection. Sensors 2022, 22, 1707. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Zhao, H.; Gu, X.; Yang, L.; Bai, B.; Jia, G.; Li, Z. Analysis of space-based observed infrared characteristics of aircraft in the air. Remote Sens. 2023, 15, 535. [Google Scholar] [CrossRef]
Zhu, H.; Li, Y.; Hu, T.; Rao, P. Key parameters design of an aerial target detection system on a space-based platform. Opt. Eng. 2018, 57, 023107. [Google Scholar] [CrossRef]
Lichtsteiner, P.; Posch, C.; Delbruck, T. A 128 × 128 120 dB 30 mW asynchronous vision sensor that responds to relative intensity change. In Proceedings of the 2006 IEEE International Solid State Circuits Conference-Digest of Technical Papers, San Francisco, CA, USA, 6–9 February 2006; pp. 2060–2069. [Google Scholar]
Lichtensteiner, P.; Posch, C.; Delbruck, T. A 128 × 128 120 dB 15 µs Latency Asynchronous Temporal Contrast Vision Sensor. IEEE J. Solid-State Circuits 2008, 43, 566–576. [Google Scholar] [CrossRef]
Posch, C.; Matolin, D.; Wohlgenannt, R. A QVGA 143 dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS. IEEE J. Solid-State Circuits 2010, 46, 259–275. [Google Scholar] [CrossRef]
Brandli, C.; Berner, R.; Yang, M.; Liu, S.C.; Delbruck, T. A 240 × 180 130 dB 3 µs latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circuits 2014, 49, 2333–2341. [Google Scholar] [CrossRef]
Scheerlinck, C.; Barnes, N.; Mahony, R. Continuous-time intensity estimation using event cameras. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; pp. 308–324. [Google Scholar]
Cadena, P.R.G.; Qian, Y.; Wang, C.; Yang, M. Spade-e2vid: Spatially-adaptive denormalization for event-based video reconstruction. IEEE Trans. Image Process. 2021, 30, 2488–2500. [Google Scholar] [CrossRef]
Kodama, K.; Sato, Y.; Yorikado, Y.; Berner, R.; Mizoguchi, K.; Miyazaki, T.; Tsukamoto, M.; Matoba, Y.; Shinozaki, H.; Niwa, A.; et al. 1.22 μm 35.6 Mpixel RGB hybrid event-based vision sensor with 4.88 μm-pitch event pixels and up to 10K event frame rate by adaptive control on event sparsity. In Proceedings of the 2023 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 19–23 February 2023; pp. 92–94. [Google Scholar]
Schon, G.; Bourke, D.; Doisneau, P.A.; Finateu, T.; Gonzalez, A.; Hanajima, N.; Hitana, T.; Van Vuuren, L.J.; Kadry, M.; Laurent, C.; et al. A 320 × 320 1/5” BSI-CMOS stacked event sensor for low-power vision applications. In Proceedings of the 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Kyoto, Japan, 11–16 June 2023; pp. 1–2. [Google Scholar]
Iddrisu, K.; Shariff, W.; Corcoran, P.; O’Connor, N.; Lemley, J.; Little, S. Event Camera based Eye Motion Analysis: A survey. IEEE Access 2024, 12, 136783–136804. [Google Scholar] [CrossRef]
Li, Y.; Moreau, J.; Ibanez-Guzman, J. Emergent visual sensors for autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2023, 24, 4716–4737. [Google Scholar] [CrossRef]
Shariff, W.; Dilmaghani, M.S.; Kielty, P.; Moustafa, M.; Lemley, J.; Corcoran, P. Event Cameras in Automotive Sensing: A Review. IEEE Access 2024, 12, 51275–51306. [Google Scholar] [CrossRef]
Afshar, S.; Nicholson, A.P.; Van Schaik, A.; Cohen, G. Event-based object detection and tracking for space situational awareness. IEEE Sens. J. 2020, 20, 15117–15132. [Google Scholar] [CrossRef]
Afshar, S.; Ralph, N.; Xu, Y.; Tapson, J.; Schaik, A.v.; Cohen, G. Event-based feature extraction using adaptive selection thresholds. Sensors 2020, 20, 1600. [Google Scholar] [CrossRef]
McHarg, M.G.; Balthazor, R.L.; McReynolds, B.J.; Howe, D.H.; Maloney, C.J.; O’Keefe, D.; Bam, R.; Wilson, G.; Karki, P.; Marcireau, A.; et al. Falcon Neuro: An event-based sensor on the International Space Station. Opt. Eng. 2022, 61, 085105. [Google Scholar] [CrossRef]
Dong, S.; Huang, T.; Tian, Y. Spike camera and its coding methods. arXiv 2021, arXiv:2104.04669. [Google Scholar] [CrossRef]
Zhu, L.; Yan, W.; Chang, Y.; Tian, Y.; Huang, H. Simultaneous Learning Intensity and Optical Flow from High-speed Spike Stream. IEEE Trans. Circuits Syste. Video Technol. 2024, 35, 5126–5139. [Google Scholar] [CrossRef]
Posch, C.; Matolin, D.; Wohlgenannt, R.; Maier, T.; Litzenberger, M. A microbolometer asynchronous dynamic vision sensor for LWIR. IEEE Sens. J. 2009, 9, 654–664. [Google Scholar] [CrossRef]
Cox, J.C. Integration of Event-Based Sensing with Traditional Imaging Approaches. Ph.D. Thesis, The University of Arizona, Tucson, AZ, USA, 2023. [Google Scholar]
Jakobson, C.; Fraenkel, R.; Ari, N.B.; Dobromislin, R.; Shiloah, N.; Argov, T.; Freiman, W.; Zohar, G.; Langof, L.; Ofer, O.; et al. Event-based SWIR sensor. In Infrared Technology and Applications XLVIII; SPIE: Bellingham, WA, USA, 2022; Volume 12107, pp. 7–12. [Google Scholar]
Fraenkel, R.; Jakobson, C.; Ari, N.B.; Dobromislin, R.; Shiloah, N.; Goldfarb, M.; Langof, L.; Ofer, O.; Elishkov, R.; Shunem, E.; et al. SWIFT EI: Event-based SWIR sensor for tactical applications. In Infrared Technology and Applications XLIX; SPIE: Bellingham, WA, USA, 2023; Volume 12534, pp. 12–19. [Google Scholar]
Su, Y.; Li, H.; Jian, Z.; Li, K.; Hu, A.; Liu, D. An 8 × 8 Event-Based Vision Infrared Sensor ROIC With Adaptive Threshold Voltage Generation Circuit. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 4708–4712. [Google Scholar] [CrossRef]
He, M.; Lu, W.; Zhang, Y.; Zhuo, Y.; Yu, S.; Chen, Z. Event-Triggered Infrared Focal Plane Array Readout Circuit and Infrared Detection Device. Patent CN115086584B, 11 June 2024. [Google Scholar]
Wu, Y.; Yang, R.; Lv, Q.; Tang, Y.; Zhang, C.; Liu, S. Infrared and Visible Image Fusion: Statistical Analysis, Deep Learning Approaches and Future Prospects. Laser Optoelectron. Progr. 2024, 61, 42–60. [Google Scholar]
Cormack, D.; Schlangen, I.; Hopgood, J.R.; Clark, D.E. Joint registration and fusion of an infrared camera and scanning radar in a maritime context. IEEE Trans. Aerosp. Electron. Syst. 2019, 56, 1357–1369. [Google Scholar] [CrossRef]
Lin, S.; Yang, F.; Zhou, X.; Li, W. Progress on Fusion Technology for Imaging of Dual-color Mid-wave Infrared. Infrared Technol. 2012, 34, 217–223. [Google Scholar]
Luo, L. Research and Implementation of Dual-Band Infrared Image Fusion Processing Technology. Master’s Thesis, Nanjing University of Science and Technology, Nanjing, China, 2020. [Google Scholar]
Liu, S.; Huang, X.; Liu, L.; Xie, Y.; Zhang, J.; Yang, J. Infrared and Visible Image Fusion Under Photoelectric Loads. Comput. Eng. Appl. 2024, 60, 28–39. [Google Scholar]
Zhang, X.; Dai, X.; Zhang, X.; Jin, G. Joint principal component analysis and total variation for infrared and visible image fusion. Infrared Phys. Technol. 2023, 128, 104523. [Google Scholar] [CrossRef]
Li, Y.; Liu, G.; Bavirisetti, D.P.; Gu, X.; Zhou, X. Infrared-visible image fusion method based on sparse and prior joint saliency detection and LatLRR-FPDE. Digit. Signal Process. 2023, 134, 103910. [Google Scholar] [CrossRef]
Bhatnagar, G.; Wu, Q.J.; Liu, Z. Directive contrast based multimodal medical image fusion in NSCT domain. IEEE Trans. Multimed. 2013, 15, 1014–1024. [Google Scholar] [CrossRef]
Tang, L.; Zhang, H.; Xu, H.; Ma, J. Deep learning-based image fusion: A survey. J. Image Graph. 2023, 28, 3–36. [Google Scholar] [CrossRef]
Wang, Z.; Wang, J.; Wu, Y.; Xu, J.; Zhang, X. UNFusion: A unified multi-scale densely connected network for infrared and visible image fusion. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 3360–3374. [Google Scholar] [CrossRef]
Liu, Y.; Chen, X.; Cheng, J.; Peng, H.; Wang, Z. Infrared and visible image fusion with convolutional neural networks. Int. J. Wavel. Multiresolut. Inf. Process. 2018, 16, 1850018. [Google Scholar] [CrossRef]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. CrossFuse: A novel cross attention mechanism based infrared and visible image fusion approach. Inf. Fusion 2024, 103, 102147. [Google Scholar] [CrossRef]
Kou, T.; Zhou, Z.; Liu, H.; Yang, Y. Multi-band composite detection and recognition of aerial infrared point targets. Infrared Phys. Technol. 2018, 94, 102–109. [Google Scholar] [CrossRef]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Detfusion: A detection-driven infrared and visible image fusion network. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4003–4011. [Google Scholar]
Wang, D.; Liu, J.; Liu, R.; Fan, X. An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Inf. Fusion 2023, 98, 101828. [Google Scholar] [CrossRef]
Sun, S.; Wenqi, R.; Cao, X. Event-fusion-based Spatial Attentive and Temporal Memorable Network for Video Deraining. J. Softw. 2024, 35, 2220–2234. [Google Scholar]
Yang, F.; Guo, H.; Yang, F. Study of Evaluation Methods on Effect of Pixel-level Image Fusion. J. Test Meas. Technol. 2002, 276–279. [Google Scholar]
Zhang, H.; Yang, H.; Zheng, F.; Wang, J.; Zhou, X.; Wang, H.; Xu, Y. Review of Feature-Level Infrared and Visible Image Fusion. Comput. Eng. Appl. 2024, 60, 17–31. [Google Scholar]
Guo, L.; Rao, P.; Li, Y. A characterization method of aerial target and complex background based on infrared differential analytic factor. In Proceedings of the Earth and Space: From Infrared to Terahertz (ESIT 2022), Nantong, China, 17–19 September 2022; Volume 12505, pp. 616–628. [Google Scholar]
Guo, L.; Rao, P.; Chen, X.; Li, Y. Infrared differential detection and band selection for space-based aerial targets under complex backgrounds. Infrared Phys. Technol. 2024, 138, 105172. [Google Scholar] [CrossRef]
Rao, G.; Mahulikar, S. Effect of atmospheric transmission and radiance on aircraft infared signatures. J. Aircr. 2005, 42, 1046–1054. [Google Scholar] [CrossRef]
Mahulikar, S.P.; Sonawane, H.R.; Rao, G.A. Infrared signature studies of aerospace vehicles. Progr. Aerosp. Sci. 2007, 43, 218–245. [Google Scholar] [CrossRef]
Huang, Y.; Cui, X. Modulation of infrared signatures based on anisotropic emission behavior of aircraft skin. Opt. Eng. 2015, 54, 123112. [Google Scholar] [CrossRef]
Gross, W.; Hierl, T.; Schulz, M.J. Correctability and long-term stability of infrared focal plane arrays. Opt. Eng. 1999, 38, 862–869. [Google Scholar] [CrossRef]
Yang, J.; Ding, Z.; Chen, X. Review of Key Technologies of Image Registration. Navig. Control 2020, 19, 77–84. [Google Scholar]
Guo, L.; Rao, P.; Gao, C.; Su, Y.; Li, F.; Chen, X. Adaptive Differential Event Detection for Space-Based Infrared Aerial Targets. Remote Sens. 2025, 17, 845. [Google Scholar] [CrossRef]
Li, H.; Wu, X. Infrared and visible image fusion using latent low-rank representation. arXiv 2018, arXiv:1804.08992. [Google Scholar]
Anantrasirichai, N.; Zheng, R.; Selesnick, I.; Achim, A. Image fusion via sparse regularization with non-convex penalties. Pattern Recognit. Lett. 2020, 131, 355–360. [Google Scholar] [CrossRef]
Chen, Y.; Wang, B.; Guo, X.; Zhu, W.; He, J.; Liu, X.; Yuan, J. DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection. In Proceedings of the International Conference on Pattern Recognition, Viña del Mar, Chile, 1–4 December 2025; pp. 236–252. [Google Scholar]

Figure 1. Illustration of the ITDD system for space-based aerial targets.

Figure 2. Detectable irradiance versus optical aperture: (a) MWIR band and (b) LWIR band. Detectable irradiance versus optical transmittance: (c) MWIR band and (d) LWIR band. Detectable irradiance versus integration time: (e) MWIR band and (f) LWIR band. Detectable irradiance versus capacitance: (g) MWIR band and (h) LWIR band.

Figure 3. Joint influence of integration time and optical aperture on irradiance. (a,c) Three-dimensional characterization in the MWIR and LWIR bands, respectively, when the threshold voltage is 70 mV. (b,d) Two-dimensional projection in the MWIR and LWIR bands, respectively, when the threshold voltage is 70–350 mV.

Figure 4. Modular overview of the IR differential event stream simulation.

Figure 5. Image fusion algorithm classification: (a) pixel-level fusion, (b) feature-level fusion, (c) decision-level fusion, and (d) fusion driven by target detection tasks.

Figure 6. Temporal registration of event frames, accumulated event frames, and IR images.

Figure 7. Overall structure of accumulated event frame–IR image fusion network.

Figure 8. Visualization of differential event stream. (a) Three-dimensional voxel grid. (b,c) Zoomed target views. Red and blue points indicate positive- and negative-polarity events.

Figure 9. Comparison between event frame and IR image. (a,b) IR image and the zoomed-in view of the space-based aerial target. (c,d) Grayscale intensity distribution of the target. (e,f) IR event frame and zoomed-in view of the aerial target. (g,h) Polarity distribution of the target.

Figure 10. Real-time acquisition of IR event frames. (a) Experimental scenario for aerial target observation with the sky as the background. (b,c) Polarity distribution of the event frame with labeled target events. (d,e) Grayscale distribution of an IR image with a labeled target.

Figure 11. Comparison of mAP and F1 across target detection algorithms on three datasets: (a) Dataset 1, (b) Dataset 2, and (c) Dataset 3.

Figure 12. Precision–recall curves’ comparison: (a) Dataset 1, (b) Dataset 2, and (c) Dataset 3.

Figure 13. Visualization of fused results.

Table 1. Parameters of a space-based IR detection system.

Symbol	Quantity	Value
$c_{int}$ /fF	Capacitance	30
$\bar{λ}$ /µm	Center wavelength	4/9
D/mm	Optical aperture	200
$τ_{o}$	Optical transmittance	0.6
$η$	Quantum efficiency	0.7
$Δ t_{int}$ /µs	Effective integration time	80
$V_{\max}$ /V	Maximum voltage	1.4
l/km	Detection distance	500

Table 2. Details of IR target datasets used in the experiments.

	Image Size	Frame	Bit Depth	Target Scale	Target Number	Background	Scene	Data Sources
Seq. 1	$320 \times 256$	$270 \times 1$	32	$5 \times 5$	1	City, clouds	Real	QLSAT-2
Seq. 2	$320 \times 256$	$300 \times 1$	16	$3 \times 3$	2	City, sea	Real	On-orbit
Seq. 3	$320 \times 256$	$500 \times 1$	16	$8 \times 4$	1	Sea, sky, wharf	Real	Ground-based
Seq. 4	$640 \times 512$	$500 \times 1$	8	$32 \times 12$	1	Sky	Real	Ground-based
Seq. 5	$256 \times 256$	$300 \times 50$	16	$3 \times 3$ – $5 \times 5$	6–20	City, sea, wharf, clouds, suburbs	Synthetic	QLSAT-2
Seq. 6	$512 \times 512$	$500 \times 50$	16	$3 \times 3$ – $5 \times 5$	6–20	City, sea, wharf, clouds, suburbs	Synthetic	On-orbit

Table 3. Quantitative indexes’ comparison between the raw images and event frames.

	$b_{frame}$ (10⁸)	$b_{event}$ (10⁶)	$ρ$ (10⁻³)	R (%)	$\bar{{SNR}_{o}}$	$\bar{{SNR}_{EF}}$	SNRG	$\bar{P_{d}}$ (%)
Seq. 1	7.08	1.32	1.86	5.98	3.63	9.97	2.75	100
Seq. 2	3.93	2.87	7.30	11.71	2.45	3.36	1.37	99.28
Seq. 3	6.55	0.50	0.76	1.21	1.98	10.66	5.39	80.96
Seq. 4	13.1	0.47	0.36	0.29	1.87	19.08	10.20	100.00
Seq. 5	3.15	0.23	0.74	1.19	3.45	9.87	2.94	93.03
Seq. 6	21.0	1.49	0.71	1.14	4.25	10.40	2.58	90.54
Average	9.13	1.15	1.95	3.59	2.94	10.56	4.21	93.97

Table 4. Quantitative analysis of the effects of noise.

	$σ_{C}$ (%)	$σ_{h}$ (%)	$σ_{t}$ (%)	$b_{event}$ (10⁶)	$ρ$ (10⁻³)	R (%)	$\bar{{SNR}_{EF}}$	SNRG	$\bar{P_{d}}$ (%)
Set. 1	0	0	0	1.02	1.64	2.97	10.39	3.84	95.43
Set. 2	0.03	0.01	0.003	1.03	1.65	2.99	10.25	3.78	95.47
Set. 3	0.3	0.1	0.03	1.11	1.72	3.10	9.50	3.48	95.48

Table 5. Parameters of datasets.

Dataset	SNR	Target Size	Target Number	Target Speed (Pixel/Frame)	Platform Speed (Pixel/Frame)	Resolution	Training	Validation	Test	Total
1	2	$11 \times 11$	1–3	1–1.4	1	$512 \times 512$	8402	1110	1110	10,622
2	4	$11 \times 11$	1–3		0–3	$256 \times 256$	7953	1062	1010	10,025
3	4	$5 \times 5$	1–3				8123	1194	1088	10,405

Table 6. The test results on Dataset 1.

	mAP@0.5 (%)	mAP@0.5:0.95 (%)	F1 (%)	DR (%)	FR ( $10^{- 6}$ )	PC (M)	Inference (ms)	FPS
LTDEF	96.80	85.20	93.62	97.06	7.37	2.93	2.4	285.14
LatLRR+YOLOv8n	41.50	24.40	43.47	35.60	1.66	2.68	2.5	293.89
FGMC+YOLOv8n	39.60	26.70	41.40	57.25	16.06	2.68	2.5	286.21
DEYOLO	98.60	87.60	96.94	97.64	1.07	6.00	7.1	122.16
ours	99.00	76.70	98.24	96.27	0.04	2.44	3.1	248.83

The best values are indicated in bold.

Table 7. The test results on Dataset 2.

	mAP@0.5 (%)	mAP@0.5:0.95 (%)	F1 (%)	DR (%)	FR ( $10^{- 6}$ )	PC (M)	Inference (ms)	FPS
LTDEF	96.50	89.80	96.24	95.54	24.33	2.93	1.0	573.06
LatLRR+YOLOv8n	96.00	93.70	91.69	90.05	2.19	2.68	2.6	289.98
FGMC+YOLOv8n	91.90	88.10	87.33	84.19	3.29	2.68	2.5	297.27
DEYOLO	98.90	97.40	96.44	97.44	2.27	6.00	7.1	121.33
ours	99.50	99.40	99.00	98.33	0.14	2.44	3.1	246.20

The best values are indicated in bold.

Table 8. The test results on Dataset 3.

	mAP@0.5 (%)	mAP@0.5:0.95 (%)	F1 (%)	DR (%)	FR ( $10^{- 6}$ )	PC (M)	Inference (ms)	FPS
LTDEF	96.30	75.60	93.02	97.53	43.24	2.93	0.8	632.41
LatLRR+YOLOv8n	76.40	65.20	71.42	65.99	4.85	2.68	2.5	283.62
FGMC+YOLOv8n	75.40	65.30	70.15	61.96	3.44	2.68	2.6	291.48
DEYOLO	98.80	96.80	96.19	96.39	1.54	6.00	7.0	124.19
ours	98.90	95.00	96.09	96.62	1.85	2.44	3.1	247.62

The best values are indicated in bold.

Table 9. Ablation experiment results on Dataset 3.

A	B	C	mAP@0.5 (%)	mAP@0.5:0.95 (%)	F1 (%)	DR (%)	FR ( $10^{- 6}$ )
✓	×	×	97.50	68.70	95.47	94.27	1.02
×	✓	×	89.50	80.30	86.07	80.93	1.81
✓	✓	×	98.30	94.40	95.68	96.62	2.27
✓	✓	✓	98.90	95.00	96.09	96.62	1.85

The best values are indicated in bold.

Table 10. Bootstrap-based 95% confidence intervals of metrics on Seq. 5.

	$b_{event}$ (10⁶)	$ρ$ (10⁻³)	R (%)	$\bar{{SNR}_{o}}$	$\bar{{SNR}_{EF}}$	SNRG	$\bar{P_{d}}$ (%)
CI	0.23 [0.21, 0.25]	0.73 [0.67, 0.80]	1.18 [1.08,1.28]	3.46 [3.34, 3.58]	9.91 [9.47, 10.36]	2.95 [2.74, 3.16]	93.08 [92.39, 93.79]
Base	0.23	0.74	1.19	3.45	9.87	2.94	93.03

Table 11. Bootstrap-based 95% confidence intervals of metrics on Seq. 6.

	$b_{event}$ (10⁶)	$ρ$ (10⁻³)	R (%)	$\bar{{SNR}_{o}}$	$\bar{{SNR}_{EF}}$	SNRG	$\bar{P_{d}}$ (%)
CI	1.49 [1.33, 1.66]	0.71 [0.63, 0.79]	1.14 [1.01, 1.27]	4.24 [3.98, 4.51]	10.41 [9.74, 11.16]	2.59 [2.35, 2.85]	90.61 [89.76, 91.46]
Base	1.49	0.71	1.14	4.25	10.40	2.58	90.54

Table 12. Bootstrap-based 95% confidence intervals of detection metrics on Datasets 1–3.

	CI (mAP@0.5)	Base (mAP@0.5)	CI (mAP@0.5:0.95)	Base (mAP@0.5:0.95)
Dataset 1	97.90% [97.52%, 98.30%]	99.00%	74.10% [73.68%, 74.57%]	76.70%
Dataset 2	99.33% [99.10%, 99.61%]	99.50%	99.14% [98.89%, 99.41%]	99.40%
Dataset 3	98.76% [98.47%, 99.07%]	98.90%	94.79% [94.41%, 95.20%]	95.00%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, L.; Chen, X.; Gao, C.; Zhao, Z.; Rao, P. Infrared Temporal Differential Perception for Space-Based Aerial Targets. Remote Sens. 2025, 17, 3487. https://doi.org/10.3390/rs17203487

AMA Style

Guo L, Chen X, Gao C, Zhao Z, Rao P. Infrared Temporal Differential Perception for Space-Based Aerial Targets. Remote Sensing. 2025; 17(20):3487. https://doi.org/10.3390/rs17203487

Chicago/Turabian Style

Guo, Lan, Xin Chen, Cong Gao, Zhiqi Zhao, and Peng Rao. 2025. "Infrared Temporal Differential Perception for Space-Based Aerial Targets" Remote Sensing 17, no. 20: 3487. https://doi.org/10.3390/rs17203487

APA Style

Guo, L., Chen, X., Gao, C., Zhao, Z., & Rao, P. (2025). Infrared Temporal Differential Perception for Space-Based Aerial Targets. Remote Sensing, 17(20), 3487. https://doi.org/10.3390/rs17203487

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Infrared Temporal Differential Perception for Space-Based Aerial Targets

Abstract

Highlights

Abstract

1. Introduction

1.1. Biomimetic Vision Perception Technology

1.2. Multi-Modal Data Fusion Method

2. Infrared Temporal Differential Detection (ITDD)

2.1. Mathematical Model of ITDD

2.2. Detection Sensitivity Model of ITDD

2.3. Factors Affecting ITDD Sensitivity

3. Infrared Differential Event Stream Simulation

3.1. Detection Scene Image Generation

3.2. Infrared Differential Event Stream Generation

3.2.1. Contrast Threshold and Threshold Noise

3.2.2. Grayscale Initialization and Logical Judgment

3.2.3. System Noise

4. Event Frame–Infrared Image Fusion Detection

4.1. Registration of Event Frames and Infrared Images

4.1.1. Temporal Alignment

4.1.2. Spatial Alignment

4.2. Multi-Modal Fusion Detection Network

5. Experiments and Results

5.1. Infrared Differential Event Stream Simulation

5.1.1. Datasets and Experimental Settings

5.1.2. Qualitative Demonstration

5.1.3. Measurement Metrics

5.2. Event Frame–Infrared Image Fusion Detection

5.2.1. Datasets and Evaluation Metrics

5.2.2. Quantitative Results

5.2.3. Visual Results

5.2.4. Ablation Experiment

5.3. Statistical Reliability and Confidence Interval Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Effect of Spatial Misalignment

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI