A Low-Latency Dynamic Object Detection Algorithm Fusing Depth and Events

Chen, Duowen; Zhou, Liqi; Guo, Chi

doi:10.3390/drones9030211

Open AccessArticle

A Low-Latency Dynamic Object Detection Algorithm Fusing Depth and Events

by

Duowen Chen

^1,2,†,

Liqi Zhou

^3,† and

Chi Guo

^1,4,*

¹

GNSS Research Center, Wuhan University, Wuhan 430079, China

²

School of Electronic Information, Wuhan University, Wuhan 430072, China

³

School of Computer Science, Wuhan University, Wuhan 430072, China

⁴

Hubei Luojia Laboratory, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2025, 9(3), 211; https://doi.org/10.3390/drones9030211

Submission received: 16 February 2025 / Revised: 11 March 2025 / Accepted: 13 March 2025 / Published: 15 March 2025

(This article belongs to the Topic Unmanned Vehicles Technology and Embodied Intelligence Systems for Intelligent Transportation)

Download

Browse Figures

Versions Notes

Abstract

Existing RGB image-based object detection methods achieve high accuracy when objects are static or in quasi-static conditions but demonstrate degraded performance with fast-moving objects due to motion blur artifacts. Moreover, state-of-the-art deep learning methods, which rely on RGB images as input, necessitate training and inference on high-performance graphics cards. These cards are not only bulky and power-hungry but also challenging to deploy on compact robotic platforms. Fortunately, the emergence of event cameras, inspired by biological vision, provides a promising solution to these limitations. These cameras offer low latency, minimal motion blur, and non-redundant outputs, making them well suited for dynamic obstacle detection. Building on these advantages, a novel methodology was developed through the fusion of events with depth to address the challenge of dynamic object detection. Initially, an adaptive temporal sampling window was implemented to selectively acquire event data and supplementary information, contingent upon the presence of objects within the visual field. Subsequently, a warping transformation was applied to the event data, effectively eliminating artifacts induced by ego-motion while preserving signals originating from moving objects. Following this preprocessing stage, the transformed event data were converted into an event queue representation, upon which denoising operations were performed. Ultimately, object detection was achieved through the application of image moment analysis to the processed event queue representation. The experimental results show that, compared with the current state-of-the-art methods, the proposed method has improved the detection speed by approximately 20% and the accuracy by approximately 5%. To substantiate real-world applicability, the authors implemented a complete obstacle avoidance pipeline, integrating our detector with planning modules and successfully deploying it on a custom-built quadrotor platform. Field tests confirm reliable avoidance of an obstacle approaching at approximately 8 m/s, thereby validating practical deployment potential.

Keywords:

event camera; neuro-vision; sensor fusion; object detection; collision avoidance; visual-based navigation

1. Introduction

The autonomous navigation of robots [1] is a popular and challenging research area with a broad range of applications, including freight logistics, aerial photography, and disaster relief. A fundamental and crucial function of an autonomous navigation system is the ability to detect and avoid obstacles, ensuring the safety of both the robot and its surroundings. Currently, both academic and industrial communities have designed and manufactured a plethora of robots with rich features. These robots are equipped with advanced sensors and algorithms to perceive their environment and make informed navigation decisions. However, when confronted with fast-moving obstacles, nearly all existing consumer-grade products have underperformed [2].

This challenge involves issues related to perception, planning, and control, with particular emphasis being placed on the perceptual component. This focused approach was justified by two principal considerations: Firstly, within the modular architecture of perception–planning–control systems, it was empirically determined that the perceptual algorithms demonstrate the most significant computational resource utilization while simultaneously introducing the greatest latency. Secondly, the cascading effects of compromised perceptual accuracy were identified as particularly detrimental, wherein erroneous perception inevitably leads to suboptimal planning decisions, which subsequently result in inadequate control execution, ultimately culminating in collision scenarios. Therefore, the authors concentrate on enhancing the detection of dynamic objects. Unlike static or quasi-static objects, detecting fast-moving objects hinges on low-latency sensors and algorithms [3,4,5]. However, traditional frame-based cameras paired with deep learning-based detection algorithms cannot fulfill this task for the following reasons: (i) Traditional cameras require tens of milliseconds for exposure. (ii) Deep learning-based object detection methods need hundreds of milliseconds for inference. Moreover, for obstacle avoidance tasks, the category information of obstacles is redundant, which introduces additional latency. (iii) Traditional cameras suffer from motion blur, and blurry images fed into the network can potentially cause missed detections, leading to collisions. (iv) When performing obstacle avoidance tasks, multiple image frames are usually required to predict the future trajectory of obstacles. Therefore, these limitations prompt us to seek alternative solutions. Fortunately, the advent of the event camera [6], a novel visual sensor inspired by biological vision, has shed light on this dilemma.

Unlike traditional frame-based cameras, which sample scenes based on an internal clock independent of motion (i.e., output images at a fixed frame rate), event cameras continuously and asynchronously record the motion information of the scene (assuming no change in illumination). Therefore, within a fixed time interval, event cameras produce a stream of events that encode motion information spatiotemporally, rather than a set number of images. More specifically, the generation of an event

e = (x, y, p, t)

represents a logarithmic change in intensity (increase when

p = + 1

or decrease when

p = - 1

) beyond the predefined threshold that occurs at pixel

(x, y)

and timestamp t. This unique operating principle offers numerous advantages over traditional frame-based cameras, such as low latency, minimal motion blur, high dynamic range, and low power consumption, making event cameras ideally suited for dynamic object detection. Accordingly, algorithms must be tailored to harness their full potential. For a detailed overview of event cameras, including algorithms and applications, refer to [7].

Inspired by the research in [2,8], a novel methodology for dynamic object detection has been developed through the fusion of event data with depth information. As depicted in Figure 1, our method’s core idea of the proposed method is to filter out events caused by the camera’s ego-motion while retaining those generated by the object’s motion. Initially, the authors designed a variable time window to balance the signal-to-noise ratio and latency. Subsequently, the authors perform ego-motion compensation to warp events. Although the Contrast Maximization (CMax) framework [9,10] has been proven to be applicable for ego-motion estimation, it involves significant computational complexity due to its optimization-based iterative nature. Therefore, the authors introduce additional sensors to provide motion information. Finally, warped events are converted into an event queue image, and after some subsequent processing (normalization, thresholding, denoising), the detection results are obtained. The authors conduct quantitative and qualitative experiments in various dynamic scenarios to validate our object detection method. These experiments provide a solid foundation for subsequent obstacle avoidance experiments.

The contributions of this paper can be summarized as follows:

The authors design a strategy to adaptively select the size of the time window for event processing based on the motion within a scene. This approach more effectively balances the signal-to-noise ratio and latency compared to using a fixed time window.
The authors propose an ego-motion compensation algorithm to eliminate events caused by the camera’s ego-motion while retaining events generated by the objects.
The authors construct a “First in, First out” event queue and perform detection on its derivative, which helps reduce computational complexity.
The authors construct a completed obstacle avoidance pipeline and deploy it on a custom-built quadrotor for validation in a real environment.

The rest of the paper is organized as follows: Section 2 reviews related work on event-based detection, Section 3 describes in detail the proposed detection framework, and Section 4 introduces the obstacle avoidance process of drones. An extensive experimental evaluation was conducted and is presented in Section 5. Conclusions are drawn in Section 6.

2. Related Work

Existing works on event-based object detection can be broadly divided into two branches. One of the branches is data-driven approaches, i.e., deep learning-based methods. This type of approach first transforms asynchronous event streams into synchronous event frames, feeds those frames into a well-designed network architecture, and finally iteratively trains that network until it converges. Liu et al. [11] reported an object detection and tracking algorithm that combines a simple Convolutional Neural Network (CNN)-based detector with Regions of Interest (ROIs) from the clustering of events. Iacono et al. [12] explored how event data can be converted into a format that traditional deep neural networks can handle in order to benefit from existing detection methods. Shiba et al. [13] used voxel grid representation to process the event data and fed it and the corresponding images into a two-stream network to extract features. Experimental results show that the proposed method has better robustness in the face of image corruption and different weather conditions compared to traditional RGB-based detection methods. Recently, ref. [14] introduced a novel object detection backbone network called Recurrent Vision Transformers (RVTs) for event cameras, which exhibits high performance in both detection accuracy and inference speed. It is highlighted by a multi-stage design: first, a convolutional prior is used as the conditional positional embedding; second, spatial feature interactions are performed using local and dilated global self-attention; finally, recursive temporal feature aggregation is used to minimize latency while preserving temporal information. Although the above methods demonstrate good performance in terms of accuracy, they rely on high-performance graphics cards for inference, making it difficult to deploy them on an embedded device.

The other branch is model-driven approaches, with the CMax framework [9,10,15,16] being the most representative. The first step of CMax is to set the warping parameters of the motion (or scene) to align events that are triggered by the same scene edge at different times. Warped events are then transformed into the Image of Warped Events (IWE) and a reward function is utilized to measure the quality of the warping (contrast of IWE). Finally, typically using optimization algorithms such as non-linear conjugate gradient to update the warping parameters, the above steps will iterate until convergence. The CMax framework has been widely used in optical flow estimation [17,18], rotational motion [19,20], and motion segmentation [21,22,23]. However, the disadvantage of the CMax framework is the higher computational cost. In other words, the higher perceptual delay may lead to the failure of obstacle avoidance.

3. Obstacle Detection

This section delineates the methodology developed for detecting fast-moving objects through the fusion of events with depth information. The comprehensive processing pipeline, as illustrated in Figure 2, consists of four sequential stages. Initially, raw event data and supplementary information are acquired through a temporal sampling strategy and subsequently stored in a dedicated buffer. Following this, a motion compensation process is implemented to eliminate events attributable to ego-motion, based on an established motion model. The retained events, corresponding to the moving object of interest, are then transformed into an event queue representation, upon which dynamic region segmentation is performed. Finally, the segmented image undergoes denoising and detection processes, culminating in the generation of a rectangular bounding box for the identified dynamic object. The subsequent subsections provide detailed technical specifications and implementation considerations for each of these processing stages.

3.1. Adaptive Time Window

Existing works [2,8,24] typically execute methods within a fixed time window

Δ t

. However, this is not an optimal approach. For instance, if

Δ t

is too small, it may result in an insufficient number of events to accomplish the task, causing the algorithm to run too frequently and waste computational resources and energy. Conversely, if

Δ t

is too large, latency increases, not only in event collection but also in processing, potentially leading to obstacle avoidance failure. To remedy this shortcoming, the authors design an adaptive time window sampling strategy that dynamically generates time windows of varying sizes for different scenarios.

As illustrated in Figure 3, the operational procedures are demonstrated through the examination of the

i t h

time window

Δ t_{i}

, with the specific implementation steps being delineated as follows: Firstly, a sampling time window

Δ t_{s}

is implemented for pre-evaluation (i.e., counting the number of events C within it). This approach is fundamentally grounded in the observable phenomenon that event frequency demonstrates significant amplification in the presence of rapidly moving objects within the visual field. Given the inherent temporal ordering of event data, computational efficiency is enhanced through the application of the

l o w e r_b o u n d

function, which facilitates rapid event counting and processing:

C = l o w e r_b o u n d (E_{i}, Δ t_{s}) .

(1)

Two distinct operational scenarios were subsequently examined: (i) if the obstacle was detected in the previous time window

Δ t_{i - 1}

, it is clear that

Δ t_{i}

should be determined by both C and

Δ t_{i - 1}

as follows:

Δ t_{i} = max \{\frac{N}{max \{C, N\}} \times Δ t_{i - 1}, Δ t_{\min}\};

(2)

(ii) if no obstacles were detected in the previous time window

Δ t_{i - 1}

,

Δ t_{i}

is determined solely by C as follows:

Δ t_{i} = min \{\frac{N}{min \{C, N\}} \times Δ t_{\min}, Δ t_{\max}\},

(3)

where N is a threshold that indicates the presence of obstacles within

Δ t_{i}

,

Δ t_{\min}

and

Δ t_{\max}

are preset minimum and maximum time windows, respectively.

This strategic implementation achieves an optimal balance between the signal-to-noise ratio and system latency while simultaneously minimizing computational overhead and energy expenditure to the greatest extent feasible.

3.2. Ego-Motion Compensation

Upon the determination of a specific temporal window

Δ t

(with the subscript i omitted for notational simplicity), both event data and supplementary information within this interval are stored in a dedicated buffer for subsequent processing stages. As mentioned earlier, events are generated when there is relative motion between the camera and the scene (with constant illumination). Evidently, this may be due to the ego-motion of the camera, the motion of the object in the scene, or both. Since our goal is to detect the moving object, events triggered by a camera’s ego-motion need to be filtered out.

Some optimization-based methods (collectively known as the CMax framework) can only use raw events to complete ego-motion compensation. More specifically, the trajectory of events within a small time interval in the spatiotemporal realm can be described by a geometric model, whose parameters are optimized to spatially align events. Events are warped to a reference time according to this model, generating an IWE. A reward function then evaluates the IWE quality, which reflects the alignment degree of warped events (i.e., spatial gradient consistency). These steps are repeated until the maximum reward is achieved. However, minimizing latency is critical for dynamic object detection, whereas the CMax framework’s computational demands make it unsuitable for this task. To address this limitation, three components are integrated: a depth camera to provide depth data, an inertial measurement unit (IMU) to measure angular velocity, and the Visual–Inertial Navigation System Fusion (VINS-Fusion [25]) as the state estimation framework. These data streams are used to warp events.

Under the assumption of proper camera calibration (with known intrinsic parameters and negligible lens distortion), ego-motion compensation is systematically applied to each individual event

e_{i} = (x_{i}, y_{i}, t_{i} | i \in {0, 1, . . ., n})

(polarity is ignored as it is unused). According to [24], the trajectory of a 3D point projected onto the pixel coordinate system can be parameterized by the camera’s known 6-DOF motion (rotation and translation) and the depth of this 3D point relative to a reference view.

The compensation calculation for rotation is divided into the following three steps: (i) calculate the time-averaged angular velocity from IMU measurements

\bar{ω} = \frac{1}{n} \sum_{Δ t} ω

; (ii) use the Rodrigues rotation formula to compute the rotation matrix from

\bar{ω} Δ t

; (iii) each event

e_{i}

within

Δ t

is warped in the image plane via

\bar{ω} (t_{i} - t_{0})

.

The compensation calculation for translation is divided into the following two steps: (i) compute the time-averaged linear velocity from VINS-Fusion odometry as

\bar{v} = \frac{1}{n} \sum_{Δ t} v

; (ii) translation of each event

e_{i}

by

\bar{v} (t_{i} - t_{0})

, where

t_{0}

denotes the timestamp of the first event in

Δ t

.

Note that the rotation and translation parameters originate from two different reference views. Ideally, when the event camera and depth camera are rigidly mounted (fixed relative positions), their extrinsics remain constant over time, i.e.,

R_{D}^{E}

and

t_{D}^{E}

are time-invariant. Furthermore, by downsampling depth maps to the event camera’s resolution, spatial correspondence between the pixel coordinate systems of the event camera and depth camera is established as follows:

{(\begin{matrix} x \\ y \\ 1 \end{matrix})}_{E} = T_{D}^{E} \cdot {(\begin{matrix} x \\ y \\ 1 \end{matrix})}_{D},

(4)

where

T_{D}^{E}

is a

3 \times 3

homography transformation matrix determined by the intrinsics

K_{E}

and

K_{D}

and extrinsics

R_{D}^{E}

and

t_{D}^{E}

of two cameras.

In the homogeneous coordinate system, the ego-motion compensation process is formulated. Each event

e_{i}

point on the pixel coordinate system are projected onto the event camera coordinate system as follows:

d (\begin{matrix} x \\ y \\ 1 \end{matrix}) = K_{E} \cdot X_{E},

(5)

where d is the depth at pixel

(x, y)

,

K_{E}

is the intrinsic matrix of the event camera, and

X_{E}

is the event’s position in the event camera coordinate system.

Since the translation is described on the depth camera, it needs to be transformed to the event camera coordinate system as follows:

(\begin{matrix} t_{E} \\ 1 \end{matrix}) = (\begin{matrix} R_{D}^{E} & t_{D}^{E} \\ 0 & 1 \end{matrix}) \cdot (\begin{matrix} t_{D} \\ 1 \end{matrix}),

(6)

where

t_{D} = \bar{v} (t_{i} - t_{0})

. The Rodrigues rotation formula is subsequently applied to derive the rotation matrix

R_{E}

, which is expressed as follows:

R_{E} = e^{{[\bar{ω}]}_{\times} (t_{i} - t_{0})},

(7)

where

{[\bar{ω}]}_{\times}

is the

3 \times 3

skew-symmetric matrix of the average angular velocity

\bar{ω}

.

After obtaining

R_{E}

and

t_{E}

, all events can be systematically projected back to the reference view through the following transformation:

(\begin{matrix} X_{E}^{'} \\ 1 \end{matrix}) = {(\begin{matrix} R_{E} & t_{E} \\ 0 & 1 \end{matrix})}^{- 1} \cdot (\begin{matrix} X_{E} \\ 1 \end{matrix}),

(8)

where

X_{E}^{'}

is the position of the compensated event in the event camera coordinates system. Finally, the transformation of

X_{E}^{'}

onto the pixel coordinate system is achieved through the following projection:

d^{'} (\begin{matrix} x^{'} \\ y^{'} \\ 1 \end{matrix}) = K_{E} \cdot X_{E}^{'},

(9)

where

d^{'}

is the depth at pixel

(x^{'}, y^{'})

.

Through Equations (4)–(9), the position of an event on the event camera pixel plane, after ego-motion (including rotation and translation) compensation, is restored (inversely mapped) from

(x, y)

to

(x^{'}, y^{'})

. This transformation process is formally characterized as a warping operator

W : R^{3} \to R^{3}

, defined as follows:

(x^{'}, y^{'}, t_{0}) = W ((x, y); \bar{ω}, \bar{v}, t - t_{0}) .

(10)

All compensated events within

Δ t

are formally represented as a set C, defined as follows:

C = {(x^{'}, y^{'}, t) | t \in Δ t} .

(11)

To balance computational constraints,

\bar{v}

and

\bar{w}

are updated at millisecond intervals, facilitating the computation of transformation matrices

R_{E}

and

t_{E}

.

3.3. Event Queue Image

Traditional computer vision algorithms cannot process events directly. To harness their potential, a natural approach is used to convert these compensated events into an event image. Ref. [24] defined the event count image as the one whose pixel value refers to counting the number of compensation events that fall within that pixel and the time image as the one whose pixel value is the average of the timestamp of every compensated event at that pixel.

However, through empirical analysis, it was observed that the time image representation introduces temporal information degradation through the averaging of timestamps from both recent and historical events, consequently diminishing the salience of dynamic regions. Furthermore, the inclusion of historical events was found to contribute significantly to computational inefficiency. To address these identified limitations, the Most Recent Event Queue (MREQ) was developed as a pixel-wise buffering mechanism that retains only the most recent K events for each pixel location.

The MREQ is implemented as a “First in, First out” structure

Q (u, v, k) \in R^{W \times H \times K}

. The matrix

Q (u, v, k)

is initialized to

0

as the foundational step in the process. Then, each compensated event

e_{i}^{'} = (x_{i}^{'}, y_{i}^{'}, t_{i}) \in C

is utilized to update Q in the following manner:

\begin{matrix} Q (u_{i}, v_{i}, K) & = Q (u_{i}, v_{i}, K - 1) \\ Q (u_{i}, v_{i}, K - 1) & = Q (u_{i}, v_{i}, K - 2) \\ ⋮ \\ Q (u_{i}, v_{i}, 2) & = Q (u_{i}, v_{i}, 1) \\ Q (u_{i}, v_{i}, 1) & = t_{i} \end{matrix}

(12)

Since the compensated event position

(x_{i}^{'}, y_{i}^{'})

is part of a discretized image plane in

R^{2}

, here,

(u_{i}, v_{i}) \in N^{2}

denotes the integer pixel coordinates of it. To enhance the spatial representation, each coordinate

(x_{i}^{'}, y_{i}^{'})

is projected onto its four nearest integer coordinate positions. Subsequently, the event queue image

T

is formally defined, where the pixel value is the average value of timestamps mapped to the event queue by the compensated event:

T (u, v) = {mean}_{k = 1}^{K} \{Q (u, v, k)\} .

(13)

Normalizing

T

yields the following:

N (u, v) = \frac{T (u, v) - min (T (u, v))}{max (T (u, v)) - min (T (u, v))} .

(14)

Following [2,8], background suppression in the normalized event queue image

N

is achieved through an adaptive thresholding approach. Unlike conventional implementations employing static threshold values for event source discrimination, the proposed method dynamically determines the threshold parameter

θ

through the computation of the mean intensity value across

N

:

θ = mean (N) .

(15)

Compare each pixel value

N (u, v)

of

N

with

θ

; if

N (u, v)

is greater than

θ

, it indicates that a moving object appears in this pixel; otherwise, this pixel belongs to the background. Through the thresholding, the filtered normalized event queue image

I

is obtained through the following transformation:

I (u, v) = \{\begin{matrix} N (u, v), & N (u, v) \geq θ \\ 0, & N (u, v) < θ \end{matrix} .

(16)

Figure 4 describes the complete pipeline for transforming warped events into an event queue image. Please note that the areas with dynamic obstacles have been retained.

3.4. Noise Reduction and Detection

To enhance detection accuracy, the filtered normalized event queue image

I

undergoes a preprocessing stage involving denoising operations.

Firstly,

I

will pass through a mean filter as follows:

I (u, v) = \frac{1}{w \times h} \sum_{i = - \frac{w - 1}{2}}^{\frac{w - 1}{2}} \sum_{j = - \frac{h - 1}{2}}^{\frac{h - 1}{2}} I (u + i, v + j),

(17)

where

w \times h

is the size of the filter; note that w and h need to be odd.

Then, a morphological opening operation is applied to

I

, combining both erosion and dilation processes to achieve enhanced smoothing and noise reduction, as expressed by the following transformation:

O p e n i n g (I) = (I ⊖ K) \oplus K,

(18)

where

K = {[\begin{matrix} 1 & 1 & \dots & 1 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & 1 & \dots & 1 \end{matrix}]}_{m \times n}

is the structured element and ⊖ and ⊕ represent erosion and dilation, respectively, defined as follows:

\begin{matrix} (I ⊖ K) (u, v) & = min_{(u^{'}, v^{'}) \in K} I (u + u^{'}, v + v^{'}) \\ (I \oplus K) (u, v) & = max_{(u^{'}, v^{'}) \in K} I (u - u^{'}, v - v^{'}) \end{matrix} .

(19)

Finally, the elements in the image

I

are normalized to

[0, 255]

.

Following the completion of these preprocessing stages, object detection is subsequently performed on

I

. Algorithm 1 summarizes our detection algorithm to find the optimal Region of Interest (ROI). An imitation is found in [8], starting from the initialized ROI

S (C_{p}^{0}, L^{0})

, where

C_{p}^{0} = (x, y)

is the center of ROI (i.e., maximum value on

I

) and

L^{0} = (a, b)

represents the width and length of the ROI (here, a and b are half the width and length of the image

I

). The algorithm iteratively calculates the zero-order moments, first-order moments, and second-order moments of the ROI to obtain the center of mass and increment of the ROI (i.e., update

C_{p}^{i}

and

L^{i}

), until the optimum

S (C_{p}^{*}, L^{*})

is obtained.

Under the following conditions, the algorithm ceases to generate detection outputs: (i) absence of moving objects within the surveillance field; (ii) excessive distance between moving targets and the UAV, resulting in insufficient event-triggered data; or (iii) a region of interest (ROI) smaller than the predefined area threshold.

Algorithm 1 Detection

Require:

S (C_{p}^{0}, L^{0})

,

N \in Z_{+}

.

Ensure:

S (C_{p}^{*}, L^{*})

.

1:: Initialize: $S (C_{p}^{0}, L^{0})$ .
2:: i← 0.
3:: while $i < N$ do
4:: $μ \leftarrow moments (S (C_{p}^{i}, L^{i}))$ .
5:: $\bar{x} \leftarrow \frac{μ_{10}}{μ_{00}}, \bar{y} \leftarrow \frac{μ_{01}}{μ_{00}}$ .
6:: $ϵ_{x} \leftarrow 4 * \sqrt{\frac{μ_{20}}{μ_{00}}}, ϵ_{y} \leftarrow 4 * \sqrt{\frac{μ_{02}}{μ_{00}}}$ .
7:: if $ϵ_{x} = = 0$ or $ϵ_{y} = = 0$ then
8:: $L^{i + 1} \leftarrow (0, 0)$ .
9:: $C_{p}^{i + 1} \leftarrow (0, 0)$ .
10:: break.
11:: end if
12:: $L^{i + 1} \leftarrow (ϵ_{x}, ϵ_{y})$ .
13:: $C_{p}^{i + 1} \leftarrow (\bar{x} - \frac{ϵ_{x}}{2}, \bar{y} - \frac{ϵ_{y}}{2})$ .
14:: $i \leftarrow i + 1$ .
15:: end while
16:: $C_{p}^{*} \leftarrow C_{p}^{i}$ , $L^{*} \leftarrow L^{i}$ .
17:: return $S (C_{p}^{*}, L^{*})$

4. Obstacle Avoidance

To provide a comprehensive validation of the proposed detection methodology’s efficacy, a simplified planning module was developed and implemented, enabling autonomous quadrotor navigation in real-world environments with the capability of avoiding high-speed dynamic obstacles.

4.1. Obstacle Position Estimation

For effective obstacle avoidance in real-world quadrotor navigation, the accurate three-dimensional spatial localization of obstacles is essential. However, the event queue image provided by a monocular event camera inherently lacks depth information. Consequently, depth data acquisition is achieved through the integration of a depth imaging system to supplement the event-based visual information.

Due to the heterogeneous sensor origins of event streams and depth data, cross-modal spatiotemporal alignment requires explicit implementation. In fact, the temporal synchronization process has already been completed at each time window, and spatial alignment can be achieved through geometric calibration. Specifically, the intrinsic and extrinsic parameters of both sensors enable direct projection mapping between event queue images and depth frames.

In practical implementation, a resolution reduction in the depth camera was implemented to ensure the consistent generation of depth images within each temporal window. To mitigate potential bias effects, the depth value

d (\cdot, \cdot)

for detected obstacles was determined through the computation of the mean depth value across all pixels contained within the corresponding bounding box:

d_{o} = \underset{(u, v) \in S (C_{p}^{*}, L^{*})}{mean} {d (u, v)} .

(20)

Upon acquiring depth information, the obstacle’s center point is projected onto the 3D camera coordinate system using the perspective projection model in homogeneous coordinates:

d_{o} (\begin{matrix} u_{c} \\ v_{c} \\ 1 \end{matrix}) = K \cdot X_{c},

(21)

where

K = \underset{2 D Translation}{\underset{︸}{(\begin{matrix} 1 & 0 & c_{x} \\ 0 & 1 & c_{y} \\ 0 & 0 & 1 \end{matrix})}} \times \underset{2 D Scaling}{\underset{︸}{(\begin{matrix} f_{x} & 0 & 0 \\ 0 & f_{y} & 0 \\ 0 & 0 & 1 \end{matrix})}} \times \underset{2 D Shear}{\underset{︸}{(\begin{matrix} 1 & s / f_{x} & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix})}}

(22)

is the intrinsic camera matrix,

X_{c}

is the Cartesian coordinate of the center point of the obstacle in the camera frame, and

u_{c}

and

v_{c}

are the coordinates of the image projection on the pixel plane. The, n

X_{c}

is converted to the world coordinate system as follows:

(\begin{matrix} X_{w} \\ 1 \end{matrix}) = T \cdot (\begin{matrix} X_{c} \\ 1 \end{matrix}),

(23)

where

T = \underset{3 D Translation}{\underset{︸}{(I | t)}} \times \underset{3 D Rotation}{\underset{︸}{(\begin{matrix} R & 0 \\ 0^{T} & 1 \end{matrix})}}

(24)

is the extrinsic matrix.

To establish correspondence between instances of the same obstacle, the following strategy is implemented: upon the appearance of a new detection result, its temporal disparity and spatial deviation are computed with respect to the preceding detection to assess if the two bounding boxes correspond to identical obstacles.

4.2. Obstacle Velocity Estimation

Once the obstacle’s position is determined, a Kalman Filter (KF) [26] is employed for velocity estimation. This approach is motivated by two considerations: (i) KFs improve reliability against the noise and misdetection that occur during the detection phase; (ii) the obstacle’s velocity dictates both the activation of obstacle avoidance and the formulation of corresponding maneuvers.

The system state is defined as the position

p

and velocity

v

, with the observed position serving as the measurement input to update

\{\begin{matrix} p_{k} & = p_{k - 1} + v_{k - 1} δ t, k = 1, 2, \dots \\ δ t & = t_{k} - t_{k - 1} \end{matrix} .

(25)

The linear modeling of (25) yields the following:

\begin{matrix} x & = [\begin{matrix} p \\ v \end{matrix}] \\ x_{k + 1} & = A x_{k} + w_{k} \end{matrix},

(26)

where

x_{k} \in R^{6}

is the state vector at

t_{k}

,

A \in R^{6 \times 6}

is the state transfer matrix for

x

from

t_{k}

to

t_{k + 1}

, and

w_{k} \in R^{6}

is the process noise that obeys a Gaussian distribution with a mean of

0

and covariance matrix of

Q \in R^{6 \times 6}

.

Observations of the random variable x are modeled as follows:

z_{k} = H x_{k} + v_{k},

(27)

where

z_{k} \in R^{3}

is the observation of the position at

t_{k}

,

H \in R^{3 \times 6}

is the state observation matrix, and

v_{k} \in R^{6}

is the observation noise that obeys a Gaussian distribution with mean of

0

and covariance matrix of

R \in R^{3 \times 3}

.

For optimal minimum mean square error estimation, the system noise is modeled as a Gaussian process. Under the assumption of time-invariant covariance matrices for both process and measurement noise, the KF equations are derived as follows:

\begin{matrix} Q & = E [w_{k} w_{k}^{T}] \\ R & = E [v_{k} v_{k}^{T}] \\ P_{k} & = E [e_{k} e_{k}^{T}] = E [(x_{k} - {\hat{x}}_{k}) {(x_{k} - {\hat{x}}_{k})}^{T}] \end{matrix},

(28)

where

{\hat{x}}_{k}

denotes the optimal estimate of

x_{k}

(also known as the posteriori state estimate) and

e_{k} = (x_{k} - {\hat{x}}_{k})

denotes the posteriori state error, and

P_{k}

is the error covariance matrix between the true value and the optimal estimate and satisfies

P_{k} = P_{k}^{T}

.

Assuming that

{\hat{x}}_{k}^{'}

denotes the prior state estimate of

{\hat{x}}_{k}

, and combining the state estimate with the measured data, the updating equation for the state estimate is defined as follows:

{\hat{x}}_{k} = {\hat{x}}_{k}^{'} + K_{k} (z_{k} - H {\hat{x}}_{k}^{'}),

(29)

where

K_{k}

is the Kalman gain, which represents the weight of the model prediction error to the measurement error in the state-optimal estimation process, and

(z_{k} - H {\hat{x}}_{k}^{'})

represents the measurement residuals.

Combining (27)–(29) yields the following:

P_{k} = P_{k}^{'} - K_{k} H P_{k}^{'} - P_{k}^{'} H^{T} K_{k}^{T} + K_{k} (H P_{k}^{'} H^{T} + R) K_{k}^{T},

(30)

where

P_{k}^{'}

represents the prior estimate of

P_{k}

.

The trace of the error covariance matrix

P_{k}

is the sum of mean square errors, and the sum of minimum mean square errors can be obtained by minimizing the trace of

P_{k}

. If the trace of

P_{k}

in (30) is partially derived from

K_{k}

and set to zero, then the calculation of the Kalman gain matrix under optimal estimation conditions is the following:

K_{k} = P_{k}^{'} H^{T} {(H P_{k}^{'} H^{T} + R)}^{- 1},

(31)

Let

S_{k} = H P_{k}^{'} H^{T} + R

. The updated equation for the error covariance matrix

P_{k}

using the optimal gain can be obtained by associating (30) and (31) as follows:

\begin{matrix} P_{k} & = P_{k}^{'} - K_{k} H P_{k}^{'} - K_{k} S_{k} K_{k}^{T} + K_{k} S_{k} K_{k}^{T} \\ = (I - K_{k} H) P_{k}^{'} \end{matrix} .

(32)

According to Equations (29), (31) and (32), the predictive equation for the state can be obtained as follows:

{\hat{x}}_{k + 1} = A {\hat{x}}_{k} .

(33)

In order to predict state

x

recursively, it is necessary to predict

P_{k + 1}^{'}

at

t_{k + 1}

. The priori error at

t_{k + 1}

is the following:

\begin{matrix} e_{k + 1}^{'} & = x_{k + 1} - {\hat{x}}_{k + 1}^{'} \\ = (A x_{k} + w_{k}) - A {\hat{x}}_{k} \\ = A e_{k} + w_{k} \end{matrix},

(34)

where

w_{k}

is the cumulative noise from

t_{k}

to

t_{k + 1}

and

e_{k}

is the error before

t_{k}

, and

e_{k}

and

w_{k}

are uncorrelated. Therefore, the priori error covariance matrix at

t_{k + 1}

is computed as follows:

\begin{matrix} P_{k + 1}^{'} & = E [e_{k + 1}^{'} e_{k + 1}^{T'}] \\ = A E [e_{k} e_{k}^{T}] A^{T} + E [w_{k} w_{k}^{T}] \\ = A P_{k} A^{T} + Q \end{matrix} .

(35)

The velocity estimation framework implemented through a KF is demonstrated in the preceding analysis, with its recursive estimation process visually summarized in Figure 5.

4.3. Artificial Repulsive Field

For dynamic obstacle avoidance applications, low-latency response is critical, precluding the use of computationally intensive optimization methods. Consequently, the Artificial Potential Field (APF) method [27] is adopted. In this method, an obstacle acts as a source of a repulsive potential field. When the quadrotor enters the repulsive range, it experiences a ’bouncing away’ effect, thereby achieving obstacle avoidance.

The repulsive force is formulated as follows:

F_{r} = \begin{matrix} \{\begin{matrix} k_{r} (1 - {(\frac{η}{η_{0}})}^{γ}) θ, & 0 \leq η \leq η_{0} \\ 0, & η > η_{0} \end{matrix} \end{matrix},

(36)

where

k_{r}

,

γ (> 1)

, and

η_{0}

are hyperparameters;

η

is the Euclidean distance between the quadrotor and the obstacle; and

θ

is an orthonormal direction vector perpendicular to the obstacle’s velocity vector, with its z-axis component constrained to positive values.

In contrast to conventional repulsive function,

F_{r}

is a concave function that allows for a more rapid response. Figure 6a shows the variation curves corresponding to different

θ

values, while Figure 6b provides a visualization of the direction of

θ

.

5. Experiments

To evaluate the performance of the entire pipeline in the real world, the authors built a quadrotor platform equipped with an iniVation DAVIS346 event camera and an Intel Realsense D455 depth camera. The odometry, detection, and avoidance modules are implemented on an NVIDIA Jetson Xavier NX embedded computer, interconnected via ROS middleware. The output from the avoidance phase is a high-level velocity command, which is subsequently sent to the CUAV Nora Autopilot Flight Controller to execute obstacle avoidance maneuvers. The main hardware configuration and software dependencies of the quadrotor are illustrated in Figure 7 and specified in Table 1.

5.1. Datasets

For an empirical validation of the proposed moving obstacle detection algorithm, six distinct real-world datesets were constructed, each encompassing a 2-s time span. The datasets’ configurations are systematically categorized in Table 2.

As depicted in Figure 8, Scenes 1–4 represent indoor experimental configurations, whereas Scenes 5–6 correspond to outdoor environments. The obstacle’s velocity is approximated by dividing its displacement by the temporal interval between its entry and exit from the sensor’s field of view.

In Scene 1, the quadrotor is in hover and a basketball is moving from right to left at 7 m/s. In Scene 2, the quadrotor is in hover and a bottle is moving from right to left at 6 m/s. In Scene 3, the quadrotor is in motion and a basketball is moving from left to right at 5 m/s. In Scene 4, the quadrotor is in motion and a bottle is moving from right to left at 6 m/s. In Scene 5, the quadrotor is in hover and a board is moving from right to left at 5 m/s. In Scene 6, the quadrotor is in motion and a football is moving toward it at 8 m/s.

5.2. Results and Analysis

In this section, we quantitatively compare our proposed detection method with [2] (denoted as Falanga) and [8] (denoted as FAST), followed by further analysis.

5.2.1. Time Cost of Constructing Event Images

In order to compare the time cost of the three methods in constructing the event image to be detected, the authors define the following metrics:

\begin{matrix} m i n & = \frac{1}{N} \sum_{1}^{N} t_{min} \\ a v g & = \frac{1}{N} \sum_{1}^{N} t_{avg} \\ m a x & = \frac{1}{N} \sum_{1}^{N} t_{max} \\ t o t a l & = \frac{1}{N} \sum_{1}^{N} t_{total} \end{matrix},

(37)

where N denotes the total number of times the algorithm was run,

t_{m i n}

,

t_{a v g}

, and

t_{m a x}

denote the minimum, average, and maximum values of the time cost for the algorithm to be run once in a given time window, respectively, and

t_{t o t a l}

denotes the total time cost of the algorithm.

The comparative results in Table 3 reveal two key observations: (i) Falanga performs optimally in Scene 1, Scene 2, and Scene 5; this is because the number of events generated by ego-motion is low when the quadrotor is in hover, and Falanga does not consider translation compensation, thus reducing the computational complexity. (ii) In Scene 3, Scene 4, and Scene 6, our method is optimal in terms of min, avg, and total metrics. This is due to the fact that ego-motion generates a large number of events, and our proposed adaptive time window can effectively deal with different situations, thus improving the efficiency of processing events. However, our method is not optimal in terms of max metrics, as it includes translation compensation to enhance the accuracy of subsequent detection.

5.2.2. Performance Comparison of Detection

In dynamic obstacle avoidance applications, algorithmic latency must be minimized to ensure real-time responsiveness. Furthermore, segmentation accuracy between obstacles and the background in event camera frames is critical for reliable detection, necessitating high contrast ratios in ROI to facilitate subsequent avoidance maneuvers. To this end, the authors employ the relative contrast [8], which defines the bounding box region of manually labeled moving objects in the event image as

M

and the other regions of the image as

B

; then, the relative contrast

χ

is defined as follows:

χ = \frac{max_{(i, j) \in M} M_{i, j} - max_{(i, j) \in B} B_{i, j}}{max_{(i, j) \in M} M_{i, j}} .

(38)

Since the calculation of relative contrast is only meaningful when there is a moving object in the scene, only time windows in which the moving object appears in the sensor’s field of view are selected for the calculation of

χ

.

The results of the experiment are shown in Table 4, where the third through fifth columns indicate the minimum, average, and maximum time required to run the algorithm once on the dataset, respectively.

For the running time of the detection algorithm, the authors have the following conclusions: (i) Our method is optimal in terms of the min, avg, and total metrics; this is attributed to the fact that the adaptive time window can limit the number of events processed by the algorithm at a time and the total number of algorithm runs. Moreover, the event images of [2,8] retain all the events within the time window, while our event queue image discards some of the “old” events, further reducing the runtime. (ii) Although Falanga does not consider translation compensation, they use the DBSCAN clustering algorithm [28] for moving object detection, which has a higher time complexity compared to our image moment-fitting method. (iii) Our method underperforms on the max metric, which reflects the situation when there are no moving objects in the field of view.

For the relative contrast of the detection algorithm, the authors have the following conclusions: (i) Our event queue image that focuses only on the most recent events performs as expected, as evidenced by the best on the avg and max metrics and the worst on the min metric. (ii) When the quadrotor is in motion, the ego-motion generates a lot of noise, resulting in a significant drop in the relative contrast.

5.2.3. Accuracy of Moving Object Detection

When there is a dynamic object in the field of view, the detection accuracy is defined as follows:

a c c = \frac{N O D}{N O R},

(39)

where

N O R

represents the total number of times the detection algorithm is triggered within a time window, while

N O D

represents the number of times the detection algorithm is triggered and detects a moving obstacle within that time window.

Table 5 presents the comparative results of the experiments. Overall, our method not only reduces the number of runs of the detection algorithm, thereby reducing the risk of misdetection, but also leads in terms of accuracy. Figure 9 presents some of the detection results for the six scenes when an object is present in the field of view.

5.2.4. Obstacle Avoidance Test in a Real Environment

To validate the performance of our proposed pipeline, the authors conducted experiments in a real-world environment. The specific scenarios are detailed as follows: as shown in Figure 10, the quadrotor is kept in a hovering state under stable lighting conditions, and a person throws a football of unknown size with a speed of approximately 8 m/s towards it from a distance of 5 to 8 m. Numerous tests show that our pipeline achieves more than an 80% success rate in obstacle avoidance. Additionally, even when the quadrotor is moving at low speeds, our pipeline can still achieve a high success rate for obstacle avoidance. However, when the quadrotor operates at higher speeds, the performance of our pipeline declines significantly, with frequent false detections that hinder effective obstacle avoidance. Addressing this challenge is a key focus for future research.

6. Conclusions

This paper presents a novel and efficient detection framework that integrates events with depth information for dynamic object detection. The proposed methodology incorporates an adaptive sampling strategy and an innovative event-based image construction technique, specifically designed to minimize latency while enhancing detection accuracy. Extensive experimental evaluations demonstrate the framework’s superior performance compared to existing state-of-the-art approaches, achieving significant improvements with approximately 20% faster detection speeds and around 5% higher accuracy across various real-world scenarios. Furthermore, a comprehensive obstacle avoidance pipeline has been developed and successfully implemented on a quadrotor platform. Field experiments conducted in outdoor environments confirm the system’s capability to reliably detect and avoid obstacles within a 5-to-8-m range, even when encountering threats approaching at velocities of approximately 8 m/s. The framework’s modular architecture facilitates seamless integration with additional functional modules, enabling adaptation to diverse downstream applications. The authors maintain that this approach constitutes an effective solution for dynamic obstacle detection while demonstrating practical viability for real-world quadrotor platform deployment.

Author Contributions

Conceptualization, D.C.; methodology, D.C. and L.Z.; software, L.Z.; validation, D.C. and L.Z.; formal analysis, C.G., D.C. and L.Z.; investigation, D.C. and L.Z.; resources, C.G.; data curation, D.C. and L.Z.; writing—original draft preparation, D.C.; writing—review and editing, D.C.; visualization, L.Z.; supervision, C.G.; project administration, C.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Hubei Province Science and Technology Major Project (2021AA010).

Data Availability Statement

The dataset involved in this study is a private dataset. Interested readers may obtain the original dataset and source code via email (dwchen@whu.edu.cn).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, G. Visual-inertial navigation: A concise review. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 9572–9582. [Google Scholar]
Falanga, D.; Kleber, K.; Scaramuzza, D. Dynamic obstacle avoidance for quadrotors with event cameras. Sci. Robot. 2020, 5, eaaz9712. [Google Scholar] [CrossRef] [PubMed]
Rodríguez-Gómez, J.P.; Tapia, R.; Garcia, M.d.M.G.; Martínez-de Dios, J.R.; Ollero, A. Free as a bird: Event-based dynamic sense-and-avoid for ornithopter robot flight. IEEE Robot. Autom. Lett. 2022, 7, 5413–5420. [Google Scholar] [CrossRef]
Forrai, B.; Miki, T.; Gehrig, D.; Hutter, M.; Scaramuzza, D. Event-based agile object catching with a quadrupedal robot. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 12177–12183. [Google Scholar]
Falanga, D.; Kim, S.; Scaramuzza, D. How fast is too fast? the role of perception latency in high-speed sense and avoid. IEEE Robot. Autom. Lett. 2019, 4, 1884–1891. [Google Scholar] [CrossRef]
Posch, C.; Serrano-Gotarredona, T.; Linares-Barranco, B.; Delbruck, T. Retinomorphic event-based vision sensors: Bioinspired cameras with spiking output. Proc. IEEE 2014, 102, 1470–1484. [Google Scholar] [CrossRef]
Gallego, G.; Delbrück, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-Based Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 154–180. [Google Scholar] [CrossRef] [PubMed]
He, B.; Li, H.; Wu, S.; Wang, D.; Zhang, Z.; Dong, Q.; Xu, C.; Gao, F. Fast-dynamic-vision: Detection and tracking dynamic objects with event and depth sensing. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3071–3078. [Google Scholar]
Gallego, G.; Rebecq, H.; Scaramuzza, D. A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3867–3876. [Google Scholar]
Stoffregen, T.; Kleeman, L. Event cameras, contrast maximization and reward functions: An analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12300–12308. [Google Scholar]
Liu, H.; Moeys, D.P.; Das, G.; Neil, D.; Liu, S.C.; Delbrück, T. Combined frame-and event-based detection and tracking. In Proceedings of the 2016 IEEE International Symposium on Circuits and systems (ISCAS), Montreal, QC, Canada, 22–25 May 2016; pp. 2511–2514. [Google Scholar]
Iacono, M.; Weber, S.; Glover, A.; Bartolozzi, C. Towards event-driven object detection with off-the-shelf deep learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–9. [Google Scholar]
Tomy, A.; Paigwar, A.; Mann, K.S.; Renzaglia, A.; Laugier, C. Fusing event-based and rgb camera for robust object detection in adverse conditions. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 933–939. [Google Scholar]
Gehrig, M.; Scaramuzza, D. Recurrent vision transformers for object detection with event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13884–13893. [Google Scholar]
Gallego, G.; Gehrig, M.; Scaramuzza, D. Focus is all you need: Loss functions for event-based vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12280–12289. [Google Scholar]
Peng, X.; Gao, L.; Wang, Y.; Kneip, L. Globally-optimal contrast maximisation for event cameras. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3479–3495. [Google Scholar] [CrossRef] [PubMed]
Shiba, S.; Klose, Y.; Aoki, Y.; Gallego, G. Secrets of Event-based Optical Flow, Depth and Ego-motion Estimation by Contrast Maximization. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7742–7759. [Google Scholar] [CrossRef] [PubMed]
Paredes-Vallés, F.; Scheper, K.Y.; De Croon, G.C. Unsupervised learning of a hierarchical spiking neural network for optical flow estimation: From events to global motion perception. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2051–2064. [Google Scholar] [CrossRef] [PubMed]
Gallego, G.; Scaramuzza, D. Accurate angular velocity estimation with an event camera. IEEE Robot. Autom. Lett. 2017, 2, 632–639. [Google Scholar] [CrossRef]
Kim, H.; Kim, H.J. Real-time rotational motion estimation with contrast maximization over globally aligned events. IEEE Robot. Autom. Lett. 2021, 6, 6016–6023. [Google Scholar] [CrossRef]
Stoffregen, T.; Gallego, G.; Drummond, T.; Kleeman, L.; Scaramuzza, D. Event-based motion segmentation by motion compensation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 7244–7253. [Google Scholar]
Zhou, Y.; Gallego, G.; Lu, X.; Liu, S.; Shen, S. Event-based motion segmentation with spatio-temporal graph cuts. IEEE Trans. Neural Networks Learn. Syst. 2021, 34, 4868–4880. [Google Scholar] [CrossRef] [PubMed]
Mondal, A.; Giraldo, J.H.; Bouwmans, T.; Chowdhury, A.S. Moving object detection for event-based vision using graph spectral clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 876–884. [Google Scholar]
Mitrokhin, A.; Fermüller, C.; Parameshwara, C.; Aloimonos, Y. Event-based moving object detection and tracking. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–9. [Google Scholar]
Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Khatib, O. Real-time obstacle avoidance for manipulators and mobile robots. Int. J. Robot. Res. 1986, 5, 90–98. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]

Figure 1. Visualization results of different stages in the detection process. (a) is the RGB image provided by the D455 camera, for display purposes only; (b) is the visualization of all events captured by the DAVIS346 within a time window, where blue denotes positive events and red indicates negative events. Note that it is not strictly aligned with the RGB image in space and time; (c) is the event queue frame obtained after compensating for the camera’s ego-motion; (d) is the final detection result obtained after a series of processing steps, which filters out the noise and retains the true obstacles compared to (c); (e) is the heatmap distribution for (c,d), showing an increasing trend from left (minimum) to right (maximum).

Figure 2. The overview of the proposed detection pipeline. For each time window, the events and corresponding data are first saved to a buffer. Then, ego-motion compensation, construction, and the filtering of event queue images and detection based on image moments are executed sequentially. Finally, the bounding box of the obstacle on the image is outputted.

Figure 3. Adaptive time window sampling strategy. Note that the size of the current time window (blue dashed box) depends not only on the sampling time window (green dashed box) but also on the previous time window (pink dashed box).

Figure 4. The warped events update the MREQ (which is initialized to zero) according to the “First in, First out” principle. Once all the warped events are incorporated into the MREQ, averaging, normalization, and filtering operations are performed.

Figure 5. The iterative process of estimating speed.

Figure 6. Visualization of our proposed repulsive force.

Figure 7. Quadrotor flight platform and the main hardware it carries.

Figure 8. Six different experimental scenarios. Note that the RGB images are only used for illustration and are not used as inputs for the detection algorithm.

Figure 9. Visualization of some of the detection results in six scenes, where the correspondence between colors follows Figure 1e.

Figure 10. Experiments in the real world. (a) demonstrates our experiment of throwing a soccer ball toward the quadcopter from a distance of approximately 8 meters; (b,c) show the process of the quadcopter detecting the object and performing an obstacle avoidance maneuver; (d) illustrates its successful evasion of the obstacle.

Table 1. Software dependencies.

Name	Software Dependency	Version
VINS-Fusion	Cmake	3.22.1
	Ceres Solver	2.0.0
	Eigen	3.3.7
PX4	python	3.9.13
	OpenCV	4.2.0
	make	4.1
	gazebo	9.0

Table 2. Datasets’ configurations.

Topic	Frequency	Data Format
/dvs/events	-	dvs_msgs/EventArray
/mavros/imu/data	190	sensor_msgs/Imu
/camera/color/image_raw	30	sensor_msgs/Image
/camera/depth/image_rect_raw	45	sensor_msgs/Image
/vins_fusion/odometry	15	nav_msgs/Odometry

Table 3. Computational overhead of constructing the image to be detected. Please note that we have bolded the optimal results.

Indicators	Methods	Scene 1	Scene 2	Scene 3	Scene 4	Scene 5	Scene 6
$m i n$ (ms)	Ours	0.12	0.12	0.17	0.11	0.17	0.18
	Falanga	0.03	0.07	0.28	0.29	0.06	0.23
	FAST	0.22	0.19	0.39	0.41	0.25	0.26
$a v g$ (ms)	Ours	0.15	0.32	0.43	0.32	0.52	0.48
	Falanga	0.07	0.24	0.67	0.47	0.37	0.51
	FAST	0.27	0.49	0.89	0.68	0.69	0.59
$m a x$ (ms)	Ours	0.82	0.83	1.85	1.94	0.91	1.15
	Falanga	0.33	0.55	1.27	0.96	0.63	0.67
	FAST	0.48	1.35	1.57	1.24	1.11	0.73
$t o t a l$ (ms)	Ours	8.79	9.42	26.25	18.86	17.34	28.86
	Falanga	3.92	4.79	28.29	21.39	10.35	30.07
	FAST	16.36	17.63	52.98	40.97	23.41	33.24

Table 4. Performance comparison of detection. Please note that we have bolded the optimal results.

Scene	Methods	Running Time (ms)				The Relative Contrast (%)
Scene	Methods	$\min$	$avg$	$\max$	$total$	$χ_{\min}$	$χ_{avg}$	$χ_{\max}$
Scene 1	Ours	0.73	0.92	2.85	55.59	22.7	81.9	100
	Falanga	0.82	1.18	2.14	62.83	26.4	70.1	100
	FAST	0.97	1.25	2.52	75.24	29.1	75.3	100
Scene 2	Ours	0.69	0.91	3.19	56.05	19.2	71.8	95.6
	Falanga	0.88	0.98	2.59	59.91	23.9	51.2	97.6
	FAST	0.91	1.23	2.74	75.53	26.8	60.4	100
Scene 3	Ours	0.95	1.37	2.61	83.31	0.7	27.1	54.2
	Falanga	1.04	1.59	3.62	89.23	3.8	18.7	45.1
	FAST	1.19	1.85	2.47	112.9	4.3	20.6	48.9
Scene 4	Ours	0.84	1.18	4.62	71.32	4.3	27.5	52.9
	Falanga	0.95	1.35	4.36	81.28	6.2	14.6	35.5
	FAST	1.12	1.59	3.17	95.59	7.4	16.9	43.4
Scene 5	Ours	1.02	1.28	2.09	78.86	8.9	35.1	94.2
	Falanga	1.17	1.39	1.61	80.59	10.4	25.9	80.3
	FAST	1.21	1.53	1.74	83.27	12.1	30.2	90.5
Scene 6	Ours	1.03	1.40	3.55	83.96	8.5	29.7	59.6
	Falanga	1.21	1.56	3.61	90.18	9.1	20.2	40.5
	FAST	1.57	1.69	2.47	87.55	9.3	26.9	54.1

Table 5. Accuracy of moving object detection. Please note that we have bolded the optimal results.

Method	Indicators	Scene 1	Scene 2	Scene 3	Scene 4	Scene 5	Scene 6
Ours	$N O D$	14	20	18	16	13	25
Ours	$a c c$	93%	90%	85%	75%	92%	73%
Falanga	$N O D$	16	24	20	16	15	28
Falanga	$a c c$	89%	85%	70%	67%	85%	64%
FAST	$N O D$	16	24	20	16	15	28
FAST	$a c c$	92%	89%	80%	75%	82%	68%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, D.; Zhou, L.; Guo, C. A Low-Latency Dynamic Object Detection Algorithm Fusing Depth and Events. Drones 2025, 9, 211. https://doi.org/10.3390/drones9030211

AMA Style

Chen D, Zhou L, Guo C. A Low-Latency Dynamic Object Detection Algorithm Fusing Depth and Events. Drones. 2025; 9(3):211. https://doi.org/10.3390/drones9030211

Chicago/Turabian Style

Chen, Duowen, Liqi Zhou, and Chi Guo. 2025. "A Low-Latency Dynamic Object Detection Algorithm Fusing Depth and Events" Drones 9, no. 3: 211. https://doi.org/10.3390/drones9030211

APA Style

Chen, D., Zhou, L., & Guo, C. (2025). A Low-Latency Dynamic Object Detection Algorithm Fusing Depth and Events. Drones, 9(3), 211. https://doi.org/10.3390/drones9030211

Article Menu

A Low-Latency Dynamic Object Detection Algorithm Fusing Depth and Events

Abstract

1. Introduction

2. Related Work

3. Obstacle Detection

3.1. Adaptive Time Window

3.2. Ego-Motion Compensation

3.3. Event Queue Image

3.4. Noise Reduction and Detection

4. Obstacle Avoidance

4.1. Obstacle Position Estimation

4.2. Obstacle Velocity Estimation

4.3. Artificial Repulsive Field

5. Experiments

5.1. Datasets

5.2. Results and Analysis

5.2.1. Time Cost of Constructing Event Images

5.2.2. Performance Comparison of Detection

5.2.3. Accuracy of Moving Object Detection

5.2.4. Obstacle Avoidance Test in a Real Environment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI