ESVIO: Event-Based Stereo Visual-Inertial Odometry

The emerging event cameras are bio-inspired sensors that can output pixel-level brightness changes at extremely high rates, and event-based visual-inertial odometry (VIO) is widely studied and used in autonomous robots. In this paper, we propose an event-based stereo VIO system, namely ESVIO. Firstly, we present a novel direct event-based VIO method, which fuses events’ depth, Time-Surface images, and pre-integrated inertial measurement to estimate the camera motion and inertial measurement unit (IMU) biases in a sliding window non-linear optimization framework, effectively improving the state estimation accuracy and robustness. Secondly, we design an event-inertia semi-joint initialization method, through two steps of event-only initialization and event-inertia initial optimization, to rapidly and accurately solve the initialization parameters of the VIO system, thereby further improving the state estimation accuracy. Based on these two methods, we implement the ESVIO system and evaluate the effectiveness and robustness of ESVIO on various public datasets. The experimental results show that ESVIO achieves good performance in both accuracy and robustness when compared with other state-of-the-art event-based VIO and stereo visual odometry (VO) systems, and, at the same time, with no compromise to real-time performance.


Introduction
Visual odometry (VO) is an emerging and hot research topic in the fields of robotics and computer vision [1], which aims to make the real-time estimation of camera poses and motion trajectory using only visual sensors (such as monocular camera [2], stereo camera [3], panoramic camera [4], depth camera [5], etc.). However, due to the limitation of imaging modalities, standard camera-based VO methods are not robust enough in some challenging scenarios, such as fast-moving, high dynamic range (HDR), rapid illumination changing, etc. In recent years, a new type of bio-inspired vision sensor, namely event camera, has gradually attracted the attention of researchers. It works differently from the standard camera and has independent pixels which can capture brightness changes individually and output this information as "event" asynchronously [6]. Compared to standard cameras, event cameras have the higher temporal resolution, wider dynamic range, lower latency, lower power consumption, and bandwidth [7,8]. With these advantages, researchers have developed a series of event-based VO methods [8][9][10][11][12] to provide correct camera state estimation in the challenging scenarios, as mentioned above. However, due to the hardware limitation of event cameras (such as noise and dynamic effects, low spatial resolution, etc.) and related research is still in the exploration stage, the event-based VO still faces the problems of low accuracy and low robustness in some scenarios.
In the traditional VO field, introducing a low-cost inertial measurement unit (IMU) to improve the accuracy and robustness of camera state estimation is a high-efficiency solution, which is called visual-inertial odometry (VIO) and has been a research highlight in the past decade [13][14][15][16][17][18]. For the event-based VO, integrating inertial measurement can also assist the system in estimating motion and recovery scale. At present, a few works have explored eventbased VIO methods and achieved accurate state estimation results, such as [19][20][21][22][23][24]. However, two common challenges exist in the event-based VIO field: (1) how to process the novel event data and fuse them with IMU measurement to estimate camera motion. Most of the existing event-based VIO methods (except for [22]) convert event stream to event image frames and execute feature-based VIO pipelines based on these frames. These methods will extract and track specific visual features from consecutive event image frames. Then, they will minimize feature reprojection errors and IMU propagation errors to estimate camera motion. However, due to the working principle and hardware limitations of event cameras, most event image frames have the disadvantages of high noise and fewer texture details, which is not conducive to visual feature extraction and tracking. (2) How to solve initialization parameters by fusing event data and IMU measurement in the initial stage of the VIO system. Visual-inertial initialization is crucial for VIO, which can help the system obtain parameters required for state estimation, including scale, gravity direction, IMU bias, etc. However, for existing event-based VIO systems, there are no initialization methods that can effectively fuse event data and IMU measurement together to estimate initialization parameters.
To this end, firstly, for the challenge (1), we propose a novel direct event-based VIO method for better fusion of event data and IMU measurement. This VIO method can directly estimate the state of the system based on event inverse depth frames, Time-Surface (TS) images, and pre-integrated inertial measurement in a sliding window non-linear optimization framework without feature tracking. Secondly, for challenge (2), we propose a semi-joint event-inertial initialization method to improve the performance of event-based VIO state estimation. This initialization method can obtain an up-to-scale camera trajectory based on the event-only initialization (since the trajectory estimated by event-based stereo VO methods, such as ESVO [8], always has a scale error, so the trajectory here is deemed up-to-scale) and solve initialization parameters using the event-inertial initial optimization. Thirdly, referring to ESVO [8], we implement an event-based stereo VIO system based on the above two proposed methods, namely ESVIO. Finally, we evaluate ESVIO on the public datasets and the experimental results show that ESVIO achieves better performance in both accuracy and robustness in different scenarios. To conclude, our contributions can be summarized as below: • We present a direct event-based stereo VIO method for the first time, which uses the sliding window tightly coupled optimization method to directly estimate the optimal state of the system based on the event and IMU data without feature tracking. • We propose a semi-joint event-inertial initialization method to estimate initialization parameters, including scale, gravity direction, initial velocities, accelerometer, and gyroscope biases, in two steps (event-only initialization and event-inertial initial optimization). • We implement the event-based stereo VIO system, ESVIO, in C++ and evaluate it on four public datasets [25][26][27][28]. The results demonstrate that our system achieves good accuracy and robust performance when compared with the state-of-the-art, and, at the same time, with no compromise to real-time performance.
The rest of the paper is organized as follows. Section 2 gives a brief review of related works. The system overview of ESVIO is presented in Section 3, followed by the preliminaries in Section 4. The proposed event-inertial initialization method is introduced in Section 5, and the detailed direct event-based VIO pipeline for state estimation is given in Section 6. Section 7 provides experimental evaluations. Finally, our conclusion is drawn in Section 8.

Visual-Inertial Odometry Methods
Considering different sensor fusion methods, VIO methods can be divided into looselyand tightly coupled methods [29]. For the loosely-coupled method, firstly, it estimates the state of the system based on visual information and IMU measurement separately in two independent estimators, and, then, it solves the system state based on these two state estima-tion results. Loosely-coupled methods are common in the early stages of VIO development, generally through an Extended Kalman Filter (EKF) to achieve sensor fusion and the final state estimation, such as [30,31]. For the tightly coupled method, it directly estimates the system state based on visual information and IMU measurement together in one estimator. Compared with the loosely coupled method, the tightly coupled method needs to consider the interaction between different sensor data, and the algorithm implementation is more complicated, but it will obtain more accurate state estimation results. Until now, tightly coupled methods mainly achieve sensor fusion and state estimation through filter-based estimation methods (e.g., EKF) [14,32,33] or optimization-based estimation methods (e.g., graph optimization) [13,15,17,[34][35][36][37].
Considering visual measurement processing, VIO methods can also be divided into two different paradigms: feature-based VIO methods and direct VIO methods. For the feature-based method, firstly, it extracts and tracks specific visual features from consecutive camera frames through feature descriptors. Then, it will minimize feature reprojection errors and IMU propagation errors to estimate the system state. At present, most of the existing VIO methods are feature-based methods due to their robustness and maturity, such as [13,15,33,[38][39][40]. Unlike the feature-based method, the direct VIO method directly uses the intensity value of pixels to construct the photometric error between two camera frames, and then estimates the system state by minimizing photometric errors and IMU propagation errors without the procedure of feature extraction and tracking. Compared with featurebased methods, direct VIO methods have faster processing speed and better performance in textureless scenarios [41], such as [16,18,42,43]. However, direct methods typically require a higher frame rate than feature-based methods to maintain good performance [44].

Event-Based Visual Odometry and SLAM
For event-based VO/SLAM methods, they can be divided into monocular ones and stereo ones. Mueggler et al. [9] presented an onboard event-based monocular system for 6-Degrees of Freedom (DoF) localization with high speed and fast rotation. Kim et al., [10] can perform real-time 3D reconstruction, 6-DoF localization, and scene intensity estimation by EKF based on a monocular event camera. Kueng et al. [45] implemented a monocular VO system by event-based feature detection. Rebecq et al. [11] proposed EVO that is implemented by an image alignment tracker via semi-dense mapping. Bryner et al. [12] introduced a maximum-likelihood framework to estimate the camera motion. Nguyen et al. [46] used the Stacked Spatial LSTM Network to estimate camera poses with event images as input. However, monocular methods always face the scale consistency problem, and stereo methods perform better in this aspect thanks to the known depth information.
For event-based stereo VO/SLAM methods, Zhou et al. [8] proposed a stereo eventbased VO system called ESVO. The system achieved semi-dense 3D reconstruction [25] and 3D-2D registration tracking [47] via inverse depth map and TS. Jiao et al. [26] analyzed and compared the performance of ESVO under different event representation methods. Compared with monocular methods, stereo ones allow for more accurate and robust state estimation. However, due to the hardware limitation of the event camera, it is still difficult to obtain accurate state estimation results for VO/SLAM methods based on the event camera only in some scenes [8]. Therefore, fusing more sensor information, such as IMU measurement, is a solution to improve the performance of event-based VO/SLAM methods.

Event-Based Visual-Inertial Odometry
Until now, there are not many VIO methods based on the event camera. Zhu et al. [19] presented the event-based VIO system, namely EVIO, to provide accurate metric tracking of a camera's full 6-DoF pose. EVIO uses the sensor state and event information to track visual features within the event frame over time and fuses visual feature tracks with the IMU measurement to update the system state employing an EKF with a structureless vision model. Rebecq et al. [20] also proposed a feature-based method, but fused event and inertial measurements using a keyframe-based non-linear optimization framework to estimate camera motion. Based on [20], Vidal et al. [21] implemented the first VIO state estimation method that fuses events, standard frames, and inertial measurements in a tightly coupled manner to achieve accurate and robust state estimation. Mueggler et al. [22] proposed a VIO method with an event camera under the continuoustime framework which allows direct integration of the asynchronous events with microsecond accuracy and inertial measurements at high frequency. Gentil et al. [23] presented an event-based VIO method which uses line features for the first time and constructs different line feature constraints to assist in state estimation. Chen et al. [24] proposed an event-based stereo VIO pipeline with sliding windows optimization, and further extended it to include standard frames, which tightly integrated stereo event streams, stereo image frames, and IMU together, to achieve accurate feature-based state estimation.
However, most of the above event-based VIO methods (except for [24]) are designed for monocular systems, and most of them (except for [22]) belong to the category of the feature-based VIO method. Compared with the standard camera frame, the event image frame constructed from raw event data has the disadvantages of low resolution, high noise, and fewer texture details due to the working principle and hardware limitations of event cameras, which makes the event-based feature extraction and tracking not as accurate and stable as needed [44]. Therefore, our proposed ESVIO can obtain events' depth by stereo constraint and adopt direct VIO method to implement the fusion of event data (event depth and event image frame) and inertial measurement without feature tracking.

Visual-Inertial Initialization Methods
To start a VIO system, some parameters need to be estimated in an initialization process, including scale, gravity direction, initial velocity, accelerometer and gyroscope biases, etc. A fast and accurate initialization is critical for the subsequent visual-inertial state estimation. Based on the classification of initialization in [48], we divide visual-inertial initialization methods into three categories: joint, disjoint, and semi-joint initialization.
Joint initialization is similar to the tightly-couple method, which directly fuses raw visual and inertial measurements and estimates all initialization parameters in one step. The joint initialization method was first proposed by Martinelli [49], who presented a closed-form solution to jointly solve feature depth (scale), gravity, accelerometer bias, and initial velocity by linear least squares. Subsequent joint initialization methods are basically extended based on this method, including [50][51][52]. The advantage of joint initialization is that the interaction between the parameters is fully considered, and the disadvantage is that the real-time performance and the convergence of the solution process are not stable.
Disjoint initialization is similar to the loosely coupled method, which processes visual and inertial information separately, and solves initialization parameters step by step. Disjoint initialization is based on the assumption that the up-to-scale camera trajectory can be estimated very accurately from pure monocular vision, and then use this trajectory to estimate the initialization parameters [48]. Disjoint visual-inertial initialization was pioneered by Mur-Artal et al. in ORB-SLAM-VI [35] and later adopted by Qin et al. in VINS-Mono [15,53]. All subsequent disjoint initialization methods are improvements to these two methods, such as [54][55][56]. Compared to joint initialization, disjoint initialization has faster solution speed and better robustness. However, due to insufficient consideration of the interaction between initialization parameters and possible visual measurement noise, disjoint initialization does not perform well in estimation accuracy.
Semi-joint initialization can be seen as a compromise between joint initialization and non-joint initialization. It first solves partial parameters based on the visual measurement and then estimates all initialization parameters in one step. This idea was first implemented by Yang et al. [34], but matured by Campos et al. in ORB-SLAM-3 [39,48]. In ORB-SLAM-3, visual-inertial initialization was formulated as two maximum a posteriori (MAP) estimations: visual-only MAP estimation responsible for obtaining up-to-scale camera poses and inertial-only MAP estimation responsible for solving all initialization parameters. Semi-joint initialization performs well in terms of real-time performance and accuracy performance.
In this paper, we design a semi-joint event-inertial initialization method for eventbased VIO to achieve better VIO state estimation results.

System Overview of ESVIO
The system overview of ESVIO is shown in Figure 1. The proposed ESVIO can be divided into four components: the measurements processing unit, the initialization unit, the depth estimation unit, and the VIO state estimation unit. Figure 1. System overview of ESVIO. The measurements of stereo event cameras and IMU will be handled by the Measurements Processing Unit, and then the processed data will be sent to the Initialization Unit for event-only initialization and event-inertial initial optimization. After initialization, the Depth Estimation Unit will solve events' depth and fuse the inverse depth frame. Finally, the VIO State Estimation Unit will estimate the camera's 6-DoF pose, velocity, and IMU biases based on inverse depth frames, negative TS images, and pre-integrated IMU measurement.
The measurements processing unit is responsible for processing sensor measurements, including event processing and IMU measurement processing. For event processing, ESVIO converts event streams to TS images for subsequent depth estimation and state estimation (Section 4.2). For IMU measurements processing, ESVIO adopts the IMU pre-integration algorithm to handle IMU measurements which can efficiently reduce the complexity and calculation time cost of the IMU propagation model (Section 4.3).
The initialization unit is responsible for solving initialization parameters to improve the performance of VIO state estimation, including event-only initialization and eventinertial initial optimization. The first part will estimate an up-to-scale camera trajectory based on event data (Section 5.1). Additionally, the second part will solve initialization parameters by fusing event and IMU data in a sliding window optimization framework (Section 5.2). The depth estimation unit is responsible for inverse depth frame construction. This unit estimates the single event's depth by stereo event matching and fuses these asynchronous depth points to construct an inverse depth frame (Section 4.4).
The VIO state estimation unit is responsible for camera state estimation by fusion of event data and inertial measurement. Firstly, it selects a depth points subset to reconstruct the inverse depth frame (Section 6.1). Secondly, it generates negative TS images for state estimation (Section 6.2). Thirdly, it estimates the state of the VIO system (including the camera's 6-DoF pose, velocity, and IMU biases) using a novel tightly coupled direct VIO pipeline in a sliding window non-linear optimization framework based on inverse depth frames, negative TS images and pre-integrated inertial measurement (Section 6.3).

Notations
We denote w (·) as the world frame. b (·) is the body (IMU) frame. c (·) is the camera frame, which is equivalent to the inverse depth frame used in ESVIO. In particular, we denote cp (·) and cp(r) (·) as the left and right camera image plane of the camera frame c (·). Note that the camera frames used in the paper are all left camera frames. We use both rotation matrix R and quaternion q to represent rotation and use translation vector p to represent translation. We consider x m n or X m n as the vector x or matrix X from frame n to frame m. m t is the frame m with timestamp t. We also use m i to represent the frame m while taking the i-th inverse depth frame.

Event Presentation Based on Time-Surface
The output of the event camera is a stream of asynchronous events. We use e k = (u k , v k , t k , p k ) to represent the k-th event in event stream, containing the pixel coordinate x k (u k , v k ) ∈ R 2 , timestamp t k and polarity p k . Referring to ESVO [8], ESVIO adopts an alternative event representation method called Time-Surface (TS). TS is a twodimensional (2D) camera size map where each pixel stores the time information. At time t, it can be defined by where t last means the timestamp of the latest event at the pixel cp t x( cp t u, cp t v) ∈ R 2 , and ϕ is the constant decay rate parameter (e.g., 30 ms). Based on Equation (1), we can convert a stream of events to a TS image whose pixel's intensity is the value of the corresponding pixel in the TS map re-scaled from [0, 1] to the range [0, 255].
TS can be considered as a kind of fixed time window event processing method. The length of the time window is determined by the TS image generation frequency and the average brightness (intensity) of the TS image is determined by ϕ in Equation (1). However, we find that the setting of ϕ with a fixed value cannot handle different camera motions (different occurrence frequencies of events), especially at a low speed. When the camera moves slowly (low event occurrence frequency), there are few pixels in the TS image that can provide enough intensity information for subsequent depth estimation and state estimation. In order to tackle this problem, we have made improvements to the TS method used in ESVO to cope with slow camera motion. Based on Equation (1), we count the number of events across the image plane in the latest time window as n, and define the improved TS value: where N τ depends on the number of events for depth estimation and we set N τ as N in Section 4.4, ϕ N τ is defined by where t N τ is the timestamp of the latest N τ -th events across the image plane, and t w is the length of the time window of TS. The improved TS also needs to be rescaled to the range [0, 255] to create the corresponding TS image. From Equations (2) and (3), we can find that: when events occur at a normal or high frequency (normal or fast camera motion, n ≥ N τ ), the improved TS will work as the TS used in ESVO; when events occur at a low frequency (slow camera motion, n < N τ ), there are at least N τ events which can provide enough intensity information in the TS image (when t last ( cp t x) = t N τ , the intensity of x's corresponding pixel is 255 × exp(−t w /ϕ)) for depth and state estimation. Figure 2 shows the processes of different TS methods under slow camera motion in an office scene. Figure 2. Different TS methods under slow camera motion in an office scene. When the camera is in slow motion, the improved TS will change the decay rate parameter based on t N τ , while the ESVO's will not. The left column is the intensity frames from the DAVIS346 event camera of the scene. The middle three columns are the TS processes: the first column shows the determination of t N τ and the events which can provide enough intensity information for TS images (in the yellow boxes); the second and third columns show how events are converted into pixels of different intensities under fixed ϕ (ESVO's TS) and unfixed ϕ N τ (improved TS). The right column is the output TS images.

IMU Pre-Integration
IMU pre-integration [57] is a computationally efficient alternative to the standard inertial measurement integration. It performs the discrete integration of the inertial measurement dynamics in a local frame of reference, thus preventing the need to reintegrate the state dynamics at each optimization step [41]. In our ESVIO, we use IMU pre-integration methods proposed in [15] to process IMU measurement.
A low-cost IMU, generally including a 3-axis accelerometer and a 3-axis gyroscope, can measure the acceleration a and the angular velocity ω of the IMU. The raw measurements from IMU,â andω, can be represented by where b b a , b b g and b n a , b n g are the biases and additive noises from the accelerometer and the gyroscope in the IMU frame, respectively, w g is the gravity vector in the world frame. Referring to [15], we assume that the additive noises b n a and b n g are Gaussian, b n a ∼ N (0, σ a 2 ), b n g ∼ N (0, σ g 2 ). Acceleration bias and gyroscope bias are modeled as random walk, whose . Given two IMU frames b i and b i+1 that correspond to two consecutive inverse depth frames, position, velocity, and rotation in the IMU frame b i can be computed by integrating the kinematics equation of IMU-driven algorithm within the duration t ∈ [t i , t i+1 ]: where where ∆t is the duration between time interval t ∈ [t i , t i+1 ], ⊗ is the quaternion multiplication, · × is the skew-symmetric matrix of a 3D vector.
in Equation (6) are called pre-integration terms. Pre-integration terms can be understood as the relative motion from IMU frame b i to b i+1 without the change of the state of IMU frame b i .
Based on IMU pre-integration terms in Equation (6), we can establish the linear Gaussian error state recursion equation of IMU measurements, and derive the equation corresponding covariance matrix and Jacobian matrix in continuous-time and discrete-time for state estimation referring to [15]. Finally, we write down the IMU measurement model which will be used to construct the IMU residual in the initialization method (Section 5) and the VIO method (Section 6):

Depth Estimation Based on Time-Surface
When stereo TS images are fed into the depth estimation unit, ESVIO will follow the depth estimation method proposed by ESVO [8] to estimate events' depth and fuse an inverse depth frame with the same timestamp as the corresponding TS image. At time t, the depth of event e t−ε = ( cp t−ε x, t − ε, p) (with ε ∈ [0, δt], cp t−ε x ∈ R 2 ) on the left image plane can be estimated based on the stereo temporal consistency criterion across space-time neighborhoods of the events presented in [8]. The depth estimation optimization objective function of e t−ε is defined by where c t−ε ρ means the inverse depth of the event in the camera frame c t−ε at time t − ε, I le f t ( cp t (·), t) and I right ( cp(r) t (·), t) are the left and the right TS with the timestamp t, respectively, T c t c t−δt = (R c t c t−δt |p c t c t−δt ) is camera transformation matrix between camera frame c t−δt and c t computed from the VIO state estimation results, and the depth residual r i ( c t−ε ρ) is defined as where cp t x 1,i ∈ R 2 and cp(r) t x 2,i ∈ R 2 are pixel coordinates on the event (e t−ε ) corresponding patches cp t W 1 and cp(r) t W 2 of the left and the right TS at time t, respectively.
Based on Equations (9) and Equation (10), we can estimate the depth of the latest N events which occur at or before time t. In ESVIO, we set N as 1500 (MVSEC dataset [27]) and 1000 (RPG datasets [25] and ESIM dataset [26]) based on the corresponding setting in [8] for different sensor resolutions to obtain a trade-off result between accuracy performance and time cost. Then we will fuse these asynchronous event depth points into an inverse depth frame with timestamp t based on the probabilistic model of estimated inverse depth and inverse depth filters proposed in [8]. In the real implementation of ESVIO, we fuse the inverse depth frame with timestamp t based on 3N ∼ 5N event depth points, where N event depth points are associated with timestamp t and the rest are associated with several timestamps before t. In this way, we can construct inverse depth frames containing historical depth information for a short period of time.

Event-Inertial Initialization Method
Initialization is important for VIO to obtain a series of parameters, such as scale, gravity direction, and IMU bias, etc. These initialization parameters can help the system align with the world coordinate system and the scale of the real world, improving the convergence speed and estimation accuracy of state optimization in the VIO system. In the traditional VIO field, the initialization method, which combines visual information and IMU measurement, has been developed maturely, such as the disjoint initialization method used in VINS-Mono [15,53], the semi-joint method used in ORB-SLAM-3 [39,48], etc. However, for the event-based VIO field, there is no initialization method that fuses event data and IMU measurement together to estimate initialization parameters.
Therefore, we present an event-inertial initialization method for ESVIO, which can estimate scale, gravity direction, accelerometer, and gyroscope biases in the initial stage of ESVIO. This method is a kind of semi-joint initialization method and performs well in terms of real-time performance and accuracy performance. Our initialization method can be split into two steps: event-only initialization and event-inertial initial optimization.

Event-Only Initialization
The initialization procedure of ESVIO starts with the event-only initialization to obtain an up-to-scale camera trajectory. This trajectory can provide translational and rotational constraints for the event-inertial initial optimization to improve the accuracy and time performance of the optimization.
Firstly, ESVIO will apply a modified Semi-Global Matching (SGM) method [58] provided in ESVO [8] to generate an initial inverse depth frame based on the stereo TS images. Secondly, ESVIO will execute ESVO's tracking and mapping procedure bootstrapped by the initial inverse depth frame. At this period, ESVIO will maintain a sliding window that contains a series of up-to-scale camera poses estimated by the event-based stereo VO process (ESVO). The construction method of the sliding window can refer to Section 6.3. Note that, we find that the camera motion trajectory estimated by ESVO always with a scale error as 10∼15%, so the camera poses and the trajectory obtained here are called up-to-scale ones, and the subsequent event-inertial initial optimization also takes the scale factor as one of the initialization parameters. The up-to-scale camera poses are defined as T c 0 where we set the first camera frame c 0 (·) in the sliding window as the reference frame and n is the length of the sliding window. Suppose we have the extrinsic parameters [R b c |p b c ] between the event camera (left) and the IMU, we can translate poses from camera frame to IMU frame by where s ∈ R + is the initialization parameter scale which can align the up-to-scale trajectory to the metric-scale trajectory. We will solve this parameter along with other initialization parameters in the next step.

Event-Inertial Initial Optimization
This step aims to solve an optimal estimation of the initialization parameters for ESVIO. Due to the mutual coupling between the initialization parameters, it is difficult to decouple all parameters and solve them sequentially in multiple steps. Therefore, we solve all initialization parameters through one-step optimization to obtain more accurate initialization results. When the camera motion in the sliding window maintained by the previous step (Section 5.1) is sufficient, ESVIO will carry out the event-inertial initial optimization using the sliding window optimization method based on the up-to-scale camera trajectory in the sliding window. We follow the method proposed in [53] to judge whether the camera motion is sufficient by checking the variation of pre-integration items in the sliding window. Before event-inertial initial optimization, ESVIO will make a rough estimation of the gravity direction based on velocity pre-integration items β in the sliding window. Considering two consecutive IMU frames b i and b i+1 in the sliding window, the velocity pre-integration item in Equation (5) can be converted to the reference frame c 0 (·) as: where R c 0 b i can be obtained by Equation (11). Adding all velocity pre-integration items as shown in Equation (12) we can obtain where t is the total duration of the sliding window. When we ignore the camera velocity changes during the sliding window, we can obtain a rough estimation of the gravity direction by normalizing the result in Equation (13).
Based on the up-to-scale camera trajectory and the rough gravity direction, ESVIO will start the event-inertial initial optimization by the sliding window optimization method. The optimal variables in the sliding window are defined by where R c 0 w is the gravity direction such that the gravity in the camera frame c 0 (·) is expressed as c 0 g = R c 0 w w g, with w g = (0, 0, G) being G the Gravitational Constant, x i is the camera initial velocity and the IMU biases in body frame at the time while taking the i-th data frame. n is the number of data frames in the sliding window.
Then, the optimization cost function C init for the event-inertial initial optimization can be designed by where , X ) is the residual for IMU measurement between consecutive IMU frames b i and b i+1 , r p is the prior residual for the accelerometer bias as r p = ∑ || b i b a 2 || referring to the initialization method [48]: if the motion does not contain enough information to estimate b a , this prior will keep its estimation close to zero; if the motion makes b a observable, its estimation will converge towards its true value.
is the covariance matrix for the IMU measurement residual and can be computed iteratively with IMU propagation as proposed in [15]. P p is the covariance matrix for the prior residual and can be defined as the diagonal matrix of noise σ b a 2 . As to the gyroscope bias b g , it is observable from estimated camera orientations and gyroscope readings, so we do not need to set a prior factor for it [48].
For the IMU measurement residual , X ), it can be defined according to the IMU measurement model defined in Equation (8) and the up-to-scale camera poses defined in Equation (11) and has the same form as Equation (25) with different position, velocity and rotation residuals: can be obtained from the up-to-scale camera trajectory estimated in the event-only initialization (Section 5.1), [·] xyz means the real part of a quaternion which is used to approximate the three-dimensional rotational error.
The solution to the non-linear optimization problem in Equation (15) is similar to the state estimation in Equation (19), and a detailed description can be found in Section 6.3. Additionally, the Jacobian matrices of the IMU measurement residual can be found in Appendix A.1.

Direct Event-Based VIO Method
In this section, we propose a novel direct event-based VIO method for ESVIO to directly estimate camera state based on event data and IMU measurement without feature tracking. Firstly, the proposed method reconstructs the inverse depth frame to simplify the redundant depth information. Then, it generates the negative TS image for state estimation. Thirdly, based on inverse depth frames, negative TS images, and pre-integrated IMU measurement, the method estimates the optimal state of the system in a sliding window non-linear optimization framework. In the following, we will describe in detail the inverse depth frame reconstruction, the negative TS image generation, and the direct VIO state estimation.

Inverse Depth Frame Reconstruction
For ESVIO, the inverse depth frame provided by the depth estimation unit sometimes contains too many depth points with redundant depth information, which will bring additional computation overhead. To tackle this problem, we propose an inverse depth frame reconstruction method to filter the depth point subset according to the distribution of depth points and reconstruct the inverse depth frame based on these depth points. As illustrated in Figure 3, the method can be achieved following the below four steps: Step 4: Select depth points from each connected region according to the distribution of depth points to reconstruct the inverse depth frame.

•
Step 1: Project all depth points of the inverse depth frame to its corresponding TS image (same timestamp) to enhance the intensity information in this TS image.

•
Step 2: Use EDLines [59] to extract line segments in the enhanced TS image to obtain the edge map (taking less than 1.5 ms per TS image in datasets used in Section 7). • Step 3: Extract connected regions in the edge map based on the contour retrieval algorithm [60]. According to the connected regions and projected coordinates in the TS image, cluster the depth points into each connected region. • Step 4: According to the proportion of the number of depth points in each connected region to the total number of depth points, determine the number of points to select in each connected region, and then randomly select this number of depth points from each connected region to reconstruct the inverse depth frame.

Negative TS Image Generation
In the TS image, the pixel with a large intensity value represents the recently triggered event. Additionally, in a certain direction of this pixel, the pixel's intensity value will decrease as the distance increases, representing the same event triggered at multiple previous moments. Due to this characteristic, the TS image can be interpreted as a kind of anisotropic distance field which is usually used in edge-based VO systems [8]. Referring to ESVO [8], we invert the TS image to the negative one and utilize it to construct the minimization optimization problem for the VIO state estimation. The negative TS image can be created from a TS map by then re-scale the pixel's value from [0, 1] to the range [0, 255]. In ESVIO, the negative TS image will be used to construct event measurement residual and participates in the state estimation.

Direct VIO State Estimation Based on Event Data and IMU Measurement
For the direct method in the traditional VIO field, it constructs the visual measurement residual by the photometric error between two camera frames, which will utilize the whole image information for state estimation. However, the event can only reflect the brightness change, not the brightness intensity, and event-based VIO can hardly construct visual residual by the photometric error. Therefore, inspired by ESVO [8], for the direct method in ESVIO, we construct the event data residual based on the projection error of the inverse depth frame on the negative TS image, which can utilize the information of event inverse depth frames and event image frames (negative TS images) more comprehensively than feature-based methods.
After achieving the inverse depth frame reconstruction (Section 6.1) and the negative TS image generation (Section 6.2), based on the reconstructed inverse depth frame, ESVIO will construct data frame which contains the corresponding negative TS image with the same timestamp and pre-integrated IMU measurement between two consecutive inverse depth frames. The sliding window will consist of a fixed number of data frames as shown in Figure 4. Note that, referring to [8], we only use the left negative TS for VIO because incorporating the right negative TS does not significantly increase accuracy while it doubles the computation cost.
The full state variables in the sliding window while taking the i-th data frame are defined by X = [x 0 , x 1 , . . . , x n ] where x i is described as the system position, velocity, and orientation in the world frame at the time while taking the i-th data frame and the IMU biases in the corresponding IMU frame. n is the number of data frames in the sliding window. We achieve the optimization of the state estimation based on inverse depth frames, negative TS images, and pre-integrated IMU measurement. All the state variables are optimized in the sliding window by minimizing the sum of the cost terms from all the measurement residuals, including IMU measurement residuals and event data residuals. We design the optimization cost function C state for state by where , X ) is the residual for IMU measurement and B is the set of pre-integrated IMU measurement in the sliding windows. r D,T (ẑ c l j , X ) is the residual for event data. D and T are the sets of inverse depth frames and corresponding negative TS images in the sliding window, respectively. P can be computed iteratively with IMU propagation as proposed in [15], and P c l j is set as the identity matrix during the implementation of ESVIO. ρ is the Huber loss function [61], which is used to down-weight the large event data residual to improve the efficiency and reduce the impacts of noise events or incorrect depth points, defined as where k is the scale of the Huber loss and we set k as 20 in ESVIO by trial-and-error method. ESVIO uses Levenberg-Marquard (LM) algorithm to solve the minimization nonlinear optimization problem in Equation (19). The optimal state vector X will be found by iteratively minimizing the Mahalanobis distance of all the residuals. At each LM iteration step, the state increment δX can be solved by following equation based on Equation (19): where H (·) is the Hessian matrix of residuals vector r (·) , we have H (·) = J (·) P −1 (·) J (·) and b (·) = −J (·) P −1 (·) r (·) , where J (·) is the Jacobian matrix of residuals vector r (·) with respect to X , and P (·) is the covariance matrix of measurements.
For position, velocity, and bias items of state vector X , the update operator and increments at each LM iteration step can be defined as For the rotation item of state vector X , since the four-dimensional rotation quaternion q is over-parameterized, we use a pertubation δθ ∈ R 3 as the rotation increment referring to [15]. Therefore, the update operator of q at each LM iteration step can be defined as We can also write the Equation (23) as the rotation matrix form: Considering the real-time performance, we do not set a specific marginalization step to marginalize old states. Formulations of measurements' residuals will be defined in the following, and the Jacobian matrices can be found in Appendix A.2.
For the IMU measurement residual between two consecutive frames b i and b i+1 in the sliding window, according to the IMU measurement model defined in Equation (8), it can be defined by For the event data residual between two data frames, inspired by ESVO [8], it can be defined as the depth projection error of one's inverse depth frame on another's negative TS image. Considering the j-th inverse depth frame and the l-th negative TS image (j < l) in the sliding window, the event data residual can be defined by where c j P m ∈ R 3 is the m-th depth point in the j-th inverse depth frame, c l P m ∈ R 3 is the same depth point but in the l-th camera frame coordinate system. Additionally, ( cp l u m , cp l v m ) is the projected pixel location of c l P m in the image plane of the l-th negative TS image. π is the projection function that turns a 3D point to a 2D pixel location using event camera intrinsic parameters.

Experimental Evaluation
We implement the ESVIO system referring to the implementation of the ESVO system [8] based on Robot Operating System (ROS) [62] in C++. In the ESVIO system, there are three different independent threads, namely time_surface thread, depth_estimation thread, and VIO thread. They run concurrently to achieve the VIO state estimation. time_surface thread implements the conversion of the event stream into TS images, which is a part of the measurements processing unit (in Figure 1). depth_estimation thread implements the depth estimation unit (in Figure 1). VIO thread implements the pre-integration of IMU measurement (another part of the measurements processing unit shown in Figure 1), the initialization unit (in Figure 1), and the VIO state estimation unit (in Figure 1). Each thread has a different running rate accordingly to ensure reliable operation and the real-time performance of the whole system.
To evaluate the proposed ESVIO system, we perform corresponding quantitative and qualitative experimental evaluations on the public RPG dataset [25] and MVSEC dataset [27]. The data provided by the RPG dataset were collected with a handheld stereo event camera in an indoor environment. Additionally, the sequences we used for 6-DoF pose estimation from MVSEC dataset were collected using a stereo event camera mounted on a drone flying in a capacious indoor environment. We also evaluate the ESVIO on the ESIM dataset, which is generated by Jiao et al. [26] using the event camera simulator ESIM [63]. In addition, we try to evaluate ESVIO on the driving dataset DSEC [28] where the data were collected from a stereo event camera mounted on a driving car with a higher resolution. These four datasets all provide stereo event stream, IMU measurement, intrinsic parameters of event cameras, extrinsic parameters between event cameras and the IMU. The ground truth 6-DoF poses are also provided except DSEC dataset. However, none of these four datasets provide ground truth IMU bias, which can be used for the initialization evaluation. We list the parameters of sensors in each dataset in Table 1. Based on the above datasets, firstly, we show the performance of the event-inertial initialization proposed in ESVIO. Secondly, we compare ESVIO with state-of-the-art methods in terms of accuracy performance. Thirdly, we show the improvement brought by the addition of IMU to ESVIO. Fourthly, we show the results of ESVIO on sequences from the driving dataset DSEC. Finally, we analyze the real-time performance of our system. Considering the randomness of the algorithm results [26], the results of quantitative evaluations (motion estimation, scale error, and time cost) are all average results of 10-trial tests. Additionally, considering the possible scale error in the adopted comparison methods, we align the estimated trajectory with ground truth using Sim(3) Umeyama alignment [64] in all evaluations. All experiments run on a computer equipped with Intel Core™ i9-10900K CPU @ 3.70 GHz and Ubuntu 20.04 LTS operation system.

Performance of the Event-Inertial Initialization
To prove the effectiveness of our proposed event-inertial initialization method for ESVIO, we compare ESVIO and its variant without the proposed event-inertial initialization (simply "ESVIO without initialization") on the RPG and MVSEC datasets which are collected in the real world. The ESVIO without initialization is achieved by the direct VIO state estimation method provided in Section 6 with the initial guesses for gravity direction (solved by Equation (13)), accelerometer bias (zero), gyroscope bias (zero), velocities, and depth (recovered from the event-only initialization method in Section 5.1).
Because RPG and MVSEC datasets do not provide IMU calibration parameters and ground truth IMU bias, we choose the scale error and the mean absolute translational error (ATE) as the metric for evaluations. We compare the scale error at the beginning of the VIO system (5 s after the system starts running) and at the end of the system for ESVIO and ESVIO without initialization. The scale error at the beginning of the system can directly reflect the influence of initialization on scale recovery. The scale error at the end of the system and the mean ATE can represent the effect of initialization on the VIO state estimation. As shown in Table 2, ESVIO demonstrates lower scale errors and better accuracy performance than ESVIO without initialization in most sequences. The results in Table 2 show that the event-inertial initialization can effectively recover the scale and improve the accuracy performance for ESVIO state estimation. In addition, note that although the lack of initialization will make the VIO system have a scale error in the initial stage, due to the addition of IMU, the scale error will decrease after the system works for a while. Table 2. Quantitative comparison results of ESVIO and ESVIO without initialization on the RPG and MVSEC datasets. The scale error and the mean absolute translational error (ATE) are used as the metric. The scale error includes the error at the beginning of the system (5s) and at the end of the system. The best result is highlighted in bold.

Sequence
Duration (

Evaluation of the VIO State Estimation Accuracy for ESVIO
To show the state estimation accuracy performance of ESVIO, we perform quantitative and qualitative evaluation experiments on the public ESIM, RPG, MVSEC datasets.
Firstly, the state estimation results of ESVIO evaluated on ESIM, RPG, and MVSEC datasets are shown in Figure 5. As shown in Figure 5, the trajectories (the fourth row) estimated by ESVIO have a good accuracy performance compared to the ground truth. Furthermore, the TS images (the second row) and inverse depth frames (the third row) from Figure 5 show the performance of ESVIO for event processing and depth estimation.
Secondly, we compare ESVIO with the state-of-the-art event-based stereo VO methods, including ESVO [8] and variants of ESVO with different event processing methods (refer to [26]), on ESIM, RPG, and MVSEC datasets. Table 3 shows the mean ATE under 12 sequences for ESVIO, ESVO, and variants of ESVO. The subscript of ESVO represents the different event-processing methods. TS represents TS (ESVO TS equals to the original ESVO). EM2000, EM3000, EM4000, EM5000 represent Event Map with 2000, 3000, 4000, 5000 events per map. EMTS represents a combination of TS and Event Map. We refer to [26] for the results of ESVO and variants of ESVO. As shown in Table 3, ESVIO demonstrates better accuracy performance in terms of motion estimation on all sequences compared with the state-of-the-art event-based stereo VO methods. In addition, we plot the absolute transla-tional and rotational error statistics of ESVIO and ESVO in Figure 6. As shown in Figure 6, for absolute translational error, ESVIO performs better than ESVO on all sequences. As to the rotational error, ESVIO performs better than ESVO in both rotational error median and distribution on most sequences. From Figure 6, the experimental results demonstrate that the proposed ESVIO not only has the advantages of accuracy performance, but also has the advantages of robustness compared with the event-based stereo VO method. Figure 5. Results of ESVIO evaluated on ESIM, RPG, and MVSEC datasets. The first row shows intensity frames from the event camera, the second and the third rows show the TS images and the inverse depth frames generated by ESVIO, and the fourth row shows the trajectories estimated by ESVIO. Inverse depth frames are color-coded from red (close) to blue (far) over a black background.
Thirdly, we compare ESVIO with the state-of-the-art event-based VIO method. At present, event-based VIO methods [19][20][21][22][23] are all designed for monocular systems and evaluated on the public dataset [65] which only provides monocular event data. Among these methods, only Ultimate SLAM [21] is open source method which is a feature-based VIO method and provides different VIO modes based on different sensor combinations. Therefore, we compare ESVIO and Ultimate SLAM on the same sequences used in Table 3, and the results are shown in Table 4. When we run Ultimate SLAM, we use its all parameter settings except the sensor calibration parameters. The results of Ultimate SLAM are collected under "event (E) + IMU" and "event (E) + IMU + frame (Fr)" two modes. As shown in Table 4, the performance of Ultimate SLAM is not as good as ESVIO. In most sequences, Ultimate SLAM cannot correctly estimate the state due to the failure of feature tracking (features can be extracted, but cannot be tracked stably). In the remaining sequences, although state estimation can be performed, the accuracy performance of Ultimate SLAM is still not as good as ESVIO due to the large estimation error in the initial stage. From the results in Table 4, we can find that ESVIO has better accuracy and robustness performances compared with Ultimate SLAM. Table 3. Accuracy performance with ESVIO and different event-based stereo VO methods. The public RPG, MVSEC, ESIM datasets are adopted for comparison. The mean ATE (cm) is used as the metric. The best result is highlighted in bold and the second best result is highlighted with underline.

Sequence
Duration (

Evaluation of the Impact of Adding IMU to ESVIO
To validate the improvement brought by the addition of IMU to ESVIO, we present the quantitative and qualitative experimental evaluations in this subsection. We evaluate ESVIO, ESVIO without IMU (state estimation implemented based on only event measurement residual), and ESVO on the public ESIM, RPG, and MVSEC datasets. The mean ATE and scale error are chosen as the evaluation metric. The evaluation results are shown in Table 5. The results show that ESVIO with IMU has more accurate state estimation accuracy and lower scale error, which demonstrates that the addition of IMU improves the state estimation accuracy performance for ESVIO. Additionally, we notice that ESVIO will not reduce to ESVO when no IMU measurement is used due to the different event processing methods (improved TS vs. TS) and different state estimation methods (sliding window optimization framework vs. 3D-2D registration between two consecutive frames). Note that, because [26] does not provide the scale error of ESVO, the results of ESVO in Table 5 are obtained by our own evaluation. Then we choose two sequences RPG_monitor_edited and MVSEC_indoor_flying1_edited to show the details of translational and rotational errors of ESVIO and ESVIO without IMU measurement. The RPG_monitor_edited sequence contains more rotations, while the MVSEC_indoor_flying1_edited sequence is collected by a drone in a larger indoor space with a faster camera motion. As shown in Figure 7, the top subfigures are the estimated trajectories, the middle subfigures are the translational errors over time, and the bottom subfigures are the rotational errors over time. From the trajectories subfigures, we can discover that the trajectories estimated by ESVIO have a lower scale error, which are more consistent with the ground truth trajectories. From the translational error subfigures and the rotational error subfigures, we can intuitively find that ESVIO always has a lower value of the translational and rotational errors. Furthermore, from Figure 7, we can discover that the addition of IMU can significantly reduce rotation estimation error when the camera keeps rotating (e.g., 5-10 s in the RPG_monitor_edited sequence) or make a sharp turn (e.g., 3-7 s in the MVSEC_indoor_flying1_edited sequence).

Experiments on DSEC Dataset
To show the performance of ESVIO in large-scale scenes, we perform qualitative evaluation experiments on the public driving dataset DSEC. Driving scenarios are challenging for event-based sensors because forward motions typically produce considerably fewer events in the center of the image (where apparent motion is small) than in the periphery [66]. Additionally, the higher sensor resolution (640 × 480), the large-scale outdoor scenes and the dynamic objects (moving cars) are also challenging for ESVIO. Since the DSEC dataset does not provide the ground truth 6-DoF poses, we only show the results of ESVIO for the DSEC dataset sequences zurich_city_04 and zurich_city_11_a in Figure 8. As shown in Figure 8g-l, ESVIO can successfully achieve the camera pose estimation not only on the medium-length driving sequences ((g), (h), (j), (l)), but also on the long driving sequences of up to more than 300 m ((i), (k)). Furthermore, the reconstructed 3D maps and estimated trajectories from Figure 8m,n show that ESVIO can complete semi-dense depth estimation and state estimation in the case of sparse and dense events (events in zurich_city_04_a and zurich_city_04_b are denser than in zurich_city_11_a, due to the richer texture). However, because of the very heavy events load, ESVIO can not run in real-time on the DSEC dataset, so we slow down the playback of the rosbag when performing the corresponding experiments in this subsection.

Real-Time Performance Analysis
In this section, we analyze the real-time performance of our proposed ESVIO system. Firstly, we present the time cost of the initialization of ESVIO. For the event-only initialization, ESVIO will take about 12 ms to execute the modified SGM method to generate an initial inverse depth frame. Then ESVIO will take about 10 ms and 46 ms to execute the tracking and mapping procedure of ESVO, respectively, for estimating up-to-scale camera poses. For event-inertial initialization, ESVIO will take less than 1 ms for the rough estimation of the gravity direction and about 95 ms for the event-inertial initialization optimization in the slide window optimization framework (including 10 data frames). Considering the visual observation conditions required for the modified SGM method and the collection and accumulation time of the data frames in the sliding window, the initialization will be completed within 2 s after the ESVIO starts.
Secondly, we show the real-time performance of ESVIO. Table 6 lists the detailed execution times of different threads of ESVIO under MVSEC_ indoor_ flying1_ edited sequence. As shown in Table 6, for ESVIO, the time_surface thread takes about 12 ms (∼ 83 Hz) to create the TS image, the depth_estimation thread takes about 46 ms (∼ 22 Hz) to estimate 1500 events' depth and fuse 6000 depth points to the inverse depth frame, the VIO thread takes about 24 ms (∼ 42 Hz) to perform state estimation in the slide window optimization framework (including 6 data frames) based on 500 depth points per reconstructed inverse depth frame. Regarding the choice for the size of the sliding window (6 as mentioned above), we justify it by showing its influence on the state estimation accuracy and computation cost. As shown in Table 7, we have investigated the effect of sliding window size (4, 6, 8, and 10 frames, respectively) in terms of state estimation accuracy and computation cost of the VIO thread under MVSEC_ indoor_ flying1_ edited sequence. The results show that the sliding window with 4 data frames presents the worst performance in accuracy, and the sliding window with 6/8/10 data frames demonstrates similar accuracy performance. However, the sliding window with 8/10 data frames has a larger computation cost. Considering the real-time performance, we set the sliding window size to 6 in the state estimation. Regarding the choice for the number of depth points selected in the state estimation (500 as mentioned above), we implement the same experiments under MVSEC_ indoor_ flyin-g1_ edited sequence. Table 8 shows the state estimation accuracy and computation cost of the VIO thread with selecting 300, 500, 1000, and 1500 depth points per reconstructed inverse depth frame. As shown in Table 8, while increasing the number of depth points can effectively improve the state estimation accuracy, however, it also brings more computation cost. Considering the real-time performance, we choose to select 500 depth points per reconstructed inverse depth frame in the state estimation. Regarding the number of events for depth estimation and the number of the depth points for depth fusion in the depth_estimation thread (1500 and 6000 as mentioned above), we refer to the corresponding settings in ESVO [8] and make some adjustments on this basis to obtain inverse depth frames with more depth information on different datasets.

Conclusions
This paper proposes a novel event-based stereo VIO system, namely ESVIO. Firstly, we present a direct event-based stereo VIO pipeline for the first time, which can directly estimate camera motion based on event data and IMU measurement in a tightly coupled sliding window non-linear optimization framework. Secondly, we design a semi-joint event-inertial initialization method that can solve initialization parameters in two steps at the initial stage of the VIO system. Based on the VIO and the initialization methods, we implement the ESVIO system. Corresponding experimental evaluations are conducted to prove the effectiveness of the proposed system, and the results show that ESVIO achieves good accuracy and robustness performance when compared with other event-based monocular VIO and stereo VO systems, and, at the same time, with no compromise to real-time performance.
However, ESVIO still has some limitations. Firstly, the performance of ESVIO is still limited by the hardware of event cameras. Secondly, ESVIO lacks the long-term data association mechanism, such as loop closure, which enables relocalization and drifts elimination.
In the future, we will introduce the standard camera frame to enhance the performance of ESVIO and implement loop closure for ESVIO. We will also implement a specific marginalization step for ESVIO. We would also like to record an event camera dataset that is more suitable for event-based VIO systems and event-inertial initialization methods to test and evaluate.  Acknowledgments: The authors would like to thank Yi Zhou for sharing their code of ESVO and edited rosbag files of RPG and MVSEC datasets. We also thank Jianhao Jiao for providing the ESVO and the variants of ESVO baselines [26] and edited rodbag files of ESIM dataset. We also thank Antoni Rosinol Vidal for sharing the code of Ultimate SLAM.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Appendix A.1. Jacobian Matrices in Initialization
The Jacobian matrices of the IMU measurement residual with respect to state increments δx i and δx i+1 are, respectively, computed by where [·] L is the left-quaternion-product matrices [67], J the Jacobian matrices of pre-integration terms with respect to accelerometer bias increment and gyroscope bias increment, respectively, referring to [15], such as J α The Jacobian matrices of the IMU measurement residual with respect to the gravity direction increment δR c 0 w and the scale increment δs are, respectively, computed by (A2) During the non-linear optimization, the Jacobian matrix of the IMU measurement residual with respect to the state increment δx i is computed by where [·] R is the right-quaternion-product matrices [67]. The Jacobian matrix of the IMU measurement residual with respect to the state increment δx i+1 is computed by Appendix A.

Jacobian Matrices of Event Data Residual
The Jacobian matrices of the event measurement residual with respect to system state increments δx j and δx l can be solved by the chain rule: where P is the projection matrix of the event camera. Finally, the third part of Equation (A5) can be, respectively, defined by