Event Collapse in Contrast Maximization Frameworks

Contrast maximization (CMax) is a framework that provides state-of-the-art results on several event-based computer vision tasks, such as ego-motion or optical flow estimation. However, it may suffer from a problem called event collapse, which is an undesired solution where events are warped into too few pixels. As prior works have largely ignored the issue or proposed workarounds, it is imperative to analyze this phenomenon in detail. Our work demonstrates event collapse in its simplest form and proposes collapse metrics by using first principles of space–time deformation based on differential geometry and physics. We experimentally show on publicly available datasets that the proposed metrics mitigate event collapse and do not harm well-posed warps. To the best of our knowledge, regularizers based on the proposed metrics are the only effective solution against event collapse in the experimental settings considered, compared with other methods. We hope that this work inspires further research to tackle more complex warp models.

The main idea of CMax and similar event alignment frameworks [27,28] is to find the motion and/or scene parameters that align corresponding events (i.e., events that are triggered by the same scene edge), thus achieving motion compensation. The framework simultaneously estimates the motion parameters and the correspondences between events (data association). However, in some cases CMax optimization converges to an undesired solution where events accumulate into too few pixels, a phenomenon called event collapse ( Figure 1). Because CMax is at the heart of many state-of-the-art event-based motion estimation methods, it is important to understand the above limitation and propose ways to overcome it. Prior works have largely ignored the issue or proposed workarounds without analyzing the phenomenon in detail. A more thorough discussion of the phenomenon is overdue, which is the goal of this work.
Contrary to the expectation that event collapse occurs when the event transformation becomes sufficiently complex [16,27], we show that it may occur even in the simplest case of one degree-of-freedom (DOF) motion. Drawing inspiration from differential geometry and electrostatics, we propose principled metrics to quantify event collapse and discourage it by incorporating penalty terms in the event alignment objective function. Although event collapse depends on many factors, our strategy aims at modifying the objective's landscape arXiv:2207.04007v2 [cs.CV] 11 Jul 2022 to improve the well-posedness of the problem and be able to use well-known, standard optimization algorithms. In summary, our contributions are:

1.
A study of the event collapse phenomenon in regard to event warping and objective functions (Sections 3.3 and 4).

2.
Two principled metrics of event collapse (one based on flow divergence and one based on area-element deformations) and their use as regularizers to mitigate the above-mentioned phenomenon (Sections 3.4 to 3.6).

3.
Experiments on publicly available datasets that demonstrate, in comparison with other strategies, the effectiveness of the proposed regularizers (Section 4).
To the best of our knowledge, this is the first work that focuses on the paramount phenomenon of event collapse, which may arise in state-of-the-art event-alignment methods. Our experiments show that the proposed metrics mitigate event collapse while they do not harm well-posed warps.

Contrast Maximization
Our study is based on the CMax framework for event alignment ( Figure 2, bottom branch). The CMax framework is an iterative method with two main steps per iteration: transforming events and computing an objective function from such events. Assuming constant illumination, events are triggered by moving edges, and the goal is to find the transformation/warping parameters θ (e.g., motion and scene) that achieve motion compensation (i.e., alignment of events triggered at different times and pixels), hence revealing the edge structure that caused the events. Standard optimization algorithms (gradient ascent, sampling, etc.) can be used to maximize the event-alignment objective. Upon convergence, the method provides the best transformation parameters and the transformed events, i.e., motion-compensated image of warped events (IWE).
The first step of the CMax framework transforms events according to a motion or deformation model defined by the task at hand. For instance, camera rotational motion estimation [5,29] often assumes constant angular velocity (θ ≡ ω) during short time spans, hence events are transformed following 3-DOF motion curves defined on the image plane by candidate values of ω. Feature tracking may assume constant image velocity θ ≡ v (2-DOF) [7,30], hence events are transformed following straight lines.   Figure 2. Proposed modification of the contrast maximization (CMax) framework in [12,13] to also account for the degree of regularity (collapsing behavior) of the warp. Events are colored in red/blue according to their polarity. Reprinted/adapted with permission from Ref. [13], 2019, Gallego et al. In the second step of CMax, several event-alignment objectives have been proposed to measure the goodness of fit between the events and the model [10,13], establishing connections between visual contrast, sharpness, and depth-from-focus. Finally, the choice of iterative optimization algorithm also plays a big role in finding the desired motioncompensation parameters. First-order methods, such as non-linear conjugate gradient (CG), are a popular choice, trading off accuracy and speed [12,21,22]. Exhaustive search, sampling, or branch-and-bound strategies may be affordable for low-dimensional (DOF) search spaces [14,29]. As will be presented (Section 3), our proposal consists of modifying the second step by means of a regularizer ( Figure 2, top branch).

Event Collapse
In which estimation problems does event collapse appear? At first look, it may appear that event collapse occurs when the number of DOFs in the warp becomes large enough, i.e., for complex motions. Event collapse has been reported in homographic motions (8 DOFs) [27,31] and in dense optical flow estimation [16], where an artificial neural network (ANN) predicts a flow field with 2N p DOFs (N p pixels), whereas it does not occur in feature flow (2 DOFs) or rotational motion flow (3 DOFs). However, a more careful analysis reveals that this is not the entire story because event collapse may occur even in the case of 1 DOF, as we show.
How did previous works tackle event collapse? Previous works have tackled the issue in several ways, such as: (i) initializing the parameters sufficiently close to the desired solution (in the basin of attraction of the local optimum) [12]; (ii) reformulating the problem, changing the parameter space to reduce the number of DOFs and increase the well-posedness of the problem [14,31]; (iii) providing additional data, such as depth [27], thus changing the problem from motion estimation given only events to motion estimation given events and additional sensor data; (iv) whitening the warped events before computing the objective [27]; and (v) redesigning the objective function and possibly adding a strong classical regularizer (e.g., Charbonnier loss) [10,16]. Many of the above mitigation strategies are taskspecific because it may not always be possible to consider additional data or reparametrize the estimation problem. Our goal is to approach the issue without the need for additional data or changing the parameter space, and to show how previous objective functions and newly regularized ones handle event collapse.

Method
Let us present our approach to measure and mitigate event collapse. First, we revise how event cameras work (Section 3.1) and the CMax framework (Section 3.2), which was informally introduced in Section 2.1. Then, Section 3.3 builds our intuition on event collapse by analyzing a simple example. Section 3.4 presents our proposed metrics for event collapse, based on 1-DOF and 2-DOF warps. Section 3.5 specifies them for higher DOFs, and Section 3.6 presents the regularized objective function.

How Event Cameras Work
Event cameras, such as the Dynamic Vision Sensor (DVS) [2,3,32], are bio-inspired sensors that capture pixel-wise intensity changes, called events, instead of intensity images. An event e k . = (x k , t k , p k ) is triggered as soon as the logarithmic intensity L at a pixel exceeds a contrast sensitivity C > 0, where x k . = (x k , y k ) , t k (with µs resolution) and polarity p k ∈ {+1, −1} are the spatiotemporal coordinates and sign of the intensity change, respectively, and t k − ∆t k is the time of the previous event at the same pixel x k . Hence, each pixel has its own sampling rate, which depends on the visual input.

Mathematical Description of the CMax Framework
The CMax framework [12] transforms events in a set E = {e k } N e k=1 geometrically according to a motion model W, producing a set of warped events E = {e k } N e k=1 . The warp x k = W(x k , t k ; θ) transports each event along the point trajectory that passes through it ( Figure 2, left), until t ref is reached. The point trajectories are parametrized by θ, which contains the motion and/or scene unknowns. Then, an objective function [10,13] measures the alignment of the warped events E . Many objective functions are given in terms of the count of events along the point trajectories, which is called the image of warped events (IWE): Each IWE pixel x sums the values of the warped events x k that fall within it: b k = p k if polarity is used or b k = 1 if polarity is not used. The Dirac delta δ is in practice replaced by a smooth approximation [33], such as a Gaussian, δ(x − µ) ≈ N (x; µ, 2 ) with = 1 pixel. A popular objective function G(θ) is the visual contrast of the IWE (3), given by the variance with mean µ I . = 1 |Ω| Ω I(x; θ)dx and image domain Ω. Hence, the alignment of the transformed events E (i.e., the candidate "corresponding events", triggered by the same scene edge) is measured by the strength of the edges of the IWE. Finally, an optimization algorithm iterates the above steps until the best parameters are found:

Simplest Example of Event Collapse: 1 DOF
To analyze event collapse in the simplest case, let us consider an approximation to a translational motion of the camera along its optical axis Z (1-DOF warp). In theory, translational motions also require the knowledge of the scene depth. Here, inspired by the 4-DOF in-plane warp in [20] that approximates a 6-DOF camera motion, we consider a simplified warp that does not require knowledge of the scene depth. In terms of data, let us consider events from one of the driving sequences of the standard MVSEC dataset [34] ( Figure 1). For further simplicity, let us normalize the timestamps of E to the unit interval t ∈ [t 1 , t N e ] →t ∈ [0, 1], and assume a coordinate frame at the center of the image plane, then the warp W is given by where θ ≡ h z . Hence, events are transformed along the radial direction from the image center, acting as a virtual focus of expansion (FOE) (cf. the true FOE is given by the data).
Letting the scaling factor in (6) be s k . = 1 −t k h z , we observe the following: (i) s k cannot be negative since it would imply that at least one event has flipped the side on which it lies with respect to the image center; (ii) if s k > 1 the warped event gets away from the image center ("expansion" or "zoom-in"); and (iii) if s k ∈ [0, 1) the warped event gets closer to the image center ("contraction" or "zoom-out"). The equivalent conditions in terms of h z are: Intuitively, event collapse occurs if the contraction is large (0 < s k 1) (see Figures 1C and 3a). This phenomenon is not specific of the image variance; other objective functions lead to the same result. As we see, the objective function has a local maximum at the desired motion parameters ( Figure 1B). The optimization over the entire parameter space converges to a global optimum that explains the event collapse.

Discussion
The above example shows that event collapse is enabled (or disabled) by the type of warp. If the warp does not enable event collapse (contraction or accumulation of flow vectors cannot happen due to the geometric properties of the warp), as in the case of feature flow (2 DOF) [7,30] (Figure 3b) or rotational motion flow (3 DOF) [5,29] (Figure 3c), then the optimization problem is well posed and multiple objective functions can be designed to achieve event alignment [10,13]. However, the disadvantage is that the type of warps that satisfy this condition may not be rich enough to describe complex scene motions.
On the other hand, if the warp allows for event collapse, more complex scenarios can be described by such a broader class of motion hypotheses, but the optimization framework designed for non-event-collapsing scenarios (where the local maximum is assumed to be the global maximum) may not hold anymore. Optimizing the objective function may lead to an undesired solution with a larger value than the desired one. This depends on multiple elements: the landscape of the objective function (which depends on the data, the warp parametrization, and the shape of the objective function), and the initialization and search strategy of the optimization algorithm used to explore such a landscape. The challenge in this situation is to overcome the issue of multiple local maxima and make the problem better posed. Our approach consists of characterizing event collapse via novel metrics and including them in the objective function as weak constraints (penalties) to yield a better landscape.

Divergence of the Event Transformation Flow
Inspired by physics, we may think of the flow vectors given by the event transformation E → E as an electrostatic field, whose sources and sinks correspond to the location of electric charges (Figure 4). Sources and sinks are mathematically described by the divergence operator ∇· . Therefore, the divergence of the flow field is a natural choice to characterize event collapse.
From left to right: contraction ("sink", leading to event collapse), expansion ("source"), and incompressible fields. Image adapted from khanacademy.org The warp W is defined over the space-time coordinates of the events, hence its time derivative defines a flow field over space-time: For the warp in (6), we obtain f = −h z x, which gives ∇ · f = −h z ∇ · x = −2h z . Hence, (6) defines a constant divergence flow, and imposing a penalty on the degree of concentration of the flow field accounts to directly penalizing the value of the parameter h z .
Computing the divergence at each event gives the set from which we can compute statistical scores (mean, median, min, etc.): To have a 2D visual representation ("feature map") of collapse, we build an image (like the IWE) by taking some statistic of the values ∇ · f k that warp to each pixel, such as the "average divergence per pixel": where N e (x) is the number of warped events at pixel x (the IWE). Then we aggregate further into a score, such as the mean: In practice we focus on the collapsing part by computing a trimmed mean: the mean of the DIWE pixels smaller than a margin α (−0.2 in the experiments). Such a margin does not penalize small, admissible deformations.

Area-Based Deformation of the Event Transformation
In addition to vector calculus, we may also use tools from differential geometry to characterize event collapse. Building on [12], the point trajectories define the streamlines of the transformation flow, and we may measure how they concentrate or disperse based on how the area element deforms along them. That is, we consider a small area element dA = dxdy attached to each point along the trajectory and measure how much it deforms when transported to the reference time: dA = | det(J)| dA, with the Jacobian (see Section 5). The determinant of the Jacobian is the amplification factor: | det(J)| > 1 if the area expands, and | det(J)| < 1 if the area shrinks.
No change of area Expansion Figure 5. Area deformation of various warps. An area of dA pix 2 at (x k , t k ) and is warped to (12)). From left to right, increasing area amplification factor | det(J)| ∈ [0, ∞).
For the warp in (6), we have the Jacobian J = (1 −th z )Id, and so det(J) = (1 −th z ) 2 . Interestingly, the area deformation around event e k , J(e k ) ≡ J(x k , t k ; θ), is directly related to the scaling factor s k : det(J(e k )) = s 2 k . Computing the amplification factors at each event gives the set from which we can compute statistical scores. For example, gives an average score: R A > 1 for expansion, and R A < 1 for contraction. We build a deformation map (or image of warped areas (IWA)) by taking some statistic of the values | det(J(e k ))| that warp to each pixel, such as the "average amplification per pixel": This assumes that if no events warp to a pixel x p , then N e (x p ) = 0, and there is no deformation (IWA(x p ) = 1). Then, we summarize the deformation map into a score, such as the mean: To concentrate on the collapsing part, we compute a trimmed mean: the mean of the IWA pixels smaller than a margin α (0.8 in the experiments). The margin approves small, admissible deformations.

Feature Flow
Event-based feature tracking is often described by the warp W(x, t; θ) = x + (t − t ref )θ, which assumes constant image velocity θ (2 DOFs) over short time intervals. As expected, the flow for this warp coincides with the image velocity, f = θ, which is independent of the space-time coordinates (x, t). Hence, the flow is incompressible (∇ · f = 0): the streamlines given by the feature flow do not concentrate or disperse; they are parallel. Regarding the area deformation, the Jacobian J = ∂(x + (t − t ref )θ)/∂x = Id is the identity matrix. Hence | det(J)| = 1, that is, translations on the image plane do not change the area of the pixels around a point.
In-plane translation warps, such as the above 2-DOF warp, are well-posed and serve as reference to design the regularizers that measure event collapse. It is sensible for welldesigned regularizers to penalize warps whose characteristics deviate from those of the reference warp: zero divergence and unit area amplification factor.

Rotational Motion
As the previous sections show, the proposed metrics designed for the zoom in/out warp produce the expected characterization of the 2-DOF feature flow (zero divergence and unit area amplification), which is a well-posed warp. Hence, if they were added as penalties into the objective function they would not modify the energy landscape. We now consider their influence on rotational motions, which are also well-posed warps. In particular, we consider the problem of estimating the angular velocity of a predominantly rotating event camera by means of CMax, which is a popular research topic [5,14,[27][28][29]. By using calibrated and homogeneous coordinates, the warp is given by where θ ≡ ω = (ω 1 , ω 2 , ω 3 ) is the angular velocity, t ∈ [0, ∆t], and R is parametrized by using exponential coordinates (Rodrigues rotation formula [35,36]). Divergence: It is well known that the flow is f = B(x) ω, where B(x) is the rotational part of the feature sensitivity matrix [37]. Hence Area element: Letting r 3 be the third row of R, and using (32)- (34) in [38], Rotations around the Z axis clearly present no deformation, regardless of the amount of rotation, and this is captured by the proposed metrics because: (i) the divergence is zero, thus the flow is incompressible, and (ii) det(J) = 1 since r 3 = (0, 0, 1) and x h = (x, y, 1) . For other, arbitrary rotations, there are deformations, but these are mild if the rotation angle ∆t ω is small.

Planar Motion
Planar motion is the term used to describe the motion of a ground robot that can translate and rotate freely on a flat ground. If such a robot is equipped with a camera pointing upwards or downwards, the resulting motion induced on the image plane, parallel to the ground plane, is an isometry (Euclidean transformation). This motion model is a subset of the parametric ones in [12], and it has been used for CMax in [14,27]. For short time intervals, planar motion may be parametrized by 3 DOFs: linear velocity (2 DOFs) and angular velocity (1 DOF). As the divergence and area metrics show in the Appendix, planar motion is a well-posed warp. The resulting motion curves on the image plane do not lead to event collapse.

Similarity Transformation
The 1-DOF zoom in/out warp in Section 3.3 is a particular case of the 4-DOF warp in [20], which is an in-plane approximation to the motion induced by a freely moving camera. The same idea of combining translation, rotation, and scaling for CMax is expressed by the similarity transformation in [27]. Both 4-DOF warps enable event collapse because they allow for zoom-out motion curves. Formulas justifying it are given in the Appendix.

Augmented Objective Function
We propose to augment previous objective functions (e.g., (5)) with penalties obtained from the metrics developed above for event collapse: We may interpret G(θ) (e.g., contrast or focus score [13]) as the data fidelity term and R(θ) as the regularizer, or, in Bayesian terms, the likelihood and the prior, respectively.

Experiments
We evaluate our method on publicly available datasets, whose details are described in Section 4.1. First, Section 4.2 shows that the proposed regularizers mitigate the overfitting issue on warps that enable collapse. For this purpose we use driving datasets (MVSEC [34], DSEC [39]). Next, Section 4.3 shows that the regularizers do not harm well-posed warps.
To this end, we use the ECD dataset [40]. Finally, Section 4.4 conducts a sensitivity analysis of the regularizers.

Datasets
The MVSEC dataset [34] is a widely used dataset for various vision tasks, such as optical flow estimation [16,18,19,41,42]. Its sequences are recorded on a drone (indoors) or on a car (outdoors), and comprise events, grayscale frames and IMU data from an mDAVIS346 [43] (346 × 260 pixels), as well as camera poses and LiDAR data. Ground truth optical flow is computed as the motion field [44], given the camera velocity and the depth of the scene (from the LiDAR). We select several excerpts from the outdoor_day1 sequence with a forward motion. This motion is reasonably well approximated by collapse-enabled warps such as (6). In total, we evaluate 3.2 million events spanning 10 s.
The DSEC dataset [39] is a more recent driving dataset with a higher resolution event camera (Prophesee Gen3, 640 × 480 pixels). Ground truth optical flow is also computed as the motion field using the scene depth from a LiDAR [41]. We evaluate on the zurich_city_11 sequence, using in total 380 million events spanning 40 s.
The driving datasets (MVSEC, DSEC) and the selected sequences in the ECD dataset have different type of motions: forward (which enables event collapse) vs. rotational (which does not suffer from event collapse). Each sequence serves a different test purpose, as discussed in the next sections.

Metrics
The metrics used to assess optical flow accuracy (MVSEC and DSEC datasets) are the average endpoint error (AEE) and the percentage of pixels with AEE greater than N pixels (denoted by "NPE", for N = {3, 10, 20}). Both are measured over pixels with valid ground-truth values. We also use the FWL metric [50] to assess event alignment by means of the IWE sharpness (the FWL is the IWE variance relative to that of the identity warp).
Following previous works [13,27,28], rotational motion accuracy is assessed as the RMS error of angular velocity estimation. Angular velocity ω is assumed to be constant over a window of events, estimated and compared with the ground truth at the midpoint of the window. Additionally, we use the FWL metric to gauge event alignment [50].

Effect of the Regularizers on Collapse-Enabled Warps
Tables 1 and 2 report the results on the MVSEC and DSEC benchmarks, respectively, by using two different loss functions G: the IWE variance (4) and the squared magnitude of the IWE gradient, abbreviated "Gradient Magnitude" [13]. For MVSEC, we report the accuracy within the time interval of dt = 4 grayscale frame (at ≈ 45Hz). The optimization algorithm is the Tree-Structured Parzen Estimator (TPE) sampler [51] for both experiments, with a number of sampling points equal to 300 (1 DOF) and 600 (4 DOF). The tables quantitatively capture the collapse phenomenon suffered by the original CMax framework [12] and the whitening technique [27]. Their high FWL values indicate that contrast is maximized; however, the AEE and NPE values are exceedingly high (e.g., > 80 pixels, 20PE > 80%), indicating that the estimated flow is unrealistic.  Table 2. Results of DSEC dataset [39]. DSEC [39] boxes_rot [40] dynamic_rot [40] (a) (b) (c) (d) (e) Figure 6. Proposed regularizers and collapse analysis. The scene motion is approximated by 1-DOF warp (zoom in/out) for MVSEC [34] and DSEC [39] sequences, and 3-DOF warp (rotation) for boxes and dynamic ECD sequences [40]. By contrast, our regularizers (Divergence and Deformation rows) work well to mitigate the collapse, as observed in smaller AEE and NPE values. Compared with the values of no regularizer or whitening [27], our regularizers achieve more than 90% improvement for AEE on average. The AEE values are high for optical flow standards (4-8 pix in MVSEC vs. 0.5-1 pixel [16], or 10-20 pix in DSEC vs. 2-5 pix [41]); however, this is due to the fact that the warps used have very few DOFs (≤4) compared to the considerably higher DOFs (2N p ) of optical flow estimation algorithms. The same reason explains the high 3PE values (standard in [52]): using an end-point error threshold of 3 pix to consider that the flow is correctly estimated does not convey the intended goal of inlier/outlier classification for the low-DOF warps used. This is the reason why Tables 1 and 2 also report 10PE, 20PE metrics,  and the values for the identity warp (zero flow). As expected, for the range of AEE values in the tables, the 10PE and 20PE figures demonstrate the large difference between methods suffering from collapse (20PE > 80%) and those that do not (20PE < 1.1% for MVSEC and <22.6% for DSEC).
The FWL values of our regularizers are moderately high (≥1), indicating that event alignment is better than that of the identity warp. However, because the FWL depends on the number of events [50], it is not easy to establish a global threshold to classify each method as suffering from collapse or not. The AEE, 10PE, and 20PE are better for such a classification.
Tables 1 and 2 also include the results of the use of both regularizers simultaneously ("Div. + Def."). The results improve across all sequences if the data fidelity term is given by the variance loss, whereas they remain approximately the same for the gradient magnitude loss. Regardless of the choice of the proposed regularizer, the results in these tables clearly show the effectiveness of our proposal, i.e., the large improvements compared with prior works (rows "No regularizer" and [27]).
The collapse results are more visible in Figure 6, where we used the variance loss. Without a regularizer, the events collapse in the MVSEC and DSEC sequences. Our regularizers successfully mitigate overfitting, having a remarkable impact on the estimated motion. Table 3 shows the results on the ECD dataset for a well-posed warp (3-DOF rotational motion, in the benchmark). We use the variance loss and the Adam optimizer [53] with 100 iterations. All values in the table (RMS error and FWL, with and without regularization, are very similar, indicating that: (i) our regularizers do not affect the motion estimation algorithm, and (ii) results without regularization are good due to the well-posed warp. This is qualitatively shown in the bottom part of Figure 6. The fluctuations of the divergence and deformation values away from those of the identity warp (0 and 1, respectively) are at least one order of magnitude smaller than the collapse-enabled warps (e.g., 0.2 vs. 2).

Sensitivity Analysis
The landscapes of loss functions as well as sensitivity analysis of λ are shown in Figure 7, for the MVSEC experiments. Without regularizer (λ = 0), all objective functions tested (variance, gradient magnitude, and average timestamp [16]) suffer from event collapse, which is the undesired global minimum of (20). Reaching the desired local optimum depends on the optimizing algorithm and its initialization (e.g., starting gradient descent close enough to the local optimum). Our regularizers (divergence and deformation) change the landscape: the previously undesired global minimum becomes local, and the desired minimum becomes the new global one as λ increases.
Specifically, the larger the weight λ, the smaller the effect of the undesired minimum (at h z = 1). However, this is true only within some reasonable range: a too large λ discards the data-fidelity part G in (20), which is unwanted because it would remove the desired local optimum (near h z ≈ 0). Minimizing (20) with only the regularizer is not sensible.
Observe that for completeness, we include the average timestamp loss in the last column. However, this loss also suffers from an undesired optimum in the expansion region (h z ≈ −1). Our regularizers could be modified to also remove this undesired  [12], (b) gradient magnitude [13], and (c) mean square of average timestamp [16]. Data from MVSEC [34] with dominant forward motion. The legend weights denote λ in (20).
optimum, but investigating this particular loss, which was proposed as an alternative to the original contrast loss, is outside the scope of this work.

Computational Complexity
Computing the regularizer(s) requires more computation than the non-regularized objective. However, complexity is linear with the number of events and the number of pixels, which is an advantage, and the warped events are reutilized to compute the DIWE or IWA. Hence, the runtime is less than doubled (warping is the dominant runtime term [13] and is computed only once). The computational complexity of our regularized CMax framework is O(N e + N p ), the same as that of the non-regularized one.

Application to Motion Segmentation
Although most of the results on standard datasets comprise stationary scenes, we have also provided results on a dynamic scene (from dataset [40]). Because the time spanned by each set of events processed is small, the scene motion is also small (even for complicated objects like the person in the bottom row of Figure 6), hence often a single warp fits the scene reasonably well. In some scenarios, a single warp may not be enough to fit the event data because there are distinctive motions in the scene of equal importance. Our proposed regularizers can be extended to such more complex scene motions. To this end, we demonstrate it with an example in Figure 8. Specifically, we use the MVSEC dataset, in a clip where the scene consists of two motions: the ego-motion (forward motion of the recording vehicle) and the motion of a car driving in the opposite direction in a nearby lane (an independently moving object-IMO). We model the scene by using the combination of two warps. Intuitively, the 1-DOF warp (6) describes the ego-motion, while the feature flow (2 DOF) describes the IMO. Then, we apply the contrast maximization approach (augmented with our regularizing terms) and the expectation-maximization scheme in [21] to segment the scene, to determine which events belong to each motion. The results in Figure 8 clearly show the effectiveness of our regularizer, even for such a commonplace and complex scene. Without regularizers, (i) event collapse appears in the ego-motion cluster of events and (ii) a considerable portion of the events that correspond to ego-motion are assigned to the second cluster (2-DOF warp), thus causing a segmentation failure. Our regularization approach mitigates event collapse (bottom row of Figure 8) and provides the correct segmentation: the 1-DOF warp fits the ego-motion and the feature flow (2-DOF warp) fits the IMO.

Conclusions
We have analyzed the event collapse phenomenon of the CMax framework and proposed collapse metrics using first principles of space-time deformation, inspired by differential geometry and physics. Our experimental results on publicly available datasets demonstrate that the proposed divergence and area-based metrics mitigate the phenomenon for collapse-enabled warps and do not harm well-posed warps. To the best of our knowledge, our regularizers are the only effective solution compared to the unregularized CMax framework and whitening. Our regularizers achieve, on average, more than 90% improvement on optical flow endpoint error calculation (AEE) on collapse-enabled warps.
This is the first work that focuses on the paramount phenomenon of event collapse. No prior work has analyzed this phenomenon in such detail or proposed new regularizers without additional data or reparameterizing the search space [14,16,27]. As we analyzed various warps from 1 DOF to 4 DOFs, we hope that the ideas presented here inspire further research to tackle more complex warp models. Our work shows how the divergence and area-based deformation can be computed for warps given by analytical formulas. For more complex warps, like those used in dense optical flow estimation [16,18], the divergence or area-based deformation could be approximated by using finite difference formulas.

Appendix A. Warp Models, Jacobians and Flow Divergence
Appendix A.1. Planar Motion -Euclidean Transformation on the Image Plane, SE (2) If the point trajectories of an isometry are x(t), the warp is given by [27] x where v, ω Z comprise the 3 DOFs of a translation and an in-plane rotation. The in-plane rotation is and R −1 (φ) = R(−φ), we have Hence, in Euclidean coordinates the warp is The Jacobian and its determinant are: det(J k ) = 1.
Hence, for small angles |tω Z | 1, the divergence of the flow vanishes. In short, this warp has the same determinant and approximate zero divergence as the 2-DOF feature flow warp (Section 3.5.1), which is well-behaved. Note, however, that the trajectories are not straight in space-time. Using calibrated and homogeneous coordinates, the warp is given by [5,12] x h k ∼ R(t k ω) x h k , where θ = ω = (ω 1 , ω 2 , ω 3 ) is the angular velocity, and R (3 × 3 rotation matrix in space) is parametrized using exponential coordinates (Rodrigues rotation formula [35,36]).
Connection between divergence and deformation maps If the rotation angle t k ω is small, using the first two terms of the exponential map we approximate R(t k ω) ≈ Id + (t k ω) ∧ , where the hat operator ∧ in SO(3) represents the cross product matrix [54]. Then, r 3,k x h k ≈ (−t k ω 2 , t k ω 1 , 1) (x k , y k , 1) = 1 + (y k ω 1 − x k ω 2 )t k . Substituting this expression into (A12) and using the first two terms in Taylor's expansion around z = 0 of (1 + z) −3 ≈ 1 − 3z + 6z 2 (convergent for |z| < 1) gives det(J k ) ≈ 1 + 3(x k ω 2 − y k ω 1 )t k . Notably, the divergence (18) and the approximate amplification factor depend linearly on 3(x k ω 2 − y k ω 1 ). This resemblance is seen in the divergence and deformation maps of the bottom rows in Figure 6 (ECD dataset).
The flow corresponding to (A13) is given by whose divergence is: As particular cases of this warp, one can identify: Using a couple of approximations of the exponential map in SO(2), we obtain Hence, φ plays the role of a small angular velocity ω Z around the camera's optical axis Z, i.e., in-plane rotation.