Auto-Exposure Algorithm for Enhanced Mobile Robot Localization in Challenging Light Conditions

The success of robot localization based on visual odometry (VO) largely depends on the quality of the acquired images. In challenging light conditions, specialized auto-exposure (AE) algorithms that purposely select camera exposure time and gain to maximize the image information can therefore greatly improve localization performance. In this work, an AE algorithm is introduced which, unlike existing algorithms, fully leverages the camera’s photometric response function to accurately predict the optimal exposure of future frames. It also features feedback that compensates for prediction inaccuracies due to image saturation and explicitly balances motion blur and image noise effects. For validation, stereo cameras mounted on a custom-built motion table allow different AE algorithms to be benchmarked on the same repeated reference trajectory using the stereo implementation of ORB-SLAM3. Experimental evidence shows that (1) the gradient information metric appropriately serves as a proxy of indirect/feature-based VO performance; (2) the proposed prediction model based on simulated exposure changes is more accurate than using γ transformations; and (3) the overall accuracy of the estimated trajectory achieved using the proposed algorithm equals or surpasses classic exposure control approaches. The source code of the algorithm and all datasets used in this work are shared openly with the robotics community.


Introduction
The success of robot localization methods based on vision relies on the quality of the camera exposure. While many methods exist to mitigate poor exposure effects after images have been acquired (e.g., motion blur [1][2][3][4][5][6][7], saturation [8], low contrast [9][10][11]), these often jeopardize the real-time capabilities of state estimation. Moreover, the performance of these specialized visual odometry (VO) and simultaneous localization and mapping (SLAM) pipelines can only be less than or equal to some equivalent generic pipelines fed with appropriately acquired images. Under challenging light conditions, such as when dealing with HDR scenes, non-static illumination, or low-light conditions, the appropriate selection of an image exposure time and gain is therefore crucial. However, in comparison to the methods mentioned above, pre-acquisition methods such as auto-exposure (AE) algorithms have received relatively little attention in the robot vision community. This is due in part to the fact that fine-tuning a specialized VO/SLAM pipeline can be achieved using a small set of prerecorded videos. In comparison, tuning an auto-exposure algorithm for maximizing VO performance requires more elaborate testing procedures such as replicating a camera trajectory multiple times under different parameter settings. As summarized in Table 1, existing AE algorithms mainly differ in the following aspects: (1) the metric optimized by the algorithm; (2) the model used to predict the effect of future changes in gain and exposure; (3) the strategy employed to balance gain and exposure time; (4) the control policy used to update the exposure parameters. As the merits and shortcomings of existing methods can be attributed to each of these individual aspects, they are addressed sequentially in the following sections.

Ref.
Year Optimization Metric Prediction Model Gain/Exposure Balance Strategy Control Policy [12] 2009 Average intensity deviation from ref.

Optimization Metrics
In the context of vision-based robot localization, the metric optimized by an AE algorithm acts as a proxy to the overall VO performance. The value of these optimization metrics depends on the exposure of the image. In this work, the exposure is quantified using the exposure level E, defined as E = log 2 (tg) = log 2 t + log 2 10 20 where t and g are the exposure time and gain, and g dB is the gain in dB. One common optimization metric is the deviation between some reference value and the average pixel intensity over the whole image or some region of interest (ROI) [12,14,21]. Minimizing this deviation is the most common AE approach available on most commercial off-theshelf cameras. Although the metric is fast to compute, it is more useful for acquiring visually appealing images than for VO applications. Instead, AE algorithms for robot vision typically maximize the content of the image which is specifically relevant for feature detection and tracking. For instance, early work by Pan [22] in the field of autonomous driving maximizes the mean difference in intensity between lane markings and the road. A more general algorithm was later proposed by Lu [13] which maximizes image entropy, the assumption being that images with high entropy are well exposed. The image entropy metric M e is defined in this context as where P i is the proportion of pixels in the image with an intensity value i out of the N l possible levels. Shim [16,23] later observed that images with strong gradients are more likely result in features being detected and matched. To balance the relative weight of weak and strong image gradients and to limit sensitivity to noise, Shim proposed that the gradient information m * i of a pixel i should be defined as where m i is the gradient magnitude of the i th pixel (normalized on a unit scale), δ ∈ R + is the activation threshold, λ ∈ R + is a shaping parameter, and N is a normalization factor defined as Finally, the gradient information metric M g is defined as where N p is the total number of pixels in the image. As proposed in Shim's original paper, λ was set in this work to 1000 [16]. The activation threshold δ was set to 0.30, which is larger than the value of 0.06 originally used by Shim. Indeed, a desired feature of the metric is that it should be independent of image noise. As image noise increases with gain, the metric value should not vary when acquiring the image of a static scene at a given exposure level E, even if the image is acquired with different combinations of exposure gain and time. When using an activation threshold of 0.06, experimentation showed that the image noise in dark frames causes M g to vary greatly for different gains (see Figure 1). In comparison, a threshold of 0.30 decouples metric values from image noise while the optimal exposure level remains the same.  Zhang [15] later introduced a closely related metric labeled the "soft" gradient percentile. It approximates a certain percentile of the image pixel gradients (e.g., the median gradient) and, unlike gradient information, it is differentiable with respect to exposure time. This property proves useful in deriving a controller policy based on gradient descent.
However, M e , M g , and the median gradient can be sensitive to noise. To avoid the problem, Kim [17] proposed to weight the gradient of each pixel i by a factor which depends on the entropy e i of the pixel defined as where P(I i ) is the proportion of pixels in the image with the same intensity as pixel i. The weight W i of a pixel is then defined as where and whereē and σ e are, respectively, the mean pixel entropy and standard deviation. Finally, the authors define the entropy-weighted gradientm i of a pixel aš where and π(·) is an activation function defined as where α ∈ R + and τ ∈ R + are shaping factors. For the present work, e thresh = 0.05, α = 32 and τ = 4 as in [17]. The entropy-weighted gradient metric M ewg is then computed as As an alternative to Kim's weighted gradient, Shin [18] proposed that the noise σ noise of an image (I) be directly approximated as where H i and U i are the i th entry of binary matrices, respectively, masking out the nonhomogeneous and the saturated regions of the image, N s is the number of pixels that are both homogeneous and unsaturated, and M is the noise estimation kernel proposed by Immerkaer [24]. The approximation σ noise of the image noise is then incorporated in a hybrid image quality metric M q defined as where α ∈ (0, 1) and β ∈ R + are weighting factors, and s g is the standard deviation of the gradient information metric evaluated individually over each cell of a grid. The present work uses the authors' original values of α = 0.4 and β = 0.4 proposed in [18].
More recently, Tomasi [20] also proposed to directly maximize the number of detected and successfully matched features across frames as a proxy for VO performance through a self-supervised learning method. However, from the current state of the literature, it remains unclear how these metrics compare to one another as substitutes of VO performance, since direct cross-correlation analyses are rarely offered. Furthermore, other metrics, such as the Lowe ratio has not yet been incorporated into an AE algorithm. Its merits as a proxy of VO performance are worth exploring in this work.

Prediction Models
While purely reactive approaches (e.g., [12,13,18]) do not require any characterization of the camera, they do require the slow process of sampling real-world images to converge. Instead, predictive AE algorithms use a model to predict the effect of future exposure parameters on the optimization metric. For learned control policies (e.g., [14,17,20]), a predictive model is implicitly embedded in the policy. In contrast, explicit predictive methods offer more interpretability, and leverage known information about the camera's image acquisition process. For instance, Shim [16,23] proposed to use discrete γ transformations to predict the effects of future variations of exposure parameters. For each pixel intensity I in ∈ {0, 1, . . . , 255} of an image and for any given γ ∈ R + , the predicted pixel intensity I out ∈ {0, 1, . . . , 255} is mapped as This approach does not require any camera characterization, but there exists no direct link between γ transformations and changes in camera exposure other than γ < 1.0 simulates a more exposed image while γ > 1.0 simulates a less exposed one. To avoid this limitation, Zhang [15] proposed to leverage the photometric response function (PRF) of the camera as a way to make better predictions. This function f PRF maps the exposure E of the camera sensor to the intensity I out of the image: The PRF of a camera can be found through a simple calibration procedure [25] or be estimated online [26], but only up to an offset which depends on the unknown scene irradiance. While Zhang leverages this PRF within a gradient descent step to select the next exposure parameters, it has not been incorporated into an explicit prediction step similar to Shim's algorithm.

Gain/Exposure Balance Strategies
From the definition of the exposure level E given in (1), there are infinitely many combinations of exposure time and gain that will result in the same exposure level. Most existing AE algorithms (e.g., [12,15,16,19]) disambiguate this choice using an "exposure priority". With this approach, the exposure time is always adjusted first while maintaining gain at a fixed low value. When exposure time reaches an upper limit, gain is increased to meet the required exposure level, thus minimizing image noise. In low-light conditions, this method is seriously prone to motion blur. One way to solve the issue is to impose some fixed relationship between g and t (e.g., g = kt, where k ∈ R + ) [13]. However, such a relationship is suboptimal as it does not allow motion blur and noise to be balanced dynamically based on the current motion of the camera.

Control Policies
Almost every algorithm summarized in Table 1 employs a different control policy. Yet, the merits of each (apart from allowing some predictive control or leveraging some learning method) cannot realistically be compared in isolation from the choice of optimization metric, prediction model, and gain/exposure balance strategy. Among these, Shim's AE algorithm [16,19] stands apart by its use of an explicit prediction model which enables the quick convergence of the camera exposure parameters with a simple feedback law. It is also the method closest to the one proposed in this work.
For every incoming frame, the authors simulate changes in exposure by applying a sequence of discrete γ transformations (15) to the input image. The gradient information metric of each simulated image is then computed and a polynomial function f fit (γ) is fit to the resulting data. The optimal transformation is then approximated as γ * = arg max f fit (γ). As there exists no direct relationship between γ transformations and changes in exposure level, the authors then set the exposure level of the next frame E k+1 using the nonlinear update where and

Proposed Approach
The aim of the present work is to detail and support the development of an autoexposure algorithm for the purpose of vision-based robot localization in challenging light conditions. Unlike other methods, the algorithm detailed in Section 2.1 fully leverages the camera's PRF to predict the exposure that maximizes gradient information. It also incorporates feedback on the error between the actual and predicted gradient information metrics to compensate for PRF inaccuracies due to image saturation. Finally, it balances gain and exposure time based on time-varying predictions of motion blur intensity. Using the setup and testing procedure detailed in Section 2.2, the overall performance of the algorithm is assessed through extensive experimental validation. First, a cross-correlation analysis (Section 3.1) for a wide range of optimization metrics supports the use of gradient information as an appropriate proxy of VO performance. A convergence analysis (Section 3.2) then demonstrates the respective effects of using predictions based on PRF and feedback to compensate prediction errors due to saturation. Finally, the AE algorithm's ability to reduce robot localization error is assessed experimentally in Section 3.3 and demonstrates that the proposed approach outperforms other exposure control approaches.

Proposed AE Algorithm
The proposed algorithm actively adjusts the camera gain and exposure time to improve VO performance by maximizing the image gradient information metric (5). While the algorithm can handle any proxy of VO performance, this choice of metric is supported by the detailed comparison included in Section 3.1. A schematic and pseudocode summarizing the algorithm are provided in Figure 2 and Algorithm 1. The C++ source code and a ROS wrapper for this algorithm are made publicly available (https://github.com/MIT-Bilab/voautoexpose accessed on 30 November 2021). The code also includes options to experiment with the different alterations of the algorithm tested in this work. It supports, for instance, Shim's prediction model based on γ transformations by reusing some portions of the open-source code shared by Mehta [19].

Rough optical flow from previous frame
Step 1 Image predictions Step 2 Find maximum

Predict best exposure level
New frame g k+1 t k+1 Balance gain and exposure time Figure 2. Schematic of the AE algorithm proposed in the present work.
Algorithm 1 AE for challenging light conditions. 1: function ROUGHOPTICALFLOW(I k−1 , I k ) 2:Ǐ k−1 ,Ǐ k ← Downsize I k−1 and I k (e.g., 90 × 68) 3: ∆x, ∆y ← Compute Farneback optical flow from I k and I k−1 for i ← 1, n predictions do Step 1: Image predictions 10: I predict ← Predict image based on ∆E i using lookup table i 11: p i ← Compute gradient information from I predict using Sobel operators and (5)  12: end for 13: 14: f fit (∆E) ← Compute 6th degree polynomial least-square approximation of p = f (∆E) Step 2: Find maximum 15: end while 20: 21: M k ← Compute gradient information of I k according to (5) Step 3: Saturation feedback 22: M * k+1 ← Predict gradient information of next frame as f fit (∆E * ) 23: E k+1 ← Compute according to (22)  I k ← Apply median blur to I k 40: For every new camera frame I k , the main algorithm loop consists of first predicting the best exposure level E k+1 of the next frame. This process can be broken down into three main steps. In Step 1, n predictions , discrete changes in exposure level are artificially applied to the image. Unlike Shim's method, which uses predictions based on γ transformations, changes in exposure level can only be predicted if the camera's PRF is available. Figure 3a, for instance, shows the PRF of the camera used in this work. It relates the exposure level of any pixel to its intensity and can be obtained from a simple static calibration procedure [25]. The function is only defined up to some offset in E which depends on the illumination of the scene. Given the camera's PRF and for any given change in exposure level ∆E, one can approximate a monotonically increasing function g exp similar to the γ transformation (15) which maps the intensity value I in of every pixel in the input image to its predicted value I out : Examples of g exp transformations for different changes in exposure level are provided in Figure 3b. The figure also includes examples of γ transformations to illustrate the difference. A justification for using g exp instead of γ transformations relies on the detailed comparison included in Section 3.2.
Step 1 terminates with the calculation of the gradient information metric of each simulated image I predict . The metric of the i th simulated image is denoted as p i . Step 2 consists of estimating the change in exposure level ∆E * that maximizes gradient information. A least-squares 6th degree polynomial approximation f fit is fit through the metrics of the simulated images. Newton's iterative method initialized at the origin is then used to approximate ∆E * as the maximum argument of f fit . This step is largely inspired by Shim's approach to find the optimal γ * transformation, as described in Section 1.4. Finally, Step 3 aims to compensate prediction errors due to image saturation. Indeed, saturated pixels cannot accurately be mapped by g exp , which is problematic for large and sudden changes in lighting conditions (e.g., lights turning on/off). Under such circumstances, the g exp transformation systematically underestimates changes in exposure level which could greatly improve gradient information by unsaturating some part of an image. This leads to smaller steps |∆E| and a longer convergence time. To circumvent some of the issue, the proposed strategy is to artificially increase |∆E| when the improvement in the gradient information metric from one frame to the next is substantially better than predicted (e.g., Figure 4).  Let r be the ratio of the actual change in the gradient information metric over the predicted one: where ∈ R + is a relatively small number (e.g., = 0.1) and M * k = f fit (∆E * ). Then, the exposure level E k+1 of the next frame is selected as where f d > 1.0 is a constant factor on the step size and r th ∈ R + is the threshold of r (e.g., f d = 1.5, r th = 1.1 were used in this work).
Once the desired exposure level for the next frame E k+1 is set, explicit values for the camera gain g k+1 and exposure time t k+1 still need to be selected. It is well known that image noise increases with gain, that motion blur increases with exposure time, and that both negatively affect localization. The relative importance of each effect largely depends on the specific VO/SLAM algorithm used. Feature-based algorithms, such as ORB-SLAM [27] and VINS-Fusion [28], for instance, are more strongly affected by motion blur and less affected by noise than methods such as DSO [29], which minimizes photometric error. The proposed algorithm exploits a simple way to balance gain g k+1 and exposure time t k+1 based on a single constant factor w ∈ R + , weighting the relative importance of image noise and motion blur as arg min t k+1 ,g k+1 where k offset ∈ R + is a constant scalar (k offset = 8 in this work),d is the average speed of image points (pixels/second), which can be approximated with Farneback's optical flow method [30], and, from (1), Hence, the procedure associates a cost that grows quadratically with the average motion blur length and linearly with gain. This choice of the exponents for the cost function is consistent with the experimental characterization of ORB-SLAM3 included in Section 3.1. It shows that ORB-SLAM3 is more sensitive to motion blur than image noise (gain) and that the rate at which it degrades increases with exposure time. Hence, the cost function (23) is specific to ORB-SLAM3 and might not be appropriate for direct VO methods such as DSO [29]. One should recharacterize the sensitivity of the method with respect to noise and motion blur before deciding on specific exponents. The choice of hyperparameter w also depends on the selected VO algorithm. In this work, w was hand-tuned for ORB-SLAM3. Starting with a unit value, w was gradually decreased over multiple test sequences until a peak performance was reached around w = 0.02. Indeed, the AE algorithm can become unstable for small values of w as the exposure parameters vary too quickly. When using the AE algorithm with other VO methods, the same procedure should be repeated to tune w. For instance, DSO [29] is a direct VO method which is more sensitive to image noise than feature-based methods like ORB-SLAM3.
If the minimization problem (23) is feasible, then it admits the unique solution This solution depends on the average speed of image pointsd determined by optical flow.
For small values ofd, the method selects images with a low gain, thus minimizing noise.
For large values ofd, the method selects images with a low exposure time, thus minimizing motion blur. Unlike existing AE methods, the one proposed thus leverages optical flow to select exposure parameters that are optimal given the current motion of points in the image.

Experimental Setup
The experimental setup used in this work and shown in Figure 5 comprises two monochrome machine vision cameras (FLIR BFS-U3-16S2M) 80 mm apart and simultaneously triggered at 60 Hz. Images are acquired at a 720 × 540 resolution using a 2 × 2 binning in order to increase light sensitivity. All experiments are performed with the cameras mounted on a custom three-axis (xy-yaw) motion table. Each axis is actuated by a NEMA 23 stepper motor. The drives used to power each motor (Tinkerforge silent stepper bricks) also provide a ground truth trajectory with a precision of~0.1 mm. The cameras communicate over USB3 to a separate computer which runs the proposed AE algorithm in real time at 60 Hz on a single core of an AMD Ryzen 7 3700x CPU.
A top-view schematic of the static scene observed throughout the experiments is shown in Figure 6a. Targets with a unique texture (e.g., AprilTags), such as shown in Figure 6b, were plastered throughout the room in order to prevent tracking failures of the VO algorithm. The distance between the camera and these targets varies during the recording between approximately 0.2 m and 2.5 m.

Benchmarking Proxies of VO Performance
In order to benchmark the different proxies of VO performance introduced in Section 1.1, the motion table was commanded to execute a preset path. The maximum linear speed reached over the trajectory is 100 mm/s and the maximum rotation speed is 100 deg/s. Each frame of the left camera video feed was preprocessed with a median blur filter and a static γ transformation of 0.3 to enhance image contrast. Four different static exposure settings were used. For each setting, the trajectory was repeated four times. The video feed of the cameras for each repetition was then post-processed 10 times with ORB-SLAM3 [27] in stereo mode without loop closures. Each estimated trajectory was then compared against the ground-truth trajectory, and VO performance was measured using the mean translation relative pose error (RPE) computed over 20 mm sub-trajectories [31]. The VO performance for the four static camera exposure settings is presented in Figure 7. It shows that increasing the exposure time both increases the median RPE and the spread of the results. It also shows that ORB-SLAM is relatively insensitive to image noise due to high gain. This supports choosing an AE algorithm with a gain/exposure balance strategy that favors low exposure times (small w). The best and worst trajectories estimated by the VO algorithm are also overlaid over the ground-truth trajectory of the left camera in Figure 8.   Then, for each image of the video sequence recorded by the left camera, the following metrics were computed: the gradient information metric M g , the gradient median (similar to Zhang's "soft" percentile metric [15]), the entropy metric M e , the entropy-weighted gradient metric M ewg , and the quality metric M q . In addition, from the synchronized left and right camera frames, the number of good stereo matches and the Lowe ratio for these matches were also computed. For each sequence, the median of the metric over the whole sequence is plotted in Figure 9 against the corresponding average translation RPE. The gradient information metric (a) and Lowe ratio (g) show the best correlations with VO error (r = −0.73 for both). As computing the Lowe ratio is more computationally expensive, Shim's gradient information was selected as the optimization metric for the AE algorithm proposed in this work. Although the median gradient (Figure 9b) is closely related to the gradient information metric, image noise can artificially increase its value, which explains the lower correlation (r = −0.37). In comparison, the thresholding function (3) used in the definition of the gradient information metric allows to mitigate this bias, as detailed in Section 1.
In [13], the authors show that image entropy is maximized when the number of underor overexposed pixels of an image is minimized. The authors then show that under static conditions, images selected based on entropy lead to better localization compared to some static exposure parameters. However, out of all the tested metrics in the present work, image entropy (Figure 9c) has the worst cross-correlation with VO error (r = 0.26). This indicates that while the metric might limit the number of saturated pixels, it fails to properly capture the detrimental effects of motion blur and noise. Another explanation for this poor result is that, in some cases, it might also be beneficial to allow some parts of the image to be under-or overexposed in order to highlight more informative regions of the image.
The entropy-weighted gradient M ewg (Figure 9d) was initially proposed in [17] to minimize noise effects on the image gradient. As such, the cross-correlation achieved by the metric (r = −0.40) is slightly better than that achieved with the median gradient. Yet, it is still far from the cross-correlation achieved with the gradient information metric. This supports that a thresholding function such as (3) is more effective for removing noise effects than weighting the image gradient with entropy.
Another way to limit sensitivity to noise was proposed in [18]. As detailed in Section 1.1, the authors introduced the quality metric M q (Figure 9e). This metric is a hybrid metric between gradient information and entropy from which a weighted estimation of the image noise intensity is subtracted. Here, again, the poor cross-correlation achieved by the metric (r = −0.26) indicates that a thresholding function such as (3) is more effective for removing noise effects than subtracting the estimated noise intensity from the image metric. This low cross-correlation is also, in part, explained due to M q directly incorporating the entropy metric M e , which does not correlate to VO error.
For the test sequences used in the present work, the cross-correlation between the number of good stereo matches (Figure 9f) and VO error was almost nonexistent (r = 0.02).
These good stereo matches were determined by first detecting ORB features in each corresponding left and right images. Each feature in the left image was then matched to one in the right image using the k-nearest neighbors algorithm. Matches were labeled as "good" if the stereo equipolar error was smaller than 1 pixel and if the Lowe ratio of the match was smaller than 0.7. Despite this outlier rejection scheme, image noise was still found to have a large impact on the number of good matches. However, feature tracking performance largely depends on the saliency of the features. Hence, unlike the raw number of good stereo matches, the median Lowe ratio of those good matches (Figure 9g) strongly correlates with VO error (r = −0.73).

Convergence Analysis
The AE algorithm proposed in this work uses a prediction model based on the camera's PRF. It also incorporates feedback to correct for some of the prediction inaccuracies due to image saturations. As shown by the camera's response to a step in ambient light (Figure 10), these choices have a drastic effect on the camera's response time. For instance, without saturation feedback, the proposed algorithm can take up to 100 frames to converge compared to about 15 frames with feedback. Similar convergence speeds can be achieved using Shim's control method [16,19], which incorporates a prediction model based on γ transformations. However, the two methods do not converge to the same exposure level. Step response of the proposed AE algorithm subject to an instantaneous increase in ambient light (from~1 to~150 lux). Frames are acquired and processed in real time at 60 Hz.
To investigate how γ transformations affect the predicted optimal exposure level, images of a static scene were acquired at different exposure levels (E). The gradient information metric (M g ) was then computed for each image as well as the optimal transformation γ * predicted by Shim's method. Both are plotted as a function of exposure level in Figure 11. To simulate three different cameras' PRF, the same procedure was repeated with static γ transformations applied to the incoming images with values of 0.6 and 0.3. The idea behind Shim's approach is that for the exposure level corresponding to γ * = 1, M g should also be maximum. However, as can be seen from Figure 11, the procedure systematically underestimates the optimal exposure level. In comparison, the true optimal exposure level only slightly varies for different cameras' PRF. A prediction model using PRF-based transformations therefore avoids this bias and selects exposure levels which are closer to the true optimal ones. . Gradient information metric for different exposure levels of the same static image. Each peak (marked with discontinuous vertical lines) corresponds to the true optimal exposure level. Peaks for the bottom graph (marked with continuous vertical lines) correspond to the optimal exposure levels predicted by Shim's γ transformations [16]. Each colored curve represents a different camera's PRF.

VO Performance
The performance of the proposed AE algorithm was tested and directly compared to using fixed exposure parameters, the camera's built-in AE algorithm, and Mehta's open-source implementation of Shim's algorithm [19]. A supplementary video illustrating this comparison is available online (https://youtu.be/Guvhvb-uQpE accessed on 30 November 2021). The fixed exposure parameters were hand-tuned such that tracking would not be lost due to under-or over-saturation. The reference pixel intensity value tracked by the built-in AE algorithm was set to 20% of the maximum pixel intensity value. Using higher target values would result in some frames being overexposed, and tracking would be lost. To allow a fair comparison, Mehta's original code was altered to use the gain/exposure balance strategy of this work. Settings for the nonlinear controller were also selected to obtain a convergence time similar to the method proposed in this work (i.e., k p = 1.6 and d = 0.1, as demonstrated in Section 3.2).
For these tests, the maximum exposure time of the camera was set to 7.5 ms, which represents about half of the cameras' sampling time at 60 Hz. All incoming images were again preprocessed with a median blur filter and a contrast-enhancing γ transformation of 0.3. Each exposure method was tested in two scenarios. For both scenarios, the camera underwent the same trajectory and the objects in the scene remained the same. However, in scenario (a) , lighting varied greatly between the different regions of the image (1-217 lux), while in scenario (b), lighting remained relatively low and constant (2-4 lux).
As can be seen in Figure 12, the proposed AE algorithm systematically produces images with higher gradient metrics compared to the other active methods. The mean VO accuracy of each method are also compared in Figure 13 and demonstrate that the proposed algorithm achieves a lower tracking error. While the static parameters result in a similar performance to the proposed method in scenario (a), the same static parameters in (b) results in suboptimal performance. It should be mentioned that the test conditions (a) and (b) were chosen such that static parameters would generate images that a VO algorithm can track. Reusing the same parameters in drastically different light conditions (e.g., in sunlight) instead systematically results in VO failure.    Finally, the exposure parameters selected by each exposure control method are compared in Figure 14 for scenario (a). As underlined in Section 3.2, Shim's control method (which uses predictions based on γ transformations) underestimates the optimal camera exposure, leading to suboptimal VO performance.

Optimization Metrics
One of the main differences between existing AE algorithms is the metric being (implicitly or explicitly) optimized. Even though the gradient information metric (5) first proposed by Shim can be sensitive to high noise levels, it was found in this work to be an acceptable proxy for VO performance. Compared to the other metrics tested in this work, it exhibits the best linear correlation with the mean VO localization error. This conclusion contrasts some of the existing literature advocating the superiority of other metrics, but to the authors' knowledge, no other work has previously compared metrics based on an extensive direct cross-correlation with VO performance. For instance, Kim's [17] benchmark involves comparing the saturation rate of images defined as optimal according to the different metrics. Zhang [15] compares metrics based on the number of FAST features detected in the "best" image of different standard datasets (where the "best" image is selected as the one with the highest score and varies according to which metric is used). This approach still does not directly relate metric values to VO performance. Shin [18] uses an approach which is closest to this work by directly comparing the absolute pose error associated with images selected according to the different metrics. The authors conclude that the quality metric M q is a better proxy of VO performance as the best images predicted by other metrics tend to be highly noisy. However, the underlying assumption is that AE algorithms optimize a metric over the whole parameter space. Yet most AE algorithms, including this one, avoid the problem by optimizing the metric over the exposure level first. A different strategy is then used to balance gain and exposure time.

Prediction Models
As underlined in Section 1.2, existing AE algorithms use different models to predict the effects of future exposure parameters. Results presented in this work support using a prediction model based on the camera's PRF rather than γ transformations due to the bias introduced by the latter. Although Zhang [15] also relies on the camera's PRF to predict optimal changes in exposure level, the authors only evaluate the gradient of the metric at the current exposure level to inform the size of a gradient descent step. Similar to Shim's approach, the proposed algorithm uses a set of discrete mappings based on the PRF (20) which greatly increase the range over which predictions are valid.

Computational Efficiency
Despite using a rough optical flow and a set of discrete simulations to drive the selection of exposure parameters, the C++ implementation of the proposed algorithm is able to run at 60 Hz on a single core of an AMD Ryzen 7 3700x (3.60 GHz) using a downsized image resolution of 360 × 270 pixels for simulations and 90 × 68 pixels for the optical flow. This real-time performance is competitive with other standard AE algorithms. Indeed, this is similar to the performance obtained by Shim [16], who reported achieving 70 Hz using an Intel Core i5-6260U (1.80 GHz) for a downsized image resolution of 320 × 240 pixels. While Tomasi [20] was able to reach a processing rate of 640 Hz using a trained CNN, the algorithm was implemented on an NVIDIA GeForce GTX 1050 Ti GPU. Reimplementing the proposed algorithm on a GPU (which are known to be 1-2 orders of magnitude faster than CPUs for image processing) would likely yield a similar processing rate.
Some AE algorithms can, however, achieve significantly lower processing times, which might prove more useful for some applications that require high frame rates with limited computational power (e.g., high-speed VO on drones). For these applications, the higher frame rate enabled by the quicker AE algorithm might offset the limitation of a suboptimal selection of the camera exposure parameters. For instance, the AE algorithm built in most modern cameras requires negligible run time as it implements a PI controller on the difference between the average pixel intensity and some reference value. Similarly, the AE control policies proposed by Kim [17] (Gaussian inference) and Shin [18] (Nelder-Mead optimization) also require minimum processing times with respect to the computation time of the actual optimization metric employed. For instance, Shin reported a computation time of <0.01 ms for a step of the Nelder-Mead method compared to the 3.23 ms required to compute the gradient-based metric of an image downsized to 800 × 600 px running on an i7-7700HQ (2.80 GHz). Both methods, however, have the downside that they require the use of query images before they can converge. These query images lead to large, oscillating changes in exposure parameters which can be detrimental for VO. Finally, while Zhang [15] does not provide details on the computational performance of the method, the algorithm does involve a few more steps, such as transforming each pixel intensity with the inverse of the camera's PRF, computing the gradients of both the image and the transformed image, and ordering the list of the derivative of the gradient magnitudes at each pixel.

Saturation Feedback
To the authors' knowledge, no other AE algorithm incorporates feedback on the difference between the predicted and the actual image metric to compensate for prediction errors due to saturated pixels. These prediction errors are especially predominant when the scene undergoes large and sudden changes in illumination. For instance, saturation feedback was shown to improve the algorithm's speed of convergence after the lights in a room are turned on or off. However, one limitation of the method is that it adjusts the change in exposure level between the current and the next frame based on the difference between the current image metric and the one previously predicted. This delay can lead the proposed AE algorithm to sporadically overshoot the optimal exposure level, especially for larger values of the step size control parameter f d .

Conclusions
Overall, the proposed AE algorithm was shown through experimental validation to perform at least as well as (and, on average, better than) other exposure control approaches under different challenging light conditions. One limitation of this validation is that it relies on the use of ORB-SLAM3 [27], which is known to be relatively insensitive to image noise. Future work should therefore include validation of the algorithm with VO/SLAM pipelines that are more sensitive to noise, such as direct methods like DSO [29]. Indeed, some of the design choices made in this work, including the hyperparameter w and the order of the exponents associated to the terms in (23), were made specifically for ORB-SLAM3 and may not be applicable to other VO methods. Another limitation of this work is that it assumes that a robot relies entirely on vision for localization. In practice, a suite of sensors (e.g., IMU, wheel encoders, LiDAR) can compensate for some of the inaccuracies of VO pipelines. Future work should therefore also explore the contribution to the proposed AE algorithm when information from other sensing modalities are also present.

Data Availability Statement:
The data presented in this study are openly available and can be found here: (https://github.com/MIT-Bilab/vo-autoexpose accessed on 30 November 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

AE
Auto-exposure CNN Convolutional neural network HDR High dynamic range IMU Inertial measurement unit PRF Photometric response function ROI Region of interest SLAM Simultaneous localization and mapping VO Visual odometry