Study on the Influence of Image Noise on Monocular Feature-Based Visual SLAM Based on FFDNet

Noise appears in images captured by real cameras. This paper studies the influence of noise on monocular feature-based visual Simultaneous Localization and Mapping (SLAM). First, an open-source synthetic dataset with different noise levels is introduced in this paper. Then the images in the dataset are denoised using the Fast and Flexible Denoising convolutional neural Network (FFDNet); the matching performances of Scale Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF) and Oriented FAST and Rotated BRIEF (ORB) which are commonly used in feature-based SLAM are analyzed in comparison and the results show that ORB has a higher correct matching rate than that of SIFT and SURF, the denoised images have a higher correct matching rate than noisy images. Next, the Absolute Trajectory Error (ATE) of noisy and denoised sequences are evaluated on ORB-SLAM2 and the results show that the denoised sequences perform better than the noisy sequences at any noise level. Finally, the completely clean sequence in the dataset and the sequences in the KITTI dataset are denoised and compared with the original sequence through comprehensive experiments. For the clean sequence, the Root-Mean-Square Error (RMSE) of ATE after denoising has decreased by 16.75%; for KITTI sequences, 7 out of 10 sequences have lower RMSE than the original sequences. The results show that the denoised image can achieve higher accuracy in the monocular feature-based visual SLAM under certain conditions.


Introduction
Simultaneous Localization and Mapping (SLAM) has been an important research direction in the field of computer vision and robotics. It is the basic module for many location applications, such as mobile robots, micro aerial vehicles, autonomous driving, virtual reality, augmented reality, and so forth. Various sensors can implement the SLAM algorithm, such as GPS, LiDAR, IMU, and cameras. The camera using image sensors has the advantages of small size, low weight, low power, and low cost, also, it can provide vast information and easy to use. Visual SLAM methods using image sensors have seen tremendous improvements in accuracy, robustness, and efficiency, and have gained increasing popularity over recent years [1]. They can be classified into monocular, binocular, and RGB-D visual SLAM according to the camera used. The monocular visual SLAM system uses only a camera sensor, which is a pure vision issue. In low-cost monocular vision SLAM, image noise as well as its denoising technique is an important issue that needs more investigation.
In visual SLAM, there are three most popular formulations of visual odometry, namely direct, semi-direct, and feature-based methods. In the representative work of these methods, the DSO [2,3] and LSD-SLAM [4,5] are direct methods, the SVO [6,7] is a semi-direct method and the ORB-SLAM [8,9] is a feature-based method. There are lots of problems that have a close influence on the performance of SLAM, such as photometric calibration, motion bias, rolling shutter effect [1], and camera noise.
To better understand the influence of noise on the SLAM system, other conditions are fixed as control variables. These methods to deal with the image noise are different. In LSD-SLAM [4,5], it is assumed that the image noise is Gaussian noise, but there is no quantitative analysis of how the noise intensity affects the system. In DSO [2], the study of geometric and photometric noise is performed and the research shows that the DSO modeled by noise is more robust to photometric noise than ORB-SLAM. Because when the photometric noise is high, the feature matching is easy to fail. In the SVO of the semi-direct method and the ORB-SLAM [8,9] of the feature-based method, it is assumed that the feature points are robust to noise, so the noise of the image has not been discussed separately. Image denoising is an effective method in visual SLAM, Cho et al. denoise the image sequences first and improved the visual SLAM performance in a turbid environment [10]. Chen et al. use the DCNNS for semantic segmentation, then the feature matching is limited to the pixel area of the same object in different frames, as a result, the influence of noise on the feature matching is reduced [11]. Liang proposes a precise Iterative Closest Point (ICP) algorithm to overcome noise and outliers to complete precise point cloud registration [12]. Zhang et al. perform feature matching with two past frames to reduce the feature sensitivity to noise [13].
Generally, the visual SLAM methods are based on the assumption that the image has a low noise level. In normal conditions, noise is difficult to notice, even with standard cameras. However, it rises quickly when exposure time has to be long, especially in dark scenes. This is a very challenging task for the existing monocular feature-based SLAM. Existing visual SLAM methods all suffer from accuracy degradation and even failure when the image noise reaches a certain intensity. However, how does image noise affect monocular feature-based visual SLAM, and what will happen to the accuracy and robustness after the image is denoised?
To answer these questions, a dataset containing different noise levels of images is needed. There are many datasets to evaluate the effect of the visual SLAM algorithm at present, such as the TUM dataset [14][15][16], EuRoC dataset [17], KITTI dataset [18,19], and ICL-NUIM RGB-D dataset [20]. However, the images in these datasets are all at low noise levels and have no different levels of noise. Therefore, a dataset for evaluating the noise of the visual SLAM method is extended from the previous publication WHU-RSVI [21]. However, the WHU-RSVI only provides a typical noise level, if the researchers want to study the noise of monocular visual SLAM, different noise levels and types are needed. According to the noise model and clean images introduced in [21], 33 sequences with different noise levels and types are obtained, and the method to add image noise with customized noise types and levels is open-sourced. To reduce the impact of noise on the visual SLAM system, this paper uses the Fast and Flexible Denoising convolutional neural Network (FFDNet) [22] to denoise the image, and then the performance of original sequences and denoised sequences is evaluated.
Scale Invariant Feature Transform (SIFT) [23,24], Speeded Up Robust Features (SURF) [25], and Oriented FAST and Rotated BRIEF (ORB) [26] are the three most commonly used methods in visual SLAM feature extraction and matching. The SIFT description is scale invariance, it can select a correct match of a key point from a large database of other key points, but the calculation is relatively large. SURF is an accelerated robust feature method implemented by integrating images for image convolution. SURF is also a scale and rotation invariant feature description method. ORB improves the problem that the FAST [27,28] detector does not have a direction, and it uses BRIEF [29] to describe the features. Compared with SIFT and SURF, ORB is rotation invariance and resistant to noise and greatly reduces the time required for extracting features.
To evaluate the impact of different levels of noise on matching performance, the three commonly used features were evaluated on the proposed dataset. Then, according to the evaluation, a method using FFDNet to improve ORB-SLAM2 is proposed.
The contributions of this paper are as follows: (1) An open-source dataset extended from WHU-RSVI [21] for evaluating the noise of Monocular visual SLAM systems or denoising tasks. (2) An open-source software for adding image noise with customized noise types and levels.
(3) Quantitative evaluation of the influence of different levels of noise on the feature matching of SIFT, SURF, and ORB. (4) Accuracy improvement of ORB-SLAM2 system due to the image denoising.
To evaluate the influence of noise on the visual SLAM methods, a dataset with different levels of image noise is generated. The previous work of this dataset is WHU-RSVI [21], where the open-source ray-tracing software POV-Ray to generate images is used.
The model of the image is from "office" of "The Persistence of Ignorance" (http://www.ignorancia. org/), and the size of the model is 800 × 500 × 250 cm 3 . The resolution of the images in the dataset is 640 × 480, and the camera's field of view is 90 degrees and the intrinsic matrix of the image in the right-handed coordinates can be expressed as The function that represents the relationship between brightness and irradiance of image pixel values is called the Camera Response Function (CRF), it is a variety of linear and non-linear relationships that cameras are subjected to during imaging. The real CRF is analyzed in detail and a diverse database of real-world camera response functions is collected by Grossberg and Nayar [30]. Linear CRF is a common response model [31] and the CRF is set to linear in this paper.
Unlike images captured by real cameras, the images generated by POV-Ray are noise-free. There are many types of noise in an image, such as photon shot noise, dark current shot noise, offset fixed pattern noise, dark current fixed pattern noise, source follower noise, quantization noise, and sense node reset noise [32]. The noise model is simplified by Liu et al. [33], and it can be expressed as where I is the pixel brightness from 0 to 255, f (·) represents the CRF, n s is all noise components that depend on irradiance, n c denotes the independent noise before correction and n q is the additional quantization noise. Since most cameras can achieve very low quantization noise, n q is ignored in the model of this paper. The expectation of the irradiance-dependent noise is E (n s ) = 0, and the variance of it is Var (n s ) = Lσ 2 s , the expectation of irradiance-independent noise is E (n c ) = 0, and the variance of it is Var (n c ) = σ 2 c and L represents the irradiance. Grossberg and Nayar [30] normalized the irradiance L and the RGB value from 0 to 1, in this paper, the irradiance L value and the RGB value are discrete from 0 to 255 as mentioned in [34], σ s and σ c are also dimensionless values.
By taking different σ s and σ c , images with different noise levels can be obtained according to the Equation (2). It is assumed that the noise levels are the same on the three channels of R, G, and B and the image noise can be divided into three types: irradiance-independent, irradiance-dependent, and the mixture of the two. Irradiance-independent noise is the so-called Gaussian noise. Noisy images of different levels can be obtained by adjusting σ c . In this paper, σ c ranges from 0.005 to 0.10. Except for the value of σ c = 0.005, the σ c is incremented by a step size of 0.01. The irradiance-dependent noise is obtained by adjusting σ s . The value of σ c in this paper ranges from 0.01 to 0.20. Except for the value of σ s = 0.01, the other steps are 0.02. For the noise mixed by the irradiance-independent and the irradiance-dependent, the value of σ s is twice the value of σ c , and the steps are the same as that when the value is taken separately. The images of different noise levels in the final sequence are shown in Figure 1.

Noise Dataset
By modeling the trajectories of the sequence through B-spline curves, the ground truth of the trajectories of each sequence can be obtained. The format of the ground truth in the dataset is compatible with the TUM dataset. All sequences with different noise levels in the dataset have the same ground truth and the statistical information of the sequence is shown in Table 1.

Image Denoising
Image denoising is an important research direction in low-level vision. Noise inevitably occurs during the imaging process, which will seriously affect the quality of the image. Therefore, image denoising is an essential step in many image processing and computer vision tasks.
FFDNet [22,35] is used to denoise the images, in this paper, the model trained by Zhang et al. [22] using an open-source machine learning framework PyTorch (https://opencv.org/). FFDNet is an image denoising method based on the convolutional neural network developed in recent years. It is proposed based on DnCNN (denoising convolutional neural network) [36]. Compared with existing neural network denoising works DnCNN [36] and BM3D [37], FFDNet has faster execution time and smaller memory footprint, and it can use a single network model to deal with various levels of noise. It has a low calculation and can handle spatial shift noise, as shown in Figure 2. As shown in Figure 2, the input tensor is then fed into a series of 3 × 3 convolutional layers, each of which contains the following three operations-Conv, ReLU (rectified linear unit), and BN (Batch Normalization) [38]. Specifically, the first layer uses Conv + ReLU, the middle is Conv + BN + ReLU, and the last convolution layer is Conv. After each convolution, zero padding is used to ensure that the size of the feature map does not change. After CNN, the upscaling operation is used as the reverse process of the downsampling process in the input stage to generate a denoised image with the same size as the original input image.
In FFDNet, it takes a tunable noise level map as input to make the denoising model flexible to noise levels. To improve the efficiency of the denoiser, a reversible downsampling operator is introduced to reshape the input image of size W × H × C into 4 downsampled sub-images of size w 2 × H 2 × 4C, where W and H are the width and height of the input image, and C is the number of channels. For grayscale images, C = 1, for color images, C = 3. The downsampling process can significantly improve the training speed without reducing the modeling ability. By considering the balance of complexity and performance, it is empirically set the number of convolution layers as 15 for grayscale images and 12 for color images. As to the channels of feature maps, it is set 64 for grayscale images and 96 for color images. Besides, unlike DnCNN, denoising the down-sampled sub-images can effectively increase the receiving domain and thus obtain a suitable network depth. After the downsampling operation, it is fed into the CNN together with the adjustable noise level map, so that images with different noise levels can be processed. The image after image denoising is shown in (m)-(p) of Figure 3.
Zhang et al. [22] evaluated the running time of FFDNet (The evaluation was performed in Matlab (R2015b) environment on a computer with a six-core Intel(R) Core(TM) i7-5820K CPU @ 3.3 GHz, 32 GB of RAM and an Nvidia Titan X Pascal GPU). The authors find that for a gray image with a size of 512 × 512, the denoising time is 0.012 s (at 83.3 Hz). The image size of our dataset is 640 × 480, which is close to 512 × 512 in terms of data size. It indicates that when the computer has a high-performance GPU, FFDNet can meet the real-time requirements of visual SLAM. The average Peak Signal to Noise Ratio (PSNR) of the noisy sequences and denoised sequences are shown in Table 2.

Feature Matching
The extraction and matching algorithms provided in OpenCV (https://opencv.org/) are used to compare the matching effects of the three features on the noise dataset. SIFT, SURF, and ORB are used to extract nearly 500 features for every image. Both SIFT and ORB can specify the number of extracted feature numbers, where SURF cannot be set to extract fixed and it is obtained according to the Heisen threshold. The same Heisen threshold results in different feature numbers in different images. So the Heisen threshold is adjusted for each image to extract 500 features in this paper.
The brute-force matching is performed after the feature extraction, that is, the distance between the descriptors of the features in the two images is measured, and then sorted, and the closest distance is taken as the matched pair. The distance between the descriptors expresses the similarity between the two features. For SIFT and SURF, the distance is generally expressed using Euclidean distance. For ORB, the Hamming distance is often used as it uses a binary descriptor. The Hamming distance between two strings of equal length is the number of positions where the corresponding symbols are different.
The ground truth is used to verify the quality of feature matching. Depth of the images is obtained by MegaPOV (http://megapov.inetart.net/), and the projection model of the pinhole camera can be expressed as where K is the camera's intrinsic matrix, P = (x, y, z) T is the point in the space, R 1 is the rotation matrix of the first image, t 1 is the translation vector and d 1 is the depth of the pixel. According to the Cramer's rule, the linear equations can be solved, we define that where D is a nonzero determinant, then x, y and z in the equations have a unique solution, whose individual values for the unknowns are given by: According to the x, y, z and the known d 2 , R 2 and t 2 , the pixel of the point P in the second image can be caculated by: The error between the matched points is calculated and it can be expressed as It is assumed that when the error is within 4 pixels, the match is correct. The point that matches correctly is determined as the inliers, otherwise, it is determined as the outliers.

Results and Discussions
The experiment platform of this paper is Intel Core i7-7700HQ (4 cores @ 2.80 GHz), 16 GB memory, and NVIDIA GeForce GTX 1050Ti. The mean execution time of FFDNet on the proposed dataset is 0.153 s per frame, and the tracking time of the ORB-SLAM is 0.025 s per frame.

Results of Feature Matching
Multiple pairs of images from different scenes in the dataset are selected for image matching, and the statistical analysis on irradiance-independent noise, irradiance-dependent noise, and mixed noise of the two is performed.
The results of irradiance-independent Gaussian noise are shown in Figure 4. It can be seen that the matching rate of ORB is the highest of the three, while SURF is second and SIFT is the lowest.  The results of irradiance-dependent noise and the mixture of the two are shown in Figures 5 and 6. Similar to Gaussian noise, irradiance-dependent noise and the mixed noise of the two are also ORB with the highest matching rate. Different from Gaussian noise, as the intensity of noise increases, the matching rate of irradiance-dependent noise and mixed noise matching image pairs decreases more significantly. For the denoised image, its matching rate is higher than the original image whether it is on ORB, SURF, or SIFT.  The results of feature matching that performed on the original images and the denoised images are shown in Table 3. It is shown that for the denoised image, a higher correct matching rate can be achieved.

Results of Trajectories
The ORB-SLAM2 [9] is used to evaluate the dataset sequences. 1000 features are extracted from each image, the scale factor for the image pyramids is set to 1.2, and the pyramid level is set to 8. The image pyramid is to downsample the image at different levels to obtain images with different resolutions.
The three types of image sequences: irradiance-independent noise, irradiance-dependent noise, and mixed noise are evaluated and compared with the sequence after image denoising.

Noised Sequences
For the noisy sequences in the dataset, each sequence is evaluated for 10 times, and the mean values after excluding outliers are used to represent the evaluation results. The irradiance-independent noisy sequences can be successfully tracked when σ c is below 0.05, while the tracking failure occurs when σ c is greater than 0.05. As the noise intensity continues to increase, the RMSE of ATE increases significantly. From the noise intensity σ c = 0.005 to σ c = 0.05, the RMSE of ATE increased from 6.7 cm to 26.2 cm with an increase of 19.5 cm. For the denoised images, all sequences in this dataset can be successfully tracked, and the ATE is maintained at about 6.6 cm. This can also reflect that FFDNet has a good effect on irradiance-independent Gaussian noise. The evaluation results are shown in Table 4.
For irradiance-dependent noise, it is found that when the noise intensity σ s < 0.04, the noisy sequences can keep the ATE stable. As the noise intensity continues to increase, the RMSE of ATE increases significantly. When σ s = 0.01, the RMSE of ATE is 7.3 cm, and when σ s = 0.08, it increases to 27.5 cm, with an increase of 20.2 cm. For the denoised sequences, all can be successfully tracked before the noise intensity σ s = 0.16. And the RMSE is maintained stable during this period as shown in Table 4.
The mixed noisy sequences are evaluated after the irradiance-independent noisy sequences and the irradiance-dependent sequences. The results are shown in Table 4. It is found that the tracking failure occurs when the noise intensity is higher than σ s = 0.06, σ c = 0.03.
The RMSE of ATE of the sequences are 7.1 cm at the noise intensity σ s = 0.01, σ c = 0.005, and it increases to 14.8 cm when the noise intensity is σ s = 0.04, σ c = 0.02, which indicates that the mixed noise has the largest effect on the robustness of ORB-SLAM2.
The evaluation results show that the denoised sequences have significantly improved robustness. In terms of accuracy, no matter the noise level, the accuracy of the denoised sequences is higher than that of the noisy sequences.

Clean Sequence
As shown in Figure 7, the clean images (noise intensity is zero) are different from the clean images after denoising. Generally, the denoised image looks smoother and softer because the noise level of an image is estimated by FFDNet, some parts that are not noise will be treated as noise and removed. As a result, the clean images after denoising by FFDNet are not the same as the original clean images. The clean sequence is evaluated separately for 20 times and the results of the evaluation are shown in Table 5. Table 5. Estimation statistics of the clean sequence. The evo tools [39] are used to plot the figures of the estimated trajectories and the two trajectories with the RMSE closest to the average are selected for plotting. A visual comparison of the original and denoised sequence of the estimated trajectories are shown in Figure 8. The comparison on each axis of the clean sequence and denoised clean sequence is shown in Figure 9. Although the difference between the two is small, it can be seen from the z-axis that the denoised clean sequence error is smaller than the original clean sequence. The statistical results of Table 5 show that for the denoised sequences, the maximum of ATE decreased by 10.68%, the mean of ATE decreased by 17.23%, the median of ATE decreased by 19.55%, the minimum of ATE decreased by 6.35%, and the RMSE of ATE decreased by 16.75%. To analyze whether the denoised sequences perform better than the clean sequence, the original clean sequence and the denoised sequence are evaluated for 20 times, the denoised sequence performs better on 15 occasions and the original clean sequence performs better on 5 occasions.

Trajectories Max Err (m) Mean Err (m) Median Err (m) Min Err (m) Rmse Err (m) Std Err (m)
The probability P of obtaining k original sequences (perform better) in n tests with a probability of original sequences (perform better) equal to p is given by the binomial distribution: Then it can be checked whether the original sequences perform better than the denoised sequences: Null hypothesis (H 0 ): The original sequences perform better than the denoised sequences, with probability p = 0.5.
Test The null hypothesis is rejected at the 0.05 level. Hence, this result (2.07% < 5%) indicates that we have evidence that is significant enough to reject the null hypothesis. So it can be said that the denoised sequence performed better than the original sequence.

KITTI Sequences
KITTI dataset [18,19] is a combined dataset of images, lidar, GPS measurements, and inertial measurement unit accelerations recorded while driving near Karlsruhe, Germany, on a mobile platform. Table 6 reports the results of the denoised sequences and the original sequence. However, only the RMSE after translation and scale alignment with the ground-truth trajectory is reported. The typical visual comparison between the original sequence and the denoised sequence (KITTI 00) is shown in Figure 10a, and the typical visual comparison on each axis of the original sequence and denoised sequence (KITTI 00) is shown in Figure 10b.
The 10 sequences on KITTI dataset are evaluated 10 times, and the results are weighted according to the number of frames. In 100 evaluations, the denoised sequences perform better on 64 occasions and the original sequences perform better on 36 occasions.
Similarly, it can be checked whether the original sequences perform better than the denoised sequences: Null hypothesis (H 0 ): The original sequences perform better than the denoised sequences, with probability p = 0.5.
Test The null hypothesis is rejected at the 0.05 level. Hence, this result (0.33% < 5%) indicates that we have evidence that is significant enough to reject the null hypothesis, and it can be said that the denoised sequences performed better than the original sequences.

Feature Stability
Feature matching is at the base of many computer vision problems. In feature-based SLAM, when the matching accuracy is higher, the tracking effect is better. To find the reason why the matching rate of the denoised sequences is higher, the descriptors of matching image pairs are analyzed. The features in the two images are matched by the distance between the descriptors, where SIFT and SURF use the Euclidean distance, ORB uses the Hamming distance. The closer the distance, the higher the similarity of features.
In the matching pairs, if the feature descriptors on the match are completely consistent, they are considered to be the same feature and have better stability. These features are called "stable features". The greater the number of "stable features", the better the stability of feature matching.
In this paper, the clean sequence and the ORB descriptor are used for testing. Statistical analysis is performed on the number of all 1514 adjacent image pairs. Among the 1514 matched pairs, the average number of "stable features" in the denoised sequence is 354.02, where the number of the original clean sequence is 346.94. After the denoising, there are 1156 pairs have more number of stable feature, 302 pairs have less number stable feature and other 56 pairs have the same number of "stable features".
The difference between the "stable features" of the denoised pairs and the original clean pairs is also counted, the results are shown in Figure 11.
It can be seen from Figure 11 that the denoised pairs usually have 4-20 more "stable features" than the original clean pairs. Similarly, it can be checked whether the original pairs perform better than the denoised sequences: Null hypothesis (H 0 ): The original clean pairs perform better than the denoised pairs, with probability p = 0.5.
Test The null hypothesis is rejected at the 0.05 level. Hence, this result (4.55 × 10 −118 < 5%) indicates that we have evidence that is significant enough to reject the null hypothesis, and it can be said that the features in the denoised pairs are more stable. When the images to be co-registered, the noise in images often have different type and level. To study whether the denoised sequence has more stable features when the noise intensity is different, the original image and the denoised image in the clean sequence are used as a reference, and then images with different noise levels are performed image matching with the reference image respectively. The number of stable features is shown in Figure 12. It is shown that the denoised images have more stable features than the original. To some extent, it can be said that the denoising will be helpful when the image co-registration at different noise levels.

Conclusions
(1) This paper presents an open-source synthetic dataset to evaluate the noise of the visual SLAM methods by extending the WHU-RSVI dataset. The dataset contains 33 noisy sequences with various noise levels and the method to add image noise with customized noise types and levels is open-sourced.
(2) This paper uses FFDNet to denoise images in the dataset, and then SIFT, SURF, and ORB are employed for feature extraction and matching respectively. ORB has the highest matching rate at all noise levels, followed by SURF, and finally SIFT. At all noise levels, the denoised images always have a higher matching rate than the original image.
(3) This paper evaluates the performance of the noisy sequences and the denoised sequences under the ORB-SLAM2. The results show that the denoised sequences have a smaller ATE than the original image sequences. For the irradiance-independent Gaussian noise, the ATE of the denoised sequences is basically unaffected, while for the irradiance-dependent noise and mixed noise, the ATE of the denoised sequences decreases slightly. The denoised sequences are more robust than the original sequences.
(4) The clean sequence in the dataset is denoised by the FFDNet and the evaluated by ORB-SLAM2. Compared with the original sequence, the RMSE of ATE of the denoised clean sequence has decreased by 16.75 %, indicating that when the features are sufficient, denoising the clean sequence can also improve the performance of the current monocular feature-based visual SLAM methods.
(5) The commonly used KITTI dataset sequences are evaluated, and compared with the original sequences, 7 out of 10 sequences have lower RMSE than the original sequences.
For an image without any noise, denoising the image will lose some information. However, it is found that after image denoising, the tracking accuracy of the ORB-SLAM2 is improved regardless of whether there is noise in the image or not. According to the analysis in Section 3.3, this is because the information affecting the feature descriptor is removed, so that the feature is usually more stable after denoising.
The work of this paper is low coupling with the existing SLAM methods , and it can be easily applied to existing methods. Moreover, using other denoising methods maybe also useful (We use DnCNN to denoise the clean sequence, and then use ORB-SLAM2 to evaluate it. Of the 20 evaluations, 16 of the evaluation results are better than the original sequence (p-value: 0.59% < 5%). The RMSE is 0.00660 m, an average drop of 32.83%). The image denoising in this paper belongs to the image pre-processing in the SLAM system, which has a similar effect to image enhancement. Therefore, an image denoising module can be added between the image capture module and feature tracking module when the engineers are developing relevant projects. Generally, FFDNet needs extra time to denoise images, but in offline feature-based SLAM, using FFDNet to denoise image sequences can improve the accuracy of the system. In structure from motion, the accuracy of the system is usually more important than the running time, so the researchers can also use FFDNet to pre-process the image first.
In future work, we will convert FFDNet's PyTorch model to TorchScript and run it in C++, then FFDNet can be integrated with ORB-SLAM2 and the images denoised by FFDNet are input into ORB-SLAM2 for feature extraction and matching. Visual SLAM is a computationally intensive task, it is difficult for visual SLAM to achieve real-time performance in a resource-limited environment like micro aerial vehicles and mobile robots. So we plan to build a cloud computing service platform, the images captured by the camera will immediately transfer to the platform, then the image denoising, feature matching, bundle adjustment, loop detection, and other computationally intensive tasks can be performed in the cloud computing service platform. Besides, we will study the performance of the method under fast motions with motion blur. The noise dataset in this paper is for monocular, and the binocular and RGB-D sequences considered to expand. Author Contributions: L.C. and X.X. conceived the idea and designed the methods and experiments; L.C. performed the experiments, analyzed the data and wrote the paper; J.L. and X.X. revised the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by the National Key R&D Program of China (2018YFB2100903).

Acknowledgments:
The authors would like to thank Jaime Vives Piqueres who provided us with the open-source office scene.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: