Fast Sparse Coding for Range Data Denoising with Sparse Ridges Constraint

Light detection and ranging (LiDAR) sensors have been widely deployed on intelligent systems such as unmanned ground vehicles (UGVs) and unmanned aerial vehicles (UAVs) to perform localization, obstacle detection, and navigation tasks. Thus, research into range data processing with competitive performance in terms of both accuracy and efficiency has attracted increasing attention. Sparse coding has revolutionized signal processing and led to state-of-the-art performance in a variety of applications. However, dictionary learning, which plays the central role in sparse coding techniques, is computationally demanding, resulting in its limited applicability in real-time systems. In this study, we propose sparse coding algorithms with a fixed pre-learned ridge dictionary to realize range data denoising via leveraging the regularity of laser range measurements in man-made environments. Experiments on both synthesized data and real data demonstrate that our method obtains accuracy comparable to that of sophisticated sparse coding methods, but with much higher computational efficiency.


Introduction
In addition to their increasing proliferation and the growing importance of their roles in numerous applications (including 3D reconstruction and landscape surveying), light detection and ranging (LiDAR) sensors have recently been widely deployed on intelligent systems such as unmanned ground vehicles (UGVs) and unmanned aerial vehicles (UAVs) to perform localization, obstacle detection, and navigation tasks. Consequently, research into range data processing with competitive performance in terms of both accuracy and efficiency has attracted increasing attention. Despite the simplicity and directness in obtaining 3D information, such range data often suffer from noise due to the reflectance property of an object's surfaces, or due to electrical and mechanical disturbances. Therefore, efficient denoising for range data remains a critical problem.
Some range data denoising methods directly originate from their counterparts for images by regarding the depth value of range data as intensity, and readers can refer to [1] for detailed discussions. However, such methods typically cause artifacts of vertex drifting, which reduce the geometric regularity [2,3]. Techniques designed specifically for range data (e.g., the 3D points cloud) can be roughly classified into point-based [4,5] and mesh-based [6,7] methods. However, estimating normal vectors from noisy data for mesh denoising appears to be a chicken-and-egg problem and remains an open topic [8,9]. In [10,11], range data and color images were fused for denoising, and promising results were reported. However, the prerequisites of these methods only hold when expensive sensors are applied or extra efforts are devoted to registering range data and images.
Sparse coding (SC) has revolutionized signal processing and led to state-of-the-art performance in a variety of applications, including image or video denoising, inpainting [12,13], restoration [14], and synthesis [15], face recognition [16], and anomaly detection [17]. These successes are mainly due to the fact that signal (images or image patches) have naturally sparse representations with respect to appropriate bases [18]. In recent years, such SC techniques have been applied to range data [19,20] and obtained competitive results. However, dictionary learning, which plays the central role in SC techniques, is computationally demanding, resulting in its limited applicability in real-time systems [21,22].
Based on the work of [20], in which reflectance information was synergized with depth data to facilitate the enforcement of adaptive sparsity constraint (therein, the depth and reflectance information that are complementary to each other work together to estimate the informative level of each patch, followed by adaptive dictionary learning which assigns more atoms to represent the patches with more information), we propose SC algorithms without dictionary learning to realize range data denoising via leveraging the regularity of laser range measurements in man-made environments. Specifically, a second-order differential transformation is applied to extract ridges that can be reconstructed and enhanced using a pre-learned ridge dictionary with sparsity constraint. Instead of performing refinement on the coefficients and dictionary alternately and iteratively [12,14,19,20,22], our method can achieve a one-step closed-form solution, resulting in much improved efficiency. Experiments on both synthesized data and real data were conducted to verify the effectiveness of our method, together with performance comparison against other state-of-the-art methods.

Notation and Preliminaries
Matrices and vectors are shown in bold capital and bold lower-cased fonts, respectively. Sets are denoted with calligraphic fonts (e.g., S), and their cardinality is denoted as |S|. For a vector x ∈ R m , x[i] is the i-th entry, x S is the subvector of x corresponding to the entries with indices in S. We define For a matrix X ∈ R m×n and an index set S, X S is the submatrix containing only the rows of X with indices in S. vec(X) denotes the column-wise vectorization of matrix X. Letters d and r in superscript refer to the components related to the depth and reflectance information, respectively. Letters v and h in subscript refer to the components related to the vertical and horizontal directions, respectively.
Ridge detection for range measurements. We first take a scan from a 2D laser scanner as an example to introduce our operator of ridge detection and then extend the approach to 3D depth profiles. In Figure 1, we plot the depth profile z ∈ R n of a typical indoor environment, which is obtained using a standard planar range finder measured with (known) discrete angles. Clearly, the profile is sufficiently regular and contains a few corners. Considering three consecutive points at coordinates (x i−1 , z i−1 ), (x i , z i ), and (x i+1 , z i+1 ), there is a corner at i if In the following, we assume that x i − x i−1 = 1 for all i (this assumption comes without loss of generality since we can define it at arbitrary resolution). Therefore, the corner detection for the 2D profile is to find those indices i such that z i−1 − 2z i + z i+1 = 0. To make the notation more compact, we introduce the second-order difference operator: Thus, a simple operation D 2nd · z can detect corners. For a 3D depth profile Z ∈ R r×c , we estimate the ridges of vertical and horizontal directions R v and R h , respectively: where the matrices D v2nd and D h2nd are the same as D 2nd in Equation (1), but with suitable dimensions.
To combine the two equations of (2), we reformulate it as below: Here, ⊗ is the Kronecker product, and D vh ∈ R 2(r·c−r−c)×(r·c) .

Adaptive SC and Its 0 , 1 Solutions
For each patch r i (in the vectorized form) in the depth ridge map, we formulate the following adaptive SC model: where r vi and r hi represent the patches from vertical and horizontal ridge maps, respectively. D L is the fixed pre-learned dictionary, as shown in Figure 2, which is obtained by applying the basic SC algorithm [19] on 40 pre-selected ridge maps of man-made scenarios (here, we display our pre-learned dictionary with the optimal setting of parameters as obtained in Section 3.1.1, namely the number of dictionary atoms is 1024 and the size of each atom is 8 × 8). In [23], different dictionaries (including off-the-shelf ones such as discrete cosine transform (DCT) basis, wavelets basis, and dictionaries obtained via learning-either pre-learned or learned from the data itself) were tested for image denoising. Although the dictionary learned from the data itself reported the best denoising accuracy, it was also the most computationally demanding. In addition to our pre-learned dictionary, we also displayed the DCT dictionary, Gabor wavelets dictionary in Figure 2. The parameter tweaking of such dictionaries and their denoising performance are discussed in Section 3.1.1. x vi and x hi correspond to the sparse coefficient vectors. k vi and k hi are the given sparsity controlling parameters. Instead of setting these parameters as constants for all patches, we set them adaptively according to the informative level of each patch (see Section 3.1.1 for details) in a manner similar to that of [20]. Given k vi and k hi , the popular orthogonal-matching-pursuit (OMP) algorithm is applied to solve such 0 optimization problem. Moreover, the batch-OMP (BOMP) technique is exploited to further improve the efficiency. As the problem of (4) is non-convex and NP-hard, we reformulate its relaxation as below: Here, λ vi and λ hi , which balance the representation fidelity term and the sparsity penalty term, are again estimated adaptively (see Section 3.1.1 for details). By stacking all the patches, we obtain Equation (6): in which the neighboring patches are extracted with half-overlapping. To solve such 1 optimization problem, the least-absolute-shrinkage-and-selection-operator (LASSO) algorithm was proposed, which has proven to find the global optimizer. Clearly, Equations (4)-(6) differ from previous methods [12,14,19,20,22] in one aspect, which is that D L is fixed. Thus, X v and X h can be obtained with a one-step closed-form solution instead of performing refinement on the coefficients and dictionary alternately and iteratively, thus resulting in much improved efficiency. With X v and X h in hand, we reconstruct each patchr vi andr hi . When all patches are processed, the average value is used for the locations occupied by multiple patches, and then we obtain the refined ridge mapsR v andR h . Next, we can recoverẐ by performing the inverse operation of Equation (3). However, due to the dependency of the rows of D vh , the system is under-determined. Therefore, we need to incorporate the boundary conditions to recoverẐ from its refined second-order difference, as shown in Equation (7): where the available N boundary points are incorporated, and six boundary points are used in our work. A more rigorous theoretical proof of the minimum necessary boundary points is out of the scope of this work. Performing the inverse operation on (7), we can recoverẐ as Equation (8): Here, all the matrices A, D vh , their transposes, and their inverses can be pre-computed to significantly reduce the overall processing time. We now summarize all previous operations as Algorithm 1. In fact, our algorithm includes two versions of implementation: our-BOMP and our-LASSO.
In Algorithm 2, the convergence criterion is well-defined as ||x|| 0 k. In Algorithm 3, the convergence criterion is that the relative update difference of x k is less than 10 −3 .

Experiments and Analysis
We performed experiments on both synthesized data and actual data recorded using a laser scanner. In the synthesized data experiments where the ground truth was available, the setting of important parameters was investigated, followed by denoising evaluation. In the real data experiments, the results are also demonstrated for visual assessment, in addition to the performance comparison. Similar to [14,20], the peak signal-to-noise ratio (PSNR) value is employed to evaluate the performance. Here, we clarify that our work is run on an ASUS Pro Notebook with a 2.4 GHz Intel Quad-Core i7-4500U processor and 12 GB RAM executing MATLAB code in parallel.

Experiments on Synthesized Data
Similar to [20,26], we designed a scene of a square box having sides of one meter in a room, as shown in Figure 3, where all information is known as the ground truth.

Important Parameters
To achieve plausible performance in terms of both accuracy and efficiency, the following parameters play the central role: the sparsity controlling parameters k vi , k hi , λ vi , λ hi , dictionary dimension d (which indicates that the patch size is √ d × √ d pixels), and its atom number n D (r = n D d is the redundancy of the dictionary).
Applying the same parameter tweaking strategy as that in [20], we recommend the sparsity controlling parameters as shown in Table 1. For each patch P j (j = v or h, from vertical or horizontal ridge maps, respectively), the parameters α vi , β vi (or α hi , β hi ) are related to its informative-level measurement (InM), which is estimated as below: Here, GradMag(P r ) is the gradient magnitude map of patch P r , and σ(·) is the sigmoid function σ(·) = 1 1+exp(·) . For the input from (−∞, +∞), the output of σ(·) is in the range (0, 1). In our work, as the value of R en cannot be negative, the output of σ(R en ) is in the range [0.5, 1). The output of exp(σ(R en (P r )) − 0.5)) is in the range [1, e 0.5 ). With InM(P d v ), InM(P d h ) for all patches in hand, we determine the sparsity controlling parameters via computing α vi , β vi , α hi , and β hi as below: In Table 1, the positions of α ji and β ji are exchanged in Our-BOMP and Our-LASSO because k ji and λ ji enforce the sparsity constraint in different manners to adaptively determine the number of dictionary atoms permitted for the patches with different informative levels. For example, to assign more atoms to represent the patch with more information, k ji should increase to push up the sparsity limit, whereas λ ji should decrease to penalize the sparsity constraint term to a lower degree.
In addition to k ji , λ ji , we tested different combinations of patch size and redundancy to find the most reasonable settings of d and r. Specifically, √ d can be 4, 8, 16, or 24, and r can be 4, 8, 16, or 24. Thus, a total of 16 combinations were tested. As shown in Figure 4a, dictionaries with redundancy 16 consistently achieved better denoising results than dictionaries with redundancy 4, 8, and 24, when their patch sizes were the same. Figure 4b shows the computation time. Clearly, the time increased dramatically with increasing patch size and redundancy. Taking both the denoising performance and computation time into account, we conclude that the dictionary setting of √ d = 8 and r = 16 (namely 1024 atoms) can yield decent outputs, while allowing fast computation. Such parameters are used throughout this paper, except where indicated. In Table 2, we further report the denoising result using different fixed dictionaries: our pre-learned dictionary, DCT dictionary, and Gabor wavelets dictionary. Clearly, the results using our pre-learned dictionary were better those using off-the-shelf ones. Table 1. Setting of sparsity controlling parameters. BOMP: batch orthogonal-matching-pursuit; LASSO: least-absolute-shrinkage-and-selection-operator.

Parameters
Our-BOMP Our-LASSO

Denoising Results
The denoising results of our work on the synthesized data (see Figure 3d in which 100% dense Gaussian noises with standard deviation of 0.02 m and gray-scale value 5 for range and reflectance data respectively were added) are reported in Table 3, together with the comparison against available competitive methods including Trilateral filter [26], Basic SC [19], and Adaptive SC [20]. Moreover, in addition to the Gaussian noise, sparse outliers (5% of the total positions, which were randomly distributed) were added to the scene, as shown in Figure 3e, and the denoising results are also reported in Table 3. In addition to the PSNR criteria, the root mean square (RMS, with units of millimeters) error indicating the difference between the recovered range data and the ground truth is also included, since it can give readers a more direct indication of how well the contaminated range information was restored by these methods. Moreover, the computational time of each method is also reported. As shown in Table 3, our methods outperformed Trilateral filter, Basic SC, and obtained comparable results to Adaptive SC in terms of denoising accuracy (the best results were obtained by Adaptive SC with PSNR = 33.83 and RMS = 1.98), while achieving significantly higher efficiency than Adaptive SC. Moreover, Our-BOMP even worked slightly more efficiently than Our-LASSO, achieving the processing speed of approximately 13 fps for the range image with resolution 800 × 800.

Denoising Data from Laser Scanner
This set of experiments used the Brown range image database [27] captured with a Riegl LMS-Z210 laser scanner, its field of view (FoV) is 80 • vertically and 259 • horizontally. We focused only on the 15 indoor scenes (e.g., classrooms and theatres). To deal with (and benefit from) such a large horizontal FoV, similar to the multiple local projection method in [19], we partitioned the scene into several parts with 80 • FoV vertically and horizontally. Note that in the Brown range image database, the depth of each measuring point is saved sequentially, left to right and up to down. Thus, the image location only indicates the order of saving and it was necessary to perform local projection to obtain reasonable images. The consecutive parts were half-overlapping along the horizontal direction, then each part was projected to an image plane orientated to the center. Finally, the average of all the restored parts was computed to restore the whole scene. In the following, the denoising results on such real data are reported and discussed. Similar to Section 3.1, Gaussian noise was added to the depth and reflectance data, and the results on three representative scenes are shown in Figure 5.
As can be observed in Figure 5, the results of Trilateral filter were good in the planar regions, but contained a great deal of noise in the edge regions. Our method, Basic SC, and Adaptive SC are all based on SC, and obtained consistent results in all regions. Moreover, the results of our method in the third column were slightly worse than those of columns 1 and 2, because the curved structure of the scene in the third column posed a greater challenge to our sparse ridge assumption. In Table 4, the average denoising results on all 15 indoor scenes of the dataset are reported. Similar to the results on the synthesized data, our method outperformed Trilateral filter and Basic SC consistently, and obtained comparable results to those of Adaptive SC in terms of denoising accuracy, while achieving significantly higher efficiency than Adaptive SC. Compared with the results in Table 3, the results of  Table 4 deteriorated as the structure of real scenes was more complex than the synthetic scene. Similar to the previous section, in addition to the Gaussian noise, sparse outliers were added to the scene, and the denoising results are also reported in Table 4.

Method
Only Gaussian Noise Gaussian Noise and Outliers

Conclusions
In this study, based on our previous work [20], we propose SC-based algorithms in which a pre-learned ridge dictionary is applied to realize range data denoising by leveraging the regularity of laser range measurements in man-made environments. Experiments on both synthesized data and real data demonstrate that our method obtained accuracies comparable to those of sophisticated SC methods with much higher efficiency, achieving approximately 13 fps (for resolution of 800 × 800) with the resource of a lightweight laptop computer. Therefore, our method can be applied for real-time systems in man-made environments. Furthermore, we are trying to implement our method with GPU acceleration, intending to further increase the efficiency and also save the precious payload for unmanned systems, especially UAVs.

Conflicts of Interest:
The authors declare no conflict of interest.