1. Introduction
As one of the most fundamental low-level vision problems, image denoising has been widely studied in computer vision, serving as the foundation and precondition for image processing, such as visual saliency detection, image segmentation, image classification, etc. In general, image denoising aims at recovering the clean image from the noise-corrupted image while preserving, as much as possible, the vital image features. During the past few decades, image denoising has drawn much research attention, resulting in a variety of efficient methods. Traditional image denoising methods include the median filter, the Gaussian filter, methods based on the total variation, wavelet threshold methods, etc. However, most of these methods frequently ignore the details of image features that include structure, texture, and edge features. To some extent, the ignorance of image details makes these methods suffer from many defects, comprising of over-smoothing, side effect, artifacts, loss of structure and texture features, ambiguousness of edges, etc.
Motivated by the defects in traditional image denoising methods, significant progress has been made in recent years. As image denoising is typically an ill-posed problem, its solution might not be unique. Based on good learning image priors from noise-corrupted images or clean natural images, numerous methods have been proposed to obtain a better solution to the image denoising problem. In particular, nonlocal self-similarity (NSS) and sparsity are two popular image priors with great potential that lead to state-of-the-art performance. Based on the fact that local image details may appear multiple times across the entire image, NSS prompts a series of excellent algorithms for image denoising. The nonlocal means (NLM) [
1] algorithm computed a noise-free pixel as the weighted average of pixels with similar neighborhoods in the fixed-size rectangle searching window, and achieved significant enhancement in denoising performance. Inspired by the success of NLM method, Dabov et al. [
2] proposed a remarkable collaborative image denoising scheme, called block-matching and 3D filtering (BM3D). In this scheme, nonlocal similar patches were grouped into a 3D cube and collaborative filtering was conducted in the sparse 3D transform domain. The BM3D algorithm ranks among the best performing methods, yet its implementation is complex. Furthermore, the BM3D algorithm is based on classical fixed orthogonal dictionaries, thus lacking data-adaptability. Building on the principle of sparse and redundant representations [
3], another category of methods has been developed, which can learn data-adaptive dictionaries for denoising. The K-SVD algorithm [
4] has boosted denoising performance significantly. Mairal et al. [
5] proposed the learned simultaneous sparse coding (LSSC) algorithm, which used nonlocal self-similarity (NSS) to improve sparse models with simultaneous sparse coding. In Reference [
6], Chatterjee et al. clustered image into
groups to enhance the sparse representation via locally learned dictionaries, which took advantage of geometrical structure feature and the NSS prior in spatial domain. Subsequently, Dong et al. [
7] observed that the difference between the representation coefficients of original and degraded images was sparse, and added a restriction to ensure the minimization of the
norm for the difference. Thus, this model achieved good results in image denoising. By assuming that the matrix of nonlocal similar patches had a low-rank structure, the low-rank minimization based methods [
8,
9] also achieved very competitive denoising results. Zhang et al. [
10] later proposed a patch group (PG) based NSS prior learning scheme to learn explicit NSS models from natural images for high performance denoising, resulting in high peak signal to noise ratio (PSNR) measurements.
Though NSS in low-level vision cues has been widely utilized to improve image denoising performance in most existing methods, we argue that such utilizations of NSS are not sufficiently effective. These methods learn NSS prior via clustering local size-fixed patches extracted from an image, which may neglect the image edge and structural features to some extent. Moreover, in most existing methods, the NSS prior is usually exploited by searching for nonlocal similar patches to a local patch across a size-fixed square searching window, which always leads to massive workload and ignores the similarities between pairs of patches with large spatial distances. Since most existing methods do not make full use of NSS prior in low-level vision cues, it is necessary to learn the prior in mid-level vision cues for a better exploration of the NSS prior.
With the above considerations, this paper proposes to further learn NSS in mid-level vision cues via superpixel clustering using the sparse subspace clustering method. Since the superpixel edges adhere to the image edges and reflect the image structural features, structural and edge information can be considered for a better exploration of the NSS prior by superpixel clustering. Furthermore, this paper proposes an improved algorithm for image denoising by taking advantage of multiple priors to achieve a better denoising performance, including the NSS prior in the spatial domain and sparse transform domain, sparsity, structure, and edge prior. In the proposed algorithm, we first divided the image into multiple superpixels by the simple linear iteration clustering (SLIC) method and grouped superpixels to generate irregular regions by the sparse subspace clustering method with local features. Regarding these regions as searching windows, we sought the first most similar patches to each local patch. Next, a data-adaptability dictionary for each region was learned to obtain the initial sparse coefficients of local patches extracted from the images. Finally, to improve the effectiveness of the sparse coefficient for each patch, a weighted sparse coding model was constructed by adding a weighted average sparse coefficient of the first most similar patches to the sparse representation model. Once final sparse coefficients for all patches were acquired, the noise-free image was obtained. Benefiting from two factors, the proposed algorithm achieves enhanced image denoising performance. First, learning the NSS in mid-level vision cues via superpixel clustering promotes a better exploitation of the NSS prior. Furthermore, regarding similar superpixel regions as a searching window can avoid the ignorance of similarities between pairs of patches with large spatial distances, which also contributes to the better exploitation of the NSS prior. Second, a weighted sparse coding model was established, which reduced the impact of noise on sparse representation and improved the accuracy of sparse coefficient for each patch.
The rest of this paper is organized as follows. In
Section 2, the basics of the proposed algorithm are described, including the SLIC algorithm, sparse subspace clustering algorithm and sparse representation algorithm. In
Section 3, we introduce the proposed denoising model in detail. Next, experimental results and discussion are presented in
Section 4. Finally, we make our conclusions in
Section 5.
2. Basics of Superpixel Clustering and Sparse Representation (SC-SR) Algorithm
This paper implements a novel image denoising algorithm based on the NSS prior in mid-level vision cues and weighted sparse coding. Before analyzing the proposed denoising algorithm, it is essential to introduce three basic algorithms which play vital roles in realizing the proposed algorithm.
2.1. Simple Linear Iterative Clustering Algorithm for Segmentation
this paper, a simple linear iterative clustering (SLIC) algorithm was selected to generate compact and nearly uniform superpixels, since it has better performance in running speed, superpixel compactness and contour preservation in comparison with other methods [
11]. SLIC generates superpixels by clustering pixels based on their similarity in color and proximity in the image plane [
12]. This method seamlessly applies to color as well as grayscale images via replacing color similarity with gray information similarity. In this paper, experiments were conducted on grayscale images.
SLIC is essentially a local k-means clustering method. Given the total number of image pixels and the desired number of superpixels , cluster centers are sampled at a regular grid . To speed up the generation of superpixels, SLIC assigns each pixel to the nearest cluster centers within the local region of the pixel rather than the whole image plane. The size of the local region is that takes the pixel as the center.
For grayscale images, gray and space information are taken into consideration for describing a pixel as a vector. Instead of using a simple Euclidean norm, the distance measure
is defined as follows:
where
is the gray distance, and
is the plane distance normalized by the grid interval
. A variable
is introduced to control the compactness of a superpixel. The greater the value of
, the more spatial is emphasized and the more compact the cluster, and we empirically set
to 10 [
12]. The values of
and
are obtained as follows:
where
and
are the gray values of pixel
and
;
and
are the
-coordinate values of pixel
and
; and
and
are the
-coordinate values of pixel
and
.
Given the desired number of superpixels , SLIC begins by sampling regularly spaced cluster centers, which are denoted by feature vectors . To avoid placing a cluster center at an edge pixel or a noisy pixel, the cluster center is moved to the lowest gradient position in a neighborhood. Next, each pixel in the image is associated with the nearest cluster center within the local area of the pixel by adopting k-means clustering method.
The segmentation results of SLIC on some benchmark test images are displayed in
Figure 1. As shown, SLIC generates compact and nearly uniform superpixels and achieves a good description of the image edges.
2.2. Sparse Subspace Clustering for Noisy Data
Sparse subspace clustering is an outstanding clustering method based on sparse representation, which is able to handle missing or corrupted data. It is based on the fact that each data point in a union of subspaces can be represented as a linear or affine combination of other points. Points are determined to lie in the same subspace by searching for the sparsest combination, which leads to a sparse similarity matrix [
13]. Based on the similarity matrix, spectral clustering is adopted to obtain the final clustering result.
Given a data matrix
, the sparse representation of each point
can be obtained by the following optimal problem:
Matrix
is defined as the matrix obtained from
by removing its
i-th column, where the subscript
means that there is no
. Point
has a sparse representation with respect to the matrix
. Next, the optimal problem is equal to the following problem:
where
is a unit vector. Taking noise into consideration, let
. be the
i-th corrupted data point, where
and
are the noise variance. The optimal problem turns to:
Lasso optimization method [
14] is used for recovering the optimal sparse solution from the above equation. With the sparsest representation of each point, a coefficient matrix
(
) can be obtained, which describes the connections between points. Taking the matrix
to establish a directed graph
, the vertices of the graph
are the
data points and the adjacency matrix is the coefficient matrix
. Moreover, to make
balanced, a new adjacency matrix
(
) is built. Subsequently, the Laplacian matrix
is formed by
, where
is a diagonal matrix and meets the condition of
. Finally, spectral clustering method is applied to cluster the eigenvector of the Laplacian matrix to obtain the final cluster result for the whole data points.
In this paper, sparse subspace clustering is adopted to effectively cluster superpixels into regions of similar geometric structure aiming to further learn the NSS prior in mid-level vision cues even in the presence of noise. It is known that better learning of the NSS prior can improve the performance of denoising algorithms based on structure clustering and sparse representation. Instead of clustering size-fixed patches that are inflexibly extracted from images, clustering superpixels takes image edge and structure features into consideration, which leads to a better learning result of the NSS prior.
2.3. Sparse Representation Based Image Denoising
In general, sparse representation works in a patch-based framework, where an image is represented by sparse coefficients of its overlapping patches. Next, the recovery of image is achieved by averaging the sparse representation of all overlapping patches.
Given an image
with noise, the column vector of
patch at the location of
, is denoted as
, where
is a binary matrix aiming to extract an overlapping patch and convert it to a column vector. The sparse coefficient
of a patch is realized by the optimal solution to the following equation:
where
,
is known as dictionary, and
is a regularization parameter to keep a balance between sparsity and reconstruction error. Once
is found, the denoising result
of the image patch
can be computed by
. A challenging issue in finding the sparse coefficient
is to choose the dictionary. In brief, a dictionary is a matrix, which is usually obtained by a specific transformation or learned from a large set of clean patches or noisy patches. When the dictionary and sparse coefficients are obtained, the denoised image
can be reconstructed by aggregating all of the sparse representation of patches as follows [
9]:
Nonetheless, only the local sparse representation model is employed in denoising problem, which may not lead to a preferable enough solution. Herein, this paper combines good image priors from the noise-corrupted image with sparse representation for a better solution to image denoising problem. The NSS prior in the spatial domain was learned by superpixel clustering, which also generated irregular regions consisting of similar superpixels. By regarding these regions as searching windows, similar patches to a local patch within each region were found. The NSS prior in sparse transform domain utilized a constraint that the weighted average of sparse coefficients for the acquired similar patches of each local patch constrains the sparse coefficient of the local patch to an optimal solution. The NSS priors in both the spatial domain and sparse transform domain were added into the sparse representation model to enhance the image denoising performance of our proposed algorithm. Further details of the proposed algorithm are explained in the next section.
3. Proposed SC-SR Algorithm
In this section, the image denoising model based on superpixel clustering and sparse representation is presented. We first utilized the SLIC algorithm to generate superpixels that include similar pixels. Next, the NSS prior was further exploited by making use of the sparse subspace clustering algorithm with local features to cluster similar superpixels into a group. In the abnormal region of similar superpixels group, we extracted overlapping patches for training dictionary and sought similar patches to each local patch. Finally, a weighted sparse coding was adopted for better sparse representation. The following is a detailed introduction on the proposed algorithm.
3.1. Superpixels Clustering
the outset, SLIC is used to generate
superpixels, and a superpixel is described as a column vector
with several features. Each image pixel within a superpixel can be represented by a seven-dimensional feature vector
:
where
is the gray value of the pixel;
are the corresponding first or second order derivatives of image intensities in both
and
axes; and
,
are the coordinates of the pixel in an image. The parameter
aims to make a balance among image gray, gradient, and spatial features. If we use equal weight (
) for the spatial feature, similarities between pairs of patches with large spatial distances may be lost. Thus, we empirically chose
to alleviate this problem [
15]. In addition, the image spatial, gray and gradient values were normalized.
For a given superpixel, we computed the mean vector
of all pixels within it as its feature vector:
where
is the size of the superpixel and
indicates a vector of a pixel within the superpixel.
We considered these
superpixels as a collection of data points drawn from a union of
independent affine subspaces. The sparse subspace clustering method was adopted to cluster the collection of data points into
groups. After transforming every superpixel into a column vector, the collection of data points
for clustering was obtained. For each superpixel, its covariance matrix
was calculated by:
where
indicates the
i-th feature of the
k-th pixel of the current superpixel and
indicates the
i-th element of the feature vector of the superpixel.
For two superpixels, their similarity can be computed based on the corresponding covariance matrices,
and
:
where
is the similarity of the two superpixels;
is the generalized eigenvalues from
; and
is a small constant (we experimentally set
to 0.5 in our experiment) [
15].
To better characterize the relationship between the superpixels and alleviate the sensitivity of sparse coding to noise, a Laplacian regularization term [
15] based on the similarity matrix
was introduced into Equation (6) to ensure the similarity of sparse coefficients among similar superpixels. Next, Equation (6) is equal to the following problem:
where
is the Laplacian matrix defined as
, and
is the diagonal matrix with row sums,
The parameter
is the weighted parameter that balances the effect of the Laplacian regularization term (
was empirically set to 0.2 in our experiments) [
15]. By solving Equation (15), a sparse coefficients matrix
was obtained, which was not symmetric. Therefore, we updated it by
to make it symmetric. A directed graph
was built by utilizing the sparse coefficients matrix
. The vertices of the graph
were the data set
, and the new adjacent matrix was
, where
The spectral clustering algorithm was used to segment the graph
to obtain the result of clustering superpixels.
Superpixel clustering learned NSS prior, which contributed to the preservation of structural information of the denoised image. When the dictionary was learned, the learned NSS prior obtained by superpixels clustering added the structural feature and edge feature into the atoms of the dictionary. The richer the features of the dictionary, the stronger the ability to reconstruct original image.
3.2. Learning Sub-Dictionaries for Each Cluster of Superpixels
In the similar superpixel regions, we extracted overlapping patches centering on each pixels, where the overlapping patches were all
size (
is odd). As for the pixels on the boundary of the image, we extended them by
pixels in a horizontal and vertical direction by mirroring pixels to gain the patches centering on them. As shown in
Figure 2, the matrix
was extended by two elements in a horizontal and vertical direction by mirroring extension, and the matrix
was obtained. After every patch was transformed into a column vector,
sub-datasets
were formed for training sub-dictionaries.
Since the number of patches in each cluster was limited and patches in
had similar patterns, it was not necessary to learn an over-complete dictionary for each cluster [
16]. Therefore, we used the principal component analysis (PCA) method to learn the compact sub-dictionary for each cluster.
For each sub-dataset
, the proposed algorithm applied PCA to compute the principal components, with the purpose of constructing the sub-dictionaries
.
can be constructed by the optimal solution for the following formulation:
where
is the sparse coefficient matrix of
over
The rank of
is denoted by
and the co-variance matrix of
is denoted by
We obtained the result of matrix factorization
, where
is the orthogonal transformation matrix and
is a diagonal matrix taking
eigenvalues of
as its diagonal elements. If we regard
as the dictionary and set
, we obtain
Therefore, Equation (16) is only determined by the sparsity regularization term
, which is constant in this case. To obtain better sub-dictionaries for sparse representation, we extracted the first
most important eigenvectors in
to form a dictionary
,
, instead of regarding
as the dictionary. The sparse coefficient matrix is denoted by
It is known that
increases when
increases, while
decreases. The optimal solution can be obtained by solving the following problem:
Consequently, we obtained sub-dictionaries for the sub-datasets In Equation (17), it was verified that some noise can be successfully removed during computing the local PCA transform of each image patch. In this paper, PCA was applied to each sub-dataset to construct the dictionary, which was able to reduce not only the computational cost consumed by dictionary training, but also the noise introduced into the dictionary.
3.3. Sparse Representation Model for Image Denoising
It was addressed that similar patches shared the same dictionary elements in their sparse decomposition [
5]. In other words, there was NSS in the sparse transform domain as well. The proposed algorithm takes advantage of that fact to achieve a better sparse representation and improve the performance of image denoising.
In the previous subsection, we learned a PCA dictionary for each cluster. Given the dictionaries, the initial sparse coefficients
of all patches within each cluster are found by solving the minimization problem of Equation (7) in the condition of
It is known that
pseudo norm regularization term has usually better reconstruction performance than
pseudo norm regularization term [
17]. Since
minimization problem is NP-hard, greedy approaches are employed to achieve an approximate solution. One of the most widely used greedy approach is the orthogonal matching pursuit (OMP) that successively selects the best atom of dictionary minimizing the representation error until a stopping criterion is satisfied. In our algorithm, we employed a modified version of OMP, referred to as generalized OMP (GOMP) [
18], to obtain the initial sparse coefficients
, which allowed the selection of multiple atoms per iteration. Simultaneous selection of multiple atoms reduced the number of iterations and achieved better sparse representation.
After obtaining the initial sparse coefficients
, we exploited the NSS prior in the sparse transform domain to produce a better estimate of the original image. To begin, we looked for the first
most similar patches to each patch across the region of similar superpixels. Each pixel in a patch was denoted as a column vector that is computed by Equation (9), and the patch with the size of
was described as a matrix
The similarity between the two patches was measured by Equation (13) with their covariance matrixes. After obtaining the patches similar to each local patch, we computed the weighted average of initial sparse coefficients associated with the similar patches as
Next,
was exploited to constrain the sparse coefficient of the local patch to reduce the impact of noise on sparse representation and achieve a better solution. A better sparse representation coefficient for each patch
can be found by:
where
is the weighted average of sparse coefficients for the first
most similar patches to the patch
and
is a regularization parameter making a balance between sparsity and reconstruction error. The weight
of the two patches was computed as follows based on their covariance matrixes
and
:
Based on the iterative soft-thresholding (IST) method [
19], the solution to Equation (18) was computed by:
where
is a soft-thresholding function;
denotes the iteration frequency; and
is a constant to guarantee the strict convexity of the optimal problem and
[
20]. After
iterations, we can get preferable sparse coefficients. All patches can be estimated by
. Since every pixel admits multiple estimations, its value can be computed by averaging all estimations. When the sparse coefficients for all patches within the image were obtained, the whole original image
was obtained by Equation (8). By using the weighted average sparse coefficients of the similar patches to restrain the sparse decomposition process of the image patches, the accuracy of the sparse coefficients was is improved. As a result, the reconstructed image was closer to the original image.
The proposed algorithm is completely demonstrated in Algorithm 1. It makes full use of the NSS prior both in the spatial domain and sparse transform domains. Clustering in the spatial domain with mid-level visual cues could better exploit image structure and edges prior in comparison with directly clustering the size-fixed local patches, which also contributes to the preservation of image edges and textures. As for the sparse transform domain, to achieve a better solution, the weighted average of the sparse coefficients for the patches similar to a local patch was utilized to constrain the sparse coefficient for the local patch. These are all devoted to the high denoising performance of the proposed algorithm.
Algorithm 1. The proposed algorithm called SC-SR |
Input image with white Gaussian noise. Set parameters: noise variance , superpixles number , cluster number , the patch size , the number of the first most similar patches, regularity parameters , , , , , . Adopt SLIC to generate superpixles. Utilize sparse subspace clustering method to group superpixels into cluster to form sub-datasets . Outer loop: For k = 1: - ①
Given sub-dataset , train a dictionary by PCA. - ②
Initialize the sparse coefficients for each patch over its specific dictionary by GOMP. - ③
Inner loop: For t = 1: Seek for the first most similar patches in the cluster for each patch, compute the weighted average of sparse coefficients for the acquired similarity patches and update the sparse coefficients = for the patch by Equations (18) and (19). End - ④
After iterations, obtain the final sparse coefficients and the sparse representation for all patches.
End Reconstruct the image, and output the denoised image .
|