Image Error Concealment Based on Deep Neural Network

: In this paper, we propose a novel spatial image error concealment (EC) method based on deep neural network. Considering that the natural images have local correlation and non-local self-similarity, we use the local information to predict the missing pixels and the non-local information to correct the predictions. The deep neural network we utilize can be divided into two parts: the prediction part and the auto-encoder (AE) part. The ﬁrst part utilizes the local correlation among pixels to predict the missing ones. The second part extracts image features, which are used to collect similar samples from the whole image. In addition, a novel adaptive scan order based on the joint credibility of the support area and reconstruction is also proposed to alleviate the error propagation problem. The experimental results show that the proposed method can reconstruct corrupted images e ﬀ ectively and outperform the compared state-of-the-art methods in terms of objective and perceptual metrics.


Introduction
Reliably delivering high-quality multimedia data is a significant task for applications such as television broadcasting. However, the transmission channel is not always satisfactory. When multimedia data are transmitted over error-prone channels or bandwidth-limited channels, packet loss will greatly reduce the quality of the received multimedia data. A straightforward way to alleviate this problem is to retransmit that multimedia data. However, retransmission is unavailable in many practical applications, especially under certain real-time constraints such as live broadcast and multicast. Therefore, it is crucial to develop error concealment techniques to reconstruct the received error multimedia data in order to guarantee the quality of transmission.
Image error concealment (EC), as a post-processing method, reconstructs the missing pixels without the need to modify the encoder or change the channel conditions [1]. The basic idea behind EC is to predict the missing pixels by using the correctly received ones in the current frame or adjacent frames based on the spatial or temporal correlations. According to which kind of correlation is utilized, EC methods can be classified into three categories: spatial EC (SEC) [2][3][4][5][6][7][8][9][10][11][12][13][14][15], temporal EC (TEC) [16][17][18][19][20][21][22], or spatial-temporal EC (STEC) [23][24][25][26]. In the case that the neighbor frames are not available, the SEC methods only use the information extracted from the neighboring pixels of the missing ones in the current frame. Here, the TEC methods purely take advantage of the temporal correlation. The missing blocks are replaced with similar areas in the previously decoded frames. STEC methods can be considered as a combination of SEC and TEC, which exploits the correlation in

Problem Formulation
Similar to the other SEC methods, we reconstructed the missing pixels using the correctly received ones. More specifically, we used the local information to predict the missing pixels and non-local information to correct the predictions in this paper.
Let O be the original image, X be the corresponding corrupted image, and X can be divided into an available part, S, and an unavailable part, U; that is, X = U ∪ S. Therefore, the EC problem can be considered to reconstruct the unavailable part U through utilizing the information from the available part S. Without a loss of generality, we suppose that only one pixel is reconstructed at a time. Define y i as a pixel group that contains pixel i, which is located on the contour of U, as shown in Figure 1. Note that y i can be regarded as a combination of y s i and y u i . y u i is the pixel set that contains the missing pixels, and here, it is actually the missing pixel i, since we reconstruct only one pixel at a time. y s i is a pixel set formed by the adjacent and available neighbors of y u i . We call y s i a support area of y u i , since y s i can be regarded as the spatial context of y u i . Obviously, we can estimate y u i from y s i by utilizing the local correlation of natural images. Considering that natural images have non-local self-similarity properties, in our method, the non-local information is also used in the image reconstruction. We use the non-local information to correct the predictions. The reconstruction of missing pixel y u i can be defined as:ŷ where y s i is the support area of y u i and S i is a set of samples, which are similar to y s i . Model φ(·) is used to predict the missing pixels through the corresponding support areas. Function F(·) uses the similar sample set S i to correct the predictions of φ(·). λ is a factor that balances the corrections and predictions,ŷ u i is the final reconstruction that will be used to replace the missing pixel. similar samples from the whole image. In addition, we propose a novel adaptive scan order based on the joint credibility of the support area and reconstruction to alleviate the error propagation problem. The experimental results show that the proposed algorithm can reconstruct corrupted images effectively and outperforms the compared state-of-the-art methods in terms of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).

Problem Formulation
Similar to the other SEC methods, we reconstructed the missing pixels using the correctly received ones. More specifically, we used the local information to predict the missing pixels and nonlocal information to correct the predictions in this paper.
Let O be the original image, X be the corresponding corrupted image, and X can be divided into an available part, S , and an unavailable part, U ; that is, X = U ∪ S. Therefore, the EC problem can be considered to reconstruct the unavailable part U through utilizing the information from the available part S . Without a loss of generality, we suppose that only one pixel is reconstructed at a time. Define i y as a pixel group that contains pixel i , which is located on the contour of U , as shown in Figure  Considering that natural images have non-local self-similarity properties, in our method, the nonlocal information is also used in the image reconstruction. We use the non-local information to correct the predictions. The reconstruction of missing pixel u i y can be defined as:   In addition, under the sequential framework, the corrupted images are reconstructed pixel by pixel in sequence. The sequence, which is often called a scan order, is very important to the reconstruction performance, since it determines the available context of each missing pixel. After each pixel is reconstructed, we update the available part S and the unavailable part U. The EC task is accomplished when the unavailable part U becomes an empty pixel set.
Therefore, the focus of our work is to build a model φ(·) for predicting the missing pixels, find an approach to search for useful non-local information, and use it to correct the predictions. Moreover, it is also an important issue to determine an appropriate scan order in our work.

Our Proposal
In this work, we propose an EC method that take into account both the local and non-local information in the reconstruction of the corrupted images. More specifically, we exploit the local information to predict the missing pixels and the non-local information to correct the predictions. In our method, a deep neural network is designed to serve as the prediction model φ(·) to achieve the EC purpose. The designed neural network can be divided into two parts: the prediction part and the auto-encoder (AE) part. The prediction part utilizes the local correlation among pixels to predict the missing ones. The AE part is used to extract the image features. Through those extracted features, we collected similar samples from the whole image. Since the designed neural network can extract image features through the AE part and predict missing pixels using the prediction part, we call it an AE-P network.
As illustrated in Figure 2, given a corrupted image X, we first collect all the possible samples from the available area to serve as the training set T = [t 1 , t 2 , . . . , t n ], and each sample t in set T can be regarded as a combination of an available part t s and a missing part t u . Here, it should be pointed out that the true value of t u in training set T is known. Therefore, the training set T can be divided into two subsets T s and T u , which are composed of t s and t u , respectively. Then, the collected training set T is used to train the designed AE-P network. More specifically, we train the AE part network through unsupervised learning and the prediction part network through supervised learning. Let y u i be the current missing pixel on the contour of the unavailable region, and y s i be the corresponding support area. When we input y s i into the trained AE-P network, on the one hand, we can get the prediction φ(y s i ) of the missing pixel through the prediction part network. On the other hand, the AE part network extracts the features of y s i by comparing the similarities in the feature domain and the pixel domain. Thus, we searched for samples similar to y s i from the whole image, which served as a similar sample set, S, as illustrated in the blue box in Figure 2. Then, we used these similar samples to correct the prediction φ(y s i ) based on the non-local similarity of the natural images. Therefore, the final reconstructionŷ u i is determined through combining the prediction and correction, as shown in Formula (1). In this way, both the local and non-local information are taken into account for reconstruction. In addition, to alleviate the error propagation problem, we proposed an adaptive scan order based on the joint credibility of the support area and the reconstruction. In addition, under the sequential framework, the corrupted images are reconstructed pixel by pixel in sequence. The sequence, which is often called a scan order, is very important to the reconstruction performance, since it determines the available context of each missing pixel. After each pixel is reconstructed, we update the available part S and the unavailable part U . The EC task is accomplished when the unavailable part U becomes an empty pixel set.
Therefore, the focus of our work is to build a model ( ) for predicting the missing pixels, find an approach to search for useful non-local information, and use it to correct the predictions. Moreover, it is also an important issue to determine an appropriate scan order in our work.

Our proposal
In this work, we propose an EC method that take into account both the local and non-local information in the reconstruction of the corrupted images. More specifically, we exploit the local information to predict the missing pixels and the non-local information to correct the predictions. In our method, a deep neural network is designed to serve as the prediction model ( ) • φ to achieve the EC purpose. The designed neural network can be divided into two parts: the prediction part and the auto-encoder (AE) part. The prediction part utilizes the local correlation among pixels to predict the missing ones. The AE part is used to extract the image features. Through those extracted features, we collected similar samples from the whole image. Since the designed neural network can extract image features through the AE part and predict missing pixels using the prediction part, we call it an AE-P network. As illustrated in Figure 2, given a corrupted image X , we first collect all the possible samples from the available area to serve as the training set 1 2 [t ,t ,...,t ] n = T , and each sample t in set T can be regarded as a combination of an available part s t and a missing part u t . Here, it should be pointed out that the true value of u t in training set T is known. Therefore, the training set T can be divided into two subsets s T and u T , which are composed of s t and u t , respectively. Then, the

Design of AE-P Neural Network
In this paper, a deep neural network named the AE-P network is designed to achieve the EC purpose. The network can accomplish two tasks. One is to predict the missing pixels through using the local correlation between the missing pixels and the available pixels. The other is to extract the image features so as to search for feature similar samples. The second task is accomplished through AE, which is short for auto-encoder, as proposed by Hinton [30] in 2006. The AE is an unsupervised learning method that reconstructs the input signal at the output side so as to realize the dimensionality reduction of complex data. A typical AE network consists of an encoder and a decoder. Between the encoder and decoder is the bottleneck layer, which we focus on in our method, since it is the most compact feature representation of input data. The encoder is from the input layer to the bottleneck layer, and the decoder is from the bottleneck layer to the output layer. This bottleneck layer is exactly the data representation after dimensionality reduction. Compared with the traditional dimension reduction methods such as principal component analysis (PCA), the AE network can extract high-level features of the data due to the excellent non-linear representation power of the deep neural network.
To train the AE-P network, we firstly use the subset T s to train the AE part network through unsupervised learning. Suppose that the samples in T s are N-dimensional data; then, the dimension of the input layer and the output layer of the AE network are N, correspondingly. Then, we need to determine the dimension of the bottleneck layer, which is a significant issue in the design of the AE network. In general, the lower dimension of the bottleneck layer, the higher the efficiency of coding at the expense of more information loss. So, it is necessary to balance the coding efficiency and information loss. In our method, we set the dimension of the bottleneck layer to be half that of the input layer. The structure of the deep AE network in our method is shown in Figure 3a, and the optimization of AE network is defined as: where t s is the one of sample of subset T s .t s is the corresponding output of the AE network. As can be seen from the formula, the objective of the optimization of the AE network is to minimize the mean square error between the input and the output.    Consider that the inputs of the two aforementioned networks are exactly the same; therefore, the front parts of the two networks are designed to be the same. Thus, a new network, which we call the AE-P network, is formed through combining the two networks and sharing the same part. The structure of the AE-P network is shown in Figure 4. The trained AE-P network can predict the missing pixels and extract image features at the same time. As can be observed from the structure, the front part of the AE-P network is the AE, and the network has two different output branches. We first trained the AE network through unsupervised learning. This network corresponds to the first branch of the AE-P network. Then, we kept the parameters of the encoder and only updated the parameters of the prediction part to train the prediction network through supervised learning, which corresponds to the second branch of the AE-P network. Next, we use the subset T u as the corresponding label set of T s to train the prediction part network through supervised learning. Figure 3b is the structure of the prediction network. The input of the prediction network t s is the available part of the training samples, which are the same as the input of the AE network. The output are the predictions of the unavailable part pixels of the corresponding samples. Since we reconstruct only one missing pixel at a time, the output layer dimension of the prediction network is one, representing the prediction of the missing pixel. The optimization of AE network is defined as: argmin t u −t u 2 (3) where t u represents the true values of the missing pixels, andt u represents the corresponding predictions of the prediction network.
Consider that the inputs of the two aforementioned networks are exactly the same; therefore, the front parts of the two networks are designed to be the same. Thus, a new network, which we call the AE-P network, is formed through combining the two networks and sharing the same part. The structure of the AE-P network is shown in Figure 4. The trained AE-P network can predict the missing pixels and extract image features at the same time. As can be observed from the structure, the front part of the AE-P network is the AE, and the network has two different output branches. We first trained the AE network through unsupervised learning. This network corresponds to the first branch of the AE-P network. Then, we kept the parameters of the encoder and only updated the parameters of the prediction part to train the prediction network through supervised learning, which corresponds to the second branch of the AE-P network. Consider that the inputs of the two aforementioned networks are exactly the same; therefore, the front parts of the two networks are designed to be the same. Thus, a new network, which we call the AE-P network, is formed through combining the two networks and sharing the same part. The structure of the AE-P network is shown in Figure 4. The trained AE-P network can predict the missing pixels and extract image features at the same time. As can be observed from the structure, the front part of the AE-P network is the AE, and the network has two different output branches. We first trained the AE network through unsupervised learning. This network corresponds to the first branch of the AE-P network. Then, we kept the parameters of the encoder and only updated the parameters of the prediction part to train the prediction network through supervised learning, which corresponds to the second branch of the AE-P network.

Training Data Collection
For training the designed AE-P network, we need a large number of samples. In order to collect appropriate training data, a template matching scheme is utilized to match and collect samples from

Training Data Collection
For training the designed AE-P network, we need a large number of samples. In order to collect appropriate training data, a template matching scheme is utilized to match and collect samples from the whole image. A challenging problem of the scheme is how to design the template shape to get the maximum available information. Similar to many SEC methods, templates with a square shape, as shown in Figure 5a, are widely used in training data collecting. The corresponding context of the missing pixel is shown in the blue dotted box in Figure 5b. It can be observed that only the available pixels in area A, which is part of the context, are used to reconstruct the current missing pixel. However, it is true that the pixels in areas B and C also have high correlation with the missing pixel. Thus, the available pixels in areas B and C should also be taken into account in image reconstruction. Since more available information is considered, the reconstruction of the missing pixel will be more reliable.
missing pixel is shown in the blue dotted box in Figure 5(b). It can be observed that only the available pixels in area A, which is part of the context, are used to reconstruct the current missing pixel. However, it is true that the pixels in areas B and C also have high correlation with the missing pixel. Thus, the available pixels in areas B and C should also be taken into account in image reconstruction. Since more available information is considered, the reconstruction of the missing pixel will be more reliable.
(a) (b) Figure 5. The red square is the current missing pixel, the gray squares are the available pixels, and the black squares are the unavailable pixels. (a) The template with square shape. (b) The context of the current missing pixel through matching the square template.
In order to get more available information from the context, eight templates with different shapes are designed in our method, as shown in Figure 6. These eight templates can be embedded around the missing pixels to augment the information collected from the context. Unlike square templates, which can only match one support area from the context for each missing pixel, we can collect multiple support areas. As illustrated in Figure Figure 7. For each support area, a reconstruction will be generated through the trained AE-P network. By combining the reconstructions of these support areas, the final reconstruction is therefore determined through all the available pixels of these support areas. The final reconstruction of the missing pixel u i y is determined as:  In order to get more available information from the context, eight templates with different shapes are designed in our method, as shown in Figure 6. These eight templates can be embedded around the missing pixels to augment the information collected from the context. Unlike square templates, which can only match one support area from the context for each missing pixel, we can collect multiple support areas. As illustrated in Figure 7, four support areas, y i around the current missing pixel y u i can be collected through matching the eight templates in the context. The available information obtained by these four support areas is shown in the blue box in Figure 7. For each support area, a reconstruction will be generated through the trained AE-P network. By combining the reconstructions of these support areas, the final reconstruction is therefore determined through all the available pixels of these support areas. The final reconstruction of the missing pixel y u i is determined as:   Eight templates with different shapes are utilized to collect training data for training the AE-P network in the proposed method. However, the relative location of the missing pixels and the corresponding support area are different among the collected samples. In order to train the AE-P network, we normalized all the collected training samples into a standard shape through rotating and flipping. As shown in Figure 6, we defined the template shape down-left as the standard shape and transformed all the collected training samples into the standard shape. For example, the procedure of transforming the shape from up-left into the standard shape is shown in Figure 8. In our method, we collected the training data from the available regions of the input-corrupted images. Specifically, we firstly used the designed eight templates to match all the possible samples in Eight templates with different shapes are utilized to collect training data for training the AE-P network in the proposed method. However, the relative location of the missing pixels and the corresponding support area are different among the collected samples. In order to train the AE-P network, we normalized all the collected training samples into a standard shape through rotating and flipping. As shown in Figure 6, we defined the template shape down-left as the standard shape and transformed all the collected training samples into the standard shape. For example, the procedure of transforming the shape from up-left into the standard shape is shown in Figure 8.  Eight templates with different shapes are utilized to collect training data for training the AE-P network in the proposed method. However, the relative location of the missing pixels and the corresponding support area are different among the collected samples. In order to train the AE-P network, we normalized all the collected training samples into a standard shape through rotating and flipping. As shown in Figure 6, we defined the template shape down-left as the standard shape and transformed all the collected training samples into the standard shape. For example, the procedure of transforming the shape from up-left into the standard shape is shown in Figure 8. In our method, we collected the training data from the available regions of the input-corrupted images. Specifically, we firstly used the designed eight templates to match all the possible samples in In our method, we collected the training data from the available regions of the input-corrupted images. Specifically, we firstly used the designed eight templates to match all the possible samples in the available region. Next, we normalized the collected samples into the standard shape through the aforementioned processing. Then, the normalized collected samples could serve as the training data for training the AE-P network.

Similar Data Collection
Error concealment is an ill-posed inverse problem, as the true values of missing pixels are unavailable in practice. Thus, the prediction error, which will reduce the reconstruction performance, is inevitable. In the proposed method, we utilized the AE-P network to reconstruct the missing pixels. Considering that the AE-P network has similar outputs for similar inputs, we assumed that the prediction errors for similar inputs were similar. Based on this assumption, we searched for the data similar to the current input of the AE-P network. Then, the network prediction errors of those data could be used to correct the prediction of the current input. This can reduce the prediction error and improve the reconstruction performance. Many methods such as those in [8,28] collected similar samples in pixel space. However, it is not reliable to measure the similarity of samples in pixel space, since the pixel value in the pixel space represents the gray value. In pixel space, the collected similar samples may only be similar in pixel value, but not in the feature, especially in the texture region. To avoid the drawbacks of collecting similar samples in pixel space, we defined the feature space and collected similar samples in this space. The feature space is composed of the sample features, which are determined through the bottleneck layer of the trained AE-P network. In feature space, the feature similarity between samples is easy to measure, so we can collect samples with more similar features. In addition, the computational complexity is greatly reduced, since the dimension of feature space is much lower than that of pixel space.
The Euclidean distance is used to measure the similarity between samples and the current support area in feature space. We defined p as the current support area of the current missing pixel, and q as a collected sample with the same shape as p. f p and f q are corresponding feature representations determined through the bottleneck layer of the trained AE-P network; then, the Euclidean distance D f (p, q) is used to measure the similarity of p and q as: where f i p and f i q are the i th values of f p and f q , respectively. k is the dimension of the feature space. The formula shows that the smaller the distance, the more similar p and q are in the feature space. Formula (5) may fail at measuring the sample similarity in pixel space, since some samples are similar in feature space but very different in pixel space. Therefore, we required these collected similar samples as determined by Formula (5) to be similar to p in pixel space. We also used the Euclidean distance to measure the similarity between the collected samples and the current support area in pixel space. In order to make the measurement of the similarity in pixel space more reliable, some adjacent pixels of the missing pixel were added in the similarity calculation. As illustrated in Figure 9, the green pixels are the added ones. The similarity between p and q in pixel space is defined as follows: where p and q are the augmented pixel sets corresponding to p and q, respectively. n is the pixel number of p. The formula shows that the smaller the distance, the more similar p and q are in pixel space.
where i s is the sample collected through matching the support area p , and 1 τ is the threshold selected in practice such that the first n = 500 closest samples were collected. Then, we used Formula (6) to calculate the similarity between the collected samples and the current support area p in pixel space; we only selected the first m most similar samples to serve as the Therefore, in our method, we first used Formula (5) to measure the similarity of samples to the current support area p in feature space and then collected the first n of the most similar samples to form the set S n j = [s 1 , s 2 , s 3 , . . . , s n ]. As shown below: where s i is the sample collected through matching the support area p, and τ 1 is the threshold selected in practice such that the first n = 500 closest samples were collected. Then, we used Formula (6) to calculate the similarity between the collected samples and the current support area p in pixel space; we only selected the first m most similar samples to serve as the similar sample set S j = [s 1 , s 2 , s 3 , . . . , s m ], as shown below: where s i is the sample from set S n j , and τ 2 is the threshold selected in practice such that the first m = 50 closest samples are selected from the set S n j . In this way, the similar samples we collected are similar both in feature space and pixel space.

Prediction Error Correct
Since samples that are similar to the current support area can be collected as mentioned in the last section, those similar samples are used to correct the prediction of the current support area. Let y u i be the current missing pixel, and suppose that n support areas y . . , y s n i can be collected through matching the templates from the context. For each support area, we collected similar samples to correct the corresponding prediction. We defined S j = [s 1 , s 2 , s 3 , . . . , s m ] as the similar sample set corresponding to the support area y s j i . Suppose that s j,k is the k th samples in set S j ; then, the prediction error e k of sample s j,k is defined as: where s s j,k is the available part of s j,k and s u j,k is the missing part of s j,k . φ(s s j,k ) is the prediction of s u j,k through the trained AE-P network and s u j,k is the true value, and e k is the prediction error produced by the network on the input s s j,k . Let set E j = [e 1 , e 2 , e 3 , . . . , e m ] represent the error set corresponding to the input data set S j ; then, the correction of the prediction φ(y s j i ) is derived from the following formula: where c k is the proportion of the prediction error e k in the whole correction a j , and c k is determined by the similarity between sample s s k and y s j i . c k obeys the following formulas (11) and (12): where c n1 and c n2 are the proportions corresponding to e n1 and e n2 , y s j i is the current support area, and D f is used to calculate the Euclidean distance in feature space. Then, the reconstructionŷ u j i determined through the support area y s j i is shown as follows: For each support area, there will be a corresponding reconstruction generated by Formula (13). The final reconstructionŷ u i of the missing pixel y u i is the average of those reconstructions, as shown in Formula (4).

Adaptive Scan Order
Within the framework of sequential reconstruction, the previously reconstructed pixels will be used in the subsequent pixel reconstruction; hence, the prediction errors will be accumulated and propagated to the later reconstruction. The scan order, which determines the available context of each missing pixel, plays a critical role in the reconstruction performance. A common idea of scan order, as used in [8], is to first reconstruct the missing pixels whose support areas contain more available pixels. Although this scan order has achieved fairly good performance, there are still deficiencies. For example, the scan order will be exactly the same for two different missing areas that have the same area shape. Thus, it is not flexible and conducive for the extension of edges.
In our method, we propose a novel adaptive scan order based on the joint credibility of the support area and reconstruction to alleviate the error propagation problem. The scan order depends not only on the credibility of the support area, but also on the credibility of the missing pixels' reconstruction. Let p(x) stand for the confidence of pixel x. Then, the confidences of the pixels in the corrupted image are initialized as: p(x) = 1 x is the corresponding received pixel 0 x is the missing pixel (14) where constant 1 stands for the pixel x being correctly received, and 0 represents it as missing.
We updated the confidence of the missing pixels to 1 after reconstructing them. The confidences of the pixels in the received image were initialized as shown in Figure 10. For each missing pixel, we find all the possible support areas around it through matching the eight templates. The sum of confidence of all the non-overlapped available pixels in these support areas is used to represent the credibility of the support area of this missing pixel, as shown in the red box in Figure 10. In the process of reconstruction, we firstly reconstructed the missing pixel whose support area had the highest credibility. However, frequently the support areas are the same; as can be seen in Figure 10, the credibility of the support areas of the current pixels 1 y , 2 y , 3 y , and 4 y are the same. In this case, we used the credibility of the reconstructions of the current missing pixels to determine the scan order. According to the non-local self-similarity of natural images, similar samples have similar pixel values at the same position. Therefore, we used the deviation between the reconstruction and the pixel value in the same location of the corresponding similar sample to measure the credibility: the lower the deviation, the higher the credibility.  For each missing pixel, we find all the possible support areas around it through matching the eight templates. The sum of confidence of all the non-overlapped available pixels in these support areas is used to represent the credibility of the support area of this missing pixel, as shown in the red box in Figure 10. In the process of reconstruction, we firstly reconstructed the missing pixel whose support area had the highest credibility. However, frequently the support areas are the same; as can be seen in Figure 10, the credibility of the support areas of the current pixels y 1 , y 2 , y 3 , and y 4 are the same. In this case, we used the credibility of the reconstructions of the current missing pixels to determine the scan order. According to the non-local self-similarity of natural images, similar samples have similar pixel values at the same position. Therefore, we used the deviation between the reconstruction and the pixel value in the same location of the corresponding similar sample to measure the credibility: the lower the deviation, the higher the credibility.
Let y u i be the i th missing pixel that is located on the contour of the available region, and suppose that the n support area y s 1 i , y s 2 i , . . . , y s n i can be collected through matching the templates. We collected corresponding similar sample sets S 1 , S 2 , . . . , S n from the whole image. Then, the credibility of the reconstructionŷ u i can be represented as: whereŷ u i is the reconstruction of y u i , and s u j,k is the true value corresponding to the missing pixel y u i in the k th similar sample of set S j . ε is a constant to ensure that the denominator is not 0. The credibility of the reconstructions of the missing pixels is determined through Formula (15): the higher the credibility, the more reliable the prediction; hence, these missing pixels are reconstructed first.

Experiments
In this section, comparative experiments verify the performance of the proposed algorithm. We firstly analyzed the influence of the proposed correction and adaptive scan order to the final EC performance. Then, we compared the proposed method with other state-of-the-art EC methods [4,6,7,9,10,27,31,32] to evaluate our method.
Similar to the work of others, three kinds of block loss were considered in our experiments: a 16 × 16 regular isolate block losses (≈ 22% loss rate), 16 × 16 regular consecutive block losses (≈ 50% loss rate), and 16 × 16 random consecutive block losses (≈ 25% loss rate). This three-loss mode is shown in Figure 11. Similar to the work of others, three kinds of block loss were considered in our experiments: a 16 × 16 regular isolate block losses (≈ 22% loss rate ), 16 × 16 regular consecutive block losses (≈ 50% loss rate ), and 16 × 16 random consecutive block losses (≈ 25% loss rate ). This three-loss mode is shown in Figure 11. For convincing comparisons, 13 widely used images were used as test sets in this paper. Note that we only considered grayscale images; since the color images contain multi-channels, we could reconstruct the color images by reconstructing the channels separately. All of the test images were 256 × 256. in size, as illustrated in Figure 12. In this paper, the size of the designed templates was set to 7 7+1 × (that is a combination of a 7 7 × square and one pixel to be predicted). The training samples were collected through matching the templates on the test images. Each sample was normalized to the range of [0,1] to serve the active range of 'tanh'. The two parts of the AE-P network were both 11-layer fully connected networks. We For convincing comparisons, 13 widely used images were used as test sets in this paper. Note that we only considered grayscale images; since the color images contain multi-channels, we could reconstruct the color images by reconstructing the channels separately. All of the test images were 256 × 256. in size, as illustrated in Figure 12. Similar to the work of others, three kinds of block loss were considered in our experiments: a 16 × 16 regular isolate block losses (≈ 22% loss rate ), 16 × 16 regular consecutive block losses (≈ 50% loss rate ), and 16 × 16 random consecutive block losses (≈ 25% loss rate ). This three-loss mode is shown in Figure 11. For convincing comparisons, 13 widely used images were used as test sets in this paper. Note that we only considered grayscale images; since the color images contain multi-channels, we could reconstruct the color images by reconstructing the channels separately. All of the test images were 256 × 256. in size, as illustrated in Figure 12. In this paper, the size of the designed templates was set to 7 7+1 × (that is a combination of a 7 7 × square and one pixel to be predicted). The training samples were collected through matching the templates on the test images. Each sample was normalized to the range of [0,1] to serve the active range of 'tanh'. The two parts of the AE-P network were both 11-layer fully connected networks. We In this paper, the size of the designed templates was set to 7 × 7 + 1 (that is a combination of a 7 × 7 square and one pixel to be predicted). The training samples were collected through matching the templates on the test images. Each sample was normalized to the range of [0,1] to serve the active range of 'tanh'. The two parts of the AE-P network were both 11-layer fully connected networks. We used the 'elu' activation function after each layer except for the last layer, and we used the 'tanh' activation function after the last layer in order to normalize the output to [0,255], which is the grayscale range. The overview of the AE-P network can be seen in Table 1. We used Adam [33] for optimization with the learning rate of the two networks set to 0.001. The batch size was set to 600. In addition, our work is implemented on the Python-Tensorflow platform on Windows 10. The hardware platform used Intel i5 7300H CPU, 8GB of RAM, and NVIDIA GTX 1060 GPU. Table 1. Architectures of the designed AE-P network. After each layer except for the last one we used the 'elu' activation function, and after the last layer, we used the 'tanh' activation function.

Layer
Number of Neurons In order to compare the quality of reconstruction, the widely used measurement peak signal-to-noise ratio (PSNR) is chosen serve as an objective metric to measure the image quality in our experiments. For a better comparison of structural similarities, the structural similarity (SSIM) index [33] is also used in this paper.

Comparative Studies
In our method, in order to reduce the prediction error and improve the accuracy of the AE-P network, we corrected the predictions based on the non-local self-similarity of the natural images. Moreover, we proposed an adaptive scan order based on the joint credibility of the support area and reconstruction to alleviate the error propagation problem. To evaluate the influence of the proposed correction and scan order to the final EC performance, we conducted three group comparative experiments corresponding to the three loss modes on the test images. In every group, there were three different scenarios of our method: a scenario that did not use the proposed correction, a scenario that did not use the proposed adaptive scan order, and a scenario that utilized both the proposed correction and the adaptive scan order. For convenience, we named these three scenarios 'Cor(off)-Ord(on)', 'Cor(on)-Ord(off)', and 'Cor(on)-Ord(on)'. 'Cor' and 'Ord' correspond to the proposed correction and scan order, while 'on' and 'off' in the brackets represent whether they were used or not used. The results of the comparative experiments are shown in Figures 13-15.
As can be seen from Figures 13-15, in all three loss modes, scenario 'Cor(on)-Ord(on)' performed best in almost all of the test images in terms of PSNR and SSIM. In particular, scenario 'Cor(on)-Ord(on)' outperformed scenario 'Cor(off)-Ord(on)' by a large margin under all the loss modes and test images. The great improvement of reconstruction performance is due to the proposed correction, which can reduce the prediction error effectively. Moreover, it could be observed that the proposed adaptive scan order could also improve the reconstruction performance. Compared with the proposed correction, there is less performance improvement with the scan order. This is because the scan order obtains a better reconstruction performance of the structure, which sometimes produces false edges and leads to the reduction of the PSNR. For example, the scan order fails to improve the reconstruction performance under PSNR in Hat with random loss mode, as shown in Figure 15.          For further evaluation of the influence of the proposed correction and adaptive scan order to the final EC performance, subjective comparisons are given in Figures 16-18 corresponding to the three different loss modes. It can be observed that scenario 'Cor(on)-Ord(on)' achieved the most comfortable visual and the highest PSNR and SSIM. Specifically, through comparing the method of scenario 'Cor(off)-Ord(on)' and the method of scenario 'Cor(on)-Ord(on)', we can observe that the proposed correction can greatly improve the reconstruction performance. As shown in the red box in Figure 16c, the bracket inside the red box is disconnected, while the bracket in Figure 16e is connected. We can also observe through the comparison between scenario 'Cor(on)-Ord(off)' and scenario 'Cor(on)-Ord(on)' that the proposed scan order can improve the reconstruction performance of edges, as shown in the red boxes in Figure 18d,e, respectively. For further evaluation of the influence of the proposed correction and adaptive scan order to the final EC performance, subjective comparisons are given in figures 16-18 corresponding to the three different loss modes. It can be observed that scenario 'Cor(on)-Ord(on)' achieved the most comfortable visual and the highest PSNR and SSIM. Specifically, through comparing the method of scenario 'Cor(off)-Ord(on)' and the method of scenario 'Cor(on)-Ord(on)', we can observe that the proposed correction can greatly improve the reconstruction performance. As shown in the red box in Figure 16(c), the bracket inside the red box is disconnected, while the bracket in Figure 16(e) is connected. We can also observe through the comparison between scenario 'Cor(on)-Ord(off)' and scenario 'Cor(on)-Ord(on)' that the proposed scan order can improve the reconstruction performance of edges, as shown in the red boxes in Figure 18(d) and Figure 18(e), respectively.    For further evaluation of the influence of the proposed correction and adaptive scan order to the final EC performance, subjective comparisons are given in figures [16][17][18] corresponding to the three different loss modes. It can be observed that scenario 'Cor(on)-Ord(on)' achieved the most comfortable visual and the highest PSNR and SSIM. Specifically, through comparing the method of scenario 'Cor(off)-Ord(on)' and the method of scenario 'Cor(on)-Ord(on)', we can observe that the proposed correction can greatly improve the reconstruction performance. As shown in the red box in Figure 16(c), the bracket inside the red box is disconnected, while the bracket in Figure 16(e) is connected. We can also observe through the comparison between scenario 'Cor(on)-Ord(off)' and scenario 'Cor(on)-Ord(on)' that the proposed scan order can improve the reconstruction performance of edges, as shown in the red boxes in Figure 18(d) and Figure 18(e), respectively.    For further evaluation of the influence of the proposed correction and adaptive scan order to the final EC performance, subjective comparisons are given in figures [16][17][18] corresponding to the three different loss modes. It can be observed that scenario 'Cor(on)-Ord(on)' achieved the most comfortable visual and the highest PSNR and SSIM. Specifically, through comparing the method of scenario 'Cor(off)-Ord(on)' and the method of scenario 'Cor(on)-Ord(on)', we can observe that the proposed correction can greatly improve the reconstruction performance. As shown in the red box in Figure 16(c), the bracket inside the red box is disconnected, while the bracket in Figure 16(e) is connected. We can also observe through the comparison between scenario 'Cor(on)-Ord(off)' and scenario 'Cor(on)-Ord(on)' that the proposed scan order can improve the reconstruction performance of edges, as shown in the red boxes in Figure 18(d) and Figure 18(e), respectively.

Objective and Subjective Performance Comparison
In order to verify the performance of the proposed method, eight other state-of-the-art EC methods-POCS [4], MRF [27], nonnormative SEC for H.264 (AVC) [31], content adaptive technique (CAD) [32], OAI [6], VC [7], sparse linear prediction (SLP) [9], and KMMSE [10]-are compared with our method. The source code of all the above methods are based on a third-party implementation [34]. The results of the EC performance comparison of the nine competing methods are given in Tables 2-4. As can be seen from the tables, our method is superior to the other eight methods in average PSNR and SSIM under three loss modes. Table 2 illustrates the reconstruction performance of the compared methods on 16 × 16 isolate block losses. As can be observed from Table 2, the proposed method outperformed all of the other methods in average PSNR by a considerable margin. Compared with the very recent image EC method KMMSE, the average PSNR gain was 0.59 dB. Compared with the well-known OAI method, our method achieved up to 0.59 dB higher PSNR and 0.0069 higher SSIM. When compared with the POCS, VC, SLP, and MRF methods, our method obtained gains of 3.73 dB, 0.98 dB, 1.42 dB, and 1.46 dB in terms of PSNR and gains of 0.0550, 0.0143, 0.0062, and 0.0285 in terms of SSIM, respectively.  Table 3 shows the quantitative comparison of 16 × 16 regular consecutive block losses. Under this loss mode, the proposed method performed better than the remaining eight methods on all of the test images in terms of both PSNR and SSIM. The average performance over the second-best method was over 0.52 dB in terms of PSNR and 0.0034 in terms of SSIM. Similarly, compared with the POCS, VC, SLP, and MRF methods, our method obtained gains of 3.16 dB, 1.84 dB, 1.59 dB, and 1.1 dB in terms of PSNR and gains of 0.0967, 0.0674, 0.0129, and 0.0261 in terms of SSIM, respectively. Moreover, compared with the recent image EC method KMMSE, the average PSNR gain is 0.67 dB.
Finally, we compared the reconstruction quality of our method with the other methods on random consecutive block losses. As illustrated in Table 4, the proposed method gained the best performance under the average PSNR and SSIM. Specifically, compared with KMMSE, SLP, and AVC, the average PSNR gains were 0.32 dB, 1.19 dB, and 1.3 dB, respectively. In order to represent the performance of the proposed method, subjective comparisons are also given in Figures 19-21. As can be observed from the figures, the proposed method produced the most visually pleasant results among all compared methods. Figure 19 compares the performance of the proposed method with the others working under isolate block loss. Severe blocking artifacts were produced in POCS, AVC, CAD and VC, and a blurred and lumpy boundary could be observed in MRF and OAI. Figure 20 presents the comparison results on regular consecutive block losses, which has the high block loss rate. The POCS, CAD, and MRF produced very serious lumps. It also observed that the CAD and VC methods produced many false edges. Only the proposed method and the recent KMMSE method produced a natural reconstruction. Figure 21 presents the comparison results on random block losses; under this loss mode, the EC task is more challenging since many missing blocks may cluster together, making it difficult to find a regular and reliable neighborhood. It can be found that some loss pixels cannot be estimated very well. Only the proposed method can restore the major edges and textures.     Regarding the run time of our proposal, since we needed to train an AE-P network for each corrupted image, the training time of the network was included in the entire image processing time. In addition, the algorithm that we implemented is not optimized; for example, we used an exhaustive search to collect similar samples. These two reasons made our algorithm time-intensive. More specifically, our algorithm required about half an hour per corrupted image under 16× 16 isolate loss mode and a 256 × 256 image size. Therefore, our algorithm is computationally prohibitive for online Regarding the run time of our proposal, since we needed to train an AE-P network for each corrupted image, the training time of the network was included in the entire image processing time. In addition, the algorithm that we implemented is not optimized; for example, we used an exhaustive search to collect similar samples. These two reasons made our algorithm time-intensive. More specifically, our algorithm required about half an hour per corrupted image under 16 × 16 isolate loss mode and a 256 × 256 image size. Therefore, our algorithm is computationally prohibitive for online applications. Although the proposed algorithm requires a longer time than the compared method, the reconstruction quality of our method is better than the others in terms of average PSNR and SSIM, as shown in Tables 2-4. Therefore, our future works will improve our algorithm in two ways. One is to optimize the algorithm and reduce the computational complexity. The other is to use a pre-trained network to avoid training the network for each image.

Conclusions
In this paper, we developed a novel image EC method based on the AE-P neural network. Both the local correlation and non-local self-similarity of natural images were taken into account in reconstructing the missing pixels. We used the local correlation to predict the missing pixels and the non-local information to correct the predictions. The designed neural network could be divided into two parts: the prediction part and the auto-encoder (AE) part. The prediction part utilized the local correlation among pixels to predict the missing ones. The AE part extracted image features, which were used to collect similar samples from the whole image. The predictions of the missing pixels were corrected through the collected similar samples. In addition, we proposed a novel adaptive scan order based on the joint credibility of the support area and reconstruction to alleviate the error propagation problem. The experimental results showed that the proposed algorithm could reconstruct corrupted images effectively and outperform the compared state-of-the-art methods in terms of objective and perceptual metrics.
Author Contributions: Z.Z. designed and performed the experiments, analyzed the data, and wrote the paper with contributions from all authors; R.H., F.H., and Z.W. supervised the study and verified the findings of the study. All the authors read and approved the submitted manuscript, agreed to be listed, and accepted this version for publication.