Sub-Pixel Mapping Model Based on Total Variation Regularization and Learned Spatial Dictionary

In this research study, we deal with remote sensing data analysis over high dimensional space formed by hyperspectral images. This task is generally complex due to the large spectral, spatial richness, and mixed pixels. Thus, several spectral un-mixing methods have been proposed to discriminate mixing spectra by estimating the classes and their presence rates. However, information related to mixed pixel composition is very interesting for some applications, but it is insufficient for many others. Thus, it is necessary to have much more data about the spatial localization of the classes detected during the spectral un-mixing process. To solve the above-mentioned problem and specify the spatial location of the different land cover classes in the mixed pixel, sub-pixel mapping techniques were introduced. This manuscript presents a novel sub-pixel mapping process relying on K-SVD (K-singular value decomposition) learning and total variation as a spatial regularization parameter (SMKSVD-TV: Sub-pixel Mapping based on K-SVD dictionary learning and Total Variation). The proposed approach adopts total variation as a spatial regularization parameter, to make edges smooth, and a pre-constructed spatial dictionary with the K-SVD dictionary training algorithm to have more spatial configurations at the sub-pixel level. It was tested and validated with three real hyperspectral data. The experimental results reveal that the attributes obtained by utilizing a learned spatial dictionary with isotropic total variation allowed improving the classes sub-pixel spatial localization, while taking into account pre-learned spatial patterns. It is also clear that the K-SVD dictionary learning algorithm can be applied to construct a spatial dictionary, particularly for each data set.


Introduction
The recent technological progress in hyperspectral imaging has led to the emergence of a gradually increasing source of spectral information showing the characteristics of hyperspectral image acquisition via numerous available spectral bands. However, many difficulties appear while extracting valuable data for the final user of different applications. The most important problem is the presence of mixed pixel [1] occurring when the size of two or more classes of land cover classes may be larger than the pixel. This difficulty appears in case where the size of two land cover classes may be larger than the pixel size, but some parts of their boundaries are localized within a single pixel [2]. As the traditional hard classification algorithms cannot be applied to solve mixed pixels problem [3,4], the best solution consists in using spectral unmixing [5,6] and fuzzy classification [7] in order to specify the endmembers (classes) and their presence ratio, while their precise spatial localization cannot be determined. In 1997, sub-pixel mapping methods were first proposed by Atkinson [8] to approximate the spatial locations at a sub-pixel scale from coarse spatial resolution hyperspectral data. These techniques are based on either the original hyperspectral image or use the findings obtained by soft classification (i.e., spectral unmixing) as input. They have been utilized in several fields, such as forestry [9], water mapping [10], burned areas [11], target detection [12], and rural land cover objects [13].
In fact, SPM (Sub-Pixel Mapping) aims essentially at improving the images spatial resolution relying on spatial dependence assumption between and within pixels [14]. Despite the importance of this method, a great effort has to be made to construct more reliable and more efficient approach in order to satisfy the requirements of this large number of applications. SPM methods presented in the literature can be classified into three categories [15].
The first one includes techniques relying on spatial dependence assumption and utilizing, as input, a hyperspectral image. As instance of these methods, we can mention linear optimization [16], pixel swapping algorithm [17], techniques relying on spatial attraction at a sub-pixel scale [18,19], Cokriging indicator [20], and Markov random field [21][22][23], as well as artificial-intelligence based on some techniques, such as differential evolution [24], genetic algorithm [25], and methods relying on hopfield neural network [26]. Indeed, technique belonging to the first class are generally applied to exploit the spatial correlation of the elements between and within a mixed pixel, based on the assumption that close elements have higher correlation than distant ones. In fact, they do not provide a single solution, but they depend on the initialization step. In Reference [27], Arun et al. suggested a technique relying on pixel-affinity and semi-variogram. Besides, authors, in Reference [20], presented a method using semi-variogram approximated from fine spatial resolution training images.
The second class involves techniques used to inject a priori knowledge in the form of additional data to improve SPM accuracy. To remedy the problem of insufficient information, several methods that use an additional source of information have been proposed in the literature. Authors in Reference [28,29] have presented methods that add additional data by considering some parameters, such as spatial resolution (through the integration of panchromatic images [30] or fused images) and spatial shift [14,28,31,32].
The third class contains techniques relying on spatial prior model. These methods allow transforming sub-pixel mapping into an inverse well-posed problem with a unique solution to form a fine spatial resolution map. Researchers, in Reference [33], developed a subpixel mapping technique relying on MAP (Maximum A Posteriori) model, and winner takes all as strategy to choose the adequate class that should be considered in the classification problem. Fend et al. [34,35] constructed also the DCT (Discret Cosine Transform) dictionary to obtain more sub-pixel configurations.
On the other hand, authors in Reference [36] classified sub-pixel mapping techniques into three classes according to the size and shape of the object [37], which can be zonal [26,38,39], linear [13], or a point [40].
The previously-mentioned methods (first category), relying on the spatial dependence assumption, are based on the assumption that the near sub-pixels are more similar than the distant one, which is not always the case. Another limitation of these techniques is the problem of insufficient information at the sub-pixel level, which affects negatively the mapping accuracy. It is worth noting that the total variation, used as a regularization, cannot provide an optimal distribution of the classes at the sub-pixel level. In addition, the discrete cosine transform dictionary does not depend on the input data. Thus, its atoms cannot be adapted to the image and cannot be employed in all possible output configurations. Unlike this standard dictionary that can be employed with all input images, that proposed in our work is a spatial one that can be adapted to each input image in order to ensure a spatial modeling that is very close to reality and can provide more possible configurations at sub-pixel scale.
To solve this problem and to adapt the atoms of the dictionary to the utilized data set, we propose, in this study, the K-SVD (K-singular value decomposition) algorithm applied in image compression [41] or feature extraction [42]. We also depict the mixed pixel with a linear combination of atoms produced by the K-SVD algorithm without taking into account any relation between classes through various mixed pixels which representing the isotropic TV (Total Variation) [43] used as a regularization constraint for the characterization of the relation among neighboring sub-pixels in all abundance cards. Our proposal based on sparse representation [44,45] allows also the transformation of the ill-posed sub-pixel mapping problem into a well-posed one, which makes possible the convergence of the algorithm into a unique minimum of the cost function having unique optimal solution.
The remaining of this manuscript is organized as follows: Section 2 presents some mathematical notations. In Section 3, we define the sub-mapping model. Section 4 shows the features of the proposed K-SVD learning process. Section 5 describes the introduced sub-pixel mapping technique relying on the learned dictionary and isotropic total variation minimization. Then, in Section 5, we discuss the experimental results. Section 6 is a short conclusion where we show our future works.

Mathematical Notation
The mathematical notations used in this paper are presented below.
Notation Description p norm p u the input data, and abundances fractions of classes. u c abundances fraction of class c or low spatial resolution image of class c. x c high spatial resolution image of class c. X sub-pixel mapping result or high spatial resolution image with non-smooth edges. X F final sub-pixel mapping result or high spatial resolution image with smooth edges. E down-sampling vector or matrix. D spatial dictionary. α and β sparse coefficients. S scale factor. TV Total Variation ITV Isotropic Total Variation

Sub-Pixel Mapping Definition
Sub-pixel mapping (SPM) is the process of estimating the spatial location of different land cover classes in a mixed pixel. This technique uses original hyperspectral image or the result of soft classification to predict the spatial distribution of endmembers estimated by spectral unmixing [46]. In 1997, the sub-pixel mapping principle, referring to near observations which are more similar than the most distant ones [8], was first introduced by Atkinson. In fact, sub-pixel mapping techniques are generally used to transform coarse spatial resolution fraction images into a high spatial resolution map based on predefined scale factor S. By applying these methods, each pixel is sub-divided into S 2 sub-pixels.

Sub-Pixel Mapping Model
To construct the subpixel mapping mapping model, we assumed that the abundances maps u contains M 1 rows and M 2 columns with C dimension (C designates the number of endmembers or classes). The low spatial resolution map u was transformed into a fine spatial resolution classified map X. The dimension of X is (S × M1)(S × M2). Each pixel in u can be transformed into S × S sub-pixels.
The first sub-pixel observation model for all abundances maps is shown in the equation below: u EX. When classes are treated separately, the model is written as follows: For a single pixel, u c denotes the abundance value of class c where E is a downsampling vector. The values of E are equal to 1 S 2 and x c is a row vector groups S 2 of sub-pixels equivalent to a single pixel in the low spatial resolution map.
Considering that the scale factor S is equal to 3 and admitting that for a single mixed pixel u = (0.50, 0.25, 0.25), then u 1 = 0.50, u 2 = 0.25 and u 3 = 0.25. The final result X is a column vector, with the binary value of X j is shown in Equation (1). . (2)

Method
In this work, the coarse spatial resolution hyperspectral image u is transformed into a high spatial resolution map X F . The proposed SMKSVD-TV shown in Figure 1 was applied following the steps described below: • A downsampling step: It was applied to high spatial resolution image in order to provide an image with coarse spatial resolution. Then, the low spatial resolution map is used as the input of spectral un-mixing step. • Generate the abundance maps u from the input hyperspectral image by applying an appropriate spectral unmixing algorithm (in our study, we used a fully-constraint Least Square algorithm). • Dictionary learning : Build a spatial dictionary D using K-SVD dictionary learning algorithm. • Sparse representation of image as a linear combination of dictionary atoms utilized to obtain a high spatial resolution map: X.

•
Apply the total variation regularization model to make edges smooth. The obtained result is a super-resolution map X F .

Dictionary Learning
In this section, the dictionary D is learned using q low spatial resolution signals of z. Figure 2 shows an example of dictionary learning process. S = 3 is the scale factor. The size of the patch is equal to 3 × 3. Our main objective is to obtain a sparse representation and one dictionary for all the abundances maps. In order to obtain z, we considered all abundances maps, generated in the spectral un-mixing step. Each map was decomposed into patches (S by S blocks or image patches), each of which is transformed into vector. The obtained vectors were then concatenated to form z, which is the input of dictionary learning process. • q is the number of small image patches. It is larger than k. • k is the number of atoms in the incomplete dictionary.
For relaxation, the p norm was fixed to l 1 norm so that the sparsity would be equal to 1.
• D is the dictionary shown in Figure 3.
q the number of signals. Each signal j has its own representation α j . To learn the dictionary D and optimize the sparse code A at the same time, the four following steps were applied: 1. Initialize the dictionary: k signals were picked randomly from q signals of z. 2. Sparse coding: Given the dictionary, the sparse code for each signal was obtained.
For every signal, the sparse code α j was provided.
3. Update the dictionary: As the sparse code is known, we update the dictionary D and the atoms one by one. We picked randomly one atom d a at a time and all the signals using it to make it more appropriate for these signals.
Er is the error, . F denotes the Frobenius norm, and d a designates an atom. 4. Repeat step 2 and step 3 until a prefix number of iterations is obtained.

Sparse Modeling of High Spatial Resolution Data Using the Dictionary
The input abundance maps presented by u is the low spatial resolution abundances maps containing M1 × M2 pixels with C endmembers or classes. In order to construct the suggested sub-pixel mapping model, u was converted into a fine spatial resolution classified map X. A pixel in the low spatial resolution abundance map is converted into S 2 sub-pixels. The Dictionary created at pixel level is used at sub-pixel level, as demonstrated in Figure 4. Because β is sparse, mixed pixel is built as a linear combination of few atoms from the learned dictionary D. These atoms were then utilized to provide the sub-pixel spatial distribution according to the following Equation (7): where β denotes the corresponding sparse coefficient, and u is the input fraction values. The final result X was calculated employing the sparse coefficient β and the dictionary D as follows: where: is the down-sampling matrix, and • D ∈ R S 2 ×k is the learned spatial dictionary, using K-SVD dictionary learning algorithm. D1, D2, D3, D4, D5, and D6 represent examples of the dictionary atoms shown in Figure 3. The size of each atom in the dictionary D is equal to S × S, where Di designates a spatial patch, and X corresponds to the (3 × 3 sub-pixels ) sub-pixel mapping result.
where u ∈ R M×C is the observation matrix (low spatial resolution abundances maps). Abundances maps u were obtained by applying soft classification (i.e., spectral unmixing). X is the high spatial resolution result of SPM in which a pixel was converted into a certain number of sub-pixels equal to S 2 in order to give more spatial details than those of the input abundances maps u. The size of u is M = M 1 × M 1 . M designates the number of pixels in the low spatial resolution image.

Spatial Regularization Using Isotropic Total Variation
X is the high spatial resolution map with non-smooth edges. The total variation was used as a spatial regularization term. The minimization Equation (9) is written as follows: In Equation (9), there are two basic components.
• The first component 1 2 X F − X 2 2 is called the error term. In our study, it relates two fine spatial resolution images.

•
The second component, named the prior regularization term U(X F ), is the main element used in the proposed formulation.
Equation (10) was applied to calculate the ITV of X F . To simplify the writing of X F , we replaced it by F.
X F is a discrete map in which the size is N 1 × N 2 . The pixel value in the row n1 and the column n2 was determined by x c [n1, n2]. The ITV(X F ) = ITV(F) is the isotropic total variation computed using pixels values in horizontal and vertical direction. Obviously, minimizing the isotropic TV yielded nice and smooth edges. An important parameter utilized Equation (10) is the regularization parameter λ TV > 0. It takes a positive value and controls the balance between two components: data fidelity and spatial prior term. The latter was calculated as a function of the isotropic total variation until it converges to a single optimal solution:X F . The total variation was derived by computing the gradient in horizontal ∇ h F and vertical ∇ v F directions to better detect edges and spatial details at the sub-pixel scale.
The gradient of X F was computed as follows:

Experimental Results
To measure the performance of the proposed approach (SMKSVD-ITV), we tested it on three types of hyperspectral image. The first image is rural which is Jasper ridge hyperspectral data; the second image is urban which is HYDICE urban and the third image represents the Pavia university hyperspectral data. For qualitative analysis, our approach was compared to three other methods proposed in the literature. The two first were introduced in Reference [33]. They are based on the MAP model and Winner takes all class determination strategy. The first method applies the total variation as a regularization model (AMCDSM-TV), while the second one uses the Laplacian model (AMCDSM-L). The third method is based on spare modeling and DCT dictionary (ASSM-TV).
For each data set, three parameters were calculated: the overall accuracy, the average accuracy and the Kappa coefficient. Most of the SPM methods utilize the abundances maps (results of spectral un-mixing step) as input data. The spectral un-mixing algorithm applied in our experiment is the Fully Constraint Least Square (FCLS) [47]. For each hyperspectral image, the original high spatial resolution images were classified by the Spectral Angle Mapper Algorithm for Remote Sensing Image Classification (SAM) as the reference image. Then, a downsampling process was used to transform the high spatial resolution image into low spatial resolution data, with the existence of mixed pixels. Afterwards, a spectral un-mixing step was applied to extract endmembers and their abundances fractions. Then, three study areas were selected to evaluate the performance of the proposed method.

Example 1: Jasper Ridge Hyperspectral Image
The utilized first hyperspectral image is the Jasper ridge composed by 100 × 100 pixels and 198 spectral bands showed in Figure 5a. The downsampling applied to the image with high spatial resolution provided an image with coarse spatial resolution, as demonstrated in Figure 5a. After the spectral unmixing shown in Figure 6, four classes (road, soil, water, and tree) were identified. Figure 5d reveals the impact of using a pre-built spatial dictionary on the final solution. To measure the sub-pixel mapping performance, we calculated the confusion matrix presented in Table 1 which summarizes the relation between the predicted class and the actual one. From this matrix, the overall accuracy, the average accuracy and omission errors ( Table 2) and kappa index ( Table 3) were calculated and compared with those provided by three other existing methods which use total variation or laplacian as a regularization term (Figure 5e-g).

Quantitative Analysis
The calculated parameters demonstrate the good performance of the pre-constructed learned dictionary.
Omission error ( Table 2) refers to the pixels omitted from the right classes. We can see that our method gave the minimum values of omission error for all classes of images.

Sensitivity Analysis
A sensitivity analysis was carried out by changing the value of the regulation parameter λ. The experiment shows an optimal value of l which is equal to λ which is λ = 0.0019.   Table 4 presents the optimal values of λ: 4 × 10 −3 , 10 −4 , 10 −3 , and 19 × 10 −4 for AMCDSM-TV, AMCDSM-L, ASSM-TV, and the SMKSVD-TV method, respectively.

Example 2: Pavia University
Pavia University hyperspectral image acquired using ROSIS sensor contains 610 × 340 pixels. The number of spectral bands is 103 with a spectral coverage ranging from 0.43 to 0.86 nm and a spatial resolution of 1.3 m per pixel. The original data set contains 9 different classes of land cover. As shown in Figure 8a, in the performed experiments, we used only 300 × 300 pixels with 8 classes: shadow, asphalt, meadows, gravel, trees, bare soil, bitumen, and brick. The original data contains only pure pixels. To create mixed pixels, the image was down-sampled using a scale factor equal to 3. We obtained 100 × 100 pixels with a spatial resolution of 14.   Sub-pixel mapping results provided applying AMCDSM-TV, AMCDSM-L, and ASSM-TV methods are described in Figure 8b-d. Figure 8e shows sub-pixel mapping result obtained by our proposed SMKSVD-TV method. From Table 5, we can see that SMKSVD-TV outperforms AMCDSM-TV, AMCDSM-L, and ASSM-TV by about 0.51% in overall accuracy and by 2.03% in Kappa coefficient. It is also obvious, from Table 6, that the introduced method has a better Overall Accuracy value (86.55%) and kappa value (56.44%). From Tables 6 and 7, we can conclude that SMKSVD-TV has the lowest omission and commission error values, compared with AMCDSM-TV, AMCDSM-L, and ASSM-TV for all land cover classes.

Example 3: Hydice Urban Hyperspectral Data
The last experiment was performed with HYDICE urban image. It contains 300 × 300 pixels and 187 bands.The number of the detected land cover classes is six: two types of roads, two types of roofs, grass, and trees. The first step of the developed approach is to degrade the spatial resolution of the high resolution image by a factor of three. Figure 10a represents the image low spatial resolution with 100 × 100 pixels. The second step consists in generating the abundances maps of the different classes using Fully Constraint Least Square algorithm, and the results are identified in Figure 11. Figure 10f illustrates the result provided applying SMKSVD-TV.
From the confusion matrix (Table 8), we calculate: the Kappa coefficient, the overall accuracy and the average accuracy ( Table 9). The Kappa coefficient (Table 9) showed that our mapping is almost 51.26% better than the random classification of sub-pixels. This difference between the overall (86.46%) accuracy and the kappa coefficient (51.26%) for the HYDICE urban hyperspectral data set is due to the omission error which is high for particular classes (Roof1 and Roof2).
From Tables 10 and 11, we can conclude that SMKSVD-TV has the lowest omission and commission error values, compared with AMCDSM-TV, AMCDSM-L, and ASSM-TV for all land cover classes. From Table 12, you can see the accuracies of each class.   The classification image obtained by SAM is used to evaluate the performance of the SMKSVD-TV approach as a reference card shown in Figure 10b with corresponding thematic classes. Figure 10c-f display results of AMCDSM-TV, AMCDSM-L, ASSM-TV, and our proposed SMKSVD-TV.
The classification image obtained by SAM was used to evaluate the performance of the SMKSVD-TV approach as a reference card shown in Figure 10b with corresponding thematic classes. Figure 10c-f display the results obtained by applying AMCDSM-TV, AMCDSM-L, ASSM-TV, and our proposed SMKSVD-TV. The comparison of the overall accuracy provided by all the existing methods (Table 9) shows that the performance of our technique is the best. For instance, our approach has an accuracy rate of 86.46%.
The graph below ( Figure 12) summarizes the detected omission error rates for the different classes of the HYDICE urban hyperspectral image. The yellow color shows the rates provided by applying our proposed approach. Roof1 and Roof 2 classes have the highest error rates, compared to the other classes. In the confusion matrix, we notice that the error rates resulted mainly from the asphalt road class which merged with Roof 2, concrete road, and grass. This confusion was due to errors in the spectral un-mixing process.

Discussion
The line graph ( Figure 13) depicts the change of the overall accuracy rate depending on the variable λ. Four representative curves were gathered in the same graph, showing a comparative study of our approach with the three other existing methods (AMCDSM-TV, AMCDSM-L, ASSM-TV). The red curve represents the accuracy rates obtained applying the SMKSVD-TV, while the others show those provided by the three other techniques. It is clear that the overall accuracy increases sharply until it reaches a particular value. The optimal overall accuracy rates are equal to 4 × 10 −3 , 10 −4 , 10 −3 , and 19 × 10 −4 for AMCDSM-TV, AMCDSM-L, ASSM-TV, SMKSVD-TV, respectively. The mentioned value was achieved at an optimum lambda value. After that, each line shows a decreasing overall accuracy as the value of λ rises. Despite the fact that the different methods gave curves with similar trends, the curve obtained by applying our proposed SMKSVD-TV stands out among the other ones thanks to a very high optimum overall accuracy. Besides, the overall accuracy in the SMKSVD-TV method is always the highest whatever the value of lambda is. Table 13 illustrates the average of the overall accuracy obtained using different sets of data and applying various methods. The experimental study uses three data sets (Jasper ridge, Pavia university, and urban hyperspectral images) to show the overall accuracy for each of the already-mentioned techniques. A comparative study of the different employed methods and data sets proves that the SMKSVD-TV technique provided the highest overall accuracy, compared to the three existing method, even when changing the used data.  This rate is 86.95%, for Jasper ridge hyperspectral image; 86.46%, for Urban image; and 86.55% for Pavia university. With regard to the average of the overall accuracy equal to 86.65% for the different data sets, the proposed SMKSVD-TV method gave the best value (86.65). This average is 85.07%, 84.26%, and 83.67% for AMCDSM-TV, AMCDSM-TV, and ASSM-TV, respectively. The above results attest that the SMKSVD-TV is very efficient in terms of overall accuracy.

Conclusions
In this work, we suggested a novel sub-pixel mapping algorithm based on a preconstructed spatial dictionary, built using K-SVD dictionary learning algorithm, and total variation as a regularization term. This algorithm was applied to regularize the ill-posed sub-pixel mapping issue and to further enhance the efficiency of sub-pixel mapping. The SMKSVD-TV approach was used, in the performed experiments, to convert the sub-pixel mapping problem into a regularization problem and integrate the isotropic total variation as a prior model applied to the abundance maps. To check the efficacy of the sub-pixel mapping approach, we carried out three experiments using the actual hyperspectral images and compared the proposed algorithm to several typical sub-pixel mapping algorithms. The experimental findings indicated that the efficiency of the introduced algorithm is better qualitatively and quantitatively than that of the traditional methods. Further analysis must then be done to choose the regularization parameters in an adaptive way. The code and test data are available by contacting the authors. Future research will focus on making an adaptive parameters selection to enhance the proposed SPM method by incorporating a CNN (Convolutional Neural Networks)-like or other option.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The