Automatic Color Correction for Multisource Remote Sensing Images with Wasserstein CNN

In this paper a non-parametric model based on Wasserstein CNN is proposed for color correction. It is suitable for large-scale remote sensing image preprocessing from multiple sources under various viewing conditions, including illumination variances, atmosphere disturbances, and sensor and aspect angles. Color correction aims to alter the color palette of an input image to a standard reference which does not suffer from the mentioned disturbances. Most of current methods highly depend on the similarity between the inputs and the references, with respect to both the contents and the conditions, such as illumination and atmosphere condition. Segmentation is usually necessary to alleviate the color leakage effect on the edges. Different from the previous studies, the proposed method matches the color distribution of the input dataset with the references in a probabilistic optimal transportation framework. Multi-scale features are extracted from the intermediate layers of the lightweight CNN model and are utilized to infer the undisturbed distribution. The Wasserstein distance is utilized to calculate the cost function to measure the discrepancy between two color distributions. The advantage of the method is that no registration or segmentation processes are needed, benefiting from the local texture processing potential of the CNN models. Experimental results demonstrate that the proposed method is effective when the input and reference images are of different sources, resolutions, and under different illumination and atmosphere conditions.


Introduction
Large-scale remote sensing content providers aggregate remote sensing imagery from different platforms, providing a vast geographical coverage with a range of spatial and temporal resolutions.One of the challenges is that the color correction task becomes more complicated due to the wide difference in viewing angles, platform characteristics, and light and atmosphere conditions (see Figure 1).For further processing purposes, it is often desired to perform color correction to the images.Histogram matching [1,2] is a cheap way to address this when a reference image with no color errors is available that shares the same coverage of land and reflectance distribution.
To gain a deeper insight, first we would like to place histogram matching in a broader context as the simplest form of color matching [3].These methods try to match the color distribution of the input images to a reference, also known as color transferring.They can either work by matching low order statistics [3][4][5] or by transferring the exact distribution [6][7][8].Matching the low order statistics is sensitive to the color space selected [9].The performances of both methods are highly related to the similarity between the contents of the input and the reference.Picking an appropriate reference requires manual intervention and may become the bottle neck for processing.A drawback of such methods is that the colors on the edges of the targets would be mixed up [10][11][12].Methods exploiting the spatial information were proposed to migrate the problem, but segmentation, spatial matching, and alignment are required [13,14].Matching the exact distribution is not sensitive to the color space selection, but has to work in an iterative fashion [8].Both the segmentation and the iteration increase the computation burden and are not suitable for online viewing and querying.For video and stereo cases, extra information from the correlation between frames can be exploited to achieve better color harmony [15,16].The holography method is introduced into color transfer to eliminate the artifacts [17].Manifold learning is an interesting framework to find the similarity between the pixels, so that the output color can be more natural and it can suppress the color leakage as well [18].Another perspective to comprehend the problem is image-to-image translation.Convolutional neural networks have proven to be successful for such applications [19], for example, the auto colorization of grayscale images [20,21].Recently, deep learning shows its potential and power in hyper-spectral image understanding applications [22].
Remote Sens.2017, 9, 483 2 of 16 is sensitive to the color space selected [9].The performances of both methods are highly related to the similarity between the contents of the input and the reference.Picking an appropriate reference requires manual intervention and may become the bottle neck for processing.A drawback of such methods is that the colors on the edges of the targets would be mixed up [10][11][12].Methods exploiting the spatial information were proposed to migrate the problem, but segmentation, spatial matching, and alignment are required [13,14].Matching the exact distribution is not sensitive to the color space selection, but has to work in an iterative fashion [8].Both the segmentation and the iteration increase the computation burden and are not suitable for online viewing and querying.For video and stereo cases, extra information from the correlation between frames can be exploited to achieve better color harmony [15,16].The holography method is introduced into color transfer to eliminate the artifacts [17].Manifold learning is an interesting framework to find the similarity between the pixels, so that the output color can be more natural and it can suppress the color leakage as well [18].Another perspective to comprehend the problem is image-to-image translation.Convolutional neural networks have proven to be successful for such applications [19], for example, the auto colorization of grayscale images [20,21].Recently, deep learning shows its potential and power in hyper-spectral image understanding applications [22].Unfortunately, for large-scale applications, it is too strict a requirement that the whole reflectance distribution should be the same between the reference image and the ones to be processed.As a result, such reference histograms are usually not available and have greatly restricted the applications of these sample-based color matching methods.In [23] the authors choose a color correction plan that minimizes the color discrepancy between it and both the input image and the reference image.This is a good solution in stitching applications.However, the purpose of this paper is to eliminate the errors raised by atmosphere, light, etc., so that the result can be further employed in ground reflectance retrieval or atmosphere parameters retrieval.We hope that the output is as close as possible to the reference images, rather than modifying the ground truth values as in [23].Since it is usually infeasible to find such a reference, a natural question is, can we develop a universal function which can automatically determine the references directly according to the input images?Once this function is obtained, we can combine it with simple histogram matching or other color Unfortunately, for large-scale applications, it is too strict a requirement that the whole reflectance distribution should be the same between the reference image and the ones to be processed.As a result, such reference histograms are usually not available and have greatly restricted the applications of these sample-based color matching methods.In [23] the authors choose a color correction plan that minimizes the color discrepancy between it and both the input image and the reference image.This is a good solution in stitching applications.However, the purpose of this paper is to eliminate the errors raised by atmosphere, light, etc., so that the result can be further employed in ground reflectance retrieval or atmosphere parameters retrieval.We hope that the output is as close as possible to the reference images, rather than modifying the ground truth values as in [23].Since it is usually infeasible to find such a reference, a natural question is, can we develop a universal function which can automatically determine the references directly according to the input images?Once this function is obtained, we can combine it with simple histogram matching or other color transfer methods into a very powerful algorithm.In this paper, a Wasserstein CNN model is built to infer the reference histograms for remote sensing image color correction applications.The model is completely data driven, and no registration or segmentation is needed in both the training phase and the inferring phase.Besides, as will be explained in Section 2, the input and the reference can be of different scales and sources.In Section 2, the details of the proposed method are elaborated in an optimal transporting framework [24,25].In Section 3, the experiments are conducted to validate the feasibility of the proposed method, in which images from the GF1 and GF2 satellites are used as the input and the reference datasets accordingly.Section 4 comprises the discussions and comparisons with other color matching (correcting) methods.And finally, Section 5 gives the conclusion and points out our future works.

Analysis
Given an input image I and a reference image I with N c channels, an automatic color matching algorithm aims to alter the color palette of I to that of I , the reference.Some of the algorithms require that the reference image is known, which are called sample-based methods.Of course an ideal algorithm should work without knowing I .The matching can be operated either in the N c -dimensional color space at once, or in each dimension separately [8,26].The influence of the light and the atmosphere conditions and other factors can be included into a function h(I , x, y) that acts on the grayscale value of the pixel located at (x, y).Under such circumstances, the problem is converted to learning an inverse transfer function f (I, x, y) that maps the grayscale values of the input image I back to that of the reference image I , where (x, y) denotes the location of the target pixel inside I.
When the input image is divided into patches that each possess a relatively small geographical coverage, the spatial variance of the color discrepancy inside each patch is usually small enough to be neglected.Thus h(I , x, y) should be the same with h(I , x , y ) as long as (x, y) and (x , y ) share the same grayscale values.Let u x,y and v x,y be the grayscale values of the pixels located at (x, y) in I and I accordingly, and h(I , x, y) can be rewritten as h(I , v x,y ), because the color discrepancy function is not related to the location of the pixel but only to its value.The three assumptions of the transformation from the input images to the reference images are made as follows, and some properties which f should satisfy can be derived from them.
Assumption 1 suggests that when two pixels in I have the same grayscale value, so do the corresponding pixels in I.This assumption is straight forward since in general cases the cameras are well calibrated and the inhomogeneity of light and atmosphere is usually small within a small geographical coverage.It is true that when severe sensor errors occur this assumption may not hold, however that is not the focus of this paper.
Assumption 2 indicates that when two pixels in I have the same grayscale, so are their corresponding pixels in I .The assumption is based on the fact that the pixel value the sensor recorded is not related to its context or location, but only to its raw physical intensity.
Assumption 3 implies that the transformation is order preserving, or a brighter pixel in I should also be brighter in I , and vice versa.
According to the above assumptions, we expect the transfer function f to possess the following properties.Property 1: u x,y = u x ,y ⇒ f (I, u x,y ) = f (I, u x ,y ) Property 2: u x,y > u x ,y ⇔ f (I, u x,y ) > f (I, u x ,y ) , or f is order-preserving Property 3: Consider that even when two pixels inside I 1 and I 2 share the same grayscale values, the corrected values can still be different according to their ground truth values in the references.Property 3 is to say that f should be content related.In other words, for different input images, the transfer function values should be different to maintain the content consistency.To better explain the point, consider that two input images having different contents, the grassland and the lake so to speak, happen to be of similar color distributions.The pixel in the lake should be darker and the other pixel in the grassland should be brighter in the corresponding reference images.If f is only related to the grayscale values while discarding the input images (the contexts of the pixels), this cannot be done because similar pixels in different input images have to be mapped to similar output levels.
An issue to take into account is whether the raw image or its histogram of the input and reference images should be made use of for the matching.Table 1 lists all possible cases, each of which will be discussed.Scheme A is the case when both the input and reference are histograms, and this is essentially histogram matching.Many previous studies employ this scheme for simplicity, for example, histogram matching and low order statistics matching in various color spaces.Since histograms do not contain the content information, the corresponding histogram matching is not content related.Concretely speaking, two pixels that belong to two regions with different contents but with the same grayscale fall into the same bin of the histogram, and have to be assigned to the same grayscale value in the output image, which does not meet Property 3. In order for one distribution with different contexts to be correctly matched to different corresponding distributions, we cannot enclose different transformations in one unified mapping (see Figure 2).This should not be appropriate for large scale datasets that demand a high degree of automation.
Scheme B corresponds to the case where both the input and output are images, which is usually referred to as image to image translation.The image certainly contains much more information than its histogram, thus providing a possibility that the mapping is content related.Although Property 3 can be satisfied, this scheme emphasizes the content of the image, and the consequence is that the pixels with same grayscales may be mapped to different grayscales as their contexts could be different, and in this case Property 1 is violated (see Figure 3).Matching algorithms of "scheme A" take both input and reference in the form of histograms.As this scheme is not content related, two similar distributions with different contexts could be not be mapped to their corresponding reference with one unified mapping.

Figure 3.
Matching algorithms of "scheme B" take both input and reference in the form of images.Similar distributions could be mapped to different corresponding references, as the scheme is content based.However, the same grayscales could be mapped to different grayscales when they are in different contexts, violating Property 1.
Scheme C is the case where the input is an image and the output is a histogram.As mentioned above, scheme A does not satisfy Property 3 because the context of the image is not used, while scheme B violates Property 1. Mapping one image to another, with constraints that the pixels with the same grayscales also have the same grayscale values in the output, is essentially a grayscale to grayscale transforming process.Under such circumstances, the output of scheme B is always equivalent to that of scheme C. Since scheme C automatically possesses Properties 1 and 3, the task has been now converted to devise the algorithm so that it possesses Property 2 as well (see Figure 4).The task is addressed under an optimal transporting framework, which will be elaborated in Section 2.2.Matching algorithms of "scheme A" take both input and reference in the form of histograms.As this scheme is not content related, two similar distributions with different contexts could be not be mapped to their corresponding reference with one unified mapping.

Figure 2.
Matching algorithms of "scheme A" take both input and reference in the form of histograms.As this scheme is not content related, two similar distributions with different contexts could be not be mapped to their corresponding reference with one unified mapping.

Figure 3.
Matching algorithms of "scheme B" take both input and reference in the form of images.Similar distributions could be mapped to different corresponding references, as the scheme is content based.However, the same grayscales could be mapped to different grayscales when they are in different contexts, violating Property 1.
Scheme C is the case where the input is an image and the output is a histogram.As mentioned above, scheme A does not satisfy Property 3 because the context of the image is not used, while scheme B violates Property 1. Mapping one image to another, with constraints that the pixels with the same grayscales also have the same grayscale values in the output, is essentially a grayscale to grayscale transforming process.Under such circumstances, the output of scheme B is always equivalent to that of scheme C. Since scheme C automatically possesses Properties 1 and 3, the task has been now converted to devise the algorithm so that it possesses Property 2 as well (see Figure 4).The task is addressed under an optimal transporting framework, which will be elaborated in Section 2.2.Scheme C is the case where the input is an image and the output is a histogram.As mentioned above, scheme A does not satisfy Property 3 because the context of the image is not used, while scheme B violates Property 1. Mapping one image to another, with constraints that the pixels with the same grayscales also have the same grayscale values in the output, is essentially a grayscale to grayscale transforming process.Under such circumstances, the output of scheme B is always equivalent to that of scheme C. Since scheme C automatically possesses Properties 1 and 3, the task has been now converted to devise the algorithm so that it possesses Property 2 as well (see Figure 4).The task is addressed under an optimal transporting framework, which will be elaborated in Section 2.2.The scheme of type D corresponds to the case where the input is the histogram and the output is the image.Since it is nearly impossible to determine a transformation mapping of a histogram to an image, we do not take this case into consideration.

Optimal Transporting Perspective of View
Denote u and v as the input and the reference color distributions, then mapping that transforms u to v. The total cost of ( ) , T u v can be defined as ( ) where ( )  ,  d x y , the p-order Wasserstein distance can be defined as [25,27]: Finding the transformation ( ) , T u v that minimizes the total cost ( ) , C u v is known as the Monge's optimal transportation problem, or the MK problem.The solution to the problem is the gradient of some convex function [25,27,28]: Specifically in one dimensional cases, this statement is equivalent to monotonicity, as consequence meets Property 2.
For high dimensional problems, the solution of the MK problem is intractable.In this paper, the distributions of the c N channels are matched separately.The Wasserstein distance between the inferred values and the ideal values can be calculated in the following way: first sort the pixels on a 1-D axis, and then calculate the distance between each pair of inferred pixels and the ideal pixels The scheme of type D corresponds to the case where the input is the histogram and the output is the image.Since it is nearly impossible to determine a transformation mapping of a histogram to an image, we do not take this case into consideration.

Optimal Transporting Perspective of View
Denote u and v as the input and the reference color distributions, then T : R N c → R N c is a mapping that transforms u to v. The total cost of T(u, v) can be defined as C(u, v) [25][26][27]: where c(x, y) is the cost of transporting one unit of mass from x to y, and π(u, v) is the joint probability measure of R N c + × R N c + , having u and v as its marginal distributions.Again, N c indicates the number of color channels and Π(u, v) is the collection of every feasible π(u, v).
When c(x, y) is defined as a distance d(x, y), the p-order Wasserstein distance can be defined as [25,27]: Finding the transformation T(u, v) that minimizes the total cost C(u, v) is known as the Monge's optimal transportation problem, or the MK problem.The solution to the problem is the gradient of some convex function [25,27,28]: Specifically in one dimensional cases, this statement is equivalent to monotonicity, as consequence meets Property 2.
For high dimensional problems, the solution of the MK problem is intractable.In this paper, the distributions of the N c channels are matched separately.The Wasserstein distance between the inferred values and the ideal values can be calculated in the following way: first sort the pixels on a 1-D axis, and then calculate the distance between each pair of inferred pixels and the ideal pixels accordingly.This is equivalent to using a stacked histogram (see Figure 5).The Wasserstein distance when p equals 2 can be formulated as: , where f is the cumulative frequency (4) Remote Sens.2017, 9, 483 7 of 16 accordingly.This is equivalent to using a stacked histogram (see Figure 5).The Wasserstein distance when p equals 2 can be formulated as: , where f is the cumulative frequency (4) Figure 5. Calculation method of the Wasserstein distance between the inferred histograms and the ground-truth reference.STEP 1: stack the histograms on the frequency axis; STEP 2: subtract the stacked histograms, and integrate with respect to the cumulative frequency.

The Model Structure
The transformation can be fitted by a CNN model, where the Wasserstein distance plays the role of the loss function.To reduce the memory and computation burden, we used a modified version of Squeeze-net v1.1 [29] (see Figures 6 and 7).In this section we will first introduce the basic modules and then go on to state the major modifications.

The Model Structure
The transformation can be fitted by a CNN model, where the Wasserstein distance plays the role of the loss function.To reduce the memory and computation burden, we used a modified version of Squeeze-net v1.1 [29] (see Figures 6 and 7).In this section we will first introduce the basic modules and then go on to state the major modifications.accordingly.This is equivalent to using a stacked histogram (see Figure 5).The Wasserstein distance when p equals 2 can be formulated as: , where f is the cumulative frequency (4) Figure 5. Calculation method of the Wasserstein distance between the inferred histograms and the ground-truth reference.STEP 1: stack the histograms on the frequency axis; STEP 2: subtract the stacked histograms, and integrate with respect to the cumulative frequency.

The Model Structure
The transformation can be fitted by a CNN model, where the Wasserstein distance plays the role of the loss function.To reduce the memory and computation burden, we used a modified version of Squeeze-net v1.1 [29] (see Figures 6 and 7).In this section we will first introduce the basic modules and then go on to state the major modifications.

Basic Modules
The Squeeze-net is a light-weight convolutional neural network.The basic modules of the squeeze-net are called the "fire" modules [29], and each consists of two convolution layers, the "squeeze" layer and the "expand" layer.The kernels in the "squeeze" layers are all of 1 × 1 sizes to maximally lessen the parameters inside the model and reduce the computational burden.Two types of kernels, 1 × 1 and 3 × 3 filters, comprise the "expand" layer.The "fire" modules prove to be computationally efficient, and also make the network less likely to be over fitted, as it "squeezes" the amount of parameters to a much smaller scale.In our experiment, the final global average pooling layer and the softmax layer of the squeeze-net was removed, and the rest of the parts were used to extract the features from the raw input images.

The Multi-Scale Concatenation and the Histogram Predictors
As stated in Section 2.3.1, we used a modified version of Squeeze-net to extract features from the input images.The layers at different levels in the CNN model extract features at different scales, and each level has its own characteristics.In general, the former layers in the CNN model are more associated with the raw pixels, while the latter ones are more meaningful in semantic senses [30,31].Besides, the scales of the former feature maps are also different from the latter ones.
To utilize the information from different scales and semantic levels, we used a concatenating structure.In order for the feature maps to be concatenated, average pooling and deconvolution operations were applied to resize them to a unified shape (27 × 27).All the padding modes in the pooling layers were "valid", so that the residual parts which could not fill up the pooling kernel were

Basic Modules
The Squeeze-net is a light-weight convolutional neural network.The basic modules of the squeeze-net are called the "fire" modules [29], and each consists of two convolution layers, the "squeeze" layer and the "expand" layer.The kernels in the "squeeze" layers are all of 1 × 1 sizes to maximally lessen the parameters inside the model and reduce the computational burden.Two types of kernels, 1 × 1 and 3 × 3 filters, comprise the "expand" layer.The "fire" modules prove to be computationally efficient, and also make the network less likely to be over fitted, as it "squeezes" the amount of parameters to a much smaller scale.In our experiment, the final global average pooling layer and the softmax layer of the squeeze-net was removed, and the rest of the parts were used to extract the features from the raw input images.

The Multi-Scale Concatenation and the Histogram Predictors
As stated in Section 2.3.1, we used a modified version of Squeeze-net to extract features from the input images.The layers at different levels in the CNN model extract features at different scales, and each level has its own characteristics.In general, the former layers in the CNN model are more associated with the raw pixels, while the latter ones are more meaningful in semantic senses [30,31].Besides, the scales of the former feature maps are also different from the latter ones.
To utilize the information from different scales and semantic levels, we used a concatenating structure.In order for the feature maps to be concatenated, average pooling and deconvolution operations were applied to resize them to a unified shape (27 × 27).All the padding modes in the pooling layers were "valid", so that the residual parts which could not fill up the pooling kernel were discarded.The strides and kernel sizes within each pooling layer were the same.All the resized shapes were 27 × 27, except for the input, whose output was 28 × 28.Its last row and column were trimmed in order to be consistent with the other tensors to be concatenated.The concatenated feature maps were then flattened into a 2-dimensional tensor of 725,355 length, and then was fed into three fully-connected layers separately, one for each channel (blue, green, and red).The fully-connected layer was then attached by a softmax head each to infer the corrected color distribution.

Data Augmentation
Data augmentation was performed on the original inputs to avoid over fitting as well as to enclose more patterns of color discrepancy into the model.The augmentation operations include: 1.
Random cropping: A patch of 227 × 227 is cropped at a random position from each 256 × 256 sample.It is worth noting that this implies that no registration is needed in the training process.2.
Random flipping: Each sample in the input batch is randomly horizontally and vertically flipped by a chance of 50%.

3.
Random color augmentation: The brightness, saturation, and gamma values of the input color are randomly shifted.Small perturbations are added to each color channel.Figure 8 shows an example of such transformation of the color distribution.
discarded.The strides and kernel sizes within each pooling layer were the same.All the resized shapes were 27 × 27, except for the input, whose output was 28 × 28.Its last row and column were trimmed in order to be consistent with the other tensors to be concatenated.The concatenated feature maps were then flattened into a 2-dimensional tensor of 725,355 length, and then was fed into three fully-connected layers separately, one for each channel (blue, green, and red).The fully-connected layer was then attached by a softmax head each to infer the corrected color distribution.

Data Augmentation
Data augmentation was performed on the original inputs to avoid over fitting as well as to enclose more patterns of color discrepancy into the model.The augmentation operations include:

Algorithm Flow Chart
The entire model can be trained in an end-to-end fashion with the gradient descent algorithm, as displayed in Algorithm 1.

Algorithm Flow Chart
The entire model can be trained in an end-to-end fashion with the gradient descent algorithm, as displayed in Algorithm 1 ( Algorithm flow of the training process).

Results
We had our algorithm evaluated with satellite images from GF1 and GF2 that cover the same areas.The GF2 images were chosen as the reference.The parameters of the data are listed in Table 2.The direct outputs of WCNN are the inferred distributions (or histograms, see Figure 9) based on the contents of the input images.The corrected images are obtained by histogram matching (see Figure 10).The reference images are only used in the training process and are unnecessary in practical applications, as the purpose of the WCNN model is to generate the reference histogram when there are no available ones.It is worth noting that the patches were only roughly sliced according to the longitude and the latitude information within the GeoTIFF files, so registration was not necessary, and neither was pre-segmentation.

Results
We had our algorithm evaluated with satellite images from GF1 and GF2 that cover the same areas.The GF2 images were chosen as the reference.The parameters of the data are listed in Table 2.The direct outputs of WCNN are the inferred distributions (or histograms, see Figure 9) based on the contents of the input images.The corrected images are obtained by histogram matching (see Figure 10).The reference images are only used in the training process and are unnecessary in practical applications, as the purpose of the WCNN model is to generate the reference histogram when there are no available ones.It is worth noting that the patches were only roughly sliced according to the longitude and the latitude information within the GeoTIFF files, so registration was not necessary, and neither was pre-segmentation.

Comparison between KL Divergence and Wasserstein Distance
As has been mentioned in Section 2.2, the Wasserstein distance is a natural choice to represent the difference between two color distributions.The Kullback-Leibler divergence (also known as KL divergence) is another commonly used measure (but not a metric) in such circumstances.The definition of KL divergence [27] is: and the definition of 2-Wasserstein distance is:

Comparison between KL Divergence and Wasserstein Distance
As has been mentioned in Section 2.2, the Wasserstein distance is a natural choice to represent the difference between two color distributions.The Kullback-Leibler divergence (also known as KL divergence) is another commonly used measure (but not a metric) in such circumstances.The definition of KL divergence [27] is: and the definition of 2-Wasserstein distance is: x − y 2 dπ(x, y) Consider two simple distributions, u 1 ∼ U(−0.5, 0.5) and u 2 ∼ U(−0.5 + a, 0.5 + a), as shown in Figure 11.The Kullback-Leibler divergence should be: And the Wasserstein distance is: Because both the Wasserstein metric and the KL divergence are fully differentiable, there is no difference in the back-propagation pipeline between the two losses.From the above discussion, however, we could see that the Wasserstein distance is more numerically stable compared to the KL divergence., as shown in Figure 11.The Kullback-Leibler divergence should be: ( ) And the Wasserstein distance is: Because both the Wasserstein metric and the KL divergence are fully differentiable, there is no difference in the back-propagation pipeline between the two losses.From the above discussion, however, we could see that the Wasserstein distance is more numerically stable compared to the KL divergence.

Connection and Comparison with Other Color Matching Methods
Histogram matching can be regarded as the simplest case of color matching.It is widely used in seamless mosaic workflows.The method requires that a reference image is selected for each input, which certainly puts restriction on the applications with large scale datasets.Wasserstein CNN is able to directly predict the corrected color distribution, and the histogram matching is the final step in the workflow of our proposed method (but not the only choice, other sample-based color matching methods would also do).
Matching low order statistics faces similar problems.Its performance is closely related to the similarity between the input images and the reference images.To handle low similarity cases, the images may have to be segmented and the color needs to be transferred part to part.Besides, for images with complex contents, color leakage on the edges could be a problem, and the image quality will degrade.Considering these restrictions, such methods may not be appropriate for automatic color matching in remote sensing applications.Matching the exact distribution is more precise than just matching the low order statistics, but is also more complex and computationally expensive.To match two non-Gaussian distributions, iterative approaches have to be exploited, as there are no closed-form solutions [8].The Wasserstein CNN method is non-iterative, and is more suitable for large scale processing.
Poisson image editing (PIE) is another well-known color matching method.Rather than directly matching the color distributions, the PIE method tries to preserve the gradients of the input image

Connection and Comparison with Other Color Matching Methods
Histogram matching can be regarded as the simplest case of color matching.It is widely used in seamless mosaic workflows.The method requires that a reference image is selected for each input, which certainly puts restriction on the applications with large scale datasets.Wasserstein CNN is able to directly predict the corrected color distribution, and the histogram matching is the final step in the workflow of our proposed method (but not the only choice, other sample-based color matching methods would also do).
Matching low order statistics faces similar problems.Its performance is closely related to the similarity between the input images and the reference images.To handle low similarity cases, the images may have to be segmented and the color needs to be transferred part to part.Besides, for images with complex contents, color leakage on the edges could be a problem, and the image quality will degrade.Considering these restrictions, such methods may not be appropriate for automatic color matching in remote sensing applications.Matching the exact distribution is more precise than just matching the low order statistics, but is also more complex and computationally expensive.To match two non-Gaussian distributions, iterative approaches have to be exploited, as there are no closed-form solutions [8].The Wasserstein CNN method is non-iterative, and is more suitable for large scale processing.
Poisson image editing (PIE) is another well-known color matching method.Rather than directly matching the color distributions, the PIE method tries to preserve the gradients of the input image and matches the pixel values on the border to those in the reference image.The problem is equivalent to solving a Poisson equation.However, in our case, this idea might not be very appropriate, because the gradients between the input image and the reference image can be very different, especially when the atmosphere visibility is low (see the PIE result in Figure 12).
Comparisons between the color matching methods are displayed in Figure 12.The ground truth was not included in the training set, as it was supposed as an unknown in the color matching problem.Because the PIE, statistics transferring, and the histogram matching methods are all sample-based, an image must be selected from the training set to act as the reference.However, all that the WCNN model needs is the input image, thus it can operate without selecting such a reference.As the reference is not likely to be exactly the same as the ground truth, we can see the color discrepancy between the output and the ground truth in the results of PIE, statistics transferring, and histogram matching in Figure 12.Also, several features and descriptors were computed for all input images, output images, and the ground truth images in the test set, including the Oriented FAST and Rotated BRIEF (ORB) descriptor, the Scale-Invariant Feature Transform (SIFT) descriptor, and the Binary Robust Invariant Scalable Keypoints (BRISK) descriptor.To be a representation of similarity, the distances between the features of the output and the ground truth are computed, and are displayed in the boxplots in Figure 13.
Remote Sens.2017, 9, 483 13 of 16 and matches the pixel values on the border to those in the reference image.The problem is equivalent to solving a Poisson equation.However, in our case, this idea might not be very appropriate, because the gradients between the input image and the reference image can be very different, especially when the atmosphere visibility is low (see the PIE result in Figure 12).Comparisons between the color matching methods are displayed in Figure 12.The ground truth was not included in the training set, as it was supposed as an unknown in the color matching problem.Because the PIE, statistics transferring, and the histogram matching methods are all samplebased, an image must be selected from the training set to act as the reference.However, all that the WCNN model needs is the input image, thus it can operate without selecting such a reference.As the reference is not likely to be exactly the same as the ground truth, we can see the color discrepancy between the output and the ground truth in the results of PIE, statistics transferring, and histogram matching in Figure 12.Also, several features and descriptors were computed for all input images, output images, and the ground truth images in the test set, including the Oriented FAST and Rotated BRIEF (ORB) descriptor, the Scale-Invariant Feature Transform (SIFT) descriptor, and the Binary Robust Invariant Scalable Keypoints (BRISK) descriptor.To be a representation of similarity, the distances between the features of the output and the ground truth are computed, and are displayed in the boxplots in Figure 13.From Figure 13 we can see that generally the processed images are closer to the ground truth, in regards to the distances of the feature descriptors, except for the PIE method.One of the reasons why PIE fails to generate high quality results is that the low atmosphere visibility may deteriorate the gradients, resulting in a significant difference between the gradients of the input image and the ground truth.The WCNN model results achieve the maximum similarity to the ground truth, and the model is also the most stable one.From Figure 13 we can see that generally the processed images are closer to the ground truth, in regards to the distances of the feature descriptors, except for the PIE method.One of the reasons why PIE fails to generate high quality results is that the low atmosphere visibility may deteriorate the gradients, resulting in a significant difference between the gradients of the input image and the ground truth.The WCNN model results achieve the maximum similarity to the ground truth, and the model is also the most stable one.
Figure 13.Boxplots of L1-norm distances between the processed images and the ground truth with respect to left: ORB; middle: SIFT, and right: BRISK feature descriptors.The distances represent the dissimilarity between the processed results and the ground truth (the smaller the better).There are five horizontal line segments in each patch, indicating five percentiles of the distances within the processed images by the corresponding method; from top to bottom: the maximum (worst) distance, the worst-25% distance, the median distance, the best-25% distance, and the minimum (best) distance.

Processing Time and Memory Comsumption
The processing time of 512 patches with a size of 227 × 227 × 3 on a single NVIDIA ® GeForce ® GTX 1080 graphics processing unit is 0.408 s, or

Conclusions
This paper presents a nonparametric color correcting scheme in a probabilistic optimal transport framework, based on the Wasserstein CNN model.The multi-scale features are first to be extracted from the intermediate layers, and then are used to infer the corrected color distribution which minimizes the errors with respect to the ground truth.The experimental results demonstrate that the method is able to handle images of different sources, resolutions, and illumination and atmosphere conditions.With high efficiency in computing speed and memory consumption, the proposed method shows its prospects for utilization in real time processing of large-scale remote sensing datasets.
We are currently extending the global color matching algorithm to take the local inhomogeneity of illumination into consideration, in order to enhance the precision.Local histogram matching of each band could serve for reflectance retrieval and atmospheric parameter retrieval purposes, and the preliminary results are encouraging.The distances represent the dissimilarity between the processed results and the ground truth (the smaller the better).There are five horizontal line segments in each patch, indicating five percentiles of the distances within the processed images by the corresponding method; from top to bottom: the maximum (worst) distance, the worst-25% distance, the median distance, the best-25% distance, and the minimum (best) distance.

Processing Time and Memory Comsumption
The processing time of 512 patches with a size of 227 × 227 × 3 on a single NVIDIA ® GeForce ® GTX 1080 graphics processing unit is 0.408 s, or 0.8 × 10 −3 s for a single patch, which means that the method could handle images as large as 2000 × 2000 in real time.A total of 6990 MB memory is consumed for 512 patches, or 13.7 MB for each.

Conclusions
This paper presents a nonparametric color correcting scheme in a probabilistic optimal transport framework, based on the Wasserstein CNN model.The multi-scale features are first to be extracted from the intermediate layers, and then are used to infer the corrected color distribution which minimizes the errors with respect to the ground truth.The experimental results demonstrate that the method is able to handle images of different sources, resolutions, and illumination and atmosphere conditions.With high efficiency in computing speed and memory consumption, the proposed method shows its prospects for utilization in real time processing of large-scale remote sensing datasets.
We are currently extending the global color matching algorithm to take the local inhomogeneity of illumination into consideration, in order to enhance the precision.Local histogram matching of each band could serve for reflectance retrieval and atmospheric parameter retrieval purposes, and the preliminary results are encouraging.

Figure 1 .
Figure 1.Color discrepancy in remote sensing images.(a,b) Digital Globe images on different dates from Google Earth; (c,d) Digital Globe (bottom, right) and NASA (National Aeronautics and Space Administration) Copernicus (top, left) images on the same date from Google Earth; (e) GF1 (Gaofen-1) images from different sensors, same area and date.

Figure 1 .
Figure 1.Color discrepancy in remote sensing images.(a,b) Digital Globe images on different dates from Google Earth; (c,d) Digital Globe (bottom, right) and NASA (National Aeronautics and Space Administration) Copernicus (top, left) images on the same date from Google Earth; (e) GF1 (Gaofen-1) images from different sensors, same area and date.

Figure 2 .
Figure 2. Matching algorithms of "scheme A" take both input and reference in the form of histograms.As this scheme is not content related, two similar distributions with different contexts could be not be mapped to their corresponding reference with one unified mapping.

Figure 2 .
Figure 2. Matching algorithms of "scheme A" take both input and reference in the form of histograms.As this scheme is not content related, two similar distributions with different contexts could be not be mapped to their corresponding reference with one unified mapping.

Figure 3 .
Figure 3. Matching algorithms of "scheme B" take both input and reference in the form of images.Similar distributions could be mapped to different corresponding references, as the scheme is content based.However, the same grayscales could be mapped to different grayscales when they are in different contexts, violating Property 1.

Figure 4 .
Figure 4. Matching algorithms of "scheme C" take images as inputs and histograms as references in the form of images.Similar distributions could be mapped to different corresponding references, as the scheme is content related.

,
cx y is the cost of transporting one unit of mass from x to y, and ( ) u and v as its marginal distributions.Again, N c indicates the number of color channels and ( ), u v Πis the collection of every feasible ( ) is defined as a distance ( )

Figure 4 .
Figure 4. Matching algorithms of "scheme C" take images as inputs and histograms as references in the form of images.Similar distributions could be mapped to different corresponding references, as the scheme is content related.

Figure 6 .
Figure 6.Structure of the "fire module" in the Squeeze-net.

Figure 5 .
Figure 5. Calculation method of the Wasserstein distance between the inferred histograms and the ground-truth reference.STEP 1: stack the histograms on the frequency axis; STEP 2: subtract the stacked histograms, and integrate with respect to the cumulative frequency.

Figure 6 .
Figure 6.Structure of the "fire module" in the Squeeze-net.Figure 6. Structure of the "fire module" in the Squeeze-net.

Figure 6 .
Figure 6.Structure of the "fire module" in the Squeeze-net.Figure 6. Structure of the "fire module" in the Squeeze-net.

Figure 7 .
Figure 7. Model structure of the proposed model.

Figure 7 .
Figure 7. Model structure of the proposed model.

1 .
Random cropping: A patch of 227 × 227 is cropped at a random position from each 256 × 256 sample.It is worth noting that this implies that no registration is needed in the training process.2. Random flipping: Each sample in the input batch is randomly horizontally and vertically flipped by a chance of 50%.3. Random color augmentation: The brightness, saturation, and gamma values of the input color are randomly shifted.Small perturbations are added to each color channel.Figure 8 shows an example of such transformation of the color distribution.

Figure 8 .
Figure 8. Color transforming curves in the random augmentation process.

Algorithm 1 . 1 :
Algorithm flow of the training process.Algorithm 1. Training Process of the Automatic Color Matching WCNN, our proposed algorithm Notations: θ , the parameters in the WCNN model; g θ , the gradients w.r.t.θ ; ( ) h  , the predicted color distribution; r , the reference color distribution; ( , ) w    , the Wasserstein loss.Required constants: α , the learning rate; m, the batch size.Required initial values: 0 θ , the initial parameters.while θ has not converged do

Figure 8 .
Figure 8. Color transforming curves in the random augmentation process.

Algorithm 1 .∼ P in a batch from the input data 3 :∼ P re f a batch from the reference data 4 :
Training Process of the Automatic Color Matching WCNN, Our Proposed Algorithm.Notations: θ, the parameters in the WCNN model; g θ , the gradients w.r.t.θ; h(•), the predicted color distribution; r, the reference color distribution; L w (•, •), the Wasserstein loss.Required constants: α, the learning rate; m, the batch size.Required initial values: θ 0 , the initial parameters.1: while θ has not converged do 2: Sample x (i) m i=1 Sample y (i) m i=1 Apply random augmentation to x (i) m i=1 5:

Figure 9 .
Figure 9. Results of matching the color palette of GF1 to GF2.Bars: histograms of input patches; solid lines with color: predicted histograms of our model; dashed lines in black: histograms of reference images; from top to bottom: histograms of images of the same area, but under different illumination and atmospheric conditions.

Figure 9 .
Figure 9. Results of matching the color palette of GF1 to GF2.Bars: histograms of input patches; solid lines with color: predicted histograms of our model; dashed lines in black: histograms of reference images; from top to bottom: histograms of images of the same area, but under different illumination and atmospheric conditions.

Figure 10 .
Figure 10.Color matching results of GF1 and GF2.From top to bottom: satellite images of the same area, but under different illumination and atmospheric conditions; left: input images; middle: output images with the predicted color palette; right: reference images, only needed in the training process to calculate the loss function.The model is able to infer the corrected color palette based on the content of the input images in the absence of a reference, when the model is fully trained.

Figure 10 .
Figure 10.Color matching results of GF1 and GF2.From top to bottom: satellite images of the same area, but under different illumination and atmospheric conditions; left: input images; middle: output images with the predicted color palette; right: reference images, only needed in the training process to calculate the loss function.The model is able to infer the corrected color palette based on the content of the input images in the absence of a reference, when the model is fully trained.
single patch, which means that the method could handle images as large as 2000 × 2000 in real time.A total of 6990 MB memory is consumed for 512 patches, or 13.7 MB for each.

Figure 13 .
Figure13.Boxplots of L1-norm distances between the processed images and the ground truth with respect to left: ORB; middle: SIFT, and right: BRISK feature descriptors.The distances represent the dissimilarity between the processed results and the ground truth (the smaller the better).There are five horizontal line segments in each patch, indicating five percentiles of the distances within the processed images by the corresponding method; from top to bottom: the maximum (worst) distance, the worst-25% distance, the median distance, the best-25% distance, and the minimum (best) distance.

Table 1 .
Different color matching schemes according to the input form and the reference form.

Table 2 .
Parameters of the GF1 and GF2 data in the experiment.

Table 2 .
Parameters of the GF1 and GF2 data in the experiment.