1. Introduction
Colour correction algorithms usually convert camera-specific RGB values into camera-independent colour spaces such as sRGB [
1] or CIE XYZ [
2]. In
Figure 1, we plot spectral sensitivity functions of the Nikon D5100 camera and CIE XYZ colour matching functions. If there existed a linear transform which took the Nikon (or any other camera) sensitivity curves so that they were equal to the XYZ matching function then the same linear transform would perfectly correct the camera’s RGB responses to the corresponding XYZ tristimuli. However, there are no commercial photographic cameras that meet this linear transform condition and so camera RGBs can only be approximately converted to XYZs.
An illustration of the colour correction problem is shown in
Figure 2. Here, RAW RGB Nikon D5100 camera responses are converted by linear colour correction to the sRGB [
1] colour space. The image shown is drawn from the Foster et al. hyperspectral image set [
3] with the RGB and sRGB images calculated by numerical integration. Both images have the sRGB non-linearity applied.
The most common approach to colour correction maps RGB data to corresponding XYZs using a 3 × 3 matrix (found by regression) such that:
where
and
represent the RAW RGB camera response vector and XYZ tristimulus, respectively. Polynomial [
4] and root-polynomial [
5] approaches can also be used for colour correction. In each case, the RGB values are expanded according to the order of the polynomial (normal or root) and a higher order regression is used to determine the regression transform. As an example, the second-order root-polynomial expansion maps
to
(
denotes transpose) and the correction matrix
M is
.
In addition to regression methods, neural networks have been used for colour correction. In the literature, there are several shallow network approaches [
6,
7,
8], and more recently, convolutional neural networks have been proposed [
9,
10,
11]. However, the recent literature focuses on the problem of correcting the colours captured underwater where the correction problem is different to the one we address here (e.g., they deal with the attenuation of colour due to the distance between the subject and the camera).
Recently, MacDonald and Mayer [
12] designed a Neural Net (
NN) with a fully connected Multilayer Perceptron (MLP) structure for colour correction and demonstrated that the network delivered colour correction that was better than the linear approach (Equation (
1)). In the first part of this paper, we investigate the performance of
NN algorithm versus regression methods more generally. Broadly, we confirm the finding that the
NN approach is significantly better than linear regression but that the polynomial [
4] and root-polynomial [
5] regressions actually deliver significantly better colour correction than the
NN [
13].
As well as delivering poorer performance than the best regression methods, the
NN approach was also found not to be exposure invariant [
14]. That is, a network trained to map RGBs to XYZs for a given exposure level delivered relatively poor colour correction when the exposure level changed. The polynomial colour correction algorithm [
15] suffers from the same exposure problem; polynomial regression works very well for a fixed exposure but less well when exposure changes [
5]. Indeed, the existence of this problem led to the development of the
root polynomial correction algorithm (which, by construction, is exposure invariant) [
5].
Let us now run a quick experiment to visually understand the problem of exposure in polynomial regression (we get similar results for the Neural Net). For the UEA dataset of spectral reflectance images [
16], we sampled four reflectances. The RAW RGB responses of the Nikon D5100 are shown at the top of
Figure 3a. Next, (b), the actual true sRGB image, rendered for a D65 whitepoint, is shown. In (c), we render these reflectances using the Nikon camera sensitivities and correct the four RGBs to the corresponding sRGB values using a second-order polynomial expansion. In detail, the second-order expansion has nine terms,
, and the colour correction matrix is
. In both the sRGB and fitted camera image, the maximum over all pixel values (across all three colour channels) is scaled to 1.
Now, we multiply the Nikon RGBs and the corresponding sRGB triplets by 7. As before, we calculate the second-order polynomial expansion of the RGBs and then apply the same colour correction matrix for the exposure = 1 condition. After colour correction, we again scale to make the exposure (across all three channels) equal to 1. The resulting colour patches are shown at the bottom of
Figure 3. It is clear that the colours of the patches have changed significantly and that the colour correction is more accurate for the colours rendered under the same exposure conditions (panel (c) is more similar to (b) than (d) is to (b)).
In the second part of this paper, we try to solve this problem: we seek to make neural network colour correction exposure invariant. We investigate two approaches. First, we augment the training data used to define the neural network with data drawn from many different exposure levels. Second, we design a new network, which, by construction, is exposure invariant. Our new network has two components. The chromaticity component network attempts to map camera rgb chromaticity to colorimetric xyz chromaticity. In the second component, we linearly correct R, G and B to predict X + Y + Z (mean colorimetric brightness). Given this summed brightness and the target xyz chromaticity, we can calculate XYZ. By construction, the combination of the chromaticity-correcting network and the linear brightness predictor generates XYZs in an exposure-invariant manner. Experiments demonstrate that both of our exposure-invariant networks continue to deliver better colour correction than a 3 × 3 linear matrix.
In prior work in this area, color correction algorithms were assessed with respect to a single data set. For example, the 1995 reflectances (SFU set [
17]) are widely used. Often
k fold cross-validation is used, i.e., the reflectance set is split into
k similar-sized folds. Then, each fold is used in turn as a test set and the other four are used for training the algorithms, and the performance of an algorithm is averaged over the
k test sets. However, when a single set is used, often the statistics of all the folds are similar and so training on a subset of a given dataset can be almost as good as training on the whole set. Thus, an important contribution of this paper is to run a cross-validation experiment where the
folds are different datasets (with known a priori statistics in common).
In
Section 2, we summarise colour correction and introduce the regression and Neural Net (NN) algorithms. We show how NN methods can be made exposure invariant in
Section 3 and we report on a comprehensive set of colour correction experiments in
Section 4. The paper finishes with a short conclusion.
2. Background
Let
denote the k-th camera spectral response function and
denote the vector of these functions as in
Figure 1. The camera response to a spectral power distribution
illuminating the
j-th reflectance
is written as:
where
denotes the visible spectrum (400 to 700 nm) and
denotes the vector of RGB responses. Similarly, given the XYZ colour matching
, the tristimulus response
x is written as:
Suppose
matrices
P and
X record (in rows) the camera responses and tristimuli of
n surface reflectances, respectively. To find the
matrix
M—that is, the best linear map—in Equation (
1), we minimise:
where
denotes the Frobenius norm [
18]. We can solve for
M in closed form using the Moore–Penrose inverse [
19]:
To extend the regression method, we define a basis function
where the subscript
e denotes the type of expansion—here
e = p and
e = r, respectively, denote polynomial and root-polynomial expansions—and the superscript
o denotes the order of the expansion. As an example, if we are using the second-order root-polynomial expansion [
5], then we write:
Again, we can use Equations (4) and (5) to solve for the regression matrix
M, though
M will be non-square (and depend on the number of terms in the expansion). For our second-order root-polynomial expansion, the columns of
P will be the six terms in the root-polynomial expansion (
P is a n × 6 matrix) and
M will be 6 × 3. See, for example, [
5] for details of higher-order expansions.
Optimising for the Frobenius norm in Equation (
4) may be undesirable because the Euclidean differences in the XYZ colour space do not correspond to the perceived differences in colour. Instead, it is more desirable to optimise for the differences in perceptually uniform colour spaces such as CIELAB [
2] or using colour difference formulas such as CIE Delta E 2000 [
20]. Let us denote the magnitude of the difference vector between a mapped camera response vector and its corresponding ground truth CIELAB value as:
where
C() maps input vectors according to the CIELAB function to corresponding Lab triplets and the superscripts
e and
o are as before. The parameter
w denotes the XYZ tristimulus of a perfect white diffuser and is required to calculate CIELAB values. To find the best regression matrix, we seek to minimise:
Unfortunately, there is no closed-form solution to the optimisation described in Equation (
1). Instead, a search-based strategy such as the Nelder–Mead simplex method [
21] can be used to find
M (though there is no guarantee that the global optimum result is found, [
21] is a local minimiser).
In this paper, we will also be interested in minimising CIE Delta E 2000 [
20]. Here, it is not possible to model colour difference as the Euclidean distance between triplets (which are non-linear transforms of XYZs and regressed RGBs). For the Delta E 2000 case, we write the error
(subscript identifies the errors as CIE Delta E 2000) in Equation (
9) and the minimisation to solve for is again given by Equation (
8). As before, we minimise Delta E 2000 using the search-based algorithm. In Equation (
9), we denote the function that calculates the CIE 2000 error as
. We see this function takes three inputs: the regressed RGB (
), the corresponding XYZ (
) and the XYZ for the illuminant (
). For details of
, the reader is referred to [
20].
As an alternative to regression methods, colour correction can also be implemented as an artificial neural network. MacDonald and Mayer’s [
12] recently published neural network is illustrated in
Figure 4 and is a leading method for neural network colour correction.
This Neural Net has 3189 ‘connections’, indicating the cost of colour correction is on the order of 3189 multiplications and additions (the number of operations applied as data flows from left to right). In comparison, the complexity of the second-order root-polynomial correction has three square root operations and (when the 6 × 3 correction matrix is applied) 18 multiplications and 15 additions, i.e., it is 2 orders of magnitude quicker to compute. In part, the practical utility or otherwise of the Neural Net approach will rest on the trade-off between how well it improves colour correction (say compared to the linear method) and its higher computational cost. The Neural Net is trained to minimise the Delta E 2000 errors.
3. Exposure-Invariant Neural Nets
Abstractly, we can think of a neural network as implementing a vector function
such that:
where
and
denote the RAW RGB camera response vector and the XYZ tristimulus, respectively. When exposure changes—for example, if we double the quantity of light—then, physically, the RGB and XYZ responses also double in magnitude. We would like a colour correction function to be exposure invariant:
where
k in Equation (
11) is a positive scalar. This
homogeneity property is actually rare in mathematical functions. It holds for linear transforms—
implies that
—and root-polynomials, but it is not true for polynomial expansions [
5]. A Neural Net, in order to not collapse to a simple affine transformation, uses non-linear activation functions. These non-linearities, while an essential aspect of the network, make it difficult to attain homogeneity. This homogeneity is required if colour correction is to be invariant to a changing exposure level. The MacDonald and Mayer network is found not to be exposure invariant. In fact, this is entirely to be expected; there is no reason why NNs should exhibit the homogeneity property and every reason why they should not. Significantly, the variation in performance with exposure is a problem. In the experimental section, we show that colour correction performance drops markedly when there are large changes in exposure.
Now, let us consider alternative methods for enhancing the robustness of neural networks to exposure variations or achieving exact exposure invariance. In neural network research, if we observe poor performance for some input data, then the trick is to retrain the network where more of the problematic data are added to the training set. In neural network parlance, we augment the training data set. Here, we have the problem that a network trained for one light level delivers poor colour correction when the light levels change (e.g., when there is double the light illuminating a scene). So, to achieve better colour correction as exposure levels change, we will augment our colour correction training data—the corresponding RGBs and XYZs for a single exposure level—with corresponding RGBs and XYZs for several exposure levels. Our retrained MacDonald and Mayer Network using the exposure level augmented dataset is our first (more) exposure-invariant neural network solution to colour correction.
Perhaps a more elegant approach to solving the exposure problem is to redesign the network so it is, by construction, exactly exposure invariant. We show such an architecture in
Figure 5. In the top network we learn—using Macdonald and Mayer’s
NN—the mapping from input
r,
g and
b chromaticity to
x,
y and
z chromaticity. When the camera and tristimulus response are denoted as
and
, then the corresponding chromaticities are defined as
,
and
; and
,
and
. In the ‘intensity’ network (bottom of
Figure 5) we map
R,
G and
B to predict
by using only a linear activation function. Multiplying the estimated
by the estimated
returns an estimated
.
In other words, when we change the scalar k, the input of the first network remains unchanged, as it depends on the chromaticities. Consequently, the output also remains the same. The only elements that change are the input and output of the second (intensity) network, as it utilises the actual RGB input and calculates the sum of the actual XYZ. However, this change is linear since it involves a dot product; there are no activation functions or biases, only three multiplications. Thus, for the given input RGB × k, the multiplication of the output of the first network (xyz chromaticities) and the second network (intensity × k) yields (XYZ × k). As a result, the system is exposure invariant, meaning that chromaticities do not change at different illumination levels, and inputs and outputs scale linearly with the scalar k.
Given that the intensity network consists of only three connections with a linear activation function and no biases, the additional computational cost compared to the operation original network is quite low. It involves just one addition and six multiplication operations. Specifically, three multiplications and one addition are used for calculating the intensity, while the other three multiplications are performed to multiply the intensity with the x, y and x chromaticities in order to obtain the corresponding XYZ values.
Informally, let us step through an example to show that the network is exposure invariant. That is, we want to show that the respective RGBs
and
are mapped to the estimated XYZs
x and
. Let us consider the RGB vector [10, 50, 40]. To make the
r,
g,
b chromaticities, we divide RGB values by the sum of RGB yielding the r, g, b chromaticities: [0.1, 0.5, 0.4]. Suppose our chromaticity network outputs [0.3, 0.4, 0.5] (the estimates of the
x,
y,
z chromaticities) and the second network (the bottom one in
Figure 5) returns 50 as the prediction of X + Y + Z. Now, we multiply output
x,
y,
z chromaticities by 50, and we generate the XYZ output: [15, 20, 25].
Now, let us double the RGB values: [20, 100, 80]. Clearly, the chromaticities are unchanged ([0.1, 0.5, 0.4]). The output of the second network is a simple linear dot-product; thus, the output must be equal to 100 (as opposed to 50 before the exposure doubling). Finally, we multiply the estimated x, y, z chromaticities, [0.3, 0.4, 0.5], by 100 and the final output is [30, 40, 50] (which is exactly double as before). This simple example demonstrates that if the exposure changes by a scalar k then the output of the network also scales by k and so our new network is exposure invariant.
4. Experiments
4.1. Preparation of Datasets
In our experiments, we used four spectral datasets: the Simon Fraser University (SFU) reflectance set [
17], the Ben-Gurion University Dataset (BGU) [
22], the Columbia University Dataset (CAVE) [
23], and the Foster et al. Dataset (FOSTER) [
3]. The spectral sensitivities of the Nikon D5100 camera [
24] and D65 viewing illuminant [
25] are also used in all experiments. All input RGB values and their corresponding target XYZ values are calculated using numerical integration, without any gamma correction.
The SFU reflectance set [
17] comprises 1995 spectral surface reflectances, including the 24 Macbeth colour checker patches, 1269 Munsell chips, 120 Dupont paint chips, 170 natural objects, and 407 additional surfaces. In
Figure 6, in the CIE 1931 chromaticity diagram, we plot the xy chromaticities of the SFU dataset. We also show the gamut of colours achievable using Rec 709 primaries (white triangle). It is evident that the SFU reflectance set comprises a wide range of colours.
The BGU Dataset comprises 201 multi-spectral outdoor images of various sizes. To make each image equally important (statistically), we resized them using bilinear resampling to 1000 × 1000 in size. In all our experiments, we used D65 as our viewing illuminant. However, the BGU images have radiance spectra and we would like the light component of each radiance spectrum to be D65. Thus, we set out to manually identify achromatic surfaces in each of the 201 scenes with the additional constraint that we judged the image to be predominantly lit by one light. We then used the corresponding spectrum (for the achromatic surface) to be the spectrum of the prevailing light. Dividing by the prevailing light and multiplying by D65 re-renders the scene for D65 illumination.
We were only confident that 57 of the 201 scenes met the two constraints of having a clearly identifiable achromatic surface and being lit by a single prevailing light. Thus, only 57 of the BGU images re-rendered to D65 were actually used. Finally, each image was scaled so that the maximum value in the image is 1.
The CAVE Dataset comprises 32 indoor reflectance images, segregated into five distinct categories: stuff, skin and hair, food and drinks, real and fake, and paints. Each image has dimensions of 512 × 512 pixels. We found it necessary to exclude one image, entitled “watercolors”, due to the presence of missing data. Consequently, we were left with a total of 8,126,464 pixels for analysis (31 × 512 × 512).
Lastly, we used the FOSTER hyperspectral image set [
3] which consists of eight different images. Similar to the BGU Dataset, the various-sized images were resized to 1000 × 1000 and 8,000,000 pixels were obtained (8 × 1000 × 1000).
As the CAVE and FOSTER images contain only reflectance data, we multiply each spectral measurement in each image and per pixel by the D65 illuminant spectrum.
4.2. Algorithms
The following algorithms are investigated in this paper:
- (i)
LS: Least Squares Regression.
- (ii)
LS-P: denotes Least Squares Polynomial Regression. Here, we use the second-order expansion which maps each three-element vector to a ten-element vector.
- (iii)
LS-RP: Least Squares Root-Polynomial Regression. Again, a second-order expansion is used which for root-polynomials has six terms.
- (iv)
LS-Opt.
- (v)
LS-P-Opt.
- (vi)
LS-RP-Opt.
- (vii)
NN: MacDonald and Mayer’s Neural Net [
12].
- (viii)
NN-AUG: The NN with an augmented training data with different exposure levels.
- (ix)
NN-EI: Here, we use two different neural networks. The first one learns to calculate chromaticities and the second one for the sum of XYZ.
Opt denotes optimisation, where we use the CIELAB or CIE Delta E 2000 loss values for training depending on the experiment (we make clear which is used in which experiment) with the Nelder–Mead simplex method [
21] (see Equations (7) and (8)) to solve for the linear (iv), polynomial (v) and root-polynomial regressions (vi). The regressions (i) through (iii)—minimising error in the XYZ tristimuli space—are found in closed form using the Moore–Penrose inverse (Equations (5) and (6)).
As suggested in MacDonald and Meyer’s original paper all the colour corrections, NNs are trained to minimise the CIE Delta E 2000 error.
Apart from the cross-dataset experiments, our colour correction algorithms are tested on the SFU dataset using a five-fold cross-validation methodology for the fixed and different exposures experiments. Here, the reflectance dataset is split into five equal-sized folds (399 reflectances per fold). Each algorithm is trained using four of the folds and then tested on the remaining fold to generate error statistics. The process is repeated five times (so every fold is the test set exactly once). According to this methodology, the reported error statistics are averages over the five experiments.
4.3. Details about How the Neural Net Was Trained
For
NN, we used MacDonald and Mayer’s [
12] neural network, which has RGB values as input and XYZ as target, the 3 × 79 × 36 × 3 fully connected MLP architecture with two hidden layers as shown in
Figure 4. As in the original paper, we used the Adam optimiser with a learning rate of 0.001 to train the network to minimise CIE Delta E 2000 [
20]. We had to raise the number of epochs from 65 (used in the original study) to 500 for the neural network to develop a successful mapping because we were working on relatively small datasets. We also used mini-batch gradient descent with a batch size of 8. Our model used 20% of the training data for the validation set and used the early stopping method, which means that the training ends automatically if there is no improvement in validation loss after a specified number of epochs (which in our model is 100) with a call-back function. We chose the best model based on the validation loss. The
NN-AUG and
NN-EI were trained using the same methodology.
4.4. Details about Exposure Experiments
The colour correction performance for all algorithms was first calculated for a fixed reference exposure (exposure = 1) level. Then, we tested our models under different exposure levels to understand their performance when the exposure level changed. We use exposure values of 0.2, 0.5, 1, 2 and 5 (e.g., 0.2 and 5, respectively, meaning the amount of light was 1/5 and 5 times the reference condition of exposure 1).
In NN-AUG, in order to achieve successful results at different exposure levels, we augmented the training data with different exposure factors, which are 0.1, 0.2, 0.5, 2, 5 and 10 times the original samples. Then, we tested the models with the test samples with an original exposure level.
The NN-EI is, by construction, exposure invariant; thus, it was only trained using the data for the reference exposure level.
4.5. Details about Cross-Dataset Experiments
In the cross-dataset section, we trained our algorithms with a single dataset and individually tested them with the remaining datasets. We repeated this process four times (i.e., every dataset was used for training once and for testing three times). Additionally, for the NN method, we allocated 20% of the training data as a validation set. The whole training set was used for the regression colour correction methods.
Even our relatively small image sets comprise millions of pixels. Training with millions of pixels is computationally expensive, especially for the NN and search-based regression methods. To mitigate training complexity, we reduced the multi-spectral images of the BGU, CAVE, and FOSTER datasets to thumbnails of size 40 × 40 × 31 (using nearest neighbour downsampling). Additionally, for all our datasets (including SFU), we sampled the set of images and included an observed spectrum if and only if it was at least 5 degrees apart from any spectrum already in the set (we built the set incrementally by adding a new spectrum if it was 5 degrees away from all previously selected members).
Thumbnail creation and angular threshold filtering resulted in, respectively, 523, 354, 4188 and 4331 spectra for the SFU, BGU, CAVE and FOSTER datasets. All our colour correction algorithms were trained for these small sets of spectra. However, testing was conducted on the full-sized spectral data, which were 1995 samples for SFU, 57,000,000 for BGU (57 × 1000 × 1000), 8,126,464 for CAVE (31 × 512 × 512) and 8,000,000 for FOSTER (8 × 1000 × 1000).
6. Conclusions
Recently, it has been proposed that neural networks can be used to solve the colour correction problem [
12]. Indeed, in line with previous work, we found the
NN approach delivered a modest performance increment compared to the (almost) universally used linear correction method (at least where training and testing data contain similar spectral data) [
13]. However, we also found that the
NN approach was not exposure invariant. Specifically, a network trained for one light could actually deliver poor colour correction as the exposure changed (there was more or less light in the scene).
However, we showed that NNs could be made robust to changes in exposure through data augmentation by training the NNs with data drawn from many different light levels. In a second approach, we redesigned the neural network architecture so that, by construction, it was exactly exposure invariant. Experiments demonstrated that both exposure-invariant networks continued to outperform linear colour correction. However, a classical method—the simple exposure-invariant root-polynomial regression method—worked best overall (outperforming the NN by about 25%).
Finally, we carried out cross-dataset experiments to test the performance of our algorithms. This is a stringent test as the spectral statistics of the datasets used are quite different from one another. Our results showed that there is not a clear ranking in the performance of regression-based methods for the cross-dataset condition. However, as for the exposure change test, we found that the NN method performed worse overall.
The general conclusion of this paper is that—at least for now—classical colour correction regression methods outperform the tested NN algorithm.