Next Article in Journal
Reconfiguration Decision-Making of IoT based Reconfigurable Manufacturing Systems
Previous Article in Journal
The Influence of Brushing Movement on Geometrical Shaping Outcomes: A Micro-CT Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CNN-Based Illumination Estimation with Semantic Information

1
Advanced Dental Device Development Institute, School of Dentistry, Kyungpook National University, 2177, Dalgubeol-daero, Jung-gu, Daegu 41940, Korea
2
College of Electrical and Computer Engineering, School of Information and Communication Engineering, Chungbuk National University, 1, Chungdae-ro, Seowon-gu, Cheongju-si, Chungcheongbuk-do 28644, Korea
3
IT College, School of Electronics Engineering, Kyungpook National University, 80, Daehak-ro, Buk-gu, Daegu 41566, Korea
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2020, 10(14), 4806; https://doi.org/10.3390/app10144806
Submission received: 8 June 2020 / Revised: 5 July 2020 / Accepted: 7 July 2020 / Published: 13 July 2020
(This article belongs to the Section Applied Industrial Technologies)

Abstract

:
For more than a decade, both academia and industry have focused attention on the computer vision and in particular the computational color constancy (CVCC). The CVCC is used as a fundamental preprocessing task in a wide range of computer vision applications. While our human visual system (HVS) has the innate ability to perceive constant surface colors of objects under varying illumination spectra, the computer vision is facing the color constancy challenge in nature. Accordingly, this article proposes novel convolutional neural network (CNN) architecture based on the residual neural network which consists of pre-activation, atrous or dilated convolution and batch normalization. The proposed network can automatically decide what to learn from input image data and how to pool without supervision. When receiving input image data, the proposed network crops each image into image patches prior to training. Once the network begins learning, local semantic information is automatically extracted from the image patches and fed to its novel pooling layer. As a result of the semantic pooling, a weighted map or a mask is generated. Simultaneously, the extracted information is estimated and combined to form global information during training. The use of the novel pooling layer enables the proposed network to distinguish between useful data and noisy data, and thus efficiently remove noisy data during learning and evaluating. The main contribution of the proposed network is taking CVCC to higher accuracy and efficiency by adopting the novel pooling method. The experimental results demonstrate that the proposed network outperforms its conventional counterparts in estimation accuracy.

Graphical Abstract

1. Introduction

The appearance of an object’s color is often influenced by surface spectral reflectance, illumination condition and relative position, which makes it very challenging for the computer vision to recognize an object in both still image and video. However, the computer vision can benefit from adopting the computational color constancy (CVCC) as a pre-processing step which enables the recorded colors of the object to stay relatively constant under different illumination conditions. Obviously, color plays a large part in the performance of computer vision applications such as human computer vision, color feature extraction, and color appearance model [1,2]. However, it is imperative to cope with undesirable effects arising from the significant impact of the illumination color on the perceived color of an object in a real-life scene. While the human visual system (HVS) has the innate ability to recognize the actual color of an object even under different light source colors, the computer vision finds it tough and challenging to identify the actual color of an object under the influence of changing illumination conditions. In an effort to mimic the HVS, the CVCC is designed to predict the actual color of an object in a real-world scene independent of varying illuminant conditions. The CVCC algorithms roughly fall into three categories: statistics-based, physics-based, and learning-based methods.
For several decades, the statics-based method has dominated the CVCC technology and the three best-known algorithms are the gray-world [3], the shades-of-gray [4] and the max-RGB (red, green, blue) [5] or the gray-edge [6]. They have a strong empirical assumption based on the statistics of real color images. There are some other statistics-based techniques which also contributed to solving the color constancy problem of the computer vision [7,8,9]. The physics-based method has mainly evolved from the dichromatic reflectance model of Shafer [10,11]. This method uses accurate reflection models but is still required to take complicated additional steps such as specularity estimation [10] or image segmentation [11]. The learning-based method includes Gamut mapping methods [12,13,14,15] and a recent patch-based approach [16]. The recent patch-based approach is intended to estimate the color of a light source in the local region. In this approach, the network is given a set of ground truth regions and is designed to learn and minimize their differences from the local regions. The learning-based algorithms produce state-of-the-art results but come with several drawbacks. For instance, to implement such a network, the computer is required to have a memory capable of storing thousands of patches. In addition, they need to take complicated steps to estimate local and global light sources, such as segmentation, feature extraction, and calculation of the nearest neighbor to the training set. However, Finlayson proposes the fastest learning-based method [17]. The key idea of his method is to apply the traditional gray-world assumption to estimating the color of the light source and devise a matrix to correct resultant estimation error. The network builds the matrix through dataset learning. The network learns colors and edge moments of a given image and as a result generates the elements of the matrix. Furthermore, Bianco and his colleagues [18] achieved state-of-the-art color constancy results by introducing a new method which uses a convolutional neural network. Their network has three parts: one convolutional layer for max pooling, one fully connected layer, and three output nodes. With their network, illumination estimation and fine-tuning are conducted on an image basis, not on a patch basis, and the purpose of fine-tuning is to minimize learning loss. This approach achieves a successful outcome from experimenting with one specific dataset only, so it needs to further experiment with more datasets. In addition, Lou et al. [19] proposed a deep convolutional neural network (DCNN) that is pre-trained to classify the big ImageNet dataset with labels. The performance of the network is assessed by hand-crafted color constancy algorithms. In the DCNN, ground truth labels are used to fine-tune each single dataset.
As overviewed above, there has been a decent amount of color constancy research and a number of proposed approaches. Given the structural nature of the computer vision, some challenges remain unsolved. More recently, Gijsenij et al. [20] proposed a scene semantics-based color constancy method where natural image statistics are used to identify the most important characteristics of color images. Akbarinia et al. [21] suggested a color constancy method that intends to overlap two asymmetric Gaussian kernels of different sizes in a similar way of changing the receptive field (RF) and the kernels come in different sizes depending on the contrast of surrounding pixels. Hu and colleagues [22] introduced a color constancy method which uses AlexNet and SqueezeNet in estimating illumination. Their color constancy method outperforms conventional methods by delivering state-of-the-art results. Despite their cutting-edge performance, the methodology is still with some inherent problems such as overfitting, gradient degradation, and vanishing gradient. Hussain et al. [23] proposed a color constancy method in which a histogram-based algorithm was used to determine an appropriate number of segments and efficiently split an input image into its key color variation areas. Zhan and colleagues [24] researched convolutional neural networks (CNNs) which use cross-level architecture for color constancy.
In this light, this article proposes a new network architecture-based approach and the new architecture uses the residual neural network which consists of pre-activation, atrous or dilated convolution and batch normalization. When receiving input image data, the proposed network crops each image into image patches before training. Once the network begins training, local semantic information is automatically extracted from the image patches and fed to put its novel pooling layer. Simultaneously, the extracted information is estimated and combined to form global information during training. While conventional patch-based CNNs handle patches sequentially and individually, the proposed network takes into account all image patches simultaneously, which makes it more efficient and simpler for the network to compare and learn patches during training. The illumination estimation with the use of the image patches is formulated in this work.
Among the CNN-based color constancy approaches, some methods estimate illumination based on local image patches like the proposed approach in this work, while others rely on full image data in its entirety. In the case of the latter, the full image data comes in the form of various chroma histograms. When the network takes the full image data in chroma histograms, the convolutional filters learn to assess and identify possible illumination color solutions in chroma plane. However, spatial information is only weakly encoded in these histograms, and thus semantic context is largely ignored. When considering semantic information at the global level, it is difficult for the network to learn and discern the significance of a semantically valuable local region. To supplement this, researchers have proposed conventional convolutional networks [18,22] designed to extract and pool local features. Especially in a study by Hu and colleagues [22], the authors proposed a pooling method to extract the local confidence region from the original image and thus to form a weighted map or a mask. By using fully connected CNNs, their color constancy method shows better performance relative to its conventional counterparts. Yet it is important to challenge the estimation accuracy of the weighted map in their approach. In the fully connected layer method, each convolutional layer gets the input of all the features combined as a result of output in an earlier layer and each convolutional layer relies on local spatial coherence with a small receptive field. On the other hand, the fully connected layers have several well-known vital problems and incur incredibly high computational cost. Motivated to solve these problems, the proposed CNN method uses the residual network to improve the estimation accuracy and reduce expensive computational cost. In addition, the proposed network employs a pooling mechanism to reduce estimation ambiguities as in previous studies [18,22].
With patch processing and semantic pooling together, the proposed network is able to distinguish between useful data and noisy data during training and evaluating. In the proposed network, semantic pooling designed to extract local semantic information from the original image is performed to form a mask and the resulting image turns out a weighted map. By enabling the network to learn the semantic information in the local region and remove noisy data, the proposed color constancy approach becomes more robust to estimation ambiguities. In addition, the proposed network features end-to-end training, direct processing of arbitrary-sized images and faster computation.
To the best of our knowledge, the proposed approach is the first study to investigate and use the residual network-based CNNs to achieve color constancy. In particular, the novelty of this approach is the use of the residual network, mainly distinct from its conventional CNN-based counterparts. The residual network allows the proposed architecture to predict scene illuminant on the local region, as opposed to many previous approaches where features are extracted from the entire image to obtain statistics and estimate the overall illuminant. In addition, the dilated convolution of the residual network is designed to handle multiscale appearance, contributing to efficiency. While there are only a few methods proposed to estimate spatially varying illuminants, the proposed approach has significance and the potential to advance CNN-based illumination estimation accuracy. The experimental results demonstrate that the proposed network stays ahead of other state-of-the-art techniques in predicting illumination and is less likely to cause large errors in estimation as the conventional methods. Moreover, the proposed scheme is further applicable to solving other computer vision problems because of its strength of aggregating local estimation to determine global estimation.

2. Technical Approach

In recent years, deep learning techniques have become remarkably advanced and contributed to addressing computer vision challenges. The proposed network is one of the cutting-edge deep learning techniques based on ResNet [25]. ResNet is composed of several units: pre-activation, atrous convolution, batch-normalization, and layers. ResNet performs better imageNet classification when its layers use skip connection. ResNet allows researchers and developers to design much deeper networks without gradient degradation and acquire much larger receptive fields often with highly distinct features. On receiving different input images (or values), the proposed network crops each input image into image patches which carry different semantic information automatically. Next, the network learns and applies the semantic information to its novel pooling layer where all local semantic information is estimated and combined to form global information.

2.1. Color Constancy Approach

In general, given an RGB image, F, the color constancy approach is designed to estimate the global illuminant color I g = ( r ,   g ,   b ) (or color cast), using a canonical light source color, usually perfect white ( 1 3 ,   1 3 , 1 3 ) T , and normalize the estimated global illuminant color I ^ g = I g I g . The approach then replaces the estimated global illuminant color with the normalized global illuminant color. However, in real-life scenes, there exist multiple illuminants, which possibly impact on the perceived color of an object. To address this problem, conventional methods attempt to estimate a single global illuminant color. Similarly, the proposed approach is designed to estimate f θ and get the replacement,   f θ ( F ) = I ^ g , and notably uses the convolutional neural network (CNN) to estimate f θ , which gets the replacement closer to the ground truth illuminant color. θ refers to parameters.
Let I ^ g * defined as the ground truth illuminant color. During dataset learning, the CNN minimizes a loss function. The loss function represents an angular error (in degrees) between the estimated color I ^ g and the ground truth illuminant color I ^ g * , described as follows:
L ( I ^ g ) = 180 π c o s 1 ( I ^ g × I ^ g * )  
In the CNN, f θ is the estimation of all the semantically informative regions, ideally avoiding any repercussion of ambiguous light. Equation (2) explains how to calculate the final global illumination estimation. Let R = { R 1 , R 2 , R 3 , ,   R n } be the local regions in F , and g ( R i ) ;   i = 1 ,   2 ,   3 ,   ,   n be the output of the regional illuminant color estimation, R i . The f θ (F) is the normalization of the sum of the product of semantic information, c ( R i ) , and regional illumination estimation, g ( R i ) , and as a result delivers the final global illuminant estimation color as follows [22]:
f θ ( F ) = I ^ g = norm ( i R c ( R i ) g ( R i ) )  
Intuitively, supposing that R i ;   i = 1 ,   2 ,   3 ,   ,   n are local regions that contain useful semantic information for illuminant estimation, c ( R i ) should be large values.
In detail, the semantic pooling in Equation (2) is described as follows:
I ^ g = i R c i I ^ i ;   i = 1 ,   2 ,   3 ,   ,   n  
where
c i = x N ( I ^ i ) 2 ;   i f   { c m e a n ,   i < c o l o r _ t h r e s o l d c i o t h e r   w i s e 0
where x is the coordinate of local region in the image and N is the total number of pixels in the local region. c m e a n ,   i refers to mean of local semantic information.
I ^ g = I g I g 2 = 1 I g i R c i I ^ i ;   i = 1 ,   2 ,   3 ,   ,   n
where c i and I ^ i refer to semantic information and local illuminant estimation function, respectively.
Using the chain rule, Equation (5) is transformed into Equation (6) below:
L ( I ^ g ) I ^ i = c i I g 2 × L ( I ^ g ) I ^ g  
In Equation (6), the estimation   I ^ i has different magnitudes with different semantic information c i . In estimating local illuminants, semantic information serves as a mask within the salience region, which helps prevent the proposed network from learning noisy data. Similarly, semantic information   c i is calculated as follows:
L ( I ^ g ) c i = 1 I g 2 × L ( I ^ g ) I ^ g × I ^ i  
By intuition, in global estimation which uses local illumination estimation colors, it is supposed to get the global estimation color closer to the ground truth illumination color.
Figure 1 depicts a block diagram of the proposed color constancy method. As shown below, as a result of performing the proposed DCNN architecture, feature maps are generated. The feature maps turn into the weighted maps or masks through semantic pooling where the proposed network distinguishes between useful data and noisy data. The semantic pooling is formulated in Equation (2) to Equation (7). To achieve color constancy and improve the performance, it is important to pay close attention to the estimation accuracy of the proposed DCNN architecture now that it has a significant impact on the accuracy of the weighted maps and eventually on the accuracy of the global illumination estimation. In this respect, the proposed method has adopted the proposed DCNN architecture to accurately estimate the local semantic information and the accuracy is prove in the experimental results and evaluation section. The next subsection focuses on the proposed DCNN architecture.

2.2. The Proposed DCNN Architecture

A deep convolutional neural network (DCNN) is a major breakthrough in image classification. The DCNN naturally incorporates low, mid, and high-level image features and classifiers in an end-to-end multi-layer form. Its depth, or the number of stacked layers, can enrich the “level” of image features. However, some deeper networks like AlexNet and VGG-16 have a degradation problem. Increasing depth of the network accelerates accuracy degradation as well as accuracy saturation. That is, an increasing number of layers cause some deeper network models to make more training errors. This is why the proposed method has adopted the well-known residual block to solve the issues with Alexnet and VGG-16 such as expensive computational cost, gradient degradation, and vanishing gradient, which are common issues in handling deep convolutional neural networks. In the proposed approach, a residual network is comprised of multiple convolutional layers. With the input of y i 1 , the output of the ith block is recursively defined as follows:
y i f i ( y i 1 ) + y i 1
Let each layer take the sequential steps of convolution   f i ( x ) , batch normalization, and rectified linear unit (ReLU) as nonlinearities, and f i ( x ) is defined as follows:
f i ( x ) W i · σ ( B ( W i · σ ( B ( x ) ) ) )
where W i and W i are weight matrices and · denotes convolution, B ( x ) is batch normalization, and σ ( x ) max ( x ,   0 ) . The proposed ResNet architecture shows that the resolution of feature maps drop down to a fourth of input resolution after passing through the first three layers. This allows the architecture to aggregate contexts and train faster. However, smaller feature maps constrain the architecture from learning high-resolution features which is useful and required at later stages. To support the learning of the high-resolution feature, the proposed network has an additional convolution layer with a 3 × 3 kernel before the first convolution layer. This enables the network to learn high-resolution features, without increasing the inference time by much. Furthermore, down-sampling principally reduces the resolution of feature maps. Although deconvolution layers are able to up-sample low-resolution feature maps, they cannot recover all the details completely. In addition, this procedure requires higher computational cost as well as intensive memory. To address such problems, the proposed method uses atrous convolution, also called dilated convolution [26]. Atrous convolution widens the kernel and simulates a larger field of perception. For a 1D input signal x [ i ] with a filter w [ k ] of K in length, the atrous convolution is described as follows:
y [ i ] = k = 1 k x [ i + r × k ] w [ k ]  
The rate r refers to a stride with regard to sampling of the input signal. For instance, a rate of 2 represents a convolution on a 2 × 2 pooled feature map. The proposed network has changed the stride of the last convolution from 2 to 1 and set the others at r = 2 . In this way, the smallest resolution is 16 times down-sampled, not 32 times, but still preserves the higher resolution details, as well as aggregates the usual number of contexts. Every object in a scene potentially varies in size, distance, and position. DCNN filters usually do not fit in this multiscale appearance. This has motivated researchers to investigate how DCNN [27,28] learns the multiscale feature. Their finding is that DCNN is given multi-resolution input images, which thus incur a higher computational cost. To reduce expensive computational cost and increase estimation efficiency, the proposed method gets the ResNet blocks made up of several different scale atrous convolutions with r > 1 . In this way, the network is enabled to learn multiscale features in every block. Furthermore, the concatenation preserves all the features within the block so that the network can learn to combine features generated on different scales.
Figure 2 depicts the proposed DCNN architecture. To explain in more detail, the top half of the figure illustrates the whole process of the proposed DCNN architecture. The blue boxes are not all residual networks. There are six residual networks: two consisting of four layers and four consisting of three layers. A residual network is marked with its structure on its top right, which looks like a superscript. The bottom half of the figure gives explanatory notes and illustrates the two types of residual networks in detail. As in the explanatory notes, a convolutional layer is described in black; and the top s stands for a stride and the bottom n indicates the n by n filter kernel size, with a symbol * to the middle left. A dilated convolution is described in red; and the top d stands for dilated convolution with a stride of 1 and the bottom n indicates the n by n filter kernel size, with a symbol * to the middle left. For instance, 1 and 1 with a symbol * in black translates as a convolutional layer with a stride of 1 and a 1 × 1 filter kernel in size. As another example, 2 and 3 with the symbol * in red mean that the rate r of dilated convolution is 2, as in Equation (10), and the filter kernel size is 3 × 3. The emphasis of using the proposed DCNN architecture is on increasing the accuracy of estimating the local semantic information, which is vital to the final performance of the network, and training the network to optimally combine the local estimates by adaptively using the corresponding g and c , as in Equation (2), for each local region, which will suppress the impact of ambiguous patches.

3. Experimental Results and Evaluations

A feasibility study uses two benchmark standard datasets: the reprocessed [15] Color Checker Dataset [15] and the NUS 8-Camera Dataset [8]. These datasets consist of 568 and 1736 raw images, respectively. The 768 × 384 input image in Figure 2 is resized to 512 × 512 pixels and then cropped into overlapping 224 × 224 image patches. There is a trade-off between patch coverage (and accuracy) and efficiency. With more patches, the CNN performs higher coverage and accuracy, but gets lower efficiency. Through additional pooling, the proposed network combines patch-based estimates to obtain a global illuminant. The proposed network is trained in an end-to-end fashion with back-propagation. For the proposed network, Adam [29] is used to optimize parameter setting for all layers, which reduces overfitting and improves performance. The experiment with the proposed network is performed to compare total training losses at four different learning rates with 10,000 iterations (or epochs) using a server with Titan XP GPU and taking 1.5 days. Figure 3 illustrates the comparative experimental results and finds that 4 × 10 4 is the optimized base learning rate. The symbol “1.00E-03” represents a learning rate of 1 × 10 3 . Likewise, parameters are optimized, including a dropout probability of 0.5 for the 6 × 6 × 64 convolution layer in Figure 2, a batch size of 16, a weight decay of 5 × 10 5 , and so forth.
Figure 4 compares median angular errors with and without semantic information, recording the errors every 20 iterations (or epochs). As a result, the errors sharply drop with semantic information. From an illuminant estimation point of view, the choice of semantic information has the effect of improving computational color constancy significantly.
Figure 5 shows Shi’s re-processed dataset [30] and their resulting images from implementing the proposed network in Tensorflow [31]. The proposed DCNN architecture is focused on increasing the accuracy of estimating the local semantic information to improve performance. As a result, in Figure 5e, the greenish blue illuminant of the original image is efficiently removed, and the true colors of objects are well represented without color distortion compared with the original image in Figure 5a.
The proposed network is compared with 27 different state-of-the-art methods which include both unitary and combinational methods. The 27 different methods are benchmarked from several sources. Specifically, AlexNet-FCand SqueezeNet-FC are benchmarked from [22]; and except for DS-Net, the other 22 methods are from [32,33,34,35,36,37,38,39]. DS-Net is cited from [40]. In this comparative study, the source codes of AlexNet-FC and SqueezeNet-FC are downloaded from GitHub website [41] and DS-Net is downloaded from GitHub as well [42]. The source codes of CCATI [23] and Zhan et al. [24] are implemented by MATLAB and Tensorflow, with parameters fixed as suggested by those articles.
For quantitative comparison purposes only, Table 1 compares the proposed method with previous mainstream algorithms in terms of the illuminant estimation accuracy. It illustrates several standard metrics: mean, median, trimean, mean of the best quarter (best 25%), and mean of the worst quarter (worst 25%) of angular error (Equation (1)). This comparative study uses the well-known dataset, the Gehler and Shi’s dataset [15], which contains 568 images of people, places, and objects in indoor and outdoor scenes, where the Macbeth color checker chart is placed in a known location of every scene. This dataset includes both single- and multiple-illuminant natural images. The proposed network surpasses all its conventional counterparts in trimean and worst 25%.
Figure 6 is an angular error (AE) histogram comparison between the proposed network and several best-performing conventional methods selected from Table 1: CNN-based method, CCATI, ExemplarCC, ensemble of decision (ED) tree based method, Zhan et al., DS-Net, SqueezeNet-FC, and AlexNet-FC. Joze and Drew [40] proposed an exemplar method which estimates the local source illuminant by finding the neighboring surfaces in the training data which consists of the weak color constant RGB values and the texture features. Ensemble of decision tree based method [38] is a discrete version of Gamut mapping which uses the correlation matrix, instead of the canonical Gamut for the considered illuminants, and uses the image data to calculate the probability that the illumination in the test image is caused by which of the known illuminants. Shi et al. [39] proposed a branch-level ensemble of neural networks consisting of two interacting sub-networks: a hypotheses network and a selection network. The selection network picks confident estimations from the plausible illuminant estimations generated from the hypotheses network. Shi’s method produces accurate results, but the model size is huge, and its processing speed is slow. That is, when the CNNs go deeper by adding layers, fully connected layers have several well-known vital problems including incredibly expensive computational problems. To solve these problems, the proposed CNN method adopts the residual network to improve the estimation accuracy and reduce expensive computational cost. Further the pooling mechanism employed by the proposed network contributes to reducing estimation ambiguities.
In estimating illuminants, the proposed network stays ahead of the state-of-the-art methods, with 76.41% of the tested images under an angular error of 3° and 97% under an angular error of 6°. Figure 7 is a comparison of the root mean square error (RMSE) results among CNN-based, exemplar-based, ensemble of decision (ED) tree based methods, Alex-FC, CCATI, Zhan et al., DS-Net, SqueezeNet-FC, and the proposed network (proposed), with the input of the angular error. The proposed network records the lower RMSE relative to its conventional counterparts. Therefore, the proposed network is deemed to be robust and generates lower AE and RMSE in estimating illumination of a wide range of image scenes.
To further verify the proposed method, additional experiments were conducted using SFU-lab dataset [33] and gray-ball dataset [43]. The SFU-lab dataset contains four different subsets: objects with minimal specularities (consisting of 22 scenes, 223 images in total), objects with at least one clear dielectric specularity (9 scenes, 98 images in total), objects with metallic specularities (14 scenes, 149 images in total), and objects with fluorescent surfaces (6 scenes, 59 images in total). A commonly used subset in literature is the union of the first two subsets. Furthermore, the gray-ball dataset [43] has a total of 11,340 images of 360 × 240 pixels from a range of scenarios, which were taken under natural single- or mixed-illuminant lighting conditions and a gray-ball was placed in front of the video camera. Thus, many of the images are nearly identical scenes. Figure 8 illustrates the comparative results of median angular errors, using 321 different SFU-lab images, and Figure 9 depicts the comparative results of mean angular errors, using 500 different gray-ball dataset images. In both experimental results, the proposed network also records the lowest angular error in terms of median and mean. Therefore, the proposed method gets ahead of the conventional methods.
The NUS 8-Camera Dataset [8] was additionally chosen to assess the camera invariant performance of the proposed method. The NUS 8-Camera Dataset is the most recent and well-known color constancy dataset which consists of 210 individual scenes captured by eight cameras, or a total of 1736 images. With the NUS 8-camera image dataset, 11 conventional methods and the proposed network are evaluated and compared. Table 2 displays the camera-wise performance comparison of the proposed network with the 11 conventional methods. As a result, the proposed network outperforms its 11 conventional counterparts. Accordingly, the proposed method is deemed robust regardless of camera conditions.

4. Conclusions

A color constancy algorithm is designed to remove color casts from images and manifest the actual colors of objects, as well as preserve constant distribution of the light spectrum across the digital images, in an effort to address the challenges faced by the computer vision algorithms or methods in nature.
Accordingly, this article presents novel network architecture that uses the residual neural network composed of pre-activation, atrous or dilated convolution and batch normalization. The proposed network is intended to enable image patches to carry different semantic information automatically, upon receiving different input values. The network learns and applies semantic information to its novel pooling layer for global estimation.
As in the comparative experimental results of AE, the proposed network achieves much higher accuracy than its state-of-the-art counterparts. In the comparative AE histogram, the proposed network gets ahead of its state-of-the-art counterparts, scoring 76.41% of the number of images under an AE of 3° and 97% under an AE of 6°. In the RMSE comparison as well, the proposed network records the lowest value. Therefore, the proposed network proves to be robust and causes lower AE and RMSE in estimating illumination of a wide range of image scenes. Furthermore, through additional experiments with two more datasets of different semantic information levels: SFU-lab and gray-ball datasets, the proposed network also results in lower median and mean angular errors, respectively. In addition, the proposed network is evaluated on NUS 8-Camera Dataset to verify the camera invariant performance. As a result, the proposed method outperforms its conventional counterparts as a camera invariant color constancy model by obtaining competitive results in uniform, non-uniform, and multiple illuminant conditions. Notwithstanding, the preprocessing method and CNN structure still need to advance in estimating color casts of light sources regardless of illumination condition as well as camera sensitivity. To this end, this study will continue to advance illumination estimation accuracy.

Author Contributions

Conceptualization, H.-H.C., B.-J.Y., and H.-S.K.; data curation, H.-H.C.; formal analysis, H.-H.C. and B.-J.Y.; funding acquisition, B.-J.Y. and H.-S.K.; investigation, H.-H.C.; methodology, H.-H.C., B.-J.Y., and H.-S.K.; project administration, B.-J.Y. and H.-S.K.; resources, H.-H.C. and B.-J.Y.; software, H.-H.C.; supervision, B.-J.Y. and H.-S.K.; validation, H.-H.C. and B.-J.Y.; visualization, H.-S.K. and H.-H.C.; writing—original draft, H.-H.C. and B.-J.Y.; writing—review and editing, B.-J.Y. and H.-S.K. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation (NRF) of Korea funded by the Ministry of Education (NRF-2019R1I1A3A01061844), in part by the NRF of Korea funded by the Ministry of Education (NRF-2018R1D1A1B07040457), and in part of the research projects of “Development of IoT infrastructure Technology for Smart Port” financially supported by the Ministry of Oceans and Fisheries, Korea.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Bianco, S.; Cusano, C.; Schettini, R. Single and Multiple illuminant Estimation Using Convolutional Neural Network. arXiv 2015, arXiv:1508.00998v2. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Kulkarni, S.G.; Kamalapur, S.M. Color Constancy Techniques. Int. J. Eng. Comput. Sci. 2014, 3, 9147–9150. [Google Scholar]
  3. Buchsbaum, G. A spatial processor model for object colour perception. J. Frankl. Inst. 1980, 310, 1–26. [Google Scholar] [CrossRef]
  4. Finlayson, G.; Trezzi, E. Shades of gray and colour constancy. In Proceedings of the Twelfth Color Imaging Conference: Color Science and Engineering Systems, Technologies, Applications, CIC 2004, Scottsdale, AZ, USA, 9 November 2004; pp. 37–41. [Google Scholar]
  5. Funt, B.; Shi, L. The rehabilitation of maxrgb. In Proceedings of the 18th Color and Imaging Conference, San Antonio, TX, USA, 12 April 2010; pp. 256–259. [Google Scholar]
  6. van de Weijer, J.; Gevers, T.; Gijsenij, A. Edge-based color constancy. IEEE Trans. Image Process. 2007, 16, 2207–2214. [Google Scholar] [CrossRef] [Green Version]
  7. Gao, S.; Han, W.; Yang, K.; Li, C.; Li, Y. Efficient color constancy with local surface reflectance statistics. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Volume 8690, pp. 158–173. [Google Scholar]
  8. Cheng, D.; Prasad, D.K.; Brown, M.S. Illuminant estimation for color constancy: Why spatial-domain methods work and the role of the color distribution. J. Opt. Soc. Am. A 2014, 31, 1049–1058. [Google Scholar] [CrossRef] [PubMed]
  9. Yang, K.-F.; Gao, S.-B.; Li, Y.-J. Efficient illuminant estimation for color constancy using gray pixel. In Proceedings of the Computer Vision Foundation Conference: CVPR 2015, Boston, MA, USA, 7–12 June 2015; pp. 2254–2261. [Google Scholar]
  10. Tan, R.T.; Nishino, K.; Ikeuchi, K. Color constancy through inverse-intensity chromaticity space. J. Opt. Soc. Am. A 2004, 21, 321–334. [Google Scholar] [CrossRef] [Green Version]
  11. Finlayson, G.; Schaefer, G. Solving for colour constancy using a constrained dichromatic reflectance model. Int. J. Comput. Vis. 2001, 42, 127–144. [Google Scholar] [CrossRef]
  12. Gijsenij, A.; Gevers, T.; van de Weijer, J. Generalized gamut mapping using image derivative structures for color constancy. Int. J. Comput. Vis. 2010, 86, 127–139. [Google Scholar] [CrossRef] [Green Version]
  13. Forsyth, D.A. A novel algorithm of color constancy. Int. J. Comput. Vis. 1990, 5, 5–35. [Google Scholar] [CrossRef]
  14. Finlayson, G.D.; Hordley, S.D.; Hubel, P.M. Color constancy. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1209–1221. [Google Scholar] [CrossRef]
  15. Gehler, P.V.; Rother, C.; Blake, A.; Minka, T.; Sharp, T. Bayesian Color constancy revisited. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 26 June 2008; pp. 1–8. [Google Scholar]
  16. Joze, H.R.V.; Drew, M.S. Exemplar-based color constancy and multiple illumination. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 860–873. [Google Scholar] [CrossRef] [PubMed]
  17. Finlayson, G.D. Corrected-moment illuminant estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, NSW, Australia, 1–8 December 2013; pp. 1904–1911. [Google Scholar]
  18. Bianco, S.; Cusano, C.; Schettini, R. Color constancy using CNNs. In Proceedings of the Deep Vision: Deep Learning in Computer Vision (CVPR Workshop), Boston, MA, USA, June 11 2015. [Google Scholar]
  19. Lou, Z.; Gevers, T.; Hu, N.; Lucassen, M. Color constancy by deep learning. In Proceedings of the British Machine Vision Conference, Swansea, UK, 7–10 September 2015. [Google Scholar]
  20. Gijsenij, A.; Gevers, T. Color Constancy using Natural Image Statistics and Scene Semantics. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 687–698. [Google Scholar] [CrossRef] [PubMed]
  21. Akbarnia, A.; Parraga, A. Color Constancy beyond the Classical Receptive Field. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2081–2094. [Google Scholar] [CrossRef] [PubMed]
  22. Hu, Y.; Wang, B.; Lin, S. Fc4: Fully convolutional Color Constancy with Confidence-Weighted Pooling. In Proceedings of the CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 4085–4094. [Google Scholar]
  23. Hussain, M.A.; Akbari, A.S.; Abbott-Halpin, E. Color Constancy for Uniform and Non-Uniform Illuminant Using Image Texture. IEEE Access 2019, 7, 7294–72978. [Google Scholar] [CrossRef]
  24. Zhan, H.; Shi, S.; Huo, Y. Computational colour constancy based on convolutional neural networks with a cross- level architecture. IET Image Process. 2019, 13, 1304–1313. [Google Scholar] [CrossRef]
  25. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  26. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolution. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  27. Ghiasi, G.; Fowlkes, C.C. Laplacian reconstruction and refinement for semantic segmentation. In Proceedings of the ECCV, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
  28. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab Semantic image segmentation with deep convolutional nets, atrous convolution, and fully conneted crfs. arXiv 2016, arXiv:1606.00915. [Google Scholar]
  29. Kngma, D.; Adam, J.B. A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  30. Shi, L.; Funtes, B. Re-Processed Version of the Gehler Color Constancy Dataset of 568 Images. Simon Fraser University, 2010. Available online: http://www.cs.sfu.ca/~colour/data/ (accessed on 15 November 2019).
  31. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-scale machine learning on heterogeneous distributed system. arXiv 2016, arXiv:1603.04467. [Google Scholar]
  32. Land, E.H. The Retinex Theory of Color Vision. Sci. Am. 1977, 237, 108–129. [Google Scholar] [CrossRef]
  33. Gijsenij, A.; Gevers, T. Color Constancy: Research Website on Illumination Estimation. 2011. Available online: http://colorconstancy.com (accessed on 15 November 2019).
  34. Xiong, W.; Funt, B. Estimating illumination chromaticity via support vector regression. J. Imaging Sci. Technol. 2006, 50, 341–348. [Google Scholar] [CrossRef] [Green Version]
  35. Zakizadeh, R.; Brown, M.S.; Finlayson, G.D. A hybrid strategy for illuminant estimation targeting hard images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 16–23. [Google Scholar]
  36. Bianco, S.; Cusano, G.; Schettini, R. Automatic color constancy algorithm selection and combination. Pattern Recognit. 2010, 43, 695–705. [Google Scholar] [CrossRef]
  37. van de Weijer, J.; Schmid, C.; Verbeek, J. Using high-level visual information for color constancy. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
  38. Cheng, D.; Price, B.; Cohen, S.; Brown, M.S. Effective learning based illumination estimation using simple features. In Proceedings of the IEEE Conference Computer Vision and Patterns Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1000–1008. [Google Scholar]
  39. Finlayson, G.D.; Hordley, S.D.; Hubel, P.M. Color by correlation: A simple, Unifying Framework for color constancy. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1209–1221. [Google Scholar] [CrossRef]
  40. Shi, W.; Loy, C.C.; Tang, X. Deep specialized network for illuminant estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 371–387. [Google Scholar]
  41. Available online: https://github.com/yuanming-hu/fc4 (accessed on 24 May 2020).
  42. Available online: https://github.com/swift-n-brutal/illuminant_estimation (accessed on 24 May 2020).
  43. Ciurea, F.; Funt, B. A large image database for color constancy research. In Proceedings of the 11th Color Imaging Conference Final Program, Scottsdale, AZ, USA, 4–7 November 2003; pp. 160–164. [Google Scholar]
Figure 1. Block diagram of the proposed method.
Figure 1. Block diagram of the proposed method.
Applsci 10 04806 g001
Figure 2. Proposed deep convolutional neural network (DCNN) architecture with pre-activation, atrous convolution, and batch-normalization.
Figure 2. Proposed deep convolutional neural network (DCNN) architecture with pre-activation, atrous convolution, and batch-normalization.
Applsci 10 04806 g002
Figure 3. Total training loss comparison in the logarithm space of four different learning rates to optimize the base learning rate.
Figure 3. Total training loss comparison in the logarithm space of four different learning rates to optimize the base learning rate.
Applsci 10 04806 g003
Figure 4. Comparison of median angular errors with and without semantic information (SI).
Figure 4. Comparison of median angular errors with and without semantic information (SI).
Applsci 10 04806 g004
Figure 5. Shi’s re-processed dataset and their resulting images: (a) original image, (b) illumination estimation map, (c) weighted map, (d) image × weighted map, and (e) corrected image.
Figure 5. Shi’s re-processed dataset and their resulting images: (a) original image, (b) illumination estimation map, (c) weighted map, (d) image × weighted map, and (e) corrected image.
Applsci 10 04806 g005
Figure 6. Comparative angular error (AE) histogram of convolutional neural network (CNN)-based, exemplar-based, ensemble of decision (ED) tree based methods, DS-Net, AlexNet-FC, SqueezeNet-FC, CCATI, Zhan et al., and the proposed network (proposed net.) with Shi’s re-processed dataset and the NUS 8-Camera Dataset.
Figure 6. Comparative angular error (AE) histogram of convolutional neural network (CNN)-based, exemplar-based, ensemble of decision (ED) tree based methods, DS-Net, AlexNet-FC, SqueezeNet-FC, CCATI, Zhan et al., and the proposed network (proposed net.) with Shi’s re-processed dataset and the NUS 8-Camera Dataset.
Applsci 10 04806 g006
Figure 7. Comparison of root mean square error (RMSE) results of CNN-based, exemplar-based, ensemble of decision (ED) tree based methods, DS-Net, AlexNet-FC4, SqueezeNet-FC4, CCATI, Zhan et al., and the proposed network (proposed) with the angular error as input.
Figure 7. Comparison of root mean square error (RMSE) results of CNN-based, exemplar-based, ensemble of decision (ED) tree based methods, DS-Net, AlexNet-FC4, SqueezeNet-FC4, CCATI, Zhan et al., and the proposed network (proposed) with the angular error as input.
Applsci 10 04806 g007
Figure 8. Comparison of the median angular errors of 321 different SFU-lab images.
Figure 8. Comparison of the median angular errors of 321 different SFU-lab images.
Applsci 10 04806 g008
Figure 9. Comparison of the mean angular errors of 500 different gray-ball images.
Figure 9. Comparison of the mean angular errors of 500 different gray-ball images.
Applsci 10 04806 g009
Table 1. Comparative statistical metrics between the proposed network and conventional methods with Shi’s re-processed dataset and the NUS-8 Camera Dataset (the lower, the better).
Table 1. Comparative statistical metrics between the proposed network and conventional methods with Shi’s re-processed dataset and the NUS-8 Camera Dataset (the lower, the better).
MethodsMeanMedianTrimeanBest 25%Worst 25%
Statistics-Based Methods
White patch [3]7.555.686.351.4216.12
Gray-world [32]6.366.286.282.3310.58
1st-order grey edge [6]5.334.524.731.8610.03
2nd-order grey edge [6]5.134.444.622.119.26
Shades of grey [4]4.934.014.231.1410.20
General grey world [6]4.663.483.811.0010.09
Modifies white patch [7]3.872.843.150.928.38
Bright-and-dark color PCA [32]3.522.142.470.508.74
Local surface reflectance [34]3.312.802.871.146.39
CCATI [23]2.341.601.910.495.28
Learning-Based Methods
SVR regression [32]8.086.737.193.3514.89
Edge-based Gamut [34]6.525.045.431.9013.58
Bayesian [34]4.823.463.881.2610.46
Pixel-based Gamut [34]4.202.332.910.5010.72
Intersection-based Gamut [34]4.202.392.930.5110.70
CART-based combination [12]3.902.913.211.028.27
Spatio-spectral [36]3.592.963.100.957.61
Bottom-up+ top-down [38]3.482.272.610.848.01
ExemplarCC [38]2.892.272.420.825.97
19-edge corrected-moment [17]2.862.042.220.706.34
CNN-based method [18]2.751.992.140.746.05
Ensemble of decision tree based method [39]2.421.651.750.385.87
Zhan et al. [24]2.291.902.030.574.72
DS-Net [40]2.241.461.680.486.08
SqueezeNet-FC [22]2.231.571.720.475.15
AlexNet-FC [22]2.121.531.640.484.78
Proposed network2.091.421.600.354.65
Table 2. Performance comparison between grey world (GW) [32], white patch (WP) [3], shades of grey (SoG) [4], general grey world (GGW) [6], 1st-order grey edge (GE1) [6], 2nd-order grey edge (GE2) [6], local surface reflectance statistics (LSR) [34], pixels-based Gamut (PG) [35], Bayesian framework (BF) [35], spatio-spectral statistics (SS) [37], natural image statistics (NIS) [20], and the proposed network (PN) with NUS dataset.
Table 2. Performance comparison between grey world (GW) [32], white patch (WP) [3], shades of grey (SoG) [4], general grey world (GGW) [6], 1st-order grey edge (GE1) [6], 2nd-order grey edge (GE2) [6], local surface reflectance statistics (LSR) [34], pixels-based Gamut (PG) [35], Bayesian framework (BF) [35], spatio-spectral statistics (SS) [37], natural image statistics (NIS) [20], and the proposed network (PN) with NUS dataset.
 Statistics-BasedLearning-Based
MethodGWWPSoGGGWGE1GE2LSRPGBFSSNISPN
CameraMean Angular Error 
Canon1Ds5.167.993.813.163.453.473.436.133.583.214.183.18
Canon600D3.8910.963.233.243.223.213.5914.513.292.673.432.35
FujiXM14.1610.203.563.423.133.123.318.593.982.994.053.10
NikonD52004.3811.643.453.263.373.473.6810.143.973.154.102.35
OlympEPL63.449.783.163.083.022.843.226.523.752.863.222.47
LumixGX13.8213.413.223.122.992.993.366.003.412.853.702.46
SamNX20003.9011.973.173.223.093.183.847.743.982.943.662.32
SonyA574.599.913.673.203.353.363.455.273.503.063.452.33
CameraMedian Angular Error 
Canon1Ds4.156.192.732.352.482.442.514.302.802.673.042.71
Canon600D2.8812.442.582.282.072.292.7214.832.352.032.462.19
FujiXM13.3010.592.812.601.992.002.488.873.202.452.962.82
NikonD52003.3911.672.562.312.222.192.8310.323.102.262.401.92
OlympEPL62.589.502.422.182.112.182.494.392.812.242.172.12
LumixGX13.0618.002.302.232.162.042.484.742.412.222.281.42
SamNX20003.0012.992.332.572.232.322.907.913.002.292.771.32
SonyA573.467.442.942.562.582.702.514.262.362.582.881.65
CameraTri-mean error  
Canon1Ds4.466.983.062.502.742.702.814.812.972.793.302.69
Canon600D3.0711.402.632.412.362.372.9514.782.402.182.722.33
FujiXM13.4010.252.932.722.262.272.658.643.332.553.062.88
NikonD52003.5911.532.742.492.522.583.0310.253.362.492.771.95
OlympEPL62.739.542.592.352.262.202.594.793.002.282.422.18
LumixGX13.1514.982.482.452.252.262.784.982.582.372.671.81
SamNX20003.1512.452.452.662.322.413.247.703.272.442.941.65
SonyA573.818.783.032.682.762.802.704.452.572.742.951.91
CameraMean of Best 25% 
Canon1Ds0.951.560.660.640.810.861.061.050.760.880.780.65
Canon600D0.832.030.640.630.730.801.179.980.690.680.780.73
FujiXM10.911.820.870.730.720.700.993.440.930.810.860.75
NikonD52000.921.770.720.630.790.731.164.350.920.860.740.57
OlympEPL60.851.650.760.720.650.711.151.420.910.780.760.80
LumixGX10.822.250.780.700.560.610.822.060.680.820.790.65
SamNX20000.812.590.780.770.710.741.262.650.930.750.750.53
SonyA571.161.440.980.850.790.890.981.280.780.870.830.57
CameraMean of Worst 25% 
Canon1Ds11.0016.758.527.087.697.767.3014.167.956.439.516.67
Canon600D8.5318.757.067.587.487.417.4018.457.935.775.765.29
FujiXM19.0418.267.557.627.327.237.0613.48.825.999.375.64
NikonD52009.6921.897.697.538.428.217.5715.938.186.9010.014.86
OlympEPL67.4118.586.786.696.886.476.5515.428.196.147.464.62
LumixGX18.4520.407.126.867.036.867.4212.198.005.908.745.74
SamNX20008.5120.236.926.857.007.237.9813.018.626.228.165.55
SonyA579.8521.277.756.687.187.147.3211.168.026.177.185.12

Share and Cite

MDPI and ACS Style

Choi, H.-H.; Kang, H.-S.; Yun, B.-J. CNN-Based Illumination Estimation with Semantic Information. Appl. Sci. 2020, 10, 4806. https://doi.org/10.3390/app10144806

AMA Style

Choi H-H, Kang H-S, Yun B-J. CNN-Based Illumination Estimation with Semantic Information. Applied Sciences. 2020; 10(14):4806. https://doi.org/10.3390/app10144806

Chicago/Turabian Style

Choi, Ho-Hyoung, Hyun-Soo Kang, and Byoung-Ju Yun. 2020. "CNN-Based Illumination Estimation with Semantic Information" Applied Sciences 10, no. 14: 4806. https://doi.org/10.3390/app10144806

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop