Improving CNN-Based Texture Classiﬁcation by Color Balancing

: Texture classiﬁcation has a long history in computer vision. In the last decade, the strong afﬁrmation of deep learning techniques in general, and of convolutional neural networks (CNN) in particular, has allowed for a drastic improvement in the accuracy of texture recognition systems. However, their performance may be dampened by the fact that texture images are often characterized by color distributions that are unusual with respect to those seen by the networks during their training. In this paper we will show how suitable color balancing models allow for a signiﬁcant improvement in the accuracy in recognizing textures for many CNN architectures. The feasibility of our approach is demonstrated by the experimental results obtained on the RawFooT dataset, which includes texture images acquired under several different lighting conditions.


Introduction
Convolutional neural networks (CNNs) represent the state of the art for many image classification problems [1][2][3].They are trained for a specific task by exploiting a large set of images representing the application domain.During the training and the test stages, it is common practice to preprocess the input images by centering their color distribution around the mean color computed on the training set.However, when test images have been taken under acquisition conditions unseen during training, or with a different imaging device, this simple preprocessing may not be enough (see the example reported in Figure 1 and the work by Chen et al. [4]).
The most common approach to deal with variable acquisition conditions consists of applying a color constancy algorithm [5], while to obtain device-independent color description a color characterization procedure is applied [6].A standard color-balancing model is therefore composed of two modules: the first discounts the illuminant color, while the second maps the image colors from the device-dependent RGB space into a standard device-independent color space.More effective pipelines have been proposed [7,8] that deal with the cross-talks between the two processing modules.
In this paper we systematically investigate different color-balancing models in the context of CNN-based texture classification under varying illumination conditions.To this end, we performed our experiments on the RawFooT texture database [9] which includes images of textures acquired under a large number of controlled combinations of illumination color, direction and intensity.
Concerning CNNs, when the training set is not big enough, an alternative to the full training procedure consists of adapting an already trained network to a new classification task by retraining only a small subset of parameters [10].Another possibility is to use a pretrained network as a feature extractor for another classification method (nearest neighbor, for instance).In particular, it is common to use networks trained for the ILSVRC contest [11].The ILSVRC training set includes over one million images taken from the web to represent 1000 different concepts.The acquisition conditions of training images are not controlled, but we may safely assume that they have been processed by digital processing pipelines that mapped them into the standard sRGB color space.We will investigate how different color-balancing models permit adapting images from the RawFooT dataset in such a way that they can be more reliably classified by several pretrained networks.
The rest of the paper is organized as follows: Section 2 summarizes the state of the art in both texture classification and color-balancing; Section 3 presents the data and the methods used in this work; Section 4 describes the experimental setup and Section 5 reports and discusses the results of the experiments.Finally, Section 6 concludes the paper by highlighting its main outcomes and by outlining some directions for future research on this topic.

Color Texture Classification under Varying Illumination Conditions
Most of the research efforts on the topic of color texture classification have been devoted to the definition of suitable descriptors able to capture the distinctive properties of the texture images while being invariant, or at least robust, with respect to some variations in the acquisition conditions, such as rotations and scalings of the image, changes in brightness, contrast, light color temperature, and so on [12].
Color and texture information can be combined in several ways.Palm categorized them into parallel (i.e., separate color and texture descriptors), sequential (in which color and texture analysis are consecutive steps of the processing pipeline) and integrative (texture descriptors computed on different color planes) approaches [13].The effectiveness of several combinations of color and texture descriptors has been assessed by Mäenpää, and Pietikäinen [14], who showed how the descriptors in the state of the art performed poorly in the case of a variable color of the illuminant.Their findings have been more recently confirmed by Cusano et al. [9].
In order to successfully exploit color in texture classification the descriptors need to be invariant (or at least robust) with respect to changes in the illumination.For instance, Seifi et al. proposed characterizing color textures by analyzing the rank correlation between pixels located in the same neighborhood and by using a correlation measure which is related to the colors of the pixels, and is not sensitive to illumination changes [15].Cusano et al. [16] proposed a descriptor that measures the local contrast: a property that is less sensitive than color itself to variations in the color of the illuminant.The same authors then enhanced their approach by introducing a novel color space where changes in illumination are even easier to deal with [17].Other strategies for color texture recognition have been proposed by Drimbarean and Whelan who used Gabor filters and co-occcurrence matrices [18], and by Bianconi et al. who used ranklets and the discrete Fourier transform [19].
Recent works suggested that, in several application domains, carefully designed features can be replaced by features automatically learned from a large amount of data with methods based on deep learning [20].Cimpoi et al., for instance, used Fisher Vectors to pool features computed by a CNN trained for object recognition [21].Approaches based on CNNs have compared against combinations of traditional descriptors by Cusano et al. [22], who found that CNN-based features generally outperform the traditional handcrafted ones unless complex combinations are used.

Color Balancing
The aim of color constancy is to make sure that the recorded color of the objects in the scene does not change under different illumination conditions.Several computational color constancy algorithms have been proposed [5], each based on different assumptions.For example, the gray world algorithm [23] is based on the assumption that the average color in the image is gray and that the illuminant color can be estimated as the shift from gray of the averages in the image color channels.The white point algorithm [24] is based on the assumption that there is always a white patch in the scene and that the maximum values in each color channel are caused by the reflection of the illuminant on the white patch, and they can be thus used as the illuminant estimation.The gray edge algorithm [25] is based on the assumption that the average color of the edges is gray and that the illuminant color can be estimated as the shift from the gray of the averages of the edges in the image color channels.Gamut mapping assumes that for a given illuminant, one observes only a limited gamut of colors [26].Learning-based methods also exist, such as Bayesian [27], CART-based [28], and CNN-based [29,30] approaches, among others.
The aim of color characterization of an imaging device is to find a mapping between its device-dependent and a device-independent color representation.The color characterization is performed by recording the sensor responses to a set of colors and the corresponding colorimetric values, and then finding the relationship between them.Numerous techniques in the state of the art have been proposed to find this relationship, ranging from empirical methods requiring the acquisition of a reference color target (e.g., a GretagMacbeth ColorChecker [31]) with known spectral reflectance [8], to methods needing the use of specific equipment such as monochromators [32].In the following we will focus on empirical methods that are the most used in practice, since they do not need expensive laboratory hardware.Empirical device color characterization directly relates measured colorimetric data from a color target and the corresponding camera raw RGB data obtained by shooting the target itself under one or more controlled illuminants.Empirical methods can be divided into two classes: the methods belonging to the first class rely on model-based approaches, that solve a set of linear equations by means of pseudo-inverse approach [6] , constrained least squares [33], exploiting a non-maximum ignorance assumption [33,34], exploiting optimization to solve more meaningful objective functions [7,35,36], or lifting the problem into a higher dimensional polynomial space [37,38].The second class instead contains methods that do not explicitly model the relationship between device-dependent and device-independent color representations such as three-dimensional lookup tables with interpolation and extrapolation [39], and neural networks [40,41].

RawFooT
The development of texture analysis methods heavily relies on suitably designed databases of texture images.In fact, many of them have been presented in the literature [42,43].Texture databases are usually collected to emphasize specific properties of textures such as the sensitivity to the acquisition device, the robustness with respect to the lighting conditions, and the invariance to image rotation or scale, etc.The RawFooT database has been especially designed to investigate the performance of color texture classification methods under varying illumination conditions [9].The database includes images of 68 different samples of raw foods, each one acquired under 46 different lighting conditions (for a total of 68 × 46 = 3128 acquisitions).Figure 2 shows an example for each class.Images have been acquired with a Canon EOS 40D DSLR camera.The camera was placed 48 cm above the sample to be acquired, with the optical axis perpendicular to the surface of the sample.The lenses used had a focal length of 85 mm, and a camera aperture of f/11.3; each picture has been taken with four seconds of exposition time.For each 3944 × 2622 acquired image a square region of 800 × 800 pixels has been cropped in such a way that it contains only the surface of the texture sample without any element of the surrounding background.Note that, while the version of the RawFooT database that is publicly available includes a conversion of the images in the sRGB color space, in this work we use the raw format images that are thus encoded in the device-dependent RGB space.
To generate the 46 illumination conditions, two computer monitors have been used as light sources (two 22-inch Samsung SyncMaster LED monitors).The monitors were tilted by 45 degrees facing down towards the texture sample, as shown in Figure 3.By illuminating different regions of one or both monitors it was possible to set the direction of the light illuminating the sample.By changing the RGB values of the pixels it was also possible to control the intensity and the color of the light sources.To do so, both monitors have been preliminarily calibrated using a X-Rite i1 spectral colorimeter by setting their white point to D65.
With this setup it was possible to approximate a set of diverse illuminants.In particular, 12 illuminants have been simulated, corresponding to 12 daylight conditions differing in the color temperature.The CIE-xy chromaticities corresponding to a given temperature T have been obtained by applying the following equations [44]: where a 0 = 0.244063, a 1 = 0.09911, a 2 = 2.9678, a 3 = −4.6070if 4000 K ≤ T ≤ 7000 K, and a 0 = 0.23704, a 1 = 0.24748, a 2 = 1.9018, a 3 = −2.0064if 7000 K < T ≤ 25,000 K.The chromaticities were then converted in the monitor RGB space [45] with a scaling of the color channels in such a way that largest value was 255.The twelve daylight color temperatures that have been considered are: 4000 K, 4500 K, 5000 K, . . ., 9500 K (we will refer to these as D40, D45, . . ., D95).
Similarly, six illuminants corresponding to typical indoor light have been simulated.To do so, the CIE-xy chromaticities of six LED lamps (six variants of SOLERIQ R S by Osram) have been obtained from the data sheets provided by the manufacturer.Then, again the RGB values were computed and scaled to 255 in at least one of the three channels.These six illuminants are referred to as L27, L30, L40, L50, L57, and L65 in accordance with the corresponding color temperature.
Figure 4 shows, for one of the classes, the 46 acquisitions corresponding to the 46 different lighting conditions in the RawFooT database.These include: • In this work we are interested in particular in the effects of changes in the illuminant color.Therefore, we limited our analysis to the 12 illuminants simulating daylight conditions, and to the six simulating indoor illumination.
Beside the images of the 68 texture classes, the RawFooT database also includes a set of acquisitions of a color target (the Macbeth color checker [31]).Figure 5 shows these acquisitions for the 18 illuminants considered in this work.

Color Balancing
An image acquired by a digital camera can be represented as a function ρ mainly dependent on three physical factors: the illuminant spectral power distribution I(λ), the surface spectral reflectance S(λ), and the sensor spectral sensitivities C(λ).Using this notation, the sensor responses at the pixel with coordinates (x, y) can be described as: where ω is the wavelength range of the visible light spectrum, and ρ and C(λ) are three-component vectors.Since the three sensor spectral sensitivities are usually more sensitive respectively to the low, medium and high wavelengths, the three-component vector of sensor responses ρ = (ρ 1 , ρ 2 , ρ 3 ) is also referred to as the sensor or camera raw RGB triplet.In the following we adopt the convention that ρ triplets are represented by column vectors.As previously said, the aim of color characterization is to derive the relationship between device-dependent and device-independent color representations for a given device.In this work, we employ an empirical, model-based characterization.The characterization model that transforms the i-th input device-dependent triplet ρ I N into a device-independent triplet ρ OUT can be compactly written as follows [46]: where α is an exposure correction gain, M is the color correction matrix, I is the illuminant correction matrix, and (•) γ denotes an element-wise operation.
Traditionally [46], M is fixed for any illuminant that may occur, while α and I compensate for the illuminant power and color respectively, i.e., The model can be thus conceptually split into two parts: the former compensates for the variations of the amount and color of the incoming light, while the latter performs the mapping from the device-dependent to the device-independent representation.In the standard model (Equation ( 6)) α j is a single value, I j is a diagonal matrix that performs the Von Kries correction [47], and M is a 3 × 3 matrix.
In this work, different characterization models have been investigated together with Equation ( 6) in order to assess how the different color characterization steps influence the texture recognition accuracy.The first tested model does not perform any kind of color characterization, i.e., The second model tested performs just the compensation for the illuminant color, i.e., it balances image colors as a color constancy algorithm would do: The third model tested uses the complete color characterization model, but differently from the standard model given in Equation ( 6), it estimates a different color correction matrix M j for each illuminant j.The illuminant is compensated for both its color and its intensity, but differently from the standard model, the illuminant color compensation matrix I j for the j-th illuminant is estimated by using a different luminance gain α i,j for each patch i: The fourth model tested is similar to the model described in Equation ( 9) but uses a larger color correction matrix M j by polynomially expanding the device-dependent colors: where T(•) is an operator that takes as input the triplet ρ and computes its polynomial expansion.Following [7], in this paper we use T(ρ) = (ρ(1), ρ(2), ρ(3), ρ(1)ρ( 2), ρ(1)ρ(3), ρ(2)ρ( 3)), i.e., the rooted second degree polynomial [38].Summarizing, we have experimented with five color-balancing models.They all take as input the device-dependent raw values and process them in different ways: 1.
device-raw: it does not make any correction to the device-dependent raw values, leaving them unaltered from how they are recorded by the camera sensor; 2.
light-raw: it performs the correction of the illuminant color, similarly to what is done by color constancy algorithms [5,30,48] and chromatic adaptation transforms [49,50].The output color representation is still device-dependent, but with the discount of the effect of the illuminant color; 3.
dcraw-srgb: it performs a full color characterization according to the standard color correction pipeline.The chosen characterization illuminant is the D65 standard illuminant, while the color mapping is linear and fixed for all illuminants that may occur.The correction is performed using the DCRaw software (available at http://www.cybercom.net/~dcoffin/dcraw/);4.
linear-srgb: it performs a full color characterization according to the standard color correction pipeline, but using different illumination color compensation and different linear color mapping for each illuminant; 5.
rooted-srgb: it performs a full color characterization according to the standard color correction pipeline, but using a different illuminant color compensation and a different color mapping for each illuminant.The color mapping is no more linear but it is performed by polynomially expanding the device-dependent colors with a rooted second-degree polynomial.
The main properties of the color-balancing models tested are summarized in Table 1.
Table 1.Main characteristics of the tested color-balancing models.Regarding the color-balancing steps, the open circle denotes that the current step is not implemented in the given model, while the filled circle denotes its presence.Regarding the mapping properties, the dash denotes that the given model does not have this property.Device-raw (Equation ( 7)) --Light-raw (Equation ( 8)) --Dcraw-srgb (Equation ( 6)) fixed for D65 Linear 1 Linear-srgb (Equation ( 9)) Linear 1 for each illum.Rooted-srgb (Equation ( 10)) Rooted 2nd-deg.poly.
1 for each illum.
All the correction matrices for the compensation of the variations of the amount and color of the illuminant and the color mapping are found using the set of acquisitions of the Macbeth color checker available in the RawFooT using the optimization framework described in [7,36].An example of the effect of the different color characterization models on a sample texture class of the RawFooT database is reported in Figure 6.

Experimental Setup
Given an image, the experimental pipeline includes the following operations: (1) color balancing; (2) feature extraction; and (3) classification.All the evaluations have been performed on the RawFooT database.

RawFooT Database Setup
For each of the 68 classes we considered 16 patches obtained by dividing the original texture image, that is of size 800 × 800 pixels, in 16 non-overlapping squares of size 200 × 200 pixels.For each class we selected eight patches for training and eight for testing alternating them in a chessboard pattern.We form subsets of 68 × (8 + 8) = 1088 patches by taking the training and test patches from images taken under different lighting conditions.
In this way we defined several subsets, grouped in three texture classification tasks.

1.
Daylight temperature: 132 subsets obtained by combining all the 12 daylight temperature variations.Each subset is composed of training and test patches with different light temperatures.

2.
LED temperature: 30 subsets obtained by combining all the six LED temperature variations.Each subset is composed of training and test patches with different light temperatures.

3.
Daylight vs. LED: 72 subsets obtained by combining 12 daylight temperatures with six LED temperatures.

Visual Descriptors
For the evaluation we select a number of descriptors from CNN-based approaches [51,52].All feature vectors are L 2 -normalized (each feature vector is divided by its L 2 -norm.).These descriptors are obtained as the intermediate representations of deep convolutional neural networks originally trained for scene and object recognition.The networks are used to generate a visual descriptor by removing the final softmax nonlinearity and the last fully-connected layer.We select the most representative CNN architectures in the state of the art [53] by considering different accuracy/speed trade-offs.All the CNNs are trained on the ILSVRC-2012 dataset using the same protocol as in [1].In particular we consider the following visual descriptors [10,54]: BVLC AlexNet (BVLC AlexNet): this is the AlexNet trained on ILSVRC 2012 [1].

•
Fast CNN (Vgg F): it is similar to that presented in [1] with a reduced number of convolutional layers and the dense connectivity between convolutional layers.The last fully-connected layer is 4096-dimensional [51].

•
Medium CNN (Vgg M): it is similar to the one presented in [55] with a reduced number of filters in the fourth convolutional layer.The last fully-connected layer is 4096-dimensional [51].

•
Medium CNN (Vgg M-2048-1024-128): it has three modifications of the Vgg M network, with a lower-dimensional last fully-connected layer.In particular we use a feature vector of 2048, 1024 and 128 size [51].

•
Slow CNN (Vgg S): it is similar to that presented in [56], with a reduced number of convolutional layers, fewer filters in layer five, and local response normalization.The last fully-connected layer is 4096-dimensional [51].

•
Vgg Very Deep 19 and 16 layers (Vgg VeryDeep 16 and 19): the configuration of these networks has been achieved by increasing the depth to 16 and 19 layers, which results in a substantially deeper network than the previously ones [2].
ResNet 50 is a residual network.Residual learning frameworks are designed to ease the training of networks that are substantially deeper than those used previously.This network has 50 layers [52].

Texture Classification
In all the experiments we used the nearest neighbor classification strategy: given a patch in the test set, its distance with respect to all the training patches is computed.The prediction of the classifier is the class of the closest element in the training set.For this purpose, after some preliminary tests with several descriptors in which we evaluated the most common distance measures, we decided to use the L2-distance: d(x, y) = ∑ N i=1 (x(i) − y(i)) 2 , where x and y are two feature vectors.All the experiments have been conducted under the maximum ignorance assumption, that is, no information about the lighting conditions of the test patches is available for the classification method and for the descriptors.Performance is reported as classification rate (i.e., the ratio between the number of correctly classified images and the number of test images).Note that more complex classification schemes (e.g., SVMs) would have been viable.We decided to adopt the simplest one in order to focus the evaluation on the features themselves and not on the classifier.

Results and Discussion
The effectiveness of each color-balancing model has been evaluated in terms of texture classification accuracy.Table 2 shows the average accuracy obtained on each classification task (daylight temperature, LED temperature and daylight vs LED) by each of the visual descriptors combined with each balancing model.Overall, the rooted-srgb and linear-srgb models achieve better performance than others models with a minimum improvement of about 1% and a maximum of about 9%.In particular the rooted-srgb model performs slightly better than linear-srgb.The improvements are more visible in Figure 7 that shows, for each visual descriptor, the comparison between all the balancing models.Each bar represents the mean accuracy over all the classification tasks.ResNet-50 is the best-performing CNN-based visual descriptor with a classification accuracy of 99.52%, that is about 10% better than the poorest CNN-based visual descriptor.This result confirms the power of deep residual nets compared to sequential network architectures such as AlexNet, and VGG etc.To better show the usefulness of color-balancing models we focused on the daylight temperature classification task, where we have images taken under 12 daylight temperature variations from 4000 K to 9500 K with an increment of 500 K.To this end, Figure 8 shows the accuracy behavior (y-axis) with respect to the difference (∆T measured in Kelvin degrees) of daylight temperature (x-axis) between the training and the test sets.The value ∆T = 0 corresponds to no variations.Each graph shows, given a visual descriptor, the comparison between the accuracy behaviors of each single model.There is an evident drop in performance for all the networks when ∆T is large and no color-balancing is applied.The use of color balancing is able to make uniform the performance of all the networks independently of the difference in color temperature.The dcraw-srgb model represents the most similar conditions to those of the ILSVRC training images.This explains why this model obtained the best performance for low values of ∆T.However, since dcraw-srgb does not include any kind of color normalization for high values of ∆T we observe a severe loss in terms of classification accuracy.Both linear-srgb and rooted-srgb are able, instead, to normalize the images with respect to the color of the illumination, making all the plots in Figure 8 almost flat.The effectiveness of these two models also depends on the fact that they work in a color space similar to those used to train the CNNs.Between the linear and the rooted models, the latter performs slightly better, probably because its additional complexity increases the accuracy in balancing the images.

Conclusions
Recent trends in computer vision seem to suggest that convolutional neural networks are so flexible and powerful that they can substitute in toto traditional image processing/recognition pipelines.However, when it is not possible to train the network from scratch due to the lack of a suitable training set, the achievable results are suboptimal.In this work we have extensively and systematically evaluated the role of color balancing that includes color characterization as a preprocessing step in color texture classification in presence of variable illumination conditions.Our findings suggest that to really exploit CNNs, an integration with a carefully designed preprocessing procedure is a must.The effectiveness of color balancing, in particular of the color characterization that maps device-dependent RGB values into a device-independent color space, has not been completely proven since the RawFooT dataset has been acquired using a single camera.As future work we would like to extend the RawFooT dataset and our experimentation acquiring the dataset using cameras with different color transmittance filters.This new dataset will make more evident the need for accurate color characterization of the cameras.

Figure 1 .
Figure 1.Example of correctly predicted image and mis-predicted image after a color cast is applied.

Figure 2 .
Figure 2. A sample for each of the 68 classes of textures composing the RawFooT database.

Figure 3 .
Figure 3. Scheme of the acquisition setup used to take the images in the RawFooT database.

Figure 4 .
Figure 4. Example of the 46 acquisitions included in the RawFooT database for each class (here the images show the acquisitions of the "rice" class).

Figure 5 .
Figure 5.The Macbeth color target, acquired under the 18 lighting conditions considered in this work.

Figure 7 .
Figure 7. Classification accuracy obtained by each visual descriptor combined with each model.

Table 2 .
Classification accuracy obtained by each visual descriptor combined with each model, the best result is reported in bold.