Comparative Evaluation of Hand-Crafted Image Descriptors vs. Off-the-Shelf CNN-Based Features for Colour Texture Classiﬁcation under Ideal and Realistic Conditions

: Convolutional Neural Networks (CNN) have brought spectacular improvements in several ﬁelds of machine vision including object, scene and face recognition. Nonetheless, the impact of this new paradigm on the classiﬁcation of ﬁne-grained images—such as colour textures—is still controversial. In this work, we evaluate the effectiveness of traditional, hand-crafted descriptors against off-the-shelf CNN-based features for the classiﬁcation of different types of colour textures under a range of imaging conditions. The study covers 68 image descriptors (35 hand-crafted and 33 CNN-based) and 46 compilations of 23 colour texture datasets divided into 10 experimental conditions. On average, the results indicate a marked superiority of deep networks, particularly with non-stationary textures and in the presence of multiple changes in the acquisition conditions. By contrast, hand-crafted descriptors were better at discriminating stationary textures under steady imaging conditions and proved more robust than CNN-based features to image rotation.


Introduction
Colour texture analysis and classification play a pivotal role in many computer-vision applications such as surface inspection, remote sensing, medical image analysis, object recognition, content-based image retrieval and many others.It is no surprise, then, that texture has been an area of intense research activity for at least forty years and that a huge variety of descriptors has been proposed in the literature (see [1][2][3] for comprehensive and up-to-date reviews).
In recent years, the advent of convolutional neural networks has dramatically changed the outlook in many areas of computer vision and has led to astonishing improvements in tasks like object, scene and face recognition [4][5][6][7][8].The structure of a CNN differs from that of traditional, hand-designed descriptors in that the former contains a large number of parameters, which are to be determined through suitable training procedures.The training process is usually carried out on huge datasets (containing millions of images), which enables the nets to "learn" very complex image-to-feature and/or image-to-class mappings.More importantly, there is evidence that such mappings are amenable to being transferred from one domain to another, making networks trained on certain classes of images usable in completely different contexts [5,9].The consequences of this are far-reaching: although datasets large enough to train a CNN entirely from scratch are rarely available in practical tasks, pre-trained networks can in principle be used as off-the-shelf feature extractors in a wide range of applications.
Inevitably, the CNN paradigm is changing the approach to texture analysis as well.However, though there is general consensus that convolutional networks are superior to hand-designed descriptors in tasks such as object and scene recognition [4][5][6], this superiority is still not quite clear when it comes to dealing with fine-grained images, i.e., textures.As we discuss in Section 2, some recent results seem to point in that direction, but it is precisely the aim of this work to investigate this matter further.To this end, we comparatively evaluated the performance of a large number of classic and more recent hand-designed, local image descriptors against a selection of off-the-shelf features from last-generation CNN.We assessed the performance under both ideal and realistic conditions, with special regard to different degrees of intra-class variability.As we detail in Section 2, intra-class variability can be the consequence of the intrinsic structure of the texture-which can be more or less stationary-and/or of variations in the imaging conditions (e.g., changes in illumination, rotation, scale and/or viewpoint).
On the whole, image features from pre-trained networks outperformed hand-crafted descriptors, but with some interesting exceptions.In particular, hand-crafted methods were still better than CNN-based features at discriminating between very similar colour textures under invariable imaging conditions and proved slightly more robust to rotation; whereas the results were split in the presence of variations of scale.
Networks were markedly superior in all the other cases-particularly with non-stationary colour textures-and also emerged as more robust to multiple and uncontrolled changes in the imaging conditions, which are harder to model and compensate for in a priori feature design.
In the remainder of the paper, we first put the work in the context of the recent literature (Section 2), then describe the datasets and image descriptors used in our study (Sections 3-4).We detail the experimental set-up in Section 5, discuss the results in Section 6 and conclude the paper with some final considerations in Section 7.

Related Research
A number of papers have addressed the problem of comparatively evaluating image descriptors for colour texture analysis (e.g., [10][11][12][13][14]), though none of these considers CNN-based features.Yet, results from recent studies seem to suggest that CNN-based descriptors can be effective for texture classification as well, in most cases outperforming hand-designed descriptors.
Cimpoi et al. [15,16] compared a number of hand-designed image descriptors including LMfilters, MR8filters, LBP and SIFT against a set of CNN-based features in texture and material recognition and concluded that in most cases, the latter group outperformed the former.Notably, their findings are mainly based on the results obtained on the Describable Texture Dataset (DTD; more on this in Section 3), which to a great extent, is composed of very irregular and non-stationary textures acquired "in the wild"; by contrast, their results look rather saturated and levelled in other datasets (i.e., Columbia-Utrecht Reflectance and Texture Database (CUReT), UMDand UIUC).Cusano et al. [17] investigated colour texture classification under variable illumination conditions.Their study included a large number of hand-designed texture descriptors (e.g., Gabor filters, wavelet and LBP), descriptors for object recognition (dense SIFT) and CNN-based features.Experimenting on a new dataset of colour texture images (RawFooT, also included in the present study), they concluded that features based on CNN gave significantly better results than the other methods.Liu et al. [1] evaluated a large selection of LBP variants and CNN-based features for texture classification tasks.In their experiments, CNN-based features outperformed LBP variants in six datasets out of eleven, but in this case, all the LBP variants considered were grey-scale descriptors, whereas CNN by default operates on colour images.The presence/absence of colour information may account-at least in part-for the different performance.Recently, Napoletano [18] comparatively evaluated a number of hand-crafted descriptors and CNN-based features over five datasets of colour images and found that, on average, the latter group outperformed the former.
In summary, there is mounting evidence that off-the-shelf CNN-based features can be suitable for texture classification tasks and may in certain cases outperform (therefore, potentially replace) traditional, hand-designed descriptors.Interestingly, both [1,17] seem to suggest that CNN-based methods tend to perform better than hand-designed descriptors in the presence of complex textures and intra-class variations, though none of the references investigated this point further.There remains a need to clarify under what circumstances CNN-based features can replace traditional, hand-crafted descriptors and what are the pros and cons of the two strategies.

Materials
We based our experiments on 23 datasets of colour texture images (Section 3.1) arranged into 46 different experimental conditions (Section 3.2).We subdivided the datasets into ten different groups (Sections 3.2.1-3.2.10) based on the following two properties of the images contained therein (see also Table 1 and Figures 1-10): (a) The stationariness of the textures; (b) The presence/absence and/or the type of variation in the imaging conditions.(b) signals whether the samples of a given class have been acquired under steady or variable imaging conditions in terms of illumination, rotation, scale and/or viewpoint.In the remainder, we use the following naming convention to indicate the dataset used: <source>-<no.-of-classes>-<prop-a>-<prop-b>where: • <source> indicates the name of the source dataset the images come from (e.g., Amsterdam Library of Textures (ALOT), KTH-TIPS, etc. as detailed in Section 3.1);

Amsterdam Library of Textures
This is a collection of stationary and non-stationary colour textures representing 250 classes of heterogeneous materials including chip, fabric, pebble, plastics, seeds and vegetables [20,21].Each class was acquired under 100 different conditions obtained by varying the viewing direction, the illumination direction and the rotation angle.The dataset comes in full, half or quarter resolution (respectively 1536 px × 1024 px, 768 px × 512 px and 384 px × 256 px): we chose the first for our experiments.

Coloured Brodatz Textures
Coloured Brodatz Textures (CBT) is an artificially-colourised version of Brodatz's album [22,23].There are 112 classes with one image sample for each class.The dimension of the images is 640 px × 640 px, which we subdivided into four non-overlapping sub-images of dimensions 320 px × 320 px.

Columbia-Utrecht Reflectance and Texture Database
The Columbia-Utrecht Reflectance and Texture Database (CUReT) contains sixty-one classes representing different types of materials such as aluminium foil, artificial grass, brick, cork, cotton, leather, quarry tile, paper, sandpaper, styrofoam and velvet [24,25].In the original version, there are 205 image samples for each class corresponding to different combinations of viewing and illumination directions.Some of these images, however, cannot be used because they contain only a small portion of texture, while the rest is background.The version used here is a reduced one [26] maintained by the Visual Geometry Group at the University of Oxford, United Kingdom.In this case, there are 92 images per class corresponding to those imaging conditions that ensure a sufficiently large texture portion to be visible across all materials.The dimension of each image sample is 200 px × 200 px.

Drexel Texture Database
This consists of stationary colour textures representing 20 different materials such as bark, carpet, cloth, knit, sandpaper, sole, sponge and toast [27,28].The dataset features 1560 images per class, which are the result of combining 30 viewing conditions (generated by varying the object-camera distance and the angle between the camera axis and the imaged surface) and 52 illumination directions.The images have a dimension of 128 px × 128 px.

Describable Textures Dataset
DTD is comprised of highly non-stationary and irregular textures acquired under uncontrolled imaging conditions (or, as the authors say, "in the wild" [29,30]).The images are grouped into 47 classes representing attributes related to human perception such as banded, blotchy, cracked, crystalline, dotted, meshed and so forth.There are 120 samples per class, and the dimension of the images varies between 300 px × 300 px and 640 px × 640 px.

Fabrics Dataset
This is comprised of 1968 samples of garments and fabrics [31,32].Herein, we considered each sample as a class on its own, though the samples are also grouped by material (e.g., wool, cotton, polyester, etc.) and garment type (e.g., pants, shirt, skirt and the like).The images were acquired in the field (i.e., at garment shops) using a portable photometric device and have a dimension of 400 px × 400 px.Each sample was acquired under four different illumination conditions; the samples also have uncontrolled in-plane rotation.

Forest Species Database
The Forest Species Database (ForestSpecies) is comprised of 2240 images representing samples from 112 hardwood and softwood species [33,34].The images were acquired through a light microscope with 100× zoom and have a dimension of 1024 px × 768 px.

KTH-TIPS
This is comprised of ten types of materials such as aluminium foil, bread, cotton and sponge [35,36].Each material sample was acquired under nine different scales, three rotation angles and three illumination directions, giving 81 images for each class.The dimension of the images is 200 px × 200 px.

KTH-TIPS2b
KTH-TIPS2bis an extension to KTH-TIPS, which adds one more class, three more samples per class and one additional illumination condition [36,37].As a result, there are 432 image samples per class instead of 81, whereas the image dimension is the same as in KTH-TIPS.

Kylberg-Sintorn Rotation Dataset (KylbergSintorn)
The Kylberg-Sintorn Rotation Dataset (KylbergSintorn) is comprised of twenty-five colour texture classes representing common materials such as sugar, knitwear, rice, tiles and wool [38,39].There is one sample for each class, which was acquired at nine in-plane rotation angles, i.e., 0 • , 40  LMTis comprised of one hundred eight colour texture classes belonging to the following nine super-classes: blank glossy surfaces, fibres, foams, foils and papers, meshes, rubbers, stones, textiles and fabrics and wood [40,41].The images were acquired using a common smartphone, and each material sample was captured under 40 different illumination and viewing conditions.The dimension of the images is 320 px × 480 px.

MondialMarmi
This is comprised of twenty-five classes of commercial polished stones [42,43].Four samples per class (corresponding to as many tiles) were acquired under steady and controlled illumination conditions and at 10 in-plane rotation angles, i.e., 0 • , 10 • , 20 The Multiband Texture Database (MBT) is comprised of one hundred fifty-four colour texture images obtained by taking three grey-scale textures [44,45] taken from the Normalized Brodatz Texture database [46] and dealing with one of each of the R, G and B channels.There is one sample for each class, and the image size is 640 px × 640 px.
3.1.14.New BarkTex BarkTexis a collage of different types of tree bark [47,48] derived from the BarkTexdatabase [49].This dataset includes six classes with 68 samples per class.The dimension of the images is 64 px × 64 px.

Parquet
This is comprised of fourteen commercial varieties of finished wood for flooring and cladding [52,53].Each variety has also a number of grades ranging from 2-4, which we considered as independent classes, yielding a total of 38 classes.The number of samples per class varies from 6-8 and the dimension of the images from 1200 px-1600 px in width and from 500 px-1300 px in height, as a consequence of the different sizes of the specimens.

Plant Leaves Database
This is comprised of twenty classes of plant leaves from as many plant species [54,55].Three images of dimensions 128 px × 128 px were acquired from the regions of minimum texture variance within each leaf, making a total of 20 × 20 × 3 = 1200 images.The acquisition was carried out using a planar scanner at a spatial resolution of 1200 dpi.

Robotics Domain Attribute Database
Robotics Domain Attribute Database (RDAD) is comprised of fifty-seven classes of objects and materials such as asphalt, chocolate, coconut, flakes, pavingstone, rice and styrofoam [56].The dataset includes a variable number of image samples per class (from 20-48), all captured "in the wild".The dimension of each image is 2592 px × 1944 px.

Raw Food Texture Database
The Raw Food Texture Database (RawFooT) is comprised of sixty-eight classes of raw food such as chickpeas, green peas, oat, chilly pepper, kiwi, mango, salmon and sugar [57,58].Each image was taken under 46 different illumination conditions obtained by varying the type, the direction and the intensity of the illuminant; other imaging conditions such as scale, rotation and viewpoint remained invariable.The images have a dimension of 800 px × 800 px.

Salzburg Texture Image Database
The Salzburg Texture Image Database (STex) is comprised of four hundred seventy-six colour texture images acquired "in the wild" around the city of Salzburg, Austria [59].They mainly represent objects and materials like bark, floor, leather, marble, stones, walls and wood.The dataset comes in two different resolutions-i.e., 1024 px × 1024 px and 512 px × 512 px-of which the second was the one used in our experiments.We further subdivided the original images into 16 non-overlapping sub-images of dimensions 128 px × 128 px.

USPTex
USPTex [60,61] is very similar to STex (Section 3.1.20)as for the content, structure and imaging conditions.In this case, there are 191 classes representing materials, objects and scenes such as food, foliage, gravel, tiles and vegetation.There are 12 samples per class and the image dimensions 128 px × 128 px.

VisTex Reference Textures
The VisTex reference textures are part of the Vision Texture Database [62].They represent 167 classes, which are further subdivided into 19 groups, e.g., bark, buildings, food, leaves, terrain and wood.For each class, there is one image sample of dimensions 512 px × 512 px, which we subdivided into four non-overlapping samples of 256 px × 256 px.

VxC TSG Database
VxCTSGis comprised of fourteen commercial classes of ceramic tiles with three grades per class [63].We considered each grade as a class on its own, which gives 42 classes in total.The images were acquired in a laboratory under controlled and invariable conditions.The number of samples per class varies from 14-30, but in our experiments, we only retained 12 samples per class.Since the original images are rectangular, we cropped them to a square shape, retaining the central part.The resulting images have a dimension ranging between 500 px × 500 px and 950 px × 950 px.

Datasets Used in the Experiments
From the source datasets (Section 3.1), we derived the datasets used in the experiments.The classification into stationary or non-stationary textures was performed manually by two of the authors (R.B.-C, >2 years experience in texture analysis, and F.B., >10 years experience in texture analysis).Those images on which no consensus was reached were discarded.

•
ALOT-95-S-N: Ninety-five stationary textures from the ALOT dataset (Section 3.1.1)with images taken from the "c1I3"group.Six samples per class were obtained by subdividing the original images into non-overlapping sub-images of dimensions 256 px × 256 px.
• Drexel-18-S-N: Eighteen stationary textures from the Drexel dataset (Section 3.1.4)with images taken from the "D1_IN00_OUT00"group. Four samples per class were obtained by subdividing the original images into non-overlapping sub-images of dimensions 64 px × 64 px.
• KylbergSintorn-25-S-N: All 25 colour textures in the KylbergSintorn dataset (Section 3.1.10)with images taken from the 0 • group.Each image was subdivided into 24 non-overlapping images of dimensions 864 px × 864 px.
As in RawFooT-68-S-N, there are four samples for each imaging condition, therefore a total of 4 × 4 = 16 samples per class.

Methods
We considered 35 hand-designed and 33 CNN-based descriptors as detailed in Sections 4.1-4.2(see also Tables 2 and 3 for a round-up).We subdivided the hand-designed methods into three subgroups:

Histograms of Equivalent Patterns
Six LBP variants, also referred to as histograms of equivalent patterns [67], specifically: Each of the above methods was used to obtain a multiple resolution feature vector by concatenating the rotation-invariant vectors (e.g., LBP ri ) computed at resolutions of 1 px, 2 px and 3 px [74].For each resolution, we used non-interpolated neighbourhoods [75] composed of a central pixel and eight peripheral pixels, as shown in Figure 11.• Grey-Level Co-occurrence Matrices (GLCM): Five global statistics (i.e., contrast, correlation, energy, entropy and homogeneity) from grey-level co-occurrence matrices [76] computed using displacement vectors of lengths of 1 px, 2 px and 3 px and orientations of 0 • , 45 • , 90 • and 135 • (5 × 3 × 4 = 60 features).A rotation-invariant version (GLCM DFT ) based on discrete Fourier transform normalisation [77] was also considered.
• Image Patch-Based Classifier, Joint version (IPBC-J): Local image patches aggregated over a dictionary of visual words (Section 4.3) as proposed by Varma and Zissermann [79].The image patches were captured at resolutions of 1 px, 2 px and 3 px using the same neighbourhood configuration shown in Figure 11.The resulting feature vectors were concatenated into a single vector.Further pre-processing involved zero-mean and unit-variance normalisation of the input image and contrast normalisation of each patch through Weber's law, as recommended in [79].
• Gabor features (Gabor): Mean and standard deviation of the magnitude of Gabor-transformed images from a bank of filters with five frequencies and seven orientations.The other parameters of the filter bank were: frequency spacing half octave, spatial-frequency bandwidth one octave and aspect ratio 1.0.We considered both raw and contrast-normalised response: in the second case, the magnitudes for one point in all frequencies and rotations were normalized to sum to one.This option is indicated with subscript "cn"in the remainder.For both options, a DFT-based rotation-invariant version [80] (superscript "DFT" in the remainder) was also included in the experiments.In all the versions, the number of features was 2 × 5 × 7 = 70.
• Wavelets (WSF + WCF): Statistical (mean and standard deviation) and Co-occurrence features (same as in GLCM) from a three-level Wavelet decomposition as described in [81].We used Haar's and bi-orthogonal wavelets, respectively indicated with subscript "haar" and "bior22" in the remainder.
The number of statistical features was 2 × 4 × 3 = 24, and that of the co-occurrence features was 6 × 4 × 3 = 60, making a total of 84 features.
• VZ classifier with MR8 filters (VZ-MR8): Filter responses from a bank of 36 anisotropic filters (firstand second-derivative filters at six orientations and three scales) plus two rotationally-symmetric ones (a Gaussian and a Laplacian of Gaussian).Only eight responses are retained, i.e., the six maximum responses at each scale across all orientations and the response of the anisotropic filters [82].The filter responses were aggregated over a dictionary of 4096 visual words (see Section 4.3).
• Dense SIFT (SIFT-BoVW): Spatial histograms of local gradient orientations computed every two pixels and over a neighbourhood of radius 3 px (histograms of equivalent patterns).The resulting 128-dimensional local features were aggregated over a dictionary of 4096 visual words as described in Section 4.3.[85].LCC is the probability distribution (histogram) of the angle between the colour vector of the central pixel in the neighbourhood and the average colour vector of the peripheral pixels.Following the settings suggested in [85], we used histograms of 256 bins for each resolution (i.e., 1 px, 2 px and 3 px) and concatenated the results.Concatenation with LBP gives a total of 108 + 256 × 3 = 876 features.

•
Local Colour Vector Local Binary Patterns (LCVBP): Concatenation of Colour Norm Patterns (CNP) and Colour Angular Patterns (CAP) as proposed by Lee et al. [86].In CNP, the colour norm of a pixel in the periphery is thresholded at the value of the central pixel; in CAP, the thresholding is based on the angle that the projections of the colour vectors form in the RG, RB and GB planes.Since the CNP feature vector is the same length as LBP and CAP's is three-times longer, their concatenation produces 108 × 4 = 432 features.
• Opponent Colour Local Binary Patterns (OCLBP): Local Binary Patterns computed on each colour channel separately and from the R-G, R-B and G-B pairs of channels [87].The other settings (type of neighbourhood and features) were the same as in grey-scale LBP (Section 4.1.2).The resulting feature vector is six-times longer than LBP's; therefore, we have 108 × 6 = 648 features.
• Improved Opponent Colour Local Binary Patterns (IOCLBP): An improved version of OCLBP in which thresholding is point-to-average instead of point-to-point [88].This can be considered a colour version of ILBP (see Section 4.1.2).

VGG-F, VGG-Mand VGG-S [93]
• VGG-VD-16 and VGG-VD-19 [6] • VGG-Face [7] Twelve of the above models were trained for object recognition and the remaining one (vgg-face) for face recognition.Each network was used as a generic feature extractor, and the resulting features were passed on to a standard classifier (see Section 5).Following the strategy suggested in recent works [16,17,57], we considered the following two alternative types of features (see Section 4.3 and Figure 12):

•
The order-sensitive output of the last fully-connected layer; • The aggregated, orderless output of the last convolutional layer.

Learned vs. Unlearned Methods: Aggregation
All the methods considered in the experiments (and, more generally, all local image descriptors) can be either learned or unlearned.Methods belonging to the first group are also referred to as a posteriori and those belonging to the second as a priori (for a discussion on this point, see also [67,94]).In the first group, the image-to-features mapping is the result of a preliminary learning stage aiming at generating a dictionary of visual words upon which the local features are aggregated.By contrast, unlearned methods do not require any training phase, since the image-to-feature mapping is a priori and universal.

Aggregation
Aggregation (also referred to as pooling [16]) is the process by which local image features (for instance, the output of a bank of filters or that of a layer of a convolutional network) is grouped around a dictionary of visual words in order to obtain a global feature vector suitable for being used with a classifier [95].
The first step of the aggregation process is the definition of the dictionary, which usually consists of vector-quantizing the local features into a set of prototypes.Key factors in this phase are:

•
The dimension of the dictionary;

•
The algorithm used for clustering;

•
The set of images used for training.
In our experiments, the dimension of the dictionary depended on the feature encoder used, as discussed below.The algorithm for clustering was always the k-means; whereas for training, we followed the same approach as in [57]; i.e., we used a set of colour texture images from an external source [96].To avoid overfitting and possibly biased results, these images were different from those contained in any of the datasets detailed in Section 3.
For the aggregation, we considered two schemes (due to dimensionality reasons, aggregation over the DenseNet, GoogLeNet and ResNet models was limited to BoVW) [16,95]: Bag of Visual Words (BoVW); • Vectors of Locally-Aggregated Descriptors (VLAD).
This choice was based on recent works [16,57] and was also the result of a trade-off between accuracy and dimensionality (recall that for a D-dimensional feature space and a dictionary with K visual words, the number of features respectively generated by BoVW and VLAD is K and K × D).
Convolutional networks also have built-in aggregation modules: the Fully-Connected (FC) layers.However, whereas BoVW and VLAD implement orderless aggregation (i.e., they discard the spatial configuration of the features), the aggregation provided by fully-connected layers is order-sensitive.The number of features produced by FC layers depends on the network's architecture and is therefore fixed.For a fair comparison between the three aggregation strategies (FC, BoVW and VLAD), we chose a number of visual words for BoVW and VLAD that produced a number of features as close as possible to that produced by FC.
Post-processing involved L 1 normalization of the BoVW features and L 2 normalization of the individual VLAD vectors and vectors of FC features [16,57].

Experiments
We comparatively evaluated the discrimination accuracy and computational demand of the methods detailed in Section 4 on a set of supervised image classification tests using the datasets described in Section 3. In the remainder 'Experiment #N' will indicate the experiment ran on the colour texture images of Group #N.Following the same approach as in recent related works [1,14,17,18], we used the nearest-neighbour classification strategy (with L 1 distance) in all the experiments.
Performance evaluation was based on split-sample validation with stratified sampling.The fraction of samples of each class used to train the classifier was 1/2 for Experiments #1 and #2 and 1/8 for all the other experiments.In the first two cases, the choice was dictated by the low number of samples available (as few as four per class in some datasets); in the others, we opted for a lower training ratio in order to better estimate the robustness of the methods to the intra-class variability.The figure of merit (' accuracy" in the remainder) was the ratio between the number of samples of the test set classified correctly and the total number of samples of the test set.For a stable estimation, the value was averaged over 100 different subdivisions into a training and test set.
For each experiment, a ranking of method was obtained by comparing all the image descriptors pairwise and respectively assigning +1, −1 or 0 each time a method was significantly better, worse or not significantly different from the other.Statistical significance (α = 0.05) was assessed through the Wilcoxon-Mann-Whitney rank sum test [97] over the accuracy values resulting from the 100 splits.

Accuracy
Tables 4 and 5 respectively report the relative performance in terms of ranking (as defined in Section 5) of the ten best hand-crafted descriptors and ten best CNN-based features.The results depict a scenario, which, on the whole, was dominated by CNN-based methods.Among them, the three ResNet outperformed by far the other networks, and interestingly, the FC configuration emerged as the best strategy to extract CNN-based features among the three considered (FC, BoVW and VLAD).
Conversely, the hand-crafted descriptors came lower in the standings and were dominated by colour variants of LBP (e.g., IOCLBP, OCLBP and LCVBP).Table 4. Hand-crafted descriptors: relative performance of the best ten methods at a glance.For each method, the columns from #1-#10 show the rank (first row) and average accuracy (second row, in parentheses) by experiment.The next to last column reports the average rank and accuracy over all the experiments and the last column the overall position in the placings.Experiments #1 and #2 (Tables 6-8) show that, under steady imaging conditions, hand crafted-descriptors were competitive only with stationary textures, whereas CNN-based features proved clearly superior with non-stationary ones (Figures 13-15).In Experiment #1 (Tables 6 and 7), the best-performing method belonged to the first group in the four datasets out of 14; the reverse occurred in eight datasets, whereas in the remaining two, the difference did not reach statistical significance (Figure 16).Interestingly, there was a marked gap when it came to classifying fine textures with a high degree of similarity, such as in datasets Parquet-38-S-N and V×CTSG-42-S-N (Figure 14).In this case, the hand-crafted descriptors outperformed CNN-based features by a good margin.With non-stationary textures (Experiment #2), CNN-based features proved generally better, outperforming hand-crafted descriptors in six datasets out of nine, whereas the reverse occurred in two datasets only.Under variable illumination conditions, CNN-based descriptors seemed to be able to compensate for changes in illumination better than hand-crafted descriptors did (Experiments #3 and #4, Tables 9-10).This result is in agreement with the findings of Cusano et al. [57].Datasets: 1) ALOT-95-S-I; 2) Outex-192-S-I; 3) RawFooT-68-S-I-1; 4) RawFooT-68-S-I-2; 5) RawFooT-68-S-I-3.
The results were however rather split in the presence of rotation (Experiments #5 and #6, Tables 11 and 12).Here, the hand-designed descriptors were significantly better than CNN-based features in three datasets out of six, whereas the reverse occurred in two datasets.This parallels the results reported in [1] and is likely to be related to the directional nature of the learned filters in the convolutional networks.
A similar trend emerged with variations in scale (Experiments #7 and #8, Tables 13 and 14).In this case, the hand-designed descriptors (particularly 3D colour histogram) were significantly better than CNN-based features in two datasets, while the reverse occurred in the other two.In the presence of multiple and uncontrolled changes in the imaging conditions, including variations in illumination, scale and viewpoint (Experiments #9 and #10, Tables 15 and 16), the hand-crafted descriptors were just non-competitive: CNN-based features proved superior in all the datasets considered.The difference was more noticeable in those datasets (e.g., RDAD) where the intra-class variability was higher.Particularly interesting were the results obtained with the Describable Texture Dataset: here, CNN-based features surpassed hand-crafted descriptors by ≈30 percentage points.On the same dataset, the 60.8% accuracy achieved by ResNet-50 was equally remarkable, in absolute terms.Another interesting outcome is the relative performance of the descriptors within the two classes of methods.The ranking of the CNN-based features was rather stable across all the experiments, with the three ResNet models invariably sitting in the first places of the standings.Conversely, hand-crafted descriptors showed a higher degree of variability: LBP colour variants (e.g., OCLBP, IOCLBP, and LCVBP) for instance-which were among the best methods on the whole-did not perform well under variable illumination (as one would expect) and were in fact surpassed by grey-scale methods (e.g., SIFT and ILBP).

Computational Demand
Table 17 reports, for each image descriptor, the average Feature Extraction time per image (FE) and the average Classification time per subdivision into the training and test set (CL).The figures were recorded from Experiment #1.For a fair comparison, all the features were computed using the CPU only (no GPU acceleration was used).To facilitate a comparative assessment, we subdivided the whole population into quartiles (columns Q FE and Q CL of the table).On the whole, the results indicate that the best performing hand-crafted descriptors (e.g., OCLBP, LCVBP and ICM) were generally slower than the CNN-based methods in the feature extraction step; in the classification step, however, the situation was inverted in favour of the hand-crafted descriptors due to the lower dimensionality of these methods.  (1Haar wavelet; (2) Bi-orth.wavelet.

Conclusions
We have compared the effectiveness and computational workload of traditional, hand-crafted descriptors against off-the-shelf CNN-based features for colour texture classification under ideal and realistic conditions.On average, the experiments confirmed the superiority of deep networks, albeit with some interesting exceptions.Specifically, hand-crafted descriptors still proved better than CNN-based features when there was little intra-class variability or where this could be modelled explicitly (e.g., rotations).The reverse was true when there was significant intra-class variability-whether due to the intrinsic structure of the images and/or to changes in the imaging conditions-and in general in all the other cases.
Of the three aggregation techniques used for extracting features via pre-trained CNN (i.e., FC, BoVW and VLAD), the first outperformed the other two in all the conditions considered.This finding is in agreement with the results recently published by Cusano et al. [17], but differs from those presented by Cimpoi [16] in which VLAD (and to some extent BoVW) performed either better or at least as well as FC.Note, however, that in our comparison, we kept the number of features approximately equal for the three methods, whereas in [16], VLAD's feature vectors were significantly longer than FC and BoVW.Furthermore, consider that in our experiments, the aggregation was performed over an external-dataset-independent-dictionary, whereas [16] used dataset-specific (internal) dictionaries.Incidentally, it is worth noting that the FC configuration is the only one that allows a genuine off-the-shelf reuse of the networks in a strict sense, only requiring a resizing of the input images to fit the input field of the net.
Among the hand-crafted descriptors, colour LBP variants such as OCLBP, IOCLBP and LCVBP gave the best results under stable illumination, whereas dense SIFT proved the most effective method in the presence of illumination changes.Pure colour descriptors (i.e., full and marginal colour histograms) were the best methods to deal with variations in scale.
On the other hand, the performance of CNN-based features was rather stable across all the datasets and experiments, with the three ResNet models emerging as the best descriptors in nearly all experimental conditions.
As for the computational cost, the best CNN-based features were approximately as fast to compute as their hand-crafted counterparts (Table 17).The feature vectors, however, are at least twice as long (Tables 2 and 3), which implies higher computational demand in the classification step.
Finally, an interesting and rather curious result is the high affinity that emerged between local binary patterns and the Outex database: in most of the experiments in which this dataset was involved, the best descriptor was always an LBP variant, a finding that did not go unnoticed by other authors either [16].
Supplementary Materials: The datasets' file names, class labels and links to the original images are available at the Supplementary material.Feature extraction and classification were based on CATAcOMB (Colour And Texture Analysis tOolbox for MatlaB), which is provided in the CATAcOMB.zipfile.An on-line version of the toolbox will also made available at https://bitbucket.org/biancovic/catacomb upon publication.

4. 1 .
Hand-Designed Descriptors 4.1.1.Purely Spectral Descriptors • Mean (Mean): Average of each of the R, G and B channels (three features).• Mean + standard deviation (Mean + Std): Average and standard deviation of each of the R, G and B channels (six features).

Figure 11 .
Figure 11.Neighbourhoods used to compute features through histograms of equivalent patterns and other methods.

Figure 12 .
Figure 12.Simplified diagram of a generic convolutional neural network.For texture classification, we can either use the order-sensitive output of a fully-connected layer or the orderless, aggregated output of a convolutional layer.

Figure 13 .Figure 14 .
Figure 13.Relative performance of hand-crafted descriptors vs. CNN-based features with stationary (x axis) and non-stationary textures (y axis) under invariable imaging conditions (Experiments #1 and #2, best 13 methods).The plot shows a clear divide between CNN-based methods (mostly clustered in the upper-left part, therefore showing affinity for non-stationary textures) and hand-crafted descriptors (mostly clustered in the lower-right part, therefore showing affinity for stationary textures)

Figure 15 .
Figure 15.CNN-based features were on the whole better than hand-crafted descriptors at classifying non-stationary textures, as those shown in the picture.

Table 1 .
Round-up table of the image datasets used in the experiments.ALOT, Amsterdam Library of Textures; CUReT, Columbia-Utrecht Reflectance and Texture Database; CBT, Coloured Brodatz Textures; MBT, Multiband Texture Database; RawFooT, Raw Food Texture Database; VisTex, Vision Texture Database; RDAD, Robotics Domain Attribute Database; STex, Salzburg Texture Image Database; DTD, Describable Texture Dataset; S, Stationary; NS, Non-Stationary; N, No variations; I, variations in Illumination; R, variations in Rotation; S, variations in Scale, M, Multiple variations.
• <no.-of-classes> the number of the colour texture classes in the dataset; • <prop-a> the stationariness of the textures, which can be either S or NS respectively indicating Stationary and Non-stationary textures; • <prop-b> the presence/absence and/or the type of intra-class variation in the imaging conditions.This can be either N, I, R, S or M, respectively indicating No variations (steady imaging conditions), variations in Illumination, variations in Rotation, variations in Scale and Multiple variations (i.e., combined changes in illumination, scale, rotation and/or viewpoint).

Table 2 .
Summary table of the hand-crafted image descriptors used in the experiments.VLAD, Vectors of Locally-Aggregated Descriptors.

Table 3 .
Summary table of the off-the-shelf CNN-based features used in the experiments.

Table 5 .
CNN-based descriptors: relative performance of the best ten methods at a glance.For each method, the columns from #1-#10 show the rank (first row) and average accuracy (second row, in parentheses) by experiment.The next to last column reports the average rank and accuracy over all the experiments and the last column the overall position in the placings.bestfive hand-crafted image descriptors and the best five CNN-based features (the other values are provided as Supplementary Material).

Table 6 .
Results of Experiment #1 (Part 1: Datasets 1-7): stationary textures acquired under steady imaging conditions.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 7 .
Results of Experiment #1 (Part 2: Datasets 8-14): stationary textures acquired under steady imaging conditions.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 8 .
Results of Experiment #2: non-stationary textures acquired under steady imaging conditions.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 9 .
Results of Experiment #3: stationary textures with variations in illumination.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 10 .
Results of Experiment #4: non-stationary textures with variations in illumination.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 11 .
Results of Experiment #5: stationary textures with rotation.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 12 .
Results of Experiment #6: non-stationary textures with rotation.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 13 .
Results of Experiment #7: stationary textures with variation in scale.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 14 .
Results of Experiment #8: non-stationary textures with variation in scale.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 15 .
Results of Experiment #9: stationary textures with multiple variations.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 16 .
Results of Experiment #10: non-stationary textures with multiple variations.Figures report overall accuracy by dataset.Boldface denotes the highest value; underline signals a statistically-significant difference between the best hand-crafted and the best CNN-based descriptor.Descriptors are listed in ascending order of by-experiment rank (best first).

Table 17 .
Computational demand: FE = average Feature Extraction time per image, CL = average Classification time per problem; Q FE and Q CL are the corresponding quartiles.Values are in seconds.Note: features' extraction and classification from the DenseNet-161 and DenseNet-201 models were carried out on a machine different from the one used for all the other descriptors.Consequently, computing times for DenseNets are not directly comparable to those of the other descriptors.