Image Representation Using Stacked Colour Histogram

: Image representation plays a vital role in the realisation of Content-Based Image Retrieval (CBIR) system. The representation is performed because pixel-by-pixel matching for image retrieval is impracticable as a result of the rigid nature of such an approach. In CBIR therefore, colour, shape and texture and other visual features are used to represent images for effective retrieval task. Among these visual features, the colour and texture are pretty remarkable in deﬁning the content of the image. However, combining these features does not necessarily guarantee better retrieval accuracy due to image transformations such rotation, scaling, and translation that an image would have gone through. More so, concerns about feature vector representation taking ample memory space affect the running time of the retrieval task. To address these problems, we propose a new colour scheme called Stack Colour Histogram (SCH) which inherently extracts colour and neighbourhood information into a descriptor for indexing images. SCH performs recurrent mean ﬁltering of the image to be indexed. The recurrent blurring in this proposed method works by repeatedly ﬁltering (transforming) the image. The output of a transformation serves as the input for the next transformation, and in each case a histogram is generated. The histograms are summed up bin-by-bin and the resulted vector used to index the image. The image blurring process uses pixel’s neighbourhood information, making the proposed SCH exhibit the inherent textural information of the image that has been indexed. The SCH was extensively tested on the Coil100, Outext, Batik and Corel10K datasets. The Coil100, Outext, and Batik datasets are generally used to assess image texture descriptors, while Corel10K is used for heterogeneous descriptors. The experimental results show that our proposed descriptor signiﬁcantly improves retrieval and classiﬁcation rate when compared with (CMTH, MTH, TCM, CTM and NRFUCTM) which are the start-of-the-art descriptors for images with textural features. in representing heterogeneous images. Our proposed


Introduction
The success of the Content-Based Image Retrieval (CBIR) system hinges mainly on the efficacy of the descriptors used to represent images. Colour, shape and texture are widely used image descriptors [1]. Most CBIR methods depend to a large extent on the colour and texture of the image. Colour Histogram is the easiest method to use to represent the colour features for a CBIR task [2,3]. A comprehensive survey on colour histogram image retrieval is given in [4][5][6][7][8][9][10]. The most significant advantages of the colour histogram are the simplicity [10], fast computation [11] and robustness to rotations and transformations on the image [12]. However, it cannot handle spatial information of colours [13] in an image and therefore texture features extraction techniques are widely used to essentially extract this information from an image [14,15].
Texture descriptors are significant for bridging the semantic gaps in CBIR. Local binary patterns-based descriptors (LBP) [16,17], Gray-Level Co-occurrence Matrix (GLCM) [18], Gabor filter [14,18] and Hidden Markov random field [19] are widely used to extract texture information for the indexing of images. Texture descriptors have not always yielded optimal results; hence, combining them with other descriptors for effective CBIR is common practice. Combining colour and texture in a single descriptor is common even though shapes, colours, edge orientation and texture combination can also be found. Using the combination may employ either the fusion of feature vectors into one feature vector or the feature extraction approach having the inherent nature of extracting two or more features into a single image representation vector. The authors of [20] fused colour feature information and pixel orientation into an image representation vector, while [20][21][22][23][24][25] integrated both the colour and texture information in a single descriptor using textons. Textons are structural elements used to capture the run lengths in an image at a particular orientation. Textons generally produce good results. However, the structure of textons is somehow rigid, which sometimes leads to poor performance when the orientation of the images changes [26]. The changes also make the texton feature extractions produce different feature maps or vectors for the same image. Again, feature extraction approaches such as the bag-of-words, edge histograms, and GLCM incorporating neighbourhood or spatial information to extract feature information may be complicated to implement or too rigid at some point.
This paper proposes a new feature extraction technique that extracts colour and neighbourhood information into a descriptor for indexing images. The presented scheme sums several histograms generated from the transformation of images.
The contributions of this work are summarised as follows: (a) The proposed SCH method for an image representation vector can effectively deal with texture and colour heterogeneous images; (b) In the SCH method, the inherent colour extraction and neighbourhood information based on recurrent transformations provide more discrimination of colour and texture features; (c) The novel descriptor improves the image recognition rate with minimal memory space and retrieval time.
Section 2 in this paper presents related works that are relevant to our own. Section 3 introduces our proposed SCH method. Section 4 contains the experimental results, evaluations and analysis of the results and the last section concludes this work.

Related Works
Several approaches to image representation using colour and texture exist. In this section, five exceptionally related works that are relevant to the work are reviewed. Recent works are Texton Co-occurrence Matrix (TCM) [23], Multi-Texton Histogram (MTH) [27], Colour Difference Histogram (CDH) [24], Complete Texton Matrix (CTM) [28] and Complete Multi Texton Histogram [29] which are largely used to index images with more textural properties.

Texton Co-Occurrence Matrix (TCM)
Liu and Yang [23] developed TCM to measure texture in a given image. They developed a spatial correlation of pixels as a statistical function of textons to define features of an image. The outcome has shown that TCM provides more description of colour, texture and shape. However, TCM has been recommended for texture only, making it inappropriate for depicting a heterogeneous image dataset. More importantly, simplifying the TCM co-occurrence matrix into third-order moments will result in the loss of useful information.

Multi-Texton Histogram (MTH)
Minarno et al. [27] extended concept on the original work of TCM [23]. They put forward an enhanced version called Multi-Texton Histogram (MTH) [27]. The MTH represents heterogeneous images, unlike TCM, which extracts textural information only. MTH leverages the co-occurrence matrix and colour histogram strength as it contains information about the spatial relationship of colours and orientations of texture. In MTH, two adjacent pixels with the same colour should be in the same direction and vice versa. However, such assumptions are not always applicable. This assumption leads to other types of textons being disregarded in the image.

Colour Difference Histogram
Liu and Yang [24] proposed an image representation method called the colour difference histogram (CDH) for retrieving images. The CDH approach is an improved multitexton histogram (MTH). The most significant improvement of the CDH is the perception of uniform colour differences in feature representation analogous to human colour perception. The paper achieves feature representation by combining colour, colour difference, orientation and spatial distribution in an image. The scheme used a vector size of 108 for image retrieval. A vector size of 108 means that the index requires relatively high memory space as compared with approaches that use sizes such as 48 and 64.

Complete Texton Matrix (CTM)
Kumari et al. [28] proposed a complete feature representation of heterogeneous images based on textons. The proposed CTM is derived from TCM and MTH. The paper used 11 textons to increase the information representation in the image, unlike the four textons used in TCM. The principle of CTM is fundamental. The image first quantises the original texture into 256 colours and calculates the gradient of RGB colour. Then, the statistical information of eleven derived textons on a 2 × 2 grid in a non-overlapped fashion are calculated to more accurately describe the image features. CTM does not include gradient or edge orientation information but is limited only to the texton co-occurrence matrix. This restriction results in weak representations at some points.

Complete Multi-Texton Histogram (CMTM)
Based on the works of [23,27,28], Khaldi et al. [29] proposed Complete Multi-Texton Histogram (CMTH). The CMTH method combined four texton-based features, namely MTH, TCM, CTM and NRFUCTM. Consequently, the combination resulted in Complete Multi Texton Histogram (CMTH) [29]. This scheme uses eleven textons to analyse the correlation between neighbouring textons, colour, and edge orientations. The movement of each texton on the image is time-consuming. In addition, many valuable texture patterns are unobserved due to the rigid structure of the textons [26].

Stacked Colour Histogram (SCH)
Spatial information is critical when extracting features for indexing images for retrieval tasks. This component is generally missing in feature extraction techniques such as the Conventional Colour Histogram (CCH), though this approach turns out to have the characteristics of being invariant to scale, rotation, and translations in mages. Colour Correlogram (CC), Colour Coherence Vector (CCV), Joint Histogram, Colour Moment (CM), Conventional Colour Histogram (CCH), Local Colour Histogram (LCH), Dominant Colour Descriptor (DCD), Colour Difference Histograms (CDH) are approaches that turn the spatial information into feature vectors used for representing images. The quest to integrate spatial information into a feature vector leads to two main challenges, which this work seeks to address:

•
The feature vector used to represent images sometimes takes up too much space, which affects the retrieval speed.

•
The approach used for the extraction may be rigid, which sometimes leads to poor performance when the orientation of images changes.
SCH is a summation of several histograms that are generated from the transformation of an image. In this work, mean filtering is used to transform images iteratively. The filtering, which employs a window size of n × n, is performed N times. In each instance of the filtering process, the output of the previous filtering is used as input to that of the current filtering processes. Histogram are generated from filtered images, and later added together to estimate the SCH. The filtering process proposed here uses neighbourhood information and therefore the estimated histogram contains information about the nature of pixels' distribution in images. The steps below present in detail the steps required to perform the recurrent filtering as well as the estimation of the histogram at each instant. 1 4 H ← Histogram of Image ( 4 I) .
The steps generate N histograms. If all the histograms are stored in a matrix H, with the number of bins in the histograms as number of columns and the number of histograms as the number of rows on the matrix, the SCH will be a vector defined as V = (V 1 , V 2 , V 3 , . . . , V m ). The respective V i can be computed using Equation (1).
where H: A matrix containing all the histograms. Images are represented using the SCH in the database. The recurrent filtering transform images into new ones in which images with fewer edges are not transformed significantly. This means that images with large homogeneous sections are not transformed significantly, thereby having a local histogram that is similar to or the same as the original image. Images with the same histogram before the transformation will have different subsequent histograms if the distribution of pixels does not exhibit homogeneous characteristics. Therefore, the shape of the SCH for images with a similar spatial distribution of pixel values will be similar. Figure 1 presents five different images with the same histogram. The last column of Figure 1 presents the SCH of the respective images. All the images exhibit different SCH shapes, which can help to uniquely represent these images. H(J) ← Extract histogram of image I 6.
I← Mean Filter (I) and assign to I. //Perform mean filtering of the image in variable I 11. count = count -1 12. Store sH as Index for the read image.
I Mean Filter (I) and assign to I. //Perform mean filtering of the image in variable I 11. count = count -1 12. Store sH as Index for the read image.

Experimental Evaluation
In this section, we demonstrate the performance of the SCH method using four datasets (Batik, Coil100, Outext, and Corel10K). The Batik, Coil100 and Outext datasets are used to evaluate SCH in differentiating texture images. The Corel10K dataset is used to evaluate the heterogeneous image recognition capabilities of the proposed method.
Below is a concise description of datasets.

Experimental Evaluation
In this section, we demonstrate the performance of the SCH method using four datasets (Batik, Coil100, Outext, and Corel10K). The Batik, Coil100 and Outext datasets are used to evaluate SCH in differentiating texture images. The Corel10K dataset is used to evaluate the heterogeneous image recognition capabilities of the proposed method.
Below is a concise description of datasets.
− Batik [30] dataset contains 300 images in total. There are 50 classes and each class contains six images. The images are of 128x128 pixels sizes in JPEG format. − Coil100 [31] dataset contains 7200 colour images categorised into 100 classes. There are 72 images of each object in different poses in each class. − Outext [32] dataset contains 11,484 total images. The image database is categorised into 29 classes. − Corel10K [33] dataset contains 10,000 images in total. There are 100 classes and each class contains 100 images. The images are sized 128 × 192 pixels in JPEG format.
Sample images from each dataset are illustrated in Figure 2.
The initial experiment evaluated the accuracy of window sizes and recurrent transformations using different classifiers and different dataset features. Window sizes of 3 × 3, 5 × 5, 7 × 7, 9 × 9 and 11 × 11 and the numbers of recurrent transformations of 5, 10, 15, 20 and 25 were used for extracting features for this experiment [34].

Machine Learning Approach for Evaluation
K-fold cross-validation was performed to avert the over-fitting of the data during the process of training. The different classifiers with K-fold cross-validation (k = 5) were used to evaluate the performance on multiple and different features of datasets. Using Principal Component Analysis (PCA), we reduced the feature sizes to speed up computation, setting the threshold to 95%.
Five different classifiers, namely Discriminant Analysis Classifier (DAC), Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT) and KNN (k-Nearest Neighbours) were used to classify images from each dataset. The classification accuracy metric is The initial experiment evaluated the accuracy of window sizes and recurrent transformations using different classifiers and different dataset features. Window sizes of 3 × 3, 5 × 5, 7 × 7, 9 × 9 and 11 × 11 and the numbers of recurrent transformations of 5, 10, 15, 20 and 25 were used for extracting features for this experiment [34].

Machine Learning Approach for Evaluation
K-fold cross-validation was performed to avert the over-fitting of the data during the process of training. The different classifiers with K-fold cross-validation (k = 5) were used to evaluate the performance on multiple and different features of datasets. Using Principal Component Analysis (PCA), we reduced the feature sizes to speed up computation, setting the threshold to 95%.
Five different classifiers, namely Discriminant Analysis Classifier (DAC), Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT) and KNN (k-Nearest Neighbours) were used to classify images from each dataset. The classification accuracy metric is used to measure the performance of classifiers. Equation (2) indicates the mathematical expression for classification accuracy.

Image Retrieval
In this section, the proposed descriptor is evaluated in a classical CBIR task. Precision and recall metrics are used to evaluate the proposed descriptor's performance in the CBIR task. Equations (3) and (4) present the formulae for the precision and recall metrics, respectively.

Image Retrieval
In this section, the proposed descriptor is evaluated in a classical CBIR task. Precision and recall metrics are used to evaluate the proposed descriptor's performance in the CBIR task. Equations (3) and (4) We first present the results of the machine learning approach used for the experiment. Table 1 provides accurate results of the DCA classifier on all datasets. The window size 11 × 11 with the recurrent transformation of 5 provided the highest accuracy of 91% for Coil100. All window sizes with the recurrent transformation of 5 gave 36% accuracy as highest for Corel10K. The window sizes 7 × 7 and 9 × 9 with the recurrent transformation of 5 produced 67% accuracy for the Outext dataset. The mean accuracy of 87.64%, 32.72%, and 64.24% was recorded for Coil100, Corel10K and Outext datasets, respectively.  Table 3 presents accuracies obtained by applying the NB classifier on all datasets. Most 9 × 9 and 11 × 11 window sizes with the recurrent transformations of 15, 20 and 25 produced the highest accuracy of 92% for Coil100. The window sizes 5 × 5 and 11 × 11 with the recurrent transformations of 15 and 25 produced an accuracy of 36% as the highest for Corel10K. The window size 3 × 3 with the recurrent transformations of 15 and 20 produced 64%, whereas window size 5 × 5 with the recurrent transformation of 5 produced the same accuracy for the Outext dataset. The mean accuracies for Coil100, Corel10K, and Outext were 89.96%, 34.28%, and 61.08%, respectively.  Table 4 shows the accuracy outcomes by applying the DT classifier on all datasets. The window size 11 × 11 with the recurrent transformations of 15 and 25 produced the highest accuracy of 92% for Coil100. The window sizes 5 × 5 and 11 × 11 with the recurrent transformations of 15 and 25 produced an accuracy of 24% as the highest for Corel10K. The window size 5 × 5 with the recurrent transformations of 20 and 25 produced 79% accuracy for the Outext dataset. Mean accuracies of 88.96%, 22.96%, and 76.32% were recorded for the Coil100, Corel10K, and Outext datasets.
From Tables 1-4, generally window size 11 × 11 with the recurrent transformations of 15, 20 and 25 produced good results for the four different classifiers. Khaldi et al. [29] in their work "Image representation using complete multi-texton histogram", compared their results with the state-of-the art texture feature extraction techniques for indexing images or representing images. The Texton Co-occurrence Matrix (TCM), Multi-Texton Histogram (MTH), Complete Texton Matrix (CTM), Complete Multi-Texton Histogram (CMTH), and Noise Resistant Fundamental Units of Complete Texton Matrix (NRFUCTM) were compared in the paper.  Additionally, our method SCH remarkably represents textures and shows high performances in classifying Outext (i.e., SCH yields 64.24%, which far outperforms CMTH with 42.34% for DAC).
We conducted another evaluation using the KNN classifier on all datasets. The mean accuracies when k = 1, 96%, 31.80%, 77.63% were recorded for Coil100, Corel10K, and Outext datasets, respectively. When k = 3, 95%, 31.35%, 77.33% were recorded for Coil100, Corel10K, and Outext datasets, respectively. When the value of k = 5, 94%, 34.03%, 76.04% were recorded for Coil100, Corel10K, and Outext datasets, respectively. Lastly, using k = 9 recorded 94%, 31.93%, 77.19% for Coil100, Corel10K, and Outext datasets, respectively. Table 6 clearly shows the obtained mean accuracies of the KNN classifier, with different values of k, on all datasets. The results obtained were compared with texton-based features in [29]. The accuracy results of our SCH method denote better performance in representing heterogeneous images. Our method performs relatively well for texture images. Table 7   The SCH is primarily proposed for indexing images for CBIR or QBIC. We evaluated proposed descriptor in a conventional CBIR task.
The experiment evaluated the effect of window sizes and the number of recurrent transformations on precision and recall on the identified dataset. Window sizes of 3 × 3, 5 × 5, 7 × 7, 9 × 9 and 11 × 11, and several recurrent transformations of 5, 10, 15, 20 and 25 were used for extracting features for this experiment. Again, the number of retrieved images from each dataset was 6, 12, 18, 24 and 30. Figures 3-6 present the precision and recall performances of the proposed descriptor on Corel-10K, Outext, COIL100 and Batik datasets.         Figure 3 presents the average precision values of five window sizes and the five retrieved values (6, 12, 18, 24 and 30). The experiment determined which window size and the number of recurrent transformation will be adequate for SCH.
From Figure 4, we can see that different datasets recorded different precision patterns. COIL-100 precision values increased with increasing window size. However, Batik, Corel-10K and Outext did not record such a pattern, especially a Batik depicting alternating precision with increasing window size. Figure 7 presents the results of SCH together with that of the Textons implemented by Khaldi et al. [29] on Corel-10K. Both experiments used retrieval values of 10 to 100 with a step of 10. The proposed SCH outperformed TCM, MTH, CTM, NRFUCTM, and CMTH.   Figure 3 presents the average precision values of five window sizes and the five retrieved values (6, 12, 18, 24 and 30). The experiment determined which window size and the number of recurrent transformation will be adequate for SCH.
From Figure 4, we can see that different datasets recorded different precision patterns. COIL-100 precision values increased with increasing window size. However, Batik, Corel-10K and Outext did not record such a pattern, especially a Batik depicting alternating precision with increasing window size. Figure 7 presents the results of SCH together with that of the Textons implemented by Khaldi et al. [29] on Corel-10K. Both experiments used retrieval values of 10 to 100 with a step of 10. The proposed SCH outperformed TCM, MTH, CTM, NRFUCTM, and CMTH.   (6, 12, 18, 24 and 30). The experiment determined which window size and the number of recurrent transformation will be adequate for SCH.
From Figure 4, we can see that different datasets recorded different precision patterns. COIL-100 precision values increased with increasing window size. However, Batik, Corel-10K and Outext did not record such a pattern, especially a Batik depicting alternating precision with increasing window size. The results for our experiment on the retrieval efficiency of SCH is compared with that of the textons approach as proposed by Khaldi et al. [29] for extracting textural features in an image. Textons were selected largely because the literature has demonstrated that they are very effective for extracting textural information from images [27,35] and hence can be used as benchmark for evaluating similar feature extraction schemes. From Figure 7, we can see that SCH outperformed all the texton approaches that have been used to index the Corel 10K. The obtained result is a result of the transformational (rotational, scaling, translation and deformation) invariant nature SCH.

Conclusions
This paper presents a simple and effective descriptor that represents colour and texture images in a given database. The proposed descriptor utilises colour and neighbourhood information for indexing images. The indexing scheme used a vector dimension of 64 and, therefore, is very effective for image retrieval. The descriptor demonstrates a highly robust feature set used to represent the colour and texture of the image. The final experiments were carried out over four public database sets. Three image databases, mainly from Batik, Coil100 and Outext, were used to evaluate the texture representation capability of our proposed descriptor. Corel10K was used to evaluate its ability in heterogeneous image representation. The experimental results show that our proposed method has strong discrimination power of texture and colour features and outperforms well-established texton-based features.   The results for our experiment on the retrieval efficiency of SCH is compared with that of the textons approach as proposed by Khaldi et al. [29] for extracting textural features in an image. Textons were selected largely because the literature has demonstrated that they are very effective for extracting textural information from images [27,35] and hence can be used as benchmark for evaluating similar feature extraction schemes. From Figure 7, we can see that SCH outperformed all the texton approaches that have been used to index the Corel 10K. The obtained result is a result of the transformational (rotational, scaling, translation and deformation) invariant nature SCH.

Conclusions
This paper presents a simple and effective descriptor that represents colour and texture images in a given database. The proposed descriptor utilises colour and neighbourhood information for indexing images. The indexing scheme used a vector dimension of 64 and, therefore, is very effective for image retrieval. The descriptor demonstrates a highly robust feature set used to represent the colour and texture of the image. The final experiments were carried out over four public database sets. Three image databases, mainly from Batik, Coil100 and Outext, were used to evaluate the texture representation capability of our proposed descriptor. Corel10K was used to evaluate its ability in heterogeneous image representation. The experimental results show that our proposed method has strong discrimination power of texture and colour features and outperforms well-established texton-based features.