A Novel Local Structure Descriptor for Color Image Retrieval

A novel local structure descriptor (LSD) for color image retrieval is proposed in this paper. Local structures are defined based on a similarity of edge orientation, and LSD is constructed using the underlying colors in local structures with similar edge direction. LSD can effectively combine color, texture and shape as a whole for image retrieval. LSH integrates the advantages of both statistical and structural texture description methods, and it possesses high indexing capability and low dimensionality. In addition, the proposed feature extraction algorithm does not need to train on a large scale training datasets, and it can extract local structure histogram based on LSD. The experimental results on the Corel image databases show that the descriptor has a better image retrieval performance than other descriptors.


Introduction
Images are one of the popular media formats for communication and understanding of human society.With the rapid development of internet and multimedia techniques, ever-increasing images are available to the public.Therefore, people desire to get a more efficient image indexing tool.Image search has become one of the crucial issues in computer vision.Generally, people retrieve images in three ways.Text-based methods use keywords that are annotated on images to search image, which are widely used by many company's applications, such as Google and Baidu.However, this approach needs to label images manually, and manually annotating a large scale of images is time-consuming.Furthermore, the retrieval results may be inaccurate.Content-based image retrieval (CBIR) approach computes similarity between images by using low level features which describe the content of each image, and then the retrieved images have similarity in certain predefined threshold.However, low level features cannot describe image semantic concepts because of a huge semantic gap.In the last few years, many semantic-based image retrieval (SBIR) methods have been proposed, owing to the limits of relevant techniques, SBIR is an open problem so far [1,2].To this day, CBIR is still one of the most effective image indexing methods.
As is well known, a process competing interactions among neurons will strengthen human visual attention, this means that a few elements of attention are selected and other irrelevant materials are suppressed by neurons [3].There are close connections between visual features of each image and the human visual system, and the study on how to exploit visual attention mechanism for CBIR is a vital and challenging problem.To simulate visual processing procedure, we propose a novel local structure descriptor (LSD) for image retrieval in this paper.The novelty of this descriptor lies in: (1) due to sensitivity of human visual perception to the orientation and color, we use edge orientation and color of each image to describe the local structure of each image, then we use local structure descriptor to simulate visual processing procedure.(2) The descriptor can not only describe image feature, but also effectively combines color, texture, shape and color layout as a whole simultaneously without any image segmentation, training and learning.(3) The dimensionality of LSD feature vector is low, its time complexity is linear, and LSD is easy to implement.
In the LSD, we initially focus on how to define local structures.We adopt the feature integration method of two-stage model to define local structure [4].In the pre-attentive stage, we extract primitive features in special modules of feature maps.In the attentive stage, we require focal attention to integrate the separate features to form objects. Subsequently, we move our focus toward describing local structures.We use a local structure histogram (LSH) to extract feature vectors of each image in HSV (Hue, Saturation, Value) color space.The local structure descriptor can describe color, texture and shape of each image simultaneously.Therefore, the LSD combines advantages of both statistical and structural texture descriptors.Finally, we focus on improving the performance of LSD by studying the effects of different color space, parameter settings, gradient operators, and distance metrics.The experimental results of the large image database show that the LSD gets higher precision and recall than representative visual representation descriptors.
Our contributions are as follows: (a) We design a new local structure descriptor to simulate human visual perception mechanism.The descriptor combines color, texture, shape and color layout as a whole.The dimensionality of its feature vector is low, which is very appropriate for large-scale image retrieval.The descriptor achieves better accuracy results on standard benchmarks than other descriptors.(b) The detail experimental research we carried out adds our understanding of the effects of different color space, parameter settings, and gradient operators of the studied descriptor.
The rest of this paper is organized as follows.Section 2 introduces related works.Section 3 describes the definition of LSD.The extraction of local structure histogram in HSV color space is presented in Section 4. In Section 5, similarity measurement is presented.The proposed descriptor is evaluated and compared with state-of-the-art descriptors on Corel image databases.The conclusion of this paper is presented in Section 7.

Related Works
Generally, researchers use different descriptors to represent color, texture and shape of image.They design different algorithms to extract image features and retrieve images.Because of simplicity and effectiveness, the color histogram has been extensively studied and used for image retrieval.Meanwhile, it is invariant to scale and orientation.However, it is difficult for color histogram to represent the image spatial structure.Instead, to exploit image spatial information, other color descriptors have been proposed [5,6].Texture features present valuable information of smoothness, coarseness and regularity of objects such as fruits, skins, clouds, tree, etc., texture-based schemes have been widely studied in CBIR [7].Manjunath has designed MPEG-7 edge histogram descriptor (EHD) according to the spatial distribution of edge, which is a very effective texture descriptor for image representation [8].In addition, the outlines of different objects in an image are usually different, and shape features have been widely used in CBIR for its capability in describing the object outline.The edge orientation autocorrelogram (EOAC) [9] is one of the most successful methods.
Local image features have attracted more and more attention in recent years.Various feature algorithms were presented in order to highlight different properties of an image such as gray and edge, etc.These methods are often based on distribution, and color features can be combined with texture features.Lowe et al. propose a so-called scale-invariant feature transform (SIFT) descriptor, which is an effective algorithm to detect and describe local feature of an image [10].Wang et al. propose a novel retrieval scheme based on structure elements' descriptors (SEDs) [11].SED effectively presents the spatial connection of color and texture by extracting the color and texture features of an image.Liu et al. propose an image feature representing method called color difference histogram (CDH) [12].CDH is completely different from the existing histogram techniques; it pays more attention to perceptually uniform color difference between color and edge orientation.Murala et al. utilize the orientation map to obtain one order derivative on horizontal and vertical directions based on Local Tetra Patterns (LTrPs) for texture image retrieval and improve retrieval performance [13].Meng et al. compute the feature of an image and its spatial correlation by using salient points in the patch of an image, and they propose the learning algorithm of multi-instances, this method improves the average retrieval accuracy of image [14].Wang et al. extract the color feature of an image by Zernike color distributed moments and compute texture feature by contourlet transform.Then, they combine color with texture for image retrieval, and the algorithm obtains better image retrieval performance [15].Local descriptors are tolerant to varied illuminations, distortions, transformations and are robust to occlusion.Meanwhile, they can describe different characteristics of object appearance or shape without any image segmentation.However, it is still an important challenge how to develop computational models that describe color, texture, shape and spatial structure simultaneously.
In order to improve further image retrieval performance, many feature combination methods and relevant feedback techniques have been proposed recently [16][17][18][19].Lee et al. propose a novel retrieval scheme that extracts the image feature by fusing Advanced Speed-Up Robust Feature (ASURF) and Dominant Color Descriptor (DCD).The system can run in real-time on iPhone and find a natural color image for mobile image retrieval [16].Kafai et al. describe Discrete Cosine Transform (DCT) hashing for creating index structures for face descriptors, and this method can efficiently reduce the cost of the linear search and improve retrieval efficiency and accuracy [17].Yang et al. propose a semi-supervised Local Regression and Global Alignment (LRGA) algorithm for data ranking and a semi-supervised long-term Relevance Feedback (RF) algorithm for using data distribution and the history RF information, and then they integrate the two algorithms into multimedia content analysis and retrieval framework [18].Spyromitros-Xioufis et al. use the framework of a Vector of Local Aggregated Descriptors (VLAD) and Product Quantization to develop an enhanced framework for large-scale image retrieval, the system significantly improves the performance for image retrieval [19].
Until now, the latest large-scale image retrieval frameworks use the bag-of-words (BoW) descriptor [20].In these methods, retrieval system extracts local features (usually SIFT [10]) from each image and assigns each feature to the nearest visual word from a visual vocabulary.The feature vector of BoW is high dimensional and sparse for each image.BoW-based methods achieve better the accuracy of image indexing, but they cannot generalize to more than millions images datasets on a single machine owing to high computational complex and memory limits.Recently, a few scalable algorithms have been developed in [21][22][23].The feature vectors of these methods are more discriminative than BoW, and these feature vectors combine with powerful compression techniques.
One of the most successful algorithms is proposed in [21].This method uses SIFT features instead of BoW with highly discriminative Fisher Vector [22] or its simpler variant-VLAD [23].Exploiting these optimized vector representation, greatly better retrieval accuracies are achieved.
In recent years, deep learning has reported encouraging results on CBIR tasks.For example, Wan et al. use a framework of deep learning for CBIR, they find that deep learning can mimic the human brain that is organized in a deep architecture and processes information through multiple stages of transformation and representation [24].Ng et al. extract convolution features from different layers of the deep convolutional networks and adopt VLAD encoding to encode features into a vector, they conclude that intermediate layers or higher layers with finer scales produce better results for image retrieval [25].Zhang et al. combine deep convolutional neural network (CNN) with learning hash functions, deep CNN is utilized to train the model, discriminative image features and hash functions are simultaneously optimized [26].Lin et al. propose an effective deep learning framework to learn binary hash codes by employing a hidden layer for representation of the latent concepts that dominate the class labels in a point-wised manner [27].Deep learning has many advantages.One of them is it can learn hierarchical features for different semantic abstraction, which effectively improve the performance of CBIR.However, deep learning needs some tricks in parameters tuning, and the feature dimensionality of the last layer is very high, which may affect its application in practice.

Local Structure Descriptors
Directly comparing image content is impractical for image search because the contents of image are significantly different.However, the structural information of the same class images often shows some similarity.We may regard that a semantic of a natural image is made up of many common local structures.If these local structures could be achieved and presented effectively, we could use them as a common base for different image matching.It is very important for image retrieval if shape is extracted along with color, texture and color layout of each image.
One key problem of the local structure descriptor is how to define local structure.Let gpx, yq be a color image of size M ˆN.Firstly, to detect local structure, we convert RGB color space into HSV color space, we quantize the color image into 72 colors in HSV color space, we use nine structure templates to detect edge orientation.The nine structure masks are nine 2 ˆ2 matrixes of different directions, which are displayed in Figure 1.Then, in the edge orientation map, we define the local structure.Thirdly, we construct the LSD using underlying color in local structures.Finally, we use local structure histogram to describe the LSD to represent image feature.The LSD not only could effectively represent each image, meanwhile, it could extract shape, texture, color and color layout feature of each image simultaneously.We present these steps in detail as follows:

Selection and Quantization of Color Space
HSV space is widely used for image feature extraction; it is made up of three components in terms of Hue (H), Saturation (S) and Value (V).HSV is usually modeled as a cylinder, the H component represents color type, its value is between 0-360 ˝, with red at 0 ˝, green at 120 ˝, and blue at 240 The S component describes the relative purity of color, its value is between 0-1.The V component represents the brightness of the color, its value is also between 0-1.
Human color perception could be imitated well by HSV color space l.In this paper, we select HSV color space, the color image is quantized into 72 bins.Specifically, we quantize the H, S and V components into 8, 3, 3 bins respectively.Hence, we obtain 8 ˆ3 ˆ3 " 72 color combinations.Let Ipx, yq be the quantized color image, Ipx, yq " i, i P t0, 1, 2, ..., 71u.

Edge Direction Detection
Edge direction plays an important role in image recognition.The boundaries and texture structure of object is represented by the direction map of each image, the direction map provides a large amount of semantic concepts for the image.Hence, the detection of edge direction is a vital processing operation.Edge direction is detected by many existing edge detection operators such as Sobel operator, Canny operator, etc.However, these detectors can only detect gray level images and cannot detect color image because a color image has three color channels.In this subsection, the following methods are used to detect edge orientation for color image in HSV color space.
In Cartesian space, we define the dot product of vectors apx 1 , y 1 , z 1 q and bpx 2 , y 2 , z 2 q as follows: Because cylinder coordinate system is used in HSV color space, to calculate the angle between vectors, HSV color space should be transformed into the Cartesian coordinate system.pH, S, Vq represents a point in HSV color space, pH 1 , S 1 , V 1 q represents the transformation of pH, S, Vq in Cartesian space, H 1 " S ¨cospHq, S 1 " S ¨sinpHq and V 1 " V.The Sobel operator is applied to each of the H 1 , S 1 , V 1 channels of a color image in Cartesian space.We use two vectors apH 1 x , S 1 x , V 1 x q and bpH 1 y , S 1 y , V 1 y q to denote the gradients along x and y direction, where the gradient in H 1 channel along horizontal direction is denoted by H 1 x , and so on.Their dot product and norm can be defined as: The angle between a and b is: θ " arccosp â, bq " arccos After we calculate the edge direction of each pixel, we quantize the edge orientation into m bins, and m P t6, 12, 18, 24, 30, 36u.We use θpx, yq to represent the edge orientation map, as θpx, yq " φ, φ P t0, 1, 2, ..., m ´1u.

Definition and Extraction of Local Structure
The human visual system is sensitive to color and direction.Direction is a strong cue for topics described about an image.Obvious direction usually means definite pattern.However, there is no strong orientation and clear structure or specific pattern in many natural scenes.Although different contents are showed in the natural image, the common elements are included in these images.Different combination and spatial distribution of these fundamental elements lead to different local structures or patterns.
In order to find local structure with similar attributes such as edge direction and color, we use edge orientation map θpx, yq to detect local structure, because edge direction is not sensitive to color and illumination variation, and it is robust to translation, scaling and small rotation.Let θpx, yq be an edge direction map of size M ˆN, we shift nine local structure templates shown in Figure 1 from top to bottom and left to right throughout orientation map to detect fundamental local structure respectively.In order to obtain a single local structure map of whole image, we use the steps described to detect as follows: (1) Beginning from the point (0, 0), we shift 2 ˆ2 local structure template (a) from top-to-bottom and left-to-right throughout edge direction map θpx, yq with a step length of two pixels along both vertical and horizontal directions.If the values of θpx, yq in the corresponding structure template are equal, the values will be saved, otherwise, the values will be set zero.Then, we will obtain a local structure map C 1 px, yq.The extraction and fusing process of the above local structure map is illustrated in Figure 2.

Feature Extraction
After the local structure map is extracted, the next step is to find a way to describe its feature.However, a critical problem is that we have to find a way that stimulates the visual system to some extent.Cortical visual system hypothesis is a framework for understanding cortical visual systems and visuomotor systems [4].The framework is often called as the "what/where" and "what/how" pathways.It depicts two information processing streams which originate in the occipital cortex.
The "what/where" pathways are responsible for color, shape, location and motion, respectively.The "how" pathway could be regarded as a generalization of the "where" pathway.The above analyses give an insight into how the relationship between image feature representation and visual attention mechanism is determined.
To present the content of image via imitating the visual attention mechanism to a certain extent, we consider the pixels with similar color in the local structure images as the stimuli in the visual field.We carry out the LSH method in accordance with the following rules: (1) what the local structure is; (2) where the local structure is; (3) how the local structure correlates with others.( 1) and ( 2) are about the semantics of image based on perceptive space, and ( 3) is about the spatial representation of the local structures in a human brain.
After the local structure map C(x,y) is extracted from the edge direction map θpx, yq, it is used as a mask to obtain the underlying colors from the quantized image I(x,y).We will achieve the local structure map by reserving only the colors in the local structure image and other colors outside the image are set to empty.In the subsection, we apply f px, yq to denote the local structure image.The value of local structure map f px, yq can be denoted as f px, yq " i, i P t0, 1, 2, ..., 71u.We use local structure histogram LSH to extract feature vector of the image in HSV color space.In order to extract color, texture and shape feature of the f px, yq simultaneously, LSH will be calculated on every bin in HSV color space.LSH will be extracted by using two steps strategy described as follows: (1) When f px, yq " i, i P t0, 1, 2, ..., 71u, count the number of LSD on nine LSD maps.In particular, when the local structure descriptor that denotes no direction has been counted, the other eight local structure descriptors should be counted again, because no direction means that every direction is possible.(2) Calculate LSH based on the number of LSD.
The full color image is composed of local structure images.The attribute of the human visual system is fitted by local structure maps, because local structure images choose only a few points of attention and suppress the irrelevant material.Hence, the LSH not only depicts "what" and "how" colors and directions are utilized, but it also designates "where" and "how" the color and direction components are distributed within a certain range in the visual scenery.Hence, the different combination and spatial distribution of the local structures could be described by LSD, LSD could describes strong discriminative features such as texture, color, shape features and color layout information.

Similarity Measurement
For each image in a database, M bins feature vector T " rT 1 , T 2 , ..., T M s of the LSH is computed according to the above described method and stored in the database.Let Q " rQ 1 , Q 2 , ..., Q M s be M bins, the feature vector of a query image, the similarity measurement between a query image and any an image in the database can be defined as follows: The above formula is called L 1 distance, which is the accumulative summary of two vectors difference, without square or square root operation on vectors.Hence, its time complexity is linear and is very suitable for large-scale image search.

Image Database
Two Corel image databases are used to test image retrieval performance in this paper, which are widely used for CBIR.The first one is Corel-1000 dataset, which includes 10 categories.There are 1000 natural images from diverse semantics such as scenes, horses, elephants, human, buses, flowers, buildings, mountains, foods and dinosaurs.The second one is Corel-10000 dataset, which includes 100 categories.There are 10,000 natural images from diverse semantics such as beaches, buses fishes and sunsets, etc. Experimental images contain different topics.
In following experiments, 10 images from each category in Corel-1000 dataset are randomly selected and used as query images.The precision and recall percentage for each category are calculated.
Then, average precision-recall pair percentage is calculated by the precision-recall pair percentage of 10 random images.In Corel-10000 dataset, we randomly choose 20 categories from 100 categories image.Then, 10 images are randomly selected from each category to use them as query images, and average precision and recall are computed.

Performance Measurements
We adopt Precision and Recall to evaluate the performance of the proposed method; these two metrics are the most commonly used metrics for evaluating image retrieval performance.Precision and Recall is defined as follows: Precision " Recall " where I N is the number of similar image retrieved, M is the total number of similar images in image database, and N is the total number of images retrieved.In the following experiments, to get the precision and recall values in Tables 1-3 N and M are set 12 and 100, respectively, for two datasets.

Retrieval Results
In the following experiments, we adopt a different number of quantization level for texture and color to evaluate the retrieval performance in RGB and HSV color space.The RGB color space is quantized to 16, 32, 64 and 128 bins, respectively, The HSV color space is quantized to 72, 108, 128 and 192 bins, respectively, and the texture orientation is quantized to 6, 12, 18, 24, 30 and 36 bins, respectively.As can be seen from the results in Tables 1 and 2 the proposed LSD has better performance in HSV color space than in RGB color space.
In order to test the results of the proposed direction detection operator, we use several classical edge detectors to detect gradient magnitude and direction, and the experimental results are listed in Table 3.We should note that the proposed operator works on the full color image, while the other four operators work on the gray level images.It can be seen from the retrieval results in Table 3 that the proposed direction operator achieves better results because it utilizes the color information that is ignored by the other detectors in direction detection.
We then validate the performance of the proposed distance metric and other popular distance metrics or similarity measurements.It can be seen from the retrieval results in Table 4 that our distance metric achieves better retrieval results than other distance metrics or similarity measurements.Meanwhile, comparing the results of L 1 distance and L 2 distance, we note that they have comparative performance with almost exactly precision and recall.However, L 1 distance is very simple to calculate with linear time requirement, unlike L 2 distance, it requires square or square root operations, which is costly for vector operation.Hence, our L 1 distance saves much computational cost and is very suitable for large-scale image database.We compare the LSD representation with the state-of-the-art representations such as BoW [20], VLAD [22] and compressed Fisher kernel [23] on Corel-1000 and Corel-10000 datasets.These state-of-the-art representations are all parameterized by parameter k.Parameter k represents the number of centroids for BoW and VLAD, and to the number of mixture components in the Fisher kernel representation.We set k = 8000 for BoW and k = 64 for VLAD and compressed Fisher kernel, respectively.We use the same feature detection and description procedure as in [28] (Hessian-Affine extractor and the SIFT descriptor) since the code and the features are available online.We reduce the dimensionalities of these features from D = 8000 to D = 648 through PCA (Principal Component Analysis).We use the Flickr60k dataset to learn the PCA and the GMM (Gaussian Mixture Model) vocabularies.The Fisher vectors used in our experiments were computed by taking the gradient with respect to the mean parameters only.The average retrieval precision and recall curves of the above local descriptors are plotted in Figure 4.It can be seen from the retrieval results in Figure 4, LSD obtains better retrieval results than BoW [20], VLAD and Compressed Fisher kernel [22,23].Though BoW is a good local feature representation, it ignores spatial information, BoW achieves the lowest retrieval results in the above descriptors.Fisher kernel and VLAD combine the strengths of generative and discriminative approaches, their feature representations are more discriminative and get better results than BoW.LSD imitates human visual mechanism to some extent, and it combines color, texture, shape and color layout as a whole.Meanwhile, LSD and its feature vector can be automatically extracted without any image segmentation, training and learning, they can be implemented simply, and the dimensionality of its feature vector is low, therefore, LSD is very suitable for large-scale image retrieval.The descriptor outperforms the previous best reported accuracy on standard benchmarks.Figures 5 and 6 show the retrieval examples on Corel-1000 and Corel-10000 datasets.In Figure 5, the query image is a bus image, it has clear shape feature and similar color.All the top 12 retrieved images belong to the bus category, which shows a good match of shape and color to query image.In Figure 6, the query image is a horse image, all of the top retrieved images are the horse category, which shows a good match of texture, shape and color to query the image.

Conclusions
A novel local structure descriptor for color image retrieval is proposed in this paper.LSD can effectively describe the shape, texture, color, and color layout of each image without any image segmentation, learning and training.Hence, its implementation is easy and its time complexity is linear.LSD can be regarded as a generalized visual attribute descriptor because the descriptor imitates human visual mechanisms to some extent.The dimensionality of LSD feature vector is only 648, which is very suitable for large-scale image search.Our experimental results on two Corel image databases show that the proposed descriptor has strong discriminative capability for color, texture and shape features and outperforms the BoW, VLAD and Compressed Fisher kernel descriptors significantly.

Figure 2 .
Figure 2. Local structures' map extraction and fusion.(a) shows the local structure map extracted C 1 px, yq.Maps C 2 px, yq, C 3 px, yq, C 4 px, yq, C 5 px, yq, C 6 px, yq, C 7 px, yq, C 8 px, yq and C 9 px, yq can be extracted similarly in (b)-(i).(j) shows the fusion of nine Maps to form the final local structure map Cpx, yq .

Figure 3 .
Figure 3. LSH extraction algorithm.(a)-(e) are the process of extracting LSH of nine bins, respectively.The number of which under the local structure is the number of the local structure of every bin.

Figure 5 .
Figure 5. Retrieval result for a bus.

Table 1 .
The average retrieval performance of the local structure descriptor (LSD) under diverse color and direction quantization levels on Corel-1000 in RGB color space.

Table 2 .
The average retrieval performance of the LSD under diverse color and direction quantization levels on Corel-1000 in HSV color space.

Table 3 .
The retrieval results of LSD with different gradient operators for orientation detection.

Table 4 .
The average results of LSD with different distance metrics.