Color Texture Image Retrieval Based on Local Extrema Features and Riemannian Distance

: A novel efﬁcient method for content-based image retrieval (CBIR) is developed in this paper using both texture and color features. Our motivation is to represent and characterize an input image by a set of local descriptors extracted from characteristic points (i.e., keypoints) within the image. Then, dissimilarity measure between images is calculated based on the geometric distance between the topological feature spaces (i.e., manifolds) formed by the sets of local descriptors generated from each image of the database. In this work, we propose to extract and use the local extrema pixels as our feature points. Then, the so-called local extrema-based descriptor (LED) is generated for each keypoint by integrating all color, spatial as well as gradient information captured by its nearest local extrema. Hence, each image is encoded by an LED feature point cloud and Riemannian distances between these point clouds enable us to tackle CBIR. Experiments performed on several color texture databases including Vistex, STex, color Brodazt, USPtex and Outex TC-00013 using the proposed approach provide very efﬁcient and competitive results compared to the state-of-the-art methods.


Introduction
Texture and color, together with other features such as shape, edge, surface, etc., play significant roles in content-based image retrieval (CBIR) systems thanks to their capacity of providing important parameters and characteristics not only for human vision but also for automatic computer-based visual recognition.Texture has appeared in most CBIR frameworks while color has been more and more exploited to improve retrieval performance, in particular for natural images.
Some recent comprehensive surveys on the CBIR field can be found in [1][2][3].Among state-of-the-art propositions, a great number of multiscale texture representation and analysis methods using probabilistic approach have been developed within the past two decades.In [4], Do and Vetterli proposed modeling the spatial dependence of pyramidal discrete wavelet transform (DWT) coefficients using the generalized Gaussian distributions (GGD) and the dissimilarity measure between images was derived based on the Kullback-Leibler divergences (KLD) between GGD models.Sharing the similar principle, multiscale coefficients yielded by the discrete cosine transform (DCT), the dual-tree complex wavelet transform (DT-CWT) or its rotated version (DT-RCWT), the Gabor Wavelet (GW), etc. were modeled by different statistical models such as GGD, the multivariate Gaussian mixture models (MGMM), Gaussian copula (GC), Student-t copula (StC), or other distributions like Gamma, Rayleigh, Weibull, Laplace, etc. to perform texture-based image retrieval [5][6][7][8][9][10].Then, by taking into account color information within these probabilistic approaches, several studies have provided significant improvement for color image retrieval [11,12].However, one of the main drawbacks of these techniques is their expensive computational time, which has been observed an discussed in several papers [7,9,12].
The second family of methods that has drawn the attention of many researchers and has provided quite effective CBIR performance is the local pattern-based framework.The local binary patterns (LBP) descriptor, which compares neighboring pixels to the center pixel and affects them by 0 and 1 to form a binary number, was first embedded in a multiresolution and rotation invariant scheme for texture classification in [13].Inspired from this work, many propositions have been developed for texture retrieval and classification such as the local derivative patterns (LDP) [14], the local maximum edge binary patterns (LMEBP) [15], the local ternary patterns (LTP) [16], the local tetra patterns (LTrP) [17], local tri-directional patterns (LTriDP) [18], etc.These descriptors, particularly the LTrP, provide quite efficient texture retrieval performance.However, due to the fact that they are applied to gray-scale images, their performance on natural images is limited without exploiting color information.To overcome this issue, several recent strategies have been proposed to incorporate these local patterns with color features.Some techniques that can be stated here are the joint histogram of color and local extrema patterns (LEP+colorhist) [19], the local oppugnant color texture pattern (LOCTP) [20] and the local extrema co-occurrence pattern (LECoP) [21].
Another CBIR system that has offered very competitive results within the past few years relies on an image compression approach called the block truncation coding (BTC).The first BTC-based retrieval scheme for color images was proposed in [22] followed by some improvements a few years later [23,24].Until very recently, one has been witnessing the evolution of BTC-based retrieval frameworks such as the ordered-dither BTC (ODBTC) [25,26], the error diffusion BTC (EDBTC) [27] and the dot-diffused BTC (DDBTC) [28].Within these approaches, an image is divided into multiple non-overlapping blocks and one of the BTC-based systems compresses each block into the so-called color quantizer and bitmap image.Then, to characterize the image content, a feature descriptor is constructed using the color histogram feature (CHF) or the color co-occurrence feature (CCF) combined with the bit pattern feature (BPF), which describes edge, shape and texture information of the image.These features are extracted from the above color quantizer and bitmap image to tackle CBIR.
In this work, we would like to develop a powerful color texture image retrieval strategy based on the capacity of characteristic points to capture significant information from the image content.Due to the fact that natural images usually involve a variety of local textures and structures that do not appear homogeneous within the entire image, an approach taking into account local features could become relevant.This may be the reason why most local feature-based CBIR schemes (e.g., LTrP [17], LOCTP [20], LECoP [21]) or BTC-based approaches [25][26][27][28] (i.e., which, in fact, sub-divide each query image into multiple blocks) have achieved better retrieval performance than probabilistic methods, which model the entire image using different statistical distributions [4][5][6][7][8][9]11,12,29].We will provide later their performance for a comparison within the experimental study.Going back to this paper, by taking into consideration the above-mentioned remark, our motivation here is to represent and characterize an input image by a set of local descriptors generated only for characteristic points detected from the image.Recent study [30][31][32][33][34][35] addressing texture-based image segmentation and classification using very high resolution (VHR) remote sensing images has proved the capacity of the local extrema (i.e., local maximum and local minimum pixels in terms of intensity) to capture and represent textural features from an image.In this work, the idea of using the local extrema pixels is continued and improved to tackle CBIR.By embedding both color and geometric information, as well as gradient features, captured by these feature points, we propose the local extrema-based descriptor (LED) for texture and color description.As a result, an input image can be encoded by a set of LEDs, which is considered as an LED feature point cloud.Then, we propose to exploit a geometric-based distance measure, i.e., the Riemannian distance (RD) [36] between the feature covariance matrices of these point clouds, for dissimilarity measurement.This distance takes into account the topological structure of each point cloud within the LED feature space.Therefore, it becomes an effective measure of dissimilarity in our CBIR scheme.
The remainder of this paper is organized as follows.Section 2 presents the pointwise approach for local texture representation and description using local extrema features and the construction of LED descriptors.The proposed CBIR framework is described in details in Section 3. Section 4 carries out our experimental study and provides comparative results of the proposed algorithm against existing methods.Conclusions and some perspectives of our work are discussed in Section 5.

Approach
The idea of using the local maximum and local minimum pixels for texture representation and characterization has been introduced in [30][31][32] in the scope of VHR optical satellite images.Regarding this point of view, an image texture is formed by a certain spatial arrangement of pixels holding some variations of intensity.Hence, different textures are reflected by different types of pixel's spatial arrangements and intensity variations.These meaningful properties can be approximated by the local maximum and local minimum pixels extracted from the image.These local extrema have been proved to be able to capture the important geometric and radiometric information of the image content, and hence is relevant for texture analysis and description.In this work, we exploit and improve the capacity of local extrema pixels to characterize both texture and color features within CBIR context.Let us first recall their definition and extraction.
A pixel in a grayscale image is supposed to be a local maximum (resp.local minimum) if it holds the highest (resp.lowest) intensity value within a neighborhood window centered at it.Let S max ω (I) and S min ω (I) denote the local maximum and local minimum sets extracted from a grayscale image I using the ω × ω search window.Let p = (x p , y p ) be a pixel located at position (x p , y p ) on the image plane having its intensity value I(p), we have: p ∈ S min ω (I) ⇔ I(p) = min q∈N ω×ω (p) where N ω×ω (p) represents a set of pixels inside the ω × ω neighborhood window of p.We note that there are many ways to extend the above definition for color images (i.e., detecting on the grayscale version, using the union or intersection of subsets detected on each color channel, etc.).In this work, the local extrema pixels are extracted from the grayscale version of color image since this produced superior performance within most of our experiments.Figure 1 illustrates the capacity of the local max and min pixels to represent, characterize and discriminate different textures that are extracted from the Vistex image database [37].Each image patch consists of 100 × 100 pixels.We display for each one a 3D surface model using the image intensity as the surface height.The local max pixels (in red) and local min pixels (in green) are extracted following Equations ( 1) and (2) by a 5 × 5 search window.Some green points may be unseen since they are obscured by the surface.We observe from the figures how these local extrema appear within each texture.Their spatial information (relative distance, direction, density, etc.) and intensities can be encoded to describe and discriminate each texture, which constitutes our motivation in this paper.

Generation of Local Extrema-Based Descriptor (LED)
As previously discussed, our work not only considers the local extrema pixels as characteristic points to represent the image but also exploits them to construct texture and color descriptors.The local extrema-based descriptor (LED) is generated for each keypoint by integrating the color, spatial and gradient features captured by its nearest local maxima and local minima on the image plane.Two strategies can be considered for the research of nearest local extrema around each keypoint: 1.
fix the number (N) of nearest local maxima and nearest local minima for each one; 2.
or, fix a window size W × W around each keypoint; then, all local maxima and minima inside that window are considered.
The first strategy takes into account the local properties around keypoints since the implicit neighborhood size considered for each one varies depending on the density of local extrema around it (i.e., the search process goes further to look for enough extrema when their density is sparse).On the other hand, by fixing the window size, the second approach considers equivalent contributions of neighboring environments for all keypoints and hence better deals with outlier points.Moreover, its implementation is less costly than the first one, especially for large-size images (i.e., the search of N points is more costly within large-size image).Nevertheless, our experimentation shows that both strategies provide similar performance for the studied data sets.In this paper, we describe the approach considering a fixed window size for all keypoints, but it is worth noting that a similar principle can be applied to build the proposed descriptor using the first approach.
The generation of LED descriptors is described as follows.Let N max W (p) and N min W (p) be the two set of local maximum and local minimum pixels inside the W × W window around the understudied keypoint p, and the following features involving color, spatial and gradient information are extracted for each set.For better explanation, we display in Figure 2 the geometric and gradient features derived from each local max or local min within these sets.Below are the features extracted from N max W (p), the feature generation for N min W (p) is similar.

1.
Mean and variance of three color channels: where c ∈ {red, green, blue} represents the three color components and |N | is the cardinality of the set N .2.
Mean and variance of spatial distances from each local maximum to point p: where d(p, q) = (x p − x q ) 2 + (y p − y q ) 2 is the spatial distance between two points p and q on the image plane.We remind readers that p = (x p , y p ) and q = (x q , y q ). 3.
Circular variance [38] of angles of geometric vectors formed by each local maximum and point p: where

4.
Mean and variance of gradient magnitudes: where ∇I is the gradient magnitude image obtained by applying the Sobel filter on the gray-scale version of the image.5.
Circular variance [38] of gradient orientations: θ is the gradient orientation image obtained together with ∇I by Sobel filtering.
All of these features are then integrated into the feature vector δ max (p), which encodes the local max set around p: The generation of δ min (p) is similar.Now, let δ LED (p) be the LED feature vector extracted for p, we have: The proposed descriptor δ LED (p) enables us to characterize the local environment around p by understanding how local maxima and local minima are distributed and arranged, and also how they capture structural properties (given by gradient features) as well as color information.We note that LED descriptors are invariant to rotation.As observed from the feature computation process, for the two directional features including angles α and gradient orientations θ, only their circular variance [38] is taken into account, and their circular mean is not involved to ensure the rotation-invariant property.
The feature dimensionality of δ LED in Equation ( 12) is equal to 27.One may realize that this descriptor can be feasibly improved by modifying or adding other features into the vector.One of the improvements can be considered here is to involve other filtering approaches for gradient computation.For example, besides the gradient images (∇I, θ), it is interesting to add (∇I σ , θ σ ), which are generated by smoothing the image by a lowpass Gaussian filter (I σ = G σ * I) before applying the Sobel operator.Then, similar gradient features are extracted as in Equations ( 8)- (10).By inserting these features (computed for two extrema sets) into δ LED (p), an enhanced descriptor of dimension 33 is built.Both 27D and 33D versions of LED descriptors are going to be experimented later in the paper.Geometric and gradient information from a local maximum (resp.local minimum) pixel q = (x q , y q ) within N max W (p) (resp.N min W (p)) considered for the calculation of LED descriptor at the studied keypoint p = (x p , y p ).Here, d(p, q) is the distance between p and q; α(p, q) is the angle of the vector yielded from p to q.We have d(p, q) = d(q, p) but α(p, q) = α(q, p).Then, ∇I(q), θ(q) are the gradient magnitude and gradient orientation at q.

Proposed Framework for Texture and Color Image Retrieval
The proposed retrieval algorithm consists of two primary stages: the extraction of LED texture and color descriptors to characterize each query image from the image database and the computation of distance measure for retrieval process.Each of them is now described in detail.Then, the complete framework is presented.

LED Feature Extraction
The previous section has proved the capacity of the local extrema pixels to represent and characterize both local texture and color information from the image content.In this section, each query image will be encoded by a set of LED descriptors generated at characteristic points (i.e., keypoints) extracted from the image.In this work, we propose using the local maxima as our keypoints.It should be noted that other keypoint types can be exploited such as Harris keypoints, Scale Invariant Feature Transform (SIFT) [39], Speeded Up Robust Features (SURF) [40], etc. (see a survey in [41]).However, we choose local maxima as keypoints since they can be extracted within any kind of textures (i.e., with variation of intensity).Meanwhile, such keypoints like Harris, SIFT or SURF are normally focused on corners, edges, and salient features so that they may not probably be detected from some quite homogeneous textures within natural scenes such as sand, water, early grass fields, etc.
We propose that our keypoints are the local max pixels detected by a search window of size ω 1 × ω 1 and LED descriptors are generated for each keypoint with the support of the local max and local min sets extracted by another search window of size ω 2 × ω 2 .Here, we separate the two window sizes since ω 2 should be small enough to ensure sufficient max and min pixels around each keypoint (i.e., dense enough) for descriptor computation.On the other hand, the density of keypoints within the image can be lower than or equal to the density of local extrema by setting ω 1 ≥ ω 2 .It is revealed that a coarser density of keypoints (by increasing ω 1 ) still produces good retrieval results but reduces quite a lot of computation time.The sensitivity of the proposed method to these window sizes will be analyzed in Section 4.3.3.Subsequently, a set of LED descriptors is constructed for the keypoint set to characterize the image.Given a neighborhood window size W × W, the computation of LED feature vector for each keypoint is carried out as described in Section 3.1 (see Equations ( 3)-( 12)).Similarly, the algorithm sensitivity to parameter W is discussed in Section 4.3.3.

Dissimilarity Measure for Retrieval
Each query image from the database is now represented and characterized by a set of LED descriptors that can be considered as an LED feature point cloud.The dissimilarity measure between two images becomes the distance between two LED feature point clouds.Several metrics can be performed.One may consider that these sets of feature vectors follow a multivariate normal distribution and thus employ the simplified Mahalanobis distance or the symmetric Kullback-Leibler distance for dissimilarity measurement as follows: where µ 1 , C 1 , µ 2 , C 2 are the estimated means and covariance matrices of the normal distributions for the two point clouds.
Others would like to only account for the space of the feature covariance matrices by supposing that these point clouds have the same mean feature vector.Hence, their dissimilarity measure is calculated based on the distance between their covariance matrices.Our proposition in this paper is to utilize a geometric-based distance between the feature covariance matrices estimated from each set of LED descriptors.Thanks to the fact that these feature covariance matrices possess a positive semi-definite structure, their Riemannian distance (which is also called geodesic distance in some studies) is proposed to become the distance measure in our retrieval scheme.The Riemannian distance between two covariance matrices is defined in [36] as follows: where λ is the th generalized eigenvalue satisfying λ C 1 χ − C 2 χ = 0, = 1 . . .d. χ is the corresponding eigenvector to λ and d is the feature dimensionality of LED feature vectors (i.e., 27D or 33D as described in Section 2.2).
Experiments show that the Riemannian distance (RD) outperforms the Mahalanobis and the symmetric Kullback-Leibler distances in terms of both retrieval accuracy and computational time.Moreover, by using the RD, only one feature covariance matrix needs to be computed and stocked for each query image.On the other hand, the distances in Equations ( 13) and ( 14) also involve the mean feature vector for their computation, and hence require more computer storage memory.
Last but not least, for better confirming and validating our choice of Riemannian distance, we will provide in Section 4.3.4 a detailed comparison (i.e., in terms of retrieval performance and computation time) of this metric not only with the mentioned Mahalanobis and Kullback-Leibler metrics but also against some other distance measures of covariance matrices such as the log-Euclidean, the Wishart-like and the Bartlett distances [42,43].

Proposed Retrieval Framework
The full proposed framework for texture and color-based image retrieval using LED features and Riemannian distance is outlined in Figure 3.The algorithm can be highlighted as follows: 1.
Load the query color image I color .

2.
Convert the image to grayscale I.

6.
Estimate the feature covariance matrice for these LED descriptors.7.
Compute the Riemannian distance (15) between the query and the other images from the database.8.
Sort these distance measures and produce the best matches as the final retrieval result for the query.
Vistex is one of the most widely used texture databases for performance assessment and comparative study in the CBIR field.It consists of 40 texture images (i.e., 40 texture classes) of size 512 × 512 pixels.As in most literature studies, each texture image is divided into 16 non-overlapping subimages of size 128 × 128 pixels in order to create a database of 640 subimages (i.e., 40 classes × 16 subimages/class).Being much larger, the Stex database is a collection of 476 color texture images captured in the area around Salzburg, Austria under real-word conditions.Like for Vistex, each 512 × 512 texture image is divided into 16 non-overlapping patches to build a total of 7616 images.Next, the Colored Brodatz Texture (CBT) database is an extension of the gray-scale Brodatz texture database [45].This database both preserves the rich textural content of the original Brodatz and possesses a wide variety of color content.Hence, it becomes relevant for the evaluation of texture and color-based CBIR algorithms.The CBT database consists of 112 textures of size 640 × 640 pixels.Each one is divided into 25 non-overlapping subimages of size 128 × 128 pixels, thus creating 2800 images in total (i.e., 112 classes × 25 images/class).The USPtex database [46] includes 191 classes of both natural scene (road, vegetation, cloud, etc.) and materials (seeds, rice, tissues, etc.).Each class consists of 12 image samples of 128 × 128 pixels.Finally, the Outex TC-00013 [47] is a collection of heterogeneous materials such as paper, fabric, wool, stone, etc.It comprises 68 texture classes and each one includes 20 image samples of 128 × 128 pixels.For our convenience in this work, we name the above five texture databases as Vistex-640, Stex-7616, CBT-2800, USPtex-2292 and Outex-1360 where the related number represents the total number of samples in each database.Figure 4 shows some examples of each database and Table 1 provides a summary of their information.

Evaluation Criteria
The main criterion used to assess the performance of the proposed retrieval framework compared to state-of-the-art methods is the average retrieval rate (ARR).Let N t , N R be the total number of images in the database and the number of relevant images for each query, and for each query image q, let n q (K) be the number of correctly retrieved images among the K retrieved ones (i.e., K best matches).ARR in terms of number of retrieved images (K) is given by: We note that K is generally set to be greater than or equal to N R .By setting K equal to N R , ARR becomes the primary benchmark considered by most studies to evaluate and compare the performance of different CBIR systems.All of ARR results shown in this paper are adopted by setting K = N R .

Performance in Retrieval Accuracy
The proposed CBIR system was applied to the five studied databases.For all cases, the local maximum and local minimum sets (i.e., S max ω 2 and S min ω 2 ) were extracted using a 3 × 3 search window (ω 2 = 3).The local max keypoints were detected by the 5 × 5 window (ω 1 = 5) and the neighborhood size for LED descriptor construction (W) was set to 30 × 30 pixels.Here, we set ω 1 > ω 2 to accelerate the computational time.Meanwhile, as discussed in Section 3.1, a denser or coarser density of keypoints (by varying ω 1 ) could produce stable performance.Our experiments showed that ω 1 set from 3 to 9 could yield very close ARR.Similarly, a value of W from 15 to 40 could provide quite equivalent retrieval results.We deeply discuss the sensitivity of our algorithm to these parameters later in Section 4.3.3.
Tables 2-6 show the average retrieval rate (ARR) of the proposed framework performed on our five databases compared to several state-of-the-art methods in the literature.Then, Table 7 provides the descriptor size (i.e., feature dimension) of some of these methods.We note that some learned features extracted from pre-trained convolutional neural networks (CNNs) [48,49] that have been recently proposed by [50,51] were also investigated to enrich our comparative study.From these tables, the first observation is that recent CBIR propositions have taken into consideration more and more color information in order to improve their retrieval performance.This trend is reasonable since color components play a significant role in natural images.The second remark is that, as previously mentioned in the introduction of our article, the local patterns-based schemes (such as LtrP [17], LECoP [21], etc.) and the OTB-based systems [26][27][28] as well as the recently proposed learned descriptors based on pre-trained CNNs [50,51] generally provide higher ARR than the wavelet-based probabilistic approaches [4][5][6][7][8][9]11,12,29].Then, more importantly, our LED+RD framework (both the 27D version and the improved 33D version) has outperformed all reference methods for all the three databases.We now discuss the results for each database to validate the effectiveness of the proposed strategy.
An ARR improvement of 1.47% (i.e., from 93.23% to 94.70%) and 3.68% (i.e., from 76.40% to 80.08%) is observed for the Vistex-640 and Stex-7616 databases, respectively, using the proposed algorithm.Within the proposed strategy, the 33D improved LED gives a slightly higher ARR compared to the 27D version (i.e., 0.06% for Vistex-640 and 0.13% for Stex-7616).This means that our framework could be still enhanced by modifying or incorporating other features for LED construction.Next, an important remark is that during our experimentation, most of the texture classes with strong structures and local features such as buildings, fabric categories, man-made object's surfaces, etc. were retrieved with 100% accuracy.Table 3 shows the per-class retrieval rate for each image class in the Vistex-640 database.As observed from the table, half of all the classes (19/40 classes by 27D LED and 20/40 classes by 33D LED) were indexed with 100% accuracy.These perfectly-retrieved image classes in general consist of many local textures and structures.Similar behavior was also remarked for the Stex-7616 database.This issue is encouraging since our motivation and expectation in this work are to exploit and emphasize local features inside each image in order to tackle the retrieval problem.Table 2. Average retrieval rate (ARR) on Vistex-640 database by the proposed method compared to the state-of-the-art methods.
Table 5.Average retrieval rate (ARR) on CBT-2800 database by the proposed method compared to the state-of-the-art methods.

Computation Time
Table 8 displays the computational cost required by the proposed strategy in terms of feature extraction (FE) time and dissimilarity measurement (DM) time performed on the Vistex-640 database.All implementations were carried out using MATLAB 2012 on a laptop machine Core i7-3740QM 2.7GHz, 16GB RAM.In short, a total time of 445.1 s is required by the 27D version to yield an ARR of 94.64% for the Vistex-640 database.For the 33D version, it needs 504.7 s to deliver an accuracy of 94.70%.One remark is that the DM stage only requires 0.035 s (resp.0.044 s) to measure the distance from each query image to 640 candidates from the database.In fact, to compute the Riemannian distance in Equation ( 15), only a generalized eigenvalue problem with matrix size of 27 × 27 (resp.33 × 33) needs to be solved.Our experiments proved that this distance is less costly than other metrics (i.e., except the log-Euclidean) mentioned in Section 3.2.A detailed comparison will be given later in Section 4.3.4.Some other experiments were also conducted to compare the processing time for LED extraction with other popular descriptors including SIFT [39], SURF [40], Binary robust invariant scalable keypoints (BRISK) [53] and Binary robust independent elementary features (BRIEF) [54] .In order to perform a fair comparison, we propose that all descriptors will be constructed from the same number of 100 SURF keypoints extracted from a texture image of size 512 × 512 pixels.Comparative results are reported in Table 9.From the table, we observe that the proposed LED needs more computational time than the other descriptors.However, since the computation time is not the main favor of our work, this issue is not so significant and does not neglect the contribution of the proposed strategy.In addition, we note that our implementation of LED was done in a very standard way without any code optimization.Further improvements in terms of optimizing codes may possibly reduce LED computation time.This subsection aims at studying the sensitivity of the proposed method to its parameters.As observed from the full algorithm in Section 3.3, only three parameters are required to perform our retrieval system: the window size to detect keypoints (ω 1 ), the window size to extract local maximum and local minimum pixels (ω 2 ) and the window size to generate LED descriptors (W).Since ω 2 needs to be small enough to ensure a sufficiently good density of local extrema to support the computation of LED descriptor.It is fixed to 3 × 3 pixels in our work.We now investigate the sensitivity to the other two parameters, ω 1 and W.
Figure 5a shows the performance of the LED+RD algorithm (version 27D) obtained by fixing  Similarly, the algorithm sensitivity to W (i.e., window size used for the generation of LED descriptors) can be found in Figure 5b.Again, a stable performance can be observed when W varies from 15 to 40.Here, we set ω 1 to 5 and ω 2 always to 3. The highest ARR is adopted by setting W to 30.However, a smaller window size W = 20 may be more interesting since it reduces only 0.15% of ARR (i.e., reduced from 94.64% to 94.49%) while speeding up by about 21% the feature extraction time (i.e., from 0.661 s to 0.523 s per image).As a result, a total time of only 349.2 seconds is necessary for our framework to produce 94.49% retrieval accuracy for Vistex-640 database.This issue makes the proposed strategy become very effective and competitive in both retrieval performance and computational cost.Furthermore, Figure 5a,b show that the proposed LED method is not very sensitive to its parameters.A stable performance can be adopted with a wide range of parameters:  15) outperformed other metrics that we implemented during the experimentation.In this subsection, we study and compare the performance of the proposed scheme by investigating different distance measures.The experiments were carried out on the Vistex-640 database using the 27D LED descriptors.Comparative results are shown in Table 10 in terms of computational time and retrieval rate.From the table, the Riemannian distance produces the highest ARR (i.e., 94.64%) and the second-lowest computational cost (i.e., 0.0349 s, only greater than the log-Euclidean metric with 0.0316 s) to measure the distance from each query image to 640 candidates.The simplified Mahalanobis and the symmetric Kullback-Leibler distances (see Equations ( 13) and ( 14)) consider that each point cloud follows a multivariate normal distribution with mean vector µ and covariance matrix C.They both yield interesting retrieval results (i.e., 90.48% and 91.82%, respectively).On the other hand, by not accounting for the mean feature vectors of those point clouds and considering the topological space of the feature covariance matrices (Riemannian manifolds), some geometric-based metrics such as Wishart-like and the recommended Riemannian could be more relevant.The Wishart-like distance in the second-last row gives a good performance in terms of ARR (92.39%) and requires 0.059 s of computation time per image.However, the recommended Riemannian distance shows better behaviors in both time consumption and retrieval rate.Besides that, two other metrics including the log-Euclidean and the Bartlett distances seem to be not suitable with very low retrieval rates.Therefore, our choice of Riemannian distance for dissimilarity measurement task is confirmed and validated.t data : time for the entire database (640 images); t image : time for each image; ||C||: the Frobenius norm [42] of matrix C; |C|: the determinant of C.

Conclusions
A novel texture and color-based image retrieval framework has been proposed in this work by exploiting the local extrema-based features and Riemannian metric.In this article, we have demonstrated the capacity of the local extrema pixels to capture significant information from the image content in order to encode color and local textural features.The proposed LED descriptor has proved its efficiency and relevance to tackle CBIR tasks.It is easy to implement, feasible to extend or improve, and not very sensitive to parameters.Then, by using such a Riemannian distance to measure the dissimilarity between their feature covariance matrices, the proposed strategy has produced very competitive results in terms of both computational cost and average retrieval rate.It is expected to become one of the state-of-the-art methods in this field.
Future work can improve the performance of LED descriptor by exploiting other features within its generation.One may be interested by a deeper study on LED feature space to propose some other efficient distance metrics.Others may want to study the generalization of the proposed approach by considering the local max and min pixels within a scale-space context [55] where the proposed LED can be regarded as a special case at one scale.For other perspectives, the proposed framework can be applied to other types of images for retrieval and classification tasks within other application fields such as remote sensing imagery, medical imaging, etc.In addition, LED features may be exploited to tackle feature matching tasks and compare them with existing powerful descriptors such as SIFT [39], SURF [40], SYBA [56] or BRIEF [54].

Figure 2 .
Figure 2.Geometric and gradient information from a local maximum (resp.local minimum) pixel q = (x q , y q ) within N max W (p) (resp.N min W (p)) considered for the calculation of LED descriptor at the studied keypoint p = (x p , y p ).Here, d(p, q) is the distance between p and q; α(p, q) is the angle of the vector yielded from p to q.We have d(p, q) = d(q, p) but α(p, q) = α(q, p).Then, ∇I(q), θ(q) are the gradient magnitude and gradient orientation at q.

Figure 3 .
Figure 3. Proposed framework for color texture image retrieval using the local extrema-based descriptors and the Riemannian distance.

FeatureFigure 5 .
Figure 5. Sensitivity of the proposed method to its parameters in terms of average retrieval rate (%) and feature extraction time (s).Experiments were conducted on Vistex-640 data set using the 27D LED+RD.(a) sensitivity to window size ω 1 for keypoint extraction; (b) sensitivity to window size W for descriptor generation.
9] and W ∈ [15, 40].4.3.4.Sensitivity to Distance Measure It has been stated in Section 3.2 that the Riemannian distance (RD) in Equation (

Table 1 .
Summary of texture databases used in the experimental study.

Table 3 .
Per-class retrieval rate (%) on the Vistex-640 database using the proposed LED+RD method.

Table 4 .
Average retrieval rate (ARR) on Stex-7616 database by the proposed method compared to the state-of-the-art methods.

Table 6 .
Average retrieval rate (%) on USPtex-2292 and Outex-1360 databases by the proposed method compared to some reference methods.

Table 7 .
Comparison of feature vector length of different methods.

Table 8 .
Performance of the proposed method in terms of feature extraction (FE) time and dissimilarity measurement (DM) time.Experiments were conducted on the Vistex-640 database.

Table 9 .
Comparison of feature description time of different methods.

Window size for keypoint extraction Average Retrieval Rate ARR (%)
30and varying ω 1 from 3 to 11.Experiments were conducted on the Vistex-640 data.We observe a stable performance in ARR (varying from 94.18% to 94.64%) for 3 ≤ ω 1 ≤ 9.Then, a decrease of 1.45% in ARR results when it switches from 9 to 11.In terms of computational cost, the feature extraction time is significantly reduced when ω 1 increased.Although the best performance in ARR (94.64%) is obtained by setting ω 1 = 5, one may prefer setting it to 7 or 9 to speed up the computation time.Indeed, by using ω 1 = 7, we gain about 28% of time (i.e., reduced from 0.661 s to 0.473 s per image), but only a reduction of 0.16% in ARR (i.e., from 94.64% to 94.48%) results.Similarly, an ω 1 set to 9 could save 44% of time (i.e., reduced from 0.661 s to 0.373 s per image) and yields 94.18% retrieval accuracy (i.e., reducing only 0.46%).