A Novel Discriminating and Relative Global Spatial Image Representation with Applications in CBIR

: The requirement for effective image search, which motivates the use of Content-Based Image Retrieval (CBIR) and the search of similar multimedia contents on the basis of user query, remains an open research problem for computer vision applications. The application domains for Bag of Visual Words (BoVW) based image representations are object recognition, image classiﬁcation and content-based image analysis. Interest point detectors are quantized in the feature space and the ﬁnal histogram or image signature do not retain any detail about co-occurrences of features in the 2D image space. This spatial information is crucial, as it adversely affects the performance of an image classiﬁcation-based model. The most notable contribution in this context is Spatial Pyramid Matching (SPM), which captures the absolute spatial distribution of visual words. However, SPM is sensitive to image transformations such as rotation, ﬂipping and translation. When images are not well-aligned, SPM may lose its discriminative power. This paper introduces a novel approach to encoding the relative spatial information for histogram-based representation of the BoVW model. This is established by computing the global geometric relationship between pairs of identical visual words with respect to the centroid of an image. The proposed research is evaluated by using ﬁve different datasets. Comprehensive experiments demonstrate the robustness of the proposed image representation as compared to the state-of-the-art methods in terms of precision and recall values.


Introduction
In recent years, with the rapid development of imaging technology, searching or retrieving a relevant image from an image archive has been considered an open research problem for computer vision based applications [1][2][3][4].Higher retrieval accuracy, low memory usage and reduction of semantic gap are examples of common problems related to multimedia analysis and image retrieval [3,5].The common applications of multimedia and image retrieval are found in the fields of video surveillance, remote sensing, art collection, crime detection, medical image processing and image retrieval in real-time applications [6].Most of the retrieval systems, both for multimedia and images, rely on the matching of textual data with the desired query [6].Due to the existing semantic gaps, the performance of these systems suffers [7].The appearance of a similar view in images belonging to different image categories, results in the closeness of the feature vector values, and degrades the performance of image retrieval [6].The main focus of the research in Content-Based Image Retrieval (CBIR) is to retrieve images that are in a semantic relationship with a query image [8].CBIR provides a framework that compares the visual feature vector of a query image to the images places in the dataset [9].
The Bag of Visual Words (BoVW), also known as Bag of Features (BoF) [10], is commonly used for video and image retrieval [11].The local features or interest point detectors are extracted from a group of training images.To achieve a compact representation, the feature space is quantized to construct a code-book that is also known as visual vocabulary or visual dictionary.The final feature vector, which consists of histograms of visual words, is orderless with respect to the sequence of co-occurrences in the 2D image space.The performance of the BoVW model suffers as the extraction of spatial information is beneficial in image classification and retrieval-based problems [6,12].
Various approaches have been proposed to enhance the performance of image retrieval, such as soft assignments, computation of larger codebooks and visual word fusion [8].All of these techniques do not contain any information about the visual word's locations in the final histogram-based representation [13].There are two common techniques that can compute the spatial information from the image.These are based on (1) the construction of histograms from different sub-regions of image, and (2) visual word co-occurrence [13][14][15].The first approach is to split the image into different cells for the histogram's computation; it is reported to be robust for content-based image matching applications [16].Spatial Pyramid Matching (SPM) [16] is considered as a notable contribution for the computation of spatial information for BoVW-based image representation.In SPM, an image is divided into different sizes of rectangular regions for the creation of level-0, level-1 and level-2 histograms of visual words.However, SPM is sensitive to image transformations (i.e., rotation, flipping and translation) and loses its discriminative power, resulting in the misclassification of two similar scene images [17,18].
The second approach to the computation of spatial layout is based on relationships among visual words [19][20][21].This paper proposes a novel approach to extracting the image spatial layout based on global relative spatial orientation of visual words.This is achieved by computing the angle between identical visual word pairs with respect to the centroid in the image.Figure 1 provides an illustration to better understand the proposed approach.The image in Figure 1 is rotated at varying angles.It can be seen that the same angle is computed between visual words irrespective of the image orientation.The main contributions of this research are the following: (1) the addition of the discriminating relative global spatial information to the histogram of BoVW model and (2) reduction of the semantic gap.An efficient image retrieval system must be capable to retrieve images that meet user preferences and their specific requirements.The reduction of the semantic gap specifies that the related categories are given higher similarity scores than unrelated categories.The proposed representation is capable of handling geometric transformations, i.e., rotation, flipping and translation.Extensive experiments on five standard benchmarks demonstrate the robustness of the proposed approach and a remarkable gain in the precision and recall values over the state-of-the-art methods.
The structure of the paper is as follows.Section 2 contains the literature review and related work; Section 3 is about the BoVW model and proposed research; and Section 4 deals with the experimental parameters and image benchmarks, while also presenting a comparison with the existing state-of-the-art techniques.Section 5 provides a discussion, while Section 6 concludes the proposed research with future directions.

Related Work
According to the literature [6], SIMPLicity, Blobworld and Query by Image Content (QBIC) are examples of computer vision applications that rely on the extraction of visual features such as color, texture and shape.Image Rover and WebSeek are examples of image search systems that rely on a query-based or keyword-based image search [6].The main objective of any CBIR system is to search for relevant images that are similar to the query image [22].Overlapping objects, differences in the spatial layout of the image, changes in illumination and semantic gaps make CBIR challenging for the research community [8].Wang et al. [23] propose the Spatial Weighing BOF (SWBOF) model to extract the spatial information by using three approaches, i.e., local variance, local entropy and adjacent block distance.This model is based on the concept of the different parts of an image object contributing to image categorization in varying ways.The authors demonstrate significant improvement over the traditional methods.Ali et al. [9] extract the visual information by dividing an image into triangular regions to capture the compositional attributes of an image.The division of the image into triangular cells is reported as an efficient method for histogram-based representation.Zeng et al. [24] propose spatiogram-based image representation that consists of a color histogram that is quantized by using Gaussian Mixture Models (GMMs).The quantized values of GMMs are used as an input for the learning of the Expectation-Maximization (EM).The retrieval is performed on the basis of the closeness of the feature vector values of two spatiograms which are obtained by using the Jensen-Shannon Divergence (JSD) [24].Yu et al. [25] investigate the impact of the integration of different mid-level features to enhance the performance of image retrieval.They investigate the impact of the integration of SIFT descriptors with LBP and HOG descriptors respectively, in order to address the problem of the semantic gap.Weighed k−means clustering is used for quantization, and best performance is reported with SIFT-LBP integration.
To reduce the semantic gap between the low-level features and the high-level image concepts, Ali et al. [8] propose image retrieval based on the visual words integration of Scale Invariant Feature Transform (SIFT) and Speeded−Up Robust Features (SURF).Their approach acquires the strength of both features, i.e., invariance to scale and rotation of SIFT and robustness to illumination of SURF.In another recent work, Ali et al. [26] propose a late fusion of binary and local descriptors i.e., FREAK and SIFT to enhance the performance of image retrieval.Filliat et al. [27] present an incremental and interactive localization and map-learning system based on BoW.Hu et al. [28] propose a real-time assistive localization approach that extracts compact and effective omnidirectional image features which are then used to search a remote image feature-based database of a scene, in order to help indoor navigation.
In another recent work, Li et al. [29] propose a hybrid framework of local (BoW) and global image features for efficient image retrieval.According to Li et al. [29], a multi-fusion based on two lines of image representation can enhance the performance of image retrieval.The authors [29] extract the texture information by using Intensity-Based Local Difference Patterns (ILDP) and by selecting the HSV color space.This scheme is selected to capture the spatial relationship patterns that exist in the images.The global color information is extracted by using the H and S components.The final feature vector is constituted by combining the H, S feature space and ILDP histograms.The experimental result validates that the fusion of color and texture information enhances the performance of image retrieval [29].According to Liu et al. [30], the ranking and incompatibility of the image feature descriptor is not considered much in the domain of image retrieval.The authors address the problem of incompatibility by using gestalt psychology theory and manifold learning.A combination of gradient direction and color is used to imitate human visual uniformity.The selection of a proposed feature scheme [30] enhances the image retrieval performance.According to Wu et al. [31], ranking and feature representation are two important factors that can enhance the performance of image retrieval and they are considered separately in image retrieval models.The authors propose a texton uniform descriptor and apply an intrinsic manifold structure through visualizing the distribution of image representations on the two-dimensional manifold.This process provides a foundation for subsequent manifold-based ranking and preserves intrinsic neighborhood structure.The authors apply a Modified Manifold Ranking (MMR) to enhance and propagate adjacent similarity between the images [31].According to Varish et al. [32], a hierarchical approach to CBIR based on a fusion of color and texture can enhance the performance of image retrieval.The color feature vectors are computed on the basis of quantized HSV color space, and texture values are computed to achieve rotation invariance on the basis of Value (V) component of HSV space.The sub-band of various Dual Tree Complex Wavelet Transform (DT-CWT) is applied to compute the principal texture direction.
Zou et al. [33] propose an effective feature selection approach based on Deep Belief Networks (DBN) to boost the performance of image retrieval.The approach works by selecting more reconstructible discriminative features using an iterative algorithm to obtain the optimized reconstruction weights.Xia et al. [34] perform a systematic investigation to evaluate factors that may affect the retrieval performance of the system.They focus the analysis on the visual feature aspect to create powerful deep feature representations.According to Wan et al. [7], a pre-trained deep convolution neural network outperforms the existing feature extraction techniques at the cost of high training computations for large-scale image retrieval.It is important to mention that the approaches based on deep networks may not be an optimal selection as they require large-scale training data with a lot of computations to train a classification-based model [21,35].

Proposed Methodology
The basic notations for the BoVW model are discussed in this section.This is then followed by a discussion of the proposed Relative Global Spatial Image Representation (RGSIR) and the details of its implementation.

BoVW Model
The Bag-of-Words (BoW) methodology was first proposed in textual retrieval systems [11] and was further applied in the form of BoVW representation for image analysis.In BoVW, the final image representation is a histogram of visual words.It is termed a bag, as it counts how many times a word occurs in a document.A histogram does not have any order and does not retain any information regarding the location of visual words in the 2D image space [9,16].The similarity of two images is determined by histogram intersection.In the case of dissimilar images, the result of the intersection is small.
As a first step in BoVW, the local features are extracted from the image Im, and the image is represented as a set of image descriptors, such as Im = {d 1 , d 2 , d 3 , ...., d I }, where d i denotes the local image features and I represents total image descriptors.The feature extraction can be done by applying some local descriptors such as SIFT descriptors [36].The key points can be acquired automatically by using interest point detectors or by applying dense sampling [16].
Consequently, there are numerous local descriptors created for each image for a given dataset.The extracted descriptors are vector quantized by applying k-means [11] clustering technique to construct the visual vocabulary, as in where K shows the specified number of clusters or visual words and v denotes the constructed visual vocabulary.
The assignment of each descriptor to the nearest visual word is done by computing the minimum distance as follows: here, w(d j ) represents the visual word mapped to jth descriptor and Dist(w,d j ) depicts the distance between the descriptor d j and visual word w.
The histogram representation of an image is based on the visual vocabulary.The number of histogram bins equates the number of visual words in the code book or dictionary (i.e., K).Each histogram bin bin i represents a visual word w i in v and signifies the number of descriptors mapped to a particular visual word as shown in ( 3) D i is the set of descriptors mapped to a particular visual word w i in an image, and the cardinality of this set is given by Card(D i ).The final histogram representation for the image is created by repeating the process for each word in the image.The histograms hence created do not retain the spatial context of the interest points.

The Proposed Relative Global Spatial Image Representation (RGSIR)
In the BoVW model the final image representation is created by mapping identical image patches to the same visual word.In [20], Khan et al. capture the spatial information by modeling the global relationship between identical visual word pairs (PIWs).Their approach exhibits invariance to translation and scaling but is sensitive to rotation [20,37], since the relative relationship between PIWs is computed with respect to the x-axis.Anwar et al. [37] propose an approach to acquire rotation invariance by computing angles between Triplets of Identical Visual Words (TIWs).Although the approach of [37] acquires rotation invariance, it significantly increases computation complexity due to the increase in the number of possible triplet combinations.For instance, if the number of identical visual words is 30, the number of distinct pair combinations is 435 and the number of possible distinct triplet combinations is 4060.
This paper proposes a novel approach to acquiring spatial information for transformation invariance by computing the global geometric relationship between pairs of identical visual words.This is accomplished by extracting the spatial distribution of these pairs with respect to a centroid in an image as shown in Figure 2. Hence we define the set of all pairs (PW) of identical visual words related to a visual word w i as: where a(x 1 , y 1 ) and b(x 2 , y 2 ) are the spatial locations of the descriptors d a and d b , respectively.Since the ith histogram bin signifies the descriptor d i , its value determines the total occurrences of the word w i .The cardinality of the set PW i is b i C 2 .The centroid c = (x, y) of an image Im of size R × C is calculated as where is the number of elements in Im.Let r ab be the Euclidean distance between a and b, then Similarly, the Euclidean distances of a and b from c are calculated as Using the Law of cosines, we have where θ = acb.
The θ angles obtained are then concatenated to create the histogram representation with bins equally distributed between 0-180 • .The optimal number of bins used for histogram representation is determined empirically.The RGSIR i represents the spatial distribution for a particular visual word w i .The RGSIR i obtained from all the visual words in an image are concatenated to create the global image representation.A bin replacement technique is used to transform the BoVW representation to RGSIR.This is achieved by replacing each bin of the BoVW histogram with the associated RGSIR i related to a particular w i .To add the spatial information while keeping the frequency information intact, the sum of all bins of RGSIR i is normalized to the size of the bin bin i of the BoVW histogram that is being replaced.The image representation for RGSIR is hence formulated as: where α i , the coefficient of normalization, is given by If the size of the visual vocabulary is K and the number of histogram bins is H, then the dimensions of RGSIR are K × H.

Implementation Details
The histogram representations for all of the datasets are created by following the same sequence of steps as shown in Figure 3.As a preprocessing step, the images are converted to gray-scale mode by using the available standard resolution, and the dense SIFT features are extracted on six multi-scales, i.e., {2,4,6,8,10,12} for the computation of codebook [38].The step size of 5 is applied to compute the Dense SIFT features [38].Dense features are selected, as the dense regular grid has shown to possess better discriminative power [16].To save computation time for clustering, 40% of the features (per image) are selected by applying a random selection on a training set to compute the codebook.To quantize the descriptors, k-means clustering is applied to generate visual vocabulary.Since the size of the codebook is one of the major factors that affects the performance of image retrieval, the proposed approach is evaluated by using different sizes of codebook to sort out the best retrieval performance.The visual vocabulary is constructed from the training set and the evaluation is done using the test set.The experiments are repeated in 10 trials to remove the ambiguity created by the random initialization of cluster centers by k-means.For each trial, the training and test images are stochastically selected and the average retrieval performance is reported in terms of precision and recall values, which are considered as standard image retrieval measures [8,39].
The calculation of RGSIR involves computing subsets of pairs from sets of identical visual words.To accelerate computation, a threshold value is set and a random selection is applied to limit the number of identical words used for creating the pair combinations.We use a nine-bin RGSIR representation for the results presented in Section 4. Figure 4 gives the empirical justification for the number of bins on two different image benchmarks used in our experiments.Support Vector Machine (SVM), a supervised learning technique, is used for classification.The SVM Hellinger Kernel is applied to the normalized RGSIR histograms.The optimal value for the regularization parameter is determined by applying 10-fold cross validation on the training dataset.As we have used a classification-based framework for image retrieval, the class of the image is predicted by using the classifier labels; similarity among the images of the same class is determined on the basis of distance in decision values [8].The results obtained from the evaluation metrics are normalized and average values are reported in tables in graphs.MATLAB is used to simulate the research by using Corei7, a 7th generation processor with 16 GB RAM.

Datasets and Performance Evaluation
This section provides a description of the datasets, measures used for evaluation, and the details of the experiments conducted for the validation of the proposed research.

Dataset Description
To assess the effectiveness of the proposed research for image retrieval, experiments are conducted on the benchmark datasets used extensively in the literature.The first dataset used in our experiments is the Corel-1K [40] image dataset.The Wang's image dataset is comprised of a total of 1000 Corel images from diverse contents such as beach, flowers, horses, mountains, food, etc.The images are grouped into 10 categories with image sizes of 256 × 384 or 384 × 256 pixels.The second dataset is the Corel-1.5Kimage benchmark comprised of 15 classes with 100 images per category [40].Figure 5 shows sample images from Corel-1K and Corel-1.5K,respectively.The forth dataset is the Oliva and Torralba (OT) dataset [41], which includes 2688 images classified into 8 semantic categories.This dataset exhibits high inter and intra-class variability, as the river and forest scenes are all considered as forest.Moreover, there is no specific sky category, since all the images contain the sky object.The average image size is 250 × 250 pixels and the images are collected from different sources (i.e., commercial databases, digital cameras, websites).This is a challenging dataset as the images are sampled from different perspectives, varying rotation angles, different spatial patterns and different seasons.Figure 7 shows the photo gallery of images for the OT image dataset.The last dataset used is our experiments is the RSSCN image dataset [33], released in 2015, comprised of images collected from Google Earth.It consists of 2800 images categorized into 7 typical scene categories.There are 400 images per class, and each image has a size of 400 × 400 pixels.It is a challenging dataset, as the images in each class are sampled at 4 different scales, with 100 images per scale under varied imaging angles.Consistent with related work [33], the dataset is stochastically split into two equal image subsets for training and testing, respectively.Example images from this dataset are shown in Figure 8.

Evaluation Measures
Let the database I 1 , ..., I n , ..., I N be a set of images represented by the spatial attributes.To retrieve an image identical to the query image Q, each image from the database I n is compared with Q, using the appropriate distance function (Q, I n ).The database images are then sorted based on the distances such that (d(Q, I n i ) ≤ (d(Q, I n i +1 ) holds for each pair images I n i and I n i+1 of distances in the sequence I n 1 , ..., I n i , ..., I n N .

Precision
The performance of the proposed method is measured in terms of precision P and recall R, which are the standard measures used to evaluate CBIR.Precision measures the specificity of the image retrieval, and it gives the number of relevant instances retrieved in response to a query image.The Precision (P) is defined as

Recall
The Recall is the fraction of the relevant instances retrieved to the total number of instances of that class in the dataset.It measures the sensitivity of the image and is given by

Mean Average Precision (MAP)
Based on P and R values, we also report results in terms of precision vs recall curve (P-R curve) and the mean average precision (MAP).The P-R curve represents the tradeoff between precision and recall for a given retrieval approach.It reflects more information about retrieval performance that is determined by the area under the curve.If the retrieval system has better performance, the curve is as far from the origin of coordinates as possible.The area between the curve and the X-Y axes should be larger, which is usually measured and is approximate to MAP [42].In other words, the most common way to summarize the P-R curve in one value is P-R.P-R is the mean of the average precision (AP) scores of all queries and is computed as follows: where T is the set of test images or queries Q.An advantage of MAP is that it contains both precision and recall aspects and is sensitive to the entire ranking [43].

Performance on Corel-1K Image Dataset
The Corel-1K image benchmark is extensively used to evaluate CBIR research.To ensure fair comparison experiments, the dataset is stochastically partitioned into training and test subsets with a ratio of 0.5:0.5.The image retrieval performance of the proposed image representation is compared with the existing state-of-the-art CBIR approaches.In order to obtain a sustainable performance, the mean average precision of RGSIR is evaluated by using visual vocabulary of different sizes [50,100,200,400,600,800].The best image retrieval performance for Corel-1K is obtained for a vocabulary of size 600, as can be seen in Figure 9.The class-wise comparison obtained from the proposed research in terms of precision and recall is presented in Tables 1 and 2. It can be seen that the proposed approach outperforms the state-of-the-art image retrieval approaches.The proposed RGSIR provides 17.7% higher precision compared to Yu et al. [25].Our proposed representation outperforms SWBOF [23] by {13.7%, 2.74%} in terms of average precision and recall values for the top 20 retrieval.RGSIR yields {8.23%, 1.65%} higher performance compared to [8] in terms of average retrieval precision and recall values.The proposed RGSIR results in {1.04%, 0.2%} higher precision and recall values compared to the work of Li et al. [29].Experimental results validate the robustness of the proposed approach against the state-of-the-art retrieval methods.
The comparative analysis of the proposed research with the existing state-of-the-art verifies the effectiveness of RGSIR for image retrieval.The average precision depends on the total number of relevant images retrieved, and hence is directly proportional to the number of relevant images retrieved in response to a given query image.It is evident from the Figure that the proposed approach attains the highest number of relevant images against a given query image as compared to the state-of-the-art approaches.Similarly, the average recall is directly proportional to the number of relevant images retrieved to the total number of relevant images of that class present in the dataset.The proposed approach outperforms the state-of-the-art methods by attaining the highest precision and recall values.
The P-R curve obtained for the Corel-1K image benchmark is shown in Figure 10.The P-R curve demonstrates the ability of the retrieval system to retrieve relevant images from the image database in an appropriate similarity sequence.The area under the curve illustrates how effectively different methods perform in the same retrieval scenario.The results indicate that the proposed spatial features enhance the retrieval performance as compared to the state-of-the-art image retrieval approaches.The image retrieval results for the semantic classes of Corel-1K image dataset are shown in Figures 11 and 12 (which reflects the reduction of the semantic gap).The image shown in the first row is the query image and the remaining 20 images are images retrieved by applying a similarity measure that is based on image classification score values.Here a classification label is used to determine the class of the image, while the similarity with-in the same class is calculated on the basis of similarity among classification scores of images of the same class from the test dataset.
Figure 11 shows that, for a given query image, all images of the related semantic category are retrieved.In Figure 12 it can be seen that in a search based on a flower query image, an image from a different semantic category containing flowers is also displayed in the 3rd row in addition to images from the flower image category.The experimental results demonstrate that the proposed approach achieves much higher performance compared to the state-of-the-art complementary approaches [9,23,25].

Performance on Corel-1.5K Image Dataset
To further assess the effectiveness of the proposed method, experiments are conducted on Corel-1.5 image benchmark.The image retrieval performance of Corel-1.5 dataset is analyzed using the visual vocabulary of different sizes.The optimal performance is obtained for a vocabulary size of 400.Table 3 provides a comparison of the mean average precision for the top 20 retrievals with the state-of-the-art image retrieval approaches [8,24,26].
It is evident from the table that the proposed RGSIR provides better retrieval performance compared to the state-of-the-art approaches with higher retrieval precision values than those of the existing research.Experimental results demonstrate that the proposed approach provides {18.9%, 3.78%} better performance compared to the method without soft assignment, i.e., SQ + Spatiogram [24] and {8.75%, 2.77%}, than the probabilistic GMM + mSpatiogram [24] in terms of precision and recall, respectively.The proposed approach based on relative spatial feature extraction achieves 7.9% higher retrieval precision compared to the image retrieval based on visual words integration of SIFT and SURF [8].
Our proposed approach provides {10.25%, 2.05%} better precision and recall results compared to the late fusion based approach [26].The experimental results demonstrate that our proposed approach significantly improves the retrieval performance compared to the state-of-the-art image retrieval techniques.

Performance on Corel-2K image Dataset
The optimal performance for the Corel-2K image dataset is obtained for a vocabulary size of 600.Table 4 provides a comparison of Corel-2K with the state-of-the-art image retrieval approaches.It is evident that the proposed approach yields the highest retrieval accuracy.The proposed approach provides 13.68% highest mean retrieval precision compared to the second best method.Figure 13 illustrates the average precision and recall values for the top 20 image retrievals.The experimental results validate the efficacy of the proposed approach for content-based image retrieval.The image retrieval results for the semantic classes of Corel-2K image dataset are shown in Figures 14 and 15 (which reflect the reduction of the semantic gap).The image displayed in the first row is the query image and the remaining images are the results of the top 20 retrievals selected on the basis of the image classification score displayed at the top of each image.

Image Retrieval Performance While Using Oliva and Torralba (OT-Scene) Dataset
To demonstrate the effectiveness of the proposed research, experiments are performed on the challenging OT image dataset.The best performance for the proposed research is obtained for a vocabulary size of 600.As the proposed approach has been designed on a classification-based framework, Figure 16 provides a class-wise comparison of the classification accuracy of the proposed approach with the recent state-of-the-art classification approaches [46,47].Shrivastava et al. [46] propose a fusion of color, texture and edge descriptors to enhance the performance of image classification and report an accuracy of 86.4%.Our proposed approach outperforms SPM by 3.85% [48] and yields 1.06% higher accuracy compared to [46].Zang et al. [47] use the Object Bank (OB) approach to construct powerful image descriptors and boost the performance of OB-based scene image classification.The best mean classification accuracy for the proposed RGSIR is 87.46%, while the accuracy reported by Zang et al. [47] is 86.5%.The proposed approach provides 0.96% higher accuracy compared to their work.It is observed that the performance of the proposed approach is low for the natural coast and the open country category due to high variability in these classes.The proposed approach based on spatial features provides better performance compared to the state-of-the-art retrieval approaches.The comparison of the proposed research with existing research [8] in terms of precision is presented in Table 5.The proposed approach provides 13.17% higher accuracy compared to the second best method in comparison.The experimental results validate the efficacy of the proposed approach for content based image retrieval.

Performance on the RSSCN Image Dataset
To evaluate the effectiveness of proposed approach for scene classification, experiments are conducted on the challenging high resolution remote sensing scene image dataset.The training test ratio of 0.5:0.5 is used for the RSSCN image dataset as is followed in the literature [33].The training set comprises 1400 stochastically selected images and the remaining images are used to assess the retrieval performance.The optimal retrieval performance is obtained for a visual vocabulary size of 200.As we have used a classification based framework for image retrieval, it is important to note here that the classification accuracy for the proposed RGSIR is 81.44% and the accuracy reported by the dataset creator is 77%.Our proposed representation provides 4.44% higher accuracy compared to the deep learning technique, i.e., the DBN adopted by the Zou et al. [33].
Table 6 provides a comparison of the retrieval performance of RSSCN with the state-of-the-art image retrieval approaches.We have computed MAP for the top 100 retrievals using the proposed RGSIR.Xia et al. [34] perform an extensive analysis to develop a powerful feature representation to enhance image retrieval.They consider different CNN representative models, i.e., CaffeNet [50], VGG-M [51], VGG-VD19 [52] and GoogLeNet [53], in combination with different feature extraction approaches.As our proposed approach is based on mid-level features, we have selected BoW based aggregation methods for comparison.Mid-level features are more resilient to various transformations such as rotation, scale and illumination [34].The proposed approach provides 16.63% higher accuracy compared to VGG-M (IFK).The proposed RGSIR outperforms the GoogLeNet (BoW), VGG-VD19 (BoW) and CaffeNet (BoW) by 18.56%, 19.2% and 20.88 %, respectively.It is important to note here that we have selected the RSSCN image dataset as the images are captured at varying angles and exhibit significant rotation differences.Hence the robustness of the proposed approach to rotation in-variance is also illustrated to some extent.The top 20 retrieval results against the "Forest" and "River & Lake" semantic categories of the RSSCN image dataset are shown in Figures 17 and 18.

Discussion
In this paper, we have proposed an image retrieval approach based on relative geometric spatial relationships between visual words.Extensive experiments on challenging image benchmarks demonstrate that the proposed approach outperforms the concurrent and the state-of-the-art image retrieval approaches based on feature fusion and spatial feature extraction techniques [8,23,24].

Factors Affecting the Performance of the System
One of the factors affecting the retrieval performance is the size of the visual vocabulary.We have conducted experiments with visual vocabulary of different sizes to determine the optimal performance of the proposed representation as discussed in the preceding sections.Another factor affecting the performance of the system is the ratio of the training images used to train the classifier.Figure 19 provides a comparison of different training test ratios i.e., 70:30, 60:40, 50:50 for the Corel-1K image dataset.It can be seen that the performance of the system increases at higher training test ratios.However, to be consistent with related approaches [8], 50:50 is used to report the precision and recall retrieval results for the experimental comparisons presented in Section 4.

Invariance to Basic Transformations
Spatial Pyramid Matching (SPM) [16] is the most notable contribution to incorporate spatial context into the BoVW model.SPM captures the absolute spatial distribution of visual words.However, SPM is sensitive to image transformations such as rotation, flipping and translation.For images that are not well-aligned, SPM may lose its discriminative power.An object may rotate by any angle on the image plane (rotation), it may be flipped horizontally or vertically (flipping), or the object may appear anywhere in an image (translation).The proposed approach is capable of addressing various transformations, by encoding the global relative spatial orientation of visual words.This is achieved by computing the angle between identical visual word pairs with respect to the centroid in image.Figure 20 provides an illustration to better understand our approach.The upper region of Figure 20a-c represents the idea of histograms constructed with SPM [16], while the lower region demonstrates the proposed approach Figure 20d-f.In the figures, we can see Figure 20a,d the original image, Figure 20b,e the image rotated by 90 • and Figure 20c,f the vertically flipped image.The performance of SPM [16] degrades in this case, as the objects occupy different regions in the original and transformed images.In Figure 20a the identical visual words are located in the 3rd and 4th regions, in Figure 20b they are found in the 2nd and 4th regions, while in Figure 20c they are in the 1st and 2nd regions, respectively.Hence the three histogram representations will be different for the same image.In the case of the proposed RGSIR, the same histogram representation will be generated for the original and for the transformed images, as the angle between identical visual words with respect to the centroid remains the same.
Figure 21 presents a graphical comparison of the average precision for the top 20 retrievals with the concurrent state-of-the-art approaches.Chathurani et al. [54] propose a Rotation Invariant Bag of Visual Words (RIBoW) approach to encode the spatial information using circular image decomposition in combination with a simple shifting operation using global image descriptors.They report improved performance to existing BoVW approaches.Although SPM [16] encodes the spatial information, it is sensitive to rotation, translation and scale variance of an image.The circular decomposition approach [54] partitions the image into sub-images, and features which are then extracted from each sub-image are used for feature representation.The proposed RGSIR provides 10.4% higher retrieval precision compared to the second best method.
Experimental results demonstrate the superiority of the proposed approach to the concurrent state-of-the-art approaches.It is important to note here that some approaches incorporate the spatial context prior to the visual vocabulary construction step, while others do so after it [9].The proposed approach adds this information after the visual vocabulary construction step.In future, we intend to enhance the discriminative power of the proposed approach by extracting rotation-invariant features at the feature extraction step, prior to the construction of the visual vocabulary.

Conclusions and Future Directions
The final feature vector for the BoVW model contains no information regarding the distribution of visual words in the 2D image space.Due to this reason, the performance of a computer vision application suffers, as spatial information of visual words in the histogram-based feature vector enhances the performance of image retrieval.This paper presents a novel approach to image representation to incorporate the spatial information to the inverted index of the BoVW model.The spatial information is added by calculating the global relative spatial orientation of visual words in a transformation-invariant manner.This is established by computing the geometric relationship between pairs of identical visual words with respect to the centroid of an image.The experimental results and quantitative comparisons demonstrate that our proposed representation significantly improves the retrieval performance in terms of precision and recall values.The proposed approach outperforms other concurrent methods and provides competitive performance as compared with the state-of-the-art approaches.
Furthermore, the proposed approach is not confined to the retrieval task but can be applied to other image analysis tasks, such as object detection.This is because we incorporate the invariant spatial layout information into the BoVW image representation, thereby ensuring seamless application of follow-up techniques.
In future, we would like to enhance the discriminative power of the proposed approach by extracting rotation invariant low-level features at descriptor level.We intend to create a unified representation, tolerant to all kinds of layout variances.As the proposed method has shown excellent results on five image benchmarks, in future we aim to apply a pre-trained deep convolution neural network for the computation of histogram of visual words for learning of classifier to a large scale image dataset.Combining our image representation with a complementary absolute feature extraction method and enriching it with other cues such as color and shape is another possible direction for future research.

Figure 1 .
Figure 1.Angle between identical visual word pairs with respect to the centroid.Here (a) represents the original image, (b) the image rotated by 120 • , and (c) the image rotated by 180 • .

Figure 2 .
Figure 2. Angle between identical visual word pairs with respect to the centroid.

Figure 3 .
Figure 3. Block diagram of the proposed work.

Figure 4 .
Figure 4.The influence of the number of bins on the performance of RGSIR.

Figure 5 .
Figure 5. Randomly selected images from each class of Corel-1K and Corel-1.5Kimage datasets [40].The third dataset used to validate the efficacy of the proposed RGSIR is the Corel-2K image benchmark.Corel-2K is a subset of Corel image dataset and is comprised of 2000 images classified into 20 semantic categories.Example images from this dataset are shown in Figure 6.

Figure 9 .
Figure 9. Average Precision as a function of vocabulary size.

Figure 11 .
Figure 11.Result of image retrieval for the semantic class "Dinosaurs".

Figure 12 .
Figure 12.Result of image retrieval for the semantic class "Flowers".

A f r iB
c a n P e o p l e B e a c h H i s t o r i c a l B u i l d i n g 67

Figure 14 .
Figure 14.Result of image retrieval for the semantic class "Lizards".

Figure 15 .
Figure 15.Result of image retrieval for the semantic class "Antique Furniture".

48 ]Figure 16 .
Figure 16.Class-wise comparison between of the proposed research with the state-of-the-art methods for OT scene image dataset.

Figure 17 .
Figure 17.Results of image retrieval for the semantic class "Forest".

Figure 18 .
Figure 18.Results of image retrieval for the semantic class "River & Lake".

Figure 21 .
Figure 21.Average precision comparison of the proposed RGSIR with the state-of-the-art approaches for the Corel-1K image benchmark.

Table 1 .
Comparison of precision when using Corel-1K image dataset.

Table 2 .
Comparison of recall when using Corel-1K image dataset.

Table 3 .
Comparison of Average Retrieval Precision and Recall when using Corel-1.5Kimage dataset.

Table 4 .
Comparison of the mean average precision using Corel-2K image benchmark.

Table 5 .
Comparison of the mean average precision using OT-Scene image benchmark.

Table 6 .
Comparison of the mean average precision when using RSSCN image benchmark.