This section describes the processes that compose our RSIR system, including feature extraction, descriptor aggregation, and image matching. First, it introduces the Deep Local Descriptor (DELF) [
4] architecture, the different attention modules utilized in this work, and the training procedure. Second, the formulation of the VLAD representation is summarized. The third part presents the image matching and retrieval process. Additionally, a query expansion strategy is introduced that uses only the global representations and requires no user input. The simplified system architecture is depicted in
Figure 2.
3.1. Feature Extraction
The first step in any CBIR system is to find meaningful and descriptive representations of the visual content of images. Our system makes use of DELF, which is based on local, attentive convolutional features that achieve state-of-the-art results in difficult landmark recognition tasks [
4,
29]. The feature extraction process can be divided into two steps. First, during feature generation, dense features are extracted from images by means of a fully convolutional network trained for classification. We use the ResNet50 model [
51] and extract the features from the third convolutional block. The second step involves the selection of attention-based key points. The dense features are not used directly but instead, an attention mechanism is exploited to select the most relevant features in the image. Given the nature of RS imagery, irrelevant objects commonly occur in the images which are belonging to different classes. For example, an image depicting an urbanized area of medium density may also include other semantic objects like vehicles, roads, and vegetation. The attention mechanism learns which features are relevant for each class and ranks them accordingly. The output of the attention mechanism is the weighted sum of the convolutional features extracted by the network. The weights of this function are learned in a separate step after the features have been fine-tuned. Noh et al. [
4] formulate the attention training procedure as follows. A scoring function
should be learned per feature vector
, with
representing the parameters of the function
,
the
n-th feature vector, and
, with
d denoting the size of the feature vectors. Using cross-entropy as the loss, the output of the network is given by
with
being the weights of the final prediction layer for
M classes. The authors use a simple CNN as the attention mechanism. This smaller network deploys the softplus [
52] activation function to avoid learning negative weighting of the features. We denote this attention mechanism as multiplicative attention, since the final output is obtained by multiplication of the image features and their attention scores. The result of Equation (
1) completely suppresses many of the features in the image if the network does not consider them sufficiently relevant. As an alternative, we also study a different attention mechanism denoted as additive attention. In the case of additive attention, the output of the network is formulated as
While removal of irrelevant objects is desired, it is possible that for RS imagery, the multiplicative attention mechanism also removes some useful information about the context of different semantic objects. An attention mechanism that enhances the relevant features, while preserving information about their surroundings, may prove useful in the RSIR task. We propose to use additive attention, which strengthens the salient features without neutralizing all other image features, unlike multiplicative attention does.
The network architectures for feature extraction, multiplicative attention, and additive attention modules are depicted in
Figure 3. For training, we follow a similar procedure as in [
4]. The feature extraction network is trained as a classifier using cross-entropy as the loss. The images are resized to
pixels and used for training. We do not take random crops at this stage as the objective is to learn better features for RS images instead of feature localization and saliency. The attention modules are trained in a semi-supervised manner. In this case, the input images are randomly cropped and the crops are resized to
pixels. Cross-entropy remains as the loss to train the attention. In this way, the network should learn to identify which regions and features are more relevant for specific classes. This procedure requires no additional annotations, since only the original image label is used for training. Another advantage of this architecture is the possibility of extracting multi-scale features in a single pass. By resizing the input images, it is possible to generate features at different scales. Furthermore, manual image decomposition into patches is not required. The authors of [
4] observe better retrieval performance, when the feature extraction network and the attention module are trained separately. The feature extraction network is trained first as in [
4]. Then, the weights of that network are reused to train the attention network.
After training, the features are generated for each image by removing the last two convolutional layers. Multi-scale features are extracted by resizing the input image and passing it through the network. As in the original paper, we use 7 different scales starting at 0.25 and ending at 2.0, with a separation factor of . The features produced by the network are vectors of size 1024. In addition to the convolutional outcome, each DELF descriptor includes its location in the image, the attention score and the scale at which the feature was found. For descriptor aggregation, we utilize the most attentive features only. We store 300 features per image, ranked by their attention score and across all 7 scales. This corresponds to roughly 21% of all the extracted descriptors with non-zero attention score. Note that features are not equally distributed across scales. For example, on the UCMerced dataset, 53% of the features belong to the third and fourth scales (1.0 and 1.4142). The remaining features are not exploited.
3.2. Descriptor Aggregation
DELF features are local, meaning that direct feature matching is computationally expensive. Each query image may contain a large number of local features that should be matched against an equally large number of features in each database image. A common solution to this problem in image retrieval literature is descriptor aggregation. This produces a compact, global representation of the image content. The most common aggregation method in RSIR literature is BoW. However, modern image retrieval systems apply other aggregation methods such as FK [
7], VLAD [
6] or Aggregated Selective Match Kernels [
53]. In this work, we use VLAD to produce a single vector that encodes the visual information of the images because it outperforms both BoW and FK [
6].
Similar to BoW, a visual dictionary or codebook with
k visual words is constructed using the k-means algorithm. The codebook
is used to partition the feature space and accumulate the statistics of the local features related to the cluster center. Consider the set of
d dimensional local features
belonging to an image. Per visual word
the difference between the features associated with
and the visual word will be computed and accumulated. The feature space is partitioned into
k different Voronoi cells, producing subsets of features
, where
denotes the features closest to the visual word
. The VLAD descriptor
is constructed as the concatenation of
k components, each computed by
with
being the vector corresponding to the
i-th visual word resulting in a
dimensional vector. The final descriptor
is
-normalized. The VLAD aggregation process is visualized in
Figure 4.
An important difference between the BoW and VLAD representations is that the latter deploys visual dictionaries with relatively few visual words. In [
6], the visual dictionaries for FK and VLAD descriptors contain either 16 or 64 visual words. Meanwhile, BoW-based systems require significantly larger values of
k. In [
9,
12], 50,000 and 1500 visual words are utilized for the aggregated descriptor, respectively. The final VLAD descriptor size is dependent on both the number of visual words
k and the original descriptor dimension
d. In the case of our system, the extracted local descriptors are of size 1024. To avoid extremely large descriptors, we limit the value of
k to 16 visual words, resulting in VLAD vectors of size 16K.
Regarding the training of our codebook, we use only a subset of the extracted DELF features. As mentioned in the previous section, only 300 features are preserved per image. However, training a codebook with a large number of high-dimensional descriptors is computationally expensive, even for low values of k. To solve this problem, we select a fraction of all available descriptors. The N most attentive DELF descriptors of every image are selected for training of the visual dictionary. This results in roughly 200K to 300K descriptors per training dataset. We provide specific values of the N most attentive descriptors in the next section. Although it is possible to train the feature extraction module on other datasets, both the codebook and feature extraction module are trained individually per dataset. Thus, we obtain the most descriptive representations possible. Once the codebook training is complete, a VLAD descriptor is generated per image. These descriptors are used in the matching module to identify visually similar images. The feature extraction and descriptor aggregation processes are performed offline for all images.
3.3. Image Matching and Query Expansion
Once each database image has a VLAD representation, it is possible to perform CBIR. Given a query image, we employ a similarity metric between its descriptor and the descriptor of every database image. The database images are ranked based on the metric and the most similar to the query image are returned to the user. The similarity is determined by the Euclidean distance of the query and database vectors.
As mentioned previously, conventional image retrieval uses specialized techniques to improve the quality of a candidate set, with two popular techniques being geometric verification and query expansion. Geometric verification is commonly performed when positions of local features are available, and consists of matching the features found in the query image against those of the retrieved set. However, geometric verification is difficult in RS imagery, due to the significant appearance variations present in Land Use/Land Cover datasets. Alternatively, geometric verification is possible in traditional image retrieval, because transformations can be found between the local features of the query and those of the candidate images. For example, consider images of the same object or scene from different views or with partial occlusions. In these cases, a number of local descriptors will be successfully matched between the query and retrieved images. In the datasets used in this work, images belonging to the same semantic class were acquired at different geographical locations. Thereby, geometric verification becomes impossible, since the features do not belong to the same objects. Furthermore, geometric verification is computationally expensive and only a few candidates can be evaluated in this manner. The other technique, query expansion, enhances the features of the query with the features of correct candidates. In image retrieval, this is commonly done by expanding the visual words of the query with visual words of correct matches and then repeat the query. However, good matches are generally identified using geometric verification. As mentioned earlier, geometric verification is difficult to perform in RS imagery. Nevertheless, in our system, we propose to exploit a query expansion method that also has a low complexity.
After retrieval, we select the top-three candidate images and combine their descriptors to improve the image matching process. We assume that the top-three images will consistently provide the visually closest matches and the most relevant features for query expansion. This procedure would normally require access to the local descriptors of the images to be aggregated. However, this means additional disc access operations are necessary, since local features are not loaded into memory during retrieval. To avoid this, we exploit the properties of VLAD descriptors and use Memory Vector (MV) construction to generate the expanded query descriptor. MV is a global descriptor aggregation method, which produces a group descriptor. Unlike visual words, these representations have no semantic meaning. A difference between VLAD and MV is that the latter does not require learning a visual dictionary.
Let us consider now to combine various VLAD global descriptors into the MV representation. The purpose is to illustrate how the MV representation can prevent accessing the local descriptors of the images that expand the query. Iscen et al. [
8] have proposed two different methods for creation of MV representations. The first is denoted by
p-sum and is defined as
with
being a
matrix of global representations to be aggregated, where
m is the number of global representations and
n is their dimensionality. Although naive, this construction is helpful when used with VLAD descriptors. It should be born in mind that each VLAD descriptor
is constructed as the concatenation of
k vectors generated using Equation (
3). Assume that
and
are the VLAD representations of two different images, and the row elements of the descriptor matrix
. The
p-sum vector is given by
with
,
the individual VLAD vector components of
and
, respectively. From Equation (
5) we observe that the
p-sum representation is an approximation of the VLAD vector of the features contained in both images. Instead of generating a VLAD vector from the union of the features belonging to the query and its top-three matches, we can produce an approximated representation by exploiting MV construction. The results of direct VLAD aggregation and
p-sum aggregation are not identical, because the values used in the latter have been
-normalized. However, this approximation allows us to use only the global representations of images. The second construction method, denoted
p-inv, is given by
where
is the Moore-Penrose pseudo-inverse [
54] of the descriptor matrix
. This construction down-weights the contributions of repetitive components in the vectors. In this work, we perform query expansion with both MV construction methods and compare their performance.