A Sketch-Based Cross-Modal Retrieval Model for Building Localization Without Satellite Signals

Du, Haihua; Fan, Jiawei; Huang, Yitao; Lin, Longyang; Qian, Jiuchao

doi:10.3390/electronics14193936

Open AccessArticle

A Sketch-Based Cross-Modal Retrieval Model for Building Localization Without Satellite Signals

by

Haihua Du

¹,

Jiawei Fan

²

,

Yitao Huang

³,

Longyang Lin

⁴ and

Jiuchao Qian

^2,*

¹

Department of Science and Technology Development, Guangzhou Shiyuan Electronic Technology Co., Ltd., Guangzhou 510700, China

²

School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai 201100, China

³

General Manager Office, Guangzhou Kindlink Intelligent Technology Co., Ltd., Guangzhou 510663, China

⁴

School of Microelectronics, Southern University of Science and Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3936; https://doi.org/10.3390/electronics14193936

Submission received: 7 September 2025 / Revised: 27 September 2025 / Accepted: 28 September 2025 / Published: 4 October 2025

(This article belongs to the Special Issue Recent Advances in Autonomous Localization and Navigation System)

Download

Browse Figures

Versions Notes

Abstract

In existing non-satellite navigation systems, visual localization is widely adopted for its high precision. However, in scenarios with highly similar building structures, traditional visual localization methods that rely on direct coordinate prediction often suffer from decreased accuracy or even failure. Moreover, as scene complexity increases, their robustness tends to decline. To address these challenges, this paper proposes a Sketch Line Information Consistency Generation (SLIC) model for indirect building localization. Instead of regressing geographic coordinates, the model retrieves candidate building images that correspond to hand-drawn sketches, and these retrieved results serve as proxies for localization in satellite-denied environments. Within the model, the Line-Attention Block and Relation Block are designed to extract fine-grained line features and structural correlations, thereby improving retrieval accuracy. Experiments on multiple architectural datasets demonstrate that the proposed approach achieves high precision and robustness, with mAP@2 values ranging from 0.87 to 1.00, providing a practical alternative to conventional coordinate-based localization methods.

Keywords:

satellite-free; semantic location; attention mechanism

1. Introduction

Location plays an important role in various fields, from transportation and outdoor exploration to agriculture and the military, covering a wide range of applications [1,2,3,4,5,6,7,8]. It helps people achieve accurate navigation in different environments, improving efficiency, safety and productivity. Moreover, as a crucial component of location technology, semantic location takes into account not only the user’s intended destination but also factors in the surrounding context and semantic information, offering a more comprehensive and user-centric location experience [9,10,11,12,13,14,15,16]. Compared to traditional location systems, semantic location is characterized by its intelligence, offering a significantly enhanced and personalized location experience with greater accuracy.

Among the methods of semantic location, sketch information is a crucial component of the semantics. Sketches are simplified, rapid, unfinished drawings designed to capture the main features of ideas, concepts or observations, emphasizing personal expression. Consequently, sketches inherently contain a wealth of high-level semantic information. Especially with the wide use of touch screen devices such as mobile phones and tablets, it is effortless to record the information in the form of sketches, making sketches a quite convenient practice. Therefore, combining sketches with location, using simple strokes to depict the target location and matching with the real location, the work of positioning and navigation can be effectively accomplished. Furthermore, even in situations where satellite signals are unavailable or unreliable for positioning purposes, by building a sketch-location database for a specific area, the location and navigation task can be completed at a lower cost and higher stability [17].

Meanwhile, sketch-based image retrieval (SBIR) [18,19,20,21,22,23]—matching natural photos with free-hand sketches—has drawn a lot of interest. Various advancements in SBIR architectures have evolved based on several foundational concepts, including multiple independent networks [24], multi-layer networks [25], and semi-heterogeneous networks [26]. However, these networks are oriented to a wide range of objects, and most of them are sketch retrieval networks about animals or objects in life. As a result, when the image retrieval of sketches is oriented to architectural objects, its accuracy often cannot meet the requirements. In order to solve the above problems, this paper proposes a Sketch Line Information Consistency Generation model, which uses the attention mechanism to extract fine texture features such as lines in sketches and buildings.

Therefore, we propose a novel semantic localization method based on line information. This method accomplishes localization tasks by matching hand-drawn sketches with actual building groups, as shown in Figure 1. The fundamental principle of this localization method revolves around achieving alignment between the image and the sketch. The primary challenges lie in the cross-domain matching problem and the issue of high similarity features. To address these challenges, we employ a two-branch structured generative adversarial network in the overall network architecture. This network aims to discover two types of mapping functions simultaneously, mapping the original image and sketch into a shared semantic space with matching distributions, thus enhancing the matching process. Furthermore, the second challenge involves the precision of locating highly similar buildings. Although the two mapping functions ensure a shared space, the distance between feature vectors from different categories may not be well preserved, leading to confusion in locating similar buildings. To tackle this issue, we introduce the L-A and R-A modules within the neural network’s training process. These modules utilize the distinctive line information in buildings as the foundation for feature vector convergence, maximizing the differentiation between different categories. In addition, we leverage Gradient-Weighted Class Activation Mapping (Grad-CAM) [27] to visualize and analyze class-discriminative regions, which provides guidelines for abstracting architectural sketches into effective line representations. This integration helps explain why line features are emphasized in our design and further justifies the role of sketches in cross-modal localization. Lastly, we consider the selection of sketch datasets for training the network. The dataset is curated from individuals without a drawing background, simulating architectural outline information obtained through recall, thus enhancing the model’s universality.

In summary, the main contributions of this paper can be concluded as follows.

A semantic location method based on sketch retrieval is introduced as an innovative approach for achieving precise positioning and navigation, without relying on satellite technology. The method involves matching hand-drawn architectural sketches created by individuals without any prior art skills to real-world buildings. The result is a remarkably robust and effective location-based navigation system, which is validated through retrieval accuracy experiments on multiple architectural datasets in Section 3.4.
To derive a set of universal sketching principles, sketch-original datasets encompassing various architectural styles were constructed, and corresponding sketching experiments were carried out. The Grad-CAM method is used to visualize the position of key lines and further simplify the way sketches are drawn. Ultimately, three fundamental principles for retaining essential lines were formulated based on distinctive line characteristics, and their impact on retrieval accuracy is demonstrated in the sketch simplification experiments of Section 3.4.
In the context of aligning sketches with their corresponding original drawings, a novel approach involves the use of a two-branch generative network that relies on line information. Two types of modules, L-A Block and R-A Block, are proposed in dealing with the matching problem of buildings with similar height. The former extracts the line features and content information of the building, while the latter deeply associates the information with the help of the characteristics of the association network. The effectiveness of these modules is confirmed by the ablation studies in Section 3.5.

The rest of this article is organized as follows. Section 1 introduces related works on both measurement and feature aspects in semantic navigation and sketch-based image retrieval. Section 2 briefly presents the proposed system framework. Section 3 provides the details of the sketch line information consistency generative model and presents the experimental results. Finally, the conclusion is summarized at the end of Section 4.

2. Materials and Methods

In this work, we introduce the SLIC model for sketch-based building localization. The main framework is built upon a generative network based on CycleGAN, aiming to establish a shared semantic space for retrieval and matching. Simultaneously, an auxiliary framework combines attention mechanisms to align the original sketches with the common semantic space, effectively separating highly similar buildings. The complete architecture of our model is illustrated in Figure 2.

The SLIC model is a generative model that effectively aligns the line information of original buildings and sketch buildings while ensuring consistency in spatial features during the alignment process. It begins by establishing a common semantic space for mapping architectural sketches and original images. The auxiliary modules in SLIC, namely the L-A module and the R-A module, play crucial roles: The L-A module is utilized for extracting line features specific to each type of architectural standard. These features are instrumental in executing alignment operations during mapping, thus accomplishing the preliminary task of building positioning. Meanwhile, the R-A module delves deeper into the correlation of line features, particularly when dealing with highly similar buildings. Its purpose is to achieve high-precision positioning by refining the alignment process, ensuring accurate mappings even in cases of architectural similarity.

We define

D = {X, Y}

as the dataset comprising sketches and images belonging to known categories. The dataset includes sketch images

X = {x_{i}}_{i = 1}^{N}

and natural images

Y = {y_{i}}_{i = 1}^{N}

, where N represents the category numbers in the sketch-image datasets. It is worth noting that sketches and images share the same labeling convention, simplifying the association of corresponding pairs. The auxiliary information

L

is defined as the output from the line attention network. The information effectively directs the attention of the generation network towards texture details, enhancing matching accuracy, particularly in scenarios where real image texture information is faint. The primary objective of our model is to learn two functions, denoted as

G_{s} (\cdot)

and

G_{i} (\cdot)

, for mapping sketches and natural images into a shared semantic space. When provided with a sketch and an image from distinct domains, these proposed functions,

G_{s} : R^{d} ⟶ R^{M}

and

G_{i} : R^{d} ⟶ R^{M}

, respectively, map them to the same semantic space where retrieval and matching tasks can be efficiently performed. Following the mapping process described above, we obtain features (

I_{ie}

&

I_{se}

) from both the sketch and original images. These features are then fed into the R-A module for similarity matching, which ultimately yields the final positioning results.

2.1. Cycle-Consistent Generative Model

When dealing with the training of depth equations for sketches and images, we adopt a cycle-consistent generative frame. Each branch in the frame is aligned with a common discriminator. The cycle consistency not only ensures that the sketch and image are mapped to the semantic space but also ensures that the spatial location is consistent in the inverse mapping back to the original space. Therefore, it is only necessary to add a loss function to the category information to achieve the generation of discriminative features.

The unique semantic information of architecture lies in its rich texture, whether sketch or original drawing, contour or line information should be the main focus. Our main goal is to enable the two kinds of mapping equations to learn this kind of texture information separately and transfer to the common semantic space. However, different from Zhu et al. [28], who adopt two types of images in different domains to achieve cyclic consistency, we use line feature information on the common semantic space as an intermediary to achieve cyclic consistency of sketches and pictures, respectively. By training

G_{s} (\cdot)

, the architectural sketch

X

is mapped to the

L

so that

{\hat{l}}_{i} = G_{s} (x_{i})

, and

l_{i} \in L

is the line information used to complete the adversarial training. Thereby, the trained

G_{s} (\cdot)

is able to transform the modality

X

to a modality

\hat{L}

that satisfies the distribution in

L

.

For adversarial loss, there exist three classes of adversarial loss regarding sketch, image, and semantics. We have designed image and sketch generators,

G_{s} : X \to L

,

G_{i} : Y \to L

,

F_{s} : L \to X

, and

F_{i} : L \to Y

, and three corresponding adversarial discriminators:

D_{s} (\cdot)

,

D_{i} (\cdot)

, and

D_{se} (\cdot)

, where

D_{se}

is used to distinguish between the semantic information of the original line

{l}

and the semantic information of the sketch

{G_{s} (x)}

and the semantic information of the image

{G_{i} (y)}

.

D_{s}

is used to distinguish the original sketch information

{x}

from the sketch information transformed from the semantic space

{F_{s} (l)}

. Similarly,

D_{i}

is used to distinguish

{y}

from

{F_{i} (l)}

. For

G_{s}

and

G_{i}

, the objective function is as follows:

\begin{matrix} L_{ad} (G_{s}, G_{i}, D_{se}, x, y, l) = 2 \times E ⌊ \log D_{se} (l) ⌋ \\ + E [\log (1 - D_{se} (G_{s} (x)))] + E [\log (1 - D_{se} (G_{i} (y)))] \end{matrix}

(1)

Among them,

G_{s}

and

G_{i}

intend to minimize the objective function, while

D_{se}

tries to maximize the objective function. Similarly, for

F_{s}

and

D_{s}

, the objective function is as follows:

\begin{matrix} L_{ad} (F_{s}, D_{s}, x, l) = E ⌊ \log D_{s} (x) ⌋ \\ + E [\log (1 - D_{s} (F_{s} (l)))] \end{matrix}

(2)

F_{s}

minimizes the objective function and the discriminator

D_{s}

intends to maximize it. The same goes for

F_{i}

and

D_{i}

.

\begin{matrix} L_{ad} (F_{i}, D_{i}, y, l) = E ⌊ \log D_{i} (y) ⌋ \\ + E [\log (1 - D_{i} (F_{i} (l)))] \end{matrix}

(3)

For the cycle consistency loss, while adversarial learning can help reduce the gap between the sketch domain and image domain, it does not guarantee the alignment of information in terms of position when they are transformed into the common semantic space. To address this concern, we employ the cycle consistency approach. This approach maps both the sketch and the image to the corresponding common semantic space and then transforms them back to their original feature spaces, ensuring spatial alignment.

Specifically, when training the two transformations:

G_{s} : X \to L

and

F_{s} : L \to X

, they are designed to be inverse operations of each other. The objective function is as follows:

\begin{matrix} L_{cy} (G_{s}, F_{s}) = E [{∥F_{s} (G_{s} (x)) - y∥}_{1}] \\ + E [{∥G_{s} (F_{s} (l)) - l∥}_{1}] \end{matrix}

(4)

Similarly, we apply cycle-consistency loss to these two types of equations

G_{i} : Y \to L

and

F_{i} : L \to Y

:

\begin{matrix} L_{cy} (G_{i}, F_{i}) = E [{∥F_{i} (G_{i} (y)) - y∥}_{1}] \\ + E [{∥G_{i} (F_{i} (l)) - l∥}_{1}] \end{matrix}

(5)

For classification loss, the adversarial loss and cycle consistency loss only ensure that the corresponding sketches and images are closer to each other in the new domain and retain certain position information. However, it does not guarantee that the trained generator has category-level discrimination. Therefore, we add the classification loss to the features generated by the generator:

\begin{matrix} L_{cl} (G_{s}) = - E [\log P (c a | G_{s} (x); θ)] \end{matrix}

(6)

where ca stands for the class label. Similarly, the classification loss is added to the generator

G_{i}

as well.

\begin{matrix} L_{cl} (G_{i}) = - E [\log P (c a | G_{i} (y); θ)] \end{matrix}

(7)

Through the loss setting of the above circular architecture, two types of features

I_{ie}

and

I_{se}

can be obtained for matching. Further feature extraction is carried out by the subsequent R-A network to achieve accurate matching of buildings with high similarity.

2.2. L-A Block and R-A Block

The proposed cyclic consistency framework can effectively align two heterogeneous modalities while preserving spatial information. However, its performance can be affected when dealing with poor separation of architectural buildings and complex backgrounds (e.g., nighttime scenes) or high similarity between buildings (e.g., similar color or outline profiles). Therefore, we need to design a network module that can enhance the distinctive line features of buildings. The module consists of two stages: the first stage (L-A Block) is to extract standard features with sketch-like style and image content, which are used to determine the alignment criteria for the common space; the second stage (R-A Block) is to deeply explore the correlations of features in the common space, which is used to distinguish similar types of buildings.

2.2.1. Line-Attention Block

The L-A Block draws inspiration from the ideas presented in SANET [29], characterized by its ability to take both a content image and a style image as input, generating a stylized image that combines semantic structures from the former with characteristics from the latter. In our approach, we utilize the pre-trained VGG-16 network to extract style features from both the sketch and the original image. As shown in Figure 3, the output from different VGG layers serves as the basis for extracting angle and length features, and an attention mechanism is used to efficiently extract line information, even in complex environments. The network is instrumental in improving the model’s ability to handle challenging scenarios where building lines are less distinct from their surroundings.

The fourth and fifth layers of the VGG-16 network are selected as the main feature extraction networks for angle change and length information and denoted as

V G G_{4}

and

V G G_{5}

, respectively. Meanwhile, the sketch is defined as the style image, and the original image is defined as the content image, which is the input into the two feature extraction networks, and we can obtain the information from different layers, and we denote them as follows.

\begin{matrix} F_{x}^{4} = V G G_{4} (x), F_{y}^{4} = V G G_{4} (y) \\ F_{x}^{5} = V G G_{5} (x), F_{y}^{5} = V G G_{5} (y) \end{matrix}

(8)

After extracting feature information from both types of images, the primary objective is to calculate the weight matrix within the attention mechanism. With a focus on the straightforward line style of the sketch and an emphasis on content extraction from the original image, the following operations are conducted across different layers.

For the extraction of length information, a relatively shallow fourth layer is employed. The attention weight matrix is derived through matrix multiplication of the sketch and the original image features. Subsequently, the weight matrix is multiplied with the original image features to obtain features that emphasize length information (

L_{l e n}

).

\begin{matrix} L_{l e n} = \frac{1}{C (F^{4})} \sum_{\forall j} exp (f {(\bar{F_{x}^{4}})}^{T} g {(\bar{F_{y}^{4}})}_{j}) h ({(F_{y}^{4})}_{j}) \end{matrix}

(9)

where function

f (\cdot)

means

W \times F

, and W is the learned weight matrix, which is implemented as

1 \times 1

convolutions. Meanwhile,

\bar{F}

denotes a mean-variance channel normalized version of F and

C (F^{4}) = \sum_{\forall j} exp (f {(\bar{F_{x}^{4}})}^{T} g {(\bar{F_{y}^{4}})}_{j})

.

Likewise, the angle information (

L_{a n g}

) can be obtained as follows.

\begin{matrix} L_{a n g} = \frac{1}{C (F^{5})} \sum_{\forall j} exp (f {(\bar{F_{x}^{5}})}^{T} g {(\bar{F_{y}^{5}})}_{j}) h ({(F_{y}^{5})}_{j}) \end{matrix}

(10)

We combine the above two output features as

\begin{matrix} L_{e m b} = c o n v_{3 \times 3} (L_{l e n} + u p s a m p l i n g (L_{a n g})) \end{matrix}

(11)

Finally, the stylized line information feature L is synthesized by feeding

L_{e m b}

into the decoder as follows:

\begin{matrix} L = D e c o d e r (L_{e m b}) \end{matrix}

(12)

After obtaining both types of line information, the integration into the sketch–image matching model serves as a benchmark for optimizing the matching performance.

2.2.2. Relation Block

The task of building localization based on sketches can be preliminarily accomplished using the modules described above. However, when dealing with buildings which are highly similar in appearance, such as those with rectangular shapes and comparable color schemes, there is a risk of confusion and misidentification. To overcome this challenge, a deep information relation module is proposed, designed to extract more refined and distinctive architectural relation from the data.

Inspired by the relational network approach proposed in [30], we aim to incorporate a relational network module after extracting the standard feature vector to better distinguish between similar architectural features in depth. In this work, we assume

C (\cdot, \cdot)

to be concatenation of two standard feature in depth. The relational network architecture

R_{ψ}

consists of two fully connected (FC) layers, each followed by a ReLU activation function and a dropout layer. This configuration produces a matching score in the range of (0, 1), which represents the degree of similarity between the input features.

\begin{matrix} R (I_{ie}, I_{se}) = sigmoid (R_{ψ} (C (I_{ie}, I_{se}))) \end{matrix}

(13)

This framework differs from the correlation of global image features used in the original literature [30]. The previous approach consumes a higher amount of computational resources, whereas the proposed method utilizes deep information mining and reasoning on the extracted standard features to achieve more efficient retrieval in cases where computation is limited.

2.3. Overall Cost Function

To effectively align sketch and image modalities while ensuring spatial consistency, category discrimination, and robust cross-modal retrieval, we design a composite cost function that integrates adversarial alignment, cycle consistency, category classification, and relation-based retrieval losses. These components jointly minimize the discrepancy between the model predictions and the ground-truth targets at multiple levels of representation. Specifically, to reduce the distribution gap between sketches, images, and the shared semantic line-space

L

, we employ adversarial training with generators

G_{s} : X \to L

and

G_{i} : Y \to L

, inverse mappings

F_{s}

and

F_{i}

, and discriminators

D_{s e}

,

D_{s}

, and

D_{i}

for semantic, sketch, and image domains, respectively. The adversarial objective is defined as

L_{ad} = L_{ad}^{s e} + L_{ad}^{s k} + L_{ad}^{i m},

(14)

where

\begin{matrix} L_{ad}^{s e} = & E_{ℓ \sim p (L)} [\log D_{s e} (ℓ)] + E_{x \sim p (X)} [\log (1 - D_{s e} (G_{s} (x)))] \\ + E_{y \sim p (Y)} [\log (1 - D_{s e} (G_{i} (y)))], \end{matrix}

(15)

L_{ad}^{s k} = E_{x} [\log D_{s} (x)] + E_{ℓ} [\log (1 - D_{s} (F_{s} (ℓ)))],

(16)

L_{ad}^{i m} = E_{y} [\log D_{i} (y)] + E_{ℓ} [\log (1 - D_{i} (F_{i} (ℓ)))] .

(17)

This adversarial loss encourages the generated distributions from

G_{s}

and

G_{i}

to become indistinguishable from the real distributions in their respective domains. To further ensure that the mapping preserves spatial and structural information, we incorporate an

L_{1}

cycle consistency loss that reconstructs sketches and images after forward and backward mappings:

\begin{matrix} L_{cy} = & E_{x} ∥ F_{s} (G_{s} (x)) {- x ∥}_{1} + E_{ℓ} {∥ G_{s} (F_{s} (ℓ)) - ℓ ∥}_{1} \\ + E_{y} ∥ F_{i} (G_{i} (y)) {- y ∥}_{1} + E_{ℓ} {∥ G_{i} (F_{i} (ℓ)) - ℓ ∥}_{1} . \end{matrix}

(18)

This ensures semantic alignment without introducing spatial distortion by enforcing invertibility between mappings. Additionally, to retain category-level discriminative information in the semantic space, we apply a standard cross-entropy classification loss to both sketch and image features:

L_{cl} = - E_{(x, c)} \log P (c | G_{s} (x)) - E_{(y, c)} \log P (c | G_{i} (y)),

(19)

where

P (c | \cdot)

is the predicted probability of class c. For sketch–image retrieval, we use a relation head

R (\cdot, \cdot)

to output similarity score

s \in (0, 1)

between sketch features

I_{se}

and image features

I_{ie}

and train it with binary cross-entropy against the ground-truth pair label

z \in {0, 1}

:

L_{rel} = - E [z \log s + (1 - z) \log (1 - s)] .

(20)

Finally, the overall cost function is expressed as

L_{total} = λ_{a d} L_{ad} + λ_{c y} L_{cy} + λ_{c l} L_{cl} + λ_{r e l} L_{rel},

(21)

where

λ_{a d}

,

λ_{c y}

,

λ_{c l}

, and

λ_{r e l}

are hyperparameters that balance the contributions of each loss component. During training, the total loss is minimized with respect to the generators, classifiers, and relation network, while the adversarial components are maximized with respect to the discriminators following a standard GAN min–max optimization scheme. This formulation explicitly connects the differences between ground-truth and predicted outputs at distributional, reconstruction, categorical, and pairwise retrieval levels within a unified optimization objective.

2.4. Algorithm Description

To enhance the reproducibility of our method, we summarize the key steps of the proposed Sketch Line Information Consistency Generation (SLIC) model using pseudocode. The training procedure integrates the adversarial loss, cycle-consistency loss, and classification loss defined in Equations (1)–(7), while the inference stage describes the sketch-to-image retrieval process. The overall optimization objective follows Equation (8).

The training algorithm (Algorithm 1) ensures that sketches and images are aligned in a shared semantic space by jointly optimizing adversarial, cycle consistency, and classification losses. The retrieval algorithm (Algorithm 2) performs inference by matching a query sketch to candidate building images, thereby enabling indirect localization in satellite-denied environments.

Algorithm 1 Training procedure of the SLIC model

Require:: Sketch dataset $X$ , Image dataset $Y$ , Line-attention features $L$
Ensure:: Trained model parameters
1:: Initialize parameters of generators $G_{s}, G_{i}, F_{s}, F_{i}$ and discriminators $D_{s}, D_{i}, D_{s e}$
2:: for each training iteration do
3:: Sample mini-batch $x \in X$ , $y \in Y$ , $l \in L$
4:: Adversarial loss (Equations (1)–(3)): Compute $L_{a d} (G_{s}, G_{i}, D_{s e}, x, y, l)$ , $L_{a d} (F_{s}, D_{s}, x, l)$ , $L_{a d} (F_{i}, D_{i}, y, l)$
5:: Cycle-consistency loss (Equations (4) and (5)): Compute $L_{c y} (G_{s}, F_{s})$ and $L_{c y} (G_{i}, F_{i})$
6:: Classification loss (Equations (6) and (7)): Compute $L_{c l} (G_{s})$ and $L_{c l} (G_{i})$
7:: Total loss (Equation (8)): $L_{t o t a l} = λ_{a d} (L_{a d}) + λ_{c y} (L_{c y}) + λ_{c l} (L_{c l})$
8:: Update $G_{s}, G_{i}, F_{s}, F_{i}$ by minimizing $L_{t o t a l}$
9:: Update $D_{s}, D_{i}, D_{s e}$ by maximizing adversarial objectives
10:: end for
11:: return Trained model parameters

Algorithm 2 Sketch-to-Image Retrieval with SLIC

Require:: Query sketch $S_{q}$ , Image database $I$ , Trained model parameters $θ$
Ensure:: Retrieved building image $I^{*}$
1:: Extract sketch features $S_{e} = VGG - 16 (S_{q})$
2:: for each image $I \in I$ do
3:: Extract image features $I_{e} = VGG - 16 (I)$
4:: Compute similarity score $s i m (S_{e}, I_{e})$ using R-A Block
5:: end for
6:: Rank all images by similarity scores
7:: return Top-1 (or Top-K) retrieved image $I^{*}$

3. Results

3.1. Simulation Setup

The effect of network retrieval was evaluated through a series of buildings in different environments and ablation experiments. To establish a standardized set of sketching rules, key stroke experiments were executed. The experiment involved selectively removing unnecessary strokes from the sketch dataset by setting accuracy thresholds for matching. To confirm the universality of the stroke rule and the network, we explored a range of architectural building styles and conducted matching experiments. This included images of buildings from different regions and at various time points. To assess the impact of line attention networks on the overall network retrieval performance, ablation experiments were conducted, with consideration given to the matching effect of buildings at different times. All the above experiments were carried out using the Python programming language (version 3.8, Python Software Foundation, Wilmington, DE, USA) and PyTorch (version 1.9, Meta Platforms, Inc., Menlo Park, CA, USA), and were run on an Intel Core i7-12700KF CPU (Intel Corporation, Santa Clara, CA, USA) with 32 GB of RAM (Kingston Technology, Fountain Valley, CA, USA).

3.2. Datasets

In research of localization, there is still limited exploration of semantic localization based on architectural sketches, which has resulted in a lack of publicly available datasets. To more effectively validate our model and establish reliable sketching guidelines, we constructed both a sketch dataset and a building image dataset, as illustrated in Figure 4. The building image dataset consists of 576 original building photographs, covering four categories of building groups with distinct architectural line styles. To ensure diversity and robustness, the images were captured under both daytime and nighttime conditions. As shown in Figure 5, we collected 2466 sketches corresponding to the 576 building photographs, with each image described from multiple viewing angles. All sketches were drawn by 21 participants without prior drawing experience, ensuring diverse drawing styles that closely resemble real human navigation scenarios. This comprehensive dataset design not only allows for a rigorous evaluation of our model’s robustness across varying environments but also provides a valuable resource for future research in sketch-based localization.

The dataset focuses on building groups with relatively similar outlines and line structures to emphasize fine-grained discrimination, and this choice inevitably limits domain adaptability. Moreover, the current dataset was collected from adult participants without prior drawing experience, which reflects the target user group of typical navigation tasks but does not capture other possible factors such as hurried drawing conditions, contributions from children, or sketches containing uncontrolled noise. In future work, we plan to expand the dataset by including a broader range of architectural styles, as well as more diverse participant conditions, so as to further improve both generalization and realism.

3.2.1. Sketch Dataset

In order to better simulate the scene of human semantic navigation, the selection of sketch data set has a high standard. The main crowd of sketch drawing is without painting foundation, which ensures that the model can learn the characteristics of different style lines as much as possible, in contrast to the network trained with neat sketches formed by painting foundation. In addition, in the initial sketch drawing, due to the lack of a guiding method for drawing architectural sketches, the external outline and internal detail features of the building are preserved to some extent. Subsequently, the number of strokes of the building sketch will be controlled to simplify the rendering process and extract only the key features required for building classification as much as possible. Finally, these relatively complex sketches will be cut through subsequent experiments so as to achieve the purpose of simplifying the sketch as much as possible.

3.2.2. Buildings Dataset

For the building images, four distinct styles architecture are selected to ensure adaptability to different building types. These include SJTU, PARIS, OXBUILD, and structures around OXFORD. For example, OXBUILD’s buildings exhibit intricate backgrounds and a wide array of architectural lines, offering a complex and diverse visual landscape. PARIS, on the other hand, is dominated by larger structures, resulting in simpler backgrounds with predominantly straight architectural lines. During the process of image acquisition, it was essential to replicate the real-world scenario of architectural navigation. The process required traversing the spaces between buildings to capture images from various angles. Consequently, the images were obtained along the path, each showcasing buildings from different perspectives. Furthermore, when capturing color images of the buildings, meticulous care was taken to exclude foreground objects like trees and vehicles. The behavior was to prevent foreground elements from occupying a significant portion of the image, which could otherwise diminish the prominence of the building in the picture. Such a setup ensures that the building stands out distinctly and does not hinder the subsequent matching experiments between the color image and the skeletal representation.

3.2.3. Night Dataset

In addition to the daytime data, a class of nighttime test data is also collected for selected buildings. Since nighttime navigation is in demand to some extent, verifying the retrieval effect of nighttime data is also to verify whether the nighttime navigation success probability can be guaranteed. Considering that the line attention network structure is an important part of extracting the edge information of color pictures, the effectiveness of this module can also be verified from the side through the night data set and ablation experiments.

In the final stage of preprocessing for both color images and stick figures within each class, the objective is to prepare the data for subsequent tasks, such as model training. The images undergo normalization to standardize their pixel values, and all images are uniformly resized to 256 × 256 pixels. This standardization ensures consistent input dimensions for the neural network. Furthermore, to establish a balanced training dataset, the paper employs a random division approach, allocating 70% of the data to the training set. This random division strategy is employed to mitigate potential biases and the undue influence of peculiarities in the training results.

3.3. Experiments on the Principles of Core Sketching

To establish general rules for sketching strokes, a stroke deletion experiment was conducted to identify key strokes. The primary objective was to extrapolate shared features from a single building class to most building classes. SJTU was chosen as the experimental data set, and sketches with different stroke numbers were drawn for SJTU’s unified architecture. In order to preliminarily identify the key focus areas of the proposed framework, we leverage the Grad-CAM algorithm. Grad-CAM utilizes gradient backpropagation to assign weights to the various convolutional layers, and then aggregates these weighted contributions to obtain the importance of different spatial locations within the input image. According to Figure 6, we can observe the line angle characterization in the sketch drawing: the core internal structure of the building and the outline of the building may become the focus of attention. Based on this phenomenon, further sketch drawing line optimization experiments can be conducted.

To assess the impact of stroke deletion, a reduction principle was applied: reduce the sketch’s complexity while maintaining consistent sketch retrieval results and accurate single-point positioning. The results of our experiment are as follows: the average stroke count for stick figure drawings was kept under eight strokes, and three fundamental drawing principles were derived.

Three representative types of redundant strokes are illustrated in Figure 7. In each subfigure, the first column shows the original building image, the second column presents the corresponding sketch containing redundant strokes, and the third column displays the simplified sketch after applying the drawing principles. As shown in Figure 7a, both the inner and outer contour lines that run in parallel are simplified into a single outline. The contrast is particularly visible on the right, where redundant contour features at the top and interior of the structure have been removed. In Figure 7b, only one instance of repeated internal features is preserved. Specifically, among multiple ‘M’-shaped structures, a single representative is retained to avoid redundancy. Finally, Figure 7c demonstrates the removal of internal dividing lines. Overlapping architectural features are simplified to retain only the outer outline, and extension lines on the right side are eliminated to achieve a cleaner representation.

Through the guidance of Grad-CAM heatmaps and systematic stroke simplification experiments, we established a quantitative sketching principle that identifies and preserves key structural strokes essential for retrieval accuracy. As shown in Table 1, we progressively removed different stroke types and evaluated the resulting performance. When non-critical strokes were removed, the retrieval accuracy remained above the robustness threshold of 0.90, indicating that these strokes could be safely simplified without harming localization. In contrast, removing key strokes—such as parallel inner/outer contours, repeated architectural features, or internal dividing lines—caused accuracy to drop below 0.90 (e.g., mAP@all decreased to 0.84–0.87), confirming their importance for preserving discriminative structures. This quantitative analysis validates that the model effectively leverages critical line information for matching and remains robust to variations in drawing style, as long as key strokes are preserved. Furthermore, we confirmed the generalizability of these sketching principles by evaluating them on three additional architectural categories beyond the SJTU dataset.

3.4. Experiments on Universality of Drawing Principles and Network Structures

To verify the universality of the model and the sketching principles mentioned above, four types of architectural scenes were selected with distinct styles for testing the model’s adaptability. As shown in Figure 8, these scenes include SJTU, PARIS, OXBUILD, and OXFORD, each representing different architectural characteristics. For instance, PARIS images are characterized by simplicity, featuring mostly straight lines or simple curves with minimal internal texture details. In contrast, OXFORD buildings exhibit more complex line information, with their outlines primarily composed of curves and intricate internal textures, including repetitive patterns. Matching and positioning experiments were conducted on these architectural scenes with varying background complexity and line features.

We evaluated the model’s performance using single-point matching accuracy and mAP (mean Average Precision) as the assessment criteria to confirm the model’s universality. Furthermore, three drawing principles were employed as a basis to create architectural sketches with fewer strokes for these scenes. In order to ensure that the model exhibits both navigation and positioning accuracy, as well as fault tolerance, two images with the smallest Euclidean distance between the generated sketch and the original image from the architectural database are selected. These selected images are then regarded as the final matching objects. Moreover, three different sketches for a building are drawn. Successful matching is determined by the identification of a correct corresponding building between the two sketches with the smallest Euclidean distance, and the matched building’s name is labeled accordingly.

In Table 2, the actual matching results for nine buildings from four different categories are presented. The experimental findings indicate that, in each building category, there is at most one mismatch, indicating the versatility of the network structure across various architectural styles. Furthermore, the best performance is observed in the PARIS dataset, which can be attributed to several factors. On the one hand, most of the buildings in the PARIS dataset are large, and they have fewer redundant background elements, thereby reducing interference. On the other hand, the overall building outlines in this dataset are composed of simple lines, making them easier to sketch. In contrast, the OXBUILD dataset contains several challenging cases, where different buildings exhibit similar tower-and-arch structures. As illustrated in Figure 9, simplified sketches drawn by participants may correspond to multiple buildings with similar outlines, leading to ambiguous retrieval results. These “bad cases” highlight a key limitation of purely outline-based representations and explain the few mismatches observed in the experiments.

Additionally, in order to assess the overall accuracy of the model and ensure the effectiveness of positioning, the corresponding mAP@2 values were calculated. For three types of buildings with relatively complex backgrounds or containing intricate interiors, the sketching principles established in previous experiments were applied, and the sketches were simplified. The mAP@2 values for these three building categories are 0.90, 0.88, and 0.87, respectively. The above experiments demonstrate that the model maintains high positioning accuracy across various architectural styles. Additionally, the applicability of the sketching rules has been validated, indicating their robustness across different environments. In terms of efficiency, the end-to-end retrieval for a single query takes approximately 16 ms on an NVIDIA RTX 4080 GPU, which meets real-time requirements.

3.5. Comparative Experiment and L-A Block Ablation Experiment

To further validate the network’s localization accuracy capability, three different types of retrieval networks were selected for comparison. The first category comprises traditional feature extraction operators, which extract features across different domains. The second category consists of neural network feature extraction operators, but the matching still occurs across different domains. The third category involves domain-transformed neural network algorithms that fuse features from different domains.

From Table 3, we can observe that for the building localization task, the neural network operators outperformed the traditional operators in terms of feature extraction, maintaining a certain advantage in retrieval time, and the VGG module achieved the highest precision, making it a suitable image feature extraction operator. Meanwhile, the SOTA models in the sketch retrieval task also achieved high precision on the building localization task, indicating that domain fusion can enhance the matching precision in tasks that utilize sketches for localization. However, it was noted that the failure cases typically involved similar types of buildings. The line-attention matching model proposed in this paper specifically addresses this problem, achieving the highest precision in building localization and maintaining stable performance as the retrieval quantity increases, thereby demonstrating strong robustness, as analyzed in the following section.

For evaluating the effect of the line attention network on the overall network, a comparison was made between the original network and a simplified network that excluded the attention network. Moreover, in adverse shooting conditions, like low light, the building’s outline tends to blur, posing challenges for accurate matching. We aspire to conduct a more in-depth examination of the line attention network’s performance under such circumstances. Therefore, a dataset of buildings during the night was introduced, as shown in Figure 10, which is characterized by the complex background weakening the texture information in the buildings.

In comparing the experimental results of day and night data under the simplified network, it can be observed that when nighttime images are used as the test set, both retrieval accuracy and single-point localization performance degrade significantly. As shown in Table 4, MAP@3 decreases from 0.88 to 0.65, MAP@2 decreases from 0.82 to 0.64, and the single-point localization accuracy drops from 0.85 to 0.62. This indicates that the difficulty of feature extraction is markedly increased when dealing with nighttime images. Furthermore, the ablation results reveal that removing the attention block leads to relative performance drops of 5.6%, 8.9%, and 3.3% for mAP@1, mAP@2, and mAP@3 under daytime conditions, and more substantial drops of 16.2%, 13.5%, and 13.3% under nighttime conditions. These findings highlight that the attention mechanism plays a critical role in enhancing robustness, particularly in low-light scenarios.

However, when observing the performance of the network with attention mechanism, its overall performance is greatly improved compared with the simplified network, especially when facing scenes with complex backgrounds or unprominent architectural textures. It improves the accuracy of MAP@3 by about 3% over the simplified network’s MAP@3 during the day, and it improves the accuracy by about 10% during the night. In addition, we also visualize the vector distance of the final match, where a smaller vector distance between the sketch and the image indicates a better match. To make the contrast more obvious, we apply post-processing in the 0–1 range, where yellow tends to 0 and purple tends to 1. As shown in Figure 11, the yellow squares are concentrated and located on the diagonal in the network with attention module, which also means that the corresponding matching is better completed. In networks without attention modules, the yellow squares are slightly scattered, resulting in poor results. Therefore, the network with attention mechanism has a strong ability to extract line information features, which ensures that the network still has good performance when the real picture highlights line information with difficulty.

4. Conclusions

When seeking effective localization methods under satellite failure conditions, traditional visual localization approaches face limitations in both accuracy and robustness, especially in complex environments. To address these challenges, we introduce the Sketch Line Information Consistency Generation model, which incorporates sketch information to reduce environmental influences and enhance robustness. Notably, sketches are created by students with no drawing experience to ensure they possess natural features recognizable by people’s subconscious. The model utilizes a line-based generative adversarial network to align sketches with original images and improves localization accuracy through attention mechanisms and correlation modules. The versatility and effectiveness of our model are validated through a series of experiments, including tests on overall building structures, stroke counting, and line attention modules. Quantitative evaluations further confirm its advantages, the proposed method achieves a top-1 retrieval accuracy (mAP@1) of 0.90, and maintains high precision (mAP@3 of 0.91) across different architectural styles. In addition, the end-to-end retrieval latency remains within 16 ms on an NVIDIA RTX 4080 GPU, satisfying real-time localization requirements. The experiments demonstrate that the model can be effectively applied to semantic localization tasks across multiple styles and environments, and the established sketching principles ensure strong robustness during localization.

Author Contributions

Conceptualization, J.Q.; methodology, H.D. and J.F.; software, J.F. and L.L.; validation, J.F. and Y.H.; formal analysis, J.Q.; investigation, H.D. and Y.H.; resources, Y.H. and J.Q.; data curation, J.F. and L.L.; writing—original draft preparation, J.Q.; writing—review and editing, H.D., J.F., Y.H., L.L. and J.Q.; visualization, J.Q. and L.L.; supervision, J.Q.; project administration, J.Q.; funding acquisition, J.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available upon request.

Conflicts of Interest

Author Haihua Du was employed by the company Guangzhou Shiyuan Electronic Technology Co., Ltd. Author Yitao Huang was employed by the company Guangzhou Kindlink Intelligent Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ye, H.; Huang, H.; Liu, M. Monocular direct sparse localization in a prior 3d surfel map. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 8892–8898. [Google Scholar]
Ji, X.; Liu, P.; Niu, H.; Chen, X.; Ying, R.; Wen, F. Object SLAM Based on Spatial Layout and Semantic Consistency. IEEE Trans. Instrum. Meas. 2023, 72, 2528812. [Google Scholar] [CrossRef]
Li, J.; Chu, J.; Zhang, R.; Hu, H.; Tong, K.; Li, J. Biomimetic navigation system using a polarization sensor and a binocular camera. J. Opt. Soc. Am. A 2022, 39, 847–854. [Google Scholar] [CrossRef] [PubMed]
Xu, S.; Dong, Y.; Wang, H.; Wang, S.; Zhang, Y.; He, B. Bifocal-Binocular Visual SLAM System for Repetitive Large-Scale Environments. IEEE Trans. Instrum. Meas. 2022, 71, 5018315. [Google Scholar] [CrossRef]
Zhang, H.; Jin, L.; Ye, C. An RGB-D camera based visual positioning system for assistive navigation by a robotic navigation aid. IEEE/CAA J. Autom. Sin. 2021, 8, 1389–1400. [Google Scholar] [CrossRef]
Xu, Z.; Zhan, X.; Chen, B.; Xiu, Y.; Yang, C.; Shimada, K. A real-time dynamic obstacle tracking and mapping system for UAV navigation and collision avoidance with an RGB-D camera. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 10645–10651. [Google Scholar]
Wang, L.; Shen, Q. Visual inspection of welding zone by boundary-aware semantic segmentation algorithm. IEEE Trans. Instrum. Meas. 2020, 70, 5001309. [Google Scholar] [CrossRef]
Chen, X.; Liu, Y.; Achuthan, K. WODIS: Water obstacle detection network based on image segmentation for autonomous surface vehicles in maritime environments. IEEE Trans. Instrum. Meas. 2021, 70, 7503213. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Q.; Wang, Y.; Yu, G. CSI-based location-independent human activity recognition using feature fusion. IEEE Trans. Instrum. Meas. 2022, 71, 5503312. [Google Scholar] [CrossRef]
de Lucca Siqueira, F.; Plentz, P.D.M.; De Pieri, E.R. Semantic trajectory applied to the navigation of autonomous mobile robots. In Proceedings of the 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), Agadir, Morocco, 29 November–2 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–8. [Google Scholar]
Aotani, Y.; Ienaga, T.; Machinaka, N.; Sadakuni, Y.; Yamazaki, R.; Hosoda, Y.; Sawahashi, R.; Kuroda, Y. Development of autonomous navigation system using 3D map with geometric and semantic information. J. Robot. Mechatronics 2017, 29, 639–648. [Google Scholar] [CrossRef]
Zender, H.; Mozos, O.M.; Jensfelt, P.; Kruijff, G.J.; Burgard, W. Conceptual spatial representations for indoor mobile robots. Robot. Auton. Syst. 2008, 56, 493–502. [Google Scholar] [CrossRef]
Crespo, J.; Barber, R.; Mozos, O. Relational model for robotic semantic navigation in indoor environments. J. Intell. Robot. Syst. 2017, 86, 617–639. [Google Scholar] [CrossRef]
Ruiz-Sarmiento, J.R.; Galindo, C.; Gonzalez-Jimenez, J. Building multiversal semantic maps for mobile robot operation. Knowl.-Based Syst. 2017, 119, 257–272. [Google Scholar] [CrossRef]
Drouilly, R.; Rives, P.; Morisset, B. Semantic representation for navigation in large-scale environments. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1106–1111. [Google Scholar]
He, G.; Zhang, Q.; Zhuang, Y. Online semantic-assisted topological map building with LiDAR in large-scale outdoor environments: Toward robust place recognition. IEEE Trans. Instrum. Meas. 2022, 71, 8504412. [Google Scholar] [CrossRef]
Tripathi, A.; Dani, R.R.; Mishra, A.; Chakraborty, A. Sketch-Guided Object Localization in Natural Images. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 532–547. [Google Scholar]
Chen, J.; Fang, Y. Deep cross-modality adaptation via semantics preserving adversarial learning for sketch-based 3d shape retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 605–620. [Google Scholar]
Yelamarthi, S.K.; Reddy, S.K.; Mishra, A.; Mittal, A. A zero-shot framework for sketch based image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
Liu, L.; Shen, F.; Shen, Y.; Liu, X.; Shao, L. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2862–2871. [Google Scholar]
Zhang, J.; Shen, F.; Liu, L.; Zhu, F.; Yu, M.; Shao, L.; Shen, H.T.; Van Gool, L. Generative domain-migration hashing for sketch-to-image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 297–314. [Google Scholar]
Song, J.; Yu, Q.; Song, Y.Z.; Xiang, T.; Hospedales, T.M. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5551–5560. [Google Scholar]
Yang, Z.; Zhu, X.; Qian, J.; Liu, P. Dark-aware network for fine-grained sketch-based image retrieval. IEEE Signal Process. Lett. 2020, 28, 264–268. [Google Scholar] [CrossRef]
Besbas, W.; Artemi, M.; Salman, R. Content based image retrieval (CBIR) of face sketch images using WHT transform domain. In Proceedings of the 2014 3rd International Conference on Informatics, Environment, Energy and Applications IPCBEE, Shanghai, China, 27–28 March 2014; Volume 66. [Google Scholar]
Zhou, R.; Chen, L.; Zhang, L. Sketch-based image retrieval on a large scale database. In Proceedings of the 20th ACM International Conference on Multimedia, Nara, Japan, 29 October–2 November 2012; pp. 973–976. [Google Scholar]
Gaidhani, P.A.; Bagal, S. Implementation of Sketch Based and Content Based Image Retrieval. Int. J. Mod. Trends Eng. Res. 2016, 3, hal-01336894. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Changpinyo, S.; Chao, W.L.; Gong, B.; Sha, F. Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5327–5336. [Google Scholar]
Park, D.Y.; Lee, K.H. Arbitrary style transfer with style-attentional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5880–5888. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1199–1208. [Google Scholar]
Rybski, P.E.; Roumeliotis, S.; Gini, M.; Papanikopoulos, N. Appearance-based mapping using minimalistic sensor models. Auton. Robot. 2008, 24, 229–246. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Dutta, A.; Akata, Z. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5089–5098. [Google Scholar]
Lin, F.; Li, M.; Li, D.; Hospedales, T.; Song, Y.Z.; Qi, Y. Zero-shot everything sketch-based image retrieval, and in explainable style. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23349–23358. [Google Scholar]

Figure 1. Sketch-based retrieval process for indirect building localization, where user-drawn sketches are matched with candidate building images. A check mark (✓) indicates a correct retrieval match between the sketch and the building image, while a cross mark (✗) indicates an incorrect or failed match.

Figure 2. Overall architecture of the proposed SLIC model. (1) Image and sketch features are extracted using VGG-16. (2) During training, the L-A Block enhances line-style consistency between modalities and guides semantic space alignment (dashed arrows). This block is not used during retrieval. (3) Generators and discriminators map both modalities into a shared semantic space. (4) The R-A Block analyzes cross-modal relations. (5) The similarity matrix is computed for final retrieval.

Figure 3. Line-Attention (L-A) Block. VGG-16 conv4/conv5 features from image and sketch streams are projected by 1 × 1 convs and weighted by learned matrices to yield length- (

L_{shallow}

) and orientation-aware (

L_{deep}

) attentions; a decoder aggregates them into the line-style standard.

Figure 3. Line-Attention (L-A) Block. VGG-16 conv4/conv5 features from image and sketch streams are projected by 1 × 1 convs and weighted by learned matrices to yield length- (

L_{shallow}

) and orientation-aware (

L_{deep}

) attentions; a decoder aggregates them into the line-style standard.

Figure 4. Different styles of buildings and sketch pairs. The first and second rows selected landmark buildings and corresponding sketches on the campus of Shanghai Jiao Tong University, while the third and fourth rows selected part of Paris and oxbuild architectural sketch pairs.

Figure 5. Dataset composition with building images (576) and sketches (2466), along with their attributes.

Figure 6. The focus position of the sketch obtained by the Grad-CAM algorithm. Black lines denote the original building contours, while red lines indicate the parts experimentally verified as removable.

Figure 7. Three types of redundant strokes in sketches. Red lines indicate unnecessary strokes.

Figure 8. Four types of buildings with different backgrounds and different interiors.

Figure 9. Examples of difficult retrieval cases. Buildings (a,b) have similar line structures, leading to ambiguous matches with the simplified sketch (c).

Figure 10. Buildings at night when the contour line features are weakened.

Figure 11. Euclidean distance of matching similarity under different network structures (the red box indicates an improvement in matching similarity).

Table 1. Accuracy degradation after removing different types of key strokes.

Removed Stroke Type	mAP@all	mAP@3
Full sketch (baseline)	0.90	0.91
Parallel inner/outer outlines	0.87	0.88
Multiple identical features	0.85	0.86
Internal dividing lines	0.84	0.84

Table 2. Comparison of sketch retrieval performance across different architectural styles, where S, P, F, and B represent SJTU, PARIS, OXFORD, and OXBUILD scenes, respectively. A check mark (✓) indicates a correct retrieval, and a cross mark (✗) indicates a failed retrieval.

Place	Scenario									mAP@2
SJTU	$S_{field}$	$S_{jiang}$	$S_{gym}$	$S_{dining}$	$S_{east}$	$S_{library}$	$S_{statue}$	$S_{gate}$	$S_{middle}$	0.90
SJTU	✓	✓	✓	✓	✓	✗	✓	✓	✓	0.90
PARIS	$P_{square}$	$P_{tower}$	$P_{triangle}$	$P_{windmill}$	$P_{clock}$	$P_{palace}$	$P_{eiffel}$	$P_{dome}$	$P_{gate}$	1.00
PARIS	✓	✓	✓	✓	✓	✓	✓	✓	✓	1.00
OXFORD	$F_{tower}$	$F_{dome}$	$F_{square}$	$F_{bridge}$	$F_{steep}$	$F_{pillar}$	$F_{palace}$	$F_{mindwill}$	$F_{glass}$	0.88
OXFORD	✓	✗	✓	✓	✓	✓	✓	✓	✓	0.88
OXBUILD	$B_{palace}$	$B_{column}$	$B_{highrise}$	$B_{tower}$	$B_{bridge}$	$B_{dome}$	$B_{church}$	$B_{build}$	$B_{railway}$	0.87
OXBUILD	✓	✓	✓	✗	✓	✓	✓	✓	✓	0.87

Table 3. SJTU-style comparison results. The bold values indicate the best performance in each column.

Method	mAP@1	mAP@2	mAP@3
DAISY [31]	0.61	0.59	0.57
HOG [32]	0.74	0.72	0.68
RESNET [33]	0.82	0.85	0.79
VGG [34]	0.85	0.83	0.80
SEM-PCYC [35]	0.85	0.83	0.81
ZSE [36]	0.87	0.84	0.82
Ours	0.90	0.90	0.91

Table 4. Ablation experiments across times. Drop is the relative decrease compared to the attention-enabled network.

Time	Network	mAP@1	mAP@2	mAP@3	Drop@1%	Drop@2%	Drop@3%
day	attention	0.90	0.90	0.91	–	–	–
day	w/o attention	0.85	0.82	0.88	5.6%	8.9%	3.3%
night	attention	0.74	0.74	0.75	–	–	–
night	w/o attention	0.62	0.64	0.65	16.2%	13.5%	13.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, H.; Fan, J.; Huang, Y.; Lin, L.; Qian, J. A Sketch-Based Cross-Modal Retrieval Model for Building Localization Without Satellite Signals. Electronics 2025, 14, 3936. https://doi.org/10.3390/electronics14193936

AMA Style

Du H, Fan J, Huang Y, Lin L, Qian J. A Sketch-Based Cross-Modal Retrieval Model for Building Localization Without Satellite Signals. Electronics. 2025; 14(19):3936. https://doi.org/10.3390/electronics14193936

Chicago/Turabian Style

Du, Haihua, Jiawei Fan, Yitao Huang, Longyang Lin, and Jiuchao Qian. 2025. "A Sketch-Based Cross-Modal Retrieval Model for Building Localization Without Satellite Signals" Electronics 14, no. 19: 3936. https://doi.org/10.3390/electronics14193936

APA Style

Du, H., Fan, J., Huang, Y., Lin, L., & Qian, J. (2025). A Sketch-Based Cross-Modal Retrieval Model for Building Localization Without Satellite Signals. Electronics, 14(19), 3936. https://doi.org/10.3390/electronics14193936

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Sketch-Based Cross-Modal Retrieval Model for Building Localization Without Satellite Signals

Abstract

1. Introduction

2. Materials and Methods

2.1. Cycle-Consistent Generative Model

2.2. L-A Block and R-A Block

2.2.1. Line-Attention Block

2.2.2. Relation Block

2.3. Overall Cost Function

2.4. Algorithm Description

3. Results

3.1. Simulation Setup

3.2. Datasets

3.2.1. Sketch Dataset

3.2.2. Buildings Dataset

3.2.3. Night Dataset

3.3. Experiments on the Principles of Core Sketching

3.4. Experiments on Universality of Drawing Principles and Network Structures

3.5. Comparative Experiment and L-A Block Ablation Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI