Integrating Multiple Semantics of Street View Imagery for Semi-Supervised Building Function Identification

Fang Fang; Nan Min; Shengwen Li; Yuxiang Zhao; Sishi Gong; Yu Wang; Shunping Zhou

doi:10.3390/ijgi14110423

,

and

¹

School of Computer Science, China University of Geosciences, Wuhan 430074, China

²

Engineering Research Center of Natural Resource Information Management and Digital Twin Engineering Software, Ministry of Education, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf.2025, 14(11), 423;https://doi.org/10.3390/ijgi14110423

Version Notes

Order Reprints

Abstract

Building function identification plays a crucial role in providing basic data for urban planning, management, and various intelligent applications. Today, building function identification methods using Street View Images (SVIs) have made significant progress. However, these methods use the visual features of SVIs to infer building functions, which ignores the contributions of the multiple potential semantics of SVIs, resulting in suboptimal identification accuracy. To address this issue, this study proposes a multi-semantic semi-supervised building function identification (MS-SS-BFI) method, which integrates multi-level visual semantics and spatial contextual semantics to improve building function identification from SVIs. Specifically, a location mapping module was designed to align SVIs with buildings. Additionally, a multi-level visual semantic extraction module was developed to integrate the visual semantics and visual-textual semantics of SVIs. In addition, a semi-supervised spatial interaction module was designed to characterize the spatial context of buildings. Extensive experiments on the Brooklyn dataset show that the proposed method achieves 7.98% improvement in F1-score over the state-of-the-art baseline, demonstrating superior performance and robustness. This work explores a novel approach to building function identification and provides a methodological reference for various SVI-based applications.

Keywords:

building function identification; street view imagery; multiple semantics; semi-supervised learning

1. Introduction

Building function refers to the land use undertaken by an individual building unit [1]. Automatic identification of building functions plays a pivotal role across multiple domains, including urban planning, environmental monitoring, population statistics, disaster emergency response, and topographic mapping [2,3,4]. In practice, building functions are affected by many factors, including environment and human activities, making it quite a challenge to accurately identify the functions of buildings [5,6].

Street view images (SVIs) have demonstrated great potential in building function identification [7,8], benefiting from their multi-view characteristics and rich embedded socioeconomic semantics. For example, Kang et al. [9] proposed a convolutional neural network (CNN)-based method for building instance function classification using Google Street View (GSV), overcoming the limitation of remote sensing images (RSIs) that capture only rooftops. Wang et al. [10] used instance segmentation to identify urban soft-story buildings in SVIs, aiming to improve the efficiency of structural safety analysis and management. Murdoch et al. [11] employed CNNs to extract building features from SVIs and classify building categories. However, these methods follow the full-supervised learning paradigm, which requires extensive labeled image samples for training the model. Annotating samples is very labor-intensive and time-consuming, especially for some regions where buildings are structurally complex and functionally diverse.

Recently, researchers have explored semi-supervised learning methods in building function identification to address the challenge of insufficient labeled samples. These methods leverage a small number of labeled samples in conjunction with a large volume of unlabeled samples for model training. For instance, Xie et al. [12] proposed a semi-supervised framework that combines geometric, POI, and spatial features to improve building function recognition with limited labels. Skuppin et al. [13] employed a U-Net model with incomplete OSM labels to classify building functions, achieving reliable performance despite limited annotations. For SVI-based building function identification, Li et al. [14] proposed a geometry-aware semi-supervised building function recognition method that enhances classification accuracy while addressing class imbalance. However, the method focuses on inferring building functions from the visual features of SVIs, which ignores multiple potential semantics of SVIs, resulting in suboptimal identification accuracy.

To alleviate this issue, this paper proposes a multi-semantic semi-supervised building function identification (MS-SS-BFI) method, which leverages multi-level visual semantics and spatial contextual semantics to improve building function identification. Specifically, a location mapping module (LoMM) is designed to align the locations of the SVIs’ buildings. A multi-level visual semantic extraction module (ML-VSEM) is developed to integrate visual semantics and visual-textual semantics from SVIs. A semi-supervised spatial interaction module (S³IM) is introduced to characterize the spatial context of buildings. The main contributions of this paper can be summarized as follows:

(1): This study highlights the importance of multiple semantics derived from SVIs for identifying building functions, and it proposes integrating multiple semantics of SVIs and the semi-supervised model to enhance identification accuracy.
(2): A network is developed to identify building functions by integrating multi-level visual semantics from SVIs and incorporating spatial contextual semantics to improve the accuracy of building function identification. This study offers a methodological reference for urban applications having insufficiently labeled samples.
(3): Experimental results demonstrate that the proposed method significantly improves the identification accuracy of building functions, even when the number of labeled SVIs is small.

The rest of this paper is structured as follows. Section 2 reviews related studies. Section 3 provides a detailed description of the proposed method. Section 4 outlines the experimental setting. Section 5 presents the experimental results and discussions. Section 6 summarizes this study and outlines prospects for future work.

2. Related Work

The work that relates to this paper can be classified into two categories: building function identification and semi-supervised learning methods for urban applications.

2.1. Building Function Identification

Numerous works used RSIs to identify building functions. They infer building functions by analyzing visual features such as shape, size, texture, and spatial layout. For example, Du et al. [15] proposed a building function classification method based on RSIs, which combines traditional machine learning techniques with spectral and texture features extracted from remote sensing images. Xu et al. [16] achieved building function classification based on features extracted from RSIs by deep learning models. Vasavi et al. [17] introduced a classification approach integrating U-Net and ResNet architectures, which substantially enhances classification accuracy with very high resolution. In practice, the diversity of building structures and the complexity of urban environments make it difficult to identify building functions using RSIs taken from a distance.

SVIs are typically ground-level images that provide visual information closer to buildings and rich embedded socioeconomic semantics [18,19], giving a unique advantage in building instance function identification. Laupheimer et al. [20] proposed a CNN-based approach that uses SVIs to classify building facades and identify five categories of building functions. Li et al. [14] proposed a geometry-aware framework for fine-grained building function identification based on SVIs, improving accuracy by incorporating geometric relationships between GIS and SVIs. However, these methods are focused on identifying building functions based on the visual features of SVIs, which ignores the multiple potential semantics of SVIs that are closely related to building functions.

2.2. Semi-Supervised Learning Method for Urban Applications

Semi-supervised models have made significant progress in urban applications, especially in scenarios having a low number of labeled samples. For example, Protopapadakis et al. [21] proposed a semi-supervised deep neural network based on stacking autoencoders to efficiently extract buildings from near-infrared satellite images. This approach significantly reduces the need for labeled samples, while achieving accuracy close to that of the state-of-the-art methods. Liu et al. [22] introduced an innovative semi-supervised learning framework that leverages graph convolutional networks and multimodal autoencoders to automatically classify geotagged social media content. Fouedjio et al. [23] developed a geostatistical semi-supervised learning method, which combines a small amount of labeled spatial data with a large volume of unlabeled samples to enhance spatial prediction accuracy and uncertainty estimation using pseudo-labels and geostatistical simulations. However, the previous building function identification methods have not yet been fully utilized for multiple semantics of unlabeled SVIs, which deserves further exploration.

3. Methodology

This study aims to improve building function identification by integrating multiple semantics of SVIs with the spatial context of buildings, and it proposes the multi-semantic semi-supervised building function identification (MS-SS-BFI) method. As depicted in Figure 1, the proposed method comprises four key modules: a location mapping module (LoMM) that is designed to align SVIs with buildings, a multi-level visual semantic extraction module (ML-VSEM) that extracts visual semantics and visual-textual semantics from SVIs, a semi-supervised spatial interaction module (S^³IM) that integrates multi-semantics and spatial context between buildings via graph-based aggregation, and a model optimization module that trains the model and predicts the function category of each building.

Figure 1. An overview of the proposed method. The method consists of four modules: (a) location mapping module (LoMM), (b) multi-level visual semantic extraction module (ML-VSEM), (c) semi-supervised spatial interaction module (S³IM), and (d) model optimization.

3.1. Location Mapping Module

The recorded geographic coordinates of SVIs refer to the location of the camera, since SVIs are typically captured by vehicles traveling along roadways [24]. That is, the image coordinates reflect the shooting locations (i.e., where the SVI was taken), not the building locations (i.e., the true locations of the buildings). To align the image objects with building objects, the proposed method develops a location mapping algorithm. As illustrated in Figure 2, the algorithm is a three-step process.

Figure 2. Location mapping from SVIs to buildings.

Step 1. Calculating building azimuth

γ

.

The building azimuth

γ

represents the angle between the shooting direction of the SVI and true north, describing the orientation of the building facade in the image. The values of

γ

typically range from 0° to 360°, where 0° corresponds to true north, 90° to east, 180° to south, and 270° to west.

γ

can be inferred from the SVI’s shooting direction

α

and the building’s position within the image.

The pixel position

(p o s_{x}, p o s_{y})

of the building in an SVI can be calculated using the bounding box annotations of the SVI, as shown in Equation (1):

\{\begin{matrix} {p o s}_{x} = {b b o x}_{x} + \frac{1}{2} {b b o x}_{w i d t h}, \\ {p o s}_{y} = {b b o x}_{y} - \frac{1}{2} {b b o x}_{h e i g h t}, \end{matrix}

(1)

where

b b o x_{x}

and

b b o x_{y}

denote the x- and y-coordinates of the bounding box’s top-left corner, respectively; and

b b o x_{w i d t h}

and

b b o x_{h e i g t h}

represent the width and height of the bounding box, respectively.

The angular deviation

β

between the building and the SVI’s shooting direction can be computed using Equation (2):

β = {p o s}_{x} \times \frac{θ}{H_{p i x e l s}} - \frac{1}{2} θ,

(2)

where

θ

, estimated through multi-angle observations, denotes the horizontal field of view (HFOV) of the image; and

H_{p i x e l s}

is the total number of horizontal pixels in the image.

θ

is computed as the ratio of the difference of building azimuths (

γ

) to the difference in the x-coordinates of the bounding boxes’ top-left corners.

The building azimuth

γ

is derived as follows:

γ = α + β,

(3)

where

α

represents the azimuth angle of the SVI’s shooting direction.

Step 2. Estimating the distance between shooting and building location.

The distance between the shooting and building location, d, is closely related to the absolute value of the horizontal offset angle. When the absolute offset angle approaches 0°, i.e., the SVI shooting direction aligns with the street’s longitudinal axis, the buildings appearing near the center of the image tend to be farther away, as they lie along the depth of the street. Conversely, when the offset angle approaches 90°, indicating a side-view perspective perpendicular to the street’s longitudinal direction, the buildings are located on both sides of the street and much closer to the camera, with the distance approaching approximately half the street width (

s w

).

To account for the fact that buildings captured at the far end of a street are not infinitely distant, this study introduces a hyperparameter, D, to denote the maximum distance from the SVI to the building. A nonlinear quadratic function is employed to represent distance variation as the building’s position shifts from the image edges toward the center. The distance d is computed as follows:

d = \frac{1}{2} s w + {[\frac{θ + 2 f a b s (β)}{θ}]}^{2} \times (D - \frac{1}{2} s w),

(4)

where

s w

is a hyperparameter representing the street width, and

β

is the azimuthal deviation of the building relative to the shooting direction.

Step 3. Mapping shooting location to building location.

Given the geographic coordinates of the SVI shooting location, the building azimuth

γ

, and the estimated distance d, the actual geographic location of the building can be inferred by solving the forward geodesic problem. The central angle

Δ

and deviation angle

δ

(in radians) are calculated by Equations (5) and (6):

Δ = \frac{d}{R},

(5)

δ = \frac{360 - γ}{360} \times 2 π,

(6)

where R is the radius of the Earth, d is the distance between shooting and building location, and

γ

is the building azimuth.

The building’s latitude

{\hat{G}}_{x}

and longitude

{\hat{G}}_{y}

are then derived using the Haversine formula:

{\hat{G}}_{x} = \arcsin (\sin ({l a t}_{s l})) \times c o s Δ + \cos ({l a t}_{s l}) \times s i n Δ \times \cos (δ),

(7)

\begin{matrix} {\hat{G}}_{y} & = {l o n}_{s l} + a t a n 2 (\sin (δ) \times \sin (Δ) \times \cos ({l a t}_{s l}), \\ \cos (Δ) - \sin ({l a t}_{s l}) \times \sin ({\hat{G}}_{x})), \end{matrix}

(8)

where

l a t_{s l}

and

l o n_{s l}

denote the latitude and longitude of the shooting location where the SVI was taken, respectively.

The complete location mapping algorithm is presented in Algorithm 1.

Algorithm 1 Location Mapping Algorithm

Input:: SVI tuple svi, Building annotations $b_{a n n s}$
Output:: BUILDING set buildings
1:: buildings ← empty set(∅)
2:: $D \leftarrow$ maximum distance
3:: $θ \leftarrow$ HOFV of svi
4:: $H_{p i x e l s} \leftarrow$ horizontal resolution of svi
5:: $R \leftarrow$ radius of the Earth
6:: for m ∈ SVI do
7:: BUILDING b
8:: for n ∈ $b_{a n n s}$ and n corresponds to of m do
9:: ${l a t}_{o l}$ , ${l o n}_{o l}$ ← the latitude and longitude of m
10:: b.svfid ← the id of m
11:: ${pos}_{x} \leftarrow$ ${b b o x}_{x}$ + n. $\frac{1}{2}$ ${b b o x}_{w i d t h}$
12:: $β \leftarrow$ ${pos}_{x}$ × $\frac{θ}{H_{p i x e l s}} - \frac{1}{2} θ$
13:: $α \leftarrow$ $n . h e a d i n g$
14:: b. $γ$ ← $α$
15:: b.d ← $\frac{1}{2} s w$ + ${[\frac{θ + 2 \cdot a b s (β)}{θ}]}^{2} \times (D - \frac{1}{2} \cdot S W)$
16:: $Δ \leftarrow \frac{d}{R}$
17:: $δ \leftarrow \frac{360 - γ}{π} \times 2 π$
18:: b.G_x ← $\arcsin (\sin (l a t_{s l})) \times c o s Δ + \cos (l a t_{s l}) \times s i n Δ \times \cos (δ)$
19:: b.G_y ← ${l o n}_{s l} + a t a n 2 (\sin (δ) \times \sin (Δ) \times \cos ({l a t}_{s l}), \cos (Δ) - \sin ({l a t}_{s l}) \times \sin ({\hat{G}}_{x}))$
20:: end for
21:: append buildings with b
22:: end for
23:: return buildings

3.2. Multi-Level Visual Semantic Extraction Module

Visual semantics from SVIs effectively characterize building functional features, while visual-textual semantics, derived from image captions [25], provide the scene knowledge of the SVI from pre-trained large models. These two kinds of visual semantics jointly provide complementary semantics for building function identification. As shown in Figure 3, visual semantics are first extracted using the Vision Transformer (ViT) model [26]. Subsequently, a pre-trained image captioning model generates descriptive text captions for each SVI to embed latent semantic information. Finally, the generated captions are processed by BERT [27] to extract contextualized visual-textual semantics.

Figure 3. An illustration of multi-level visual semantic extraction from SVIs.

Specifically, the input SVI is denoted as

X \in R^{H \times W \times C}

, where H, W, and C represent the image height, width, and number of channels, respectively. The image is first divided into N non-overlapping patches, each of size

P \times P

. Each patch is flattened into a 1-D vector and mapped to a high-dimensional embedding space, represented as follows:

X_{p} = \{\begin{matrix} x_{p}^{1}, x_{p}^{2}, \dots, x_{p}^{n} \end{matrix}\}, x_{p}^{i} \in R^{(P^{2} \cdot C) \times D} .

(9)

The embedding layer transforms these image patches into D-dimensional vectors:

z_{0} = \{\begin{matrix} x_{p}^{1} E, x_{p}^{2} E, \dots, x_{p}^{n} E \end{matrix}\} + E_{p o s}, E \in R^{(P^{2} \cdot C) \times D},

(10)

where

E_{p o s}

is positional encoding.

The embedding vectors are processed through the Transformer encoder consisting of identical L layers. Each layer employs a residual connection and layer normalization (LN) around a multi-head self-attention (MSA) mechanism, followed by a position-wise feed-forward network (MLP):

z_{l}^{'} = M S A (L N (z_{l - 1})) + z_{l - 1}, l = 1, \dots, L,

(11)

z_{l} = M L P (L N (z_{l}^{'})) + z_{l}^{'}, l = 1, \dots, L .

(12)

The visual semantic feature vectors

F_{v}

are derived from the last Transformer layer:

F_{v} = L N (z_{L}^{0}) .

(13)

Subsequently, an image caption model generates a textual description

T_{v} = {{w o r d}_{1}, {w o r d}_{2}, \dots, {w o r d}_{n}}

for the SVI. The generation process is modeled as follows:

P r (w o r d_{t} ∣ w o r d_{1 : t - 1}, S V I) = softmax (W_{0} \cdot h_{t} + b_{o}),

(14)

where

h_{t}

denotes the hidden state of the LSTM at time step t,

W_{o}

represents the parameters of the output layer, and

b_{o}

is the bias term.

The textual description

T_{v}

is fed into a pre-trained BERT model to encode into a semantic embedding. Let

B (\cdot)

denote the BERT encoding function; the resulting vector representation is

F_{t} = B (T_{v}) .

(15)

Following Yang et al. [28], a concatenation-based fusion strategy integrates the visual and visual-textual semantic features. The experimental results in Section 5.5 suggest the strategy is effective.

F_{m} = f_{c} (F_{v}, F_{t}) .

(16)

For each building b, let

I_{b} = m_{1}, m_{2}, \dots, m_{n}

denote its corresponding SVI set. The final building-level feature

B F_{b}

is obtained by aggregating the features of all SVIs in

I_{b}

.

B F_{b} = a v e r a g i n g (F_{m_{1}}, F_{m_{2}} \dots, F_{m_{n}}) .

(17)

3.3. Semi-Supervised Spatial Interaction Module

A semi-supervised spatial interaction graph convolutional network (S^³I-GCN) is designed to integrate multi-semantics with spatial context between buildings. Specifically, a spatial weight matrix

M \in R^{N \times N}

is constructed to represent the spatial relationships among buildings, where n is the number of buildings. The matrix is defined as follows:

M_{i, j} = \{\begin{matrix} w_{i j}, & i f v_{i} \sim v_{j}, \\ 0, & o t h e r w i s e, \end{matrix}

(18)

where

v_{i} \sim v_{j}

denotes that building

v_{i}

is spatially related to building

v_{j}

, and

w_{i j}

represents the spatial weight between them. The spatial weight is computed using a K-nearest neighbor (KNN) spatial weighting scheme [29], where

w_{i j}

is inversely proportional to the distance between

v_{i}

and

v_{j}

, typically defined as follows:

w_{i j} = \frac{1}{d i s t a n c e {(v_{i}, v_{j})}^{ρ}},

(19)

where

ρ

is a hyperparameter controlling the weight decay with distance, proportional to the scaling exponent of the spatial interaction, and is set to 0 to 3 according to the spatial autocorrelation theory [30].

As illustrated in Figure 4, once the spatial weight matrix M is established, a spatial interaction graph

G (V, E)

is constructed. Each node

v_{i} \in V

corresponds to a building instance and is associated with a feature vector

B F_{i}

. Each edge

(v_{i}, v_{j}) \in E

captures the spatial context between the corresponding building instances and is associated with

M_{i, j}

.

Figure 4. Semi-supervised spatial interaction modeling.

The S³I-GCN network stacks three layers of GCN to capture the relationships between nodes and their spatial interaction in graph

G (V, E)

. Each layer learns node representations by aggregating features from neighboring nodes, as formulated in Equations (20) and (21):

H^{l} = σ (\tilde{M} H^{(l - 1)} W^{l}),

(20)

\tilde{M} = D^{- \frac{1}{2}} M D^{\frac{1}{2}},

(21)

where

H^{l}

is the node feature matrix at the l-th layer; M is the adjacency matrix of the graph, indicating the connections between nodes. D is the degree matrix of the graph.

\tilde{M}

is the normalized adjacency matrix, obtained by normalizing the adjacency matrix M. In semi-supervised training, the graph structure enables the propagation of label information from labeled to unlabeled nodes through feature aggregation using the matrix

\tilde{M}

.

3.4. Model Optimization

In graph

G (V, E)

, the feature matrix V, which incorporates both visual and visual-textual semantic features, is fed into the S³I-GCN. The hidden representations from the k-th layer are passed through a SoftMax classifier:

Z = s o f t m a x (h^{k}),

(22)

where

h^{0} = V

denotes the input of the first layer.

The network is trained using a cross-entropy loss function, defined as

L = - \sum_{b \in y_{b}} \sum_{i = 1}^{C} Y_{p i} l n σ (Z_{p i}),

(23)

where

y_{b}

represents the set of labeled buildings in the training set, C is the number of building function categories, and

Y_{p i}

is the ground truth label for node p and class i. Finally, the SoftMax layer outputs a probability distribution

{P_{i}}_{i = 0}^{C - 1}

of each node over all building function categories.

4. Experimental Setting

4.1. Study Area and Datasets

This study selects Brooklyn, one of the administrative districts of New York City, as the study area. The buildings in Brooklyn are diverse in terms of type and spatial distribution. The SVIs of the study area are sourced from “Zone 2” of the OmniCity dataset [31], which provides detailed building functional categories. As shown in Figure 5, the study area covers approximately 20.09 square kilometers, and the Brooklyn dataset contains a total of 5197 SVIs. After processing by the location mapping module, 4125 unique buildings are mapped to SVIs. The SVIs and buildings are divided into a training set and a test set. The dataset is roughly divided into a training set and a test set in a 3:1 ratio. Specifically, the training set consists of 4156 images covering 3086 buildings, and the test set comprises 1041 images covering 1039 buildings. It should be noted that this paper seeks to develop a semi-supervised approach of the training model to enhance its generalization capabilities when labeled data is limited. In the experiments, the proposed method assumes that the spatial regions of training and testing data do not overlap. That is, the data used for prediction did not participate in training. Even in production scenarios where vast amounts of prediction data are available, the proposed method does not increase the training dataset size, thus alleviating computational bottlenecks of semi-supervised models.

Figure 5. Spatial distribution of buildings.

The building functional categories in the study area include six categories: 1/2 Family Buildings, Walk-Up Buildings, Elevator Buildings, Industrial/Transportation/Utility, Mixed Residential/Commercial, and Office Buildings. A detailed description of each category is listed in Table 1. Figure 6 illustrates each category with four samples.

Table 1. Building function categories of Brooklyn.

Figure 6. Illustration of building function.

4.2. Implementation Details

The pre-trained ViT model on the ImageNet dataset was employed to extract visual feature vectors from the SVIs. S^³I-GCN stacks three GCN layers, connected using the ReLU non-linear activation function. The cross-entropy loss function was used to train the proposed model. The dimensions of the input layer, first hidden layer, second hidden layer, and output layer were set to 1536, 32, 16, and 8, respectively.

During training, a fixed random seed of 1,234,567 was used to ensure the reproducibility of results. The network was trained for 300 epochs using the Adam optimizer with an initial learning rate of 0.005 and a weight decay of 0.0005. In this study, K is set to 5, and

ρ

is set to 1. The hyperparameters D and

s w

are set to 60 and 30, respectively. Following the common setup of semi-supervised learning models, 20% of the SVIs from the training set were randomly sampled as labeled samples, and the remaining images were treated as unlabeled samples.

All the experiments were conducted using the PyTorch 2.1.1 framework and performed on a NVIDIA 3090 GPU.

4.3. Baselines

This study selected four methods as baselines to investigate the performance of the proposed method, including VGG [32], ResNet [33], StructCNN [9], and GASSL [14]. VGG and ResNet are two classical deep convolutional networks for vision tasks. VGG is known for its deep architecture that effectively captures both local and global visual features. ResNet incorporates residual learning through skip connections, allowing it to extract complex patterns of urban environments. StructCNN is a task-specific method designed for building instance classification using SVIs. GASSL is a state-of-the-art semi-supervised model of building function identification. These baselines are selected to include both classical vision models and state-of-the-art methods of building instance identification, contributing to a comprehensive evaluation of the proposed method.

4.4. Evaluation Metrics

This study adopts three widely used evaluation metrics, including Precision (Pre.), Recall (Rec.), and F1-score (F1.). For a given class, Pre. indicates the proportion of correctly predicted instances of that class among all instances predicted as that class. Rec. measures the proportion of actual instances of the class that were correctly identified by the model. F1. serves as the harmonic mean of Pre. and Rec., providing a balanced assessment of the model’s performance on that specific class.

For each class c, the evaluation metrics are defined as follows:

P r e_{c} = \frac{T P_{c}}{T P_{c} + F P_{c}},

(24)

{R e c}_{c} = \frac{T P_{c}}{T P_{c} + F N_{c}},

(25)

F 1_{c} = 2 \times \frac{P r e_{c} \times R e c_{c}}{P r e_{c} + R e c_{c}},

(26)

where for each class c,

T P_{c}

,

F P_{c}

, and

F N_{c}

represent the number of true positives, false positives, and false negatives, respectively.

The overall macro-averaged metrics are then computed as follows:

P r e . = \frac{1}{n} \sum_{c = 1}^{n} P r e_{c},

(27)

R e c . = \frac{1}{n} \sum_{c = 1}^{n} {R e c}_{c},

(28)

F 1 . = \frac{1}{n} \sum_{c = 1}^{n} F 1_{c} .

(29)

where n denotes the total number of classes.

5. Results and Discussion

5.1. Overall Results

As shown in Table 2, the proposed method outperforms all the baseline models in all the evaluation metrics. Specifically, the method achieves an average Pre. of 30.45%, an average Rec. of 27.32%, and an average F1. of 27.74%, surpassing the second-best model (GASSL) by margins of 10.43%, 6.75%, and 7.98%, respectively. Additionally, the weighted F1. of the proposed method stands at 37.8%, indicating that class imbalance has affected the overall average scores. The results demonstrate the superior capability of the proposed method in accurately identifying building functions. Furthermore, the proposed method outperforms the baselines across all six building function categories in terms of Pre.; specifically, it achieves 40.00% in category C3 and 53.49% in category C2, outperforming the second-best methods by substantial margins. The improvements observed in some categories, such as C3 (i.e., Elevator Buildings), which is very challenging due to its high visual ambiguity, further highlight the effectiveness of the multiple semantics.

Table 2. Comparison results in the Brooklyn dataset.

To further observe the performance of the proposed method, the identification results of various methods on the Brooklyn dataset are visualized, as shown in Figure 7. As illustrated in the figure, the proposed method achieves better performance compared to the baseline methods. Specifically, the baseline methods shown in columns (a) to (d) made some misidentifications. In contrast, as shown in row (e), the proposed method produces more accurate building category predictions, highlighting its superior ability when integrating multiple semantics and spatial contextual information.

Figure 7. Visualized results of (a) ResNet, (b) VGG, (c) StructCNN, (d) GASSL, (e) the proposed method, (f) ground truth in the Brooklyn dataset.

5.2. Ablation Study

To further investigate the effect of each component within the proposed method, ablation experiments were conducted by systematically removing the following key components: (1) the location mapping module (w/o LoMM); (2) visual semantic features (w/o

F_{ν}

); (3) visual-textual semantic features (w/o

F_{t}

); and (4) the semi-supervised spatial interaction GCN (w/o S³I-GCN). Here, “w/o” indicates that the component is removed from the proposed method. The results are reported in Figure 8.

Figure 8. Ablation experimental results on the Brooklyn dataset.

As shown in Figure 8, when the LoMM is removed, F1. drops from 27.74% to 21.72%, indicating the importance of spatial alignment between SVIs and buildings. Removing

F_{v}

slightly decreases but retains a relatively high Rec. of 26.17%, suggesting that textual semantics alone offer some resilience in classification. In contrast, removing the

F_{t}

causes a pronounced decline in F1. of 19.55%, highlighting the critical role of textual descriptions in enhancing semantic discrimination. Notably, removing the S³I-GCN module leads to the most substantial performance decline, with the F1. dropping sharply to 9.26%, thereby confirming its vital role in the proposed framework. This demonstrates the critical importance of modeling spatial relationships among buildings to support robust function identification. Overall, the full model achieves the best performance, confirming that each component makes a meaningful contribution and that their integration substantially boosts the effectiveness of the proposed method.

5.3. Effect of the Training Sample Size

Given that the number of labeled training samples may affect model performance, we conducted experiments using five different numbers of labeled training samples: 10%, 20%, 30%, 40%, and full supervision (i.e., all training samples are labeled). The detailed results are presented in Table 3.

Table 3. Experimental results for different proportions of the labeled sample.

As shown in Table 3, the experimental results demonstrate that the proposed method achieved its highest F1. of 27.74% when the training sample size was 20%. Remarkably, even with only 10% of the training data, it reached 80% of the fully supervised performance, highlighting its strong applicability in low-resource scenarios. However, when the annotation ratio increased to 40%, all the evaluation metrics declined, with the F1. dropping to 23.11%. We argue this may be due to the model being trained on a large proportion of samples from the training region, which leads to overfitting and consequently lower accuracy in the test region. The results validate the effectiveness of the proposed method under limited annotated-data conditions.

5.4. Effect of GNN Models

In this study, GNNs are employed to capture and model the spatial interactions of buildings within a graph network. To evaluate the effect of different GNN models, we conduct experiments using four widely used models: Graph Convolutional Network (GCN) [34], GraphSAGE [35], Graph Attention Network (GAT) [36], and TransGNN [37].

As shown in Figure 9, GraphSAGE achieves a slightly lower Rec. of 23.22% and Pre. of 20.77%, resulting in an F1. of 18.96%. TransGNN obtains a Rec. of 24.40% and Pre. of 22.48%, resulting in an F1. of 21.06%. Overall, these results demonstrate that GCN substantially improves the model performance across multiple metrics, including Pre., Rec., and F1., by effectively capturing spatial dependencies among buildings.

Figure 9. Experimental results for different spatial relationship modeling strategies.

5.5. Effect of Semantic Fusion Strategy

To evaluate the influence of semantic fusion strategies, we experiment with three widely used fusion methods: feature splicing, adaptive weighted average, and feature aggregation. Their results are presented in Table 4.

Table 4. Experimental results for different semantic fusion strategies.

Among the three strategies, feature splicing achieves the best overall performance, having a Pre. of 30.45%, a Rec. of 27.32%, and the highest F1. of 30.45%. This indicates that directly concatenating visual and textual semantic features helps retain the complementary information of the two semantics, thus enhancing their discriminative capacity. In contrast, the adaptive weighted average strategy yields a slightly lower performance, having an F1. of 23.66%. The feature aggregation strategy performs the worst, having an F1. of only 18.40%. The results show that the feature splicing strategy effectively preserves the multiple semantic features, making it more suitable for building function identification tasks.

5.6. Effect of Spatial Weight

As discussed in Section 3.3, the S³I-GCN captures the influence of surrounding geographic areas on building functions by assigning weights to the edges between nodes. To investigate the effect of calculating spatial weights, we compare two commonly used methods: KNN [29] and Queen Contiguity [38]. The results are summarized in Table 5.

Table 5. Experimental results for different spatial weight calculation algorithms.

Table 5 shows that KNN achieves a significantly higher F1. of 27.74%, with a Pre. of 30.45% and a Rec. of 27.32%. The improvement can be attributed to the flexibility of KNN in modeling irregular spatial structures by dynamically selecting the most relevant neighbors based on distance, thereby preserving local spatial context more effectively. In contrast, the Queen Contiguity method yields an F1. of only 22.96%, with a lower Pre. of 22.99% and Rec. of 25.06%. This suggests that relying solely on topological adjacency may overlook subtle yet relevant spatial interactions, particularly in urban environments under complex spatial distributions.

5.7. Effect of Parameter D

To evaluate the influence of parameter D on the proposed method, experiments were conducted by setting D from 40 m to 80 m in 10 m increments. The detailed results are presented in Table 6. As shown in this table, the highest F1. of 17.74% is obtained when D is set to 60 m. Performance declines slightly when D takes smaller or larger values, indicating that distances are either too large or too small, increasing mapping noise.

Table 6. Experimental results for different values of parameter D.

5.8. Experiments in Different Areas

To evaluate the robustness of the proposed method, an additional experiment in a different area, Wuhan city, is conducted. The Wuhan dataset covers an area ranging approximately from

113^{\circ} 12^{'} E

to

113^{\circ} 41^{'} E

and

30^{\circ} 30^{'} N

to

30^{\circ} 39^{'} N

, comprising a total of 537 buildings. Each building is manually classified into one of four functional categories: Retail (C1), Apartment (C2), Office Building (C3), and Others (C4). In this experiment, the model was trained directly on samples having four-category labels.

The experiments are conducted using the same settings as those in the New York dataset. As shown in Table 7, the proposed method outperforms the baselines, achieving overall Pre., Rec., and F1. values of 32.21%, 32.56% and 31.27%, respectively. The experimental results suggest that the proposed approach is robust.

Table 7. Comparison results in the Wuhan dataset.

6. Conclusions

This study highlights the importance of multiple semantics derived from SVIs for identifying building functions, and it proposes integrating multiple semantics of SVIs and the semi-supervised model to enhance identification accuracy. Specifically, a location mapping module was designed to align SVIs with buildings. Additionally, a multi-level visual semantic extraction module was developed to integrate the visual semantics and visual-textual semantics of SVIs. A semi-supervised spatial interaction module was also designed to characterize the spatial context of buildings. Extensive experiments demonstrate that MS-SS-BFI not only achieves superior prediction accuracy but also exhibits strong robustness across various scenarios. The proposed method contributes a methodological reference for SVI-based practical applications in urban analytics, especially in situations where only a small number of labeled samples are available.

Future work will focus on developing improved fusion strategies to integrate multimodal data with SVIs, thereby further enhancing the model’s identification accuracy. Additionally, it is well worth exploring the possibility of extending the proposed model by more deeply incorporating large pre-trained models that have shown strong capabilities in various visual tasks.

Author Contributions

Conceptualization, Fang Fang and Shengwen Li; methodology, Fang Fang and Nan Min; software, Nan Min; validation, Nan Min; formal analysis, Nan Min; investigation, Nan Min; resources, Fang Fang; data curation, Nan Min, Yuxiang Zhao, Yu Wang and Sishi Gong; writing—original draft preparation, Nan Min; writing—review and editing, Fang Fang, Shengwen Li, and Shunping Zhou; visualization, Nan Min; supervision, Shengwen Li; project administration, Fang Fang; funding acquisition, Shengwen Li. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (grant No. 42371420).

Data Availability Statement

The data that support the findings of this study are available at https://figshare.com/s/85e58426fad9c8eefd11, accessed on 15 July 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Steiniger, S.; Lange, T.; Burghardt, D.; Weibel, R. An Approach for the Classification of Urban Building Structures Based on Discriminant Analysis Techniques. Trans. GIS 2008, 12, 31–59. [Google Scholar] [CrossRef]
Fonte, C.C.; Minghini, M.; Antoniou, V.; Patriarca, J.; See, L. Classification of Building Function Using Available Sources of VGI. ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci. 2018, IV-4, 209–215. [Google Scholar] [CrossRef]
Smith, D.; Crooks, A. From Buildings to Cities: Techniques for the Multi-Scale Analysis of Urban Form and Function; Centre for Advanced Spatial Analysis (UCL): London, UK, 2010; Unpublished work. [Google Scholar]
Xiao, C.; Xie, X.; Zhang, L.; Xue, B. Efficient Building Category Classification with Façade Information from Oblique Aerial Images. ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci. 2020, V-2-2020, 1309–1313. [Google Scholar] [CrossRef]
Kang, S.; Hong, T. A Feasibility Study of Occupancy Measurement Using Environmental Sensors for Building Energy Management. Appl. Energy 2015, 149, 18–35. [Google Scholar]
Zhong, L.; Weng, Q. Building Function Classification from High-Spatial Resolution Imagery and Geographic Data Using Deep Learning. ISPRS J. Photogramm. Remote Sens. 2019, 149, 104–116. [Google Scholar]
Sarmadi, B.; Javed, U.; Gu, Y.; Wang, S.; Cao, S. Robust Building Identification from Street Views Using Deep Convolutional Neural Networks. Buildings 2023, 13, 578. [Google Scholar]
Zhao, K.; Liu, Y.; Hao, S.; Lu, S.; Liu, H.; Zhou, L. Bounding Boxes Are All We Need: Street View Image Classification via Context Encoding of Detected Buildings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 13–19 June 2020; pp. 1072–1081. [Google Scholar]
Kang, J.; Körner, M.; Wang, Y.; Taubenböck, H.; Zhu, X. Building Instance Classification Using Street View Images. ISPRS J. Photogramm. Remote Sens. 2018, 145, 44–59. [Google Scholar] [CrossRef]
Wang, C.; Hornauer, S.; Yu, S.X.; McKenna, F.; Law, K.H. Instance Segmentation of Soft-Story Buildings from Street-View Images with Semiautomatic Annotation. Earthq. Eng. Struct. Dyn. 2023, 52, 2520–2532. [Google Scholar] [CrossRef]
Murdoch, R.; Al-Habashna, A. Residential building type classification from street-view imagery with convolutional neural networks. Signal Image Video Process. 2024, 18, 1949–1958. [Google Scholar] [CrossRef]
Xie, X.; Liu, Y.; Xu, Y.; He, Z.; Chen, X.; Zheng, X.; Xie, Z. Building Function Recognition Using the Semi-Supervised Classification. Appl. Sci. 2022, 12, 9900. [Google Scholar] [CrossRef]
Skuppin, N.; Hoffmann, E.J.; Shi, Y.; Zhu, X.X. Building Type Classification with Incomplete Labels. Remote Sens. 2022, 14, 567. [Google Scholar]
Li, W.; Yu, J.; Chen, D.; Lin, Y.; Dong, R.; Zhang, X.; He, C.; Fu, H. Fine-Grained Building Function Recognition with Street-View Images and GIS Map Data via Geometry-Aware Semi-Supervised Learning. Int. J. Appl. Earth Obs. Geoinf. 2025, 137, 104386. [Google Scholar] [CrossRef]
Du, S.; Zhang, F.; Zhang, X. Semantic Classification of Urban Buildings Combining VHR Image and GIS Data: An Improved Random Forest Approach. ISPRS J. Photogramm. Remote Sens. 2015, 105, 107–119. [Google Scholar] [CrossRef]
Xu, Y.; He, Z.; Xie, X.; Xie, Z.; Luo, J.; Xie, H. Building Function Classification in Nanjing, China, Using Deep Learning. Trans. GIS 2022, 26, 2145–2165. [Google Scholar] [CrossRef]
Vasavi, S.; Somagani, H.S.; Sai, Y. Classification of Buildings from VHR Satellite Images Using Ensemble of U-Net and ResNet. Egypt. J. Remote Sens. Space Sci. 2023, 26, 937–953. [Google Scholar] [CrossRef]
Li, Z.; Su, Y.; Zhu, C.; Zhao, W. BuildingView: Constructing Urban Building Exteriors Databases with Street View Imagery and Multimodal Large Language Models. arXiv 2024, arXiv:2409.19527. [Google Scholar]
Liang, X.; Xie, J.; Zhao, T.; Stouffs, R.; Biljecki, F. OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery. arXiv 2025, arXiv:2504.02866. [Google Scholar] [CrossRef]
Laupheimer, D.; Tutzauer, P.; Haala, N.; Spicker, M. Neural Networks for the Classification of Building Use from Street-View Imagery. ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci. 2018, IV-2, 177–184. [Google Scholar] [CrossRef]
Protopapadakis, E.; Doulamis, A.; Doulamis, N.; Maltezos, E. Stacked Autoencoders Driven by Semi-Supervised Learning for Building Extraction from Near Infrared Remote Sensing Imagery. Remote Sens. 2021, 13, 371. [Google Scholar] [CrossRef]
Liu, P.; De Sabbata, S. A graph-based semi-supervised approach to classification learning in digital geographies. Comput. Environ. Urban Syst. 2021, 86, 101583. [Google Scholar] [CrossRef]
Fouedjio, F.; Talebi, H. Geostatistical semi-supervised learning for spatial prediction. Artif. Intell. Geosci. 2022, 3, 162–178. [Google Scholar] [CrossRef]
Wilson, D.; Alshaabi, T.; Van Oort, C.; Zhang, X.; Nelson, J.; Wshah, S. Object Tracking and Geo-Localization from Street Images. Remote Sens. 2022, 14, 2575. [Google Scholar] [CrossRef]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Yang, Y.-H.; Huang, T.E.; Sun, M.; Rota Bulò, S.; Kontschieder, P.; Yu, F. Dense Prediction with Attentive Feature Aggregation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2023; pp. 97–106. [Google Scholar]
Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN Model-Based Approach in Classification. In OTM Confederated International Conferences on the Move to Meaningful Internet Systems; Springer: Berlin/Heidelberg, Germany, 2003; pp. 986–996. [Google Scholar]
Chen, Y. The Distance-Decay Function of Geographical Gravity Model: Power Law or Exponential Law? Chaos Solitons Fractals 2015, 77, 174–189. [Google Scholar] [CrossRef]
Li, W.; Lai, Y.; Xu, L.; Xiangli, Y.; Yu, J.; He, C.; Xia, G.-S.; Lin, D. OmniCity: Omnipotent City Understanding with Multi-Level and Multi-View Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 17397–17407. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1025–1035. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Zhang, P.; Yan, Y.; Zhang, X.; Li, C.; Wang, S.; Huang, F.; Kim, S. TransGNN: Harnessing the Collaborative Power of Transformers and Graph Neural Networks for Recommender Systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, 14–18 July 2024; pp. 1285–1295. [Google Scholar]
Anselin, L. Spatial Econometrics: Methods and Models; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1988. [Google Scholar]

Figure 1. An overview of the proposed method. The method consists of four modules: (a) location mapping module (LoMM), (b) multi-level visual semantic extraction module (ML-VSEM), (c) semi-supervised spatial interaction module (S³IM), and (d) model optimization.

Figure 2. Location mapping from SVIs to buildings.

Figure 3. An illustration of multi-level visual semantic extraction from SVIs.

Figure 4. Semi-supervised spatial interaction modeling.

Figure 5. Spatial distribution of buildings.

Figure 6. Illustration of building function.

Figure 7. Visualized results of (a) ResNet, (b) VGG, (c) StructCNN, (d) GASSL, (e) the proposed method, (f) ground truth in the Brooklyn dataset.

Figure 8. Ablation experimental results on the Brooklyn dataset.

Figure 9. Experimental results for different spatial relationship modeling strategies.

Table 1. Building function categories of Brooklyn.

Category	Abbreviation	Descriptions
1/2 Family Buildings	C1	Single or duplex family homes.
Walk-Up Buildings	C2	Low-rise, no elevator, residential.
Elevator Buildings	C3	Mid/high-rise, with elevators, residential.
Mixed Residential/Commercial	C4	Residential units above commercial spaces.
Office Buildings	C5	Business offices, multiple floors, corporate.
Industrial/Transportation/Utility	C6	Factories, warehouses, power plants, stations.

Table 2. Comparison results in the Brooklyn dataset.

Method	Brooklyn			Pre. (%) of Each Category
Method	Pre. (%)	Rec. (%)	F1. (%)	C1	C2	C3	C4	C5	C6
ResNet	15.64	17.35	13.43	16.67	43.06	0.00	16.92	0.00	0.61
VGG	19.30	20.42	18.53	25.69	44.28	0.00	20.33	3.57	0.00
StructCNN	20.38	19.96	19.44	20.53	49.21	2.44	22.71	4.35	1.79
GASSL	20.02	20.57	19.76	23.12	47.51	2.04	24.61	0.00	0.00
Ours	30.45	27.32	27.74	30.20	53.49	40.00	45.05	8.57	5.41

Table 3. Experimental results for different proportions of the labeled sample.

Labeling Ratios	Pre. (%)	Rec. (%)	F1. (%)
10%	22.34	22.85	20.35
20%	30.45	27.32	27.74
30%	28.03	28.98	23.21
40%	25.20	23.46	23.11
Fully supervised	26.08	25.70	25.50

Table 4. Experimental results for different semantic fusion strategies.

Feature Fusion Strategy	Pre. (%)	Rec. (%)	F1. (%)
Feature splicing	30.45	27.32	27.74
Adaptive weighted average	24.80	26.14	23.66
Feature aggregation	22.24	24.60	18.40

Table 5. Experimental results for different spatial weight calculation algorithms.

Algorithm	Pre. (%)	Rec. (%)	F1. (%)
KNN	30.45	27.32	27.74
Queen Contiguity	22.99	25.06	22.96

Table 6. Experimental results for different values of parameter D.

Parameter D	Pre. (%)	Rec. (%)	F1. (%)
40 m	27.02	24.73	25.26
50 m	28.62	25.97	26.51
60 m	30.45	27.32	27.74
70 m	29.21	26.45	26.98
80 m	26.89	24.15	25.02

Table 7. Comparison results in the Wuhan dataset.

Method	Wuhan			Pre. (%) of Each Category
Method	Pre. (%)	Rec. (%)	F1. (%)	C1	C2	C3	C4
ResNet	31.99	27.07	22.94	0.00	72.63	66.67	0.00
VGG	28.06	22.59	19.19	50.00	55.16	5.00	0.00
StructCNN	27.55	28.81	24.12	20.00	56.40	17.65	9.52
GASSL	30.59	30.78	30.25	37.04	62.41	13.33	0.00
Ours	32.21	32.56	31.27	26.67	59.46	20.00	9.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Integrating Multiple Semantics of Street View Imagery for Semi-Supervised Building Function Identification

Abstract

1. Introduction

2. Related Work

2.1. Building Function Identification

2.2. Semi-Supervised Learning Method for Urban Applications

3. Methodology

3.1. Location Mapping Module

3.2. Multi-Level Visual Semantic Extraction Module

3.3. Semi-Supervised Spatial Interaction Module

3.4. Model Optimization

4. Experimental Setting

4.1. Study Area and Datasets

4.2. Implementation Details

4.3. Baselines

4.4. Evaluation Metrics

5. Results and Discussion

5.1. Overall Results

5.2. Ablation Study

5.3. Effect of the Training Sample Size

5.4. Effect of GNN Models

5.5. Effect of Semantic Fusion Strategy

5.6. Effect of Spatial Weight

5.7. Effect of Parameter D

5.8. Experiments in Different Areas

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics