Building Block Extraction from Historical Maps Using Deep Object Attention Networks

: The geographical feature extraction of historical maps is an important foundation for realizing the transition from human map reading to machine map reading. The current methods for building block extraction from historical maps have many problems, such as low accuracy and poor scalability. Moreover, the high cost of annotating historical maps further limits its applications. In this study, a method for extracting building blocks from historical maps is proposed based on the deep object attention network. Based on the OCRNet framework, multiple attention mechanisms were used to improve the ability of the network to extract the contextual information of the target. Moreover, through the optimization of the feature extraction network structure, the impact of the down-sampling process on local information and boundary contours was reduced, in order to improve the network’s ability to capture boundary information. Subsequently, the transfer learning method was used to jointly train the network model on both remote sensing datasets and few-shot historical map datasets to further improve the feature learning ability of the network, which overcomes the constraints of small sample sizes. The experimental results show that the proposed method can effectively improve the extraction accuracy of building blocks from historical maps.


Introduction
Historical maps preserve the natural landscape and the traces of human activity on the Earth's surface for an extended period [1,2]. Such maps are valuable for historical and cultural heritage, and they provide an essential source for the description and communication of geographical features and their spatial relationships. Thus, historical maps are of great reference value for analyzing regional developments and changes. With the continuous development of digital technology, many countries have digitized paper historical maps and established different types of digital historical map archives to better preserve and utilize these historical documents (https://ngmdb.usgs.gov/topoview/viewer/ (accessed on 12 November 2022), https://www.oldmapsonline.org/ (accessed on 12 November 2022), http://www.map-cn.com/ (accessed on 12 November 2022), https://www.swisstopo. admin.ch/en/maps-geodata-online.html (accessed on 12 November 2022)). Although these digital historical maps are available to the public and can be easily read and understood by humans, an enormous amount of geographic information is locked in the images, which makes it challenging to perform quantitative calculations and spatial analysis through automatic machine reading [3]. In geographic feature extraction, each pixel is assigned with the accurate classification label (e.g., roads, waters, building blocks, and map annotations) such that computers can autonomously "read" the maps. As an important artificial facility, the building is an essential place for people's life and activities. The automatic extraction and identification of building block features from historical maps has the following challenges due to the limitations of the era and mapping technology ( Figure 1). (1) The sizes and styles of building blocks vary, but they can be simplified either as single buildings or as blank areas with boundaries. Thus, the multi-scale expression of features is a great challenge. (2) Building blocks can overlap and nest with each other, which increases the difficulty of filtering and screening boundaries. (3) Building blocks can be covered by other map contents, which interfere with the texture features of the building blocks. (4) Finally, the integrity of building blocks can be affected due to stains, creases, and map damage.
place for people's life and activities. The automatic extraction and identification of building block features from historical maps has the following challenges due to the limitations of the era and mapping technology ( Figure 1). (1) The sizes and styles of building blocks vary, but they can be simplified either as single buildings or as blank areas with boundaries. Thus, the multi-scale expression of features is a great challenge. (2) Building blocks can overlap and nest with each other, which increases the difficulty of filtering and screening boundaries. (3) Building blocks can be covered by other map contents, which interfere with the texture features of the building blocks. (4) Finally, the integrity of building blocks can be affected due to stains, creases, and map damage. With the rapid development of deep learning techniques in the field of computer vision, Convolution Neural Networks (CNNs) have been widely applied in the field of geospatial information processing [4][5][6]. Compared with common images, building blocks account for only a small part of historical map images ( Figure 2). Thus, local information may be missing, or the boundaries of the building blocks may be blurred due to down sampling in the training process [7], which affects the extraction results. The majority of the existing CNNs are composed of a series of sub-networks in high-resolution, low-resolution, and subsequent high-resolution sequences (e.g., ResNet [8], GoogLeNet [9], and VGGNet [10]). The network structure affects the ability to extract the features of building blocks to a certain extent. In addition, many public datasets are available for the semantic segmentation of natural images, such as ImageNet [11], COCO [12], and Cityscapes [13], which greatly increases the accuracy of semantic segmentation. In comparison, the benchmark datasets for the semantic segmentation of historical maps are relatively rare. Therefore, improving the accuracy of semantic segmentation based on limited label data is of great significance for feature extraction from historical map images [14]. With the rapid development of deep learning techniques in the field of computer vision, Convolution Neural Networks (CNNs) have been widely applied in the field of geospatial information processing [4][5][6]. Compared with common images, building blocks account for only a small part of historical map images ( Figure 2). Thus, local information may be missing, or the boundaries of the building blocks may be blurred due to down sampling in the training process [7], which affects the extraction results. The majority of the existing CNNs are composed of a series of sub-networks in high-resolution, low-resolution, and subsequent high-resolution sequences (e.g., ResNet [8], GoogLeNet [9], and VGGNet [10]). The network structure affects the ability to extract the features of building blocks to a certain extent. In addition, many public datasets are available for the semantic segmentation of natural images, such as ImageNet [11], COCO [12], and Cityscapes [13], which greatly increases the accuracy of semantic segmentation. In comparison, the benchmark datasets for the semantic segmentation of historical maps are relatively rare. Therefore, improving the accuracy of semantic segmentation based on limited label data is of great significance for feature extraction from historical map images [14]. To fully utilize the advantages of deep learning in feature extraction from historical maps, the object context features were incorporated with the attention mechanism in this study. Specifically, based on HRNet [15] and OCRNet [16], a Deep Object Attention Network (DOANet) was developed to extract fine-grained geographic features from historical maps under the condition of limited training samples. DOANet makes full use of the deep features in the limited training samples, and the attention mechanism accurately captures the global contextual information, thereby improving the ability of the model to learn the features of building blocks and suppress its responsiveness to other categories of features. Moreover, the transfer learning method was integrated into training to extract building blocks from the established few-shot historical map dataset. The results show that the proposed model can effectively reduce the cost of manual annotation under the condition of limited map samples and has good accuracy in historical map building block extraction.
The paper is organized as follows: in order to understand the value of this research, the paper describes the work related to this paper (Section 2), describes the idea of the algorithm (Section 3), and carries out the algorithm implementation and compares the experimental results with the analysis (Section 4). Finally, Section 5 concludes the paper with remarks on future work.

Related Works
As a critical task of digital map processing, information extraction from historical maps has received much attention from researchers in various fields [17][18][19][20][21]. Early studies mainly used color segmentation, template matching, shape descriptors, mathematical morphological operators, and other methods to extract geographic features, such as roads, contour lines, and buildings [22][23][24][25][26][27][28]. However, these methods often use customized processes and parameter configurations for specific map types or geographic elements, and thus, they suffer from low automation. As the scale of digital map data has increased, the cost of data processing has increased significantly. In recent years, CNN-based object detection, scene classification, and semantic segmentation methods have been applied to extract geospatial data from remote sensing images, and these approaches have shown better results than traditional methods [5,29,30]. As both remote sensing images and historical map images are pixel-based and both are objective representations of geographical elements, they have similarities at the semantic level. Hence, many studies have been carried out to apply deep learning methods to the extraction of information from historical maps [31][32][33][34].
To reduce the impact of scale on the feature extraction results of deep learning net- To fully utilize the advantages of deep learning in feature extraction from historical maps, the object context features were incorporated with the attention mechanism in this study. Specifically, based on HRNet [15] and OCRNet [16], a Deep Object Attention Network (DOANet) was developed to extract fine-grained geographic features from historical maps under the condition of limited training samples. DOANet makes full use of the deep features in the limited training samples, and the attention mechanism accurately captures the global contextual information, thereby improving the ability of the model to learn the features of building blocks and suppress its responsiveness to other categories of features. Moreover, the transfer learning method was integrated into training to extract building blocks from the established few-shot historical map dataset. The results show that the proposed model can effectively reduce the cost of manual annotation under the condition of limited map samples and has good accuracy in historical map building block extraction.
The paper is organized as follows: in order to understand the value of this research, the paper describes the work related to this paper (Section 2), describes the idea of the algorithm (Section 3), and carries out the algorithm implementation and compares the experimental results with the analysis (Section 4). Finally, Section 5 concludes the paper with remarks on future work.

Related Works
As a critical task of digital map processing, information extraction from historical maps has received much attention from researchers in various fields [17][18][19][20][21]. Early studies mainly used color segmentation, template matching, shape descriptors, mathematical morphological operators, and other methods to extract geographic features, such as roads, contour lines, and buildings [22][23][24][25][26][27][28]. However, these methods often use customized processes and parameter configurations for specific map types or geographic elements, and thus, they suffer from low automation. As the scale of digital map data has increased, the cost of data processing has increased significantly. In recent years, CNN-based object detection, scene classification, and semantic segmentation methods have been applied to extract geospatial data from remote sensing images, and these approaches have shown better results than traditional methods [5,29,30]. As both remote sensing images and historical map images are pixel-based and both are objective representations of geographical elements, they have similarities at the semantic level. Hence, many studies have been carried out to apply deep learning methods to the extraction of information from historical maps [31][32][33][34].
To reduce the impact of scale on the feature extraction results of deep learning networks, Duan et al. [35] proposed a georeferencing method based on reinforcement learning. Through the automatic alignment of contemporary vector data and georeferenced historical maps, the precise locations of geographic features on scanned maps were annotated. In addition, Generative Adversarial Networks (GANs) have been used to generate data for historical maps. For example, Li [36] proposed an automatic method to generate a dataset from Open Street Map to train text detection systems to be able to work with historical maps. Andrade et al. [4] synthesized satellite-like urban images based on historical maps. Although these methods can increase the scale of datasets to a certain extent and improve the feature learning ability of deep learning networks, they still require a large amount of historical map data, and deviations between vector data and historical maps remain.
Saeedimoghaddam et al. [33] used Faster RCNN to extract the intersection points of single-lane and double-lane roads from the United States Geological Survey (USGS) historical map series; they also used the pre-trained Inception-Resnet-V2 on the Microsoft Common Objects in Context (COCO) dataset to improve the accuracy of the network. Because the target objects in the COCO dataset are quite different from geospatial elements in terms of the scale, direction, and shape of the data [37], the use of geospatial data (e.g., remote sensing images) for pre-training and transfer learning has been proposed.
Heitzler et al. [38] segmented single buildings from the Swiss Siegried map using U-Net and used methods based on contour tracing and orientation-based clustering to vectorize the segmentation results. Uhl et al. [3,7,39] studied the effects of different network structures to extract the footprint of human settlements from historical USGS topographic map series and used the weakly supervised CNN to solve the problem of the high costs related to manual annotation. They found that for the semantic segmentation of historical maps, the accuracy of the feature extraction network had a significant influence on the segmentation performance. It is worth noting that the above studies only focused on buildings in small-scale topographic maps that were represented by small rectangles with a regular shape and simple texture (most of them filled with a single color). However, in large-scale maps, building blocks have complex contours and different types of textures, even texture-free blanks, posing a great challenge to the algorithm's feature extraction capabilities.

Network Model
When training samples are limited, the number of target features (i.e., building blocks) in a map is small. Hence, improving the feature extraction ability of a network is necessary, and the problem of missing details during down sampling must be addressed. In this study, the encoding and decoding structures of HRNet were optimized to capture multi-scale deep features, and OCRNet was introduced to obtain contextual information in the samples. Then, the deep features were incorporated with the contextual information to increase the ability of the network to learn the building block features in few-shot datasets.

Architecture of DOANet
The architecture of the proposed DOANet for building block extraction from historical maps based on the attention mechanism is shown in Figure 3. Based on OCRNet, DOANet is composed of a feature extraction module and an object attention module. In particular, the object attention module is further divided into the criss-cross attention module and the object context module. Specifically, the criss-cross attention module uses a large receptive field to obtain spatial distribution information and learns important features while ignoring irrelevant features. The object context module is designed to fuse receptive fields of different sizes to capture detailed contextual information. Then, the object attention module aggregates the spatial distribution information and object contextual information to enhance feature representation. The deep features extracted by the feature extraction module are further optimized by the criss-cross attention module, and the optimized deep features and the coarse classification results of the intermediate layers are taken as the input of the object context module to obtain the object region features. Then, the optimized deep features and object region features are combined by the criss-cross attention module to obtain the object context features. Finally, the optimized deep features are spliced with the object context features to obtain the final feature representation with enhanced contextual information.

Feature Extraction Module
Unlike most feature extraction networks, HRNet is composed of parallel high-resolution and low-resolution sub-networks and uses repeated multi-scale feature fusion. Low-resolution feature maps of the same depth and similar levels are used to improve high-resolution features, enabling the network to capture local information with a strong robustness. Based on HRNet, DOANet consists of six encoders and six decoders ( Figure  4), which reduces the connections of the same-scale layers in the original network and strengthens the connections between layers of different scales. Moreover, due to the small number of encoders, fewer decoders are needed, thereby reducing the size of the network. Each encoder uses leaky-ReLU as the activation function, supplemented by batch normalization operations to improve the stability of the model parameters. Because the extraction of building information from historical maps is a binary classification problem, i.e., the labels only include the background and buildings, cross entropy is used as the loss function L: where yi is the label of sample i, which is one for positive classification and zero for negative classification, and pi is the probability that sample i will be positively predicted.

Feature Extraction Module
Unlike most feature extraction networks, HRNet is composed of parallel high-resolution and low-resolution sub-networks and uses repeated multi-scale feature fusion. Lowresolution feature maps of the same depth and similar levels are used to improve highresolution features, enabling the network to capture local information with a strong robustness. Based on HRNet, DOANet consists of six encoders and six decoders (Figure 4), which reduces the connections of the same-scale layers in the original network and strengthens the connections between layers of different scales. Moreover, due to the small number of encoders, fewer decoders are needed, thereby reducing the size of the network. Each encoder uses leaky-ReLU as the activation function, supplemented by batch normalization operations to improve the stability of the model parameters. Because the extraction of building information from historical maps is a binary classification problem, i.e., the labels only include the background and buildings, cross entropy is used as the loss function L: where y i is the label of sample i, which is one for positive classification and zero for negative classification, and p i is the probability that sample i will be positively predicted.

Attention Module
The criss-cross attention module [40] in Figure 5 was used in this study. The module can capture the dependence of each pixel on the rest of the pixels in the image, thereby effectively improving the ability of the network to extract contextual information. First, the dimension of the feature map H of size C × W × H is reduced by two 1 × 1 convolutions, and two feature maps, Q and K, are obtained. Then, for any pixel u on the feature map Q, a channel vector Qu with a size of 1 × 1 × C' is obtained, and all pixels in the same row and column as pixel u are used to construct a feature vector Ωu with a size of (H + W − 1) × C′. Next, the affinity di,u of each pixel u on the feature map Q to the feature vector Ωu is calculated through the affinity operation: where Qi,u denotes the i-th channel vector of Ωu. The attention map A of size (H + W − 1) × W × H is obtained after the SoftMax layer. In addition, the feature map V of size C × W × H is obtained through another 1 × 1 convolution of the feature map H. The feature vector Ѱi,u in the same row and column as each pixel u in V is dot multiplied with the feature vector Ai,u in the corresponding position, and the dot products for all pixels are added to obtain the residual aggregation feature at the position, which is then added to the original feature vector Hu to obtain the feature vector Hu′ with a stronger feature representation ability. The equation is outlined as follows: Because a single criss-cross attention module only considers elements on the same row and column as a pixel, two criss-cross attention modules are connected to obtain the contextual information at all positions.

Attention Module
The criss-cross attention module [40] in Figure 5 was used in this study. The module can capture the dependence of each pixel on the rest of the pixels in the image, thereby effectively improving the ability of the network to extract contextual information. First, the dimension of the feature map H of size C × W × H is reduced by two 1 × 1 convolutions, and two feature maps, Q and K, are obtained. Then, for any pixel u on the feature map Q, a channel vector Q u with a size of 1 × 1 × C is obtained, and all pixels in the same row and column as pixel u are used to construct a feature vector Ω u with a size of (H + W − 1) × C . Next, the affinity d i,u of each pixel u on the feature map Q to the feature vector Ω u is calculated through the affinity operation: where Q i,u denotes the i-th channel vector of Ω u . The attention map A of size (H + W − 1) × W × H is obtained after the SoftMax layer. In addition, the feature map V of size C × W × H is obtained through another 1 × 1 convolution of the feature map H. The feature vector Ψ i,u in the same row and column as each pixel u in V is dot multiplied with the feature vector A i,u in the corresponding position, and the dot products for all pixels are added to obtain the residual aggregation feature at the position, which is then added to the original feature vector H u to obtain the feature vector H u with a stronger feature representation ability. The equation is outlined as follows: Because a single criss-cross attention module only considers elements on the same row and column as a pixel, two criss-cross attention modules are connected to obtain the contextual information at all positions.  Figure 5. Structure of the attention module.

Transfer Learning
Transfer learning is a technique that converts the learning processes in the source domain, including the training data, model parameters, and tasks, into knowledge and then transfers them to the target domain to facilitate the learning of the prediction function in the target domain [41]. The transfer learning method used in this study is shown in Figure 6. A public dataset was used as the training dataset in the source domain; that is, the sample dataset in the source domain was imported into DOANet for learning, and the parameters and features of the source domain network were shared with the target domain through network replication. The target domain network was initialized with the network parameters in the source domain, while freezing the batch normalization layer in the target domain. The dataset in the target domain was used for training to fine-tune the network parameters in the target domain, thereby realizing the knowledge transfer.

Transfer Learning
Transfer learning is a technique that converts the learning processes in the source domain, including the training data, model parameters, and tasks, into knowledge and then transfers them to the target domain to facilitate the learning of the prediction function in the target domain [41]. The transfer learning method used in this study is shown in Figure 6. A public dataset was used as the training dataset in the source domain; that is, the sample dataset in the source domain was imported into DOANet for learning, and the parameters and features of the source domain network were shared with the target domain through network replication. The target domain network was initialized with the network parameters in the source domain, while freezing the batch normalization layer in the target domain. The dataset in the target domain was used for training to fine-tune the network parameters in the target domain, thereby realizing the knowledge transfer.

Transfer Learning
Transfer learning is a technique that converts the learning processes in the source domain, including the training data, model parameters, and tasks, into knowledge and then transfers them to the target domain to facilitate the learning of the prediction function in the target domain [41]. The transfer learning method used in this study is shown in Figure 6. A public dataset was used as the training dataset in the source domain; that is, the sample dataset in the source domain was imported into DOANet for learning, and the parameters and features of the source domain network were shared with the target domain through network replication. The target domain network was initialized with the network parameters in the source domain, while freezing the batch normalization layer in the target domain. The dataset in the target domain was used for training to fine-tune the network parameters in the target domain, thereby realizing the knowledge transfer.

Experiments
To evaluate the efficiency of DOANet and compare the results with those of the existing semantic segmentation algorithms, the dataset for task 1 (building block detection)

Experiments
To evaluate the efficiency of DOANet and compare the results with those of the existing semantic segmentation algorithms, the dataset for task 1 (building block detection) in the ICDAR2021 Competition on Historical Map Segmentation [42] was used to train, validate, and test the algorithms. The dataset consists of large-scale urban maps of Paris dating from 1860 to 1940 collected by the French National Library, which includes one training image, one validation image, and three test images. The resolution of each image is at least 8000 × 6000. The building blocks in the training and validation images were manually annotated. A 512 × 512 sliding window with a step size of 200 was used to crop the training image and the validation image (including the corresponding annotated images), and the cropped image blocks were then used as the dataset in the experiment. The dataset was then divided into a training set and a validation set at a ratio of 8:2. In total, 2237 training samples and 559 validation samples were obtained. Prior to training, each training sample is flipped up and down and left and right in a mirror image, and then randomly rotated once at 45 • . ICDAR2021 provides a standard indicator to evaluate the test results, which is calculated as follows: where PQ is the aggregated score, SQ is the mean Intersection Over Union (mIoU), RQ is the F-score, and TP, FP, and FN represent true positive, false positive, and false negative, respectively. The test platform was a 64-bit Ubuntu 18.04 operating system equipped with eight GeForce RTX™ 2080 Ti GPUs (11 GB VRAM). The PaddlePaddle v2.1 framework was used to build the algorithm. The batch size was set to 8 according to the characteristics of the GPUs, the initial learning rate of the network was 0.0125, and the Stochastic Gradient Descent (SGD) was used as the optimizer. The momentum was 0.9, the weight decay rate was 4 × 10 −5 , and the training epoch was set to 10,000 times. The open source WHU building dataset [43] was used as the source domain dataset, and the divided dataset was used as the target domain dataset to train DOANet. Building block extraction was carried out on the three test images after training, and the results were evaluated using the above indicators in Equation (4).

Results
The scores of DOANet were compared with the official results of ICDAR2021 (Table 1). The PQ of DOANet was higher than that of the other methods for all three test images. The visualization result of the method is shown in Figure 7. Compared with that of the other methods, the PQ of DOANet was increased by at least 7.2% suggesting that the feature extraction method based on the attention mechanism effectively aggregated the local features that did not pass through the receptive field and that the impact of blurred boundaries was reduced due to the scale change. Thus, the proposed algorithm showed a high level of detection accuracy. However, the method's ability to solve the problem of overlapping nesting within building blocks leaves much to be desired, as shown by the second result in Figure 7, where DOANet incorrectly judges several single buildings (shaded sections) as a whole.

Ablation Analysis
To further analyze the role of the object attention module and transfer learning in the extraction of geographic features from historical maps, five algorithms were designed to investigate the modular performance of DOANet: ANet: the feature extraction module was unchanged, and only the criss-cross attention module was retained.
ONet: the feature extraction module was unchanged, and only the object context module was retained.
OOANet: the object attention module was unchanged, the number of encoders and decoders in the feature extraction module was changed, and the original HRNet structure (Figure 8, left) was used [15].

Ablation Analysis
To further analyze the role of the object attention module and transfer learning in the extraction of geographic features from historical maps, five algorithms were designed to investigate the modular performance of DOANet: ANet: the feature extraction module was unchanged, and only the criss-cross attention module was retained.
ONet: the feature extraction module was unchanged, and only the object context module was retained.
OOANet: the object attention module was unchanged, the number of encoders and decoders in the feature extraction module was changed, and the original HRNet structure (Figure 8, left) was used [15]. ROANet: the object attention module was unchanged, and the feature extraction network ResNet (Figure 8, right), which has a series structure, was used [8].
DOANet-: the network structure was unchanged, and the transfer learning module was removed.
The quantitative evaluation indicators of each experiment are shown in Table 2. Due to the absence of part of the core modules, both ANet and ONet yielded poor building block extraction results. OOANet has a deeper network structure and should theoretically yield better extraction results, yet the results showed that OOANet failed to enhance the features. Moreover, the results of ROANet were also inferior to those of DOANet due to the loss of local information, which indirectly demonstrates the advantages of parallel networks in the extraction of geographical features from historical maps. In addition, in the absence of external knowledge transfer, the performance of DOANet-was affected to a certain extent due to the limited number of samples. As shown in Figure 9, when the number of training samples was small, the scores of DOANet-without transfer learning decreased. As the training dataset increased, the difference between DOANet and DO-ANet-gradually decreased. To achieve a score of >75%, DOANet-required 1200 training samples, whereas DOANet only needed 600 image samples. Hence, the method proposed in this study can effectively deal with the problem of limited training samples and thereby reduce the cost of annotation.  ROANet: the object attention module was unchanged, and the feature extraction network ResNet (Figure 8, right), which has a series structure, was used [8].
DOANet-: the network structure was unchanged, and the transfer learning module was removed.
The quantitative evaluation indicators of each experiment are shown in Table 2. Due to the absence of part of the core modules, both ANet and ONet yielded poor building block extraction results. OOANet has a deeper network structure and should theoretically yield better extraction results, yet the results showed that OOANet failed to enhance the features. Moreover, the results of ROANet were also inferior to those of DOANet due to the loss of local information, which indirectly demonstrates the advantages of parallel networks in the extraction of geographical features from historical maps. In addition, in the absence of external knowledge transfer, the performance of DOANet-was affected to a certain extent due to the limited number of samples. As shown in Figure 9, when the number of training samples was small, the scores of DOANet-without transfer learning decreased. As the training dataset increased, the difference between DOANet and DOANet-gradually decreased. To achieve a score of >75%, DOANet-required 1200 training samples, whereas DOANet only needed 600 image samples. Hence, the method proposed in this study can effectively deal with the problem of limited training samples and thereby reduce the cost of annotation.

Conclusions
To address the problems associated with information extraction from historical maps, a building block extraction network for historical maps, DOANet, was developed based on the attention mechanism. Built upon the HRNet and OCRNet structures, DO-ANet uses parallel subnetworks to fuse multi-scale features and includes a criss-cross attention module. Moreover, the transfer learning method is integrated with DOANet to improve the detection accuracy of the model in the case of limited training samples. The experimental results show that the PQ of the proposed method increased by at least 7.2% compared with that of existing algorithms. The proposed method effectively solves the problem of poor network performance caused by insufficient training samples in the task of building block extraction from historical maps and provides a reference for extracting other features from historical maps. However, the ability of the method to solve the problem of overlapping nesting within building blocks leaves much to be desired, and we will focus on solving this problem in our next work. In addition, applying the method to different styles of maps (e.g., different languages, different time periods) will also be part of our future work in order to improve the generalizability of the method in this paper.  [42].

Conflicts of Interest:
The authors declare no conflict of interest.

Conclusions
To address the problems associated with information extraction from historical maps, a building block extraction network for historical maps, DOANet, was developed based on the attention mechanism. Built upon the HRNet and OCRNet structures, DOANet uses parallel subnetworks to fuse multi-scale features and includes a criss-cross attention module. Moreover, the transfer learning method is integrated with DOANet to improve the detection accuracy of the model in the case of limited training samples. The experimental results show that the PQ of the proposed method increased by at least 7.2% compared with that of existing algorithms. The proposed method effectively solves the problem of poor network performance caused by insufficient training samples in the task of building block extraction from historical maps and provides a reference for extracting other features from historical maps. However, the ability of the method to solve the problem of overlapping nesting within building blocks leaves much to be desired, and we will focus on solving this problem in our next work. In addition, applying the method to different styles of maps (e.g., different languages, different time periods) will also be part of our future work in order to improve the generalizability of the method in this paper.

Conflicts of Interest:
The authors declare no conflict of interest.