Next Article in Journal
Spatial Distribution of Pension Institutions in Shanghai Based on the Perspective of Wisdom Grade
Previous Article in Journal
Mapping Agricultural Intensification in the Brazilian Savanna: A Machine Learning Approach Using Harmonized Data from Landsat Sentinel-2
 
 
Article
Peer-Review Record

Urban Architectural Style Recognition and Dataset Construction Method under Deep Learning of street View Images: A Case Study of Wuhan

ISPRS Int. J. Geo-Inf. 2023, 12(7), 264; https://doi.org/10.3390/ijgi12070264
by Hong Xu 1,2,*, Haozun Sun 1, Lubin Wang 3, Xincan Yu 1 and Tianyue Li 1
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
ISPRS Int. J. Geo-Inf. 2023, 12(7), 264; https://doi.org/10.3390/ijgi12070264
Submission received: 5 April 2023 / Revised: 13 June 2023 / Accepted: 29 June 2023 / Published: 2 July 2023

Round 1

Reviewer 1 Report (Previous Reviewer 1)

After reviewing the author's response to my previous review, I am pleased to note that they have made significant improvements to the manuscript, addressing all of my concerns and significantly enhancing the clarity and comprehensibility of the paper. The study is both innovative and meaningful, with the proposed approach having the potential to make a valuable contribution to the field of urban architectural style recognition and dataset construction. As such, I recommend this paper for publication.

 

That being said, there are some minor English expression errors that require correction, including the extra use of the term "courtyard" in the 16th row of Table 1, "The most typical representative is the courtyard courtyard in Beijing." Furthermore, I would like to ask why "Jing Residence" is listed separately and whether this type of architecture exists in Wuhan, which is within the research area. Additionally, it would be helpful to know if there is a category for "Wuhan Residence." I kindly request that the authors carefully revise the English expression before the paper is officially published.

Author Response

Response to Reviewer 1 Comments

Point #1:That being said, there are some minora English expression errors that require correction, including the extra use of the term "courtyard" in the 16th row of Table 1, "The most typical representative is the courtyard courtyard in Beijing."

Response:Thanks for your comments. We apologize for these expression error; we revised them in the manuscript.

Point #2Furthermore, I would like to ask why "Jing Residence" is listed separately and whether this type of architecture exists in Wuhan, which is within the research area. Additionally, it would be helpful to know if there is a category for "Wuhan Residence."

Response:Thanks for your comments.

Jing Residence represents the architectural style of the Beijing region, serving not only as a representative of northern Chinese architecture but also holding a significant position in traditional Chinese architecture, given its role as the capital and ancient imperial city of China. It exerts influence on various architectural styles throughout China. We have also endeavored to identify buildings in the Wuhan region that have been influenced by Jing Residence. According to the experimental findings, there is indeed a discernible number of buildings in the Jing Residence style in the Wuhan region.

Moreover, we find your suggestion regarding the categorization of residential buildings in Wuhan to be highly intriguing. Undoubtedly, this would be a meaningful research endeavor. We wholeheartedly appreciate your inquiry, as it will be one of the main directions for our subsequent research.

Point #3I kindly request that the authors carefully revise the English expression before the paper is officially published.

Response:Thanks very much for your comments. We apologize for the poor language of our manuscript. We have revised this paper through mdpi editing service. We really hope that the flow and language level have been substantially improved.

 

Author Response File: Author Response.docx

Reviewer 2 Report (New Reviewer)

This study is very meaningful, and I really enjoyed reading this article. I only have a few minor suggestions:

 

(1) I presume that the author spent a lot of time on this study, especially in manually labeling the images. In the article, the author mentions that 43,670 images were manually labeled and the quality of the label data determines the accuracy of the trained model. Could the author please provide some information on who completed the labeling work and whether their professional background was sufficient to support the labeling work in this study? How was the accuracy of the labeled data measured? Were the labeled results  by different people validated to ensure consistency?

 

(2) In Figure 7, there is a huge difference in the number of labels for different types of buildings. Does this affect the accuracy of the model in identifying different types of buildings? Is it true that the more labels there are, the higher the recognition accuracy?

 

(3) The English writing in this paper needs to be strengthened, and grammar errors need to be checked one by one.

 

(4) There are some duplicate or formatting issues with the references, such as references 17 and 22 being repeated.

 

Author Response

Response to Reviewer 2 Comments

Point #1:  I presume that the author spent a lot of time on this study, especially in manually labeling the images. In the article, the author mentions that 43,670 images were manually labeled and the quality of the label data determines the accuracy of the trained model. Could the author please provide some information on who completed the labeling work and whether their professional background was sufficient to support the labeling work in this study? How was the accuracy of the labeled data measured? Were the labeled results by different people validated to ensure consistency?

Response:Thanks for your comments.

In the aspect of the image annotation task, all 43,670 street scene images were annotated by undergraduate and graduate students who have received a professional architectural education. We also conduct regular quantitative inspections of the completed annotation information to ensure accuracy. After manual check and confirmation, the accuracy of the label key reaches 100%.

Point #2:In Figure 7, there is a huge difference in the number of labels for different types of buildings. Does this affect the accuracy of the model in identifying different types of buildings? Is it true that the more labels there are, the higher the recognition accuracy?

Response:Thanks for your comments.

Yes, there is a huge difference in the number of labels for different types of buildings. It actually affects the accuracy of the model in identifying different types of buildings. This paper shows the detection accuracy of each architectural style after model training. Also this paper discuss it as the follows:

For the functionalism style with a large number of labels, the three models show good recognition accuracy, proving that a sufficient number of labels is conducive to improving the model's accuracy. However, for some traditional architectural styles such as Byzantium, Jing, and Tang, the number of label of these styles differs from the functionalist style, but it shows good recognition accuracy. Through analysis, the architectural style with apparent characteristics of this class has certain advantages in judging the style attributes by identifying external features. This may be why they do not have an advantage in the number of labels, but they can achieve better recognition accuracy. Therefore, if the style does not have eye-catching features, increasing the number of annotations may be the way to obtain a higher recognition accuracy.

The Table 3 and the 22*22 confusion matrix diagram (Figure 11) show that there are 22 types of architectural styles, of which 18 types have an accuracy of 0.5 and 10 types have reached 0.76, and no kind with an accuracy below 0.25 has been produced.

Point #3: The English writing in this paper needs to be strengthened, and grammar errors need to be checked one by one

Response:Thanks for your comments.

Thanks very much for your comments. We apologize for the poor language of our manuscript. We have revised our manuscript using MDPI language editing services. We really hope that the flow and language level have been substantially improved.

Point #4: There are some duplicate or formatting issues with the references, such as references 17 and 22 being repeated.

Response:Thanks very much for your comments. We have revised them according to your comments.

Author Response File: Author Response.doc

Reviewer 3 Report (New Reviewer)

This paper attempts to construct a database of each building in Wuhan with attribute information about architectural styles in 23 detailed categories by utilizing techniques such as image classification, object detection, and semantic segmentation. Although the purpose of the research is clear, it is difficult for me to give a high evaluation to this paper. The reasons are as follows.

(1) This paper has few academic and technical contributions. The machine learning models used are basically existing ones such as VGG16, Faster R-CNN, and DeepLabV3, but the results do not show high accuracy.

(2) Insufficient consideration and discussion were given to the fact that high discrimination results were not obtained. This is a very important process to improve the study results. For example, Table 3 is considered to be one of the important results showing the validity of this method, but the authors' discussion on this table is hardly mentioned. The authors should not only show the AP, but also discuss the reasons why the AP is low (around 40%) for certain architectural styles, and what kind of misclassification was frequently observed by creating a 23x23 confusion matrix.

(3) I do not understand the significance and validity of the statistics (Fig. 11) and the spatial distribution of the estimation results (Figs. 12-Fig. 17) for the architectural style, which are presented without the above considerations and efforts to improve the accuracy of the estimation. The impact of estimation errors on these results should be considered. Also, it should be possible to discuss the validity of the estimation results in terms of spatial distribution by generating correct data in the architectural styles, even if only for a specific small area.

(4) In addition, there are many other areas where I feel that explanations are insufficient or ambiguous. For example:

- The rationale and validity of the classification into 23 different architectural styles.

- The cost of collecting images through web crawling and copyright issues.

- The conditions of the buildings to be labeled with building forms using Labelimg. That is, the size of the extracted building in the image is also presumed to affect the accuracy of manual labeling.

- The validity of Table2, which verifies the accuracy of mapping with only 150 artificially extracted buildings.

- What does "the field distance between the two SVI pairs is 50-150m" on p.16 mean? What is the difference from "adjacent two images"?

- What is the data source of the buildings to be mapped? What is the accuracy of the Baidu Map location information? These information are expected to have a significant impact on the accuracy of the matching.

- Impact of the bias in the composition ratio of the annotated buildings, as shown in Fig. 7. Are the authors training the model after downsampling?

Author Response

Response to Reviewer 3 Comments

Point #1:This paper has few academic and technical contributions. The machine learning models used are basically existing ones such as VGG16, Faster R-CNN, and DeepLabV3, but the results do not show high accuracy.

Response:Thanks for your comments.  

We conclude the contribution as follows in the revised paper:

“This paper proposes an approach for building an urban architectural style dataset under deep learning for SVIs. The contributions of this paper are summarized as follows. First, this paper summarized 22 architectural styles in the study area, which could be used to define and describe urban architectural styles in most Chinese urban areas. Second, this paper implemented a Faster-RCNN general framework of architectural style classification with a VGG-16 backbone network, which is the first machine learning approach for identifying architectural styles in Chinese cities. Third, this paper introduces an approach for constructing urban architectural style datasets by mapping the identified architectural style in continuous street view imagery and vector map data of building top-down contour maps. This is valuable for urban landscape planning and maintaining in sustainable and smart cities.”

Point #2:Insufficient consideration and discussion were given to the fact that high discrimination results were not obtained. This is a very important process to improve the study results. For example, Table 3 is considered to be one of the important results showing the validity of this method, but the authors' discussion on this table is hardly mentioned. The authors should not only show the AP, but also discuss the reasons why the AP is low (around 40%) for certain architectural styles, and what kind of misclassification was frequently observed by creating a 23x23 confusion matrix.

Response:Thanks for your comments.

According to your comments, we have made some modifications to Table 3 according to your requirements, adding the statistics of Q4 (0-25%), Q3 (25%-50%), Q2 (50%-75%), and Q1 (75%-100%) with accuracy, and constructed a 22x22 confusion matrix. Indeed, many architectural styles are well identified; less than 40% are architectural style categories with low differentiation but a small total number of 3186. And we discussed how to solve the problem of recognition accuracy of this part of the architectural style.

Table 3. The detection accuracy and range of each architectural style after model training.

Category

AP/%

Range

 

 

 

None

 

Q4(0-25%)

 

 

 

Art Deco

0.4128

 

New Chinese

0.4361

Q3(25-50%)

French Classicism

0.4385

 

Baroque

0.4828

 

Expressionism

0.5379

 

Han

0.5717

 

Gothic

0.6316

 

Western-style

0.6539

Q2(50-75%)

Chu

0.6923

 

Su

0.7219

 

Ancient Rome

0.7336

 

Yuan

0.7361

 

Ancient Greece

0.7697

 

Ming

0.7852

 

Folk Residence

0.7861

 

Qing

0.7959

 

Functionalism

0.8412

Q1(75-100%)

Hui

0.8437

 

Song

0.8673

 

Jing Residence

0.8835

 

Byzantium

0.8921

 

Tang

0.8943

 

  MAP

0.6868

 

 

 

Figure 11. 22*22 confusion matrix diagram. The pictures show the actual and predicted values of 22 styles and the false predictions for each.

Table 3 and the 22*22 confusion matrix diagram (Figure 11) show that there are 22 types of architectural styles, of which 18 types have an accuracy of 0.5 and 10 types have reached 0.76, and no style with an accuracy below 0.25 has been produced. The potential factors for several lower-precision styles (Art Deco, New Chinese, French Classicism, and Baroque) are as follows. First, for French classicism and Baroque, the number of buildings within the city limits is small for this style. However, it shows good validity in the confusion matrix, proving that this style has suitable identification. However, the data should be expanded to improve the accuracy when conditions permit. For Art Deco, we found by looking at the confusion matrix that a considerable number of Art Deco buildings were incorrectly identified as Western-style. Art Deco originated at the Paris Exposition, matured during the construction of skyscrapers in the United States in the twenties and thirties of the 20th century, and its style characteristics include some European styles. Nevertheless, by looking at the European style, it is rare to find that it is mistakenly identified as Art Deco. Therefore, extracting feature elements such as setbacks and diverse geometric patterns significantly improves Art Deco's accuracy. Looking at the confusion matrix, the New Chinese buildings are mostly misidentified as Functionalism. Only by appearance can the two styles be identified with some similarities, such as simple, streamlined, function-oriented features. New Chinese is a style that adds abstract Chinese cultural symbols based on modernist architectural design. One of the representatives is the Chinese roof, so adjusting the model to focus more on the representative features of the building may be the key to improving the accuracy of the New Chinese style detection. Finally, part of the style is defined as background architecture. In this case, the New Chinese style and all styles will be greatly affected.

Point #3: I do not understand the significance and validity of the statistics (Fig. 11) and the spatial distribution of the estimation results (Figs. 12-Fig. 17) for the architectural style, which are presented without the above considerations and efforts to improve the accuracy of the estimation. The impact of estimation errors on these results should be considered. Also, it should be possible to discuss the validity of the estimation results in terms of spatial distribution by generating correct data in the architectural styles, even if only for a specific small area. 

Response:Thanks for your comments.

       The purpose of Figure 12-17 is mainly to show the spatial distribution of different architectural styles. The following Figure 13~16 shows the updated version of these figures. Architectural classifications can provide guidance and reference for urban planning and design. By understanding the distribution of different architectural styles in the city, urban planners and designers can be provided with advice and decision support on appropriately using different styles. The effectiveness of this classification lies in closely linking architectural styles to urban planning and design, supporting sustainable development and livability of cities.  

Figure 13. Illustration of the dataset generation results matching SVIs. It includes the styles of Qing, Hui, Functionalism, New Chinese, and Folk Residence.

Figure 14. Schematic diagram of dataset validity. Pink is the correct prediction monomer and black is the incorrect one.

Figure 15 Enlarged schematic view of the area along the river. We can see a large number of red Western-style buildings and ancient Roman and ancient Greek style buildings.

Figure 16. Schematic map of East Lake Ecotourism Landscape Park.

For the issue of correctness, we added a subsection to discuss it as follows:

3.3. Dataset validity

Six hundred and fifty-seven distinct building shapes are chosen from the generated results, and their architectural attributes are manually identified as previous knowledge to examine the dataset's correctness objectively. The dataset's classification accuracy map representing Wuhan's architectural styles is presented in Figure 17 and was created after the experimental results were cross-validated with the previous categories. The Yuan, Song, Han, Su, French classicism, Byzantium, Gothic, and Baroque architectural information used for verification in the experimental area needed to be included, so these eight architectural styles were ignored. As a result, the average classification accuracy of the dataset was 57.8%, the average recall rate was 80.91%, and the average F1 score was 0.634. The results demonstrated that the accuracy of the dataset created in this experiment for Wuhan's architectural style needs to be further enhanced. However, it can reflect the geographic distribution of Wuhan's architectural style and create datasets for other cities with only minor adjustments. In addition, the accuracy of Ming and Tang styles is much lower than the average. However, considering the geographical location, the Ming and Tang styles are more distributed in the north, such as in Beijing or Xi'an, and may have better recognition performance if evaluated in these areas. According to the Faster-RCNN model test results, combined with precision, recall, F1 score, and other indicators, the visual analysis of architectural style datasets shows that the number of training samples corresponding to architectural styles with poor classification accuracy is low, or it is challenging to identify them due to their style characteristics. It is found through manual inspection that the model has a good recognition of distinctive styles such as Hui, Ancient Greece, and Ancient Rome. Although the identification of Functionalism is efficient, the recognition accuracy of the model between the three styles (Functionalism, New Chinese style, and other Western-style high-rises) still needs further improvement.

Figure 17. Dataset style accuracy histogram. Although the accuracy rate can judge the overall accuracy rate, it is not a good indicator to measure the result in the case of unbalanced samples. Therefore, we evaluate the accuracy of the dataset we construct by comparing precision, recall, and F1 score.

Point #4:

  • The rationale and validity of the classification into 23 different architectural styles

Response:Thanks for your comments.

Architecture in most cities in China can be classified into three categories: traditional Chinese style, Western style, and modern style. The proposed classification system considers various historical periods, regions, and national architectural styles within these three categories, making it highly versatile. Chinese and Western-style buildings exemplify the historical evolution and cultural heritage of architecture. This classification system's effectiveness lies in its ability to showcase the diversity and uniqueness of urban architecture by presenting various architectural styles. For instance, the Western-style architecture in Wuhan is predominantly concentrated along the river, symbolizing the city's colonial history. Table 1 provides a comprehensive understanding of the distinguishing features and characteristics of different architectural styles, which serves as a foundation for subsequent in-depth research.

(2)The cost of collecting images through web crawling and copyright issues. 

Response:Thanks for your comments.

We sincerely apologize for the issues that arose during the translation process of our article. The image data used for research in this paper was not obtained through web crawling. Instead, we utilized the APIs provided by Baidu Map to access street view images, as this approach is open and free of charge. Therefore, the methodology proposed in this paper does not involve any cost or copyright-related concerns. We have already completed the necessary revisions in the relevant sections of the article.

(3)The conditions of the buildings to be labeled with building forms using Labelimg. That is, the size of the extracted building in the image is also presumed to affect the accuracy of manual labeling. 

Response:Thanks for your comments.

We understand the reviewer's concern. The size of the extracted building affects the accuracy of the labeling to some extent. Considering that the Street View imagery we acquired was preprocessed, each SVI had a resolution of 2048×1024 pixels. We need to keep the dimensions of the candidate boxes in the range of 200x100 to 1000x500 to ensure the accuracy and validity of our annotations.

(4) The validity of Table2, which verifies the accuracy of mapping with only 150 artificially extracted buildings. 

Response:Thanks for your comments.

We only tested the accuracy of 150 extracted buildings in Table 2, which is challenging to demonstrate the validity of Table 2. We expanded the number of building inspections in the updated manuscript, checked the mapping of 500 buildings, and recorded the information in Table 2.

(5)What does "the field distance between the two SVI pairs is 50-150m" on p.16 mean? What is the difference from "adjacent two images"? 

Response:Thanks for your comments.

We revised it as follows:

Each SVI has a resolution of 2048×1024 pixels, and the distance between two adjacent image locations is 8-20 meters.

(6)What is the data source of the buildings to be mapped? What is the accuracy of the Baidu Map location information? These information are expected to have a significant impact on the accuracy of the matching. 

Response:Thanks for your comments.

We understand the reviewer's concern. The building data sources and SVIs used in the study were all obtained from Baidu Map, considering that Baidu Map is one of the top three map providers in the world. The building data sources and SVIs obtained from Baidu Maps will be processed in the same spatial coordinate system (WGS1984) to ensure the accuracy of the data information.

(7) Impact of the bias in the composition ratio of the annotated buildings, as shown in Fig. 7. Are the authors training the model after downsampling?

Response:Thanks for your comments.

We did not train the model after downsampling. Achieving a balanced number of building annotations for each type would significantly enhance the model training process. We have diligently annotated each type of architectural style in Wuhan and presented them in Figure 7. However, in reality, Wuhan's distribution of architectural styles is not uniform. To mitigate the impact of imbalanced image counts, we opted to horizontally flip the images with a limited number of categories, thereby enhancing the data.

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report (New Reviewer)

Sorry for the delay in responding.

Previously, I determined that this paper should be rejected.

However, since I have confirmed that the author has properly addressed the points, I do not see any problem with accepting this paper.

Again, I apologize for any inconvenience caused. 

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

This is an interesting and innovative study that develops a machine learning-based method to identify the architectural style of the cities based on the use of street view maps data. After carefully reading the manuscript, I think the topic of this paper is meaningful. I have a few suggestions for improving the manuscript.

1. In Section 1.2, after presenting the related literature about the deep learning of architectural style recognition, the authors claim that “the research on combining deep learning technology and SVI to identify urban architectural styles and construct architectural style datasets automatically is still lacking compared with other fields”. The SVI dataset contains a large volume of photos. If the existing models can deal with one photo well, they should deal with many photos well only need more time. The reader would wonder why the existing models cannot be applied to the SVI. What are the differences between this work and other related works mentioned in this Section? As such, I suggest: (1) expound the research question and your innovative idea clearly. (2) Don’t simply list the literature. They need to be sorted out into categories according to their idea or method principle. Then you can find out the difference between your idea and their idea.

2. The SVI collection points in Figure 1 are not easy to see. They should be represented with a more conspicuous map symbol.

3. The sequencing words are confusing. There is a “Sencond” in Line 185. But where is the “First” word? And more, the “First” word in Line 189 should be “Firstly”.

4. The architectures are classified into 23 categories. (1) the key identifying characteristics of each category should be expounded. This would help to understand how to label 23 architectural styles manually. (2) the sample pictures of each style in Table 1 should be extracted from SVI.

5. In Section 2.2.2 Line 282, the authors mention that the anchor box was added to the model, which is a new concept. The idea and function of this anchor box should be explained.

6. Lines 313-316 list the name of 23 styles. They are not necessary because Table 1 includes this information.

7. Figure 5 shows that a geodetic coordinate system was used to locate the building in photos. Does it have any positioning errors? If have, please evaluate them in the results. Furthermore, the title of Fig. 5 is too long. Some information can move to the main text.

8. In Section 2.3, the author mentions that “The dataset should attempt to satisfy the four conditions of sufficient light, weak distortion, complete content, and good resolution performance of the photo to improve the recognition accuracy”. Does it mean the model was built with good-quality SVIs? How was the data augmentation performed? What is the generalization performance of the model?

9. Some problems with the maps: Figure 13 lacks legend, scale, and north arrow. Figure 15 is not clear.

10. A Discussion Section should be added to this paper. Discussed the model sensitivity, and the contributions to the field and compare them with related work.

architectures   详细X 基本翻译 n. 建筑;架构(architecture 的复数) 网络释义 Software Architectures: 软件架构 Common Architectures: 一般架构 the style of architectures: 建筑风格

Reviewer 2 Report

The manuscript is well organized. 

Figure 10 at the bottom right, windows are mistakenly detected as buildings. It'll be good to acknowledge this.

Line 447-447, why use these hyperparameters?

My biggest concern is Fig 9, where no convergence is observed at 50,000 iteration. 

Additionally, the English writing can use some improvements. 

Also some related papers are missing in the literature review, e.g., 

Detecting and geolocating city-scale soft-story buildings by deep machine learning for urban seismic resilience;

Rapid visual screening of soft-story buildings from street view images using deep learning classification;

Machine learning-based regional scale intelligent modeling of building information for natural hazard risk management;

Visual Perception of Building and Household Vulnerability from Streets;

Instance segmentation of soft‐story buildings from street‐view images with semiautomatic annotation;

AdaLN: A Vision Transformer for Multidomain Learning and Predisaster Building Information Extraction from Images;

 

Reviewer 3 Report

The submitted manuscript uses street view images and deep learning to extract architectural styles for a city in China.

This approach, i.e., using street view images and deep learning to extract aspects of the urban landscape (green visibility, sky coverage, building function, facade colour), has been studied countless times in recent years. Of course, the deep learning was additionally trained on the authors' original dataset. This method is already approaching the practical application stage rather than the academic research stage.

A review of the submitted manuscript from this perspective shows a very similar approach, although the authors have a different target: architectural styles. What did the authors add or improve on their existing knowledge?

In addition, when extracting architectural styles from street view images, the space between the camera viewpoint and the building contains elements such as street trees, cars, people, and anti-crossing fences that are not necessary for the computer to understand the architectural style. It is unclear how the authors addressed these unnecessary elements (techniques have been developed to remove them automatically). 

Finally, there is a previously published paper by the authors with a similar title, but this paper is not cited as a reference, and furthermore, the differences from this previously published paper are not explained.

Haozun SUN, Hong XU and Quanfeng WEI
The Classification Method of Urban Architectural Styles Based on Deep Learning and Street View Imagery
Hydraulic and Civil Engineering Technology VII
M. Yang et al. (Eds.)
© 2022 The authors and IOS Press.
doi:10.3233/ATDE220940

Based on the above, the reviewer could not confirm the novelty of this research as an academic study.

Back to TopTop