remote Improved Mask R-CNN for Rural Building Roof Type Recognition from UAV High-Resolution Images: A Case Study in Hunan Province, China

: Accurate roof information of buildings can be obtained from UAV high-resolution images. The large-scale accurate recognition of roof types (such as gabled, ﬂat, hipped, complex and mono-pitched roofs) of rural buildings is crucial for rural planning and construction. At present, most UAV high-resolution optical images only have red, green and blue (RGB) band information, which aggravates the problems of inter-class similarity and intra-class variability of image features. Furthermore, the different roof types of rural buildings are complex, spatially scattered, and easily covered by vegetation, which in turn leads to the low accuracy of roof type identiﬁcation by existing methods. In response to the above problems, this paper proposes a method for identifying roof types of complex rural buildings based on visible high-resolution remote sensing images from UAVs. First, the fusion of deep learning networks with different visual features is investigated to analyze the effect of the different feature combinations of the visible difference vegetation index (VDVI) and Sobel edge detection features and UAV visible images on model recognition of rural building roof types. Secondly, an improved Mask R-CNN model is proposed to learn more complex features of different types of images of building roofs by using the ResNet152 feature extraction network with migration learning. After we obtained roof type recognition results in two test areas, we evaluated the accuracy of the results using the confusion matrix and obtained the following conclusions: (1) the model with RGB images incorporating Sobel edge detection features has the highest accuracy and enables the model to recognize more and more accurately the roof types of different morphological rural buildings, and the model recognition accuracy (Kappa coefﬁcient ( KC )) compared to that of RGB images is on average improved by 0.115; (2) compared with the original Mask R-CNN, U-Net, DeeplabV3 and PSPNet deep learning models, the improved Mask R-CNN model has the highest accuracy in recognizing the roof types of rural buildings, with F1-score , KC and OA averaging 0.777, 0.821 and 0.905, respectively. The method can obtain clear and accurate proﬁles and types of rural building roofs, and can be extended for green roof suitability evaluation, rooftop solar potential assessment, and other building roof surveys, management and planning.


Introduction
The accurate identification of rural building roof types is significant in natural resource surveys [1], beautiful countryside planning and construction [2], detection of illegal roofs [3], assessment of rooftop solar photovoltaic power generation potential [4,5] and disaster emergency management (e.g., detection of damaged rooftop areas after earthquakes and landslides) [6]. Compared with urban buildings, rural buildings have their own unique characteristics, which are mainly reflected in the following: firstly, the lack of unified planning and management leads to a chaotic building layout; secondly, the design of houses is mostly based on the construction experience of rural artisans, which makes the roof types of rural buildings complex and diverse [7]. However, the current building identification research mainly focuses on the extraction of large scale urban buildings [8], while less attention is paid to the more difficult and complex identification of multiple roof types of rural buildings. Therefore, there is an urgent need to develop methods for fine investigation and identification of rural roof types on a large scale.
Traditional building survey methods often require a lot of manpower and material resources for field mapping and surveying, which is a large workload and high cost, especially in rural areas [9]. With the launch of high-resolution remote sensing satellites (such as Worldview-2, GF-2, etc.), more and more scholars and mapping departments use high spatial resolution remote sensing images to extract building information [10,11]. Its spatial resolution reaching the sub-meter level allows buildings on remote sensing images to present richer detailed information such as internal structure, geometric contours and texture patterns, and the differences in geometric dimensions and texture features are the fundamental basis for identifying different categories of building roofs. However, due to the presence of a large number of shadows, features with similar spectral characteristics to buildings (such as roads, etc.) and intra-class hybrid image elements, the traditional remote sensing image classification methods based on pixel features cannot effectively, correctly and completely recognize different roof types [12]. Considering that the size of image elements in high spatial resolution remote sensing images reflecting the ground target is closer to the natural scene target taken on the ground, which is more in line with the human eye's perception compared with low and medium resolution images [13], there are many scholars using machine learning methods to extract different roof types, such as objectoriented classification, support vector machine classification (SVM) and random forest classification (RF) [14,15]. However, the performance of object-oriented methods depends mainly on the segmentation results of images, and the results of classifier methods such as SVM and RF often depend on the selection of a large set of valid samples, whose shallow structure makes the deeper information about buildings unavailable and not generalizable across images of different regions, which makes machine learning classification methods face great challenges in terms of reliability and generalizability in accurately identifying the roof types of buildings [16,17]. Although some studies have also combined LiDAR point cloud data and satellite image data using SVM and RF models to identify multiple roof types (e.g., flat, gabled, hipped, pyramidal and skillion roof types) to improve the accuracy of roof category identification by machine learning models [18], the high cost of acquiring LiDAR point cloud data prevents the effective achievement of the accurate identification of roof types on a large scale.
Similarly, large-scale high-resolution satellite remote sensing images have many shortcomings in roof type identification, for example, the long revisit period makes the time interval of different simultaneous data acquisition and processing longer, which makes the real-time update of roof database and disaster emergency monitoring impossible to be guaranteed [19]. In addition, compared with urban areas, rural areas are more likely to produce more cloudy weather, which makes the quality of satellite imaging lower and thus limits the accurate identification of roof types of complex buildings such as small areas in rural areas [20]. Low-altitude remote sensing technology, represented by UAV technology, can overcome the above shortcomings due to its advantages of high flexibility, high timeliness, low cost and not being restricted by geographic environment conditions, and it can provide centimeter-level ultra-high-resolution remote sensing images, which makes the spatial structure, surface texture features and edge feature information of the features on the images more clear [21]. However, the UAV remote sensing images with significantly larger image data have higher requirements for classification methods than those of general high-resolution images [22].
With the rapid development of big data and high-performance computers [23], the application of deep learning technology in the field of automatic image recognition is expanding. Deep learning models, represented by Convolutional Neural Network (CNN), can automatically learn more complex abstract high-dimensional features from the low-level features of the input image [24], which obviously brings great advantages for acquiring complex spectral, geometric and texture features in ultra-high-resolution remote sensing images [25]. Based on this, many researchers are now using semantic segmentation frameworks (e.g., VGG-F [26], U-Net [27], SegNet [28,29], etc.) to identify building roof types in ultra-high-resolution remote sensing images. However, semantic segmentation suffers from the problems of difficulty in distinguishing different objects of the same range and easily connecting different building roof types at the edges [30], which is not conducive to the application and research of complex rural roof type recognition. In contrast to image semantic segmentation, instance segmentation can identify multiple objects of the same broad category as different individual entities and assign a pixel-level semantic category to each entity on this basis [31], which is ideal for the recognition of complex rural building roof types. Among the existing instance segmentation methods, Mask R-CNN has been proved to be a powerful and adaptable deep learning model in different domains [32] and consists of a combination of target detection and semantic segmentation techniques to segment objects into prediction frames by predicting the bounding boxes of target objects and finally outputting high precision vector segmentation results [33]. A large number of scholars have applied it to the recognition of buildings [34], for example, Stiller et al. [35] used fine-tuned Mask R-CNN to extract large-scale buildings in urban areas of Chile, while for more complex recognition of different building roof types Mask R-CNN is less applied at present. Most of the above studies, however, have focused on building extraction in urban areas, and less attention has been paid to the more difficult problem of identifying complex rural roof types. At the same time, the selection of deep learning feature extraction model also has a significant impact on the recognition accuracy of complex roof types [36]. As an important feature extraction structure of Mask R-CNN, the deep residual network [37] enables the model to extract more complex image features without decreasing the accuracy by increasing the number of residual convolution layers, and the traditional Mask R-CNN uses ResNet50 or ResNet101 as the feature extraction layer [38]. Whereas, for complex building roof type recognition in UAV remote sensing images with large data volume, the above deep residual networks are not able to extract very complex building roof type features at a deeper level [39], and ResNet152 [40], which is currently one of the best performers in classification, can solve the above problem and can be deployed in Mask R-CNN by migration learning.
Due to payload limitations [41], most UAV ultra-high-resolution remote sensing images only have red, green and blue (RGB) bands, making the inter-class similarity and intra-class variability of different features in the images more obvious [42]. Deep learning, while already the best method available in terms of automation and accuracy, has limitations in the recognition of low reflectance, features with similar spectral characteristics to buildings and complex building roof types, such as similar roof lawns and grasses, similar concrete roofs and floors, and dark gaps between different roof types [43]. It has been shown that adding more visual features to deep learning models can better address these problems [44]. Boonpook et al. [45] combined RGB data from UAV remote sensing imagery with the visible band vegetation index (VDVI) and digital surface model (DSM) to extract complex buildings using the SegNet deep learning method. The results show that the RGB combination with VDVI features can improve the separability of building areas from vegetation, the RGB combination with DSM features helps to separate buildings from ground objects, and the RGB combination with both features can identify small buildings that are low and obscured by vegetation, and the extraction results of each feature combination are higher than those of RGB only. However, the DSM data only contains Remote Sens. 2022, 14, 265 4 of 25 the height information of the ground objects and cannot distinguish the internal structure (e.g., surface texture features, etc.) of more complex different roof types. Zhang et al. [46] extracted buildings from high-resolution remote sensing images by fusing the Sobel edge detection algorithm and Mask R-CNN algorithm, and the results showed that using Sobel edge detection algorithm to segment building boundaries solved the problems of boundary texture extraction and object internal integrity in deep learning. The above studies show that adding VDVI feature bands can effectively distinguish buildings from green areas in UAV visible images and improve the inter-class similarity problem. What is more, adding Sobel edge detection features can clearly show the gradient, texture and boundary features of building roof surfaces, enhance the distinguishability of different building roof types and thus improve the intra-class variability problem.
In summary, to address the problems of extracting complex roof type features and easily confusing the building roofs of low reflectance with vegetation, roads and other objects of similar spectral features in existing methods, this paper proposes the improved Mask R-CNN method for rural building roof type recognition from UAV visible high-resolution remote sensing imagery. The improved Mask R-CNN model based on different feature combinations can fully extract the more complex features of different building roof types, effectively improve the differentiation between buildings and vegetation, as well as different building roof types, accelerate the convergence speed of the model and achieve a large range of high-precision recognition of building roof types in UAV visible remote sensing images. The main sections of this paper are organized as follows: Section 2 introduces the study areas and pre-processing of the experimental UAV image dataset. Section 3 describes the main methodological process of this study, including the visual feature extraction method of UAV visible images and the improvement and implementation of Mask R-CNN model. Section 4 shows the results of the rural building roof type recognition of this model. Section 5 mainly discusses the influences of different feature combinations and the number of ResNet layers on the training results, as well as the future probable improvement of this study. Finally, a summary of the conclusions of this study is given in Section 6.

Study Area
To test the performance of the proposed model, we used a UAV to obtain ultrahigh-resolution remote sensing images covering Luxi County, Xiangxi Prefecture, Hunan Province, China. The selected study areas all contain relatively dense rural buildings, as shown in Figure 1. Luxi County is located in the northwestern part of Hunan Province, with mountainous terrain, a well-developed water system, high annual precipitation and obvious climatic differences in the region. It is an agricultural area with a predominantly ethnic minority population. The roofs of rural buildings in the selected area of this study are mainly flat and sloped, while there are numerous indistinguishable roof types, which pose great challenges to the task of building roof type identification. Although there are already open building datasets around the world (such as the WHU building dataset and the ISPRS Vaihingen dataset, etc.) that provide various patterns and styles of architectural landscapes [47], there are still fewer ultra-high-resolution remote sensing image building datasets proposed for rural areas in China, and the building patterns of rural areas in China are very different from urban areas; even urban and suburban buildings in western countries are very different, so in the process of studying the supervised learning method, some special representative rural buildings in the test area of this study can be considered to improve the generalizability of the model to identify the roof types of different styles of buildings. The total area of the seven test areas in this study is 62.34 km 2 , of which the training area is 48.55 km 2 and the test area is 13.79 km 2 .

Data Acquisition and Preprocessing
This study used a six-rotor UAV (KPM-28, Hunan Kunpeng Zhihui UAV Technology Co., LTD., Changsha, China) to acquire ultra-high-resolution true color aerial imagery of Luxi County, Hunan Province, in May 2018, and it had a wheelbase of 1.

Data Acquisition and Preprocessing
This study used a six-rotor UAV (KPM-28, Hunan Kunpeng Zhihui UAV Technology Co., LTD., Changsha, China) to acquire ultra-high-resolution true color aerial imagery of Luxi County, Hunan Province, in May 2018, and it had a wheelbase of 1.6 m, a payload weight of 8 kg, a cruise speed of 8 m/s, an endurance of about 60 min and was equipped with a SHARE-101S tilt photography camera. The SHARE-101S tilt photography camera consists of five complementary metal oxide semiconductor (CMOS) sensors (23.5 mm × 15.6 mm) with an effective pixel count of 24.3 megapixels, a tilt angle of 45 • , a storage capacity of 320 G and a lens focal length of 35 mm × 4 and 25 mm for mapping. The UAV orthophoto data acquisition and processing process are carried out with Pix4D software, which mainly includes four steps: laying image control points, developing the flight plan, field UAV image acquisition and orthophoto generation. Luxi County is located Remote Sens. 2022, 14, 265 6 of 25 in a mountainous area with complex terrain. In order to ensure the final image accuracy of the survey area, it is necessary to lay image control points evenly in advance in areas with relatively flat terrain and clear feature points. Since the remote sensing images acquired by the UAV in each flight cover a small area, it is necessary to manually divide the survey area of Luxi County into several small areas before the flight. The flight design is carried out according to a ground resolution of 20 cm, with a heading overlap of 60% and a side overlap of 40%, and the average flight height is within the range of 200-250 m. The output image is in visible RGB mode. The images acquired by different sorties of UAVs are stitched together as a way to reduce the influence of weather and light on the images and to acquire image data of the whole county. After adding ground control points to the stitched images, an aerial triangulation leveling quality report is generated to meet production accuracy requirements [48]. Finally, after setting the CGCS2000 coordinate system, the UAV orthophoto can be generated. Zhang et al. [49] classified roofs into six categories (flat, gable, gambrel, half hip, hip and pyramid) based on roof edges. In this study, the types of roof samples in orthophotos obtained by UAVs were classified into five types: gabled, flat, hipped, complex and mono-pitched, based on the roof survey standards of local mapping departments and the surface texture and shape characteristics of roofs and the overall morphology of buildings in existing data sets. The typical roof types used in this paper are shown in Figure 2. Table 1 shows the number and percentage of training and test data for each type of roof. The UAV orthophoto data acquisition and processing process are carried out with Pix4D software, which mainly includes four steps: laying image control points, developing the flight plan, field UAV image acquisition and orthophoto generation. Luxi County is located in a mountainous area with complex terrain. In order to ensure the final image accuracy of the survey area, it is necessary to lay image control points evenly in advance in areas with relatively flat terrain and clear feature points. Since the remote sensing images acquired by the UAV in each flight cover a small area, it is necessary to manually divide the survey area of Luxi County into several small areas before the flight. The flight design is carried out according to a ground resolution of 20 cm, with a heading overlap of 60% and a side overlap of 40%, and the average flight height is within the range of 200-250 m. The output image is in visible RGB mode. The images acquired by different sorties of UAVs are stitched together as a way to reduce the influence of weather and light on the images and to acquire image data of the whole county. After adding ground control points to the stitched images, an aerial triangulation leveling quality report is generated to meet production accuracy requirements [48]. Finally, after setting the CGCS2000 coordinate system, the UAV orthophoto can be generated. Zhang et al. [49] classified roofs into six categories (flat, gable, gambrel, half hip, hip and pyramid) based on roof edges. In this study, the types of roof samples in orthophotos obtained by UAVs were classified into five types: gabled, flat, hipped, complex and mono-pitched, based on the roof survey standards of local mapping departments and the surface texture and shape characteristics of roofs and the overall morphology of buildings in existing data sets. The typical roof types used in this paper are shown in Figure 2. Table 1 shows the number and percentage of training and test data for each type of roof.

Methods
This paper proposed an improved Mask R-CNN based on different visual feature combinations for the rural building roof type recognition of UAV visible high-resolution remote sensing images. All the processes are shown in Figure 3, which mainly include: (1) calculated VDVI spectral features and Sobel edge detection spatial features of UAV visible remote sensing images, and composed two visual features with RGB images into different feature combinations as the input dataset of the deep learning model; (2) the Mask R-CNN model based on ResNet migration learning were applied to train sample datasets with different feature combinations and to identify and evaluate the accuracy of rural building

Methods
This paper proposed an improved Mask R-CNN based on different visual feature combinations for the rural building roof type recognition of UAV visible high-resolution remote sensing images. All the processes are shown in Figure 3, which mainly include: (1) calculated VDVI spectral features and Sobel edge detection spatial features of UAV visible remote sensing images, and composed two visual features with RGB images into different feature combinations as the input dataset of the deep learning model; (2) the Mask R-CNN model based on ResNet migration learning were applied to train sample datasets with different feature combinations and to identify and evaluate the accuracy of rural building roof types in T1 and T2 test areas to achieve accurate identification of rural building roof types in UAV visible remote sensing images. roof types in T1 and T2 test areas to achieve accurate identification of rural building roof types in UAV visible remote sensing images.

UAV Optical Image Visual Feature Extraction Methods
The visible RGB bands and the high spatial resolution of the UAV remote sensing images make the inter-class similarity and intra-class variability of the features in the images significantly increase. The inter-class similarity will intensify the confusion of identifying different classes of features in the images, while the intra-class variability will make the subclasses of the same kind of features in the images present more complex image features, causing difficulties for the model to extract the features of the subclasses of the features [50]. Therefore, how to select UAV images with visual features added to the RGB band to enhance the difference between building roofs and other features, as well as highlighting building roof features to make different building roof types easier to identify, is the key to identifying building roof types in the countryside from UAV visible light high-resolution remote sensing images. The area of UAV remote sensing image data used in this paper is located in a rural area with dense vegetation, and there are many buildings obscured by dark vegetation, which can be easily misclassified as buildings. Therefore, it is very important to remove the influence of vegetation on buildings for building identification. The existing vegetation indices of visible UAV remote sensing images mainly include NGRDI [51], VDVI [52], EXG [53], etc. Among these methods, VDVI has proved to be the most effective in extracting green vegetation and can effectively distinguish vegetation from other features [54]. This study mainly classifies the types of rural building roofs based on the texture features of the roof surface, and to make the model extract the

UAV Optical Image Visual Feature Extraction Methods
The visible RGB bands and the high spatial resolution of the UAV remote sensing images make the inter-class similarity and intra-class variability of the features in the images significantly increase. The inter-class similarity will intensify the confusion of identifying different classes of features in the images, while the intra-class variability will make the subclasses of the same kind of features in the images present more complex image features, causing difficulties for the model to extract the features of the subclasses of the features [50]. Therefore, how to select UAV images with visual features added to the RGB band to enhance the difference between building roofs and other features, as well as highlighting building roof features to make different building roof types easier to identify, is the key to identifying building roof types in the countryside from UAV visible light high-resolution remote sensing images. The area of UAV remote sensing image data used in this paper is located in a rural area with dense vegetation, and there are many buildings obscured by dark vegetation, which can be easily misclassified as buildings. Therefore, it is very important to remove the influence of vegetation on buildings for building identification. The existing vegetation indices of visible UAV remote sensing images mainly include NGRDI [51], VDVI [52], EXG [53], etc. Among these methods, VDVI has proved to be the most effective in extracting green vegetation and can effectively distinguish vegetation from other features [54]. This study mainly classifies the types of rural building roofs based on the texture features of the roof surface, and to make the model extract the texture features of the roof surface more easily, image enhancement methods can be used, which start by improving the visual effect of the image and highlighting the texture or boundary information of the image. The methods can usually be divided into two categories: spatial domain enhancement and transform domain enhancement [55]. The Sobel edge detection algorithm [56], as a commonly used spatial domain enhancement algorithm, is better for images with grayscale gradients and more noise, and it is sensitive to edges in both horizontal and vertical directions, which can reduce the blurring of image edges. It has been shown that this method is a more effective image edge enhancement algorithm than edge detection algorithms such as Canny [57] and Laplacian [58]. In this paper, two visual features, VDVI based on spectral visual features and the Sobel edge detection algorithm based on spatial visual features, are introduced and their applicability is compared with different combinations of features for UAV visible band images.

Calculation of Visible Difference Vegetation Index (VDVI)
In order to improve the separability of rural building roofs from vegetation and avoid misclassification due to the similarity of rural building roofs (e.g., roofs with green vegetation or roofs shaded by dark vegetation, etc.) and ground vegetation [20], this paper introduces the vegetation index as another spectral visual feature. Xu et al. introduced NDVI as a feature band to extract buildings from ultra-high-resolution color infrared remote sensing images [59], and the results showed that adding NDVI could further distinguish buildings from green areas. However, vegetation indices such as NDVI, which need to calculate multispectral information, cannot be applied to visible UAV high-resolution remote sensing images, so this paper uses a visible difference vegetation index (VDVI) with improved NDVI, which can use visible RGB band information to extract vegetation information. Ma et al. [60] demonstrated that the use of VDVI as an additional spectral feature for visible UAV remote sensing imagery can effectively reduce the interference of vegetation on building roof information extraction. VDVI can be calculated according to Equation (1), and its results range from −1 to 1.
where: ρ green , ρ red and ρ blue denote the values of the visible green, red and blue bands of the UAV orthophoto, respectively. In this study, the VDVI calculation of the UAV orthophoto was done using the band operation of ENVI 5.3, and the grayscale image obtained after the index calculation was input into the training data as an additional visual feature together with the RGB image.

Calculation of Sobel Edge Detection Features
The Sobel edge detection algorithm uses a discrete differential operator to operate on the approximate gradient of image grayscale, and the larger the gradient is, the more likely it is an edge [61]. The deep learning network can learn the complex features of high-resolution remote sensing image features, but there are still deficiencies in the feature extraction of building roof types for very complex UAV ultra-high-resolution remote sensing images, mainly in the differences between different building roof types on the image edges and the integrity of the target [62]. To solve these problems, we combine the image features calculated by Sobel edge detection as additional spatial visual features with UAV RGB images to improve the discriminability of different building roof types on UAV visible images. The Sobel operator can smooth out the building boundaries in the filtered images, making the surface texture and shape features of different building roof types more prominent, while reducing the interference of background noise. The Sobel algorithm consists of two sets of 3 × 3 matrices, which are convolved along the x-axis, y-axis, from top to bottom and from left to right on the image, respectively, to obtain the horizontal and vertical luminance difference approximation; if f (x, y) is the gray value of the (x, y) coordinate point on the image, S x and S y represent the gray value of the horizontal and The approximate gradient M and the gradient direction θ of the grayscale at each pixel point of the image are calculated by combining the horizontal and vertical grayscale values of the point by applying the square root.
If the approximate gradient M is greater than a certain threshold, the point (x, y) is considered an edge point. The Sobel edge detection feature of the UAV image in this study is calculated by an operation in ENVI5.3. After several experiments and comparisons, the calculated image enhancement parameters are set to linear 0-255 and the filter parameters are set to 18 sharpening degrees, which can make the image boundary more clearly displayed.

Improved Mask R-CNN for UAV Image Roof Type Recognition
This section describes the specific process of applying deep learning algorithms and theories, including an overview of the network architecture of Mask R-CNN based on ResNet152, the production and processing of sample sets, model implementation and training.

Migration Learning Deployment of ResNet152
When deep learning is used for building roof type recognition, the number of network layers of the neural network is crucial for the extraction of roof type features, which can extract higher-level abstract features such as texture, shape and color features when the pattern of the roof type features is not obvious. However, simply increasing the network depth can easily lead to gradient disappearance and network degradation problems, which make the image classification accuracy decrease rapidly after saturation [63]. These issues are addressed by ResNet [64], which can also reduce training errors while deepening the network by introducing identity mapping between layers. ResNet152 is one of the networks with deeper layers in the ResNet, which can effectively use the multilayer information of the network even though the number of layers is deeper, due to its lower complexity and better ability to extract features. Therefore, this study uses ResNet152 as the base network for complex feature extraction of roof types of buildings in the countryside of UAV images. Figure 4 shows the ResNet152 network structure, where from the first group of convolutional blocks up to the fifth group are residual modules. After inputting the image with an image size of 224 × 224, the final group output size is reduced to 7 × 7 by learning the features extracted from the training residual network, then the trained image is input to the average pooling layer to take the average, and finally, the softmax function of the fully connected layer is used to classify the image categories. Transfer learning is a very effective method proposed to solve the problem of overfitting in the training process of neural network learning for small data volumes. It improves the efficiency and accuracy of small data classification problems by saving the feature parameters pre-trained in large datasets (such as ImageNet, etc.) and then applying them to the new target classification task to be solved, through the portability of feature model weights between different classification datasets. The two main common migration learning methods are feature migration and model migration [65]. In this study, we use model migration to migrate the ResNet152 pre-trained model, which is fully trained in the ImageNet dataset, to the feature extraction layer of Mask R-CNN, and then re-initialize the parameters of the last layer of the ResNet152 pre-trained model, while the other layers directly use the weight parameters of the pre-trained network and freeze them, and then use the rural building roof as a landmark. The model is then fine-tuned using the rural building roof type dataset to achieve optimal training of the building roof type recognition model.

Construction of Improved Mask R-CNN Model
Mask R-CNN is a widely used and efficient multi-task instance segmentation framework for integrated target detection and semantic segmentation, which is based on R-CNN [66], Fast R-CNN [67] and Faster R-CNN [68]. Mask R-CNN adds a branch using Full Convolutional Network (FCN) to Faster R-CNN to predict the segmentation mask, making it juxtaposed with the original bounding box layer and classification layer, and it can accurately detect the target class and location information in the image. In addition, Mask R-CNN uses region of interest (RoI) Align to optimize the spatial location misalignment problem caused by the RoI pooling layer, and by introducing a bilinear interpolation algorithm, each RoI is better aligned to the location of pixels on the original input image to achieve accurate pixel-level target segmentation. The network structure of Mask R- Transfer learning is a very effective method proposed to solve the problem of overfitting in the training process of neural network learning for small data volumes. It improves the efficiency and accuracy of small data classification problems by saving the feature parameters pre-trained in large datasets (such as ImageNet, etc.) and then applying them to the new target classification task to be solved, through the portability of feature model weights between different classification datasets. The two main common migration learning methods are feature migration and model migration [65]. In this study, we use model migration to migrate the ResNet152 pre-trained model, which is fully trained in the ImageNet dataset, to the feature extraction layer of Mask R-CNN, and then re-initialize the parameters of the last layer of the ResNet152 pre-trained model, while the other layers directly use the weight parameters of the pre-trained network and freeze them, and then use the rural building roof as a landmark. The model is then fine-tuned using the rural building roof type dataset to achieve optimal training of the building roof type recognition model.

Construction of Improved Mask R-CNN Model
Mask R-CNN is a widely used and efficient multi-task instance segmentation framework for integrated target detection and semantic segmentation, which is based on R-CNN [66], Fast R-CNN [67] and Faster R-CNN [68]. Mask R-CNN adds a branch using Full Convolutional Network (FCN) to Faster R-CNN to predict the segmentation mask, making it juxtaposed with the original bounding box layer and classification layer, and it can accurately detect the target class and location information in the image. In addition, Mask R-CNN uses region of interest (RoI) Align to optimize the spatial location misalignment problem caused by the RoI pooling layer, and by introducing a bilinear interpolation algorithm, each RoI is better aligned to the location of pixels on the original input image to achieve accurate pixel-level target segmentation. The network structure of Mask R-CNN used in this paper is shown in Figure 5, and the steps of building roof type recognition based on the improved Mask R-CNN are as follows: CNN used in this paper is shown in Figure 5, and the steps of building roof type recognition based on the improved Mask R-CNN are as follows: (1) Input a pre-processed UAV remote sensing image of a specific size into the pretrained ResNet152 network to obtain the corresponding feature maps. (2) Assign a fixed number of RoIs to each point on the feature map, resulting in multiple RoIs. (3) Transfer these candidate RoIs to the RPN network for binary classification (foreground and background) and fine-tuning of the location and size of the bounding box to obtain a more accurate bounding box for better fitting of the target. Simultaneously, filter out some of the candidate RoIs by using non-maximal value suppression. (4) Run the RoI Align operation on the remaining RoIs, that is, first mapping the feature map's pixels to the original map, and then mapping the feature map to the fixed features. (5) Finally, these RoIs are subjected to multi-category classification, bounding box regression, and mask generation by FCN in the sub-network.

Model Implementation and Training
(1) Software and hardware environment configuration: The computer used in this experiment is equipped with a 3.7 GHz octa-core Intel Core i9-10900K CPU, an 11 GB NVIDIA GeForce GTX 2080 Super graphics card, a 32 GB memory stick and Windows 10 as the operating system. The neural network design framework used in this paper is the Pytorch deep learning framework. (2) Construction of a sample dataset of rural building roof types: We cropped the seven images and then calculated the spectral visual features and spatial visual features of the sample area images according to the method described in Section 3.1, and combined them with the original UAV visible band images for different features. We used (1) Input a pre-processed UAV remote sensing image of a specific size into the pre-trained ResNet152 network to obtain the corresponding feature maps. (2) Assign a fixed number of RoIs to each point on the feature map, resulting in multiple RoIs. (3) Transfer these candidate RoIs to the RPN network for binary classification (foreground and background) and fine-tuning of the location and size of the bounding box to obtain a more accurate bounding box for better fitting of the target. Simultaneously, filter out some of the candidate RoIs by using non-maximal value suppression. (4) Run the RoI Align operation on the remaining RoIs, that is, first mapping the feature map's pixels to the original map, and then mapping the feature map to the fixed features. (5) Finally, these RoIs are subjected to multi-category classification, bounding box regression, and mask generation by FCN in the sub-network.

Model Implementation and Training
(1) Software and hardware environment configuration: The computer used in this experiment is equipped with a 3.7 GHz octa-core Intel Core i9-10900K CPU, an 11 GB NVIDIA GeForce GTX 2080 Super graphics card, a 32 GB memory stick and Windows 10 as the operating system. The neural network design framework used in this paper is the Pytorch deep learning framework. (2) Construction of a sample dataset of rural building roof types: We cropped the seven images and then calculated the spectral visual features and spatial visual features of the sample area images according to the method described in Section 3.1, and combined them with the original UAV visible band images for different features. We used ArcGIS Pro 2.8 to manually visually interpret the sample labeling of each representative roof type in these combined features and cross-checked it with multiple people to ensure the accuracy of the sample types, including the gabled type labeled as 1, flat type labeled as 2, hipped type labeled as 3, complex type labeled as 4 and mono-pitched labeled as 5. The labeled images are converted to GeoTIFF format, which is used as the reference standard for training sample data and model accuracy verification of the deep learning model. Due to the hardware limitation in the training of the deep learning network model, the image needs to be segmented into several small pieces. Based on the random strategy [69], a 224 × 224 area is randomly intercepted from the manually labeled sample area as the input image for the training model. several parameter selections, the improved Mask R-CNN deep learning network uses the average binary cross entropy as the loss function, which allows the generation of masks for each class, and there is no inter-class competition. The weight decay coefficient is 0.0001, the momentum coefficient is 0.9, the activation function is sigmoid, the batch size is set to 8, the epoch is set to 20, the initial learning rate is 0.001, and the optimization method uses the stochastic gradient descent (SGD) method, which can accelerate the convergence of the network.

Accuracy Evaluation Method
In this paper, the evaluation of the model includes two aspects: first, the accurate evaluation of the improved Mask R-CNN classification results in terms of their agreement with the true values; secondly, the feature applicability evaluation to determine the impact of different visual feature combinations on the accuracy of the recognition results of roof types of buildings in the countryside of UAV images. According to the combination of the true category and model classification category, the results can be classified into four cases: true positive (TP), false negative (FN), false positive (FP) and true negative (TN). The number of pixels correctly classified as positive samples is denoted by TP; the number of pixels correctly classified as negative samples is denoted by FN; the number of pixels with errors for negative samples is denoted by FP; and the number of pixels with errors for positive samples is denoted by TN. These values can be calculated using the pixelbased confusion matrix [70]. Based on the above calculation results, we use five accuracy evaluation methods, namely Precision, Recall, F1-score, Overall Accuracy (OA) and Kappa coefficient (KC), to check the overall prediction performance of the algorithm for different roof types. Precision is the ratio of the number of correctly classified positive samples to the number of all positive samples classified by the classifier. Recall is the ratio of the number of correctly classified positive samples to the number of all actual positive samples. In practice, Precision sometimes contradicts Recall so we use the F1-score metric, which is the summed average of Precision and Recall. OA is the probability that the classified result is consistent with the actual type of the region on the ground. KC is used for consistency testing, which can be a better measure of classification accuracy. The specific formula for each accuracy evaluation index is as follows: In the equation of KC, r is the total number of categories in the confusion matrix; N is the total number of pixels used for accuracy evaluation; x ii is the total number of pixels correctly extracted in the confusion matrix; x i+ and x +i are the total number of pixels for each row and column of the confusion matrix, respectively.

Experimental Results
To verify the superiority of the improved Mask R-CNN model, the accuracy of the recognition results and the applicability of different combinations of visual features for the recognition of roof types of buildings in the countryside in UAV images were compared and evaluated. First, to investigate the applicability of different input feature combinations for the recognition of different roof types, we evaluate the effects of different feature combinations on the recognition results of single rural building roof types and the overall recognition results of roof types using the improved Mask R-CNN model, respectively, to verify the positive effects of different visual feature combinations in the recognition of complex roof types. Second, to evaluate the performance of the improved Mask R-CNN model, we trained the model on the feature combination images with the highest accuracy of roof type recognition and compared it with the original Mask R-CNN, U-Net [71], DeeplabV3 [72] and PSPNet [73] models, and we also verified the impact of different models on the roof type recognition results of single rural buildings and the overall roof type.

Comparison of Accuracy of Roof Type Recognition Results of Single Rural Buildings with Different Feature Combinations
Spectral information and spatial information are significant features for remote sensing image classification and recognition. Based on the improved Mask R-CNN model, this paper compares the roof type recognition effects of two visual features, Sobel and VDVI, combined with UAV visible images and evaluates the impact of spectral-and spatial-based visual features on the recognition accuracy of complex building roof types. The improved Mask R-CNN model is used to conduct four sets of feature combination comparison tests: RGB, RGBS (RGB + Sobel), RGBV (RGB + VDVI) and RGBVS (RGB + VDVI + Sobel). RGB is the orthoimages in the visible band acquired by UAV, and VDVI and Sobel features are calculated from RGB images. Figure 6 shows the recognition results of single building roof types in the T1 and T2 test areas. From Figure 6d, it can be seen that the feature combination of RGBS has stronger sensitivity to the boundaries of single building roof categories, which can accurately outline the outlines of single buildings and correctly separate the boundaries of adjacent roof types, while the recognition results of RGB band only have the problems of broken boundaries and incomplete extraction of internal information. From Figure 6d,e, it can be seen that the feature combination of RGBS and RGBV can identify flat roofs well, and in the case of vegetation distribution, the feature combination of RGBV is more advantageous than that of RGBS for distinguishing vegetation and buildings, while the recognition results of both RGBVS and RGB band only have some degree of under-recognition phenomenon.
The recognition accuracy of roof types of single rural buildings with different feature combinations is shown in Table 2. When RGB is combined with VDVI or with Sobel for features, the Precision, Recall and F1-score of each roof type are improved to some extent, among which the feature combination of RGBS is better. Specifically, the F1-score of RGBS in the T1 test area for each roof type improved by a minimum of 0.03 and a maximum of 0.18 compared to the test results for the RGB band only. In the T2 test area, RGBS shows the highest average F1-score, with superior recognition for gabled and flat roof types in particular and better recognition of other different complex roof types in the area. The RGBV feature combination is also slightly better than the RGB band-only recognition results, with an improved F1-score of at least 0.02 and at most 0.11, but the accuracy of RGBV is slightly lower than that of RGBS for the gabled and flat roof types. In addition, there is a certain degree of accuracy degradation in the recognition of each roof type by the combination of RGBVS features, which indicates that too many feature combinations may not necessarily improve the accuracy of feature recognition, but may lead to an overall decrease in accuracy. The recognition accuracy of roof types of single rural buildings with different feature combinations is shown in Table 2. When RGB is combined with VDVI or with Sobel for features, the Precision, Recall and F1-score of each roof type are improved to some extent, among which the feature combination of RGBS is better. Specifically, the F1-score of RGBS in the T1 test area for each roof type improved by a minimum of 0.03 and a maximum of 0.18 compared to the test results for the RGB band only. In the T2 test area, RGBS shows the highest average F1-score, with superior recognition for gabled and flat roof types in particular and better recognition of other different complex roof types in the area. The RGBV feature combination is also slightly better than the RGB band-only recognition results, with an improved F1-score of at least 0.02 and at most 0.11, but the accuracy of RGBV is slightly lower than that of RGBS for the gabled and flat roof types. In addition, there is a certain degree of accuracy degradation in the recognition of each roof type by the combination of RGBVS features, which indicates that too many feature combinations may not necessarily improve the accuracy of feature recognition, but may lead to an overall decrease in accuracy. The overall recognition results of roof types of rural buildings with different feature combinations using the improved Mask R-CNN are shown in Figures 7 and 8. It can be seen that the RGBS feature combinations in the T1 and T2 regions can identify more roof types and ensure the number of extracted roofs in the region, while all other feature combination methods have a considerable degree of missed extraction. From Figures 7c and 8c, it can be seen that the RGB band features by themselves can maintain high accuracy in extracting to different building roof types, but the RGB band features cannot accurately depict the gaps between different building roof types in dense building areas, and there are cases of misclassifying farmland plots into gabled roof types. The feature combination of RGBS improves this situation, and Figures 7d and 8d demonstrate the high performance of this feature combination in identifying medium-sized building roofs, extracting the shape of each building roof type well and separating them. However, in the case of insufficient Sobel feature detection, there are also some feature recognition errors, as shown in Figure 8d, which may not accurately identify the roof types at vegetation shading. While RGBV can identify the difference between each roof type and vegetation in this case, as shown in Figure 8e, the combination of RGBV features can improve the recognition of roof types that are heavily shaded by vegetation.
The overall recognition accuracies of roofs with different feature combinations are shown in Table 3. The results show that the feature combination of RGBS has the highest roof type recognition accuracy in both T1 and T2 test areas, with improvements of 0.105, 0.115 and 0.061, and 0.05, 0.115 and 0.075 over the F1-score, KC and OA of RGB band features, respectively. The roof type identification with the combination of RGBV features also has a higher F1-score, KC and OA than the RGB band features, improving by 0.023, 0.042 and 0.028, and 0.042, 0.097 and 0.078, respectively. In addition, the roof recognition accuracy of the RGBVS feature combination in both test areas is significantly lower than that of the RGB band features, indicating that too many feature inputs may instead hinder the model from extracting image features, resulting in low accuracy recognition. In contrast, using the right combination of features can improve the recognition accuracy of roof types in UAV visible band images to a certain extent.  The overall recognition accuracies of roofs with different feature combinations are shown in Table 3. The results show that the feature combination of RGBS has the highest roof type recognition accuracy in both T1 and T2 test areas, with improvements of 0.105, 0.115 and 0.061, and 0.05, 0.115 and 0.075 over the F1-score, KC and OA of RGB band features, respectively. The roof type identification with the combination of RGBV features also has a higher F1-score, KC and OA than the RGB band features, improving by 0.023, 0.042 and 0.028, and 0.042, 0.097 and 0.078, respectively. In addition, the roof recognition accuracy of the RGBVS feature combination in both test areas is significantly lower than that of the RGB band features, indicating that too many feature inputs may instead hinder the model from extracting image features, resulting in low accuracy recognition. In contrast, using the right combination of features can improve the recognition accuracy of roof     The RGBS feature combination dataset with the highest accuracy for roof type recognition is input to the input layer of the improved Mask R-CNN model for training, and the performance of the improved Mask R-CNN model is evaluated by comparing the recognition results and accuracy with those of the original Mask R-CNN, U-Net, DeeplabV3 and PSPNet. The original Mask R-CNN uses ResNet50 as the feature extraction layer, which can also obtain good results in roof type recognition. U-Net has good applications in medical image recognition and has also achieved good results in remote sensing image building recognition. DeeplabV3 uses the ASPP module to mine convolutional features and image layer features at different scales, which has wide applications in high-resolution remote sensing. DeeplabV3 has a wide range of applications in high-resolution remote sensing image classification [74]. PSPNet is able to aggregate global contextual information from different sub-region images and is suitable for image segmentation of buildings in different complex scenes. The evaluation metrics for the recognition results of roof types of single buildings with different deep learning models in T1 and T2 test areas are calculated and shown in Table 4. The mean F1-score of the improved Mask R-CNN is higher than the other models in the recognition of gabled, flat, hipped and complex types of roofs, indicating that it has a greater advantage in the recognition of different roof types. Although the recognition accuracy of the original Mask R-CNN model for different building roof types is not as high as that of the improved Mask R-CNN, its result accuracy is more stable and can also maintain a high recognition accuracy. On the other hand, the U-Net, DeeplabV3 and PSPNet models all show very low recognition accuracy on hipped, complex and monopitched types of roofs, indicating that there are still limitations in using only semantic segmentation networks for recognizing complex building roof types. The overall recognition accuracy of different deep learning models on the roof types of rural buildings is shown in Table 5, and the results show that the model proposed in this paper has higher evaluation indexes than other deep learning models in both T1 and T2 test areas, with F1-score, KC and OA improving, respectively, by 0. and missed identifications for roof types with smaller sample sizes and more complex features, resulting in lower accuracy of roof type recognition, and thus these models have shortcomings in robustness and generalizability. To verify the effect of different feature combinations on the training curves of the improved Mask R-CNN model used in this paper, we trained the improved Mask R-CNN model with 20 epochs of learning on sample datasets with different feature combinations and obtained the loss curves of the training and validation sets during the training process. As shown in Figure 9, in terms of training efficiency, the training curves of the RGBS feature combinations exhibit a faster convergence rate, which is 40% higher than other feature combinations under the same epochs, greatly improving the model training efficiency. In terms of model stability, the training curve of the RGBS feature combination has the least fluctuation and is highly stable, which can reduce the occurrence of overfitting problems. In terms of training accuracy, the training and validation loss values of the RGBS feature combination are closer to 0.5 than those of the other feature combinations. It can be seen that the combination of Sobel features and UAV RGB images is more conducive to improving the training efficiency, stability, and accuracy of the model.

Effect of Different Feature Extraction Layers of ResNet on the Accuracy of Results
Feature extraction is the key for deep learning models to maintain high accuracy in recognition results. To investigate the effect of using different typical layers of ResNet on the accuracy of Mask R-CNN models in recognizing complex rural building roof types, we used migration learning to deploy pre-trained ResNet18, ResNet34, ResNet50, Res-

Effect of Different Feature Extraction Layers of ResNet on the Accuracy of Results
Feature extraction is the key for deep learning models to maintain high accuracy in recognition results. To investigate the effect of using different typical layers of ResNet on the accuracy of Mask R-CNN models in recognizing complex rural building roof types, we used migration learning to deploy pre-trained ResNet18, ResNet34, ResNet50, ResNet101 and ResNet152 on the Mask R-CNN feature extraction layer and trained on the RGBS feature combination sample dataset with the highest recognition accuracy, and compared their recognition accuracy and efficiency of rural building roof types on T1 and T2 test areas. As shown in Table 6, using ResNet152 as the feature extraction layer of Mask R-CNN was able to obtain a much higher roof type recognition accuracy than ResNet18, ResNet34 and ResNet50. Although the recognition accuracy of the Mask R-CNN model based on ResNet101 is very close to that of ResNet152, it consumes much more training time than ResNet152, which may be due to the fact that ResNet152 has more residual blocks, which reduced the complexity of the model to extract features, thus improving the feature extraction capability and efficiency.

Analysis of the Limitations of Roof Type Identification Methods for Complex Rural Buildings
With the wide application of UAV high-resolution remote sensing images, the accurate recognition of rural building roof types has gradually become possible. However, there are a large number of complex rural building roof types in UAV visible images and other features that are easily confused with building roof types, which poses a great challenge to the existing methods. Therefore, this paper proposes an improved Mask R-CNN model based on the combination of different visual features, which can effectively improve the recognition accuracy of complex rural building roof types in UAV visible images and provide a feasible reference solution for rural roof surveys. Specifically, the RGBS feature combination uses Sobel edge detection features to highlight the surface texture, shape and boundaries of different rustic building roof types, making it easier for the model to extract important features of complex building roof types, and the results show that the model based on this feature combination can completely and accurately segment the vector contours of different building roof types, clearly separating the gaps between buildings. The RGBV feature combination, on the other hand, uses VDVI features to highlight vegetation areas to distinguish them from buildings, reducing misclassification of dense areas of rural buildings, such as small building roofs covered by trees and green pads. In addition, we compare the accuracy of the improved Mask R-CNN model with other deep learning models for roof type recognition, and the results also show that the improved Mask R-CNN model exhibits the highest accuracy and the best robustness, with good performance in coping with different complex scenarios for recognizing roof types of rural buildings. However, the method proposed in this paper also has some false recognition and missed recognition for more complex building types and their morphologies (e.g., large buildings, irregularly shaped buildings, dark buildings, etc.), resulting in model recognition accuracy that is not very high, so we analyze the causes of these errors and their improvement methods.
(1) Uneven sample size across roof types The uneven sample size of different roof types is an important reason for the error in identifying the roof types of complex buildings. In this paper, the roof types of buildings in UAV images are divided into five categories, but since rural buildings are far less numerous and dense than urban buildings, the sample size of these five categories of building roofs cannot be guaranteed to be evenly distributed, so the hipped, complex and mono-pitched roof type datasets with smaller sample sizes were trained separately and secondarily, and the results showed more accurate recognition than training the five categories together. We combined the results of training the full class dataset and the three class datasets with a smaller sample size to obtain the overall results of building roof type recognition. However, the recognition results and accuracy still did not reach a high level. This may be due to the unbalanced samples of the five categories of roof datasets, which makes the model pay more attention to the gabled and flat categories with large data volume, and the parameters in the network are optimized mainly based on the losses of these two categories, resulting in much lower test accuracy for the remaining categories; for example, rural buildings rarely have large irregular buildings with less training sample data, which reduces the model's ability to recognize these morphological building roof types (hipped and complex type roofs). Mono-pitched roof types with the same small sample size generally have more dark buildings and tend to cling to the sides of taller buildings, causing them to be obscured by shadows and other building walls, which also prevents the model from accurately extracting the features of this type of roof, resulting in low accuracy identification of mono-pitched roof types. In addition, the category with larger sample data does not mean that higher recognition accuracy can be obtained; the more samples of the category, the higher its recall rate will be, and therefore a certain accuracy will be lost accordingly [75]. For example, in the T2 test area, the recognition results of RGBV feature combinations have a higher Recall for flat type roofs (Table 2 and Figure 8e), but the Precision is 24.9% different, indicating a decrease in the F1-score. The recognition error problem caused by the imbalance between samples can be improved by increasing the sample size of complex roof types in other larger regions or by setting higher weights of network parameters for the categories with more complex features and smaller sample sizes.
(2) Limitations of different visual feature extraction methods Different visual features also have limitations for feature extraction of building roof types for complex scenes. The Sobel edge detection algorithm in the RGBS feature combination used in this paper can only detect the edges in the horizontal and vertical directions of the image, which often has low detection accuracy for more complex scenes, and its detected image edges are coarse, which cannot precisely locate the location of the edge points and may generate additional background noise. Furthermore, while using VDVI in the RGBV feature combination distinguishes vegetation from buildings, it also eliminates building roof shapes and textures in densely built-up areas, resulting in a model that cannot effectively extract features of different building roof types. The Sobel algorithm can be improved to refine the detected edge features and improve its edge detection accuracy in complex scenes.
(3) Mask R-CNN structure problem The structure of Mask R-CNN itself suffers from the problem of inadequate utilization of the features of each scale roof type. Although this paper uses the migration learningbased ResNet152 as the feature extraction layer of Mask R-CNN, there are still two problems with the structure of Mask R-CNN itself: first, the path between the highest-level features and the lowest-level features is too long, which easily leads to the loss of feature information transfer and cannot effectively utilize the lower-level features. Second, the feature mapping map input to the RPN network is only a map carrying information about itself and the higher-level features, which does not make full use of the feature information at each scale, resulting in lower detection accuracy [76]. These problems make the improved Mask R-CNN model unable to effectively utilize the extracted features of complex building roof types at all scales, thus reducing the accuracy of the model in recognizing complex building roof types. Future research can improve the network structure of the Mask R-CNN model (e.g., FPN network) to shorten the path from low-level feature transfer to high-level mapping, reduce the feature information loss in the transfer process, and improve the feature utilization efficiency of the model, thus improving the performance of the model.

Conclusions
Rural areas in China account for nearly half of the Chinese population, and the survey of rural building roof types is of great significance for the planning and construction of beautiful villages in China. Aiming at the current problems that most of the UAV highresolution remote sensing images only have a visible band, that the existing methods have difficulties extracting features of complex roof types, and that features with similar spectral features such as low reflection, obscured vegetation, and concrete roads are easily confused with building roof types, this paper proposes a method to identify rural building roof types in UAV visible images based on different combinations of visual features, and an improved Mask R-CNN deep learning model is used to improve the recognition accuracy of complex building roof types. VDVI features based on spectral vision and Sobel edge detection features based on spatial vision are combined with UAV visible images to form different feature datasets applied to a deep learning model for roof type recognition. We evaluate the recognition results of the models with four different feature combinations, RGB, RGB + Sobel, RGB + VDVI and RGB + VDVI + Sobel, and also compare the accuracy of the improved Mask R-CNN with the original Mask R-CNN, U-Net, DeeplabV3 and PSPNet deep learning models.
The results show that adding Sobel features or VDVI features to the UAV visible RGB images can improve the accuracy of the model in recognizing the roof types of rural buildings. Firstly, adding Sobel features to RGB images can identify the types and contours of different building roofs more clearly, especially in the dense building areas, and can show the gaps between different buildings well. Secondly, combining RGB images with VDVI features can effectively distinguish buildings and vegetation areas and improve the recognition accuracy of buildings obscured by vegetation. In contrast, when combining RGB images with VDVI and Sobel features together, the recognition accuracy of the model for roof types is reduced instead, indicating that too many feature combinations may not be beneficial to the recognition of building roof types. In addition, the F1-score, KC and OA of the improved Mask R-CNN rustic building roof type recognition results used in this paper are higher than those of other deep learning models, showing the highest accuracy and robustness in the test area.