Color-Coated Steel Sheet Roof Building Extraction from External Environment of High-Speed Rail Based on High-Resolution Remote Sensing Images

Li, Yingjie; Jin, Weiqi; Qiu, Su; Zuo, Dongsheng; Liu, Jun

doi:10.3390/rs15163933

Open AccessArticle

Color-Coated Steel Sheet Roof Building Extraction from External Environment of High-Speed Rail Based on High-Resolution Remote Sensing Images

¹

MOE Key Laboratory of Optoelectronic Imaging Technology and System, Beijing Institute of Technology, Beijing 100081, China

²

Institute of Computing Technologies, Chinese Academy of Railway Sciences, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(16), 3933; https://doi.org/10.3390/rs15163933

Submission received: 2 June 2023 / Revised: 4 August 2023 / Accepted: 6 August 2023 / Published: 8 August 2023

(This article belongs to the Special Issue Scalable and Credible Artificial Intelligence for Remote Sensing Imagery Understanding)

Download

Browse Figures

Versions Notes

Abstract

:

The identification of color-coated steel sheet (CCSS) roof buildings in the external environment is of great significance for the operational security of high-speed rail systems. While high-resolution remote sensing images offer an efficient approach to identify CCSS roof buildings, achieving accurate extraction is challenging due to the complex background in remote sensing images and the extensive scale range of CCSS roof buildings. This research introduces the deformation-aware feature enhancement and alignment network (DFEANet) to address these challenges. DFEANet adaptively adjusts the receptive field to effectively separate the foreground and background facilitated by the deformation-aware feature enhancement module (DFEM). Additionally, feature alignment and gated fusion module (FAGM) is proposed to refine boundaries and preserve structural details, which can ameliorate the misalignment between adjacent features and suppress redundant information during the fusion process. Experimental results on remote sensing images along the Beijing–Zhangjiakou high-speed railway demonstrate the effectiveness of DFEANet. Ablation studies further underscore the enhancement in extraction accuracy due to the proposed modules. Overall, the DFEANet was verified as capable of assisting in the external environment security of high-speed rails.

Keywords:

color-coated steel sheet roof buildings; high-speed rail security; deformation-aware; gating mechanism; remote sensing image

1. Introduction

High-speed rail is an important part of modern urban transportation [1,2,3,4], and it is essential to ensure its operational safety [5,6,7,8,9]. External environment security is one of the key points of high-speed rail operational security [10,11,12], and unstable objects in the external environment can easily enter the rail area and result in serious security accidents. These objects pose safety hazards around railways. Common categories of safety hazard sources include color-coated steel sheet (CCSS) roof buildings, plastic greenhouses, and dust-proof nets [13].

CCSSs are rolled from color-painted galvanized (or aluminized) steel sheets and have been widely used in construction. CCSS roof buildings, as a typical type of temporary construction, have been built in large numbers during the process of urban expansion due to their corrosion resistance [14,15], easy construction, and low cost. They have also increased in number and size around high-speed rail lines that run through cities. However, their ease of construction leads to the instability of CCSS roofs. Since CCSS roofs are light and the components in CCSS roof buildings are mainly bolted or welded [16], the CCSS roofs can easily be blown onto the high-speed rail line by high winds, causing operational safety issues. Therefore, it is important to investigate CCSS roof buildings constructed in the external environment regularly for the operational security of high-speed rails.

In practical work, the investigation of CCSS roof buildings surrounding high-speed rails relies on manual field investigations, conducted within 500 m on both sides of the high-speed railway every seven days. This approach is labor-intensive and subject to terrain and weather constraints. The development of remote sensing imaging technology provides a new approach; earth observation systems can quickly acquire large-scale, high-resolution remote sensing images without being constrained by terrain. Identification of CCSS roof buildings in remote sensing images greatly reduces human costs. However, existing CCSS roof building identification still relies on professional visual interpretation. Despite reducing the workload of fieldwork, this method cannot make effective use of massive remote sensing data due to slow processing speeds and requires professional operators.

Over the past years, many studies have been devoted to extracting CCSS roof buildings based on spectral and texture features from high-resolution remote sensing images. Guo et al. [17] proposed a new spectral index for blue CCSS roof buildings based on Landsat-8 images to map industrial areas; Samat et al. [18] designed several spectral indexes to enhance and identify blue and red CCSS roof buildings from the Sentinel-2 images and analyzed the correlation between their construction area and urban population; Zhang et al. [19] established a decision tree model, combining spectral and texture features, to study the spatiotemporal change rule of CCSS roof buildings in Kenya, Africa.

Although these studies performed well on images with obvious selected features, the handcrafted features can be affected by sunlight, seasons, and sensors, making it difficult to use these methods on massive multi-source remote sensing imagery. Thus, it is natural to apply deep-learning technology as a more generalized and intelligent method. CCSS roof building extraction, as a pixel-wise classification task, can be regarded as semantic segmentation in computer vision. Although there has been a lot of work on object extraction from remote sensing images based on deep learning methods, the extraction of CCSS roof buildings is more challenging. First, the scale and density of CCSS roof buildings are related to the scene. For example, in industrial zones and construction land, CCSS roof buildings are large in scale and have high aggregation; the floor area could be tens of thousands of square meters, while in towns and urban villages, they are scattered and occupy only tens of square meters or even less. The great scale variation leads to holes and omissions in extraction masks. Second, the CCSS roof buildings in remote sensing images mostly appear as irregular shapes, which poses challenges to models that have fixed-shape receptive fields. Third, the locations of CCSS roofs are highly diverse due to the convenience of their construction, including industrial zones, urban villages, construction sites, etc. Moreover, in some places, people have illegally built sheds on top of buildings or reinforced the roofs with CCSS, making it more challenging to distinguish CCSS-roofed buildings from complex backgrounds.

In this paper, we are devoted to developing a method that can address the above issues and assist in ensuring high-speed rail external environment security. Our main contributions can be summarized as follows:

We propose a deformation-aware feature enhancement and alignment network (DFEANet) to realize intelligent CCSS roof building identification in the external environment of high-speed rails.
A deformation-aware feature enhancement module (DFEM) is proposed to solve the problem associated with the multiple scales and irregular shapes of CCSS roof buildings. It adjusts the spatial sampling locations of convolutional layers according to the input feature and uncovers implicit spectral features, thus separating these features from the complex background.
A feature alignment and gated fusion module (FAGM) is proposed to suppress interference from the background and maintain structural integrity and details. It mitigates the spatial misalignment between adjacent semantic feature maps and guides the fusion process, thereby reducing the introduction of redundant information.
High-resolution remote sensing images collected from the SuperView-1 satellite are used to evaluate the effectiveness of DFEANet. Compared with six classical and state-of-the-art (SOTA) deep-learning methods, DFEANet achieved competitive performance.

2. Related Work

2.1. Deep Learning Based Researches on CCSS Roof Buildings

With the accelerated advancements in computing capabilities, deep learning has gained significant scholarly interest due to its superior generalization performance. This has facilitated its extensive deployment within the domain of remote sensing. Specific applications include, but are not limited to, change detection [20,21], pan-sharpening [22], and hyperspectral image classification [23,24,25,26].

Some studies have applied deep learning to the research on CCSS roof buildings. Hou et al. [27] implemented several classical networks, such as ResNet, VGG, and DenseNet, to classify image patches with CCSS roof buildings; Yu et al. [28] proposed an improved YOLO v5 to detect illicitly constructed CCSS roof structures using rectangular bounding boxes from Unmanned Aerial Vehicle (UAV) images. To accomplish pixel-wise object extraction, Sun et al. [29] utilized DeepLab v3+ for extracting blue CCSS roof buildings within the Nanhai District, Foshan, China, and analyzed the correlation between their distribution and industrial production. Despite the effectiveness of previous works, limitations persist. Some studies were unable to obtain pixel-wise semantic information, while others employed models originally conceived to process natural images as opposed to remote sensing images. Such models overlook the issues prompted by the characteristics of CCSS roofs.

To improve the CCSS roof building extraction accuracy, Pan et al. [13] proposed a texture-enhanced ResUNet that applied histogram quantization to enhance boundary details. However, performing simple grayscale enhancement solely within the spatial dimension is insufficient to address issues stemming from intricate backgrounds and irregular shapes. In this paper, we initially enhance both spatial and spectral features, followed by the suppression of irrelevant features through the regulation of information flow during feature fusion. The proposed method effectively improves extraction accuracy.

2.2. Multi-Scale Feature Extraction and Fusion

Large-scale variation is the main characteristic of CCSS roof buildings in remote sensing images because of the diversity of their functions and locations. Inadequate learning of CCSS roof building features typically results in holes within the predictions of large-scale CCSS roof buildings and irregular boundaries in the predictions of smaller ones. Many researchers have aimed to improve the algorithm’s robustness to scale variation by adjusting the receptive field or optimizing the multi-level feature fusion strategy.

The receptive field indicates the size of the area in the input image that corresponds to each pixel in the feature map. The bigger the receptive field, the more global information, especially long-distance dependencies, is captured by each pixel. Although the receptive field in the encoder–decoder structure increases gradually with the encoder depth, it is often hard to generate complete predictions for large-scale CCSS roof buildings. Dilated convolution and spatial pyramid pooling are the main ways to enlarge the receptive field. The former changes the receptive field by adjusting the dilated rate without computational cost; the latter adopts a pooling operation with different window sizes. In the field of target extraction from high-resolution remote sensing images, many studies have embedded them in bottlenecks to improve the extraction accuracy of large-scale targets. Zhu et al. [30] inserted multiple dilated convolution branches in parallel at the end of the encoder to extract roads in remote sensing images; Zheng et al. [31] proposed a large kernel pyramid pooling module to parse urban scene images, which adopted asymmetric dilated convolutional layers to counter the grid problem and connected multiple branches with different receptive fields in parallel to flexibly adjust the receptive field; Wang et al. [32] proposed a global feature information awareness module to improve the accuracy of building extraction, which enlarged the receptive field to a global scope by adopting dilated convolutional layers with different dilation rates to process the highest level feature map and combine non-local block to capture the global context. However, the receptive fields following fixed shapes in these studies cannot fit the shape of the targets well, which may cause a large number of extraneous pixels to be sampled.

The main methods used to fuse multi-level feature maps are to up-sample the high-level feature map and then combine it with the low-level feature map by concatenating along the channel dimension or directly adding, such as U-Net [33] and FPN [34]. However, blind fusion without information selection considers all fused information as equally important, which may introduce useless or interfering information to the output. To wisely utilize multi-level features, some studies adopted a gating mechanism to guide the feature fusion process and improve the utilization of valuable information. Ding et al. [35] used recurrent neural networks to learn the gate map for each level feature map, then fused all level feature maps according to them; GFF [36] learned gate maps for all feature maps extracted by the backbone and fused every feature map with the others via the gating mechanism; GSMC [37] adopted an input gate and a state gate to screen important spatial structure and semantic states in different spatial scales. These studies controlled the information flow in the feature fusion process efficiently but ignored the misalignment between adjacent feature maps, resulting in the omission of spatial details.

To summarize, expanding the receptive field and optimizing multi-level feature fusion methods are common ways to deal with large-scale variations in objects in high-resolution optical remote sensing images. However, the fixed-shape receptive field is inflexible to the geometrical variation of CCSS roof buildings, and the misalignment between adjacent feature maps should also be considered. In this paper, the DFEANet adopts DFEMs to adaptively adjust the receptive field according to feature deformation and FAGMs to align adjacent feature maps and guide the feature fusion process, outperforming the classical and SOTA deep-learning methods.

3. Data and Study Area

3.1. Study Area

In this study, the surrounding area of the Beijing–Zhangjiakou high-speed railway was taken as the study area. Completed at the end of 2019, the Beijing–Zhangjiakou high-speed rail has undertaken important transportation tasks during the 2022 Beijing Winter Olympics and has promoted the integrated development of Beijing–Tianjin-Hebei [38]. As shown in Figure 1, the main line is 174 km long, connecting Zhangjiakou and Beijing, which have large differences in altitude and climate.

The study area involves the Qiaoxi District, Qiaodong District, Xiahuayuan District, Xuanhua District, Huailai County, Zhangjiakou City, as well as Yanqing District, Changping District, and Haidian District, Beijing. In particular, Zhangjiakou is the focus of high-speed rail operational security because it is located in the transition zone between the North China Plain and the Inner Mongolia Plateau, which is a mountainous area and windy all year round. Thus, it is significant to identify CCSS roof buildings in the external environment to guarantee the safety of the Beijing–Zhang high-speed railway’s operation. Comprehensively considering the application and sample requirements, a 5 km wide area on both sides of the Beijing–Zhangjiakou high-speed railway was selected to study CCSS roof building extraction.

3.2. Data

The remote sensing images used in this study were offered by the Superview-1 satellite. Superview-1 is China’s first commercial multi-sensor satellite constellation, launched on 9 January 2018. It works on a highly agile platform and provides four collection models, including long-trip, multi-strip, multi-point, and stereoscopic collections. The satellite imagery provided by Superview-1 is of 0.5 m panchromatic resolution and 2 m multispectral resolution. The Superview-1 satellite specifications are listed in Table 1 [39]. The images were collected from January to November 2019, covering all four seasons. The images underwent radiometric and geometric correction, followed by pan-sharpening, image mosaicking, and color balancing. As a result, a 3-band image with a 0.5 m resolution covering the main line of the Beijing–Zhangjiakou high-speed railway was obtained.

3.3. Self-Annotated CCSS Roof Building Dataset

To verify the effectiveness of the proposed method in CCSS roof building extraction, a self-annotated CCSS roof building dataset were built based on the satellite image mentioned above. This dataset contains the external environmental characteristics of the high-speed rail under different seasonal conditions.

Figure 2 shows the samples of the self-annotated CCSS roof building dataset. The CCSS roof buildings are commonly blue; however, some are white or red. They are distributed in industrial zones, construction sites, or residential areas with high population density. The scale and density of CCSS roof buildings in different areas vary greatly; large-scale CCSS roof buildings are usually built as factories or temporary dormitories, while small-scale ones are scattered in residential areas. The dataset, covering an area of 896 km² with a 0.5 m spatial resolution, was annotated by professionals through visual interpretation and underwent a round of annotation checks. CCSS roof buildings were annotated in shapefile format and saved as a tag image file format (TIFF). The images and labels were cropped into tiles with 512 × 512 pixels in 400-pixel steps. When testing, the results were obtained by averaging the scores of overlapping areas. To balance the positive and negative samples, image tiles without labels were removed, and 12,156 image tiles were finally obtained. The cropped images and labels were divided into a training set, a validation set, and a test set according to a ratio of 3:1:2. Finally, the training set contained 6078 tiles, the validation set contained 2026 tiles, and the test set contained 4052 tiles.

4. Methodology

In this section, we propose a method to improve the performance of CCSS roof building extraction from high-resolution remote sensing images by accurately separating the target from the complex background and reserving structural details. Concretely, we first give an overview of DFEANet. Then the deformable convolution adopted in DFEANet is introduced. Finally, two proposed modules are described in detail.

4.1. Model Overview

Effective feature representation of multi-scale CCSS roof buildings from high-resolution remote sensing images is essential to improving the extraction accuracy of CCSS roof buildings in the external environment of a high-speed rail. The encoder–decoder structure has been verified as effective in coping with scale variance. It extracts multi-level features via the encoder and then fuses them via the decoder to make predictions. In the feature extraction process, as the depth of the encoder increases, the resolution of the feature maps reduces while the semantic information increases. Features of different scales are included in different-level feature maps. Specifically, features of small objects and details are contained in low-level feature maps, while features of large-scale objects are obtained in high-level feature maps. To accurately extract CCSS roof buildings of different scales from complex backgrounds and preserve more detailed information, DFEANet was proposed. The overview of DFEANet is shown in Figure 3. DFEANet adopts ResNet as the encoder. Deformable convolution is adopted in DFEMs to fit the deformation of CCSS roof buildings in different-level feature maps and separate their features from the complex background. The processed feature maps are then adjusted and fused from top to bottom by FAGMs to reduce the spatial deviation. The gating mechanism is adopted to guide the fusion process and suppress redundant information. Finally, the multi-level feature maps obtained by FAGMs are sampled to a unified scale, concatenated along the channel dimension, and then input into the segment head to generate the final prediction.

4.2. Deformable Convolution

In this study, deformable convolution is used to establish the relationship between pixels in irregular local areas according to the feature shapes, improving the integrity of feature extraction. Traditional convolution samples a rectangular region around the pixel in the input image; thus, the receptive field is mostly rectangular. However, in reality, CCSS roofs often have different shapes, and a regular sampling grid limits the exploration of interpixel relationships at different distances. Therefore, it is desirable to adaptively adjust receptive field sizes to establish interpixel relationships more efficiently and improve feature representation. To this end, Zhu et al. [40,41] proposed deformable convolution. It adds learnable two-dimensional offsets to the regular grid-like sampling locations in the standard convolution to flexibly adjust each sample location and modulates them with a learnable feature amplitude, which can control both the spatial distribution and relevance of samples.

In standard convolution, taking the 3 × 3 convolution of dilation as an example, the set of sample locations relative to the central one can be expressed as

K = {(- 1, - 1), (- 1, 0), \dots, (1, 0), (1, 1)}

(1)

For each location

q_{0}

on the output y,

y (q_{0})

is obtained by sampling the input

x

as expressed in [40]

y (q_{0}) = \sum_{n = 1}^{N} w (q_{0}) \cdot x (q_{0} + q_{n}), N = 3 \times 3

(2)

where

N

is the number of sample locations,

w

denotes the weight for each sample location, and

q_{n}

enumerates the locations in

K

.

Deformable convolution [41] introduces learnable offsets and modulation to the regular grid sample locations, and

y (q_{0})

obtained by the deformable convolution can be expressed as

y (q_{0}) = \sum_{n = 1}^{N} w (q_{0}) \cdot x (q_{0} + q_{n} + Δ q_{n}) \cdot Δ m_{n}, N = 3 \times 3

(3)

where

Δ q_{n}

and

Δ m_{n}

are learnable offsets and modulation scalars, respectively.

Δ q_{n}

adjusts the sample location according to input features, while

Δ m_{n}

, which lies in the range [0, 1], modulates the feature amplitude to suppress irrelevant information.

Figure 4 illustrates the workflow of deformable convolution. For each location, the offsets

Δ q_{n}

and the modulation scalars

Δ m_{n}

are learned from the input feature map by a separate convolution layer. This layer outputs a tensor with the same length and width as the input, and it has

3 N

channels. The offsets

Δ q_{n}

in both the x and y directions are recorded in the first

2 N

channels, and the remaining

N

channels correspond to the modulation scalars

Δ m_{n}

. Then, deformable convolution calculates the output following the formulation in Equation (2). In DFEANet, deformable convolutional layers are adopted in DFEM to enhance the spatial representation of multi-level features and alleviate incomplete predictions.

4.3. Deformation-Aware Feature Enhancement Module

The extraction of large-scale CCSS roof buildings often results in incompleteness, while small ones may be omitted from the extraction masks. These issues are usually related to the inefficient separation of different-scale CCSS roof buildings from the background in multi-level feature maps. The complex background in remote sensing images disrupts CCSS roof building extraction. In the encoder–decoder structure, if the multi-level feature maps extracted by the encoder are delivered directly to the decoder, the features of CCSS roof buildings and noise may be transferred indistinguishably, causing the features to become confused by background noise. This confusion is particularly challenging for CCSS roof buildings in remote sensing images, given their irregular shapes and extensive scale range, making them even more difficult to distinguish from the background. Generally, strong dependencies exist among pixels of the same object; capturing more interpixel dependencies leads to more accurate feature extraction. Therefore, it is feasible to improve extraction accuracy by effectively capturing interpixel dependencies. DFEM is proposed to dig out the implicit spatial and spectral relationships between features and the background. Specifically, deformable convolution is adopted in the spatial enhancement and feature extraction parts of DFEM to capture the spatial morphology of the CCSS roof building features.

As shown in Figure 5, DFEM digs out the relationship between the target and background in the multi-level feature maps across two dimensions: channel and space. Spectral signatures, as the most prominent features of CCSS roof buildings, are primarily contained in the channel dimension of the image. Within this context, the channel branch works to enhance the implicit spectral signatures by weighting the channels. Global average pooling (GAP) [42] is first adopted to compress the input feature map

S \in R^{C \times H \times W}

along the spatial dimension to a vector

U \in R^{C \times 1 \times 1}

. The input feature map can be considered as a set of single channel feature maps expressed as

S = [s_{0}, s_{1}, \dots, s_{C}]

. According to the definition of GAP in [42], the m

t h

channel of vector

U

can be obtained by calculating the mean of all pixels in

s_{m}

.

u_{m} = \frac{1}{H W} \sum_{i}^{H} \sum_{j}^{W} s_{m} (i, j)

(4)

Then, two

1 \times 1

convolutional layers and a ReLU function are applied to obtain the channel weight vector

W_{c} \in R^{C \times 1 \times 1}

, which is subsequently limited to the range [0, 1] using a sigmoid function. The feature map with enhanced spectral signatures is obtained by multiplying the original input by the channel weight vector.

The spatial branch enhances texture and geometric features by capturing interpixel relationships. To establish these interpixel dependencies, the input feature map

S \in R^{C \times H \times W}

is compressed along the channel dimension using a

1 \times 1

convolutional layer, resulting in a single-channel tensor with spatial information embedded. A

3 \times 3

deformable convolution is then applied to obtain the spatial weight matrix

W_{s} \in R^{1 \times H \times W}

, which helps to differentiate targets from the background. Notably, since deformable convolution can adjust the sampling position of the convolution kernel based on the input features, it can capture relationships between pixels at varying distances. A sigmoid function is used to constrain the values in the spatial weight matrix to the range [0, 1], and the spatially enhanced feature map is obtained by multiplying the original input feature map by the spatial weight matrix.

To integrate crucial information from both spatial and channel dimensions, the outputs of the two branches are concatenated along the channel dimensions and then passed through a 1 × 1 convolutional layer to reduce dimensionality. Subsequently, a deformable convolution layer is employed to aggregate the pixels belonging to CCSS roof buildings and further distinguish the features from the background. This results in the feature map E, embedded with the relationship between the target and the background, in which the receptive field can adapt according to the input feature.

4.4. Feature Alignment and Gated Fusion Module

Multi-level feature fusion is the primary approach in tackling large differences in CCSS roof building scales. Many feature fusion methods overlook the semantic gap caused by down-sampling operations. They either directly add the low-level feature maps to the up-sampled high-level feature maps or concatenate them along the channel dimension. This can lead to the misclassification of boundaries and small objects. Specifically, the loss of spatial details and locations during down-sampling can result in misalignment between low-level and high-level feature maps. Moreover, important information regarding different-scale CCSS roof buildings is often embedded in the differences among multi-level feature maps. Indiscriminate fusion of adjacent feature maps may inundate them with excessively redundant information. To handle these problems, we propose FAGM. Similar to [43,44], we consider that the relative offset between feature maps resembles the optical flow between adjacent video frames, and we align the feature maps accordingly. Furthermore, a gating mechanism is employed to select important information and guide the fusion process.

Figure 6 shows the structure of FAGM. In this structure, the high-level feature map

H

is first up-sampled to match the size of the low-level feature map

L

and then concatenated with

L

along the channel dimension. Subsequently, the concatenated feature map passes concurrently through two branches, generating two offset maps,

Δ H

and

Δ L

. Each offset map has two channels corresponding to offsets in the x and y directions.

H

and

L

are aligned by a warping function based on

Δ H

and

Δ L

, which resamples the feature maps using bilinear interpolation. The aligned features are then concatenated along the channel dimension and evaluated by the gate map

G

, whose values are restricted to within the range [0, 1] via a sigmoid function. The high-level and low-level feature maps are fused by this gating mechanism, which controls the information flow in the fusion process, thereby improving fusion efficiency. According to the structure illustrated in Figure 6, the FAGM could be performed as

A = G \times warp (upsample (H), Δ H) + (1 - G) \times warp (L, Δ L)

(5)

where

u p s a m p l e (\cdot)

denotes bilinear interpolation and

w a r p (\cdot)

denotes the warping function.

Three FAGMs integrate features of adjacent feature maps from top to bottom, passing the high-level semantics down. Figure 7 visualizes the feature maps and offset maps in the third FAGM, where the aligned feature maps preserve a clearer structure, leading to more consistent representations of CCSS roof buildings. Moreover, in the output feature map, the gating mechanism effectively suppresses background noise while preserving the boundaries.

4.5. Segmentation Head

To generate the final prediction, as shown in Figure 8, the multi-level feature maps are first up-sampled to a uniform size and then concatenated along the channel dimension, denoted as U. To improve the precision of boundaries, the feature map

F_{0}

, generated by stage0 as depicted in Figure 3, serves to guide the generation of the output mask via an alignment module. Specifically, similar to the FAGM, the up-sampled U and

F_{0}

are first concatenated along the channel dimension. Subsequently, an offset map

Δ U

is produced, which the warp function utilizes to align the boundaries of

U

with

F_{0}

. Ultimately, a prediction mask with the same size as the input image is generated through a series of layers.

4.6. Loss Function

Considering the imbalance between foreground and background in remote sensing images, we selected focal loss [45] as the loss function. The focal loss is a variation of the cross-entropy loss, which focuses training on hard examples.

L_{f o c a l l o s s} (p_{t}) = - α {(1 - p_{t})}^{γ} \log (p_{t}), α = 0.25, γ = 2

(6)

where

p_{t}

denotes the predicted probability. Here we use the same values for

α

and

γ

as in [45].

5. Experiments and Analysis

In this section, the proposed method is compared with classical and SOTA deep learning methods on the self-annotated CCSS roof building dataset to validate its learning ability and generalization performance. The effectiveness of the proposed modules is evaluated through ablation experiments. Additionally, the contribution of deformable convolution in DFEM is also experimentally investigated. Finally, different multi-level feature fusion methods are compared to further verify the effectiveness of FAGM.

5.1. Evaluation Metrics

To quantitatively evaluate the performance of DFEANet in CCSS roof building extraction, four evaluation metrics were adopted, including precision, recall, F1-score [46,47], and Intersection over Union (IoU) [48]. These metrics are commonly used for per-pixel classification tasks [20,30,49,50,51]. Precision is the proportion of true positive samples among all positive predictions. The recall represents the probability of correctly identifying a positive sample as positive. Generally, when an algorithm has a higher recall, its precision is likely to be lower. The F1-score is the harmonic mean of precision and recall. IoU is defined as the area of the intersection divided by the area of the union between a predicted mask and a ground-truth mask. The above evaluation metrics can be computed from the following formulas:

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

R e c a l l = \frac{T P}{T P + F N}

(8)

F_{1} - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(9)

I o U = \frac{T P}{T P + F P + F N}

(10)

where TP represents the number of correctly classified CCSS roof building pixels, FP represents the number of background pixels that are incorrectly classified as CCSS roof building pixels, and FN represents the number of CCSS roof building pixels incorrectly classified as background pixels.

5.2. Experimental Settings

The proposed DFEANet was implemented using the PyTorch [52] framework. The SGD optimizer was adopted with 0.01 as the initial learning rate. The moment value was 0.9 and the weight decay value was 5 × 10⁻⁴. The poly policy was employed, in which the initial learning rate was multiplied by

{(1 - \frac{e p o c h}{t o t a l_e p o c h s})}^{0.9}

during each epoch. All models in the experiments were trained on a single NVIDIA RTX 3090 GPU with the batch size set to 6. All compared methods were trained until convergence. Data augmentations were applied during the training process, including random brightness, random scales, and horizontal flipping. Concretely, images and labels were randomly resized with scale factors varying from 0.5 to 1.25 and then cropped into 512 × 512 pixels. Each tile had a probability of 0.5 being horizontally flipped.

5.3. Results and Analysis

5.3.1. CCSS Roof Buildings of the External Environment of the Beijing–Zhangjiakou High-Speed Railway

The extraction results of the proposed method in the study area are illustrated in Figure 9. This process involved the addition of geometric information to the binary prediction masks, which were subsequently converted into the Shapefile format for integration with raster data. In practical applications, the utilization of these Shapefiles enables railway personnel to simultaneously determine the locations, quantities, and areas of CCSS roof buildings. According to their distance from the railway line, these buildings can be further divided into several risk levels, which is of great significance for CCSS roof building inspection in high-speed rail’s external environment. Our method has effectively enhanced the statistical efficiency related to CCSS roof buildings, thereby offering significant assistance to the safety assurance endeavors concerning the external environment of high-speed railways.

The map in Figure 9 was exported by ArcGIS. To illustrate it more clearly, four regions were selected as representative samples, including Huailai County and Qiaodong District in Zhangjiakou City and Changping District and Yanqing District in Beijing. The sample regions involve industrial zones, urban villages, and residential areas, which are typical areas where CCSS roof buildings are found.

As shown in Figure 9, the number and density of CCSS roof buildings in Zhangjiakou City and Beijing exhibit significant differences. As a first-tier city, Beijing has well-established infrastructure, and the number of CCSS roof buildings is relatively low, with most of them located in urban villages or industrial zones. In contrast, Zhangjiakou is a developing city in the midst of rapid urban construction, resulting in a large number of CCSS roof buildings that cover a wide range of sizes and are densely distributed. Furthermore, the area of CCSS roof buildings varies depending on their location. In the city’s suburbs, CCSS roof buildings are primarily large-scale and densely built factory buildings, while in densely populated areas, they are mostly temporary dormitories on construction sites or sheds built in residential areas.

5.3.2. Comparisons with SOTA Methods

We conducted quantitative and qualitative comparisons of the proposed method with classical and SOTA methods on the test set. Six methods were selected, including U-Net [33], SegNet [53], PSPNet [54], DeepLab v3+ [55], HRNet v2 [56], and FLANet [57].

Among the compared methods, both U-Net and SegNet are basic encoder–decoder structural models. PSPNet utilizes pyramid pooling to amplify the receptive field. DeepLab v3+ obtains multi-scale features via an atrous spatial pyramid pooling module and integrates low-level features during the output mask generation to improve boundary accuracy. HRNet v2 adheres to a multi-branch parallel architecture with the primary objective of maintaining high-resolution representations throughout the feature extraction phase. FLANet encodes both spatial and channel attentions via a fully attentional block to simultaneously enhance the segmentation accuracy for both large and small targets, which has been demonstrated to attain state-of-the-art (SOTA) performance in semantic segmentation.

Quantitative comparison: The quantitative comparison results on the CCSS roof building dataset are shown in Table 2. Our DFEANet outperforms the other six methods, achieving the highest IoU and F1-score, demonstrating the effectiveness of DFEANet in CCSS roof building extraction. Compared with DeepLab v3+, the IoU and F1-score were increased by 1.84% and 1.09%, respectively. We think this is attributable to the fact that the FAGMs effectively suppress background noise, while concurrently, the DFEMs facilitate a more accurate extraction of the spatial morphology of CCSS roof buildings.

Qualitative comparison: To intuitively analyze the improvement of our DFEANet in CCSS roof building extraction, we qualitatively compared the DFEANet with the compared methods on the self-annotated CCSS roof building dataset. Some CCSS roof building extraction samples of the all compared methods on the test set are presented in Figure 10 and Figure 11.

The predictions of DFEANet can be observed to be more complete, with finer boundaries. The spectral signature is regarded as the most prominent feature of CCSS roofs in high-resolution optical remote sensing images, and deep-learning networks tend to prioritize learning this feature. However, the proportion of blue CCSS roofs is usually higher than that of other types, leading to incomplete extraction of non-blue CCSS roof buildings and misclassification of blue objects. In DFEANet, the DFEMs enhance spectral signatures, enabling the model to capture latent characteristics and facilitating a heightened ability to identify CCSS roof buildings. Samples 1 and 3 in Figure 10 demonstrate this. DFEANet successfully resisted color interference, thereby avoiding the misidentification of the blue basketball court as a CCSS roof building. Moreover, it accurately extracted the red and white CCSS roof buildings in the lower left corner, which were overlooked by most comparison methods. Another challenge is to differentiate objects close to each other in high-resolution remote sensing images due to their complex backgrounds. As evidenced in Sample 2, regarding the two rows of CCSS roof buildings in close proximity, their masks remain unseparated in the extraction results obtained by all other compared methods. In contrast, DFEANet successfully separated them from the background and extracted each entity, facilitated by the employment of DFEMs.

To explicitly illustrate the accuracy of the proposed method in edge extraction, we have utilized different color markers to denote true positives, false positives, false negatives, and true negatives in Figure 11. It can be observed that, in comparison to other methods, the extraction results of DFEANet more closely approximate the reference values at the edges, which indicates that the proposed method can well retain the edge details while extracting multi-scale features.

The results of the qualitative comparison demonstrate that DFEAnet can dig out the implicit relationship between the background and the target and effectively eliminate redundant information.

Although some studies focus on CCSS roof building extraction based on traditional methods, in this study, we only compared our method with deep-learning models. To our knowledge, the traditional methods are mainly based on manually designed spectral features, and their performance can be affected by factors such as sensors, landscape types, and seasons [18]. However, the main objective of this paper is to propose an intelligent method that utilizes large quantities of high-spatial-resolution remote sensing data to assist in CCSS roof building identification without relying on handcrafted feature engineering. From another perspective, the traditional methods do not require a large number of training samples, making them effective in areas where training samples are scarce.

5.4. Ablation Study

To analyze the effectiveness of the proposed DFEM and FAGM modules, we conducted ablation experiments on the self-annotated CCSS roof building dataset. The network settings and quantitative results are listed in Table 3, with D representing DFEM and F representing FAGM. In these experiments, ResNet50 and FPN were chosen as baselines, resulting in an IoU of 80.62%. When we applied DFEM to the baseline, the IoU increased by 1.74%, demonstrating that the implicit spectral signatures and the inter-pixel dependences in multi-level feature maps were enhanced by the DFEMs. Replacing the original adding operation with FAGM led to a 1.69% improvement in IoU compared to the baseline, indicating that FAGM facilitated effective information screening and preserved important details while suppressing redundant background noise. By simultaneously adding both modules, the IoU increased to 86.46%, further highlighting the effectiveness of the proposed modules.

5.4.1. Effect of Deformable Convolutions in DFE

Deformable convolution can aggregate the pixels associated with CCSS roof buildings in the feature maps by adaptively adjusting sample positions according to the input feature. In the proposed DFEM, two deformable convolutions are employed: one is used to enhance multi-level features along the spatial dimension, and the other is utilized to strengthen the implicit foreground and background relationships in the enhanced feature map. To further analyze the contribution of deformable convolutions, we compared different settings of deformable convolutions in DFEM. First, we replaced all the deformable convolutional layers with conventional 3 × 3 convolutional layers, denoted as ‘Without Deform’. Then, only the deformable convolutional layer in the spatial branch was retained, denoted as ‘Spatial Deform’. Next, we only retained the convolutional layer after feature enhancement, denoted as ‘Level Deform’. Finally, the structure of DFEM was followed, denoted as ‘Both’.

The evaluation results are listed in Table 4. As shown in Table 4, applying a deformable convolutional layer after feature enhancement increased the IoU and F1-score by 0.86% and 0.62%, respectively. This shows that the deformable convolutional layer separated targets and backgrounds by aggregating pixels associated with CCSS roof buildings at different distances. ‘Spatial Deform’ achieved 90.22% in the F1-score and 83.49% in the IoU, which is higher than ‘Level Deform’. We think that this may be because the spatial weight matrix embedded the distinction and connection between the foreground and the background, thus assigning a higher weight to the CCSS roof buildings in multi-level feature maps. The original DFEM achieved the highest IoU and F1-score, demonstrating that the proposed DFEM improves the integrity of predicted masks by effectively separating the CCSS roof buildings from the complex background.

Figure 12 shows the samples of extraction results for ‘Without Deform’ and ‘Both’. It can be seen that without deformable convolutional layers, the prediction masks appear irregular, with holes and messy spots. In rows 2 and 3 of Figure 12b, the shadows were incorrectly classified as parts of CCSS roof buildings. In rows 1 and 4 of Figure 12b, varying degrees of defects appeared in the prediction masks. However, after applying the deformable convolutional layers in DFEM, the ability of the network to distinguish between targets and backgrounds was enhanced, and the shadows were classified correctly. Meanwhile, the spatial dependence between target pixels was strengthened, and the prediction masks became intact with regular edges. This demonstrates the effectiveness of the DFEM.

5.4.2. Visualization of FAGM

To intuitively illustrate the impact of FAGM, we visualized the maps generated by it, specifically focusing on the feature maps and offset maps in the FAGM with the highest resolution output.

Visualization of offset maps: FAGM can generate offset maps for the inputted high-level and low-level feature maps before alignment, containing two-dimensional offsets of the feature maps. In Figure 13, these offset maps were visualized in two forms and bilinearly interpolated to match the image size. As shown in Figure 13, the offsets of high-level feature maps are more pronounced than those of low-level ones. This means that the high-level feature maps tend to align with the structural details in the low-level feature maps. In Figure 13c,e, we observe that the high-level features in the FAGM tend to diffuse from the center to the edge, addressing the lack of structure.

Visualization of feature maps: Figure 14 shows the feature maps both before and after alignment. As shown in Figure 14, the aligned feature maps exhibit more coherent structures compared to those before alignment. This coherence facilitates precise boundary extraction, ultimately improving the extraction accuracy for smaller CCSS roof buildings. The aligned feature maps were subsequently fused by the gating mechanism. As shown in Figure 14f, this mechanism effectively maintains the important information from the two feature maps while avoiding redundant noise.

5.4.3. Comparison of Different Multi-Level Feature Fusion Methods

We also compared different feature fusion methods and quantitatively analyzed the effect of FAGM; the quantitative results are listed in Table 5. Four feature fusion schemes were chosen, including channel concatenate, pixel addition, feature alignment and fusion (FAF), and the proposed FAGM. FAF represents replacing the gating mechanism in FAGM by directly adding the aligned feature maps. According to the results, FAF improved both the IoU and the F1-score compared with channel concatenation and pixel addition, which indicates that the details in low-level feature maps, especially boundaries and small buildings, were maintained by feature alignment. The FAGM achieved the highest IoU and F1-scores. We think this is probably because FAGM sifted out the important information from adjacent feature maps before fusion, lightening the influence caused by noise and improving the fusion efficiency.

5.5. Complexity of DFEANet

The DFEANet employs deformable convolutions to establish irregular local pixel relationships, necessitating the learning of a greater volume of parameters compared to traditional convolution. Furthermore, prior to feature fusion, the generation of offset maps is conducted to align adjacent features, which, compared to direct addition, moreover, introduces an additional quantity of parameters. To comprehensively illustrate the superior performance of the proposed method, we further compared the number of trainable parameters and floating-point operations (FLOPs) with other methods. The experiment results are listed in Table 6.

We also provided a more intuitive comparison of the complexity and performance among all the compared methods in Figure 15. The number of trainable parameters and IoU serve as respective indicators of the complexity and performance of each method, and the radius of the blue circle represents the size of the model file. Despite the introduction of additional trainable parameters by the proposed DFEM and FAGM, compared with other related methods, the proposed method maintains a higher degree of accuracy and a reduced level of complexity.

6. Conclusions

The utilization of deep learning for the extraction of CCSS roof buildings from remote sensing images can effectively aid in the intelligent inspection of such buildings within the high-speed rail external environment. Previous research utilized models designed based on natural imagery, thereby overlooking the unique characteristics of CCSS roof buildings within remote sensing images. In this study, DFEANet was proposed to improve the accuracy of CCSS roof building extraction. To improve the integrity of irregular roof extraction, DFEM was proposed to perform feature enhancement and adaptive receptive field adjustment. Moreover, FAGM was proposed to preserve boundary details while suppressing background noise.

Quantitative and qualitative analysis based on the remote sensing images of the Beijing–Zhangjiakou high-speed railway demonstrates the superior performance of DFEANet in the extraction of CCSS roof buildings. It can accurately identify CCSS roof structures while ensuring edge accuracy and effectively striking a balance between model complexity and accuracy. The ablation experiments and visualizations further verified the effectiveness of the proposed two modules. In practical applications, by deploying the proposed method on a big data platform and integrating the extracted results with geographic information, it is possible to carry out statistical analysis of potential risk targets. This assists railway personnel in promptly identifying and rectifying safety hazards, thus ensuring the safety of the external environment of the high-speed rail.

Author Contributions

Conceptualization, Y.L. and W.J.; methodology, Y.L.; validation, Y.L., S.Q. and D.Z.; formal analysis, Y.L. and D.Z.; writing—original draft preparation, Y.L.; writing—review and editing, W.J., S.Q. and J.L.; visualization, Y.L.; supervision, S.Q.; project administration, W.J.; funding acquisition, W.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China, Grant No. 2020YFF0304104.

Data Availability Statement

The data are not publicly available because the authors do not have permission to share data.

Acknowledgments

The authors are grateful to the China Center for Resources Satellite Data and Application for the data support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, Y.; Wang, Y.; Zhao, C. How do high-speed rails influence city carbon emissions? Energy 2023, 265, 126108. [Google Scholar] [CrossRef]
Tang, R.; De Donato, L.; Bešinović, N.; Flammini, F.; Goverde, R.M.P.; Lin, Z.; Liu, R.; Tang, T.; Vittorini, V.; Wang, Z. A literature review of Artificial Intelligence applications in railway systems. Transp. Res. Part C Emerg. Technol. 2022, 140, 103679. [Google Scholar] [CrossRef]
Zheng, Y.; Gao, C.; Huang, Y.; Sheng, W.; Wang, Z. Evolutionary ensemble generative adversarial learning for identifying terrorists among high-speed rail passengers. Expert Syst. Appl. 2022, 210, 118430. [Google Scholar] [CrossRef]
Pan, X.; Liu, S. Modeling travel choice behavior with the concept of image: A case study of college students’ choice of homecoming train trips during the Spring Festival travel rush in China. Transp. Res. Part A Policy Pract. 2022, 155, 247–258. [Google Scholar] [CrossRef]
Li, T.; Rong, L. A comprehensive method for the robustness assessment of high-speed rail network with operation data: A case in China. Transp. Res. Part A Policy Pract. 2020, 132, 666–681. [Google Scholar] [CrossRef]
Lu, C.; Cai, C. Overview on safety management and maintenance of high-speed railway in China. Transp. Geotech. 2020, 25, 100397. [Google Scholar] [CrossRef]
Cao, Y.; An, Y.; Su, S.; Xie, G.; Sun, Y. A statistical study of railway safety in China and Japan 1990–2020. Accid. Anal. Prev. 2022, 175, 106764. [Google Scholar] [CrossRef]
Ren, T.; Liu, Y.; Gao, Z.; Qiao, Z.; Li, Y.; Li, F.; Yu, J.; Zhang, Q. Height Deviation Detection of Rail Bearing Platform on High-Speed Railway Track Slab Based on Digital Image Correlation. Opt. Lasers Eng. 2023, 160, 107238. [Google Scholar] [CrossRef]
Huang, Y.; Zhang, Z.; Tao, Y.; Hu, H. Quantitative risk assessment of railway intrusions with text mining and fuzzy Rule-Based Bow-Tie model. Adv. Eng. Inform. 2022, 54, 101726. [Google Scholar] [CrossRef]
Hoerbinger, S.; Obriejetan, M.; Rauch, H.P.; Immitzer, M. Assessment of safety-relevant woody vegetation structures along railway corridors. Ecol. Eng. 2020, 158, 106048. [Google Scholar] [CrossRef]
Wang, H.; Tian, Y.; Yin, H. Correlation Analysis of External Environment Risk Factors for High-Speed Railway Derailment Based on Unstructured Data. J. Adv. Transp. 2021, 2021, 6980617. [Google Scholar] [CrossRef]
Meng, H.; Wang, S.; Gao, C.; Liu, F. Research on Recognition Method of Railway Perimeter Intrusions Based on Φ-OTDR Optical Fiber Sensing Technology. IEEE Sens. J. 2021, 21, 9852–9859. [Google Scholar] [CrossRef]
Pan, X.; Yang, L.; Sun, X.; Yao, J.; Guo, J. Research on the Extraction of Hazard Sources along High-Speed Railways from High-Resolution Remote Sensing Images Based on TE-ResUNet. Sensors 2022, 22, 3784. [Google Scholar] [CrossRef] [PubMed]
Abdel-Gaber, A.M.; Nabey, B.A.A.-E.; Khamis, E.F.; Abdelattef, O.A.; Aglan, H.; Ludwick, A.G. Influence of natural inhibitor, pigment and extender on corrosion of polymer coated steel. Prog. Org. Coat. 2010, 69, 402–409. [Google Scholar] [CrossRef]
Dong, Z.; Wang, X.; Tang, L. Color-Coating Scheduling with a Multiobjective Evolutionary Algorithm Based on Decomposition and Dynamic Local Search. IEEE Trans. Autom. Sci. Eng. 2021, 18, 1590–1601. [Google Scholar] [CrossRef]
Li, Y. Fire Safety Distance Analysis of Color Steel Sandwich Panel Houses in Different Meteorological Conditions. Master’s Thesis, Chongqing University, Chongqing, China, 2016. [Google Scholar]
Guo, Z.; Yang, D.; Chen, J.; Cui, X. A new index for mapping the ‘blue steel tile’ roof dominated industrial zone from Landsat imagery. Remote Sens. Lett. 2018, 9, 578–586. [Google Scholar] [CrossRef]
Samat, A.; Gamba, P.; Wang, W.; Luo, J.; Li, E.; Liu, S.; Du, P.; Abuduwaili, J. Mapping Blue and Red Color-Coated Steel Sheet Roof Buildings over China Using Sentinel-2A/B MSIL2A Images. Remote Sens. 2022, 14, 230. [Google Scholar] [CrossRef]
Zhang, W.; Liu, G.; Ding, L.; Du, M.; Yang, S. Analysis and Research on Temporal and Spatial Variation of Color Steel Tile Roof of Munyaka Region in Kenya, Africa. Sustainability 2022, 14, 14886. [Google Scholar] [CrossRef]
Xue, J.; Xu, H.; Yang, H.; Wang, B.; Wu, P.; Choi, J.; Cai, L.; Wu, Y. Multi-Feature Enhanced Building Change Detection Based on Semantic Information Guidance. Remote Sens. 2021, 13, 4171. [Google Scholar] [CrossRef]
Bai, B.; Fu, W.; Lu, T.; Li, S. Edge-Guided Recurrent Convolutional Neural Network for Multitemporal Remote Sensing Image Building Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5610613. [Google Scholar] [CrossRef]
Zhang, L.-B.; Zhang, J.; Ma, J.; Jia, X. SC-PNN: Saliency Cascade Convolutional Neural Network for Pansharpening. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9697–9715. [Google Scholar] [CrossRef]
Yang, K.; Sun, H.; Zou, C.; Lu, X. Cross-Attention Spectral–Spatial Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5518714. [Google Scholar] [CrossRef]
Luo, F.; Zhou, T.; Liu, J.; Guo, T.; Gong, X.; Ren, J. Multiscale Diff-Changed Feature Fusion Network for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5502713. [Google Scholar] [CrossRef]
Duan, Y.; Luo, F.; Fu, M.; Niu, Y.; Gong, X. Classification via Structure-Preserved Hypergraph Convolution Network for Hyperspectral Image. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5507113. [Google Scholar] [CrossRef]
Guo, T.; Wang, R.; Luo, F.; Gong, X.; Zhang, L.; Gao, X. Dual-View Spectral and Global Spatial Feature Fusion Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5512913. [Google Scholar] [CrossRef]
Hou, D.; Wang, S.; Xing, H. A novel benchmark dataset of color steel sheds for remote sensing image retrieval. Earth Sci. Inform. 2021, 14, 809–818. [Google Scholar] [CrossRef]
Yu, J.; Shun, L. Detection Method of Illegal Building Based on YOLOv5. Comput. Eng. Appl. 2021, 57, 236–244. [Google Scholar]
Sun, M.; Deng, Y.; Li, M.; Jiang, H.; Huang, H.; Liao, W.; Liu, Y.; Yang, J.; Li, Y. Extraction and Analysis of Blue Steel Roofs Information Based on CNN Using Gaofen-2 Imageries. Sensors 2020, 20, 4655. [Google Scholar] [CrossRef]
Zhu, Q.; Zhang, Y.; Wang, L.; Zhong, Y.; Guan, Q.; Lu, X.; Zhang, L.; Li, D. A Global Context-aware and Batch-independent Network for road extraction from VHR satellite imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 353–365. [Google Scholar] [CrossRef]
Zheng, X.; Huan, L.; Xia, G.; Gong, J. Parsing very high-resolution urban scene images by learning deep ConvNets with edge-aware loss. ISPRS J. Photogramm. Remote Sens. 2020, 170, 15–28. [Google Scholar] [CrossRef]
Wang, Y.; Zeng, X.; Liao, X.; Zhuang, D. B-FGC-Net: A Building Extraction Network from High Resolution Remote Sensing Imagery. Remote Sens. 2022, 14, 269. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Ding, H.; Jiang, X.; Shuai, B.; Liu, A.Q.; Wang, G. Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2393–2402. [Google Scholar]
Li, X.; Zhao, H.; Han, L.; Tong, Y.; Tan, S.; Yang, K. Gated Fully Fusion for Semantic Segmentation. In Proceedings of the AAAI, 2020, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Xu, L.; Li, Y.; Xu, J.; Guo, L. Gated Spatial Memory and Centroid-Aware Network for Building Instance Extraction. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4402214. [Google Scholar] [CrossRef]
Wang, T.Y. The Intelligent Beijing–Zhangjiakou High-Speed Railway. Engineering 2021, 7, 1665–1672. [Google Scholar] [CrossRef]
Available online: http://en.spacewillinfo.com/english/Satellite/SuperView_1/#main (accessed on 10 May 2023).
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9300–9308. [Google Scholar]
Lin, M.; Chen, Q.; Yan, S. Network in network. In Proceedings of the 2014 IEEE International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014; pp. 1–10. [Google Scholar]
Li, X.; You, A.; Zhu, Z.; Zhao, H.; Yang, M.; Yang, K.; Tong, Y. Semantic Flow for Fast and Accurate Scene Parsing. arXiv 2020, arXiv:2002.10120. [Google Scholar]
Huang, Z.; Wei, Y.; Wang, X.; Shi, H.; Liu, W.; Huang, T.S. AlignSeg: Feature-Aligned Segmentation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 550–557. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Van Rijsbergen, C.J. Information Retrieval, 2nd ed.; Butterworths: Waltham, MA, USA, 1979. [Google Scholar]
Manning, C.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
Cao, Y.; Huang, X. A full-level fused cross-task transfer learning method for building change detection using noise-robust pretrained networks on crowdsourced labels. Remote Sens. Environ. 2023, 284, 113371. [Google Scholar] [CrossRef]
Hu, C.; Zhang, S.; Barnes, B.B.; Xie, Y.; Wang, M.; Cannizzaro, J.P.; English, D.C. Mapping and quantifying pelagic Sargassum in the Atlantic Ocean using multi-band medium-resolution satellite data and deep learning. Remote Sens. Environ. 2023, 289, 113515. [Google Scholar] [CrossRef]
Hertel, V.; Chow, C.; Wani, O.; Wieland, M.; Martinis, S. Probabilistic SAR-based water segmentation with adapted Bayesian convolutional neural network. Remote Sens. Environ. 2023, 285, 113388. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the NeurIPS, 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the ECCV, 2018, Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Song, Q.; Li, J.; Li, C.; Guo, H.; Huang, R. Fully Attentional Network for Semantic Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 36, Vancouver, BC, Canada, 22 February–1 March 2022; pp. 2280–2288. [Google Scholar]

Figure 1. The study area of color-coated steel sheet (CCSS) roof building extraction.

Figure 2. Samples of CCSS roof building dataset. (a,c) are images and (b,d) are corresponding ground truth labels.

Figure 3. The structure of deformation-aware feature enhancement and alignment network (DFEANet).

Figure 4. The workflow of deformable convolution.

Figure 5. The structure of the deformation-aware feature enhancement module (DFEM).

Figure 6. The structure of feature alignment and gated fusion modules (FAGM).

Figure 7. Visualization of feature maps and offset maps in FAGM. The offset maps are visualized by color coding and arrowhead illustrations, where the orientation and magnitude of offset vectors are represented by hue and saturation, respectively, similar to [43].

Figure 8. Generation of final prediction.

Figure 9. Distribution of CCSS roof buildings in the external environment along the Beijing-Zhangjiakou high-speed railway. The red rectangles correspond to the four sample areas below.

Figure 10. Samples of CCSS roof building extraction results. (a) Original images; (b) Ground truth labels; (c) PSPNet, (d) FLANet, (e) SegNet, (f) HRNet v2, (g) U-Net, (h) Deeplab v3+, (i) Ours.

Figure 11. Samples of CCSS roof building extraction results, true positive, false positive, true negative, and false negative are marked in different colors. (a) Original images, (b) Ground truth labels, (c) PSPNet, (d) FLANet, (e) SegNet, (f) HRNet v2, (g) U-Net, (h) Deeplab v3+, (i) Ours.

Figure 12. Samples of extraction results for ‘Without Deform’ and ‘Both’. (a) Original image, (b) results of ‘Without Deform’, (c) results of ‘Both’, (d) visualization of (b), (e) visualization of (c).

Figure 13. Visualization of offset maps in the FAGM. (a) Original images, (b,d) show the color-coded offset maps of the high-level feature maps and the low-level feature maps, respectively, which were generated by the FAGM with the highest resolution output. The offset maps follow the color coding in Figure 7. (c,e) are the arrowhead visualizations of offset maps in (b,d).

Figure 14. Visualization of feature maps in the FAGM with the highest resolution output. (a) Original image. (b,c) show the high-level (b) and low-level (c) feature maps before alignment. (d,e) show the high-level (d) and low-level (e) feature maps after alignment. (f) is the output of FAGM.

Figure 15. Comparison with other methods in terms of complexity and efficiency.

Table 1. Superview-1 satellite specifications.

Orbit	Type	Sun-synchronous
Orbit	Altitude	530 km
Design life	8 years
Spectral bands	Blue	450–520 nm
	Green	520–590 nm
	Red	630–690 nm
	NIR	770–890 nm
	PAN	450–890 nm
Swath width	12 km
Ground sample distance	PAN	0.5 m
Ground sample distance	MS	2 m
Dynamic range	11 bit
Revisit time	2 days by twin satellites

Table 2. Quantitative comparison results with classical and state-of-the-art (SOTA) methods on the CCSS roof building dataset. The first and second places are in bold and underlined, respectively.

Method	IoU (%)	Recall (%)	Precision (%)	F1-Score (%)
PSPNet	70.88	77	83.47	80.10
FLANet	77.63	84.40	87.33	85.84
SegNet	81.78	87.91	90.12	89
HRNet v2	82.80	87.40	92.35	89.81
U-Net	84.24	87.09	95.19	90.96
DeepLab v3+	84.64	87.98	94.59	91.16
Ours	86.48	91.46	93.05	92.25

Table 3. The quantitative results of the ablation study on the CCSS roof building dataset.

Method	Baseline	D	F	IoU (%)
Description	ResNet50 + FPN	DFEM	FAGM
Ablation 1	√			80.62
Ablation 2	√	√		82.86
Ablation 3	√		√	82.31
Ablation 4	√	√	√	86.48

Table 4. Evaluation results of the deformable convolutions in DFEM. The first places are in bold.

Method	IoU (%)	Recall (%)	Precision (%)	F1-Score (%)
Without Deform	82.29	89.67	89.05	89.36
Level Deform	83.15	90.61	89.35	89.98
Spatial Deform	83.49	90.61	89.83	90.22
Both	86.48	91.46	93.05	92.25

Table 5. Comparison of different feature fusion schemes. The first places are in bold.

Method	IoU (%)	Recall (%)	Precision (%)	F1-Score (%)
Pixel Addition	82.86	86.43	93.78	89.96
Channel Concatenate	84.57	89.98	91.98	90.97
FAF	85.61	90.92	92.44	91.67
FAGM	86.48	91.46	93.05	92.25

Table 6. Comparison with other methods in parameters and FLOPs.

Method	Backbone	IoU (%)	Parameters (M)	FLOPs (G)
PSPNet	ResNet 50	70.88	46.36	27.47
FLANet	ResNet 50	77.63	69.99	451.37
SegNet	-	81.78	29.44	160.68
HRNet v2	-	82.80	65.85	93.83
U-Net	-	84.24	34.53	262.106
DeepLab v3+	ResNet 50	84.64	63.26	157.85
Ours	ResNet 50	86.48	27.60	44.16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Jin, W.; Qiu, S.; Zuo, D.; Liu, J. Color-Coated Steel Sheet Roof Building Extraction from External Environment of High-Speed Rail Based on High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 3933. https://doi.org/10.3390/rs15163933

AMA Style

Li Y, Jin W, Qiu S, Zuo D, Liu J. Color-Coated Steel Sheet Roof Building Extraction from External Environment of High-Speed Rail Based on High-Resolution Remote Sensing Images. Remote Sensing. 2023; 15(16):3933. https://doi.org/10.3390/rs15163933

Chicago/Turabian Style

Li, Yingjie, Weiqi Jin, Su Qiu, Dongsheng Zuo, and Jun Liu. 2023. "Color-Coated Steel Sheet Roof Building Extraction from External Environment of High-Speed Rail Based on High-Resolution Remote Sensing Images" Remote Sensing 15, no. 16: 3933. https://doi.org/10.3390/rs15163933

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Color-Coated Steel Sheet Roof Building Extraction from External Environment of High-Speed Rail Based on High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning Based Researches on CCSS Roof Buildings

2.2. Multi-Scale Feature Extraction and Fusion

3. Data and Study Area

3.1. Study Area

3.2. Data

3.3. Self-Annotated CCSS Roof Building Dataset

4. Methodology

4.1. Model Overview

4.2. Deformable Convolution

4.3. Deformation-Aware Feature Enhancement Module

4.4. Feature Alignment and Gated Fusion Module

4.5. Segmentation Head

4.6. Loss Function

5. Experiments and Analysis

5.1. Evaluation Metrics

5.2. Experimental Settings

5.3. Results and Analysis

5.3.1. CCSS Roof Buildings of the External Environment of the Beijing–Zhangjiakou High-Speed Railway

5.3.2. Comparisons with SOTA Methods

5.4. Ablation Study

5.4.1. Effect of Deformable Convolutions in DFE

5.4.2. Visualization of FAGM

5.4.3. Comparison of Different Multi-Level Feature Fusion Methods

5.5. Complexity of DFEANet

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI