Global–Local Deep Fusion: Semantic Integration with Enhanced Transformer in Dual-Branch Networks for Ultra-High Resolution Image Segmentation

: The fusion of global contextual information with local cropped block details is crucial for segmenting ultra-high resolution images. In this study, A novel fusion mechanism termed global– local deep fusion (GL-Deep Fusion) is introduced, based on an enhanced transformer architecture that efficiently integrates global contextual information and local details. Specifically, we propose the global–local synthesis networks (GLSNet), a dual-branch network where one branch processes the entire original image, while the other branch handles cropped local patches as input. The feature fusion of different branches in GLSNet is achieved through GL-Deep Fusion, significantly enhancing the accuracy of ultra-high resolution image segmentation. Identifying tiny overlapping items is a task where the model excels, demonstrating its particular effectiveness. To optimize GPU memory utilization, a dual-branch architecture was meticulously designed. This architecture proficiently leverages the features it extracts and seamlessly integrates them into the enhanced transformer framework of GL-Deep Fusion. Benchmarks on the DeepGlobe and Vaihingen datasets demonstrate the efficiency and accuracy of the proposed model. It significantly reduces GPU memory usage by 24.1% on the DeepGlobe dataset, enhancing segmentation accuracy by 0.8% over the baseline model. On the Vaihingen dataset, our model delivers a Mean F1 score of 90.2% and achieves a mIoU of 90.9%, highlighting its exceptional memory efficiency and segmentation precision.


Introduction
The task of image segmentation, which is considered a crucial and difficult subject in the fields of artificial intelligence and computer vision, involves attributing semantic class labels to each pixel present within an image [1].It divides the image into distinct regions with semantic information, providing crucial scene understanding and semantic context.These semantic insights are essential in many cutting-edge sectors, including autonomous driving, remote sensing, and medical imaging, where ultra-high resolution images deliver unparalleled detail and information [2,3].
Previously, the enhanced development of deep convolutional neural networks (CNNs) has significantly improved the dependability of image segmentation models.Notable examples include DeepLab [4][5][6][7], UNet [8], BSNet [9], PSPNet [10], SegNet [11], ICNet [12], RefineNet [13], EncNet [14], etc.With the development of advancements in autonomous driving and remote sensing, the widespread use of ultra-high resolution images has posed new challenges for image segmentation.Currently, image datasets can be divided into different categories at the pixel level.Furthermore, 2 K image resolution is at least 2048 × 1080 (approximately 2.2 M) [15], 4 K image resolution is at least 3840 × 1080 (approximately 4.1 M) [16], and 4 K ultra-high definition is at least 3840 × 2160 (approximately 8.3 M) [17].The enormous number of pixels is a considerable barrier to algorithm efficiency, especially given GPU memory constraints.Downsampling is recognized as an effective way to reduce the number of pixels in an image, addressing the problem of excessive GPU memory usage in ultra-high resolution image segmentation tasks.Nevertheless, an overabundance of downsampling might lead to the compromise of local details.GLNet has achieved good progress by using a multilevel feature pyramid network (FPN) [18] to fuse global contextual information from the downsampled input image and local cropped block details from cropped local patches.Figure 1(1) displays an image from the Vaihingen dataset [19], and its segmentation is presented in Figure 1(2).Ultra-high resolution orthophotos and digital surface models produced by dense image-matching technologies are included in this dataset.The large number of freestanding and little multi-story structures in the collection is noteworthy.Under this dataset, we employed GLNet [20] and DeepLabv3 [6] for prediction.Their results are displayed in Figure 1 (4,5), respectively.One can observe that the latter performs better in handling segmentation details, especially for overlapping cars (zoomed-in panels (a), (b), (c), (d)).GLNet shows some discrimination ability but still cannot accurately segment each car.This reflects the limited ability of traditional FPNs to maintain the relationship between global contextual information and local details.Since the fusion features are overly reliant on imprecise and one-sided global context information, significant boundary details are missing in the prediction results.
(1) Source Image (2) Label Since visual transformer (ViT) [21] introduced the transformer architecture into visual tasks, various state-of-the-art models such as the masked-attention mask transformer (Mask2Former) [22,23], BSNet [9], and EfficientUNetTransformer [24] have demonstrated the effectiveness of encoder-decoder structures and attention mechanisms.Thus, we constructed a unique global-local deep fusion (referred to as GL-Deep Fusion) and used an improved transformer structure to better represent the connection between local details and global contextual information.Based on this, the global-local synthesis network (GLSNet) is proposed, featuring a dual-branch network structure with GL-Deep Fusion serving as the fusion module.By creating local deep branching structures and global shallow branching structures, more complex global contextual information and finer local details can be captured.As shown in Figure 1(3), GLSNet performs excellently in segmenting overlapping cars and object boundaries.Aside from segmentation accuracy, the GPU memory usage brought by the transformer is also a point of concern.Traditional feature pyramid networks (FPNs) [18] usually require stacking multiple layers to fuse branch information, which can lead to higher memory usage.In comparison, the transformer has greater potential.UN-EPT [25] employs an efficient pyramid transformer structure for semantic segmentation tasks, resulting in a considerable reduction in GPU memory utilization, which greatly inspired us.In particular, the potential of GPUs and CPUs in terms of computational capacity is constrained by the delay in accessing memory [26][27][28], which significantly hampers the operational speed of transformers [29,30].The memory inefficiency of the element-wise functions can be greatly reduced in the processes of multi-head self-attention (MHSA) and frequent tensor reshaping.The study reveals that there are methods to significantly reduce the time taken for memory access without compromising overall system efficiency.Based on the analysis and findings, the proposed GL-Deep Fusion utilizes a dual-encoder and single-decoder attention mechanism.This mechanism, combined with the dual-branch structure, significantly reduces GPU memory usage.This structure demonstrates potential gains in accuracy and GPU memory usage on the Vaihingen and DeepGlobe [3] datasets.
To summarize our contributions in the following manner:

•
We introduce GL-Deep Fusion, which effectively holds the correlation between global semantics and ultra-high resolution image details through its integrated feature representation.

•
The global contextual information and partially truncated block details captured by the dual-branch structure can be directly alternated between the dual encoding structures of the GL-Deep Fusion module, thereby avoiding redundant feature computations.

•
Our proposed GLSNet has significantly improved GPU memory utilization and segmentation accuracy in the context of ultra-high resolution image segmentation.Compared to GLNet (baseline), it reduces GPU memory usage by 24.1% on the DeepGlobe dataset [3].The Vaihingen dataset [19] also achieves groundbreaking results.
The organization of this paper is as follows: Section 2 presents an overview of the current state of the related research.Section 3 outlines the network architecture and fusion mechanism that have been designed.Furthermore, the results of our experiments are presented in Section 4. Finally, Section 6 introduces our conclusion and future work.

Image Segmentation
Advancements in image segmentation have been pivotal to the field of computer vision, with models like FCN [31], U-Net [8,32,33], and DeepLab [4][5][6] laying the groundwork for modern techniques.Recent innovations, such as BSNet [9], MaskFormer [22], and Mask2Former [23], have further pushed the boundaries by introducing novel attention mechanisms for segmentation tasks.These models have demonstrated remarkable efficacy in various applications, from biomedical imaging to autonomous vehicle perception.However, they also present challenges when applied to ultra-high resolution images, particularly in terms of GPU memory requirements and processing speed.
In this context, it is crucial to consider the balance between segmentation accuracy and computational efficiency.Figure 1, which provides a visual comparison of segmentation results on the Vaihingen dataset, is particularly instructive.It showcases the performance of GLSNet alongside DeepLabv3 [6] and GLNet [20], highlighting the ability of different models to capture semantic details and boundary precision.The source image and segmentation labels are depicted, with distinct colors representing different semantic classes: white for "impervious surfaces", dark blue for "buildings", light blue for "low vegetation", green for "trees", and yellow for "cars".The segmentation results from GLSNet demonstrate a superior ability to handle boundary details, as evidenced by the zoomed-in panels (a), (b), (c), and (d), which reveal the nuances of each model's performance.The visual evidence presented in Figure 1 underscores the significance of our approach, which aims to achieve high segmentation accuracy without compromising on computational efficiency.This balance is essential for the practical deployment of segmentation models in real-world applications.

Segmentation of Ultra-High Resolution Images: Efficiency and Quality
As the dependency on image segmentation for real-time/low-latency tasks increases, the need to efficiently and qualitatively perform image segmentation on ultra-high resolution images becomes paramount.ENet [34] successfully reduces floating-point computations by adopting an asymmetric encoder-decoder structure and early downsampling.ICNet [12] integrates multi-resolution feature maps for model compression to enhance efficiency.Recently, context aggregation has been a key tactic for overcoming the difficulties associated with ultra-high resolution image segmentation jobs.ParseNet [35] pools scene contexts globally at various levels to apply context aggregation techniques.To aggregate global contextual and high-resolution details, the deep/shallow branches were integrated into ContextNet [36], BiSeNet [37], and GUN [38].However, these models are not specifically tailored for ultra-high resolution images.The challenge of balancing memory usage and segmentation accuracy remains unresolved.In contrast to the aforementioned studies, our objective is to develop a customized model that tackles the challenges in ultra-high resolution image segmentation tasks.

Overview
The overview of the entire network structure is shown in

GL-Deep Fusion
As noted in Section 3.1, the integration of features from both branches is of paramount importance for the performance of GLSNet.To tackle this issue, the global-local deep fusion method (GL-Deep Fusion) was designed.The core idea is to leverage an enhanced transformer architecture that employs a dual-encoder and single-decoder mechanism to amalgamate global and local features effectively.Unlike traditional transformer frameworks that necessitate extensive computations for deriving the matrices of queries, keys, and values, the proposed encoders are adept at directly aligning with the features provided by the dual-branch network, thereby significantly reducing computational overhead and memory consumption.This innovative approach facilitates a memory-efficient fusion process that capitalizes on the distinct advantages of both the global and local branches, culminating in a more robust and accurate model for feature integration.[39]: The attention function is defined for matrices of queries Q, keys K, and values V where the dimensions are d k × n and d v × n, respectively.Here, n denotes the number of elements within the set, while d k and d v are the dimensionalities associated with the keys and values.This function calculates the dot products between the queries and keys, applies a scaling factor of √ d k to stabilize the softmax operation, and subsequently uses the softmax function to generate a weighted distribution across the values.

Transformer attention function
where QK T corresponds to the dot product of the query matrix with the transpose of the key matrix.The division by √ d k serves as a form of normalization to maintain the balance of the softmax output.
Branch-Attention (B2A): B2A (Figure 3 right) facilitates the interaction between two distinct branches, allowing one to query information from the other.This is particularly useful for establishing relationships between different data representations within the model. where branch1 represents the query matrices from the branch1, and branch2 are the key and value matrices derived from branch2.As shown in Figure 3, GL-Deep Fusion is based on the branch-attention and is designed with a dual-cross-encode and single-decode structure.Information sequences from both the global branch and the local branch are taken as input sequences directly, denoted as F global and F local .The design allows for a significant decrease in redundant computations, which enhances the effectiveness of the dual-branch structure's benefits.The first encoder generates a global-local sequence that encompasses the local relevance information.It can be represented as In parallel and symmetrically, the second encoder takes queries from F local and keyvalue pairs from F global , generating a local-global sequence with globally relevant information.Its representation is Finally, the decoder merges the insights from both the global and local perspectives by employing a sophisticated B2A function, which treats the global-local and local-global sequences as distinct yet complementary branches.This integration is formulated as follows: This fusion process culminates in a rich set of features that capture the essence of both local details and global semantics.By deliberately excluding the FFN layer, our model streamlines the architecture, focusing on the efficient integration of features through the B2A mechanism, which is well-suited for the dual-branch structure of GLSNet.This decision reflects a strategic choice to prioritize memory efficiency without compromising the model's ability to understand complex scenes.

Global Shallow Branch and Local Deep Branch
The global shallow branch and local deep branch of the GLSNet are compatible with various backbone network structures.In the presented study, the standard convolutionbased ResNet [40] backbone network was utilized, which includes ResNet18 and ResNet50, with 18 and 50 layers, respectively (Table 1 for network structures).For large-scale images, employing shallower neural networks can effectively extract global features without incurring significant computational overhead.Accordingly, the shallow branch architecture is ResNet18.Notably, the original ultra-high resolution images are used as input directly, without any preliminary downsampling, in the global shallow branch of GLSNet.With this design, the extraction of global contextual information is enabled, which encompasses a wide array of background environments and the semantic content of the entire image.Implementing a shallow neural network in global branching improves segmentation accuracy and memory utilization compared to deep neural network designs.In addition, ResNet50 has been used as a deep branch processing architecture.

DeepGlobe [3]:
The DeepGlobe dataset, known for its ultra-high resolution and challenging content, comprises 803 remote sensing images, each with a substantial pixel dimension of 2448 × 2448.This dataset has been meticulously partitioned into training, validation, and test sets, containing 455, 207, and 142 images, respectively.Such a division facilitates efficient model training and evaluation and ensures the robustness of the model's performance when generalized across varied datasets.The diversity of land cover categories covered by DeepGlobe, including urban, agricultural, pastoral, forested, fire-stricken, waste, and unclassified areas, presents a comprehensive spectrum of seven distinct classes.These categories, in conjunction with the dataset's higher resolution compared to its predecessors, offer a more complex and realistic evaluation ground for the GLSNet model, particularly in handling fine details and diverse land cover types.
Vaihingen [19]: The Vaihingen dataset, with its 33 high-resolution images, each averaging 2494 × 2064 pixels and featuring a spatial resolution of 9 cm, provides a rich dataset for urban and natural environment analysis.The inclusion of red, green, and nearinfrared (NIR) channels allows for the capture of a diverse range of visual and spatial attributes, which are crucial for accurate segmentation tasks.The six distinct categories represented in the Vaihingen dataset-impervious surfaces, buildings, low vegetation, trees, cars, and background-pose a demand for the model to exhibit high precision in distinguishing common urban and natural land cover types.The high spatial resolution and channel diversity of the Vaihingen dataset serves as an ideal testbed for evaluating the GLSNet's capability to accurately segment complex scenes with intricate details.
The characteristics of both the DeepGlobe and Vaihingen datasets, with their high resolution, diverse categories, and realistic scenarios, make them particularly suitable for evaluating the performance of GLSNet.These datasets not only challenge the model's ability to process and analyze large amounts of detailed spatial information but also verify its accuracy in segmenting various land cover types, which is essential for applications in autonomous driving, remote sensing, and urban planning.

Implementation Details
The model is optimized using the focal loss function [41], set with a weight of 1.0 and a γ value of 6, which serves as our primary objective to address class imbalance and focus on hard-to-classify examples.Similar to the methods employed in GLNet [20], we integrate two auxiliary losses and apply a regularization coefficient λ set to 0.15 to further refine the model's performance.As shown in Figure 4, the loss curves demonstrate a good fit, with both training and validation losses consistently decreasing and starting to plateau around the 20th epoch.This stabilization indicates that the model has effectively learned the underlying patterns in the data without overfitting, as evidenced by the convergence of the two curves to a similar minimum loss value.To effectively assess and optimize the graphics processing unit (GPU) utilization for our model, we have employed the "gpustat" command-line utility.This tool provides us with detailed insights into GPU usage, which is crucial for enhancing our model's computational efficiency.All training and testing are conducted on a single NVIDIA 1080Ti GPU card.This approach not only eliminates the need for gradient computation across multiple devices but also guarantees the replicability of our results.Moreover, to balance the GPU load and ensure stable training dynamics, we have established a batch size of six for all training iterations.Our experiments are executed within the PyTorch framework [42], chosen for its flexibility and powerful dynamic computational capabilities.For optimization, we have selected the Adam optimizer [43], renowned for its robust performance in handling sparse gradients.
Heeding research that demonstrates the benefits of assigning distinct learning rates to local and global branches for improved training outcomes, we have integrated this strategy into our model, GLNet [20].Specifically, we have set the global branch learning rate to β 1 = 1 × 10 −4 and the local branch learning rate to β 2 = 2 × 10 −5 .This differential rate allows each branch to learn at a pace that is well-matched to the nature of the data it processes.

Evaluation Metrics
The performance of the proposed GLSNet is assessed using three widely used metrics: overall accuracy (OA), F 1 score, and mean intersection over union (mIoU) for each class.The OA is a metric that evaluates the accuracy of pixel classification by determining the ratio of accurately classified pixels to the overall number of pixels.The F1 score can be calculated for every category: Additionally, the mean F1 score is determined by averaging all of the F1 scores.If TP is defined as the number of true positives.The numbers for false positives and false negatives are represented as FP and FN, respectively.The IoU formula can be written as follows: Next, we calculate the mean intersection over union (mIoU) by averaging the IoU values across all semantic categories to facilitate comparison.

Experimental Results
The advantages of GLSNet were verified through experiments comparing it with various models on the Vaihingen and DeepGlobe datasets.
Table 2 presents a detailed comparison of these methods.GLSNet demonstrates superior performance across the overall metrics, with a particularly notable mean F1 score of 90.2%, an OA of 90.9%, and a mIoU of 90.9%.These results showcase the effectiveness of our model in segmenting various object categories with high precision.For instance, GLSNet outperformed BSNet by achieving a 2.3% improvement in accuracy for impervious surfaces and showed a significant enhancement of 3.7% for cars.It is also worth noting that GLSNet's performance exceeds that of the popular Mask2Former method introduced in recent years.When compared with Mask2Former, which has gained significant attention for its innovative approach to segmentation, GLSNet demonstrates a higher mean F1 score and overall accuracy, indicating its robustness and potential for practical applications in scenarios requiring precise segmentation of ultra-high resolution images.However, we also observed that in scenarios with complex backgrounds or lowcontrast objects, GLSNet's segmentation accuracy is slightly lower than in scenarios with clear object boundaries and distinct features.This observation indicates the necessity for enhancing the model's ability to adapt to challenging conditions.One potential avenue for improvement is expanding the diversity of the training data to include more examples of difficult-to-segment objects, which could help the model learn to better distinguish between similar textures and backgrounds.
In conclusion, the Vaihingen dataset evaluation highlights GLSNet's strengths in accurately segmenting a wide variety of objects.However, the analysis also reveals areas for improvement, particularly in the category of low vegetation, where GLSNet's performance was less pronounced compared to other object categories.This suggests that the model faces challenges in distinguishing objects with similar textural features from their surroundings.Addressing these challenges, such as enhancing the model's adaptability to complex segmentation tasks and optimizing memory usage without affecting accuracy, will be pivotal in guiding the future development of GLSNet.Our commitment to refining these aspects is aimed at achieving even higher standards of performance and efficiency.
As depicted in Table 3, all methods demonstrated improved mIoU results with the incorporation of a global branch, as opposed to relying solely on local patches.However, this enhancement came at the cost of a significant increase in GPU memory consumption.The majority of the methods struggled to effectively balance the segmentation accuracy with GPU memory usage.Among the listed approaches, only GLNet, which employs globallocal information sharing, achieved a higher mIoU with reduced memory consumption, thus being selected as the benchmark model for this dataset.

1414
In contrast to the baseline model GLNet, the proposed GLSNet achieved significant breakthroughs in both mIoU and GPU memory usage.The mIoU score reached 72.4%, marking a 0.8% improvement over the baseline model.Most notably, the GPU memory usage was substantially decreased by 451 MB, a reduction of 24.1%.These advancements position GLSNet as a more advantageous model in terms of operational speed and resource utilization, offering enhanced potential for practical applications.We conducted an indepth analysis to elucidate the performance differences between GLSNet and the baseline GLNet.The innovative dual-branch structure and the enhanced transformer attention mechanism of GLSNet allow for more efficient feature extraction and fusion, leading to superior segmentation accuracy while maintaining low memory usage.Specifically, the optimized architecture of GLSNet minimizes redundant computations and capitalizes on the strengths of both global and local information, resulting in a notable increase in segmentation performance.
Furthermore, we have considered the practical implications of our findings, recognizing that the balanced approach of GLSNet to accuracy and efficiency could be beneficial for real-world applications, especially in environments with constrained computational resources.The performance of GLSNet suggests that it may offer a viable solution for 3D object detection tasks that require a balance between precision and resource utilization.We believe that the results achieved by GLSNet contribute valuable insights for the ongoing research and may support the development of more effective 3D object detection systems in the future.

Ablation Experiments 4.5.1. The Effects of Shallow-Deep Branch and GL-Deep Fusion
As shown in Table 4, we designed three models: shallow-deep, shallow-shallow, and deep-deep.These three models differ in their design for the global backbone, local backbone, and fusion module.They are used to evaluate the impact of the shallow-deep branch collaborative strategy and GL-Deep Fusion structure on ultra-high resolution image segmentation.Note, the benchmark model, GLNet, is also included for comparison.On the DeepGlobe dataset, we conducted ablation studies to evaluate the impact of different network architectures on mIoU, GPU memory usage, and the frames per second (FPS) metric, which provides additional insight into the models' performance.As shown in

Discussion
The exceptional performance of our GLSNet method, as reflected in the comparative Table 7, is fundamentally attributed to the unique architectural design tailored specifically for ultra-high resolution image segmentation.This innovative framework forms the cornerstone of our success.The superior performance metrics observed when juxtaposed with other state-of-the-art methods substantiate the effectiveness of our approach in addressing the complexities of high-resolution imagery.
U-Net (Series) [8,25,32,33] Utilizes an encoder-decoder architecture with skip connections to integrate features from various levels, particularly effective for medical image segmentation.
GLNet [20] a dual-branch network that leverages multi-level feature pyramid networks (FPNs) to exchange features between branches, improving feature utilization.
MaskFormer (Series) [22,23] Introduces Transformer decoders and proposes a mask classification model that unifies semantic, instance, and panoptic segmentation tasks by predicting a set of binary masks.

Our Method
Utilizes a global shallow branch and a local deep branch in conjunction with GL-Deep Fusion based on branch-attention, achieving complete collaboration between the dual-branch structure, which is highly suitable for the field of ultra-high resolution image segmentation.

Conclusions
GLSNet, a segmentation model optimized for ultra-high resolution images prioritizing memory efficiency, has been presented.It creates a network structure composed of a shallow branch that covers the global context and a deep branch that focuses on local details, ensuring the effective collection of both global and local information.The innovative GL-Deep Fusion seamlessly combines global contextual information and local intricacies, bringing about a transformative effect.GLSNet showcases its competitive performance on both the DeepGlobe and Vaihingen datasets using this method.Specifically, it excels at producing exceptional outcomes in the process of separating overlapping small objects within an image.We consider it essential to strike an improved balance between the utilization of GPU memory and the accuracy of segmentation when exploring ultra-high resolution image research.Therefore, the GLSNet network proved to be a key solution.It is a powerful tool for solving the problem of ultra-high resolution image segmentation.
Although GLSNet has already shown efficient memory usage, there remains untapped potential for additional optimization.In our upcoming research, we aim to further investigate various prospects related to ultra-high-resolution image segmentation.We intend to expand the range of uses for GLSNet to incorporate a greater variety of real-life situations.Simultaneously, we will explore the fusion of multi-modal data, including the integration of ultra-high resolution images with LiDAR, radar, or hyperspectral imagery, aiming to improve both segmentation accuracy and contextual comprehension.These endeavors will contribute to further enhancing the performance and applicability of ultra-high resolution image segmentation technology.

Figure 1 .
Figure 1.Segmentation Results Example from the Vaihingen dataset.

Figure 2 .Figure 2 .
Figure 2. Overview of GLSNet.The global shallow branch and local deep branch use shallow neural networks and deep neural networks to capture global contextual information and local detail.Then, the GL-Deep Fusion completes the fusion of global-local information, ultimately completing the segmentation task of ultra-high resolution images.The two main modules proposed are GL-Deep Fusion and the global shallow branch, along with the local deep branch.In Section 3.2, the intricacies of GL-Deep Fusion are explored.Subsequently, Section 3.3 is dedicated to an exploration of the nuances of the global shallow branch and the local deep branch.

Figure 3 .
Figure 3. Overview of the GL-Deep Fusion module structure.As shown in the left figure, the core structure of GL-Deep Fusion includes a dual-cross encoder and a single decoder.As shown in the right figure, the two input sequences branch1 and branch2 correspond to the information sequences generated after processing by the global branch and local branch in the left figure, respectively.They are cross-used as the input sequences of the two encodes, obtaining the global-local and local-global sequences.Finally, the two types of sequences are fused into the ultimate feature by the decoder.

Figure 4 .
Figure 4.The training loss curve of GLSNet on the DeepGlobe dataset.

Figure 5 .
Figure 5.Comparison of segmentation results between GLSNet, DeepLabv3, and GLNet on the Vaihingen dataset.Note that the red boxes in the figure are mainly used to indicate parts where there are significant differences in segmentation results.

Author Contributions:
The authors confirm their contribution to the paper as follows: study conception and design: C.L., K.H. and J.M.; data collection: C.L.; analysis and interpretation of results: C.L.; draft manuscript preparation: C.L., K.H. and J.M.All authors have read and agreed to the published version of the manuscript.

Table 2 .
Comparison results of various approaches on the Vaihingen dataset.

Table 3 .
mIoU and inference GPU memory usage for predictions on the DeepGlobe test set.

Table 4 .
Illustrations of network architectures for various model designs.

Table 5 ,
the deep-deep model, enhanced by the GL-Deep Fusion strategy, achieved a 1%

Table 7 .
The main contributions of those already reported in the state of the art.