1. Introduction
Coastal environments are dynamic and ecologically critical systems that provide essential services such as shoreline protection, biodiversity support, and economic benefits [
1,
2,
3,
4]. However, these areas are increasingly under pressure due to both natural processes and human activities, including urbanization, infrastructure development, and climate change.
Coastal zones across the European Mediterranean are experiencing growing environmental pressures, with approximately 20% of the total shoreline currently being affected by erosion [
5,
6]. Greece, which has a coastline exceeding 16,000 km and more than 6000 beach zones [
7], faces an even more critical situation: over one-third of its national shoreline is undergoing active erosion, with loose-sediment beaches being the most vulnerable [
8]. Future projections under high-emission (SSP5-8.5) IPCC scenarios indicate that up to 99% of these beaches could retreat by more than 20% of their maximum width, around 72% may experience retreat exceeding 50%, and up to 20% could vanish entirely by the end of the century [
9,
10]. These changes pose significant threats to ecosystems [
11], tourism-based economies [
12], and cultural heritage [
13], underscoring the socio-economic consequences for coastal communities [
14] and the need for accurate, scalable coastal monitoring.
Traditional monitoring methods, such as in situ surveys or manual image interpretation, are often expensive, labor-intensive, and spatially constrained. Remote sensing (RS) technologies offer a scalable solution through satellite and aerial imagery, but their effective use is limited by the complexity of coastal landscapes and the need for detailed, localized interpretation.
Recent developments in Artificial Intelligence (AI), particularly in deep learning, have enabled more advanced analysis of geospatial data. Semantic segmentation, which classifies each pixel in an image into predefined categories, has become a valuable tool to extract environmental characteristics from RS images. Convolutional Neural Networks (CNNs) have been widely used in this domain due to their capacity for hierarchical feature learning [
15]. Building on these advances, Transformer-based models such as MaskFormer and Mask2Former have recently demonstrated superior performance, owing to their ability to capture long-range dependencies and richer contextual information [
16,
17].
Despite these advances, coastal applications face a notable bottleneck: the limited availability of high-resolution, annotated datasets specific to coastal environments. Many state-of-the-art models are pretrained on urban or generic datasets (e.g., Cityscapes [
18], ADE20K [
19]), which do not generalize well to heterogeneous shorelines. The Coast Train dataset [
20] represents a significant step toward addressing this, offering labeled coastal imagery from the United States. However, regional data gaps remain, particularly in morphologically diverse coastlines such as Greece, where features like tidal zones, sediment transitions, and dune systems are underrepresented in training data.
Manual annotation of coastal imagery is also a major barrier. It requires detailed pixel-wise labeling, often supplemented by in situ data or expensive sensing techniques such as LiDAR. This limits the operational deployment of AI models, particularly in data-scarce or remote regions.
This study proposes a two-stage semantic segmentation pipeline to address these challenges. Specifically, we achieve the following:
Fine-tune transformer-based semantic segmentation models on the publicly available Coast Train dataset, covering various coastal imagery.
Adapt the pretrained models to the Greek coastline using a limited, manually labeled set of high-resolution aerial images.
Evaluate model performance quantitatively using mean Intersection over Union (mIoU) and qualitatively through visual analysis of segmentation outputs.
Demonstrate the effectiveness of transfer learning and localized fine-tuning in enhancing model accuracy across diverse coastal environments.
2. Related Work
Traditional approaches to mapping coastal environments relied on manual interpretation or classical image processing techniques, which often lack scalability and robustness. Recent progress in Artificial Intelligence has significantly improved the analysis of remote sensing (RS) imagery, enabling more detailed modeling of coastal dynamics [
21]. Despite this global trend, the application of machine learning for coastal mapping in Greece remains limited.
Early studies employed CNN-based architectures such as U-Net and its variants, achieving strong performance in sea–land segmentation and shoreline delineation [
22,
23]. Additional improvements were achieved through attention mechanisms, edge-aware learning, and hybrid modules, as demonstrated in works such as [
24,
25,
26]. These approaches addressed challenges related to complex boundaries, varying resolutions, and low-contrast coastal scenes.
Beyond binary water/land separation, more recent research has focused on multi-class coastal segmentation. For example, [
27] applied a U-Net-based model to classify pixels into categories such as water, sand, vegetation, and built structures, highlighting the need for finer-grained coastal information. Other works have incorporated temporal analyses of coastal dynamics, such as multi-year shoreline change detection [
24], or focused on multi-resolution robustness [
28]. Lightweight architectures have also been developed for real-time segmentation in aquatic or urban water environments [
29].
General-purpose models such as the Segment Anything Model (SAM) [
30] have attracted interest in coastal mapping due to their prompt-based segmentation capabilities. However, reviews such as [
31,
32] emphasize that RS-specific spectral variability and resolution differences often require domain adaptation for optimal coastal performance.
Recent advances in transformer-based segmentation address these limitations by leveraging self-attention mechanisms capable of modeling both local and global spatial contexts. This characteristic is particularly beneficial for high-resolution coastal imagery, where intricate boundaries and heterogeneous surface types are common. Building on these developments, our study evaluates three state-of-the-art transformer-based models—SegFormer [
33], MaskFormer [
34], and Mask2Former [
35]—for the multi-class segmentation of coastal remote sensing images.
Transformer architectures have recently attracted substantial attention in remote sensing image segmentation due to their ability to model long-range spatial dependencies and integrate multi-scale contextual information [
33,
34,
35]. In this work, we evaluate three representative transformer-based models—SegFormer, MaskFormer, and Mask2Former—due to their strong performance across diverse segmentation benchmarks and their increasing adoption in Earth observation applications.
SegFormer [
33] introduces a hierarchical Transformer encoder that produces multi-resolution feature maps without relying on positional encodings. This design enables effective capture of both fine details and global structure while maintaining computational efficiency. A lightweight MLP decoder fuses the multi-scale features into the final segmentation prediction, making SegFormer suitable for high-resolution remote sensing tasks where efficiency is important.
MaskFormer [
34] reconceptualizes semantic segmentation as a mask-classification problem rather than per-pixel labeling. A Transformer decoder processes a set of learnable queries, each predicting a binary mask and a class label. This set-based prediction strategy improves robustness to class imbalance, ambiguous coastlines, and heterogeneous land–water boundaries commonly found in coastal imagery.
Mask2Former [
35] extends MaskFormer by introducing masked attention, where attention is computed only within spatial regions defined by predicted masks. Combined with a hierarchical backbone such as the Swin Transformer [
36] and a multi-scale pixel decoder, this approach enhances boundary delineation and improves segmentation performance in complex scenes, including coastal regions with subtle spectral transitions.
3. Study Areas and Data
In this study, we implement and compare three Transformer-based semantic segmentation models, SegFormer, MaskFormer, and Mask2Former, to evaluate their suitability for coastal remote sensing applications. Coastal environments are characterized by strong spatial heterogeneity, elongated linear structures, and gradual transitions between land and water, which pose challenges for traditional pixel-based and local-context models. The self-attention mechanism employed by Transformer architectures enables the modeling of long-range dependencies while preserving fine-scale spatial details, making these models particularly well suited for complex coastal scenes.
The selected architectures jointly capture local texture information (e.g., sediment patterns, vegetation structure, and built-up areas) and global spatial context (e.g., shoreline continuity and coastal morphology). This combination is critical for accurately delineating dynamic coastal features such as surf zones, sediment–water boundaries, and vegetated buffers. By evaluating these models across multi-resolution datasets and distinct geographic regions, we aim to assess their robustness, transferability, and practical applicability for automated coastal monitoring and environmental analysis.
To support the semantic segmentation of coastal environments, we employed two complementary datasets: the publicly available Coast Train dataset [
20] from the United States and high-resolution aerial imagery from the Greek coastline, provided by the Hellenic Cadastre [
37]. These datasets differ in geographic origin, spatial resolution, and coastal morphology, offering a diverse basis for evaluating geographic generalization and cross-regional model adaptation. The inclusion of morphologically distinct coastlines introduces variability in landforms, textures, and environmental conditions—factors that are critical for training robust segmentation models.
A small subset of the Greek imagery was manually annotated to enable supervised fine-tuning. Both datasets used the same labeling scheme, enabling a unified training pipeline and a consistent evaluation across regions.
The Coast Train dataset [
20] is a large-scale, human-labeled collection of orthomosaic and satellite imagery from diverse U.S. coastal environments. It provides pixel-level annotations across several land cover classes relevant to coastal monitoring.
The imagery spans spatial resolutions from 0.05 m (orthomosaics) to 15 m (Sentinel-2 and Landsat-8). This multi-resolution design enables learning across different sensor types and spatial scales, supporting generalization to heterogeneous inputs.
For preprocessing, images smaller than 300 × 300 pixels and non-coastal samples were removed, yielding a curated set of 645 images. All remaining images were resized to 512 × 512 pixels for compatibility with the transformer-based models. The dataset was split into 70% training, 15% validation, and 15% testing, ensuring balanced representation across classes.
The class taxonomy is shown in
Table 1.
To evaluate cross-regional generalization, a complementary dataset was constructed using high-resolution aerial imagery (25 cm per pixel) from sections of the southwestern Greek coastline, obtained from the Hellenic Cadastre [
37]. These images depict coastal morphologies not represented in Coast Train, including narrow beaches, rocky outcrops, mixed sediment types, and dense shoreline vegetation (
Figure 1).
Raw images were divided into 512 × 512 patches using zero-padding where necessary to ensure consistent dimensions (
Figure 2,
Figure 3 and
Figure 4). This procedure preserved fine spatial detail while maintaining compatibility with model input requirements.
A total of 420 manually annotated patches were generated using the Segments.ai platform. Annotations followed the Coast Train label definitions (
Table 1). Special care was taken to accurately capture transitional shoreline zones—areas where automated models typically struggle due to fine-grained boundaries or mixed surface types.
The patches were divided into 60% training, 20% validation, and 20% testing. The training portion was used exclusively for fine-tuning, while validation and test sets served to assess cross-regional adaptation.
The Coast Train dataset served as the primary training source, while the Greek patches enabled targeted fine-tuning and systematic evaluation in a distinct geographic region. This dual-dataset framework supports both large-scale model development and region-specific adaptation.
The pilot site selected for this study lies along the southwestern coastal zone of the Peloponnese, Greece, specifically bordering the Kyparissiakos Gulf. The study area extends approximately 60 km along the shoreline and reaches up to 300 m inland (
Figure 5). This zone includes a wide range of geomorphological formations, such as sandy beaches, vegetated backshores, active dune systems, and rocky coastal segments. Settlements including Kyparissia, Filiatra, and Kalo Nero are interspersed along this stretch, introducing diverse land-use patterns and varying levels of human-induced alteration.
The natural variability within the study region, ranging from unaltered coastal ecosystems to semi-urbanized zones, provides an ideal ground truth for evaluating segmentation model performance in real-world settings. Particularly relevant are the multiple hydrological inputs, such as the Neda River and smaller streams that flow into the Gulf, which contribute to sediment dynamics and influence nearshore patterns. Much of the area is included in the Natura 2000 ecological network, which underscores its ecological sensitivity and the need for high-resolution, frequent environmental monitoring to support effective conservation and land-use decision-making.
4. Methodology
4.1. Training Framework
Semantic segmentation of coastal imagery is essential for applications such as environmental monitoring, shoreline assessment, and coastal zone management. In this study, we evaluate three transformer-based models—SegFormer, MaskFormer, and Mask2Former—for pixel-level classification into seven coastal classes: water, sea foam, sediment, vegetation, development, natural terrain, and unknown.
A central challenge lies in the fact that the Greek Coastline dataset contains only a small number of manually annotated samples, limiting its suitability for fully supervised training. To address this, we implement and compare two training strategies: (i) direct fine-tuning solely on the Greek annotations, and (ii) a two-stage domain adaptation process leveraging the large, labeled Coast Train dataset prior to adaptation to the Greek domain.
In the first approach, we directly fine-tune the pre-trained transformer models on the manually annotated subset of the Greek Coastline dataset (
Section 5.4 and
Section 5.5), without using the Coast Train dataset as an intermediate domain. This method evaluates whether region-specific training data alone are sufficient for effective segmentation (
Figure 6).
Fine-tuning of the limited Greek annotations enables the models to adapt to local morphological and spectral patterns, including shoreline geometries, sediment types, and vegetation structures. However, the restricted size of the data set results in reduced generalization, particularly for visually similar classes (e.g., sediment vs. natural terrain) and transition zones such as sea foam or mixed vegetation–sand regions. Although this approach avoids additional preprocessing steps, it produces lower segmentation accuracy and weaker boundary consistency compared to the two-stage strategy (
Section 5.6).
4.2. Classification Scheme
To achieve higher accuracy and improved generalization, we adopt a two-stage fine-tuning strategy grounded in transfer learning principles (
Figure 7). This process enables the models to first learn broad coastal characteristics before adapting to region-specific Greek imagery.
In the first stage, the transformer models are fine-tuned on the labeled Coast Train dataset, which includes diverse orthomosaic and satellite imagery from a wide range of U.S. coastal settings. Exposure to heterogeneous landforms, spectral conditions, and coastal structures allows the models to learn generalizable representations of coastal environments.
This stage also bridges the gap between generic pre-training (e.g., ImageNet) and the specialized coastal domain, establishing meaningful semantic priors prior to adaptation to the Greek region.
In the second stage, the models are fine-tuned on the manually annotated Greek Coastline patches. This step addresses domain shifts arising from differences in terrain morphology, vegetation composition, lighting conditions, and imaging resolution.
Despite the limited number of annotated samples, fine-tuning on this localized dataset significantly improves segmentation performance, particularly for narrow shoreline structures and region-specific spectral signatures. The two-stage strategy therefore provides a practical approach for deploying deep learning models in data-scarce regions by combining large-scale general-domain training with targeted local adaptation.
5. Experiments
5.1. Experimental Setup
The experimental framework was designed to ensure reproducibility and rigorously evaluate the transferability of Transformer-based models to diverse coastal geographies. All experiments were conducted on the Kaggle cloud computing platform using an NVIDIA Tesla P100 GPU (16 GB of VRAM), manufactured by NVIDIA Corporation (Santa Clara, CA, USA). Model training and evaluation were implemented using the PyTorch deep learning framework (version 1.13.1; Meta Platforms, Inc., Menlo Park, CA, USA) and the MMsegmentation library.
5.1.1. Architectural Configurations
While existing Transformer architectures —SegFormer (2021), MaskFormer (2021), and Mask2Former (2022)—serve as the backbones; their selection is predicated on their ability to capture long-range dependencies, which is critical for the continuity of linear coastal features. We utilized the “Large” variants (e.g., SegFormer-B5, Mask2Former-L) to maximize the feature extraction capacity during the two-stage domain adaptation.
5.1.2. Hyperparameters and Optimization
To maintain consistency across architectures and training stages, the following hyperparameters were employed:
Optimizer: AdamW, with a weight decay of 0.01.
Learning Rate: A base learning rate of was used, governed by a poly learning rate schedule with a power of 0.9 to ensure smooth convergence.
Batch Size: A total batch size of 8 (4 images per GPU iteration).
Training Duration: Stage 1 (Coast Train) was conducted for 160,000 iterations, while Stage 2 (Greek Adaptation) was performed for 40,000 iterations to prevent overfitting on the smaller Greek dataset.
Data Augmentation: Standard techniques, including random horizontal flipping, random scaling (0.5× to 2.0×), and random cropping to pixels, were applied to enhance model robustness.
5.1.3. Loss Function
To address the inherent class imbalance in coastal scenes (e.g., the dominance of ‘Water’ over ‘Sea foam’), we utilized a combination of Cross-Entropy Loss and Dice Loss. In Stage 2, a coastal-class weighted focal loss was integrated to prioritize the accurate delineation of high-contrast boundaries, such as the sediment–water interface.
5.2. Evaluation Methodology
Model performance was assessed using the Mean Intersection over Union (mIoU), a standard metric for multi-class semantic segmentation that quantifies agreement between predicted and ground-truth labels.
For a given class
i, the Intersection over Union (IoU) is defined as follows:
where
,
, and
denote true positives, false positives, and false negatives, respectively.
The overall mIoU across all
k classes is computed as follows:
This metric is particularly informative for coastal segmentation, where classes may exhibit fine-scale boundaries and strong spatial imbalance. Reporting mIoU for both training strategies allows for a comprehensive comparison of generalization capability, region-specific adaptation, and the overall effectiveness of the proposed framework.
5.3. Quantitative Results
This section presents a comprehensive evaluation of the proposed two-stage training pipeline across three transformer-based architectures—SegFormer, MaskFormer, and Mask2Former. Models were first fine-tuned on the Coast Train dataset (Stage 1) and subsequently adapted to the Greek Coastline dataset via targeted fine-tuning (Stage 2). Performance is reported using mean Intersection over Union (mIoU), mean accuracy, and F1 score. We additionally evaluate a baseline condition in which models are trained directly on the Greek Coastline dataset without Stage 1 pretraining.
Overall, the results demonstrate (i) strong Stage 1 generalization from Coast Train, (ii) limited effectiveness when training directly on Greek imagery, and (iii) substantial improvements following Stage 2 fine-tuning, particularly for models pretrained on Cityscapes.
5.4. Stage 1: Training on the Coast Train Dataset
Table 2,
Table 3 and
Table 4 summarize the Stage 1 results. Across all architectures, Cityscapes’ pretraining consistently outperforms ADE20K, reflecting the closer visual and structural similarity between Cityscapes and coastal scenes. Among the SegFormer variants, SegFormer-B5 pretrained on Cityscapes achieved the best performance (82.69% mIoU). MaskFormer-Large reached 82.18% mIoU, while the Mask2Former-Large model pretrained on Cityscapes attained the highest Stage 1 score of 84.06% mIoU.
The pretraining dataset comparison (
Table 5) reveals that the initialization of Cityscapes provides consistent advantages across architectures, with the largest gain observed for SegFormer-B5 (+5.32%). This finding suggests that pretraining on datasets with similar visual characteristics to the target domain—in this case, structured scenes with clear boundaries between land cover types—facilitates more effective transfer learning than pretraining on diverse but visually dissimilar datasets like ADE20K.
Across all architectures, Mask2Former-Large (Cityscapes) provides the strongest foundation for downstream fine-tuning, establishing this as the baseline for subsequent experiments.
5.5. Direct Training on the Greek Coastline Dataset
Direct fine-tuning on Greek imagery, without Stage 1, provides insight into how well models adapt using only local, limited annotations. Results are reported in
Table 6. While functional, performance is consistently lower than the two-stage approach. Mask2Former-Large pretrained on Cityscapes achieved the best direct score (82.42% mIoU), outperforming SegFormer and MaskFormer variants in this baseline setting.
The reduced performance of direct training compared to the Stage 1 results (82.42% vs. 84.06% for Mask2Former) demonstrates the value of large-scale pretraining on diverse coastal imagery. The limited size of the Greek dataset (420 patches) is insufficient for models to learn robust coastal representations from scratch, even when initialized with general-purpose weights from Cityscapes or ADE20K.
5.6. Stage 2: Fine-Tuning on the Greek Coastline Dataset
The second training stage adapts each model to the Greek coastal domain using a limited set of manually annotated high-resolution aerial images. Fine-tuning consistently improves performance across all architectures, with Mask2Former-Large achieving the strongest results (85.43% mIoU).
Table 7 summarizes Stage 2 results.
Comparing Stage 1 and Stage 2 reveals clear gains from geographic adaptation. Mask2Former-Large improves by +1.37% mIoU after exposure to even a small number of Greek training patches, highlighting the efficiency of targeted fine-tuning. The improvement is particularly notable given that Stage 2 training uses only 252 images (60% of 420 patches), demonstrating that strategic transfer learning can achieve substantial performance gains with minimal additional annotation effort.
Table 8 presents a consolidated view of the final Stage 2 performance across all three architectures. The results clearly demonstrate Mask2Former’s superiority, with a 4.84 percentage point advantage over MaskFormer and an 8.62 percentage point advantage over SegFormer.
8. Conclusions
This study introduced a comprehensive two-stage semantic segmentation framework tailored to coastal environments, systematically evaluating three transformer-based architectures (SegFormer, MaskFormer, Mask2Former) combined with a transfer learning strategy. We demonstrate two key contributions: First, we establish that Mask2Former architecture with masked attention provides superior performance for coastal segmentation, achieving 85.43% mIoU on challenging Greek coastal imagery—outperforming MaskFormer and SegFormer. This systematic architectural comparison, absent in the prior coastal segmentation literature, provides empirical guidance for method selection in operational coastal monitoring systems. Second, we validate a two-stage domain adaptation strategy that achieves efficient transfer from large-scale U.S. coastal data (Coast Train) to a data-scarce Greek coastal region. This approach outperforms both direct training on limited Greek data (+3.01% mIoU) and single-stage training on Coast Train alone (+1.37% mIoU). The results demonstrate that strategic transfer learning enables state-of-the-art performance with only 420 manually annotated patches, addressing a critical barrier to AI deployment in regions lacking extensive labeled datasets.
The findings highlight the effectiveness of hybrid training pipelines that balance generalization and locality, demonstrating that strong performance can be obtained even with limited manual annotation. The success of transformer-based models, particularly Mask2Former’s masked attention mechanism, further underscores the growing relevance of attention-based architectures in geospatial and environmental applications where both local fine-grained detail and global spatial context are essential.
Broader Implications
The proposed pipeline offers a scalable and cost-effective solution for high-resolution coastal monitoring, supporting climate-resilient planning, shoreline change detection, and sustainable land-use management. As pressures on coastal zones intensify, such AI-based tools provide a robust alternative to traditional field surveys, enabling more frequent, consistent, and wide-area assessments.
The methodological contributions—systematic transformer architecture evaluation and validated two-stage adaptation strategy—are immediately applicable to operational coastal management. National mapping agencies, environmental protection authorities, and coastal zone planners can leverage this framework to establish automated monitoring systems that complement or replace labor-intensive manual interpretation. The achieved segmentation accuracy enables the precise quantification of shoreline retreat rates, critical for erosion risk assessment and coastal adaptation planning under sea-level rise scenarios.
From a research perspective, this work establishes important benchmarks: (i) the first systematic comparison of transformer architectures for coastal segmentation, and (ii) validated cost–benefit metrics for transfer learning strategies in geographic adaptation scenarios. These contributions provide a foundation for future research in multimodal integration (combining optical, radar, and elevation data), temporal modeling (multi-year change detection), and global-scale coastal monitoring systems.
Future extensions incorporating multi-temporal data, multispectral features, or ancillary geospatial layers (bathymetry, wave exposure indices, historical shoreline positions) may further enhance model reliability and support informed decision-making at multiple governance levels. The demonstrated effectiveness of transformer architectures and strategic transfer learning suggests that comprehensive, automated coastal monitoring at national or continental scales is now technically feasible, with the remaining challenges primarily occurring in operational deployment, computational optimization, and multi-regional validation rather than fundamental methodological limitations.
In conclusion, this work advances the state-of-the-art in AI-driven coastal remote sensing by demonstrating that careful attention to architecture selection and training strategy can achieve robust performance even under challenging conditions of limited labeled data and significant domain shift. The practical applicability of these methods to pressing coastal management challenges positions this research as a step toward operational, scalable, and cost-effective automated coastal monitoring systems that can support evidence-based environmental policy and climate adaptation planning.