RoadMark-cGAN: Generative Conditional Learning to Directly Map Road Marking Lines from Aerial Orthophotos via Image-to-Image Translation

Cira, Calimanut-Ionut; Yokoya, Naoto; Manso-Callejo, Miguel-Ángel; Alcarria, Ramon; Broni-Bediako, Clifford; Xia, Junshi; Bordel, Borja

doi:10.3390/electronics15010224

Open AccessArticle

RoadMark-cGAN: Generative Conditional Learning to Directly Map Road Marking Lines from Aerial Orthophotos via Image-to-Image Translation

by

Calimanut-Ionut Cira

^1,*

,

Naoto Yokoya

^2,3

,

Miguel-Ángel Manso-Callejo

¹

,

Ramon Alcarria

¹

,

Clifford Broni-Bediako

³

,

Junshi Xia

³

and

Borja Bordel

⁴

¹

Departamento de Ingeniería Topográfica y Cartografía, Escuela Técnica Superior de Ingenieros en Topografía, Geodesia y Cartografía, Universidad Politécnica de Madrid, C/Mercator 2, 28031 Madrid, Spain

²

Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi 277-8561, Japan

³

Geoinformatics Team, RIKEN Center for Advanced Intelligence Project (AIP), Mitsui Building 15th Floor, 1-4-1 Nihonbashi, Tokyo 103-0027, Japan

⁴

Departamento de Sistemas Informáticos, Escuela Técnica Superior de Ingenieros de Sistemas Informáticos, Universidad Politécnica de Madrid, C/Alan Turing s/n, 28031 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 224; https://doi.org/10.3390/electronics15010224

Submission received: 24 October 2025 / Revised: 10 December 2025 / Accepted: 24 December 2025 / Published: 3 January 2026

(This article belongs to the Special Issue Signal and Image Processing Applications in Artificial Intelligence, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Road marking lines can be extracted from aerial images using semantic segmentation (SS) models; however, in this work, a conditional generative adversarial network, RoadMark-cGAN, is proposed for direct extraction of these representations with image-to-image translation techniques. The generator features residual and attention blocks added in a functional bottleneck, while the discriminator features a modified PatchGAN, with an optimized encoder and an attention block added. The proposed model is improved in three versions (v2 to v4), in which dynamic dropout techniques and a novel “Morphological Boundary-Sensitive Class-Balanced” (MBSCB) loss are progressively added to better handle the high class imbalance present in the data. All models were trained on a novel “RoadMarking-binary” dataset (29,405 RGB orthoimage tiles of 256 × 256 pixels and their corresponding ground truth masks) to learn the distribution of road marking lines found on pavement. The metrical evaluation on the test set containing 2045 unseen images showed that the best proposed model achieved average improvements of 45.2% and 1.7% in the Intersection-over-Union (IoU) score for the positive, underrepresented class when compared to the best Pix2Pix and SS models, respectively, trained for the same task. Finally, a qualitative, visual comparison was conducted to assess the quality of the road marking predictions of the best models and their mapping performance.

Keywords:

generative learning; conditional learning; road marking extraction; representative road lines; image-to-image translation

1. Introduction

Detailed and up-to-date road cartography is essential for improving road safety. Accurately achieving this with automatic methods can provide comprehensive and timely information about road conditions and infrastructure. Such information facilitates better maintenance, leading to better planning, more efficient resource management, and informed decision-making.

Additionally, as autonomous vehicles become more widespread in society, the demand for high-quality geospatial road data grows increasingly crucial. This accentuates the need for very detailed road mappings [1] and incentivizes the extraction of horizontal road marking lines instead of road surface area representations (as pavement markings provide a safer and more effective navigation for autonomous vehicles and better support their need for real-time decision-making).

Aerial imagery has been widely adopted by computer vision researchers for tasks such as land use analysis, geospatial object classification, and extraction [2] with deep learning (DL) algorithms. In these remote sensing applications, semantic segmentation (SS) techniques (supervised operations) are generally applied to assign a land cover class to every pixel of an image. However, these operations are notably challenging due to noise and occlusions inherently present in remotely sensed imagery, and the complex nature of some geospatial objects.

The automatic mapping of the road networks from aerial imagery is a topic of high interest in machine vision, where SS operations are widely used to extract linear objects such as road surface areas [3] as well as other point-based [4] or polygonal [5] geospatial elements with high levels of performance. In this study, the road extraction task focuses on using conditional generative learning, specifically, image-to-image (I2I) translation.

The I2I translation task was introduced by Isola et al. with the Pix2Pix model [6]—a specific type of generative model that features adversarially trained generator and discriminator networks and relies on paired, concatenated input–output examples for training. The generator receives a sample from the source domain and outputs an image tensor that is compared with the target image by the discriminator to compute the adversarial loss function, which evaluates whether the synthetic predictions match the target. Guided by the adversarial feedback received from the discriminator and the reconstruction L1 loss, the generator iteratively improves its predictions to better align them with the target domain.

The I2I translation delivers impressive qualitative results, but its performance evaluation in the literature is often centered on style and consistency and is not generally assessed with performance metrics specific to SS like the Intersection-over-Union (IoU) score. In [6], a Pix2pix-based model, heavily modified for computational efficiency (92% and 61% reduction in the number of parameters of the generator and discriminator networks, respectively), was proposed to improve the SS predictions for road mapping purposes using a dataset of 6784 image tiles. Average improvements of 11% in the IoU score were observed with respect to the initial representations extracted. More recently, in 2023, the I2I translation technique has been applied to postprocess the road surface area representations extracted using SS at a large scale [7] (state-level, covering a land area of 8637 km² of the Spanish territory), and average improvements of 15.5% were achieved in the IoU score when compared to the initial unprocessed predictions.

However, in this work, a novel conditional generative adversarial network (cGAN) architecture, RoadMark-cGAN, is proposed to directly extract representative road marking lines found on the pavement using aerial orthophotos from a novel dataset “RoadMark-binary” (introduced in Section 4).

The training goal was to conditionally generate new synthetic images that resemble the source samples closely enough to allow a direct mapping of representative road lines. The training data is highly imbalanced, since road marking lines typically cover, on average, less than 5% of the image area (the remainder is considered background). To better handle this class imbalance, dynamic dropout techniques and a novel loss function were incorporated into RoadMark-cGAN-v2 and v3.

In addition, Pix2Pix and two popular SS models were also trained on the same data. Afterwards, the generalization capacity of the trained models was evaluated on the same unseen test data both metrically (with performance metrics specific to SS) and visually, to assess the quality of the predicted road marking lines representations.

Our contributions are summarized as follows:

The RoadMark-cGAN model is proposed for mapping road marking lines identified on pavement from aerial orthoimages. The Generator G is based on the U-Net architecture, where residual and attention blocks are added as a functional bottleneck. The Discriminator D is a modified version of PatchGAN that features an attention block. The model also features enhanced normalization and exponential decay added to the training procedure. This model achieved an average increase of 38.7% in the IoU score for the positive class over the best Pix2Pix model trained.
RoadMark-cGAN was further enhanced in RoadMark-cGAN-v2, where the regularization was improved with a dynamic dropout technique applied to the residual and attention blocks. RoadMark-cGAN-v3 features a new “Morphological Boundary-Sensitive Class-Balanced” (MBSCB) loss function, with a combination of morphological, dice loss, adversarial loss, and $L 1_{r a w}$ loss components to better extract thin structures representations. RoadMark-cGAN-v4 represents a combination of RoadMark-cGAN-v2 and RoadMark-cGAN-v3 and achieved average performance gains for extracting the positive, underrepresented class of approximately 45.15% and 1.64% when compared to the best Pix2Pix and SS models, respectively, that were considered.
The dataset, “RoadMark-binary” [8], containing 29,405 real pairs of orthoimage tile—ground truth mask, manually labelled, was split into training (27,360 pairs of images) and test (2045 pairs of images) sets and is available under a CC-BY 4.0 license.
The appropriateness of conditional generative techniques for directly extracting road markings was statistically evaluated using performance metrics specific to SS and visually evaluated on random test samples to assess prediction quality. The results indicate that the generator can produce high-quality synthetic data without direct access to ground-truth masks and can successfully map the source domain distribution (orthoimage tiles) to the target domain distribution (corresponding maps) through I2I translation techniques.

The remainder of the manuscript is organized as follows. Section 2 presents the I2I translation task with conditional generative learning from a mathematical perspective. In Section 3, related works are described. The data used to train and test the models are presented in Section 4. Details of the proposed RoadMark-cGAN model and training procedure are presented in Section 5, while its three improved variants (RoadMark-cGAN-v2 to RoadMark-cGAN-v4) are described in Section 6. The experiments carried out and the performance on the test set are presented and analyzed in Section 7. Section 8 contains a qualitative evaluation of the test predictions, while the implications of the findings are discussed in Section 9. Finally, Section 10 presents the conclusions.

2. Problem Description

For the I2I translation task tackled in this work, cGANs take a supervised conditional approach. Here, the conditional data

x

from the source domain

X

(aerial orthoimages to be mapped to representations of road markings) is the generator’s input.

The generator processes the input into a synthetic image

G (x) = y^{'}

, and the goal of the generator is to learn a transformation function (

G (x) \to y

) capable of converting the input image

x

into a synthetic image

y^{'}

that is indistinguishable from the target image

y

from the target domain

Y

(road marking maps). It is important to note that the generator does not have direct access to the target data

y

.

The conditional data is then paired with the generated image,

(x, G (x)),

and fed to the Discriminator together with the conditional data concatenated with the corresponding target labels,

(x, y)

. The discriminator learns to distinguish between real target images and the generated images

G (x)

by examining the pairs of real data

(x, y)

and synthetic data

(x, G (x))

. This is used to compute the adversarial loss,

L_{c G A N}

, to measure how well the synthetic prediction

G (x)

is classified as real.

During training, the generator is guided not only by the discriminator’s feedback,

L_{c G A N}

, but also by a reconstruction

L 1

loss,

L_{L 1}

, that measures how close the generated image is to the actual target image to ensure that the generated outputs,

G (x)

, are realistic and semantically aligned to the target domain. The combined objective function of the cGAN model is defined by the adversarial setting from Equation (1).

\min_{G} \max_{D} L_{c G A N (G, D)} + λ \times L_{L 1 (G)}

(1)

The

L_{c G A N (G, D)}

and

L_{L 1 (G)}

terms from Equation (1) are defined in Equation (2) and Equation (3), respectively.

L_{c G A N (G, D)} = E x, y \sim p_{data} [\log D (x, y)] + E x \sim p_{data} [\log (1 - D (x, G (x)))]

(2)

The first term of Equation (2),

\log D (x, y)

, ensures the correct classification of real pairs

(x, y)

by the discriminator, while the second term,

\log (1 - D (x, G (x)))

, penalizes the discriminator when incorrectly classifying synthetic pairs as target domain samples. At the same time, the second term encourages the generator to produce more realistic synthetic samples. The adversarial loss stimulates realism on the synthetic data produced, while the reconstruction loss,

L_{L 1 (G)}

, ensures a mapping that preserves the structural and semantic properties of

x

.

L_{L 1 (G)} = E x, y \sim p_{data} [| y - G (x) |_{1}]

(3)

The

L_{L 1 (G)}

from Equation (3) measures the absolute difference between the generated image

G (x)

and the ground truth

y

at the pixel level. This loss component penalizes

G

so that the structural and semantic similarities of the target images are retained in the synthetic predictions

G (x)

.

In Equation (1), the

λ

(lambda) hyperparameter associated with the

L_{L 1 (G)}

loss controls the trade-off between the adversarial and the reconstruction losses. A larger

λ

value will prioritize accurate reconstructions, while a smaller

λ

value gives more weight to the adversarial loss (centered on “fooling” the discriminator). Identifying a balanced

λ

value is important, as a dominant adversarial loss generally leads to synthetic data with less semantic context, while an overly dominant

L 1

loss could lead to blurry outputs. A

λ

value of 100 is applied for the standard Pix2Pix model [9] training.

The discriminator maximizes both terms of Equation (2) to improve its accuracy for real data

(D (x, y))

and fake data

(D (x, G (x)))

, while the generator minimizes the second term of Equation (2) to generate data that

D

misclassifies as “real” and combines it with the reconstruction

L 1

loss. It is also important to note that the generator does not have access to the target domain data and only iteratively improves its predictions using feedback received from the discriminator and a reconstruction loss. A simplification of this training procedure of a cGAN for I2I translation is presented in Figure 1.

3. Related Work

GAN models, introduced by Goodfellow et al. [10] in 2014, have evolved into specialized models applicable to remote sensing tasks [11]. One example is the Deep Convolutional Generative Adversarial Networks (DCGANs) [12], which features convolutional neural networks with architectural constraints (e.g., fractionally-strided convolutions for upsampling) that enable extracting representations from unlabeled data. One of the more popular applications of GAN models in remote sensing is related to image super-resolution [13].

Conditional GANs [14] extend this type of model by applying a condition to the input to produce synthetic data. Pix2Pix is a conditional GAN introduced by Isola et al. [9] as an extension of the GAN architecture for the task referred to as image-to-image (I2I) translation. It features a network based on U-Net [15] as the generator, and the PatchGAN architecture (which evaluates image patches of 32 × 32 for computing the loss function) as the discriminator network.

The I2I translation task involves modeling a function to map an image tensor from a source domain to a target domain, and it is considered one of the most interesting cGAN applications for geospatial element extraction. However, its outputs are most often assessed qualitatively and are not typically used for direct mapping (e.g., turning aerial images into a map). This model evolved in recent years with the introduction of CycleGAN [16], DiscoGAN [17], and DualGAN [18], but cGANs for image generation are still known to deliver more inconsistent, blurry, and distorted content compared to SS models [19].

Road extraction from satellite and aerial imagery is an essential task for many applications, including urban planning, traffic control, and autonomous driving. In recent years, DL techniques have shown remarkable performance for this extraction task, but most works approach the road surface area extraction task and apply SS operations to obtain the road representations. However, it is considered that more fine-grained road representations (such as pavement-level road markings) are required for safer navigation of intelligent transportation systems and autonomous vehicles, or to improve the decision-making process in smart city planning [20].

3.1. cGANs for Augmentation of Road Extraction Data

One of the most significant GAN and cGAN applications for road extraction is the augmentation of the training dataset with synthetic samples. In this line, Hu et al. [21] introduced WSGAN, a weakly supervised framework that generates additional mapping images synthetically (using data from public datasets). The binary images are then post-processed through dilation and erosion methods to improve road extraction from remote sensing images and achieve IoU scores of 0.84.

Similarly, Zhao et al. [22] introduced the AU3-GAN model to augment training road data from historical maps, and subsequently employed a UNet++ network [23] with an attention mechanism to extract roads on the augmented dataset, achieving a 7.6% increase in the mean IoU score over convolutional neural network-based methods. The main limitations are the confusion of roads with contours, rivers, or administrative boundaries, as well as the scarcity and inconsistency of training data in historical maps.

Lu et al. [24] proposed a two-step road extraction method that leverages Pix2Pix to model road images affected by tree occlusions and produce synthetic data where the occlusion issue is reduced. The second step involves using the generated data to train a model, called FusionLinkNet (based on the fusion of the DenseNet [25], ResNet [26], and LinkNet [27]). This approach delivered IoU scores for the positive class of 0.65 and 0.74, on the DeepGlobe [28] and Massachusetts roads [29] datasets, respectively, demonstrating significant improvement in extracting occluded regions.

3.2. Hybrid cGANs for Road Extraction

Another important line of research is centered on applying variations in the GAN architecture to obtain more advanced, hybrid models (for example, featuring multiple generators or multiple discriminators) or by incorporating new learning modules like attention gates that can impact the stability of the training or the quality of the generated data.

In this context, Chen et al. [30] proposed SW-GAN to tackle the challenge of large, elaborately labeled training datasets required for accurate road extraction from remotely sensed imagery. The framework consists of three components: (1) a weakly supervised generator trained on centerline road data from OpenStreetMap, (2) a fully supervised generator trained with both high-quality pixel-wise annotations and the features learned by the weakly supervised generator, and (3) a discriminator trained to distinguish whether the data comes from the high quality labelled data distribution or was generated by the fully supervised generator. The generators are based on ResUnet [31], while the discriminator is a fully convolutional network similar to the PatchGAN architecture used in Pix2Pix’s discriminator. The method produced realistic road network representations.

Zhang et al. [32] designed the MsGAN model featuring two discriminators, each coupled with four “sub-discriminators” that rescale the input to improve the extraction of roads with varying widths, and achieve improved connectivity through multiscale feature fusion. This model was trained with road spectral and topology features and provided an improved completeness of the extracted road network compared to other DL-based methods.

Kyslytsyna et al. [33] proposed a cGAN-based method, ICGA, featuring attention gates added to Pix2Pix’s U-Net-like generator that help detect cracks in the road surface. The method involves separately training the model on (1) data where all pixels outside the roads are treated as noise, and (2) road surface crack data to better capture the distribution and achieve a more accurate shape of the target object.

3.3. Topology-Aware GANs for Road Extraction

As the field advanced, researchers introduced topology-aware GAN models featuring graph components capable of extracting the road network by applying topological constraints to preserve the road structures, while improving continuity and connectivity.

Zhang and Li [34] proposed McGAN, composed of two discriminators (one to process spectral information, and another to improve the topology of the road network), to obtain more complete road network graphs.

Lin et al. [35] propose MS-AGAN, a GAN model that employs multiscale information fusion and an asymmetric encoder-decoder. The model also introduces topological constraints and uses a structural similarity loss, but challenges remain in extracting roads in areas with occlusions or background clutter associated with urban-rural transitions (resulting in reduced connectivity and usability) and a high computational budget required to run the model.

Hartmann et al. train a GAN model [36] to enrich regions where discontinuities are present (such as intersections or highway ramps) using data from OpenStreetMap. Similarly, Liu et al. [37] proposed the TPEGAN, a road extraction approach that combines a segmentation model with a GAN model to enhance road pixels through adversarial techniques and improve prediction consistency. To capture dependencies between different spatial regions and improve road network connectivity, graph convolutions were applied subsequently. The technique achieved IoU scores around 0.66, an improvement of approximately 5% over other widely used methods.

3.4. Other Challenges

Recently, Bai et al. [38] similarly recognized the critical role of attention mechanisms for road extraction from aerial imagery and proposed RISENet, which incorporates the MCSA attention mechanism based on feature fusion. The architecture is composed of a dual-branch, fusion encoder, an attention component where channel features are spatially fused, together with a dilation-aware feature decoder. This design enabled a better integration of global and local information, resulting in strong performance across quantitative indicators and robustness in visual quality assessment.

Another well-recognized challenge in SS is class imbalance (which becomes even more pronounced in the case of road marking extraction), and is usually approached with specialized loss functions such as the Dice loss [39]. Dice is a region-based loss that optimizes the overlap between the predicted and ground truth segments, making models less sensitive to the class distribution. Other proposed functions are the Boundary loss [40] designed for improving the segmentation of object boundaries, or the Focal loss [41] that modifies the standard cross-entropy loss to reduce the weight of well-classified examples, allowing the model to focus on the harder (often minority) class samples.

Nonetheless, loss functions that address this challenge by focusing on the morphological properties of the objects (such as the thinness of the road structures) are yet to be fully explored, particularly in the case of cGAN models. A promising direction is the clDice loss [42], developed for the segmentation of tubular structures in medical imaging, which can adapt well to the thin, linear nature of road cracks.

The challenges are aligned with broader efforts in thin structure extraction across computer vision domains such as medical imaging (blood vessels, nerve fibers, etc.), industrial inspection (concrete crack detection or weld seam analysis), or document analysis (handwritten or formatted lines). All these tasks have in common thin, elongated structures that are relevant for medical diagnosis, understanding of biological processes, ensuring structural integrity, or digitizing and understanding structured documents—similar to the features present in mapping and infrastructure monitoring through road marking extraction.

4. Data: RoadMark-Binary

The source domain data is based on the latest available information from PNOA (National Plan for Aerial Ortophotography, Spanish: “Plan Nacional de Ortofotografía Aérea”) [43], produced by the National Geographical Institute of Spain (a public agency).

According to the producer’s statement, the raw image data was captured with high-resolution, calibrated digital cameras and standardized photogrammetric acquisition procedures. The raw data is acquired with the maximum possible frequency (not more than two years), under optimal methodological conditions (clear skies, in early morning hours), usually, in late spring, when the terrain presents a stable vegetation, and the visibility and lighting conditions are favorable.

Following acquisition, the raw aerial data is orthorectified, radiometrically, and topographically corrected with the same standardized procedures throughout the national territory. Then, the processed PNOA data and metadata are uploaded to the open repository “Download Center” of the National Center for Geographic Information (CNIG [44]) to be distributed under a CC-BY 4.0 license. The aerial RGB PNOA orthoimages used in this work feature a spatial resolution of 15 cm per pixel and are referenced in the ETRS89 Geodetic Reference System.

The target domain consists of ground-truth masks containing maps of the road line markings within aerial imagery. This mask was manually tagged using PNOA images throughout the territory (the labeled areas are presented in Figure 2) into two binary classes (“Background” and “Road_Marking”). Geospatial features that were considered as “road markings” are the continuous and discontinuous road lines or marked lines defining highway entrances or exits found on the pavement (Figure 3). Pixels belonging to the positive class were assigned a pixel value of 255 across the three image channels, with the rest of the pixels assigned to the “Background” class and labeled with a value of 0 in the three channels.

For the ground truth masks, a width of three pixels was used for labeling the positive class, as the visual inspection showed it provided the best alignment with the actual lines in the orthophotos. Considering the spatial resolution of the data, this choice is coherent with the average dimensions of the considered representative road lines marked on highways. This labeling strategy was chosen based on a combination of empirical, technical, and operational criteria, as it also enabled a faster and more consistent labeling procedure. Finally, an independent operator reviewed the quality of the labels and eliminated those with interpretation errors or significant inconsistencies.

Following recommendations from [45], aerial orthoimages and the corresponding ground truth masks were cropped into tiles of size 256 × 256 × 3 with an overlap of 12.5% which helps models better process the edge information. A 25 m buffer around the road marking lines was applied to define the region of interest, and areas of the image outside of the buffer area were automatically filled with black across the three channels. To avoid processing images with extreme class imbalance, pairs of ground truth masks and image tiles where less than 40 pixels of the positive class were present were discarded. The resulting training set contains 27,360 image tiles and covers a land surface of 40.34 km² of representative scenarios; random samples from the training set can be found in Figure 3.

For the test data, a novel area from the Madrid region was labeled in the same way (highlighted in red in Figure 2). This area was not included in the training data. The test aerial imagery and the corresponding labeled data were also divided into RGB tiles of 256 × 256 pixels, but without overlap [46]. The resulting test set contains 2045 tiles and covers an area of 3.02 km².

Table 1 summarizes the data size and distribution, where a high-class imbalance can be identified—the positive class covers approximately 5% of the ground truth mask pixels and is highly underrepresented due to the nature of the object considered. Considering the sample size and following the Central Limit Theorem [47] (stating that an independent variable approximates a normal distribution as the sample size becomes larger, regardless of the actual distribution shape of the population), it was assumed that the training and test data follow a normal distribution. The training and testing data cover 43.36 km² and are openly available under a CC-BY 4.0 license in a Zenodo repository [8].

5. Proposed RoadMark-cGAN Model

The RoadMark-cGAN model is inspired by Pix2Pix [9]. However, given the specific task of road-making extraction, with highly unbalanced classes, the model differs significantly from the Pix2Pix implementation. These differences include the architectures of the generator and discriminator, their underlying components, and the training procedure adopted to ensure convergence.

The structural components (Section 5.1), the generator and discriminator networks (Section 5.2 and Section 5.3, respectively), and the training procedure of RoadMark-cGAN (Section 5.4) are detailed next.

5.1. Processing Components

5.1.1. Residual Block

The residual block component consists of two convolutional layers with a kernel size of 3 × 3 with padding applied to maintain spatial dimensions. The weights of each convolutional layer are initialized from a Gaussian distribution

N (0, 0.02)

(mean of 0 and a standard deviation of 0.02); the ReLU activation function [48] was applied to them to introduce nonlinearity in the training procedure.

Batch Normalization [49] was also applied to the convolutional layers to regularize the network and accelerate convergence. Additionally, a dropout layer [50] with a default rate of 0.3 is integrated between the two convolutional layers to reduce overfitting by randomly deactivating 30% of neurons during training.

Finally, the residual connection [26], specific to residual blocks, skips over the two convolutional layers. This enables the direct flow of unchanged information from the input tensor and facilitates learning of the residual mapping that emphasizes feature refinement. The residual path also provides regularization and can be suitable for dense feature learning tasks such as road marking extraction. Unlike RoadMark-cGAN, Pix2Pix does not feature residual connections; each layer processes the input sequentially, and the model relies on the skip connections between the encoder and decoder.

5.1.2. Attention Block

The attention block integrates a convolutional layer with a kernel size of 1 × 1 and padding, combined with spectral normalization to reduce the dimensionality of the features and generate an attention map. Spectral normalization [51] was shown to stabilize adversarial training by constraining the Lipschitz constant of the convolutional layers and preventing mode collapse.

The sigmoid activation function applied afterwards normalizes the output of this convolution to values between 0 and 1. These values represent the attention weights and enable the network to give more importance to spatial regions with higher relevance to the task and dynamically amplify or suppress features.

To enhance generalization, dropout with a default rate of 0.3 is also applied to the attention map, while Layer Normalization [52] ensures stability and scale invariance in feature processing. The combination of these operations allows the attention block to focus on critical spatial features and to discard noise, improving the quality of feature representation.

The output map is multiplied elementwise with the input features, amplifying the features in salient regions, while suppressing irrelevant information. In contrast, Pix2Pix lacks attention mechanisms that explicitly focus on important regions or suppress noise. Instead, it relies on convolutional operations to implicitly learn feature importance

5.1.3. Downsample Block

The downsample component reduces the spatial resolution of the input features by a factor of two, while increasing the number of feature channels, enabling the network to capture more abstract, high-level features as the depth increases. To achieve this, a convolutional layer with a kernel size of 4 × 4 and a stride of 2 with zero padding is applied. Its weights are randomly initialized from a Gaussian distribution

N (0, 0.02)

.

Afterwards, spectral normalization regularizes the convolutional weights to improve training stability in the adversarial setup. Optional batch normalization can also be applied to reduce internal covariate shifts (particularly recommended in the case of the generator).

Leaky ReLU [53] introduces nonlinearity afterwards while preventing the “dying neuron” problem of ReLU by allowing small gradients for negative values, and was applied to achieve a more stable training behavior (as recommended in [12]). The configuration of the block promotes efficient feature extraction while retaining important details for postprocessing.

The downsample path of the original Pix2Pix model also uses standard 4 × 4 convolutions with a stride of 2, followed by batch normalization and Leaky ReLU activation, but does not employ spectral normalization, and batch normalization is applied uniformly across all layers to stabilize training and normalize activations.

5.1.4. Upsample Block

The upsampling component restores the resolution of the feature maps by a factor of two using a transposed convolution with a kernel size of 4 × 4 and a stride of 2. Spectral normalization is applied to the weights of the transposed convolution to stabilize the adversarial training.

Normalization is also enhanced, as Batch Normalization and Layer Normalization are combined to regularize the feature distribution and ensure more robust scaling during upsampling. Finally, ReLU introduces non-linearity in the processing block. This architectural design is expected to help preserve the fine details that are necessary to reconstruct accurate outputs.

In comparison, the upsample block in the original Pix2Pix also uses transposed convolutions of 4 × 4 to recover the spatial resolution of the input, followed by ReLU activation and batch normalization, and relies on skip connections from the encoder to combine high-level features with preserved spatial details.

In contrast, RoadMark-cGAN’s upsample implementation features enhanced regularization in terms of layer and spectral normalization, combined with batch normalization applied to the transposed convolution weights. Added alongside batch normalization, layer normalization is also applied in RoadMark-cGAN to independently normalize each feature map and improve robustness and stability for smaller batch sizes. These modifications can help to obtain a more robust upsampling mechanism that preserves fine details even under the challenging class imbalance conditions present in training.

The processing components introduced in Section 5.1 are resumed in Table 2.

5.2. Generator $G$

The Generator

G

network (Table 3) is responsible for synthesizing the target domain distribution and generating similar synthetic data. In the generator’s encoder, the input is passed through four downsampling blocks (described in Section 5.1.3) that progressively reduce the width and height dimensions (from 256 × 256 to 8 × 8) until reaching a functional bottleneck.

In the functional bottleneck, the tensor is passed through a loop of four consecutive residual (described in Section 5.1.1) and attention blocks (described in Section 5.1.2), each with 512 filters, to focus on feature refinement (residual maps) and dynamically emphasize important regions in the input features, spatially important features (attention maps).

In the “functional bottleneck”, the number of channels does not change, and the feature maps have the smallest spatial dimensions (8 × 8). Here, the more compressed spatial information is processed with the residual and attention blocks to force the network to learn highly abstract representations by repeated application, allowing for iterative refinement. This processing is expected to enhance the model’s ability to focus on regions containing road markings while reducing interference from the irrelevant background features.

Afterwards, the width and height of the tensor are upsampled with four upsample blocks (defined in Section 5.1.4) to 128 × 128. In this part, as in U-Net [15], skip connections between the corresponding encoder and decoder stages were added to further enable the sharing of information across the network and avoid the loss of low-level information.

The output layer is a transposed convolution of 4 × 4 with a stride of 2, where the weights are randomly initialized from a Gaussian distribution

N (0, 0.02)

. Tanh activation is applied to scale the output pixel values to the range [−1, 1] (similar to Pix2Pix) and to obtain the generator’s RGB output tensor of 256 × 256 × 3.

The arrangement of processing blocks in the functional block was adopted after an iterative process of experimentation, as it was observed that the sequence of operations within the generator significantly influenced the training stability and performance in the intermediate experiments.

The generator network of RoadMark-cGAN features approximately 36.5 million parameters, representing a 33% decrease in the number of parameters when compared to the 54.5 million parameters of Pix2Pix’s generator.

5.3. Discriminator $D$

The discriminator network (Table 4) receives two inputs (pairs of both

(x, G (x))

and

(x, y)

) analyzes the distribution of samples to determine whether the data is generated (fake) or comes from the target domain (real). This input passes through three downsample blocks and convolutional block with spectral normalization and Leaky ReLU as activation function. Afterwards, the tensor is processed through an attention block with 512 filters.

The discriminator

D

of RoakMark-cGAN is a modified PatchGAN (also called Markovian discriminator, introduced by Isola et al. [9]) that traverses the image tensor with convolutions to classify each patch of an image to provide a final prediction by averaging the patch responses. All discriminator stages, except for the input, incorporate batch normalization. Instead of polling operations, strided convolutions (with a stride of 2) and zero padding are utilized. In addition, the modified PatchGAN also features an attention block with a dropout rate of 0.3.

The output is a single grid of 32 × 32 scalars (1024 predictions, each scalar represents the “real” (1) or “fake” (0) prediction for a processed patch and the discriminator learning is enabled by the data-driven cGAN loss,

L_{c G A N (G, D)}

.

Pix2Pix’s discriminator has approximately 2.8 million parameters while the discriminator network of RoadMark-cGAN features approximately 3 million parameters (a 7% increase, due to the addition of an attention processing block).

5.4. Training Procedure

The training procedure of the RoadMark-cGAN model is illustrated in Figure 1 and its objective functions are mathematically detailed in Section 2. RoadMark-cGAN training involves a random weight initialization from a Gaussian distribution with a mean of 0 and a standard deviation of 0.02,

N (0, 0.02)

, for both

G

and

D

to introduce variability in the optimization process and ensure that unique features are learned.

An orthoimage tile

x

and its corresponding road marking map

y

are sampled from the training data; and data augmentation in the form of random rotation by 90, 180, and 270 degrees is applied to both images the same way. The source image

x

is processed through the generator architecture described in Section 5.2 to output

G (x)

. The generator’s output is then concatenated with

x

, and the pair is processed through the discriminator network, resulting in (

x, G (x)

).

The discriminator architecture (Section 5.3) receives (

x, G (x)

) and the orthoimage tiles (source image

x

) concatenated with the target image

y

pair,

(x, y)

, and processes them to obtain patches of 32 × 32 that are classified as “real” (label 1) for

(x, y)

or “fake” (label 0) for the (

x, G (x)

) pair. This is used to calculate the adversarial loss,

L_{c G A N (G, D)}

, with the discriminator trained to distinguish between these two classes and maximize its ability to correctly classify the real and fake data.

L_{c G A N (G, D)}

penalizes

D

for incorrectly classifying

(x, y)

as fake or (

x, G (x)

) as real, and the gradients from this loss are only used to update the discriminator’s weights (only the discriminator is trained directly on real and generated images).

Next,

G (x)

is compared to the ground truth

y

of the source image

x

to compute the L1 loss component as the pixelwise mean absolute error between the generated image

G (x)

and the target image

y

. The loss of

G

has two components: (1) the adversarial loss, which encourages

G

to produce realistic synthetic data that

D

classified as real, and (2) the L1 loss (reconstruction loss), which encourages

G

to deliver road marking representations that are more structurally similar to the target

y

[54].

Therefore, the RoadMark-cGAN model is updated with two targets: the weighted sum of the adversarial loss provided by the discriminator and the L1 loss. The authors of Pix2Pix proposed an empirical weighting of 100:1 in favor of the L1 loss, but our initial experiments showed a significant performance improvement when increasing

λ

to 500. The resulting loss function of RoadMark-cGAN is presented in Equation (4).

L_{c G A N (G, D)} + λ \times L_{L 1 (G)}, where λ = 500

(4)

During backpropagation, the weights are repeatedly updated to minimize the cGAN loss, with the model alternating between training the discriminator and the generator. The gradient of the output of the discriminator network with respect to the synthetic data informs how closely the generated data matches the target domain.

During generator training, gradients from the adversarial loss are propagated to the generator through

D

(whose weights are frozen) to inform

G

how to adjust its weights to produce outputs that the discriminator classifies as “real” (with a low probability of being fake) [11]. The intuition is that, over time, the generator is forced to create data that are as close as possible to the distribution of the target domain.

The optimal state is reached when the converged

G

parameters can reproduce the real data distribution, and the discriminator cannot detect the difference. The simplified pseudocode of the training procedure of the RoadMark-cGAN model is presented in Algorithm 1 with the notation used in this section.

Algorithm 1: Training procedure for RoadMark-cGAN

Input: Orthoimage tile (

x

), road marking map (

y

)
Output: Synthetic data

G (x)

containing the corresponding road marking map produced by

G

1 . Initialize G

and D

with random weights from a Gaussian distribution, N (0, 0.02)

.

2 . For i = n

epochs do:

3 . Sample input image x

(orthoimage tile) and the corresponding target image y

(road marking map).

4 . Generate the predicted map, G (x)

using the generator G

.

5 . Concatenate input x

with both G (x)

and y

.

6 . Process pairs (x, G (x))

and (x, y)

through the discriminator D.
7. Compute discriminator loss:

•: The loss is computed at patch level (32 × 32).
•: D minimizes classification error for real pairs (x, y) predicted as “fake” and synthetic pairs (x, G(x)) predicted as “real”.

8 . Compute the adversarial loss, L_{c G A N (G, D)}

for the generator.

•: G minimizes the difference between D’s predictions for (x, G(x)) and the real “labels” derived from y.

9 . Get

D’s

prediction for G (x)

and y

.

10 . Compute L 1 loss, L_{L 1 (G)}

:

•: $G$ minimizes pixel-wise differences between $G (x)$ $and y$ .

11. Backpropagate and update weights:

•: Update D’s weights using gradients from $L$ _cGAN(G,D).
•: Update G’s weights using gradients from $L$ _cGAN(G,D) and $L$ _L1(G).

12. End for

The model is trained usings minibatch stochastic gradient descent with the Adam optimizer [55] where exponential decay was applied to its learning rate. Exponential decay [56] is a scheduling technique to gradually reduce the learning rate during training, which proved to help the model converge more smoothly—the learning rate decreases according to Equation (5).

η (t) = η_{0} \times {decay_rate}^{[t / d e c a y_s t e p s]}

(5)

In Equation (5),

η (t)

denotes the learning rate at time step

t

,

η_{0}

is the initial learning rate, the

decay_rate

is the factor by which the learning rate decreases, while the

decay_steps

represent the number of steps after which the decay is applied. The decay is applied in discrete steps, meaning the learning rate remains constant within each

n

-step interval and only updates at those discrete steps.

In the RoadMark-cGAN setup, the

G

starts with

η_{0}

= 0.0001 and

D

starts with

η_{0}

= 0.0002. Different starting learning rates were applied for the generator and the discriminator because previous studies [57] showed that a slightly higher discriminator learning rate can improve the GAN convergence.

The

decay_rate

is set to 0.98 (a 2% reduction per step) and decay occurs every 1000 gradient update steps. To further help stabilize RoadMark-cGAN’s training, the influence of the first momentum estimate,

β_{1}

, was reduced from 0.9 to 0.5, for both

G

and

D

, while the second momentum estimate,

β_{2}

, remained 0.999 (similar to [6])

6. Improved RoadMark-cGAN Variants

The RoadMark-cGAN was further improved in variants v2 to v4, where dynamic dropout was applied, and a new loss function was proposed, together with the simultaneous use of the two mentioned techniques. The training features of the improved RoadMark-cGAN-v2 to v4 variants are presented in Section 6.1, Section 6.2 and Section 6.3.

6.1. RoadMark-cGAN-v2 (Dynamic Dropout)

In highly imbalanced datasets, models often become biased towards the majority class. For this reason, RoadMark-cGAN features a dropout of 0.3 applied to the residual and attention blocks (as found in Table 2). RoadMark-cGAN-v2 introduces a more advanced dropout regularization with dynamic dropout operations.

Dynamic dropout involves dropout rates that are not fixed but vary with the training epoch. A dynamically changing dropout rate can improve the training at different learning stages: a higher dropout rate in early training stages encourages the model to explore a wider range of features, instead of focusing on those from the predominant “Background” class; while a lower dropout rate in later stages allows the model to fine-tune the features that improve the extraction of the minority “Road_Marking” class.

The dynamic dropout technique can be defined as follows. Let

E

be the total number of training epochs, while

e

represents the current epoch number

0 \leq e < E

. The dynamic dropout rates for the residual (present only in

G

) and attention blocks (present in both

G

and

D

) at epoch

e

,

d_{r e s} (e)

,

d_{a t t} (e)

, are defined in Equations (6) and (7) as a linear decay function from an initial rate

d_{r e s}^{i n i t i a l}, d_{a t t n}^{i n i t i a l}

to a final rate

d_{r e s}^{f i n a l}

,

d_{a t t n}^{f i n a l}

.

d_{r e s} (e) = d_{r e s}^{i n i t i a l} - (d_{r e s}^{i n i t i a l} - d_{r e s}^{f i n a l}) \times \frac{e}{E - 1}

(6)

d_{a t t} (e) = d_{a t t}^{i n i t i a l} - (d_{a t t}^{i n i t i a l} - d_{a t t}^{f i n a l}) \times \frac{e}{E - 1}

(7)

In the case of RoadMark-cGAN-v2, the initial dropout rates

d_{r e s}^{i n i t i a l}, d_{a t t n}^{i n i t i a l}

are 0.3 (same as for the RoadMark-cGAN model), while the final dropout rates,

d_{r e s}^{f i n a l}

,

d_{a t t}^{f i n a l}

are 0.01, the total number of training epochs,

E

, being set to 200.

6.2. RoadMark-cGAN-v3 (Novel MBSCB Loss for $G$ )

While the standard Pix2Pix loss is based on the adversarial and L1 reconstruction terms, the Generator loss of RoadMark-cGAN-v3 incorporates a Morphological Boundary-Sensitive Class-Balanced loss (MBSCB loss) for the generator that is based on four loss components.

MBSCB loss combines (i) the adversarial

L_{c G A N (G, D)}

explained in Section 3, (ii) a loss focused on accounting for the morphological thinness property inherent to the extracted object, (iii) the Dice loss to optimize the overlap between the predicted and ground truth segments, and (iv) a

L 1_{r a w}

loss applied to the raw output logits of

G

(RoadMark-cGAN’s generator).

The morphological component of MBSCB loss addresses the inherent thinness property of the extracted object, and is based on the sensitivity to the boundaries of the predicted shapes. Complementing it, the Dice loss (region-based) further reinforces the accuracy and proper placement of these boundaries. This combination makes the model less sensitive to the overall class distribution and can be beneficial for highly imbalanced datasets. The addition of

L 1_{r a w}

loss component further enforces a correct reconstruction in the synthetic data.

The MBSCB loss of RoadMark-cGAN-v3’s generator is presented in Equation (8), and its components are described below.

L_{M B S C B (G)} = L_{c G A N (G, D)} + {λ_{m o r p h} \times L}_{M o r p h T h i n n e s s} (G (x)) + λ_{D i c e} \times L_{D i c e} (y, G (x)) + λ_{L 1_{r a w}} \times L_{L 1_{r a w}} (y, G (x))

(8)

The

L_{c G A N (G, D)}

loss component from Equation (8) is the standard adversarial loss (defined in Equation (2)) that encourages the generator

G

to produce road marking maps

G (x)

indistinguishable from real road marking maps

y

in the context of the input orthoimage

x

using feedback from the discriminator.

The

λ_{m o r p h}

,

λ_{D i c e}

, and

λ_{L 1_{r a w}}

terms from Equation (8) represent the weight hyperparameters associated with the morphological thinness, Dice, and

L 1_{r a w}

loss components, respectively, and control the importance of each component in the total generator loss. In the case of RoadMark-cGAN-v3, these weight hyperparameters have assigned values of 50, 150, and 100, respectively. It is important to mention that these hyperparameter values were chosen empirically, and there is a potential for optimizing them with further experimentation.

The morphological thinness loss component,

L_{M o r p h T h i n n e s s} (G (x))

, defined in Equation (9), penalizes the thickness of the predicted roads in

G (x)

using its output, to emulate the expected thin structure of real road markings in

y

. This loss aims to enforce the morphological thinness property of road markings on

G (x)

without directly comparing them pixel-wise to the ground truth.

L_{M o r p h T h i n n e s s} (G (x)) = \frac{\sum (D i l a t i o n (M_{G (x)}) - E r o s i o n (M_{G (x)}))}{D i l a t i o n (M_{G (x)}) + ϵ}

(9)

In Equation (9),

M_{G (x)}

represents the map of the generated road markings, binarized with a threshold value of 0.5. Therefore, morphological operations are applied to the binarized version of

G (x)

(the predicted road markings).

The dilation applied to

M_{G (x)}

uses a 3 × 3 kernel of ones to expand the road pixels (resulting in road markings slightly thicker, as the boundaries are expandeds), while the erosion operations, also using a 3 × 3 kernel of ones, shrink the road pixels (the boundaries are shrunk, resulting in road markings slightly thinner).

The morphological thinness loss uses the element-wise sum of the subtraction between these two maps; the sum of these differences indicates the thickness, as a thicker road marking will have more pixels in the dilated version than in the eroded version. The sum of these values approximates the number of boundary pixels; this difference will be larger for thicker road lines.

Finally, a small constant

ϵ

of 0.00001 is added to the denominator to prevent division by zero when the predicted area is extremely small. Therefore, the formula from Equation (9) calculates a normalized measure of the “thickness” of the predicted roads and aims to minimize this value, encouraging thinner road predictions.

The Dice loss component handles the overlap comparison of

G

’s predictions with the ground truth

y

and encourages the improvement of the segmented road maps delivered by

G

using the sigmoid

σ

of

G (x)

to get probabilities for the Dice calculation as presented in Equation (10).

L_{D i c e} (y, G (x)) = 1 - \frac{2 \times \sum (y \times σ (G (x))) + ϵ}{\sum y + \sum σ (G (x)) + ϵ}

(10)

In Equation (10),

y

represent the ground truth map with pixel values scaled to the range [0, 1], while

G (x)

represents the raw output logits of the generator for the input orthoimage

x

.

The denominator term applies the sigmoid activation function,

σ

, to the generator’s output to convert it to probabilities in the range [0, 1] (the confidence that a pixel belongs to the “Road_Marking” class) and sums all values to represent the total predicted road marking area.

In the numerator, an element-wise multiplication is performed between the ground truth mask and the predicted probability map. The resulting values are larger when both values are high (which indicates an overlap of predicted and true roads). This represents the sum of all values in the element-wise product. It measures the intersection (overlap) between the predicted and ground truth road regions, weighted by the prediction probabilities—multiplying it by 2 gives more weight to this intersection.

A small constant,

ϵ = 1 \times 10^{- 7}

, is added to both the numerator and the denominator to prevent division by zero and ensure numerical stability. The Dice loss component maximizes the Dice coefficient for a better overlap between the predicted and ground-truth road marking masks (ranging from 0 when there is no overlap to 1 when there is a perfect overlap) and can handle class imbalance.

The final component,

L 1_{r a w}

loss (Equation (11)) also handles the comparison with the ground truth to encourage the raw output of

G

to have pixel-wise values closer to the scaled ground truth. This loss calculates the pixel-wise absolute difference between the ground truth

y

(scaled to [−1, 1]) and the raw output logits of the generator

G (x)

. It encourages the generator’s raw output to be numerically close to the scaled ground truth.

L_{L 1_{r a w}} (y, G (x)) = \frac{1}{N} \sum | y - G (x) |_{1}

(11)

In Equation (11),

G (x)

represents the raw output of the generator for an input orthoimage

x

, while

y

represents the corresponding pixel value in the ground truth mask and has values scaled to the range [−1, 1]. The term,

| y - G (x) |_{1}

, calculates the absolute pixel difference between the ground truth mask and the raw output logits of

G

. These values will be summed and then divided by the total number of pixels

N

to calculate the mean absolute error between

y

and

G (x)

.

The training produced for RoadMark-cGAN-v3 is presented in Algorithm 2.

Algorithm 2: Training procedure for RoadMark-cGAN-v3

Input : Orthoimage tile (x

), road marking map (y

)

Output : Synthetic data G (x)

containing the corresponding road marking map produced by G

1–9. Steps 1 to 9 are the same as those in “Algorithm 1: Training procedure for RoadMark-cGAN”.
10. Compute morphological thinness loss component,

L_{M o r p h T h i n n e s s} (G (x))

.

•: Encourage thin road structure in the predictions using morphological operations.

11. Compute Dice loss component,

L_{D i c e} (y, G (x))

.

•: Optimize overlap with target domain and class imbalance.

12. Compute L1_raw loss component,

L_{L 1_{r a w}} (y, G (x))

:

•: Minimize pixel-wise differences between target domain and raw $G$ output.

13. Compute the total generator loss,

L_{G (R o a d M a r k - c G A N - v 3)}

.
14. Backpropagate and update weights:

•: Update $D$ ’s weights using gradients from $L_{c G A N (G, D)}$ .
•: Update $G$ ’s weights using gradients from $L_{G (R o a d M a r k - c G A N - v 3)}$ .

15. End for

6.3. RoadMark-cGAN-v4

RoadMark-cGAN-v4 represents the simultaneous addition of the techniques used in RoadMark-cGAN-v2 (Section 6.1) and RoadMark-cGAN-v3 (Section 6.2) on the RoadMark-cGAN presented in Section 5. Therefore, it features both the dynamic dropout applied to the residual and attention block,

d_{r e s}^{i n i t i a l}, d_{a t t n}^{i n i t i a l}

of 0.3 and final dropout rates,

d_{r e s}^{f i n a l}

,

d_{a t t}^{f i n a l}

of 0.01, together with the Morphological Boundary-Sensitive Class-Balanced loss (MBSCB loss) defined in Equation (8) and applied using the procedure from Algorithm 2.

As noted, RoadMark-cGAN v2 to v4 are designed as ablation settings of RoadMark-cGAN, where the dynamic dropout technique, the MBSCB loss, or a combination of both were progressively introduced. This design allowed us to study their contribution to performances, validate the effectiveness of the proposed methods, and identify the setting that yields the best results.

7. Experiments and Results

RoadMark-cGAN and RoadMark-cGAN v2 to v4 were implemented with TensorFlow [58] DL library version 2.19 and were trained for 200 epochs on the train set presented in Table 1.

For comparison and evaluation reasons, two popular SS models, U-Net [15] and U-Net [15]-Inception-ResNet-v2 [59] were trained from scratch on the same training test. These SS models were implemented using the “Segmentation Model” library version 1.0.1 [60] (based on Keras version 2.2.4 [61] and TensorFlow version 1.14.0 [62]). Their training procedure is presented in Section 5 and involves applying data augmentation strategies to training samples.

The loss function is a combination of binary cross-entropy and Jaccard loss components (as defined in Equations (2)–(5) of [46]) to encourage the model to better predict the positive labels. To prevent overfitting and help model convergence, early stopping (if the monitored IoU score metric does not improve after ten epochs) and learning rate reduction strategies (with a factor of 10, up to a minimum of 0.00001) were applied.

The training of the implemented models considered in this study (training scenarios are presented in Table 5) was repeated three times, as ANOVA (Analysis of Variance) can be applied with three performance samples, with the maximum batch size allowed by the GPU. Training was carried out on an Ubuntu 20.04 server featuring four NVIDIA Tesla V100 SXM2 graphics cards with 16 GB of VRAM and an 80-core Intel Xeon Gold 6148 CPU with 126 GB of RAM.

At the end of the training, the resulting generators and SS models were exported to the HDF5 format, and the test set presented in Table 1 was evaluated. At inference time, a decision threshold of 0.5 was applied for the positive and negative classes: black was assigned to the negative “Background” class, and white was assigned to the positive “Road_Marking” class. The predictions delivered by the trained models were stored in the lossless PNG format and were compared against the target domain values to compute the associated confusion matrices and the corresponding

T P

(True Positive),

T N

(True Negative),

F P

(False Positive), and

F N

(False Negative) values.

The performance of the models was evaluated with IoU score, F1 score, Cohen’s Kappa, geometric mean, and Matthews correlation coefficient. IoU score is defined in Equation (12); a model obtaining an IoU score greater than 0.5 is considered to have a good performance [63]). F1 score (Equation (13)) is computed as

2 \times (P r e c i s i o n \times R e c a l l / (P r e c i s i o n + R e c a l l))

, where

P r e c i s i o n = T P / (T P + F P)

and

R e c a l l = T P / (T P + F N)

. Cohen’s Kappa (or

κ

, defined Equation (14)) is a measure of agreement (extent to which the predictions match the true labels), accounting for chance, that ranges from −1 (worse than chance) to 1 (perfect agreement).

Geometric mean (or G-mean, Equation (15) is calculated as the square root of the product of the recall and specificity (computed as

T N / (T N + F P)

) and ranges from 0 (worst) to 1 (perfect classification). It is suitable in imbalanced datasets as a higher G-mean indicates a model that performs well across both classes, where neither the positive nor the negative class is neglected.

Matthews correlation coefficient (abbreviated MCC, defined in Equation (16)) ranges from −1 to 1 (1 indicates a perfect prediction, 0 indicates a random prediction, and −1 indicates a perfect mismatch between prediction and ground truth). MCC uses values from four quadrants of the confusion matrix, being considered a robust performance indicator in imbalanced datasets.

These performances metrics are widely used in SS tasks and are relevant for the road marking extraction task with conditional generative learning approached in this study.

I o U s c o r e = T P / (T P + F P + F N)

(12)

F 1 s c o r e = \frac{2 \times T P}{2 \times T P + F P + F N}

(13)

κ = \frac{2 \times (T P \times T N - F N \times F P)}{(T P + F P) \times (F P + T N) + (T P + F N) + (F N + T N)}

(14)

G - m e a n = \sqrt{(\frac{T P}{T P + F N}) \times (\frac{T N}{T N + F P})}

(15)

M C C = \frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)}}

(16)

The ANOVA results for the performance results (only computed for positive “Road_Marking” class) achieved by the models grouped by Training Scenario ID are presented in Table 6.

As found in Table 6, the lowest performance and highest standard deviation (less consistent training) was delivered by the standard Pix2Pix model (in the P2P100 scenario), where a mean IoU score of 0.26 was achieved. In P2P500, a 24.64% increase in the IoU score was observed (from 0.2605 to 0.3247); the mean F1 score, κ, G-mean, and MCC values computed in this scenario are 0.4902, 0.4606, 0.6355, 0.4703, respectively and indicate a better learning of the road marking elements and demonstrate that increasing the importance for the reconstruction L1 loss component (

L_{L 1} λ = 500

) can significantly improve the results.

The SS models U-Net and U-Net–Inception-ResNet-v2 delivered mean IoU scores of 0.4637 and 0.4582, respectively (values approaching the 0.5 threshold, which is considered a good performance). The standard U-Net performs slightly better and achieves mean F1 score, κ, and G-mean values of 0.6336, 0.6102, and 0.7481, respectively. The mean MCC values of 0.6147 indicate that it is the model that best distinguishes the positive class.

RoadMark-cGAN achieves an IoU score value for the positive class of 0.4505. The introduction of Dynamic Dropout in RM2 delivered a 0.75% increase in the IoU score and the lowest standard deviation (highly stable performance) of all training scenarios. The introduction of the novel MBSCB loss further improved in RoadMark-cGAN-v3, and the combination of Dynamic Dropout and MBSCB loss in RoadMark-cGAN-v4 subsequently improved the performance, indicating that all the modifications to the RM architecture led to improvements in performance. The percentage differences between the mean IoU score of RM-RM3, RM-RM4, and RM3-RM4 are 3.55%, 4.62%, and 1.03%, respectively.

The RoadMark-cGAN-v4 achieves the highest mean IoU score, F1 score, κ, and G-mean values of 0.4713, 0.6407, 0.6145, and 0.7917, respectively. Its average improvements in the IoU score are 45.15% and 1.64% when compared to the best Pix2Pix (

L_{L 1} λ = 500

) and SS models (U-Net). The mean MCC value of 0.6145 achieved by this model is only 0.03% lower than the maximum value of 0.6147 achieved by U-Net. It is also noticeable that the mean G-mean metric of the SS models is worse than that delivered by the RoadMark-cGAN and its variants, indicating that the more asymmetric importance given to the classes was slightly more neglected during training.

The null hypothesis that there are no differences in performance in the training scenarios is rejected and implies that there are very significant differences in the model’s capacity to extract road markings, and that the trained model has a very significant impact on performance that is not due to random chance (p-values < 0.001). Figure 4 presents a boxplot comparison of the models in terms of performance metrics.

The performance follows a similar pattern. Although a considerable increase in performance is achieved when increasing

L_{L 1} λ

to 500, the Pix2Pix models achieve the lowest performance across all performance metrics and the highest variability of the test set values (indicating less consistent training and more overfit). When focusing on RoadMark-cGAN and its variants and the SS models, it can be noticed that the performance metrics have less variability (indicating consistent training and better generalization capacity).

The SS models deliver a better average and median performance when compared to RoadMark-cGAN and RoadMark-cGAN-v2 across all metrics, excepts for G-mean, but after introducing the MBSCB loss in RoadMark-cGAN-v3, the advantage shifts in favor of the I2I translation models and their average performance surpasses the one achieved by popular SS models. The boxplot observations are aligned with those obtained from Table 1 and indicate that the best average performance was achieved by RoadMark-cGAN-v4.

A highly significant difference ID was identified between the means of the performance grouped by training (p-value < 0.001) in the ANOVA test from Table 6. However, the F-statistics and their p-values do not reveal which training scenarios are different from the others when there is a significant difference.

For this, a post-hoc test was applied to determine which specific SS and RoadMark-cGAN Training ID groups are significantly different from each other. In this case, the Tukey HSD test with a significance level of 0.05 was applied (given that test sample sizes are equal) to identify homogeneous subsets of group means that are not significantly different from each other (with 95% confidence that the subsets formed are not by chance). Tukey HSD post-hoc test compares all possible pairs of means and controls the Type I error in all comparisons. The results are presented in Table 7.

The homogeneous subsets reported in Table 7 contain models whose mean performance is not significantly different from each other at a level of significance of 5%; the difference is only statistically significant between subsets.

Regarding the IoU score, although U-Net and RoadMark-cGAN-v3 are also present in Subset 4 (with the highest performance), they share the homogenous Subset 3 with U-Net–Inception-ResNet-v2, while RM4 uniquely occupies the highest-performing position of Subset 4 and is not present in any other homogenous subset. In particular, the most advanced UNetIR is found in the lower-performing subsets (2 and 4), indicating a significant difference in IoU score compared to RM3 and RM4. The same trend is observed in the F1 score’s homogeneous subsets, which highlights its higher capacity to balance its precision and recall in the road marking extraction.

Cohen’s Kappa (measuring prediction agreement with the ground truth beyond chance) shows RoadMark-cGAN-v4 residing in Subset 2, alongside U-Net, RoadMark-cGAN-v3, and U-Net–Inception-ResNet-v2, the latter one being also present in Subset 1 (alongside RoadMark-cGAN and its v2 variant). The highest position in the best-performing subset belongs to RoadMark-cGAN-v4, suggesting a performance comparable to the SS models.

For the G-mean metric, RoadMark-cGAN-v3 and v4 constitute Subset 3, which is statistically distinct from the rest of the model subsets. This indicates that the introduction of the MBSCB loss achieved a more balanced performance between the minority “Road_Marking” class and the “Background” class, an important aspect when dealing with high class imbalance.

The MCC results show RoadMark-cGAN-v3 and v4 grouped within Subset 3 (with the best performance), together with U-Net and U-Net-Inception-ResNet-v2, and suggest that the last two variants of the RoadMark-cGAN are comparable to the best SS models. The Tukey HSD’s test results support the observations from Table 7 and Figure 4.

Finally, for a more detailed analysis of the performance, the confusion matrices of the best Pix2Pix, SS model, and RoadMark-cGAN model on the test set are presented in Figure 5 in terms of class-level percentages. These values correspond to experiments number 6, 8, and 24, respectively (Appendix A).

In Figure 5, it can be observed that the best Pix2Pix model correctly identified background pixels (lowest false positive rates), but struggled significantly with correctly identifying road markings, as only 42% of these pixels were correctly extracted. This high FN rate of 58% is concerning for the intended task, as it means that a significant portion of the actual road markings would be missed. These aspects indicate a higher bias towards the majority, the negative class, which significantly impacted its performance.

U-Net also demonstrated robust performance in correctly identifying the background. The FN rate of 42% has improved, but the model correctly segmented only 58% of the true road marking pixels. RoadMark-cGAN-v4 delivers a slightly lower accuracy in identifying the background (a TN rate of 94%), but an improved capacity to correctly identify road markings (TP rate of 65%, significantly higher than the best Pix2Pix and U-Net models). The FN rate of 35% achieved by the model is also significantly lower compared to the rest and indicates the effectiveness of the proposed architecture for this specific task.

8. Qualitative Evaluation

To analyze what these performance scores mean for the road marking extraction, a perceptual validation was conducted using eight random samples from the test set. In Figure 6, a correspondence between the aerial orthoimage (source domain,

x

), the ground-truth segmentation masks (target domain,

y

), together with the predicted

G (x)

of the best Pix2Pix and RoadMark-cGAN models, and predictions

\hat{y}

of the best SS model (U-Net) is presented.

The best Pix2Pix, SS model, and RoadMark-cGAN and its variants correspond to experiments number 6, 8, 14, 17, 19, and 24, respectively, from Appendix A (where their test performance is presented).

Regarding the general patterns found in the visual evaluation, it was observed that, overall, the quality of the road marking representations is high in typical highway scenarios, where multiple straight lanes that are separated by well-defined lane markings are present (e.g., sample rows 3, 7, and 8). In these scenes, the markings of interest were correctly extracted.

It was also found that the highway junctions were generally handled properly by the best models (e.g., sample row 8). As the presence of additional pavement signaling drawn with the same type of painting (gore area, marked with diagonal lines that converge to a central point) seemed not to harm the extraction, the only ones extracted in the generated maps were the intended ones.

However, common structures such as bridges (e.g., sample row 5) seem to cause errors that could be attributed to the underrepresentation of the objects in the training information. It is recommended to extract such scenes with object detection models like YOLO [64] (for example, YOLOv8 is capable of detecting oriented bounding boxes) and, afterwards, involve a human operator in the extraction process. The same approach can be applied to tunnel entrances or exits.

Additionally, a recurring issue was observed with the extraction of road marking near highway entrances and exits (e.g., sample rows 2 and 6). This is likely due to insufficiently representative training data and could be mitigated by incorporating more examples of such scenarios into the dataset. For the same reason, it seems that even the best models present higher error rates when extracting marking information from roundabouts (e.g., sample rows 2 and 6).

The models also seem to be sensitive to fading road marking color that causes a lower contrast between the lane separation and the road pavement (e.g., separation lane markings in sample row 4, or the roundabout in sample row 6). It is recommended to apply additional data augmentation techniques involving changes in image contrast and brightness to expose the models to more variation in the data during training.

Moreover, objects such as cars affect the quality of the predictions, since they can cover the road markings (as found in sample row 3). This prediction outcome is to be expected, as objects that cover patterns important for the predictions remove the representative information needed for a correct extraction.

With respect to the models, it is obvious that even the best Pix2Pix achieved the worst results and consistently missed the extraction of the road markings. The RM model presents a significant qualitative improvement in its synthetic predictions, and the techniques applied subsequently in its variants delivered predictions that are closer to the target domain when compared to the other image-to-image translation models trained. The best RM4 delivered results comparable to the best trained SS models, and it proves the suitability of the proposed extraction technique, based on the RoadMark-cGAN model. In many cases, the model extracted road markings better.

These findings are aligned with the quantitative analysis of the performance metrics carried out in Section 7. It proves the suitability of the image-to-image translation with RM for direct mapping purposes, with results comparable to or even higher than the SS applied directly.

Nonetheless, some prediction artifacts specific to GAN models can be encountered (rough road marking edges and difficulties in accurately delineating the road boundaries). A possible solution for this would be to explore higher L1 loss lambda values, as a higher importance of the reconstruction loss generally tends to smooth the edges of the geospatial element extracted. This is in line with the recommendation of Section 6.2, regarding the exploration of more hyperparameter combinations in additional training scenarios to identify the most suitable one.

9. Discussion

The direct mapping from the source to the target domain is complex, as the I2I translation function must be learned with features from imbalanced data (the background pixels significantly outnumber the road marking pixels) and challenging orthoimage information.

9.1. On the Quantitative Evaluation

In the quantitative evaluation from Section 7, the metrics showed that the original I2I translation model, Pix2Pix, was not suitable for the task and that a purely generative approach with a standard L1 loss is insufficient for an accurate extraction of geospatial elements. Even when increasing Pix2Pix’s L1 loss lambda fivefold, the model struggles with precise segmentation, as the achieved IoU score of around 0.32 was still far from the desired value of 0.5 for the positive class (indicating a good performance). The poor performance of the Pix2Pix models raises questions about the suitability of purely generative adversarial networks for precise mapping tasks.

When incorporating the processing blocks described in Section 4 into the novel RoadMark-cGAN architecture (abbreviated RM in the figures and tables of Section 7), the I2I performance was greatly improved to around 0.45 in the IoU score for the positive class. The novel architecture likely enabled

G

to produce more realistic road markings by better learning to fool the discriminator, which distinguishes between real and generated segmentations.

G

features residual blocks (which help mitigate the vanishing gradient problem and improve the flow of information) and attention blocks (which enable the model to focus on more relevant features in the input image tiles).

Variations in the RoadMark-cGAN architecture suggest that architectural modifications such as dynamic dropout (RoadMarking-cGAN-v2), or the introduction of training strategies like a novel MBSCB loss focusing on the morphological characteristics of the extracted object (RoadMarking-cGAN-v3), can significantly impact the segmentation performance and lead to a more accurate extraction performance. The ablation study confirmed that both dynamic dropout and the MBSCB loss contribute to improved performance, with their combination (RoadMarking-cGAN-v4) yielding the best results.

The proposed RM-based models showed a stable training behavior, with low variability; the best outperformed traditional and advanced SS models like U-Net or U-Net—InceptionResNet-v2 to achieve a maximum IoU score for the positive class of 0.473. The consistent high performance of the RM model and its variants is also an indicator of the appropriateness of applying conditional generative learning for road marking extraction.

ANOVA and Tukey’s HSD post-hoc test indicated that, although not statistically significant at the 0.05 level, the RM3 and RM4 consistently perform the best in terms of IoU and F1 scores (indicating a good extraction performance in delineating the road marking boundaries and a good balance between precision and recall for the positive class). These models also achieved the highest average Cohen’s Kappa metric (relevant when dealing with class imbalance, where road markings represent a small fraction of the image tile).

Nonetheless, the G-mean metric analysis indicated that RM4 and RM3 models deliver a statistically significantly more balanced performance between the “Road_Marking” class and the background, particularly important in extraction tasks with high class imbalance. RM4 consistently ranks at the top and has a significant advantage in G-mean, and this provides robust evidence for its superiority over the other models considered for this specific extraction task (including over advanced SS models like U-Net—InceptionResNet-v2).

The confusion matrices of Figure 5 reveal an important trade-off between the models. Pix2Pix and U-Net are slightly better at not misinterpreting background pixels for road markings, but they miss a significant portion of the actual road markings (incorrectly predicting significantly more “Road_Marking” pixels as background). While RoadMark-cGAN-v4 prioritizes a correct extraction of road markings and delivers a substantially superior performance in correctly classifying road pixels, it presents a slightly higher rate of incorrectly labeled “Background” pixels as “Road_Marking”.

The best RoadMark-cGAN model achieves an absolute increase of 7.1% in correctly identified road marking pixels when compared to the best SS model. This aspect can be considered an important advantage for the intended application, as missing road markings can have safety implications, while a slightly higher rate of incorrectly identifying background as road markings might be less critical (as these errors can be filtered out by subsequent processing).

9.2. On the Qualitative Evaluation

The visual analysis of random test predictions revealed patterns (such as scenes with consistently lower performance) or specific regions where the models struggle (e.g., shadows, occlusions, or confusion with similar objects). This is consistent with the idea that errors are often spatially correlated. First, it was observed that the quality of the representations extracted by the original Pix2Pix model is poor, and the representations delivered by the standard I2I translation model are inadequate for road marking mapping.

The RoadMark-cGAN models can generate realistic road markings representations that are accurate to the context and deliver consistent reconstructions. RoadMark-cGAN-v3 and v4 achieved results comparable to SS models, with v4 surpassing the performance of the best-trained SS model. This could be attributed to the introduction of the novel MBSCB loss function, focusing on the morphological thinness of road markings, which is likely to improve boundary detection. The RoadMark-cGAN model is considered to correctly learn the road marking distribution in the ground truth masks.

However, higher rates of errors were detected near highway entrances and exits, and roundabouts, which suggests that labelling more data from these scenes is needed. Higher error rates were also observed around bridges, tunnel entrances, or exits, requiring further improvements, such as training an object detection model like YOLO [64] to detect those scenes and then involving a human operator in the extraction process.

Another common extraction error was caused by fading road marking colors due to wear of the road painting, and it is recommended to apply more data augmentation techniques (centered on image brightness and contrast) to expose the model to more variations in data. It should also be noted that the I2I translation models fail to extract road markings occluded by objects like cars, but this type of error was also observed in SS models and is a challenge inherent to the extraction of geospatial objects from aerial imagery.

Nonetheless, aligned with the metrical analysis, the visual evaluation revealed that RoadMark-cGAN-v4 (RM4) delivers predictions comparable to state-of-the-art SS models. However, it also indicated the potential of a better hyperparameter tuning, as, although the boundary delineation was the closest to the ground-truth masks, the edges were often rougher compared to SS models. It is expected that these prediction artifacts could be reduced by optimizing the lambda parameters for the different loss components (particularly the L1 lambda) to improve the reconstruction loss.

9.3. Limitations and Future Lines of Work

Regarding the challenges, training cGANs for I2I translation can be more complex and computationally intensive compared to traditional supervised models. This type of model also suffers from instability issues, as indicated by the higher standard deviations observed for the standard P2P models.

Even the best-performing RoadMark-cGAN has the potential for improvement, and it is important to conduct further research on the optimal number of training epochs and on the optimal combination of hyperparameters, particularly those related to the importance of the loss components (for example, optimize L1’s loss lambda for detail preservation). Nonetheless, for thin structure extraction, the starting weight hyperparameter combination for the morphological thinness, Dice, and

L 1_{r a w}

components of MBSCB loss of 50, 150, and 100, respectively, is recommended, as it delivered good performance in our experiments.

Exploring loss functions adapted to the morphological features of the studied object (e.g., MBSCB loss when extracting thin geospatial elements) is also strongly recommended to ensure a stable adversarial training process. Furthermore, concepts such as nested skip pathways introduced by UNet++ [23] could also improve performance.

In addition, since the available computational budget for this work only allowed for three experiment repetitions for each training scenario, it is recommended to potentially consider additional performance metrics and continue with more training repetitions in the future, to give the findings more statistical significance. Nonetheless, there is always the risk associated with the phenomena being caused by factors not considered in the statistical analysis. In this line, the uncertainty of the models could be quantified by applying techniques like Bayesian DL [65] or dropout uncertainty [66], which could provide confidence measures for predictions and improve reliability in real-world deployments

In the qualitative evaluation, it was observed that the training dataset should be further improved, especially with the addition of training samples from areas where higher error rates were detected (e.g., roundabouts and highway entrances and exits). It is also recommended to quantify the errors in terms of percentages of affected test samples from these scenes.

However, it is important to mention that, given the thin nature of road markings, false predictions have a significant impact on the performance scores computed, and high-quality road representations are required as input. The visual analysis also indicated that additional data augmentation strategies should be explored to study if they improve the generalization capabilities of the models (particularly in areas where the road painting began to fade), or that a multi-model approach, where a scene detection classifier followed by a segmentation model, could be applied.

The quality of the road marking extraction, particularly in urban areas, where many buildings are present, could be improved by exploring multi-modal fusion techniques with information from LiDAR sources (e.g., with data from PNOA- LiDAR [67], covering the Spanish territory with a density of 5 points/m²), or super-resolved aerial imagery [68].

The model’s scalability could be a limitation when working with hardware that has limited processing capabilities. To find a balance between the performance and the computational resources (including GPU and RAM), future work could explore processing strategies such as efficient data batching [69], or the use of sparse representations to reduce memory footprint, while conducting a comprehensive monitoring and analysis of the training times and memory usage to analyze how changes in dataset size affect resource consumption.

Another interesting line of work would be to investigate the trade-off between model accuracy and speed, as for this application, real-time performance could be crucial. For this, lightweight segmentation models such as TinyU-Net [70], SegDesicNet [71], or ERFNet [72] could be considered in additional training scenarios. From an electronics perspective, deploying these models in autonomous vehicles would require addressing the constraints associated with embedded systems (limited computational resources, energy efficiency, and real-time processing). These challenges highlight the need for optimizing memory usage and inference speed to ensure safe and reliable operation in resource-constrained environments. Therefore, future work should also explore lightweight architectures and model compression techniques to enable deployment without compromising performance or safety-critical requirements.

To offer more insights into the learned representations and why a model delivers a certain prediction, explainable artificial intelligence (XAI) techniques like saliency maps [73], or Grad-CAM [74] could be applied to understand which features of the input orthoimages are the most influential for the road marking extraction in different scenarios. In addition, analyzing the receptive fields of the convolutional layers of the models could help to understand how the global and local context are used to incorporate contextual information and improve performance. This could lead to a deeper understanding of potential cGAN biases.

Nonetheless, the authors strongly believe that applying conditional generative learning for road marking mapping is a viable option and that the processing approach could lead to a better extraction of geospatial elements from aerial imagery.

10. Conclusions

In this work, a novel RoadMark-cGAN model was proposed to directly extract road marking features from aerial imagery—an important task for the expected rise of autonomous vehicles. The training objective was to apply conditional generative learning and achieve important levels of performance when mapping road markings in an I2I translation setting. The generator features enhanced regularization, residual connections, attention blocks, and a 33% decrease in the number of parameters when compared to Pix2Pix’s generator. The discriminator is a modified PatchGAN with an added attention block. The model was trained on the RoadMark-binary dataset with techniques such as exponential decay and achieved an average 39.8% improvement in the IoU score for the positive class over the best Pix2Pix model.

RoadMark-cGAN was improved in versions 2 to 4 by adding regularization techniques like dynamic dropout (v2), introducing the MBSCB loss (v3) designed for specific morphological properties of the target object (thin structure), or combining both effective techniques in v4. The best cGAN architecture (RoadMark-cGAN-v4) achieved a mean improvement in the IoU score for the positive class of 45.2%, and 1.6% on unseen data, compared to the best trained Pix2Pix and the SS model, respectively.

The visual evaluation of test predictions indicated that the best model delivers road representations that are comparable to, and even surpass, those of popular state-of-the-art SS models. However, cGAN models appeared to be more sensitive to fading road marking paint, and higher error rates were also observed in scenes featuring bridges, tunnels, roundabouts, and highway entrances and exits, or objects like cars covering the road. Nonetheless, it is expected that a better optimization of the lambda hyperparameters of the MBSCB loss, an increase in the number of samples from these regions, and applying more data augmentation could help reduce the errors and introduce DL models for object detection.

Although larger datasets and additional evaluation metrics could further support these findings, the statistical analysis in Section 7 shows that the proposed RoadMark-cGAN model (particularly, version 4) represents a significant advancement in road marking extraction for road cartography purposes. The proposed method has great potential for many remote sensing applications that involve mapping geospatial objects.

As future lines of work, the exploration of additional cGAN architectures and the identification of combinations of hyperparameters that optimize and stabilize the adversarial training are proposed. Proposing and testing additional loss functions that focus on specific morphological characteristics of geospatial objects could lead to better training and can also be considered an active area of research. The end goal would be to integrate the predictions into a road decision support system used to set up road-related policies and automatically monitor and map occurring changes.

Author Contributions

C.-I.C.: conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft, writing—review and editing; N.Y.: conceptualization, formal analysis, methodology, project administration, validation, visualization, writing—review and editing; M.-Á.M.-C.: data curation, formal analysis, resources, validation, visualization, writing—review and editing; R.A.: formal analysis, validation, visualization, writing—review and editing; C.B.-B.: validation, visualization, writing—review and editing; J.X.: validation, visualization, writing—review and editing; B.B.: validation, visualization, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The training and test data (RoadMark-binary) are distributed under a CC-BY 4.0 license in the Zenodo repository https://zenodo.org/records/15474412 (accessed on 21 May 2025).

Acknowledgments

The first author would like to thank the Geoinformatics Team of RIKEN Advanced Intelligence Project (AIP) for the five-month research stay invitation to participate in the “Creation of High-Definition Maps” project, established between RIKEN-AIP and Universidad Politécnica de Madrid (UPM).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANOVA	Analysis of Variance
cGAN	conditional generative adversarial network
CNIG	National Center for Geographic Information
$D$	discriminator
DCGAN	Deep Convolutional Generative Adversarial Networks
DL	deep learning
FN	false negative
FP	false positive
$G$	generator
G-mean	geometric mean
GAN	generative adversarial network
I2I	image-to-image
ID	identifier
IoU	Intersection-over-Union
MBSCB (loss)	Morphological Boundary-Sensitive Class-Balanced (loss)
MCC	Matthews correlation coefficient
PNOA	National Plan for Aerial Ortophotography (Spanish: “Plan Nacional de Ortofotografía Aérea”
ReLU	Rectified Linear Unit
RGB	Red-Green-Blue
Std.	standard
SS	semantic segmentation
TN	true negative
TP	true positive
XAI	explainable artificial intelligence
$κ$	Cohen’s Kappa
$λ$	lambda

Appendix A

Table A1. Test set performance (IoU score, F1 score, Cohen’s Kappa, G-mean, and MCC) for the “Road_Marking” class obtained by the models trained in the eight training scenarios from Table 5.

Training Scenario ID	Experiment No.	Test Set Performance for the “Road_Marking” Class
Training Scenario ID	Experiment No.	IoU Score	F1 Score	Cohen’s Kappa	G-Mean	MCC
P2P100	1	0.2737	0.4298	0.3959	0.5986	0.4025
	2	0.2586	0.4110	0.3796	0.5670	0.3939
	3	0.2492	0.3999	0.3690	0.5498	0.3884
P2P500	4	0.3188	0.4835	0.4545	0.6252	0.4666
	5	0.3255	0.4911	0.4610	0.6397	0.4692
	6	0.3297	0.4959	0.4663	0.6416	0.4751
UNet	7	0.4657	0.6355	0.6124	0.7475	0.6174
	8	0.4683	0.6379	0.6144	0.7547	0.6180
	9	0.4570	0.6274	0.6038	0.7420	0.6088
UNetIR	10	0.4508	0.6214	0.5980	0.7329	0.6047
	11	0.4649	0.6347	0.6116	0.7466	0.6167
	12	0.4588	0.6290	0.6056	0.7423	0.6108
RM1	13	0.4504	0.6211	0.5961	0.7488	0.5986
	14	0.4513	0.6219	0.5969	0.7503	0.5992
	15	0.4497	0.6204	0.5956	0.7466	0.5985
RM2	16	0.4536	0.6241	0.5991	0.7532	0.6011
	17	0.4543	0.6248	0.5997	0.7548	0.6016
	18	0.4537	0.6242	0.5992	0.7539	0.6012
RM3	19	0.4685	0.6380	0.6118	0.7877	0.6118
	20	0.4654	0.6352	0.6087	0.7867	0.6087
	21	0.4656	0.6354	0.6090	0.7850	0.6090
RM4	22	0.4715	0.6408	0.6148	0.7902	0.6148
	23	0.4700	0.6395	0.6134	0.7884	0.6134
	24	0.4725	0.6418	0.6153	0.7965	0.6154

Notes: (1) For each training scenario, the experiments were repeated three times. (2) Italics are used to represent the experiment with the best results within the same Training Scenario ID. (3) Bold is used to represent the model with the highest overall performance metrics.

References

Cira, C.-I.; Arranz-Justel, J.-J.; Şuba, E.-E.; Manso-Callejo, M.-Á.; Nap, M.-E.; Sălăgean, T. Evaluation of Multiclass Extraction of Representative Road Lines Found on Highway Pavement Using Supervised Semantic Segmentation Techniques and Aerial Imagery. In Proceedings of the 2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA), Mahé, Seychelles, 1–2 February 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
Vali, A.; Comai, S.; Matteucci, M. Deep Learning for Land Use and Land Cover Classification Based on Hyperspectral and Multispectral Earth Observation Data: A Review. Remote Sens. 2020, 12, 2495. [Google Scholar] [CrossRef]
Cira, C.-I.; Alcarria, R.; Manso-Callejo, M.-Á.; Serradilla, F. A Deep Learning-Based Solution for Large-Scale Extraction of the Secondary Road Network from High-Resolution Aerial Orthoimagery. Appl. Sci. 2020, 10, 7272. [Google Scholar] [CrossRef]
Manso-Callejo, M.A.; Cira, C.-I.; Alcarria, R.; Gonzalez Matesanz, F.J. First Dataset of Wind Turbine Data Created at National Level with Deep Learning Techniques from Aerial Orthophotographs with a Spatial Resolution of 0.5 m/Pixel. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7968–7980. [Google Scholar] [CrossRef]
Manso-Callejo, M.-Á.; Cira, C.-I.; Arranz-Justel, J.-J.; Sinde-González, I.; Sălăgean, T. Assessment of the Large-Scale Extraction of Photovoltaic (PV) Panels with a Workflow Based on Artificial Neural Networks and Algorithmic Postprocessing of Vectorization Results. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103563. [Google Scholar] [CrossRef]
Cira, C.-I.; Manso-Callejo, M.-Á.; Alcarria, R.; Fernández Pareja, T.; Bordel Sánchez, B.; Serradilla, F. Generative Learning for Postprocessing Semantic Segmentation Predictions: A Lightweight Conditional Generative Adversarial Network Based on Pix2pix to Improve the Extraction of Road Surface Areas. Land 2021, 10, 79. [Google Scholar] [CrossRef]
Cira, C.-I.; Manso-Callejo, M.-Á.; Alcarria, R.; Bordel Sánchez, B.B.; González Matesanz, J.G. State-Level Mapping of the Road Transport Network from Aerial Orthophotography: An End-to-End Road Extraction Solution Based on Deep Learning Models Trained for Recognition, Semantic Segmentation and Post-Processing with Conditional Generative Learning. Remote Sens. 2023, 15, 2099. [Google Scholar] [CrossRef]
Manso Callejo, M.A.; Cira, C.I. RoadMark-Binary: 29,405 RGB Aerial Tiles with Ground Truth Masks Labeled with Horizontal Road Line Marking Information 2025. Available online: https://zenodo.org/records/15474412 (accessed on 21 May 2025).
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 5967–5976. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; pp. 2672–2680. [Google Scholar]
Pan, Z.; Yu, W.; Yi, X.; Khan, A.; Yuan, F.; Zheng, Y. Recent Progress on Generative Adversarial Networks (GANs): A Survey. IEEE Access 2019, 7, 36322–36333. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, PR, USA, 2–4 May 2016; Conference Track Proceedings. Bengio, Y., LeCun, Y., Eds.; [Google Scholar]
Wang, X.; Yi, J.; Guo, J.; Song, Y.; Lyu, J.; Xu, J.; Yan, W.; Zhao, J.; Cai, Q.; Min, H. A Review of Image Super-Resolution Approaches Based on Deep Learning and Applications in Remote Sensing. Remote Sens. 2022, 14, 5423. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2242–2251. [Google Scholar]
Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; PMLR: Birmingham, UK, 2017; Volume 70, pp. 1857–1865. [Google Scholar]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 2868–2876. [Google Scholar]
Tian, R.; Li, X.; Li, W.; Li, G.; Chen, K.; Dai, H. Using Pix2Pix Conditional Generative Adversarial Networks to Generate Personalized Poster Content: Style Transfer and Detail Enhancement. J. Comput. Methods Sci. Eng. 2025, 25, 1938–1950. [Google Scholar] [CrossRef]
Feng, D.; Shen, X.; Xie, Y.; Liu, Y.; Wang, J. Efficient Occluded Road Extraction from High-Resolution Remote Sensing Imagery. Remote Sens. 2021, 13, 4974. [Google Scholar] [CrossRef]
Hu, A.; Chen, S.; Wu, L.; Xie, Z.; Qiu, Q.; Xu, Y. WSGAN: An Improved Generative Adversarial Network for Remote Sensing Image Road Network Extraction by Weakly Supervised Processing. Remote Sens. 2021, 13, 2506. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, G.; Yang, J.; Li, T.; Li, Z. AU3-GAN: A Method for Extracting Roads from Historical Maps Based on an Attention Generative Adversarial Network. J. Geovis. Spat. Anal. 2024, 8, 26. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Stoyanov, D., Taylor, Z., Carneiro, G., Syeda-Mahmood, T., Martel, A., Maier-Hein, L., Tavares, J.M.R.S., Bradley, A., Papa, J.P., Belagiannis, V., et al., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. ISBN 978-3-030-00888-8. [Google Scholar]
Lu, W.; Shi, X.; Lu, Z. A New Two-Step Road Extraction Method in High Resolution Remote Sensing Images. PLoS ONE 2024, 19, e0305933. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; IEEE: New York, NY, USA, 2017; pp. 1–4. [Google Scholar]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. DeepGlobe 2018: A Challenge to Parse the Earth Through Satellite Images. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2018, Salt Lake City, UT, USA, 18–22 June 2018; IEEE Computer Society: Washington, DC, USA, 2018; pp. 172–181. [Google Scholar]
Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
Chen, H.; Peng, S.; Du, C.; Li, J.; Wu, S. SW-GAN: Road Extraction from Remote Sensing Imagery Using Semi-Weakly Supervised Adversarial Learning. Remote Sens. 2022, 14, 4145. [Google Scholar] [CrossRef]
Pu, Y.; Yu, H. ResUnet: A Fully Convolutional Network for Speech Enhancement in Industrial Robots. In Proceedings of the Advances and Trends in Artificial Intelligence. Theory and Practices in Artificial Intelligence—35th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2022, Kitakyushu, Japan, 19–22 July 2022; Proceedings. Fujita, H., Fournier-Viger, P., Ali, M., Wang, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13343, pp. 42–50. [Google Scholar]
Zhang, Y.; Xiong, Z.; Zang, Y.; Wang, C.; Li, J.; Li, X. Topology-Aware Road Network Extraction via Multi-Supervised Generative Adversarial Networks. Remote Sens. 2019, 11, 1017. [Google Scholar] [CrossRef]
Kyslytsyna, A.; Xia, K.; Kislitsyn, A.; Abd El Kader, I.; Wu, Y. Road Surface Crack Detection Method Based on Conditional Generative Adversarial Networks. Sensors 2021, 21, 7405. [Google Scholar] [CrossRef]
Zhang, Y.; Li, X.; Zhang, Q. Road Topology Refinement via a Multi-Conditional Generative Adversarial Network. Sensors 2019, 19, 1162. [Google Scholar] [CrossRef]
Lin, S.; Yao, X.; Liu, X.; Wang, S.; Chen, H.-M.; Ding, L.; Zhang, J.; Chen, G.; Mei, Q. MS-AGAN: Road Extraction via Multi-Scale Information Fusion and Asymmetric Generative Adversarial Networks from High-Resolution Remote Sensing Images under Complex Backgrounds. Remote Sens. 2023, 15, 3367. [Google Scholar] [CrossRef]
Hartmann, S.; Weinmann, M.; Wessel, R.; Klein, R. StreetGAN: Towards Road Network Synthesis with Generative Adversarial Networks. In Proceedings of the International Conference on Computer Graphics, Visualization and Computer Vision, Lisbon, Portugal, 21–23 July 2017. [Google Scholar]
Liu, R.; Li, F.; Jiang, W.; Song, C.; Chen, Q.; Li, Z. Generating Pixel Enhancement for Road Extraction in High-Resolution Aerial Images. IEEE Trans. Intell. Veh. 2024, 9, 6313–6325. [Google Scholar] [CrossRef]
Bai, H.; Ren, C.; Huang, Z.; Gu, Y. A Dynamic Attention Mechanism for Road Extraction from High-Resolution Remote Sensing Imagery Using Feature Fusion. Sci. Rep. 2025, 15, 17556. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: New York, NY, USA, 2016; pp. 565–571. [Google Scholar]
Kervadec, H.; Bouchtiba, J.; Desrosiers, C.; Granger, E.; Dolz, J.; Ben Ayed, I. Boundary Loss for Highly Unbalanced Segmentation. Med. Image Anal. 2021, 67, 101851. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Shit, S.; Paetzold, J.C.; Sekuboyina, A.; Ezhov, I.; Unger, A.; Zhylka, A.; Pluim, J.P.W.; Bauer, U.; Menze, B.H. clDice—A Novel Topology-Preserving Loss Function for Tubular Structure Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 16555–16564. [Google Scholar]
Instituto Geográfico Nacional Plan Nacional de Ortofotografía Aérea (PNOA). Available online: https://pnoa.ign.es/ (accessed on 27 December 2024).
Instituto Geográfico Nacional Centro de Descargas del CNIG (IGN). Available online: http://centrodedescargas.cnig.es (accessed on 3 February 2020).
Cira, C.-I.; Manso-Callejo, M.-Á.; Yokoya, N.; Sălăgean, T.; Badea, A.-C. Impact of Tile Size and Tile Overlap on the Prediction Performance of Convolutional Neural Networks Trained for Road Classification. Remote Sens. 2024, 16, 2818. [Google Scholar] [CrossRef]
Cira, C.-I.; Manso-Callejo, M.-Á.; Alcarria, R.; Iturrioz, T.; Arranz-Justel, J.-J. Insights into the Effects of Tile Size and Tile Overlap Levels on Semantic Segmentation Models Trained for Road Surface Area Extraction from Aerial Orthophotography. Remote Sens. 2024, 16, 2954. [Google Scholar] [CrossRef]
Fischer, H. A History of the Central Limit Theorem: From Classical to Modern Probability Theory; Springer: New York, NY, USA, 2011; ISBN 978-0-387-87856-0. [Google Scholar]
Agarap, A.F. Deep Learning Using Rectified Linear Units (ReLU). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015; Bach, F.R., Blei, D.M., Eds.; Volume 37, pp. 448–456. [Google Scholar]
Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar] [CrossRef]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. Conference Track Proceedings. [Google Scholar]
Ba, L.J.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), Atlanta, GA, USA, 17–19 June 2013. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X.; Chen, X. Improved Techniques for Training GANs. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 2234–2242. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings. Bengio, Y., LeCun, Y., Eds.; 2015. [Google Scholar]
Weisskopf, V.; Wigner, E. Berechnung der natürlichen Linienbreite auf Grund der Diracschen Lichttheorie. Z. für Phys. 1930, 63, 54–73. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 6626–6637. [Google Scholar]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2 November 2016; p. 21. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Singh, S.P., Markovitch, S., Eds.; AAAI Press: Washington, DC, USA, 2017; pp. 4278–4284. [Google Scholar]
Yakubovskiy, P. Segmentation Models; GitHub: San Francisco, CA, USA, 2019. [Google Scholar]
Chollet, F. Keras. Available online: https://github.com/fchollet/keras (accessed on 14 May 2020).
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. arXiv 2015, arXiv:1603.04467. [Google Scholar]
Forczmański, P. Performance Evaluation of Selected Thermal Imaging-Based Human Face Detectors. In Proceedings of the 10th International Conference on Computer Recognition Systems CORES 2017, Polanica Zdroj, Poland, 22–24 May 2017; Kurzynski, M., Wozniak, M., Burduk, R., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 578, pp. 170–181, ISBN 978-3-319-59161-2. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Wilson, A.G.; Izmailov, P. Bayesian Deep Learning and a Probabilistic Perspective of Generalization. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 4697–4708. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; PMLR: Birmingham, UK, 2016; Volume 48, pp. 1050–1059. [Google Scholar]
Instituto Geográfico Nacional Especificaciones Técnicas—Plan Nacional de Ortofotografía Aérea. Available online: http://pnoa.ign.es/web/portal/pnoa-lidar/especificaciones-tecnicas (accessed on 18 May 2025).
Karwowska, K.; Wierzbicki, D. Using Super-Resolution Algorithms for Small Satellite Imagery: A Systematic Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3292–3312. [Google Scholar] [CrossRef]
Chen, S.; Fegade, P.; Chen, T.; Gibbons, P.B.; Mowry, T.C. ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines 2023. In Proceedings of the 10th International Conference on Computer Recognition Systems CORES 2017, Polanica Zdroj, Poland, 22–24 May 2017. [Google Scholar]
Chen, J.; Chen, R.; Wang, W.; Cheng, J.; Zhang, L.; Chen, L. TinyU-Net: Lighter Yet Better U-Net with Cascaded Multi-Receptive Fields. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2024—27th International Conference, Marrakesh, Morocco, 6–10 October 2024; Proceedings, Part IX. Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer: Berlin/Heidelberg, Germany, 2024; Volume 15009, pp. 626–635. [Google Scholar]
Verma, S.; Lindseth, F.; Kiss, G. SegDesicNet: Lightweight Semantic Segmentation in Remote Sensing with Geo-Coordinate Embeddings for Domain Adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025, Tucson, AZ, USA, 26 February–6 March 2025; IEEE: Washington, DC, USA, 2025; pp. 9093–9104. [Google Scholar]
Romera, E.; Álvarez, J.M.; Bergasa, L.M.; Arroyo, R. ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2018, 19, 263–272. [Google Scholar] [CrossRef]
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014; Workshop Track Proceedings. Bengio, Y., LeCun, Y., Eds.; [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]

Figure 1. Simplification of the training procedure of a cGAN architecture for mapping of aerial orthoimagery with image-to-image translation.

Figure 2. Distribution of training and test set areas that were labeled with road marking information (RoadMark-binary [8] dataset).

Figure 3. (a–x) Random training pairs of orthoimage tiles from the source domain and their corresponding ground truth masks from the target domain. Notes: (1) All tiles feature an image size of 256 × 256 × 3. (2) Black is used to represent the “Background” class, while white represents the “Road_Marking” class.

Figure 4. Boxplots of the performance metrics obtained by the models resulting from each training scenario ID on the testing data (from (a–e)). Note: As the RoadMark-cGAN and the SS models achieved considerably higher performance when compared to Pix2Pix models, for each performance metric, there are second subplots focused on these training IDs associated (from (a1–e1)).

Figure 5. Confusion matrices (expressed as class-level percentages) of the (a) best Pix2Pix model (with

L_{L 1} λ = 500

), (b) best semantic segmentation model (U-Net), and (c) best RoadMark-cGAN model (Road-Marking-cGAN-v4) on the test set (experiments number 6, 8, and 24 from Appendix A). Note: The confusion matrices were plotted using a green heatmap—a darker green corresponds to percentages closer to 100%, while lighter green corresponds to percentages closer to 0%.

Figure 5. Confusion matrices (expressed as class-level percentages) of the (a) best Pix2Pix model (with

L_{L 1} λ = 500

), (b) best semantic segmentation model (U-Net), and (c) best RoadMark-cGAN model (Road-Marking-cGAN-v4) on the test set (experiments number 6, 8, and 24 from Appendix A). Note: The confusion matrices were plotted using a green heatmap—a darker green corresponds to percentages closer to 100%, while lighter green corresponds to percentages closer to 0%.

Figure 6. Eight random test samples with predictions delivered by the best models resulted from Training Scenarios IDs P2P500 (c1–c8), RoadMark-cGAN and its variants v2 to v4 (d1–g8), and U-Net (h1–h8), together with the corresponding orthoimage tile (a1–a8) and ground truth mask (b1–b8) used for the qualitative analysis. Note: Black is used to represent pixels belonging to the “Background” class, while white represents the pixels belonging to the “Road_Marking” class.

Table 1. Dimensions of the train and test sets: Number of images, number of pixels labeled, and areas covered with the binary classes of interest from the “RoadMark-binary” dataset.

Set	No. Images	No. Pixels		Area Covered (km²)
Set	No. Images	Class_0 (Background)	Class_1 (Road_Marking)	Class_0 (Background)	Class_1 (Road_Marking)
Train	27,360	1,709,092,507	83,972,453	38.45	1.89
Test	2045	124,950,170	9,070,950	2.82	0.20
Total	29,405	1,834,042,677	93,043,403	41.27	2.09
Percentage	-	95.17%	4.83%	-	-

Notes: (1) The image size of the RGB orthoimage tiles is 256 × 256 pixels, and the spatial resolution is 15 cm. (2) “-” indicates that the value is not applicable or not calculated for that column.

Table 2. The “Residual_Block”, “Attention_Block”, “Downsample”, and “Upsample” components are used by RoadMark-cGAN.

Component	Input	Operations	Output
Residual_Block	$Input_Tensor (x)$	$\to N (0, 0.02)$ initialization applied to Conv2D (3 × 3, same padding) → Batch Normalization → ReLU activation → Dropout → $N (0, 0.02)$ initialization applied to Conv2D (3 × 3, same padding) → Batch Normalization → ReLU activation → $Add Input_Tensor (x)$ to the processed output	$Tensor of same shape as Input_Tensor (x)$
Attention_Block	$Input_Tensor (x)$	→ Conv2D (1 × 1, same padding) → Spectral Normalization → Sigmoid → Dropout → Layer Normalization $\to Multiply by Input_Tensor (x)$	$Tensor of same shape as Input_Tensor (x)$ with attention-applied features
Downsample	$Input_Tensor (x)$	→ $N (0, 0.02)$ initialization applied to Conv2D (4 × 4, same padding, stride of 2), $\to Spectral Normalization \to Batch Normalization (only for G)$ → Leaky ReLU activation	$Downsampled tensor with half the height and width of Input_Tensor (x)$
Upsample	$Input_Tensor (x)$	$\to N (0, 0.02)$ initialization applied to Transposed Conv2D (4 × 4, same padding, stride of 2), → Spectral Normalization → Batch Normalization → Layer Normalization → ReLU activation	$Upsampled tensor with double the height and width of Input_Tensor (x)$

Notes: (1) “

N (0, 0.02)

” refers to a random weights initialization of the layer from a Gaussian distribution with a mean of 0 and a standard deviation of 0.02. (2) “Conv2D” is a processing layer that applies a 2D convolution operation on the

h e i g h t \times w i d t h

of the input data—for multidimensional data, it will stop at each possible position within the rest of the dimensions as well. (3) The number of filters is passed as an argument to all component functions. (4) The Residual_Block and the Attention_Block also take as input parameter a dropout rate (by default, 0.3). (5) The symbol “→” indicates the sequential flow of data or operations (processing steps) applied to the input tensor within each block.

Table 3. The architecture of RoadMark-cGAN’s generator network.

Stage		Structure	Output Size (Volume)
Input		RGB tensor with aerial orthoimage tile	256 × 256 × 3
Encoder	Downsample-1	Downsample (64 filters)	128 × 128 × 64
	Downsample-2	Downsample (128 filters)	64 × 64 × 128
	Downsample-3	Downsample (256 filters)	32 × 32 × 256
	Downsample-4	Downsample (512 filters)	16 × 16× 512
	Downsample-5	Downsample (512 filters)	8 × 8 × 512
Functional Bottleneck	Residual_Block-1	Residual_Block (512 filters, dropout_rate)	8 × 8 × 512
	Attention_Block-1	Attention_Block (512 filters, dropout_rate)	8 × 8 × 512
	Residual_Block-2	Residual_Block (512 filters, dropout_rate)	8 × 8 × 512
	Attention_Block-2	Attention_Block (512 filters, dropout_rate)	8 × 8 × 512
	Residual_Block-3	Residual_Block (512 filters, dropout_rate)	8 × 8 × 512
	Attention_Block-3	Attention_Block (512 filters, dropout_rate)	8 × 8 × 512
	Residual_Block-4	Residual_Block (512 filters, dropout_rate)	8 × 8 × 512
	Attention_Block-4	Attention_Block (512 filters, dropout_rate)	8 × 8 × 512
Decoder	Upsample-1	Upsample (512 filters)	16 × 16 × 512
		Concatenate with Downsample-4	16 × 16 × 1024
	Upsample-2	Upsample (256 filters)	32 × 32 × 256
		Concatenate with Downsample-3	32 × 32 × 512
	Upsample-3	Upsample (128 filters)	64 × 64 × 128
		Concatenate with Downsample-2	64 × 64 × 256
	Upsample-4	Upsample (64 filters)	128 × 128 × 64
		Concatenate with Downsample-1	128 × 128 × 128
Final Layer		$N (0, 0.02)$ initialization applied to Transposed Conv2D (4 × 4, stride of 2, 3 filters) → Tanh activation	256 × 256 × 3
Output		$Synthetic data, G (x)$

Notes: (1) The encoder contains five downsample stages. (2) The functional bottleneck is formed by the combination of Residual and Attention blocks at 8 × 8 resolution with 512 channels, where abstract representations that are important for the predictions are learned. (3) The decoder part contains four upsample stages, with a skip connection between the corresponding encoder stages. (4) The symbol “→” indicates the sequential flow of data or operations (processing steps) applied to the input tensor within each block

Table 4. The architecture of RoadMark-cGAN’s discriminator network.

Stage	Structure	Output Size (Volume)
$Input : (pairs of (x, G (x))$ $and (x, y)$ )	Concatenated input and target tensors	256 × 256 × 6
Downsample-1	Downsample (64 filters)	128 × 128 × 64
Downsample-2	Downsample (128 filters)	64 × 64 × 128
Downsample-3	Downsample (256 filters)	32 × 32 × 256
Conv_Block	Conv2D(4 × 4, 512 filters) → Leaky ReLU activation → Spectral Normalization	32 × 32× 512
Attention block	Attention_Block (512 filters, dropout_rate)	32 × 32 × 512
Final layer	Conv2D (4 × 4, 1 channel) with Random Normal (0, 0.02) initialization	32 × 32 × 1
Output	Single grid of 32 × 32 scalars (each scalar represents the “real” (1) or “fake” (0) prediction for a patch)

Note: The symbol “→” indicates the sequential flow of data or operations (processing steps) applied to the input tensor within each block

Table 5. Training scenarios considered in this work.

Training Scenario ID	Trained Model	Description
P2P100	Pix2Pix $with L_{L 1} λ = 100$	The original Pix2Pix [9] $model trained with the λ$ hyperparameter of 100 for the L1 loss (as proposed by the authors).
P2P500	Pix2Pix $with L_{L 1} λ = 500$	The standard Pix2Pix [9] $model trained with an increased λ$ hyperparameter of 500 for the L1 loss.
UNet	U-Net from scratch	Original U-Net [15] model trained from scratch for semantic segmentation.
UNetIR	U-Net—InceptionResNet-v2 from scratch	U-Net [15] model with InceptionResNet-v2 [60] network as backbone trained from scratch for semantic segmentation.
RM	RoadMark-cGAN	Model described and proposed in Section 5.1, Section 5.2, Section 5.3 and Section 5.4. The model features improved normalization, residual and attention blocks added to a novel functional bottleneck, and a learning rate scheduler for exponential decay and L1 lambda of 500.
RM2	RoadMark-cGAN-v2	Improved version of the RoadMark-cGAN model (described and proposed in Section 6.1). It features enhanced regularization in the form of dynamic dropout added to the residual and attention blocks, with a L1 lambda of 500.
RM3	RoadMark-cGAN-v3	Improved version of the RoadMark-cGAN model that is proposed and described in Section 6.2. It features a novel MBSCB loss proposed for thin structures. The MBSCB loss is based on a combination of loss components focused on morphological thinness, Dice coefficient and L1 applied to raw G output, with associated lambda parameters $λ_{m o r p h}$ , $λ_{D i c e}$ , $and λ_{L 1_{r a w}}$ of 50, 150, and 100, respectively.
RM4	RoadMark-cGAN-v4	Improved version of the RoadMark-cGAN model described in Section 6.3. It represents the combination of the techniques introduced in RoadMark-cGAN-v2 and RoadMark-cGAN-V3 that were simultaneously added to the proposed RoadMark-cGAN model.

Note: All models were trained and evaluated on the training and test sets described in Table 1.

Table 6. Statistical analysis with ANOVA of the road marking extraction models performance (grouped by Training ID) on the test set (n = 2045 orthoimage tiles).

Category (Training Scenario ID)	Statistical Measure	IoU Score	F1 Score	Cohen’s Kappa (κ)	G-Mean	MCC
P2P100	Mean	0.2605	0.4136	0.3815	0.5718	0.3949
P2P100	Std. Deviation	0.0124	0.0151	0.0136	0.0248	0.0071
P2P500	Mean	0.3247	0.4902	0.4606	0.6355	0.4703
P2P500	Std. Deviation	0.0055	0.0063	0.0059	0.0090	0.0044
UNet	Mean	0.4637	0.6336	0.6102	0.7481	0.6147
UNet	Std. Deviation	0.0059	0.0055	0.0056	0.0064	0.0051
UNetIR	Mean	0.4582	0.6284	0.6051	0.7406	0.6107
UNetIR	Std. Deviation	0.4637	0.6336	0.6102	0.7481	0.6147
RM	Mean	0.4505	0.6211	0.5962	0.7486	0.5988
RM	Std. Deviation	0.0008	0.0008	0.0007	0.0019	0.0004
RM2	Mean	0.4539	0.6244	0.5993	0.7540	0.6013
RM2	Std. Deviation	0.0004	0.0004	0.0003	0.0008	0.0003
RM3	Mean	0.4665	0.6362	0.6098	0.7865	0.6098
RM3	Std. Deviation	0.0017	0.0016	0.0017	0.0014	0.0017
RM4	Mean	0.4713	0.6407	0.6145	0.7917	0.6145
RM4	Std. Deviation	0.0013	0.0012	0.0010	0.0043	0.0010
Inferential Statistics	F-statistic	561.187	506.134	621.682	177.668	1243.265
Inferential Statistics	p-value	<0.001	<0.001	<0.001	<0.001	<0.001

Notes: (1) The values correspond to the positive “Road_Marking” class performance; the performance for the “Background” (the majority class) was not considered. (2) Bold is used to point out the highest mean values and highly significant p-values.

Table 7. Tukey HSD’s homogeneous subsets of the performance metrics achieved by the semantic segmentation and the proposed RoadMark-cGAN and its variants.

IoU Score					F1 Score					Cohen’s Kappa (κ)
Training Scenario ID	Subset				Training Scenario ID	Subset				Training Scenario ID	Subset
Training Scenario ID	1	2	3	4	Training Scenario ID	1	2	3	4	Training Scenario ID	1	2
RM	0.4505				RM	0.6211				RM	0.5962
RM2	0.4539	0.4539			RM2	0.6244	0.6244			RM2	0.5993
UNetIR	0.4582	0.4582	0.4582		UNetIR	0.6284	0.6284	0.6284		UNetIR	0.6051	0.6051
UNet		0.4637	0.4637	0.4637	UNet		0.6336	0.6336	0.6336	RM3		0.6098
RM3			0.4665	0.4665	RM3			0.6362	0.6362	UNet		0.6102
RM4				0.4713	RM4				0.6407	RM4		0.6145
p-value	0.220	0.078	0.163	0.224	p-value	0.218	0.075	0.161	0.233	p-value	0.102	0.075
G-mean					MCC
Training Scenario ID	Subset				Training Scenario ID	Subset
Training Scenario ID	1	2	3		Training Scenario ID	1	2	3
UNetIR	0.7406				RM	0.5988
UNet	0.7481	0.7481			RM2	0.6013	0.6013
RM	0.7486	0.7486			RM3		0.6098	0.6098
RM2		0.7540			UNetIR			0.6107
RM3			0.7865		RM4			0.6145
RM4			0.7917		UNet			0.6147
p-value	0.288	0.579	0.686		p-value	0.931	0.073	0.500

Notes: (1) The homogeneous subsets are presented with the observed mean values for groups. (2) The “p-value” rows correspond to the Tukey HSD test’s assessment of whether the means within a subset are statistically different (values > 0.05 indicate that the means within the subset are not significantly different). (3) Bold is used to represent the homogeneous subsets with the best mean performance. (4) The computed mean square errors (MSE) for the performance subsets are 1.51 × 10⁻⁵, 1.32 × 10⁻⁵, 1.38 × 10⁻⁵, 1.90 × 10⁻⁵, and 1.11 × 10⁻⁵ in terms of IoU score, F1 score, Cohen’s Kappa (κ), G-mean, and MCC, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cira, C.-I.; Yokoya, N.; Manso-Callejo, M.-Á.; Alcarria, R.; Broni-Bediako, C.; Xia, J.; Bordel, B. RoadMark-cGAN: Generative Conditional Learning to Directly Map Road Marking Lines from Aerial Orthophotos via Image-to-Image Translation. Electronics 2026, 15, 224. https://doi.org/10.3390/electronics15010224

AMA Style

Cira C-I, Yokoya N, Manso-Callejo M-Á, Alcarria R, Broni-Bediako C, Xia J, Bordel B. RoadMark-cGAN: Generative Conditional Learning to Directly Map Road Marking Lines from Aerial Orthophotos via Image-to-Image Translation. Electronics. 2026; 15(1):224. https://doi.org/10.3390/electronics15010224

Chicago/Turabian Style

Cira, Calimanut-Ionut, Naoto Yokoya, Miguel-Ángel Manso-Callejo, Ramon Alcarria, Clifford Broni-Bediako, Junshi Xia, and Borja Bordel. 2026. "RoadMark-cGAN: Generative Conditional Learning to Directly Map Road Marking Lines from Aerial Orthophotos via Image-to-Image Translation" Electronics 15, no. 1: 224. https://doi.org/10.3390/electronics15010224

APA Style

Cira, C.-I., Yokoya, N., Manso-Callejo, M.-Á., Alcarria, R., Broni-Bediako, C., Xia, J., & Bordel, B. (2026). RoadMark-cGAN: Generative Conditional Learning to Directly Map Road Marking Lines from Aerial Orthophotos via Image-to-Image Translation. Electronics, 15(1), 224. https://doi.org/10.3390/electronics15010224

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RoadMark-cGAN: Generative Conditional Learning to Directly Map Road Marking Lines from Aerial Orthophotos via Image-to-Image Translation

Abstract

1. Introduction

2. Problem Description

3. Related Work

3.1. cGANs for Augmentation of Road Extraction Data

3.2. Hybrid cGANs for Road Extraction

3.3. Topology-Aware GANs for Road Extraction

3.4. Other Challenges

4. Data: RoadMark-Binary

5. Proposed RoadMark-cGAN Model

5.1. Processing Components

5.1.1. Residual Block

5.1.2. Attention Block

5.1.3. Downsample Block

5.1.4. Upsample Block

5.2. Generator G

5.3. Discriminator D

5.4. Training Procedure

6. Improved RoadMark-cGAN Variants

6.1. RoadMark-cGAN-v2 (Dynamic Dropout)

6.2. RoadMark-cGAN-v3 (Novel MBSCB Loss for G )

6.3. RoadMark-cGAN-v4

7. Experiments and Results

8. Qualitative Evaluation

9. Discussion

9.1. On the Quantitative Evaluation

9.2. On the Qualitative Evaluation

9.3. Limitations and Future Lines of Work

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2. Generator $G$

5.3. Discriminator $D$

6.2. RoadMark-cGAN-v3 (Novel MBSCB Loss for $G$ )