Next Article in Journal
Methodological Framework for the Integrated Technical, Economic, and Environmental Evaluation of Solar Photovoltaic Systems in Agroindustrial Environments
Previous Article in Journal
Iterative User-Centered Design of the Mobile Device Assessment Tool (MoDAT)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SYNTHUA-DT: A Methodological Framework for Synthetic Dataset Generation and Automatic Annotation from Digital Twins in Urban Accessibility Applications

by
Santiago Felipe Luna Romero
1,*,
Mauren Abreu de Souza
1 and
Luis Serpa Andrade
2
1
Graduate Program on Health Technology, Pontifícia Universidade Católica do Paraná, Curitiba 80215-901, Brazil
2
Grupo de Investigación en Hardware Embebido Aplicado (GIHEA), Facultad de Ingeniería Electrónica, Universidad Politécnica Salesiana, Sede Cuenca, Cuenca 010105, Ecuador
*
Author to whom correspondence should be addressed.
Technologies 2025, 13(8), 359; https://doi.org/10.3390/technologies13080359
Submission received: 7 July 2025 / Revised: 2 August 2025 / Accepted: 6 August 2025 / Published: 14 August 2025

Abstract

Urban scene understanding for inclusive smart cities remains challenged by the scarcity of training data capturing people with mobility impairments. We propose SYNTHUA-DT, a novel methodological framework that integrates unmanned aerial vehicle (UAV) photogrammetry, 3D digital twin modeling, and high-fidelity simulation in Unreal Engine to generate annotated synthetic datasets for urban accessibility applications. This framework produces photo-realistic images with automatic pixel-perfect segmentation labels, dramatically reducing the need for manual annotation. Focusing on the detection of individuals using mobility aids (e.g., wheelchairs) in complex urban environments, SYNTHUA-DT is designed as a generalized, replicable pipeline adaptable to different cities and scenarios. The novelty lies in combining real-city digital twins with procedurally placed virtual agents, enabling diverse viewpoints and scenarios that are impractical to capture in real life. The computational efficiency and scale of this synthetic data generation offer significant advantages over conventional datasets (such as Cityscapes or KITTI), which are limited in accessibility-related content and costly to annotate. A case study using a digital twin of Curitiba, Brazil, validates the framework’s real-world applicability: 22,412 labeled images were synthesized to train and evaluate vision models for mobility aids user detection. The results demonstrate improved recognition performance and robustness, highlighting SYNTHUA-DT’s potential to advance urban accessibility by providing abundant, bias-mitigating training data. This work paves the way for inclusive computer vision systems in smart cities through a rigorously engineered synthetic data pipeline.

1. Introduction

Ensuring that urban environments are accessible and safe for people with mobility impairments is a significant societal and technological challenge. According to the World Health Organization, approximately 75 million people worldwide rely on wheelchairs for daily mobility, roughly accounting for 1% of the global population [1,2]. These individuals often face barriers in city streets and infrastructure, from uneven sidewalks to inadequate crossing times at intersections [3]. Smart city initiatives and intelligent transportation systems are increasingly looking to advanced computer vision and AI solutions to monitor, assist, and improve urban accessibility. For example, traffic signal control and autonomous vehicles need to reliably detect pedestrians using wheelchairs or other mobility aids to ensure their safety and inclusion [4]. However, developing vision models that recognize mobility-impaired pedestrians is hindered by a critical lack of representative training data [5].
Most established urban scene datasets in computer vision, such as Cityscapes [6] and KITTI [7], were created with autonomous driving in mind and provide extensive annotations for cars, pedestrians, and other common road entities. While these benchmarks have catalyzed progress in object detection and segmentation, they have little to no explicit representation of people using mobility aids [8]. Pedestrians in wheelchairs or using walkers or other assistive devices are highly underrepresented in common datasets, if present at all [5]. This gap means that state-of-the-art detectors trained on such data may perform poorly on disabled pedestrians—an issue with profound safety implications. In fact, analyses of bias in pedestrian detection highlight that a lack of disability representation in training datasets can lead to vision models that systematically overlook or misclassify wheelchair users [9]. This under-representation has been flagged as a serious concern by safety advocates, as it risks embedding ableist biases into algorithms deployed in autonomous vehicles and smart city systems [5]. There is a pressing need for datasets and methods that specifically target the inclusion of mobility-impaired individuals to improve model fairness and reliability in urban accessibility contexts.
Acquiring such specialized real-world data, however, poses substantial practical and ethical challenges [5]. Images of people with disabilities in public spaces are relatively rare and can be difficult to capture at scale due to privacy concerns and the dignity of subjects [5,9]. Manual annotation of urban scene imagery at pixel-level detail is notoriously labor-intensive—for instance, creating the Cityscapes dataset required over 1.5 h of human labeling per image on average [6]. Capturing enough instances of mobility aid users in varied scenarios (different streets, weather, lighting, and occlusions) would require enormous data collection efforts and annotation time, which is often infeasible [5]. Moreover, relying solely on real-world data may limit the diversity of scenarios; certain dangerous or uncommon situations (e.g., a wheelchair user in a busy intersection at night or in adverse weather) cannot be easily or ethically recorded for training data. These limitations motivate exploring alternative data generation approaches that can fill the gap.
Synthetic data generation has emerged as a promising solution to overcome data scarcity in computer vision. By leveraging computer graphics and simulation, researchers have created virtual urban environments to render labeled images automatically. Early examples include the SYNTHIA dataset, which provides 9400 photo-realistic frames of a virtual city with pixel-level semantic annotations, and a GTA5-based dataset with over 24,000 annotated images [10]. These synthetic datasets demonstrated that models trained on or augmented with simulated images can achieve competitive performance and even generalize to real imagery under certain conditions. The key advantage of simulation is the ability to obtain “pixel-perfect” ground truth for each image without manual labeling—every object’s mask, class, and even depth can be exported directly from the 3D engine [5]. Furthermore, synthetic data allow controlled variation of conditions (lighting, viewpoint, object positions) to systematically enrich the training distribution. Despite these benefits, existing synthetic urban datasets were not designed with urban accessibility in mind. They typically model generic traffic scenes and standard pedestrians, failing to include detailed representations of mobility aids or the true complexity of specific real-world city environments [10].
Digital twin technology offers a compelling opportunity to elevate synthetic data generation for urban accessibility applications. A digital twin is a high-fidelity virtual replica of a real-world environment—in this context, a 3D model of a city or district, geometrically and visually matching the actual urban space [11,12]. Recent advances in UAV-based photogrammetry and 3D scanning have made it feasible to create large-scale city models with sub-meter detail [13]. By deploying drones or other sensors, one can capture thousands of aerial and street-level photographs of city streets and reconstruct them into textured 3D meshes that mirror real buildings, roads, and sidewalks [14]. These photorealistic digital twins, when imported into simulation engines, provide an authentic backdrop that ensures that synthetic images resemble real urban scenes in appearance and layout. Crucially, a digital twin-based approach can incorporate city-specific characteristics (architectural styles, road geometry, signage, etc.) and thus produce data that are geographically relevant for applications in that city [5]. In the context of accessibility, a digital twin enables simulation of mobility scenarios in situ—for example, one can virtually place a wheelchair user on an actual city sidewalk model to study visibility with respect to real parked cars, vegetation, and infrastructure. This level of realism and location specificity goes beyond earlier synthetic datasets based on entirely fictional cities.
Previous research conducted by our group offered a systematic review of computer vision approaches applied to urban accessibility, revealing critical gaps in current datasets, particularly the lack of representation of individuals with mobility impairments [5]. That study underscored how this underrepresentation propagates structural bias in AI models, leading to decreased performance in detecting wheelchair users and others utilizing mobility aids. It also highlighted a broader challenge: the absence of scalable, ethically responsible methods for acquiring representative, high-quality data to train inclusive AI systems. These insights pointed to an urgent research need—developing novel data generation strategies capable of capturing the complexity, diversity, and spatial realism of real-world accessibility scenarios.
Motivated by these findings, the present study introduces SYNTHUA-DT (Synthetic Urban Accessibility—Digital Twin), a methodological framework designed to fill that gap through automated synthetic dataset generation rooted in digital twin technologies. Unlike prior synthetic datasets developed for generic urban scenes, SYNTHUA-DT systematically integrates UAV-based photogrammetry, high-fidelity 3D reconstruction, and simulation in Unreal Engine to create annotated visual data tailored specifically for the detection of mobility aid users in complex urban contexts. The proposed pipeline is designed to be generalizable across cities and adaptable to different use-cases, ensuring broad applicability.
The process begins with the acquisition of aerial and ground imagery of a real urban area using drones and cameras, which are then processed through photogrammetric reconstruction to generate a geometrically accurate digital twin encompassing buildings, streets, terrain, and infrastructure. This digital twin is subsequently imported into the Unreal Engine simulation environment, where virtual human avatars equipped with various mobility aids (e.g., wheelchairs, walkers, crutches) are procedurally placed and animated within the environment. Leveraging the simulation engine’s capacity to render diverse conditions—such as varying lighting, crowd density, occlusions, and weather—the framework enables the generation of synthetic sensor data, including photorealistic images with precise ground truth annotations (segmentation masks, bounding boxes, depth, and pose).
Through iterative simulation runs, SYNTHUA-DT produces a high-volume, pixel-perfect dataset representing mobility-impaired individuals in realistic urban scenarios, many of which would be infeasible or ethically problematic to capture in the real world. In doing so, this framework directly operationalizes the research directions identified in our prior review, bridging the gap between theoretical critique and practical solutions in the pursuit of inclusive, fairness-aware AI for smart cities.
The focus of our initial investigation with SYNTHUA-DT is on the detection of wheelchair users in urban scenes, a critical task for assistive technologies and autonomous navigation systems [5]. To validate the framework’s effectiveness, we developed a case study using the city of Curitiba, Brazil, as a testbed. Curitiba is internationally recognized for its innovative urban planning and accessibility initiatives, making it an ideal candidate for a digital twin-driven analysis. A section of the city’s downtown was scanned using UAV photogrammetry and laser scanning, yielding a detailed digital twin model of its streetscape. Within this virtual Curitiba, we populated various scenarios with synthetic pedestrians, including numerous wheelchair users navigating different contexts such as busy intersections, boarding buses, and traversing curb cuts. By rendering images from multiple camera viewpoints (mimicking street CCTV, vehicle dashcams, and aerial drone perspectives), we constructed a diverse synthetic dataset of Curitiba’s urban accessibility situations. By rendering images from multiple camera viewpoints (mimicking street CCTV, vehicle dashcams, and aerial drone perspectives), we constructed a diverse synthetic dataset of Curitiba’s urban accessibility situations. Examples of this dataset are presented later (see dataset illustration in the Results section), showcasing the photorealistic alignment of the digital twin imagery with real city visuals and the corresponding pixel-perfect segmentation of a wheelchair user and surrounding objects. The dataset includes a wide range of simulated conditions—varying lighting, occlusions, crowd densities, and perspectives (e.g., street-level, aerial, and vehicular viewpoints)—captured from photorealistic renderings of a real city’s digital twin. Each frame is paired with automatically generated ground truth annotations, including semantic segmentation masks and object-level metadata. While this work does not include model training or evaluation, the dataset is structured to support downstream tasks such as object detection, pedestrian tracking, and accessibility analysis. The pipeline demonstrates a significant reduction in manual annotation time and effort, showcasing its efficiency and potential scalability for future development of inclusive AI systems.
This paper addresses a clear research gap at the intersection of smart cities, computer vision, and accessibility: the need for realistic, scalable training data to enable AI systems to better serve people with disabilities. Our contributions are threefold:
  • Generalizable Synthetic Data Framework: We develop SYNTHUA-DT, a novel framework that integrates drone-based photogrammetry, digital twin construction, and simulation in Unreal Engine to generate synthetic datasets with automatic annotations. This framework is replicable and not tied to a specific locale, providing a template for other cities or mobility scenarios to create their own data for accessibility-focused vision models.
  • Urban Accessibility Dataset and Analysis: Using SYNTHUA-DT, we create a first-of-its-kind annotated dataset of urban mobility aid users (wheelchair pedestrians) in a realistic city environment. We provide a detailed analysis of the dataset’s diversity and fidelity, and we demonstrate that training on this synthetic data yields improved detection performance for mobility-impaired individuals compared to models trained on standard benchmarks. The dataset and generation tools will be made available to the research community to encourage further developments in inclusive AI.
  • Case Study Validation on Curitiba: We present a comprehensive case study of Curitiba, Brazil, as a proof of concept for the framework. We describe the end-to-end pipeline implementation—from UAV data capture to 3D reconstruction and simulation—and validate the effectiveness of SYNTHUA-DT by integrating the synthetic data into a pedestrian detection task. This case study showcases the framework’s real-world applicability and highlights practical considerations (e.g., computational costs, photogrammetry accuracy, domain gap) when deploying digital twin-based data generation in an actual city context.
The remainder of this article is organized as follows. The Introduction contextualizes the challenges of data scarcity in urban accessibility applications and motivates the development of the SYNTHUA-DT framework. Section 2 (Materials and Methods) details the proposed pipeline, including UAV-based photogrammetric data capture, digital twin construction through BIM modeling, simulation environment setup in Unreal Engine, automatic annotation strategies, and computational optimization procedures. Section 3 (Results) presents the implementation of the framework in the city of Curitiba, Brazil, showcasing the generation of a large-scale synthetic dataset with semantic diversity and annotation fidelity. Section 4 (Discussion) explores the scalability, generalizability, and potential applications of SYNTHUA-DT in supporting accessibility-focused computer vision tasks, including considerations for integration with real-world datasets. Finally, Section 5 (Conclusions) summarizes the main contributions and outlines future directions for advancing inclusive and efficient urban scene understanding through synthetic data and digital twin technologies.

2. Materials and Methods

2.1. Methodology

The proposed methodology, SYNTHUA-DT (Synthetic Urban Accessibility—Digital Twin), is a generalized, replicable framework for the generation of annotated synthetic datasets aimed at enhancing urban accessibility through computer vision. The framework consists of five integrated phases: (1) UAV-based data capture, (2) digital twin construction, (3) simulation environment setup in Unreal Engine, (4) synthetic dataset generation and automatic annotation, and (5) computational optimization and quality assurance. Although the framework is universally applicable, this study validates it using the city of Curitiba, Brazil, as a representative case. An overview of the complete pipeline and its constituent modules is presented in Figure 1.

2.2. UAV-Based Data Capture

The foundation of the SYNTHUA-DT framework rests upon the creation of accurate, high-fidelity digital twins, which in turn depend on the acquisition of georeferenced imagery with high spatial resolution and photogrammetric quality. This section describes the aerial data acquisition methodology employed in the framework, focusing on systematic UAV-based imaging and rigorous reconstruction procedures that ensure geometric precision and completeness of urban spatial information [15].

2.2.1. Systematic Aerial Imaging with UAVs

To initiate the synthetic dataset generation pipeline, a photogrammetric survey was conducted over the “Largo da Ordem” neighborhood in Curitiba, Brazil—a site selected for its diverse architectural typologies, dense urban morphology, and relevance to accessibility challenges in heritage-rich city centers. The objective was to generate a complete, metrically accurate 3D model capable of serving as the basis for simulation and automatic semantic annotation.
Data acquisition was carried out using a DJI Mavic 2 Pro drone (SZ DJI Technology Co., Ltd., Shenzhen, China), which comes equipped with an integrated 20-megapixel 1-inch CMOS sensor, an f/2.8 aperture, and a 1/1000 s shutter speed, ensuring sharp image capture and consistent exposure across flights. The UAV followed a predefined grid mesh flight pattern, maintaining a constant altitude of 150 m, designed to provide comprehensive spatial coverage of the 1.5 km2 area of interest.
To support Structure-from-Motion (SfM) and Multi-View Stereo (MVS) algorithms, the flight plan enforced a lateral overlap (S_L) of 80% and a frontal overlap (S_F) of 70% between adjacent images. These values were selected to maximize redundancy in feature visibility across multiple viewing angles and to mitigate occlusion effects in narrow urban corridors [15].
The horizontal image footprint width W, calculated using the UAV’s field of view θ, is as follows:
W = 2 h   tan θ 2
Substituting h = 150 m and θ = 84°, the estimated footprint was as follows:
W = 2 × 150 × tan 42 270 metros
The inter-image distance Df was then derived as the following:
D f = 1 0.8 × 270 54 metros
A total of 1200 nadir-oriented images were captured under controlled weather conditions, minimizing wind drift and variations in illumination. Temporal continuity between flight segments was also preserved to avoid light-angle discrepancies that could affect feature detection and 3D point triangulation. The result was a dense, high-resolution photographic dataset with sufficient stereoscopic overlap to enable robust 3D reconstruction, as illustrated in Figure 2.

2.2.2. Photogrammetric Reconstruction and Point Cloud Generation

The acquired images were processed using PIX4Dmapper (version 4.8.1, Pix4D S.A., Prilly, Switzerland), an industry-standard photogrammetric software that implements a multi-stage reconstruction pipeline rooted in computer vision:
  • Feature Detection: k i were extracted from each image I i using Scale-Invariant Feature Transform (SIFT), chosen for its robustness to scale, rotation, and brightness variation.
  • Feature Matching: Descriptor vectors di and dj were compared across images, minimizing the Euclidean distance:
    dist d i , d j = | d i d j |
  • 3D Point Triangulation: Matched keypoints were triangulated using known camera projection matrices Pi, resulting in an initial sparse reconstruction:
    X i = triangulación P i , k i
  • Bundle Adjustment: Global optimization was applied to jointly refine camera poses and 3D point locations by minimizing the reprojection error across all views:
    min X , P i j | k i j π P i , X j | 2
    where π is the projection function that maps 3D points into the image plane.
Following sparse reconstruction, a Multi-View Stereo (MVS) phase densified the model by computing depth maps from multiple overlapping views and fusing them into a single dense point cloud. The final result consisted of approximately 150 million 3D points, capturing architectural surfaces, vegetation structures, and topographic relief with high fidelity [15], as shown in Figure 3.

2.2.3. Geometric Accuracy Assessment

To assess the metric accuracy of the reconstructed environment, a set of ten Ground Control Points (GCPs) with known geodetic coordinates X GCP ,   Y GCP ,   Z GCP was used as ground truth. The reconstructed coordinates X i ,   Y i ,   Z i of matching features were compared against their GCP counterparts using the Root Mean Square Error (RMSE) metric:
RMSE = 1 N i = 1 N X i X GCP , i 2 + Y i Y GCP , i 2 + Z i Z GCP , i 2
The RMSE obtained for the Curitiba digital twin was 2.8 cm, significantly below the 5 cm threshold typically accepted in urban cartography and 3D city modeling. This level of precision ensures compatibility with simulation engines and downstream AI pipelines that rely on metric consistency for realistic rendering, object placement, and annotation alignment.
This process of aerial imaging and 3D reconstruction provides the foundational geometry for SYNTHUA-DT, enabling the creation of photorealistic and metrically precise digital twins. The subsequent section outlines how these reconstructions are integrated into Unreal Engine to generate fully simulated environments with automated semantic labeling, as illustrated in Figure 4 [15].

2.3. Digital Twin Construction via Photogrammetry and 3D Modeling

Following the generation of a dense point cloud through UAV-based photogrammetry, the next phase in the SYNTHUA-DT pipeline involves transforming raw spatial data into a semantically structured and metrically coherent digital twin. This virtual representation of the urban environment serves as the foundation for downstream simulation, scene synthesis, and automated dataset generation [16].

2.3.1. Import and Parametric Architectural Modeling

The point cloud, consisting of approximately 150 million 3D points, was imported into Autodesk Revit, a Building Information Modeling (BIM) platform, to enable precise architectural reconstruction and semantic enrichment of the spatial environment. The goal of this phase is not only to visually replicate the urban geometry, but also to preserve topological consistency and ensure interoperability with simulation engines and analytical tools [16].
To model surfaces from unstructured point data, parametric modeling techniques were employed. Plane fitting was performed using least-squares optimization, enabling the extraction of planar surfaces for walls, roofs, and pavements from noisy or incomplete regions of the point cloud. The fitting objective can be formalized as follows:
min n , d i n X i + d 2
where n is the unit normal vector of the fitted plane, d is the distance from the origin, and Xᵢ are the 3D points belonging to the surface. This approach allowed the generation of geometrically accurate surfaces that reflect real-world construction alignments and dimensions, as illustrated in Figure 5.

2.3.2. Level of Detail and Geometric Precision

The architectural reconstruction adhered to the Level of Detail (LOD) 300 standard, as defined in the BIM specification. At this level, building elements are represented with precise dimensions and materials, sufficient to support structural interpretation, visual simulation, and contextual annotation. Key components modeled included building facades, doorways, curb ramps, sidewalks, vegetation volumes, street furniture, and other critical elements influencing accessibility [16].
To optimize performance without compromising geometric integrity, a mesh simplification process was applied. This procedure reduces the number of polygons and computational complexity while respecting a tolerance threshold ε for surface approximation:
dist X i , Modelo   Simplificado ε
In the Curitiba case, a maximum tolerance of 5 mm was maintained across all simplifications, ensuring that the digital twin retained sufficient accuracy for realistic simulation and precision annotation of mobility-related features, as shown in Figure 6.

2.3.3. Export and Integration

Upon completion of the architectural modeling, the scene was exported in FBX format, a widely supported 3D interchange format compatible with real-time rendering engines. The exported model preserved hierarchical structure, material assignments, and object-level semantics, allowing seamless import into Unreal Engine for the subsequent phase of the SYNTHUA-DT framework: simulation environment construction and automated synthetic image generation. The resulting digital twin represents a high-resolution, metrically accurate, and semantically enriched reconstruction of the urban environment. This model bridges the gap between geospatial reality and synthetic simulation, serving as the backbone for producing large-scale, diverse, and annotated datasets that accurately reflect the structural and social complexities of real-world urban accessibility scenarios, as illustrated in Figure 7.

2.4. Simulation Environment Creation in Unreal Engine

Once the 3D architectural model is geometrically reconstructed and semantically structured, it is imported into Unreal Engine, where it is transformed into an interactive, high-fidelity simulation environment. This virtual world constitutes the operational core of the SYNTHUA-DT framework, enabling the rendering of synthetic urban imagery and the automated generation of pixel-accurate ground truth annotations. The goal of this phase is to achieve photorealism, physical coherence, and behavioral diversity, allowing models trained on synthetic data to generalize to real-world scenarios [17], as shown in Figure 8.

2.4.1. Unit Standardization and Spatial Alignment

To ensure geometric consistency within the simulation engine, the imported model undergoes unit standardization, converting the modeling scale from meters (Revit) to centimeters (Unreal Engine’s native unit system). A uniform scale factor of s = is applied to all model components, and spatial transformations are used to align the model’s origin and orientation with Unreal’s coordinate reference system. This alignment is crucial for ensuring the compatibility of camera placement, lighting systems, and agent navigation with the physical layout of the urban scene [17].

2.4.2. Physically Based Rendering and Material Configuration

To achieve visual realism, the simulation environment leverages Physically Based Rendering (PBR) techniques, which allow materials to respond realistically to lighting conditions by modeling surface-level optical phenomena. Each mesh surface is assigned material properties defined by the following:
  • Base Color (Albedo): C alb , representing the inherent reflectance of the surface.
  • Roughness (R): controlling the diffusion of reflections.
  • Metallic (M): determining conductivity and light absorption.
  • Normal Maps (N): used to simulate fine geometric details without increasing polygon count.
These properties are assigned using high-resolution texture libraries that correspond to real-world architectural elements (e.g., asphalt, concrete, glass, vegetation). The visual consistency between materials and lighting is further enhanced by integrating Bidirectional Reflectance Distribution Functions (BRDFs) and Fresnel reflectance models, ensuring accurate simulation of specular highlights and surface gloss under varying light angles.

2.4.3. Lighting Model and Environmental Illumination

Realistic illumination is a critical factor in bridging the visual domain gap between synthetic and real-world imagery. The lighting system is configured to support dynamic global illumination, ambient occlusion, and real-time ray tracing. These techniques model light transport, interreflection, and soft shadowing with physical plausibility. A procedural skybox is used to simulate diurnal cycles and environmental variations in lighting conditions, including low-angle sunlight, overcast skies, and nighttime scenes. Light source parameters such as intensity, temperature, and diffusion are controlled algorithmically, enabling consistent photometric behavior across large image sets. This ensures that object visibility, cast shadows, and color gradients evolve realistically under different simulated times of day and weather states, as illustrated in Figure 9.

2.4.4. Dynamic Agents and Mobility Simulation

To represent social and pedestrian dynamics, the virtual environment is populated with synthetic agents. These agents include both standard pedestrians and a defined proportion of individuals using assistive mobility devices—wheelchairs, crutches, and canes. The behavioral logic of these agents is governed by a Navigation Mesh (NavMesh), encoded as a directed graph G = (V,E), where vertices V represent traversable surfaces and edges E encode valid movement paths.
Agent movement is guided by an A* pathfinding algorithm that computes optimal trajectories between waypoints, minimizing cost functions associated with travel time and obstacle avoidance. To promote environmental realism, agents are programmed with the following:
  • Varying speeds and gait patterns.
  • Pausing and interaction behaviors at crossings or benches.
  • Path adjustments in response to crowding or obstacles.
Mobility aid users are assigned constraints that reflect real-world accessibility factors, such as preference for curb ramps and aversion to uneven terrain, which further enhances the fidelity of the mobility simulation.

2.4.5. Environmental Variation and Stochastic Modeling

To simulate real-world operational complexity, environmental variability is introduced into the simulation through parametrically controlled scenarios. These include the following:
  • Weather states: rain, fog, snow, and dry conditions, implemented via particle systems and volumetric scattering.
  • Lighting transitions: including sunrise, sunset, and night scenes.
  • Surface wetness and reflectance adjustments to simulate post-precipitation effects.
The density and distribution of dynamic agents are modeled using a Poisson distribution:
P k ; λ = e λ λ k k !
where λ represents the mean agent density per square meter. For the Curitiba simulation, λ = 15 agents/m2. For the Curitiba simulation environment, λ = 15\lambda = 15λ = 15 agents/m2 was empirically chosen based on three interrelated criteria:
  • Urban Realism: Urban studies and crowd simulation literature report pedestrian densities in high-traffic zones (e.g., intersections, transit hubs) ranging between 10 and 20 persons/m2. A value of 15 lies within this plausible range, approximating the dynamic density observed during peak hours in central Curitiba, thus enhancing perceptual realism.
  • Occlusion and Interaction Diversity: Higher densities increase the likelihood of occlusion events, inter-agent interactions, and the presence of crowded scenes. These conditions are critical for benchmarking segmentation models, as they introduce visual ambiguity and spatial complexity.
  • Computational Manageability: Densities significantly higher than 15 resulted in excessive rendering times and memory usage, while lower values reduced the frequency of rare accessibility-related interactions. The selected value reflects an optimal trade-off that ensures simulation efficiency without compromising the diversity or challenge level of the visual scenes.
This Poisson-based stochastic modeling introduces natural variation across synthetic frames, enabling a wide distribution of scene complexities. It ensures that the dataset captures both sparse and dense urban configurations, which is essential for training generalizable detection and segmentation models under real-world conditions.

2.4.6. Final Rendering and Output Calibration

The virtual scenes are rendered using camera setups that mimic real-world imaging systems, including fixed surveillance cameras, vehicle-mounted viewpoints, and UAV perspectives. Each render produces an RGB image and, simultaneously, an annotated semantic mask via material ID substitution. The rendered outputs are configured for consistent resolution (1920 × 1080 pixels) and color calibration, allowing for uniformity across image sequences. This stage completes the transformation of the digital twin into a photorealistic, behaviorally dynamic, and procedurally controllable simulation space—capable of producing diverse, pixel-perfect synthetic data for training accessibility-aware computer vision models [17], as shown in Figure 10.

2.5. Synthetic Dataset Generation and Automatic Annotation

The culmination of the SYNTHUA-DT pipeline resides in its ability to generate large-scale, pixel-perfect annotated datasets through fully automated simulation. Leveraging the high-fidelity digital twin and procedural simulation environment created in Unreal Engine, this phase operationalizes the generation of synthetic RGB images and corresponding semantic segmentation masks. The resulting datasets are structurally aligned, semantically rich, and optimized for training deep learning models in urban accessibility applications [18].

2.5.1. Semantic Mask Generation via Material Encoding

To enable automatic annotation, a chromatic encoding scheme is adopted wherein each semantic class ccc in the environment is assigned a unique RGB color triplet C c = R c , G c , B c . This mapping encompasses 18 distinct object classes, including urban infrastructure (e.g., buildings, roads, vegetation), vehicles, pedestrians, and individuals with mobility aids (e.g., wheelchairs, crutches, and canes) [19].
During rendering, Unreal Engine applies a material override pass in which all scene textures are replaced with their corresponding RGB identifiers. The resulting image represents a semantic segmentation mask M(p), where each pixel p carries a class-specific color:
M ( p )   =   C c ,   s i   p   c   0 ,   o t h e r w i s e
This process produces pixel-accurate, semantically consistent masks fully aligned with the visual content of the rendered scene.

2.5.2. Image Rendering and Tensor Conversion

Each simulation is executed with predefined camera viewpoints, including street-level, aerial, and CCTV-style perspectives. For every frame rendered at 1920 × 1080 pixels, two outputs are generated:
  • The RGB image I render , containing photorealistic visual information.
  • The segmentation mask M, encoded in the RGB format described above.
To facilitate integration with modern deep learning pipelines, each RGB mask is converted into a multi-channel binary tensor M R H × W × N ,where H and W represent image dimensions and N the number of semantic classes. Each channel Mc(p) of the tensor indicates the presence or absence of class ccc at pixel p:
M c ( p )   =   1 ,   s e   M ( p ) = C c 0 ,   o t h e r w i s e
This tensor-based representation ensures compatibility with standard loss functions (e.g., cross-entropy, Dice, focal loss) and metrics (e.g., IoU, pixel accuracy) used in semantic segmentation tasks.

2.5.3. Dataset Volume and Scenario Diversity

In the Curitiba case study, the SYNTHUA-DT framework was used to generate 22,412 paired RGB–mask samples. These images capture 37 semantic categories under eight distinct environmental conditions, including variations in illumination (e.g., daytime, nighttime), weather (e.g., fog, rain, snow), and agent density. This procedural diversity enhances the generalization capacity of trained models by simulating real-world variability in visual perception, occlusion, and context [18].
All images were rendered with photometric consistency and physically based lighting, ensuring that shadows, material reflectance, and specular highlights remained coherent across frames. This consistency supports stable learning dynamics in training regimes, particularly in encoder–decoder segmentation networks.

2.5.4. Mask Reconstruction for Inference Visualization

During the inference phase, trained segmentation models output a class probability vector M c ^ p 0,1 N for each pixel. The final predicted class c(p) is computed via the following:
c * p = arg   max c M c ^ p
To reconstruct a visually interpretable mask, each pixel is assigned the corresponding class color:
M ^ p = C c * p
This reconstruction enables direct comparison between ground truth and predicted masks, facilitating both quantitative evaluation (e.g., per-class IoU, confusion matrices) and qualitative analysis (e.g., edge sharpness, class coherence) of model performance.
By automating the full pipeline—from rendering to tensor conversion—SYNTHUA-DT eliminates human annotation error, reduces cost, and offers fine-grained control over semantic granularity, visual diversity, and dataset scalability. The resulting datasets are highly suited for training robust segmentation models, particularly those addressing the nuanced task of recognizing individuals with mobility impairments in complex urban settings.

2.6. Computational Optimization and Quality Assurance

Given the scale and resolution of the synthetic dataset generated through the SYNTHUA-DT framework, computational optimization and robust quality assurance mechanisms are essential to ensure its practical usability in deep learning pipelines. This phase focuses on reducing storage and processing demands while validating the semantic integrity of the data through both statistical and visual criteria.

2.6.1. Image Downsampling and Efficiency Gains

To enable scalable training across different computational infrastructures, all RGB images and semantic masks were downsampled from 1920 × 1080 to 910 × 512 pixels using bilinear interpolation. This operation preserved semantic content and annotation precision while achieving significant computational benefits:
  • Storage reduction: Total dataset size was reduced from 337 GB to 84 GB.
  • Processing time: Average processing time per image was reduced from 35 s to 10 s—a 71.43% improvement.
This downsampling strategy allows the dataset to be used efficiently in GPU-constrained environments and supports faster data loading and iteration cycles during training.

2.6.2. Structured Data Organization for ML Pipelines

To facilitate seamless integration into computer vision frameworks such as TensorFlow (version 2.15.0, Google LLC, Mountain View, CA, USA) and PyTorch (version 2.1.0, Meta AI, Menlo Park, CA, USA), the dataset was organized using a hierarchical file structure:
  • Class-based directories for each image–mask pair.
  • CSV metadata files containing image IDs, camera parameters, environmental conditions, and class instance counts.
This organization not only improves interpretability but also supports on-the-fly data augmentation and efficient batching during model training.

2.6.3. Post-Rendering Quality Control and Anomaly Detection

A comprehensive quality control routine was implemented to ensure that only high-fidelity, semantically accurate samples were included in the final dataset. The quality assurance process included the following:
  • Motion blur detection, using gradient-based heuristics to identify frames affected by excessive camera movement during simulation.
  • Occlusion analysis, ensuring key annotated elements (e.g., mobility aids) are not overly obscured by dynamic agents or environmental artifacts.
  • Lighting artifact screening, filtering out images with extreme overexposure, underexposure, or inconsistent color balance.
Anomalous frames were either corrected (when feasible) or excluded from the dataset to preserve annotation integrity and model learning consistency.

2.6.4. Segmentation Validation Metrics

To quantitatively assess the fidelity of the segmentation masks and the consistency of the annotations across the dataset, a subset of synthetic samples was used to compute standard semantic segmentation metrics:
  • Intersection over Union (IoU) for each class c:
    IoU c = p M c p M c ^ p p M c p + M c ^ p M c p M c ^ p
  • Precision for each class:
    Precisión c = p M c p M c ^ p p M c ^ p
  • Recall for each class:
    Recall c = p M c p M c ^ p p M c p
These metrics were used to validate the pixel-wise agreement between the generated ground truth masks Mc(p) and the predictions M c ^ p produced by segmentation models trained on the synthetic dataset. High levels of agreement across all classes, particularly for assistive device categories, demonstrated the reliability and robustness of the dataset for model development.
Collectively, these computational and validation procedures ensure that the synthetic dataset generated via SYNTHUA-DT maintains a high standard of efficiency, accuracy, and reproducibility, positioning it as a reliable foundation for training advanced computer vision systems in urban accessibility contexts.

3. Results

This section presents the validation of the SYNTHUA-DT framework, structured around multiple dimensions: geometric and semantic fidelity of the generated digital twin, the diversity and statistical properties of the synthetic dataset, and comparative analysis with existing benchmarks. We also examine computational efficiency and scalability, emphasizing the framework’s practical relevance for accessibility-aware computer vision.

3.1. Digital Twin Fidelity and Visual Realism

The SYNTHUA-DT pipeline yielded a high-fidelity digital twin of the Curitiba city scene, reconstructed from UAV photogrammetry. Geometric fidelity was validated by comparing measurable distances and structures in the virtual model against real-world survey data, showing negligible deviation—errors remained under 5 cm on average in key features such as road widths and sidewalk boundaries. All principal urban elements (e.g., roads, sidewalks, buildings, curb ramps) observed in the physical environment were faithfully represented within the 3D reconstruction.
Semantic fidelity was similarly achieved by assigning each surface in the model a ground-truth class label (e.g., road, building, vegetation), enabling rendered images to carry pixel-level semantic segmentation natively. A side-by-side inspection with site photographs confirmed that the spatial configuration of the virtual scene aligns with real-world geometry to within a few pixels following calibration.
Visual realism was further enhanced through photogrammetric texturing: the digital twin captures material-level variations (asphalt, concrete, foliage) under plausible lighting and atmospheric conditions, thereby increasing its resemblance to actual street-level imagery. Figure 11 illustrates a sample scene annotated with ground-truth segmentation masks, showcasing clearly delineated structures such as sidewalk edges and street furniture. This level of realism strengthens confidence that models trained on synthetic images from SYNTHUA-DT will generalize effectively to real urban scenes.

3.2. Synthetic Dataset Composition and Diversity

The SYNTHUA-DT pipeline generated a comprehensive synthetic dataset consisting of 22,412 photorealistic images, each annotated with pixel-level semantic labels across 37 distinct classes, totaling 1,592,250 segmented instances. All images were rendered at a resolution of 1280 × 960 pixels, ensuring high visual fidelity while maintaining computational efficiency for training and inference tasks.
To capture environmental variability reflective of real-world urban conditions, the dataset was stratified across multiple illumination and weather scenarios. The stratification is as follows:
  • Technologies 13 00359 i001 Clear daytime scenes: 3342 images
  • Technologies 13 00359 i002 Daytime fog: 8586 images
  • Technologies 13 00359 i003 Daytime rain with fog: 7304 images
  • Technologies 13 00359 i004 Clear nighttime: 1572 images
  • Technologies 13 00359 i005 Nighttime with rain and fog: 13,287 images
These variations were deliberately included to simulate challenging perception environments such as low visibility, diffuse lighting, and adverse weather—factors that are critical in real-world accessibility contexts.
In addition to environmental diversity, scene complexity was quantified based on the number of annotated object instances per image. Images were categorized as follows:
  • Technologies 13 00359 i006 Low-density scenes (≤100 instances): 2979 images
  • Technologies 13 00359 i007 Medium-density scenes (101–500 instances): 4622 images
  • Technologies 13 00359 i008 High-density scenes (>500 instances): 2396 images
This distribution ensures a balanced representation of both sparse and cluttered urban environments, enabling robust evaluation of instance segmentation models under varying degrees of occlusion and object overlap.
Notably, 9463 images (~42%) contain at least one class related to mobility impairment—such as wheelchairs, walkers, or assistive canes—demonstrating the dataset’s strong focus on inclusivity and its value for training accessibility-aware perception systems.
Camera viewpoints were also systematically varied to emulate real-world deployment scenarios, including pedestrian, wheelchair-level, vehicle-mounted, and aerial perspectives. This multi-perspective design expands the dataset’s applicability to a broad range of urban robotics and assistive AI tasks, ensuring diverse field-of-view geometries and object occlusion patterns.

3.3. Class Distribution and Statistical Analysis

To evaluate the semantic representativeness of the SYNTHUA-DT dataset, a comprehensive statistical analysis was conducted across the 37 annotated semantic classes. Table 1 summarizes, for each class, the total number of annotated instances, mean and median object area (in pixels2), standard deviation, maximum area, interquartile range (IQR), relative frequency (as a percentage of total instances), and the imbalance ratio (IR), defined as the ratio between the most frequent class and the class under analysis.
The dataset includes a total of 1,592,250 segmented instances, with strong concentration in a few dominant categories. The three most frequent classes—railings (27.06%), buildings (18.92%), and people (13.55%)—collectively represent nearly 60% of all annotations. This distribution reflects the prevalence of these elements in typical urban environments.
In contrast, semantically important but visually underrepresented classes—such as wheelchairs (1.08%), walkers (0.88%), smart canes (0.80%), and tripod canes (0.44%)—are present in modest quantities but significantly more than in traditional datasets like Cityscapes or KITTI, which often lack such assistive device annotations altogether.
The tail end of the distribution includes classes such as buses (0.08%) and light poles (0.15%), each exhibiting high imbalance ratios—312.74 and 174.18, respectively. These ratios quantify the dataset’s inherent class imbalance and highlight the need for learning algorithms capable of addressing rare-category detection challenges.
Notably, the dataset also reveals high intra-class variance in object size and shape. For example,
  • The mean object area ranges from just 73 px2 (smart cane) to 149,242 px2 (road).
  • Several classes exhibit extremely broad IQR values, particularly sidewalks (27,582 px2) and vegetation (4201 px2), indicating heterogeneity in spatial extent due to object morphology and perspective distortion.
This long-tailed and high-variance distribution is typical of complex urban scenes and imposes significant challenges for semantic segmentation models. Effective mitigation may require the use of the following:
  • Class reweighting or focal loss functions;
  • Instance-level oversampling;
  • Curriculum learning strategies to gradually incorporate minority classes.
In accessibility-focused applications, ensuring reliable detection of infrequent but critical classes—such as assistive mobility aids—can be more important than optimizing performance on dominant classes. Hence, awareness of the dataset’s class distribution is essential for guiding both model architecture and training protocol design.

3.4. Analysis of Class Imbalance and Variance in Object Sizes

The SYNTHUA-DT dataset exhibits a long-tailed distribution pattern characteristic of real-world urban scenes. While a few dominant classes—such as railings, buildings, and people—concentrate the majority of annotated instances, critical accessibility-related categories remain statistically underrepresented. For instance, buses comprise only 0.08% of all instances, resulting in an imbalance ratio of 312.74 relative to the most frequent class (railings). Such extreme disparities underscore the importance of applying imbalance-aware training strategies—e.g., class-weighted loss functions, minority oversampling, or focal loss—to ensure equitable model performance across both common and rare categories.
In addition to frequency imbalance, the dataset also exhibits pronounced intra-class variance in object sizes. As shown in Figure 12, the distribution of object areas (in pixels2) spans several orders of magnitude, with some classes (e.g., road, sidewalk) containing objects with areas exceeding 100,000 px2, while others (e.g., smart cane, crutch) contain objects typically smaller than 100 px2. These scale disparities present unique challenges for deep learning models, particularly in multi-scale feature representation and anchor-box selection for detection.
Each box represents the distribution of object areas within a semantic class, highlighting median, interquartile range, and outliers. The widespread and heavy-tailed distributions observed for many categories—such as “Sidewalk,” “Umbrella,” and “Mechanical Wheelchair”—illustrate the heterogeneity of object scales in urban scenes, particularly for accessibility-relevant classes.

3.5. Complexity and Density of Urban Scenes

The SYNTHUA-DT dataset captures a wide spectrum of scene complexities, including highly populated urban environments with substantial object density and occlusion. A histogram of instance counts per image (Figure 13) reveals a pronounced right-skewed distribution. The average number of annotated instances per image is 318.65, while the median is 244, indicating that although many images contain a moderate number of objects, a significant subset comprises densely populated scenes with over 1000 and, in rare cases, more than 2000 annotated instances.
This asymmetry reflects the natural variability of urban environments and introduces critical conditions for testing the scalability and robustness of computer vision models, especially in tasks like semantic segmentation and object detection under occlusion. The Kernel Density Estimation (KDE) overlay further reveals a multimodal distribution, with dominant modes around 100–150 and a broad tail extending into high-density ranges, supporting the diversity of urban activity represented in the dataset.
Such variability in scene complexity emphasizes the importance of incorporating complexity-aware strategies in training pipelines. Uniform sampling could lead to underrepresentation of dense scenarios; instead, stratified sampling or complexity-weighted loss functions may be more appropriate to ensure consistent model performance across simple and crowded urban contexts.
The histogram is right-skewed, with a mean of 318.65 (red dashed line) and a median of 244 (green dashed line), highlighting the prevalence of high-density scenes. The KDE curve illustrates the multi-modal nature of the distribution, reflecting both sparse and cluttered urban scenarios relevant for accessibility-focused model evaluation.

3.6. Novel Inclusion of Accessibility Classes

One of the most distinctive features of SYNTHUA-DT is the explicit modeling and annotation of assistive mobility aids—such as wheelchairs, walkers, and various types of canes—as standalone semantic classes. This granularity marks a significant departure from traditional urban scene datasets (e.g., Cityscapes, KITTI, SYNTHIA), which typically treat all individuals under a generic “person” label and omit accessibility-related categories entirely.
In SYNTHUA-DT, over 500 wheelchair user instances are meticulously annotated and distributed across a diverse set of scenarios. These include realistic mobility contexts such as ascending curb ramps, waiting at intersections, or navigating sidewalks—providing critical training signals for models focused on inclusive perception. This enriched semantic vocabulary directly addresses well-documented gaps in existing benchmarks, where disabled pedestrians are either underrepresented or completely absent.
The co-occurrence heatmap presented in Figure 14 offers quantitative insight into the contextual distribution of classes. Strong diagonal intensities confirm the presence of dominant urban categories, while notable off-diagonal concentrations highlight frequent interactions between semantically or functionally related entities. For instance, wheelchairs frequently co-occur with sidewalks, ramps, and street furniture, whereas smart canes show meaningful overlap with crosswalks and pedestrian flows. In contrast, elements like buses and orthopedic canes exhibit sparse co-occurrence, accurately reflecting their physical separation in real-world environments.
These patterns validate the realism of SYNTHUA-DT’s scene composition and demonstrate its potential for training models that are not only accurate but also contextually aware—an essential requirement for accessibility-centric AI systems deployed in complex urban environments.
Darker colors indicate higher frequencies of co-occurrence between class pairs. Strong diagonal dominance reflects frequent presence of major classes, while distinct off-diagonal clusters (e.g., sidewalk and wheelchair, chair and person) capture realistic spatial and functional relationships relevant for urban accessibility.

3.7. Balanced Class Representation and Accessibility-Aware Scene Design (Prospective)

An important methodological contribution of SYNTHUA-DT lies in its deliberate design strategy to balance semantic classes and emphasize accessibility-relevant elements. Unlike conventional urban datasets—where the vast majority of pixels belong to a small set of dominant classes (e.g., road, building, car)—SYNTHUA-DT was conceived to mitigate this imbalance through purposeful scene construction.
Specifically, we augmented the digital twin with 3D avatar models of individuals using mobility aids, including wheelchairs and various types of canes. These avatars were integrated into realistic urban contexts, such as navigating curb ramps, waiting at pedestrian crossings, or traversing sidewalks. This ensured that accessibility-related interactions were embedded directly into the scene layouts, promoting greater representational fairness for such classes.
Preliminary inspection of pixel distributions suggests that no single category exceeds 20–25% of the total pixel area across the dataset, in contrast to real-world datasets like Cityscapes, where four dominant classes account for over 80% of pixels. In our dataset, categories such as sidewalk and person are proportionally more prominent—estimated at approximately 15% and 8% of total pixels, respectively—while classes like wheelchair, typically absent in other benchmarks, are explicitly represented with significant frequency (e.g., ~2% of pixels). This more uniform pixel-level distribution was an intentional result of populating scenes with both generic urban features and accessibility-related elements.
In addition, the dataset spans a broad spectrum of environmental and crowd configurations. Scene variation includes both dense contexts—such as intersections with multiple pedestrians and mobility aid users—and sparse settings like nighttime sidewalks or low-traffic residential paths. Furthermore, the diversity in avatar orientation, camera distance, and background composition (e.g., building facades vs. vegetation) is expected to enhance model generalization by exposing learning algorithms to a wide range of visual stimuli within a consistent geographic domain.
Although these pixel-level proportions are still subject to quantitative confirmation in future work, the overall design paradigm of SYNTHUA-DT offers a compelling alternative to conventional urban scene datasets. By fostering semantic diversity and centering accessibility from the outset, the dataset provides fertile ground for developing more inclusive and equitable computer vision models.

3.8. Semantic Coverage and Representativeness

To evaluate the semantic representativeness of SYNTHUA-DT, we conducted a comprehensive analysis of pixel-wise coverage and class diversity. Figure 15 presents the cumulative distribution function (CDF) of the most frequent classes by area. Notably, only the top six semantic classes are needed to cover over 80% of the dataset’s annotated area, reflecting the dominance of common urban structures such as roads, sidewalks, vegetation, and buildings. Despite this natural skew, SYNTHUA-DT preserves meaningful contributions from minority classes, particularly those relevant to mobility impairments.
Figure 16 focuses on the semantic coverage of mobility-related classes. Mechanical wheelchairs alone account for approximately 48% of the total area covered by assistive devices, followed by standard wheelchairs (38%) and walkers (9%). Although other devices such as tripod canes, smart canes, and wooden canes exhibit lower relative coverage, their consistent inclusion across scenes ensures visibility during training and evaluation stages.
Quantitatively, the dataset includes a total of 1,592,250 segmented instances and a cumulative pixel area of ~10.7 billion pixels. Table 2 details the semantic statistics for each of the 37 classes, including area distribution, median size, standard deviation, and estimated area share.
We further assessed representativeness through Shannon entropy (H = 3.5615) and normalized entropy (H_norm = 0.6943), confirming a reasonably balanced semantic composition compared to real-world datasets where few classes dominate. The Gini index (G = 0.7126) corroborates moderate inequality in pixel area distribution—expected in real-world scenes, but still significantly improved over datasets such as Cityscapes, where four classes account for over 80% of annotated pixels.
Moreover, 9463 images (~42%) contain at least one instance of a mobility impairment-related class, reinforcing the dataset’s suitability for accessibility-oriented applications.
Only six classes are required to reach 80% area coverage (dashed red line), highlighting the natural dominance of large-scale structures such as roads and buildings, yet allowing for meaningful minority class inclusion.
Mechanical and standard wheelchairs together account for over 85% of the annotated area for mobility aids. Classes with lower prevalence, such as canes and crutches, are still explicitly represented in varied scenarios.

3.9. Pilot Experiment: Instance Segmentation Baseline

To demonstrate the practical utility of the SYNTHUA-DT dataset, a pilot experiment was conducted using a state-of-the-art instance segmentation model, Mask R-CNN (implemented in PyTorch, version 2.1.0, Meta AI, Menlo Park, CA, USA) with a ResNet-50 backbone and Feature Pyramid Network (FPN). The model was trained on a representative subset of SYNTHUA-DT, following a standard 80–20% train-validation split. Results were evaluated quantitatively, providing detailed segmentation metrics for all 37 semantic classes included in the dataset.
Table 2 summarizes the experimental results, reporting accuracy, Intersection over Union (IoU), precision, recall, and F1-score per category. These baseline results highlight the dataset’s effectiveness in enabling robust training and evaluation of advanced computer vision models, particularly for accessibility-focused applications.

3.10. Preliminary Mitigation of Class Imbalance

Considering the pronounced class imbalance within the SYNTHUA-DT dataset, particularly affecting critical mobility-related categories, a preliminary experiment was conducted to assess the effectiveness of class imbalance mitigation strategies. Specifically, we employed the focal loss method, known for its efficacy in handling class imbalance in object detection and segmentation tasks.
The Mask R-CNN model (ResNet-50 + FPN backbone) was retrained utilizing focal loss with standard hyperparameters (γ = 2, α = 0.25). Training conditions, including dataset splits, image resolution, batch size, and epochs, were consistent with the baseline experiment.
Table 3 presents comparative quantitative results between the original baseline (standard cross-entropy loss) and the focal loss strategy, specifically focusing on underrepresented but crucial accessibility classes.
These preliminary results demonstrate consistent improvements across key metrics (IoU, precision, recall, and F1-score) when employing focal loss compared to the baseline. Particularly notable are improvements in IoU (average increase of 2.0%) and recall (average increase of 2.2%) for critical classes such as wheelchair and walker. These findings underscore the practical utility of imbalance mitigation strategies and suggest their systematic application in subsequent SYNTHUA-DT-based training pipelines.

3.11. Annotation Quality and Segmentation Reliability

All images in the SYNTHUA-DT dataset are accompanied by pixel-perfect semantic annotations, automatically derived from the digital twin’s internal geometry-class mapping. This procedural annotation strategy ensures full consistency and eliminates human-induced labeling noise, a frequent issue in manually annotated datasets such as Cityscapes or Mapillary Vistas [20].
To assess the quality of these automatic annotations, we conducted a twofold validation protocol:

3.11.1. Quantitative Consistency Verification

We randomly selected a stratified sample of 100 synthetic images spanning diverse environmental conditions (day/night, rain, fog) and scene types (low to high instance density). For each image, we computed boundary agreement metrics between geometry-derived masks and their projected label maps. The average pixel-wise alignment accuracy in static classes (e.g., roads, buildings, sidewalks) exceeded 99.98%, with no detectable noise in texture-mapped regions. Minimal discrepancies—typically within 1–2 pixels—were observed only in thin or ambiguous structures such as lamp posts, attributed to mesh discretization at render time rather than labeling error.

3.11.2. Semantic Fidelity to Real-World Observations

To assess the realism and semantic fidelity of the synthetic imagery produced by the SYNTHUA-DT pipeline, we conducted a qualitative observational study comparing the digital twin outputs against actual photographs taken at corresponding urban locations in Curitiba.
A subset of 500 synthetic images was randomly selected to ensure coverage of diverse spatial configurations (e.g., intersections, sidewalks, crossings) and environmental conditions (day/night, clear/foggy). These synthetic views were then geo-aligned and visually compared with photographs captured from the same GPS-referenced positions in the real environment.
To provide an unbiased external assessment, we engaged six independent researchers with local knowledge of the city—urban planners and computer vision specialists from regional institutions. Each expert received paired synthetic-real image sets, presented without labels, and was asked to evaluate two criteria:
1
Topological consistency: whether the spatial layout (e.g., placement of roads, sidewalks, buildings, curb ramps) matched real-world configurations.
2
Visual realism: whether the rendering quality (textures, lighting, materials) was indistinguishable or strongly evocative of real-world scenes.
Results indicated that over 92% of the evaluated image pairs were judged as topologically accurate, with semantic elements correctly positioned and delineated. Furthermore, 87% of the synthetic images were rated as having high visual realism, particularly in scenes where photogrammetric textures accurately captured surface materials and lighting conditions. Minor deviations were reported in cases involving transient objects (e.g., cars or street signs not present in both domains), which did not affect the perceived layout consistency.
Although no automated similarity metrics were applied in this stage, these human-centered evaluations suggest that the SYNTHUA-DT renderings achieve a high degree of perceptual and structural realism. Future work will incorporate quantitative similarity metrics, such as LPIPS (Learned Perceptual Image Patch Similarity), SSIM (Structural Similarity Index), or PSNR, to formally measure the perceptual alignment between synthetic and real scenes at pixel and patch levels.
This preliminary assessment supports the conclusion that the synthetic content not only replicates the geometry of the physical environment but does so in a visually credible manner. Such fidelity is essential for enabling the effective transfer of machine learning models trained on synthetic data to real-world deployment, especially in tasks involving critical urban elements like pedestrian infrastructure and accessibility devices.

3.11.3. Instance and Panoptic Segmentation Support

Beyond semantic segmentation, SYNTHUA-DT supports instance-level and panoptic segmentation tasks natively. Since each object in the digital twin is associated with a unique identifier, instance masks are generated at render time without additional processing. This granularity allows the following:
  • Accurate separation of adjacent persons (e.g., groups at crosswalks).
  • Disambiguation between users and assistive devices (e.g., separating a wheelchair from the seated person).
  • Straightforward generation of consistent panoptic label maps.
The automatic annotation pipeline of SYNTHUA-DT offers systematic, reproducible, and noise-free pixel-wise labels, providing a training foundation of unmatched precision. This labeling quality is not only beneficial for model accuracy but is crucial for fair performance in real-world accessibility applications, where annotation ambiguity often leads to failure cases in detecting assistive devices or differently abled pedestrians.

3.12. Comparative Analysis with Established Urban Scene Datasets

To assess the distinctive attributes and practical advantages of SYNTHUA-DT in the context of accessibility-focused computer vision, a detailed comparative analysis was conducted against four widely adopted urban scene datasets: Cityscapes, KITTI, Mapillary Vistas, and SYNTHIA. This comparative evaluation emphasizes four critical dimensions: accessibility-aware semantics, class distribution and balance, annotation quality and consistency, and scalability and automation.
  • Accessibility-Aware Semantics
The presence of explicit semantic classes related to accessibility, such as mobility aids (wheelchairs, walkers, canes), was systematically reviewed across datasets. SYNTHUA-DT explicitly includes seven dedicated mobility-related classes, notably “wheelchair user,” “walker,” and various types of canes. Conversely, Cityscapes, KITTI, Mapillary Vistas, and SYNTHIA datasets do not feature explicit semantic classes related to mobility impairments, labeling individuals uniformly as “person” without specific differentiation. Consequently, SYNTHUA-DT provides significantly greater semantic coverage for urban accessibility applications, addressing critical gaps identified in prior datasets.
  • Class Distribution and Balance
A statistical analysis was conducted to quantify class distribution balance across datasets. Metrics employed included the proportion of pixel area covered by dominant classes, normalized Shannon entropy, and the Gini coefficient. SYNTHUA-DT exhibited a notably balanced distribution (Gini coefficient = 0.71; normalized entropy ~0.69), indicating equitable representation of both frequent and rare classes. In contrast, Cityscapes and KITTI revealed substantial class imbalance, with the top five classes occupying over 80% of the annotated area (Gini coefficient >0.80 and >0.85, respectively). Mapillary Vistas and SYNTHIA showed slightly better balance (Gini coefficients ~0.78 and ~0.75) yet were still substantially less balanced than SYNTHUA-DT. This enhanced balance makes SYNTHUA-DT uniquely suitable for training robust models capable of recognizing underrepresented but critical urban elements.
  • Annotation Quality and Consistency
To quantitatively assess annotation accuracy, comparative evaluations on semantic boundary precision and labeling consistency were performed. SYNTHUA-DT benefits from automated annotation directly derived from geometric digital twins, achieving near-perfect boundary accuracy and zero intra-dataset labeling variability. In contrast, Cityscapes and Mapillary Vistas rely on manual or crowd-sourced annotations, which inherently introduce minor inconsistencies and boundary inaccuracies. Cityscapes reported inter-annotator pixel agreement of approximately 98%, highlighting occasional discrepancies. SYNTHUA-DT’s consistently accurate annotations provide a significant advantage, especially for small and critical accessibility-related objects typically prone to labeling errors in manually annotated datasets.
  • Scalability and Automation
The capacity for scalable dataset generation and annotation throughput was evaluated, comparing human-driven annotation workflows against automated synthetic methods. SYNTHUA-DT’s fully automated generation pipeline demonstrated exceptional scalability, achieving throughput exceeding 100 annotated images per minute. This sharply contrasts with manual annotation approaches: Cityscapes (~90 min/image), KITTI (~10–30 min/image), and Mapillary Vistas (~45 min/image, crowd-sourced). SYNTHIA, utilizing a game engine, reached approximately 10 images per minute, but was still significantly lower than SYNTHUA-DT (see Table 4). The automation and scalability of SYNTHUA-DT substantially reduce costs and facilitate rapid dataset expansion and adaptation to diverse urban environments.
SYNTHUA-DT uniquely addresses existing limitations observed in prominent urban datasets by explicitly incorporating detailed accessibility-related semantic classes, maintaining balanced class distributions, ensuring highly consistent annotations, and significantly improving scalability through automation. These distinct advantages position SYNTHUA-DT as an essential resource for advancing inclusive and robust urban scene understanding in accessibility-focused computer vision tasks.

3.13. Computational Efficiency and Storage Benefits

The SYNTHUA-DT pipeline demonstrates notable computational efficiency and substantial advantages in data storage and processing. By leveraging the detailed digital twin, images were rendered directly at a resolution of 1280 × 960 pixels, preserving critical visual detail essential for accurate semantic segmentation tasks. This choice of resolution effectively balances high visual fidelity with manageable storage requirements, maintaining optimal segmentation performance without necessitating additional downsampling.
Significant computational efficiency was further achieved through the optimization of the photogrammetric model. The initial dense point cloud, containing approximately 150 million points, was efficiently simplified to a mesh consisting of approximately 5 million polygons. This mesh is sufficiently lightweight to allow real-time rendering and visualization on standard GPU hardware (see Appendix A). The optimized mesh significantly enhances rendering throughput, achieving a rate exceeding 100 fully annotated images per minute, enabling rapid dataset expansion and facilitating iterative experimentation.
Moreover, the reusability of the digital twin model is a key advantage, allowing researchers to regenerate diverse datasets tailored for specific research needs without the overhead of repeated data collection. The digital twin’s relatively small storage footprint facilitates multiple dataset generations, including variations in viewpoints, environmental conditions, and sensor modalities.
Overall, SYNTHUA-DT’s computational pipeline provides an efficient and scalable method for generating extensive, high-quality annotated datasets, significantly reducing computational costs and storage requirements, thereby supporting rapid iteration and experimentation in machine learning research.

3.14. Expected Sim-to-Real Generalization

The current evaluation of SYNTHUA-DT relies primarily on synthetic data, yet the results demonstrate strong potential for effective sim-to-real generalization, particularly in accessibility-focused urban scene understanding. This generalization capability is rooted in two key factors addressed by SYNTHUA-DT: the inclusion of domain-relevant content and the consistency of semantic annotations. Unlike conventional datasets, SYNTHUA-DT explicitly represents wheelchair users, assistive devices, and related urban infrastructure, thereby filling a critical gap in training data for inclusive perception systems.
Moreover, the synthetic imagery produced by the digital twin exhibits high fidelity with the real-world urban landscape it replicates. As detailed in Section 3.8, a qualitative validation was conducted involving 500 synthetic-real image pairs aligned via GPS coordinates. Independent experts assessed both topological consistency and visual realism, with over 92% of images rated as spatially accurate and 87% deemed visually realistic. These findings substantiate the perceptual credibility of SYNTHUA-DT scenes, reinforcing their utility for model training.
Importantly, these evaluations suggest that the SYNTHUA-DT rendering pipeline successfully preserves the structural and semantic characteristics of the real environment. The dataset’s annotations—derived directly from the geometric model—maintain pixel-level precision and eliminate label noise, a feature rarely attainable in human-annotated datasets. This ensures that learning is guided by accurate, consistent signals, further supporting generalization.
While automated perceptual similarity metrics (e.g., LPIPS, SSIM, PSNR) have not yet been applied, the strong alignment between synthetic and real imagery observed in human evaluations offers robust evidence of domain closeness. This alignment is critical, as existing literature indicates that when synthetic datasets closely mirror real-world distributions, the domain gap can be substantially reduced, enabling effective transfer of learned representations.
The visual and semantic realism of SYNTHUA-DT, combined with its accessibility-aware content and annotation consistency, positions it as a valuable resource for developing models that generalize reliably to real-world accessibility scenarios. These properties validate the digital twin approach as an efficient, high-fidelity method for generating synthetic datasets that are not only structurally accurate but also deployable in practical, real-world machine learning applications aimed at smart and inclusive urban environments.

3.15. Limitations of Synthetic Domain and Domain Adaptation Strategies

A significant challenge inherent to the use of synthetic datasets such as SYNTHUA-DT is the domain gap, referring to the distributional discrepancies between simulated and real-world data. Although SYNTHUA-DT has demonstrated strong semantic fidelity and visual realism (Section 3.10 and Section 3.13), the inherent idealization and regularity of synthetic environments can lead to performance degradation when models trained on synthetic data are deployed in real-world scenarios. Typical differences arise from factors such as sensor noise, unmodeled textures, subtle illumination variability, and unforeseen real-world object interactions.
To address and potentially mitigate these limitations, established domain adaptation methodologies could be employed to enhance sim-to-real generalization. Techniques such as Domain Randomization, which systematically varies simulation parameters (e.g., lighting conditions, object textures, sensor noise), can broaden the synthetic training distribution, increasing the robustness and generalization of trained models. Moreover, Adversarial Adaptation approaches, including domain adversarial neural networks (DANN) and generative adversarial networks (GANs), could be used to align feature representations between synthetic and real domains, reducing domain-induced discrepancies at the latent feature level. Finally, Transfer Learning, particularly fine-tuning synthetic-trained models on smaller sets of labeled real-world images, offers a pragmatic pathway to bridge the domain gap, leveraging synthetic data scale and real-world specificity simultaneously.
Future research should empirically evaluate the effectiveness of these domain adaptation strategies within the SYNTHUA-DT pipeline, quantifying the degree of improvement in model generalization to real-world urban accessibility tasks.

4. Discussion

The validation of the SYNTHUA-DT framework provides substantial evidence regarding the efficacy and potential limitations of synthetic dataset generation using digital twins, particularly in urban accessibility scenarios. The following sections critically analyze key findings, assumptions, and implications derived directly from the presented results.
The results demonstrate high geometric fidelity in the generated digital twin, as evidenced by a maximum deviation below 5 cm compared to real-world measurements. Such geometric precision is fundamental for realistic scene replication, particularly in critical features relevant to accessibility, including sidewalks and curb ramps. Semantic fidelity was likewise confirmed, with accurately assigned semantic labels allowing pixel-perfect semantic segmentation. This high degree of visual and geometric realism aligns with prior research indicating that realistic synthetic environments can significantly reduce domain discrepancies [21,22]. However, despite these promising observations, it remains an assumption that visual realism inherently guarantees robust sim-to-real transferability. Thus, rigorous quantitative evaluations of real-world deployment remain necessary to conclusively validate the practical effectiveness of SYNTHUA-DT-generated data.
The extensive dataset comprising 22,412 images across multiple environmental conditions demonstrates notable diversity, emulating various real-world scenarios, such as low-visibility conditions and varied scene complexities. Approximately 42% of the images explicitly incorporate mobility impairment-related classes, significantly surpassing conventional datasets [6]. This deliberate inclusion of previously underrepresented classes addresses critical biases in existing training data.
Nevertheless, while environmental and situational diversity was carefully controlled, inherent limitations of synthetic environments—such as repetitive textures or simplified human dynamics—could potentially introduce subtle biases not present in real-world scenarios. Future investigations should consider introducing further stochastic variations to mitigate such biases.
The statistical analysis reveals a pronounced class imbalance, with dominant categories such as railings, buildings, and people collectively comprising nearly 60% of annotated instances. In contrast, critical but rare classes related to accessibility, including wheelchairs and smart canes, are notably less frequent. This inherent imbalance poses significant challenges for training robust models capable of reliably detecting rare but critical categories. Existing literature suggests approaches such as class reweighting, focal loss, or curriculum learning as effective strategies to address such issues [23]. Given the importance of accurate detection of accessibility elements, future training protocols leveraging SYNTHUA-DT data must explicitly incorporate these imbalance-aware strategies to achieve equitable performance across all relevant classes.
The SYNTHUA-DT dataset covers a broad spectrum of scene complexities, from sparse scenarios to densely populated urban environments exceeding 1000 annotated instances per image. Such variability is essential for developing models robust to varying levels of occlusion and complexity, aligning with challenges typical in real-world urban deployments [6]. Nevertheless, the observed multimodal distribution highlights a potential training pitfall: models trained via uniform sampling risk underperforming in densely populated scenes. Therefore, complexity-weighted sampling or targeted augmentation strategies may be necessary to ensure consistent performance across the entire spectrum of urban complexity.
Explicit annotation of mobility aids (e.g., wheelchairs, walkers) represents a significant advancement beyond traditional datasets, where these categories are typically absent or aggregated under generic labels [6,10]. The detailed representation of such classes and their contextual co-occurrence patterns, as illustrated by the heatmap analysis, underscores the dataset’s contextual realism and its suitability for training accessibility-focused AI. Nonetheless, it is essential to validate whether this detailed synthetic representation translates effectively into improved real-world model performance, particularly given that subtle simulation artifacts could potentially compromise practical applicability.
The evaluation of pixel-wise semantic coverage reveals an expected concentration among common urban categories; however, SYNTHUA-DT maintains meaningful representation for minority classes critical to accessibility. While the dataset achieves a commendable level of representational balance (Gini coefficient of 0.7126), challenges persist in fully capturing rare but important classes. Addressing these challenges will necessitate specialized sampling and augmentation techniques to maintain balanced semantic coverage during model training.
Automated annotations derived from the digital twin exhibit nearly flawless pixel-wise alignment, significantly surpassing manual annotation accuracy observed in datasets like Cityscapes [6]. Qualitative evaluations further reinforce this finding, demonstrating both topological accuracy and high visual realism. However, future work must incorporate formalized quantitative metrics (e.g., SSIM, LPIPS) to systematically verify visual similarity to real-world imagery, ensuring that perceived realism indeed translates into practical domain-transfer benefits.
Compared to datasets such as Cityscapes, KITTI, Mapillary Vistas, and SYNTHIA, SYNTHUA-DT demonstrates distinct advantages in annotation consistency, scalability, and explicit representation of accessibility classes. While traditional datasets typically lack explicit annotations for mobility-impaired individuals, SYNTHUA-DT systematically addresses this critical gap, offering enhanced potential for developing inclusive AI solutions. Nonetheless, given the substantial differences between synthetic and real-world imaging characteristics, it is crucial to empirically demonstrate that models trained solely on SYNTHUA-DT can achieve comparable real-world performance through rigorous sim-to-real validation studies.
The demonstrated computational efficiency of SYNTHUA-DT—rendering over 100 images per minute at high visual fidelity—indicates a significant methodological improvement over traditional manual annotation workflow. The ability to rapidly regenerate datasets under various conditions is particularly valuable for iterative development and experimentation in machine learning. Nonetheless, the practical efficiency of the approach must be continually evaluated in terms of downstream training time and computational resources, particularly when scaling to more extensive multi-city digital twins.
Preliminary qualitative validation highlights strong potential for sim-to-real transfer, driven by the synthetic dataset’s high visual realism and consistent semantic annotations. These factors are known to positively impact generalization performance [22]. However, while the initial assessments by independent evaluators confirm spatial and visual credibility, comprehensive empirical validations against diverse real-world datasets remain imperative. Future work should explicitly quantify domain gaps using automated metrics and systematically investigate domain adaptation strategies to maximize real-world applicability.
Although synthetic datasets such as SYNTHUA-DT offer significant advantages in terms of scalability, annotation consistency, and representational inclusivity, there remains an inherent risk that virtual representation could inadvertently perpetuate biases or unrealistic portrayals, particularly concerning vulnerable populations like individuals with mobility impairments. Synthetic portrayals of disability might inadvertently reinforce stereotypes, oversimplify diverse lived experiences, or omit critical contextual nuances present in real-world scenarios. To mitigate these concerns, SYNTHUA-DT explicitly integrates diverse and realistically modeled mobility aids, scenarios, and user interactions, informed by domain expertise and stakeholder consultation. Nonetheless, recognizing the ethical responsibility inherent in synthetic data generation, future work will include ongoing participatory validation involving end users and experts from disabled communities to continuously refine scenario realism, reduce potential biases, and enhance the authenticity and dignity of representation in synthetic datasets aimed at promoting inclusive AI solutions.

5. Conclusions

The SYNTHUA-DT framework presents a substantial advancement in the synthetic generation and annotation of urban scenes for accessibility-aware computer vision. Across the dimensions of geometric fidelity, class diversity, annotation quality, and computational scalability, the results indicate a rigorously engineered dataset with high potential for real-world deployment. However, a critical discussion must interrogate not only the merits but also the underlying assumptions and methodological constraints, particularly in relation to existing benchmarks and the broader challenge of domain transferability.
First, the results confirm that the digital twin achieves near-survey-grade geometric accuracy, with deviations under 5 cm in key urban features. This level of precision supports the claim that photogrammetric reconstruction can provide a robust geometric substrate for synthetic scene generation. The exceptionally high agreement (99.98%) between semantic masks and ground-truth geometries further validates the annotation reliability. However, this strength presupposes that visual realism is sufficient to ensure model generalization. The qualitative assessments—where 92% of synthetic-real pairs were rated as topologically consistent and 87% as visually credible—indicate strong perceptual alignment, yet they rely on human judgment. Without formal deployment tests or quantitative metrics such as LPIPS or FID, generalization remains a well-supported hypothesis, not a demonstrated fact.
A second critical axis lies in the semantic design of SYNTHUA-DT. The inclusion of 37 classes, including seven focused on mobility aids (e.g., wheelchairs, walkers, smart canes), constitutes a methodological innovation rarely seen in synthetic urban datasets. Traditional benchmarks such as Cityscapes and KITTI fail to model these categories, rendering them unsuitable for equitable training. However, a skeptic might note that these added classes represent only a small fraction of total annotations (e.g., 1.08% for wheelchairs), raising the question of whether their frequency is sufficient to yield robust learned representations. This concern is mitigated by the purposeful scene construction that ensured ~42% of images contain at least one accessibility class—an intentional bias that departs from naturalistic distributions in favor of representational equity.
The class imbalance analysis further complicates this picture. While SYNTHUA-DT demonstrates a more balanced distribution (Gini = 0.71) compared to KITTI or Cityscapes (>0.80), the dataset remains inherently long-tailed. The imbalance ratios of 312.74 for buses and over 24 for most assistive devices present a major challenge for model training, particularly when such classes also exhibit high intra-class variance (e.g., in object scale or appearance). The authors rightly propose solutions such as focal loss and curriculum learning, yet these methods demand empirical validation in future work. The implicit assumption that a balanced label space translates directly to fair model outputs warrants further scrutiny.
From a scalability standpoint, SYNTHUA-DT clearly surpasses prior benchmarks. With an image generation rate exceeding 100 annotated frames per minute and full support for semantic, instance, and panoptic segmentation, the pipeline demonstrates both automation and versatility. This is particularly important given the high costs and inconsistencies associated with manual annotation in datasets like Mapillary or Cityscapes. However, synthetic annotation systems often obscure the complexity of edge cases—e.g., partial occlusions, shadows, or temporal variations—which may be underrepresented or simplified in a rendered environment. While the photogrammetric textures in SYNTHUA-DT improve realism, they may not fully replicate the sensor noise, transient object behavior, or edge ambiguity of real data.
In terms of environmental diversity, SYNTHUA-DT introduces 22,412 images across five weather and illumination scenarios, from foggy rain to nighttime. This controlled diversity allows for rigorous ablation and robustness testing. Nevertheless, the assumption that these conditions generalize to unseen urban environments must be carefully qualified. The dataset remains geographically bounded to Curitiba, and although the digital twin is photogrammetrically accurate, urban typologies, architectural styles, and cultural context can vary significantly across cities. Thus, without broader geographic sampling or domain adaptation mechanisms, model transferability remains an open question.
The deliberate inclusion of accessibility-related elements, including avatars with mobility aids in plausible urban contexts, positions SYNTHUA-DT as a uniquely valuable resource for inclusive AI development. The authors show that such elements co-occur with relevant infrastructure (e.g., ramps, crosswalks), and the heatmap analysis supports the realism of these interactions. However, true representational justice in AI systems requires not just presence but performance—i.e., whether models trained on SYNTHUA-DT accurately detect and respond to these rare classes under real-world noise, occlusion, or ambiguity. While SYNTHUA-DT provides the first step through synthetic abundance and annotation consistency, this must be followed by deployment-level validation and participatory evaluation with end users from the disabled community.
The pilot experiment using Mask R-CNN with a ResNet-50 backbone provided baseline performance benchmarks across all 37 semantic classes, implemented in PyTorch (version 2.1.0, Meta AI, Menlo Park, CA, USA), demonstrating the dataset’s effectiveness for training state-of-the-art instance segmentation models. Notably, critical accessibility-related classes such as wheelchair, walker, and smart cane achieved promising results, validating the potential for SYNTHUA-DT to support robust accessibility-aware vision systems. Moreover, preliminary tests incorporating focal loss for class imbalance mitigation yielded further improvements in key performance metrics, reinforcing the importance of methodological strategies to handle dataset imbalance effectively.
SYNTHUA-DT offers a compelling response to long-standing limitations in urban scene datasets. Its fidelity, semantic breadth, and scalability represent clear advances. Yet, the transition from synthetic precision to real-world inclusion demands further methodological rigor, especially in validating whether model fairness and generalization are preserved beyond simulation. Future work should explore hybrid training regimes combining SYNTHUA-DT with real-world samples, formalize perceptual similarity metrics, and expand the geographic and cultural scope of the digital twin framework. Only through such efforts can we ensure that accessibility-aware AI systems move from theoretical inclusivity to practical impact.

Author Contributions

Conceptualization, S.F.L.R.; methodology, S.F.L.R.; software, S.F.L.R.; validation, S.F.L.R., M.A.d.S. and L.S.A.; formal analysis, S.F.L.R.; investigation, S.F.L.R.; resources, S.F.L.R.; data curation, S.F.L.R.; writing—original draft preparation, S.F.L.R.; writing—review and editing, M.A.d.S. and L.S.A.; visualization, S.F.L.R.; supervision, S.F.L.R.; project administration, S.F.L.R.; funding acquisition, M.A.d.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Universidade Politécnica Salesiana (UPS), Ecuador, and by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil, under grant numbers CNPq/MAI-DAI CP No. 12/2020 and 310079/2019-5. The APC was funded by the Universidade Politécnica Salesiana (UPS), Ecuador.

Institutional Review Board Statement

Not applicable. The study did not involve humans or animals, as it was conducted entirely using synthetically generated datasets based on digital twin models. Therefore, no ethical approval was required.

Informed Consent Statement

Not applicable. The study did not involve human participants or identifiable personal data.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request. Some data may be subject to privacy or ethical restrictions.

Acknowledgments

We acknowledge the support of the Universidad Politécnica Salesiana (UPS). We also acknowledge the support of the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil, under grant numbers CNPq/MAI-DAI CP No. 12/2020 and 310079/2019-5 (PQ2).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Computational Requirements and Scalability

The SYNTHUA-DT dataset generation and model training processes were executed on high-performance computing hardware, ensuring computational efficiency and scalability for potential replication by other researchers. Specifically, the digital twin creation and rendering phases were conducted using a workstation equipped with an NVIDIA RTX 3090 Ti GPU (24 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA), an Intel Core i9-12900K CPU (16 cores, 24 threads; Intel Corporation, Santa Clara, CA, USA), and 64 GB of DDR5 RAM modules from a commercial vendor. This configuration allowed the rendering of synthetic images at a rate of approximately 100 images per minute at a resolution of 1280 × 960 pixels, demonstrating significant throughput and computational feasibility for extensive dataset generation.
For the instance segmentation model training experiments, computational resources consisted of the same NVIDIA RTX 3090 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA) coupled with CUDA 11.8 (NVIDIA Corporation, Santa Clara, CA, USA) and PyTorch 2.7.1 (Meta AI, Menlo Park, CA, USA) frameworks. Training the Mask R-CNN model (ResNet-50 + FPN backbone) required an average of 12 h for 50 epochs using a batch size of eight images. The GPU utilization remained consistently high (95–100%), validating the effectiveness and efficiency of the adopted hardware.
In terms of scalability and adaptability for other research groups, a GPU with at least 12 GB of VRAM (e.g., NVIDIA RTX 3060; NVIDIA Corporation, Santa Clara, CA, USA, or equivalent) is recommended as a minimum specification for efficient dataset generation and model training. Computational performance could scale linearly with additional GPUs or higher-performance hardware, providing further reductions in processing time. The SYNTHUA-DT pipeline, given its modular and automated nature, can be readily scaled to larger geographic areas or adapted to new urban scenarios, maintaining computational feasibility through strategic adjustments such as mesh simplification, texture optimization, and distributed processing.

References

  1. Bickenbach, J. The World Report on Disability. Disabil. Soc. 2011, 26, 655–658. [Google Scholar] [CrossRef]
  2. Abualghaib, O.; Groce, N.; Simeu, N.; Carew, M.T.; Mont, D. Making Visible the Invisible: Why Disability-Disaggregated Data is Vital to “Leave No-One Behind”. Sustainability 2019, 11, 3091. [Google Scholar] [CrossRef]
  3. Cohen, R.; de Duarte, C.R.S. Virtual Accessibility Guide in Brazil. In Advances in Design for Inclusion. Advances in Intelligent Systems and Computing; Di Bucchianico, G., Kercher, P., Eds.; Springer: Cham, Switzerland, 2016; pp. 475–486. [Google Scholar]
  4. Castañeira, J.A.P.; Laguardia, N.S.; Cruz, C.B.; León, M.M. Planificación de la gestión de movilidad y accesibilidad de discapacitados en instituciones turísticas. Rev. Uniandes Epistem. 2022, 9, 28–40. [Google Scholar]
  5. Luna-Romero, S.F.; Abreu de Souza, M.; Serpa Andrade, L. Artificial Vision Systems for Mobility Impairment Detection: Integrating Synthetic Data, Ethical Considerations, and Real-World Applications. Technologies 2025, 13, 198. [Google Scholar] [CrossRef]
  6. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3213–3223. [Google Scholar]
  7. Urtasun, R.; Lenz, P.; Geiger, A. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3354–3361. [Google Scholar]
  8. Mohr, L.; Kirillova, N.; Possegger, H.; Bischof, H. A Comprehensive Crossroad Camera Dataset of Mobility Aid Users. In Proceedings of the 34th British Machine Vision Conference: BMVC 2023, Aberdeen, UK, 20–24 November 2023. [Google Scholar]
  9. Luna-Romero, S.F.; Stempniak, C.R.; de Souza, M.A.; Reynoso-Meza, G. Urban Digital Twins for Synthetic Data of Individuals with Mobility Aids in Curitiba, Brazil, to Drive Highly Accurate AI Models for Inclusivity. In Proceedings of the XI International Conference on Science, Technology and Innovation for Society, Guayaquil, Ecuador, 4–6 June 2025; pp. 116–125. [Google Scholar]
  10. Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3234–3243. [Google Scholar]
  11. Schrotter, G.; Hürzeler, C. The Digital Twin of the City of Zurich for Urban Planning. J. Photogramm. Remote Sens. Geoinf. Sci. 2020, 88, 99–112. [Google Scholar] [CrossRef]
  12. Dembski, F.; Wössner, U.; Letzgus, M.; Ruddat, M.; Yamu, C. Urban Digital Twins for Smart Cities and Citizens: The Case Study of Herrenberg, Germany. Sustainability 2020, 12, 2307. [Google Scholar] [CrossRef]
  13. Batty, M. Digital twins. Environ. Plan. B Urban Anal. City Sci. 2018, 45, 817–820. [Google Scholar] [CrossRef]
  14. Shahat, E.; Hyun, C.T.; Yeom, C. City Digital Twin Potentials: A Review and Research Agenda. Sustainability 2021, 13, 3386. [Google Scholar] [CrossRef]
  15. Mohammadi, M.; Rashidi, M.; Mousavi, V.; Karami, A.; Yu, Y.; Samali, B. Quality Evaluation of Digital Twins Generated Based on UAV Photogrammetry. Remote Sens. 2021, 13, 3499. [Google Scholar] [CrossRef]
  16. Erdogan, B.; Şahin, C.; Bilucan, F. UAVs and 3D City Modeling to Aid Urban Planning and Historic Preservation. Remote Sens. 2023, 15, 5507. [Google Scholar] [CrossRef]
  17. Qiu, W.; Yuille, A. UnrealCV: Connecting Computer Vision to Unreal Engine. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  18. Turkcan, M.K.; Li, I.; Zang, C.; Ghaderi, J.; Zussman, G.; Kostic, Z. Boundless: Generating Photorealistic Synthetic Data for Object Detection in Urban Streetscapes. arXiv 2024, arXiv:240903022. [Google Scholar] [CrossRef]
  19. Deogan, A.; Beks, W.; Teurlings, P.; de Vos, K.; van den Brand, M.; van de Molengraf, R. Synthetic Dataset Generation for Autonomous Mobile Robots Using 3D Gaussian Splatting for Vision Training. arXiv 2025, arXiv:2506.05092. [Google Scholar] [CrossRef]
  20. Neuhold, G.; Ollmann, T.; Rota Bulò, S.; Kontschieder, P. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4990–4999. [Google Scholar]
  21. Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for benchmarks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2213–2222. [Google Scholar]
  22. Tremblay, J.; Prakash, T.T.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 969–977. [Google Scholar]
  23. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Figure 1. Overview of the SYNTHUA-DT framework: A five-stage pipeline for synthetic dataset generation using digital twins.
Figure 1. Overview of the SYNTHUA-DT framework: A five-stage pipeline for synthetic dataset generation using digital twins.
Technologies 13 00359 g001
Figure 2. Photogrammetric data acquisition process over Largo da Ordem, Curitiba. (a) Rendered point cloud sample derived from aerial image alignment; (b) Example of a survey image captured during the drone mission; (c) High-altitude aerial overview showing the urban layout and flight path coverage; (d) Close-up view of the photogrammetric capture illustrating texture and structural detail. (Source: created by the authors).
Figure 2. Photogrammetric data acquisition process over Largo da Ordem, Curitiba. (a) Rendered point cloud sample derived from aerial image alignment; (b) Example of a survey image captured during the drone mission; (c) High-altitude aerial overview showing the urban layout and flight path coverage; (d) Close-up view of the photogrammetric capture illustrating texture and structural detail. (Source: created by the authors).
Technologies 13 00359 g002
Figure 3. Three-dimensional mapping of Largo da Ordem—a deep dive into urban space (source: created by the author).
Figure 3. Three-dimensional mapping of Largo da Ordem—a deep dive into urban space (source: created by the author).
Technologies 13 00359 g003
Figure 4. Point cloud visualization—unraveling the complexity of Largo da Ordem (source: created by the author).
Figure 4. Point cloud visualization—unraveling the complexity of Largo da Ordem (source: created by the author).
Technologies 13 00359 g004
Figure 5. The synergy of the point cloud with 3D architectural modeling—Foundation for Urban Innovation (source: created by the author).
Figure 5. The synergy of the point cloud with 3D architectural modeling—Foundation for Urban Innovation (source: created by the author).
Technologies 13 00359 g005
Figure 6. BIM redrawing of Largo da Ordem—a frontal view of streets and facades. (Source: created by the author) [(a) Largo da Ordem church with street incline, (b) residential structure 1, (c) commercial building 1, (d) residential structure 2].
Figure 6. BIM redrawing of Largo da Ordem—a frontal view of streets and facades. (Source: created by the author) [(a) Largo da Ordem church with street incline, (b) residential structure 1, (c) commercial building 1, (d) residential structure 2].
Technologies 13 00359 g006
Figure 7. Digital twin of Largo da Ordem—urban perspective post-3D redrawing. (Source: created by the author) [(ad) Diverse perspectives of the digital twin model].
Figure 7. Digital twin of Largo da Ordem—urban perspective post-3D redrawing. (Source: created by the author) [(ad) Diverse perspectives of the digital twin model].
Technologies 13 00359 g007
Figure 8. Representation of pedestrians in virtual urban environments. (Source: created by the authors) [(ad) Varied avatar representations in the digital twin].
Figure 8. Representation of pedestrians in virtual urban environments. (Source: created by the authors) [(ad) Varied avatar representations in the digital twin].
Technologies 13 00359 g008
Figure 9. Varied perspectives of the digital twin—lighting and effects in the urban environment. (Source: created by the authors) [(ad) Diverse examples of realistic representations].
Figure 9. Varied perspectives of the digital twin—lighting and effects in the urban environment. (Source: created by the authors) [(ad) Diverse examples of realistic representations].
Technologies 13 00359 g009
Figure 10. Photorealistic details of the digital twin—the art of rendering with Unreal Engine (source: created by the author).
Figure 10. Photorealistic details of the digital twin—the art of rendering with Unreal Engine (source: created by the author).
Technologies 13 00359 g010
Figure 11. (ad) Sample rendered scene from the Curitiba digital twin with ground-truth semantic segmentation overlay. Each color represents a class (e.g., road in blue, sidewalks in gray, buildings in yellow, persons in red). The high visual fidelity and precise annotation (every pixel is labeled) demonstrate the quality of the SYNTHUA-DT generated data.
Figure 11. (ad) Sample rendered scene from the Curitiba digital twin with ground-truth semantic segmentation overlay. Each color represents a class (e.g., road in blue, sidewalks in gray, buildings in yellow, persons in red). The high visual fidelity and precise annotation (every pixel is labeled) demonstrate the quality of the SYNTHUA-DT generated data.
Technologies 13 00359 g011
Figure 12. Box plot of object areas per class (log scale). Blue boxes represent the distribution of object areas for each class after filtering, with whiskers indicating variability outside the upper and lower quartiles and circles representing outliers.
Figure 12. Box plot of object areas per class (log scale). Blue boxes represent the distribution of object areas for each class after filtering, with whiskers indicating variability outside the upper and lower quartiles and circles representing outliers.
Technologies 13 00359 g012
Figure 13. Distribution of the number of annotated instances per image in the SYNTHUA-DT dataset.
Figure 13. Distribution of the number of annotated instances per image in the SYNTHUA-DT dataset.
Technologies 13 00359 g013
Figure 14. Heatmap of class co-occurrence in SYNTHUA-DT.
Figure 14. Heatmap of class co-occurrence in SYNTHUA-DT.
Technologies 13 00359 g014
Figure 15. CDF of semantic class coverage by area.
Figure 15. CDF of semantic class coverage by area.
Technologies 13 00359 g015
Figure 16. Pixel-wise area coverage of assistive mobility classes.
Figure 16. Pixel-wise area coverage of assistive mobility classes.
Technologies 13 00359 g016
Table 1. Statistical summary of annotated semantic classes.
Table 1. Statistical summary of annotated semantic classes.
Semantic ClassAccuracy (%)IoU (%)Precision (%)Recall (%)F1-Score (%)
Railings99.290.192.391.591.9
Buildings99.692.494.893.594.1
People99.189.791.990.691.2
Windows98.888.590.789.690.1
Vegetation99.089.491.790.891.2
Cars98.888.790.989.890.3
Furniture98.688.190.089.289.6
Chairs98.788.390.189.089.5
Mechanical wheelchair98.788.490.289.189.6
Umbrella98.788.390.189.089.5
Street Road99.089.391.590.490.9
Luminaires98.688.089.988.989.4
Crutch98.286.287.886.787.3
Tables98.487.289.188.088.5
Wheelchair98.989.591.390.490.8
Sidewalk99.391.693.992.793.3
Cover98.587.889.488.588.9
Doors98.587.589.388.388.8
Walker98.587.989.588.789.1
Smart Cane98.386.788.687.588.0
Murals98.487.389.288.188.6
Wooden Cane98.286.187.786.887.2
Bicycles98.688.089.888.889.3
Animals98.587.689.488.388.8
Orthopedic Cane98.286.087.586.787.1
Tripod Cane98.185.987.386.486.8
Road99.089.291.490.390.8
Tents98.487.489.288.288.7
Grass98.989.091.190.190.6
Stands98.587.889.688.689.1
Motorcycle98.688.290.089.089.5
Senialetics98.386.988.887.788.3
Light Poles98.286.488.087.187.5
Buses98.085.587.086.186.5
Table 2. Instance segmentation baseline results per semantic class (Mask R-CNN, ResNet-50 + FPN Backbone).
Table 2. Instance segmentation baseline results per semantic class (Mask R-CNN, ResNet-50 + FPN Backbone).
Semantic ClassAccuracy (%)IoU (%)Precision (%)Recall (%)F1-Score (%)
Railings99.290.192.391.591.9
Buildings99.692.494.893.594.1
People99.189.791.990.691.2
Windows98.888.590.789.690.1
Vegetation99.089.491.790.891.2
Cars98.888.790.989.890.3
Furniture98.688.190.089.289.6
Chairs98.788.390.189.089.5
Mechanical Wheelchair98.788.490.289.189.6
Umbrella98.788.390.189.089.5
Street Road99.089.391.590.490.9
Luminaires98.688.089.988.989.4
Crutch98.286.287.886.787.3
Tables98.487.289.188.088.5
Wheelchair98.989.591.390.490.8
Sidewalk99.391.693.992.793.3
Cover98.587.889.488.588.9
Doors98.587.589.388.388.8
Walker98.587.989.588.789.1
Smart Cane98.386.788.687.588.0
Murals98.487.389.288.188.6
Wooden Cane98.286.187.786.887.2
Bicycles98.688.089.888.889.3
Animals98.587.689.488.388.8
Orthopedic Cane98.286.087.586.787.1
Tripod Cane98.185.987.386.486.8
Road99.089.291.490.390.8
Tents98.487.489.288.288.7
Grass98.989.091.190.190.6
Stands98.587.889.688.689.1
Motorcycle98.688.290.089.089.5
Senialetics98.386.988.887.788.3
Light Poles98.286.488.087.187.5
Buses98.085.587.086.186.5
Note: This pilot experiment was performed under standard conditions (batch size of eight, image resolution of 1280 × 960 pixels, trained for 50 epochs). The presented metrics constitute baseline performance benchmarks, validating the SYNTHUA-DT dataset’s capability to support training and evaluation of state-of-the-art instance segmentation models.
Table 3. Preliminary results of class imbalance mitigation using focal loss (Mask R-CNN, ResNet-50 + FPN).
Table 3. Preliminary results of class imbalance mitigation using focal loss (Mask R-CNN, ResNet-50 + FPN).
Semantic ClassMethodAccuracy (%)IoU (%)Precision (%)Recall (%)F1-Score (%)
WheelchairBaseline98.989.591.390.490.8
Focal Loss99.391.293.592.793.1
WalkerBaseline98.587.989.588.789.1
Focal Loss99.090.492.791.892.2
Smart CaneBaseline98.386.788.687.588.0
Focal Loss98.889.191.490.390.9
Tripod CaneBaseline98.185.987.386.486.8
Focal Loss98.688.390.589.690.0
CrutchBaseline98.286.287.886.787.3
Focal Loss98.788.791.089.990.4
Table 4. Comparative overview of urban scene datasets with emphasis on accessibility, annotation strategy, and scalability. Symbols “” and “” indicate whether the dataset includes () or does not include () dedicated accessibility-related semantic classes.
Table 4. Comparative overview of urban scene datasets with emphasis on accessibility, annotation strategy, and scalability. Symbols “” and “” indicate whether the dataset includes () or does not include () dedicated accessibility-related semantic classes.
PropertyCityscapesKITTIMapillarySYNTHIASYNTHUA-DT
Number of Images5000748125,000+20,00022,412
Resolution2048 × 10241392 × 512Various960 × 7201280 × 960
Number of Classes19~10661337
Accessibility Classes Included (7)
Annotation MethodManualManualCrowd/ManualSyntheticSynthetic
Annotation ConsistencyMediumLowMediumMediumHigh
Generation Throughput (img/min)<0.05<0.1<0.1~10>100
Balanced Distribution (Gini coeff.)>0.80>0.85~0.78~0.750.71
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Romero, S.F.L.; Souza, M.A.d.; Andrade, L.S. SYNTHUA-DT: A Methodological Framework for Synthetic Dataset Generation and Automatic Annotation from Digital Twins in Urban Accessibility Applications. Technologies 2025, 13, 359. https://doi.org/10.3390/technologies13080359

AMA Style

Romero SFL, Souza MAd, Andrade LS. SYNTHUA-DT: A Methodological Framework for Synthetic Dataset Generation and Automatic Annotation from Digital Twins in Urban Accessibility Applications. Technologies. 2025; 13(8):359. https://doi.org/10.3390/technologies13080359

Chicago/Turabian Style

Romero, Santiago Felipe Luna, Mauren Abreu de Souza, and Luis Serpa Andrade. 2025. "SYNTHUA-DT: A Methodological Framework for Synthetic Dataset Generation and Automatic Annotation from Digital Twins in Urban Accessibility Applications" Technologies 13, no. 8: 359. https://doi.org/10.3390/technologies13080359

APA Style

Romero, S. F. L., Souza, M. A. d., & Andrade, L. S. (2025). SYNTHUA-DT: A Methodological Framework for Synthetic Dataset Generation and Automatic Annotation from Digital Twins in Urban Accessibility Applications. Technologies, 13(8), 359. https://doi.org/10.3390/technologies13080359

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop