Reconstruction of Cultural Heritage in Virtual Space Following Disasters

Chen, Guanlin; Tong, Yiyang; Wu, Yuwei; Wu, Yongjin; Liu, Zesheng; Huang, Jianwen

doi:10.3390/buildings15122040

Open AccessArticle

Reconstruction of Cultural Heritage in Virtual Space Following Disasters

by

Guanlin Chen

¹

,

Yiyang Tong

²

,

Yuwei Wu

³

,

Yongjin Wu

⁴

,

Zesheng Liu

⁵

and

Jianwen Huang

^5,*

¹

School of Arts, Tiangong University, Tianjin 300387, China

²

School of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, China

³

Graduate School of Chinese National Academy of Arts, Beijing 100027, China

⁴

Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR 999077, China

⁵

School of Architecture and Urban Planning, Guangdong University of Technology, Guangzhou 510090, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(12), 2040; https://doi.org/10.3390/buildings15122040

Submission received: 15 May 2025 / Revised: 5 June 2025 / Accepted: 9 June 2025 / Published: 13 June 2025

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

While previous studies have explored the use of digital technologies in cultural heritage site reconstruction, limited attention has been given to systems that simultaneously support cultural restoration and psychological healing. This study investigates how multimodal, deep learning–assisted digital technologies can aid displaced populations by enabling both digital reconstruction and trauma relief within virtual environments. A demonstrative virtual reconstruction workflow was developed using the Great Mosque of Aleppo in Damascus as a case study. High-precision three-dimensional models were generated using Neural Radiance Fields, while Stable Diffusion was applied for texture style transfer and localized structural refinement. To enhance immersion, Vector Quantized Variational Autoencoder–based audio reconstruction was used to embed personalized ambient soundscapes into the virtual space. To evaluate the system’s effectiveness, interviews, tests, and surveys were conducted with 20 refugees aged 18–50 years, using the Impact of Event Scale-Revised and the System Usability Scale as assessment tools. The results showed that the proposed approach improved the quality of digital heritage reconstruction and contributed to psychological well-being, offering a novel framework for integrating cultural memory and emotional support in post-disaster contexts. This research provides theoretical and practical insights for future efforts in combining cultural preservation and psychosocial recovery.

Keywords:

digital preservation of architectural heritage; multimodal interaction design; Neural Radiance Fields; post-disaster trauma intervention; Stable Diffusion; virtual cultural heritage reconstruction

1. Introduction

Cultural heritage encompasses tangible assets such as historic architecture, archaeological sites, and curated artifacts. This heritage reflects the historical memory and cultural identity of specific societies and holds irreplaceable historical, educational, and cultural value [1]. While reconstruction of cultural heritage following disasters is a task of restoring physical space, it is also a crucial process for rebuilding social cohesion and cultural identity. Threats such as natural disasters, armed conflict, and environmental degradation continue to endanger heritage resources. Traditional physical preservation methods are often costly and are constrained by environmental conditions and geographic limitations, making large-scale implementation challenging.

In response, a new paradigm of digital technology has emerged as a crucial process for rebuilding social cohesion and cultural identity, particularly those involving three-dimensional (3D) scanning, photogrammetry, and artificial intelligence, are increasingly applied in the virtual reconstruction and digital safeguarding of post-disaster heritage. These technologies not only restore the physical form of cultural assets but also allow the reconstruction process to take into account the needs of affected communities, serving as vital tools in the broad recovery of social and cultural structures.

Emerging technologies have created new means of post-disaster urban reconstruction. Studies have shown that digital technologies can help restore damaged heritage sites to their original spatial conditions. For example, following the destruction of the ancient city of Palmyra, international organizations and academic institutions used 3D scanning and modeling technologies to document architectural details, providing digital references for future restoration work [2]. Similarly, Belal demonstrated the effectiveness of Geographic Information Systems in protecting damaged cultural heritage, offering solutions for post-disaster reconstruction and contributing to the development of smart cities [3]. Later, Naser emphasized the role of digital reconstruction in restoring cultural heritage and collective memory, using 3D modeling to virtually recreate the Al-Nuri Mosque, and proposed new methods for its restoration [4]. Meanwhile, Nazarenko and Martyn explored the application of geospatial technologies in post-disaster reconstruction in Ukraine, improving the efficiency and precision of rebuilding efforts while improving resource optimization and transparency [5].

Although 3D digital reconstruction has become a well-developed approach in the field of post-disaster urban recovery, existing research reveals a persistent lack of integration between digital reconstruction and the reconfiguration of humanistic values. Three main limitations are evident: (1) traditional reconstruction models emphasize the restoration of physical space, leading to stagnation in theoretical innovation and practical paradigms; (2) the ability to capture high-resolution texture details remains limited, particularly during texture alignment and 3D synthesis, where visual realism is often compromised by issues such as color inconsistency and lighting imbalance [6]; (3) current virtual displays typically focus on static and structured content, lacking multimodal digital support for personalized and immersive user experiences.

To address the aforementioned challenges, this study adopts a multimodal virtual reconstruction approach that integrates surface texture style transfer and localized regeneration using Stable Diffusion, 3D reconstruction based on Neural Radiance Fields (NeRF), and audio restoration through a Vector Quantized Variational Autoencoder (VQ-VAE) is adopted this study. Centered on the reconstruction of the Great Mosque of Aleppo in Damascus, Syria, the virtual space is developed to form a multi-layered support system—restoring spatial cognition through visual reconstruction and evoking emotional memory through sound design. This approach offers a novel theoretical and practical multimodal framework for post-disaster cultural heritage recovery.

The contributions of this study are as follows:

(1): A joint optimization framework combining NeRF-based 3D reconstruction and Stable Diffusion is constructed, enabling semantic-guided style transfer and fine-grained local detail refinement for architectural models.
(2): VQ-VAE–based audio restoration is employed to embed synthesized ambient sound into the 3D virtual environment, offering users an immersive and personalized experience.
(3): The system architecture supports dynamic parametric adjustment and incorporates a human–machine collaborative optimization framework based on semantic consistency constraints, providing an interactive design paradigm for the digital restoration of historical architecture.
(4): Feedback collected through user testing and surveys among refugee participants confirms the effectiveness of the proposed approach in cultural memory translation and humanistic value reconstruction within a hybrid virtual–physical environment, offering a replicable reference for related research.

2. Theoretical Background

Artificial intelligence, particularly in the form of deep learning algorithms, has been increasingly applied to the intelligent restoration and reconstruction of cultural heritage. Concurrently, the rapid development and adoption of virtual reality (VR), augmented reality (AR), and mixed reality (MR) technologies have expanded the applicability of virtual restoration beyond two-dimensional (2D) contexts to include the reconstruction of 3D heritage artifacts, particularly architecture [7,8]. The literature related to the generation of virtual audiovisual models from diverse data sources and scene reconstruction models is reviewed in this section to demonstrate the value and feasibility of applying multimodal technologies in the digital preservation of cultural heritage.

2.1. Restoration of Cultural Heritage

Cultural heritage is not merely a collection of tangible artifacts; it is the foundation of collective memory and social identity. When heritage sites are destroyed by natural disasters or armed conflicts, the damage extends beyond physical loss, disrupting community cohesion and cultural continuity. In his theory of “lieux de mémoire,” noted that the destruction of heritage spaces signifies the collapse of symbolic memory sites, weakening emotional bonds between individuals and history [9].

Digital technologies are emerging as vital tools for the activation and sustainment of cultural memory. Through the e-Heritage project, Ikeuchi demonstrated how laser scanning and 3D modeling can enable spatially precise restoration of cultural assets [10]. Neglia et al. suggested that in the context of war or disaster, digital archives function as “deferred containers” of memory, supporting future restoration and continuity of identity [11]. Digital reconstruction is not just the replication of geometry and texture; it is also a reconstruction of cultural meaning and memory systems. Reading observed that memory spaces in the digital media age are increasingly characterized by dematerialization and globalization, turning digital environments into new cultural domains [12]. From a social science perspective, Pereira et al. reviewed post-disaster practices of collective mourning and rebuilding, emphasizing that digital restoration helps bridge cultural discontinuities [13].

Immersive technologies, such as VR and AR, introduce new sensory dimensions to digital heritage experiences. Georgieva and Georgiev found that the reconstruction of personal memory in virtual settings can alleviate trauma and can support identity recovery [14]. Lin et al. showed that digital roaming systems enhance cultural engagement and evoke collective emotional resonance in post-disaster contexts, reinforcing a sense of belonging [15].

Based on these theoretical and practical perspectives, it is explored how advanced multimodal digital technologies can facilitate the precise reconstruction of cultural heritage while enhancing emotional and mnemonic restoration through sensory experience and interaction design, ultimately contributing to post-crisis socio-cultural recovery.

2.2. Application of Digital Technologies in Cultural Heritage

The digital reconstruction of cultural heritage has become critical in the field of heritage preservation. Through interdisciplinary collaboration, particularly in the areas of algorithmic automation and photorealistic texture reconstruction, ongoing efforts seek to overcome current technical limitations. Building on the extensibility of neural networks, three dimensions to establish a new paradigm for cultural heritage digitization are explored. Accordingly, the potential and pathways of texture style transfer and 3D reconstruction, localized 3D modification, and auditory restoration technologies in enhancing the expressive power of digital cultural heritage are examined in the following sections.

2.2.1. Texture Style Transfer and 3D Reconstruction

The digital reconstruction of cultural heritage requires balancing realistic visual effects with historical authenticity. The integration of image texture style transfer and 3D reconstruction techniques has become a core approach to restoring the appearance of cultural heritage sites [10]. Various methods have been applied in different cultural contexts. Liao et al. proposed a deep feature-driven cross-image semantic matching method, offering a controllable and consistent path for unifying surface textures across visual domains [16], while Kerdreux et al. enhanced interactivity in style transfer via a predictive evaluation mechanism, enabling real-time adjustment and procedural artistic evolution [17]. Building on this, An et al. introduced ArtFlow, an invertible neural flow-based method that improves content fidelity and reversibility during style transfer [18]. Chen et al. developed TANGO, a text-prompted 3D style transfer method that operates without specific dataset training, to support practical 3D transfer applications [19]. Liu et al. proposed StyleGaussian, integrating 2D style features into 3D Gaussian Splatting representations and enabling image-driven 3D style transfer with real-time rendering [20].

In 3D reconstruction, Amico and Felicetti introduced an ontology-based modeling framework for standardized cultural heritage reconstruction [21]. Intelligent real-scene fusion technologies now combine TLS, UAV photogrammetry, and MR to overcome physical constraints and enable high-precision digital restoration [22]. Altaweel et al. applied CycleGAN to 3D heritage reconstruction, resulting in a low-cost, skill-independent solution for cultural preservation and dissemination [23].

Despite these advances, several technical challenges remain. First, achieving multi-view style consistency and coherent texture transfer remains difficult, especially with frame-by-frame image methods [16]. Second, balancing geometric fidelity with artistic style remains problematic, as over-stylization can distort geometry, while under-stylization weakens style expression [17,18]. In light of this, it is essential to develop a model that balances content integrity and stylization intensity to ensure high-fidelity visual restoration while preserving cultural and semantic meaning in practical applications [24].

2.2.2. Detailed Modification of 3D Reconstruction

In virtual cultural reconstruction, local detail editing or style adjustment of 3D models is often necessary to restore intricate structures or meet artistic requirements. For example, carvings or mural fragments in historical buildings may remain coarse or incomplete even after full-model reconstruction, requiring localized refinement. In response, local 3D editing has become a key technique for enhancing detail fidelity and spatial adaptability in cultural heritage restoration.

Several methods have been proposed to achieve precise local editing. Fang et al. presented a digital restoration framework combining 3D point clouds with GANs, using real scan data to reconstruct missing parts [25]. With the development of NeRFs, Song et al. introduced Blending-NeRF, which fuses base and editable NeRFs to support the text-driven editing of specific regions in pretrained models. Text prompts allow targeted adjustments to color and density while preserving original appearance, improving flexibility in 3D content creation and enabling restorers to add architectural elements “as needed” [26].

Point cloud mapping has also been applied to address local damage and stylistic variation. In the Elmina Castle project, Ye et al. mapped color normals onto low-poly mesh textures, recovering detailed surfaces in damaged zones [27]. For local style transfer, follow-up studies introduced multi-view attention to merge structural and style image features, enhancing realism and artistic quality [28]. Locally Stylized NeRF further used dual-branch hash encoding to separate geometry and appearance, applying segmentation and matching loss to precisely transfer styles while maintaining cross-view consistency [29].

Despite these advances, challenges remain. Edited areas must blend seamlessly with unedited regions to avoid visual discontinuities, especially in multi-view rendering. It is still technically demanding to limit edits to defined areas while preserving structural boundaries. Moreover, manual region selection and iterative refinement increase interaction complexity and computational costs.

To address these issues, this project integrates ControlNet-based diffusion model control to support local editing. Zhang et al. showed that adding structural edges or depth as constraints enables spatial control in diffusion models [30]. In heritage reconstruction, contours, sketches, or original features can guide localized generation by combining NeRF and diffusion models, ensuring alignment with the original structure and stylistic consistency and providing an effective toolkit for fine-grained digital restoration.

2.2.3. Audio Reproduction

For damaged cultural spaces, visual reconstruction is only part of the restoration process. Sound, as a key medium for memory and emotion, plays a vital role in virtual heritage restoration. Research shows that sensory combinations affect emotion in different ways and that multisensory integration enhances empathy and immersion [31]. However, most current post-disaster cultural recovery projects remain in the early stages regarding sound, relying on traditional methods such as soundscape recording, sound field measurement, or audio archive playback.

Mediastika et al. built a local historical soundscape archive in Indonesia’s Kotagede region, using interviews and field recordings to preserve disappearing traditional sounds [1]. Gozzi and Grazioli developed an immersive acoustic system for the Maggio Musicale Theatre in Florence, using field impulse responses and binaural rendering to recreate listening experiences from various seats [32]. For large-scale sites, acoustic simulations support sound reconstruction. Kritikos et al. designed a mobile-based AR experience in Chania, Crete, where visitors trigger audio along heritage routes [33]. Zou et al. proposed a virtual music museum using digital twins, combining archived audio with 3D instrument models for interactive playback [34].

Reproducing realistic, layered soundscapes in virtual environments requires multi-scale audio modeling. Chen et al. introduced DPTNet, a dual-path Transformer architecture that improves speech modeling through enhanced context awareness and feature separation [35]. Yang et al. proposed Genhancer, a discrete token-based encoder-decoder that maintains audio accuracy and feature integrity, to address hallucination and feature loss [36].

These studies reveal several limitations in auditory reconstruction for virtual heritage: overreliance on traditional capture and modeling methods, lacking flexibility and fine control [1,32]; static archival playback with minimal environmental or multisensory integration [33]; and absence of multi-scale modeling, limiting the rendering of both macro structures and fine details, resulting in flat, uniform soundscapes [34].

To address these gaps, VQ-VAE-based multi-scale audio modeling with HiFi-GAN is employed, reconstructing immersive historical soundscapes ranging from ambient spatial sounds to ceremonial music and visualizing results to enhance sensory memory.

2.3. Research Motivation and Objective

Drawing on current research in post-disaster reconstruction under emerging technologies, the concept of “media memory” enabled by digital tools offers several possible innovative pathways for the digital reconstruction of cultural heritage. Although significant progress has been made in this area, several challenges still remain. First, existing techniques often emphasize geometric precision while neglecting the reproduction of historical textures and artistic styles. Second, technical barriers persist in the integration of heterogeneous data sources, such as laser scanning and photogrammetry, limiting overall model quality. Furthermore, the lack of standardized workflows and protocols hampers the large-scale and efficient implementation of digital heritage preservation efforts [23].

In addition, virtual reconstruction from the three examined perspectives faces further challenges, including conflicts between preserving geometric detail and fusing artistic styles, stylistic inconsistency and abrupt transitions in local regions under varying viewpoints, and the absence of multi-scale modeling strategies. Hence, this study focuses on audiovisual virtual technologies, aiming to digitally restore post-disaster cultural heritage by integrating these three approaches and reconstructing sensory memory. Furthermore, this study seeks to transcend the boundaries of conventional realism in 3D generation, address the limitations of existing tools, and expand the possibilities for deep learning and visual applications in virtual heritage.

3. Methodology

A multimodal virtual reconstruction framework that uses AI technologies including 3D reconstruction, texture style transfer, and audio enhancement to restore the historical and cultural heritage of the Great Mosque of Aleppo in Damascus, Syria, is presented in this section. Figure 1 shows the framework, whose four stages consist of data preprocessing, texture transfer, structural refinement, and auditory reconstruction.

Relevant architectural and cultural data were collected and preprocessed. Three-dimensional models were selected based on accuracy, completeness, historical authenticity, and compatibility with immersive visualization requirements. Diffusion models were then applied for surface stylization and local editing. Key technologies include NeRF-based 3D reconstruction for achieving optimal accuracy and visual realism, Stable Diffusion for refined texture and structural enhancements, and Jukebox VQ-VAE for synthesizing authentic audio features. The integrated system was implemented in Unity, facilitating immersive and collaborative user interactions.

3.1. Texture Style Transfer and 3D Reconstruction

Surface texture style transfer was conducted on the original buildings to effectively capture and emphasize informational features intrinsic to historical architecture. Specifically, Stable Diffusion was integrated into the 3D Reconstruction pipeline to perform surface structure style transfer, blending material textures extracted from original architectural images with selected reference images featuring distinctive religious motifs. This method employs an iterative noise-reduction process guided by textual prompts describing the desired religious texture styles, ensuring an accurate representation of target aesthetics while preserving inherent material properties. Subsequently, NeRF technology, a neural volumetric rendering method, was integrated for 3D reconstruction, accurately capturing intricate geometric structures, depth details, and realistic lighting effects through volumetric density estimation. This approach facilitated the creation of high-fidelity virtual models, enabling precise restoration and visualization of damaged historical architectures within virtual environments.

3.1.1. Texture Style Transfer

The Stable Diffusion model is used to perform texture style transfer on architectural surfaces [37]. Recognized for strong capabilities in image generation and style adaptation, Stable Diffusion produces detailed, realistic textures. When integrated with 3D reconstructions generated by NeRF, it enables the spatial restoration of architectural forms while enhancing visual realism and historical fidelity through surface-level modifications to preserve the original structural contours.

As a latent diffusion-based generative model, Stable Diffusion progressively denoises encoded variables to generate high-fidelity outputs. In image-to-image translation tasks, it allows stylistic adaptation based on reference images, preserving geometric structure while re-rendering texture and material qualities. This approach enables the generation of images that maintain original architectural outlines while incorporating stylistic features [38], supporting dataset creation and providing input for 3D reconstruction.

To generate textured images of architectural entities, Stable Diffusion is used to produce outputs from specified viewing angles. These images serve as the dataset for NeRF training. The Wan2.1 model with a Low-Rank Adaptation (LoRA) module is used to automate dataset generation from static images [39,40]. This diffusion transformer-based video model generates structurally coherent dynamic videos from images. With LoRA guidance, it produces short sequences, from which multiple frames are extracted as training samples [41].

3.1.2. Three-Dimensional Reconstruction

In previous studies, models generated using 3D-CNNs and GANs often lacked sufficient resolution and structural accuracy for high-fidelity architectural reconstructions [42,43]. NeRF is adopted to address this issue [44], which leverages implicit volumetric rendering to reconstruct fine geometric textures and lighting details, meeting the demands of photorealistic 3D modeling in virtual environments.

Training NeRF requires precise camera poses and sparse point clouds. COLMAP is used to extract keypoints from multi-view images and estimate camera positions via feature matching and reprojection error minimization [45]. COLMAP also generates a sparse point cloud through multi-view stereo methods [46], providing structural priors for NeRF and ensuring accurate geometric constraints during volumetric rendering.

Positional encoding, which maps raw 3D coordinates into a higher-dimensional space, is applied to further improve geometric and illumination fidelity. This transformation encodes multi-frequency spatial information, enhancing the model’s ability to learn complex lighting and shape variations.

3.2. Detailed Modification of 3D Reconstruction

To enhance stylistic variation and artistic richness in 3D reconstructed models, inspired by the two-stage generative framework proposed in GNeRF [47], Stable Diffusion is combined with NeRF to build a multi-stage pipeline for localized 3D structure modification. Its five core steps are multi-view image extraction, image-to-image style transfer, semantic control via textual prompts, structural constraints using ControlNet, and NeRF retraining with updated datasets.

Initially, keyframes from the NeRF-generated model are extracted, focusing on geometric and textural features of structural components. Then these views are stylized using a pretrained Stable Diffusion model, guided by textual and negative prompts.

To ensure structural integrity during stylization, ControlNet is used for boundary control, with HED and Canny edge detectors as preprocessors to extract structural contours from reference images [30,48,49]. This enables localized style transfer that respects the outlines of the original architecture.

The resulting stylized images are used to build datasets via the Wan2.1 video model, with LoRA modules providing prompt-driven control. These are used to retrain NeRF, resulting in updated 3D models incorporating local modifications. This process improves the expressiveness and cultural richness of architectural components at a local scale. The method demonstrates strong adaptability and structural stability in the application of cross-cultural styles to virtual reconstructions.

3.3. Audio Reproduction

Feature nodes are extracted from diverse audio recordings containing fragments of collective memory to enhance realism in a virtual utopian environment. These are fused to generate new audio that reconstructs real-world ambiance. As a time-series signal, audio data display distinct multi-scale characteristics [50]; hence, multi-scale feature extraction is used to effectively capture key information across time scales, supporting efficient element retrieval. The multi-scale VQ-VAE includes multiple encoder, quantizer, and decoder modules [51,52], each representing latent vectors at different scales. The top level captures global features such as melody and harmony, the mid-level encodes details such as timbre and rhythm, and the bottom level preserves fine-grained features such as waveform and spectrogram texture. When environmental audio is input, it is first processed by the VQ-VAE encoders and mapped into latent space representations, i.e., [51]

z_{e}^{l} = {Encoder}^{l} (x), l \in {1, 2, 3},

(1)

where l indicates the level of the latent space (e.g.,

l = 1

for the top level), x is the input audio signal, and

z_{e}^{l}

is the latent space representation output by the encoder at level l.

The continuous latent representations are then discretized to generate discrete latent codes, as also illustrated in Equation (1). Ultimately, through the combined contributions of all latent spaces, the decoder fuses the multi-level latent representations to efficiently and precisely extract features from global structure to local details, outputting the final audio representation

x^{'}

that contains comprehensive audio feature information, as also illustrated in Equation (2) [51].

x^{'} = Decoder (z_{q}^{1}, z_{q}^{2}, z_{q}^{3}),

(2)

where

z_{q}^{1}

,

z_{q}^{2}

, and

z_{q}^{3}

denote discrete latent representations at different scales; and

x^{'}

is the reconstructed output audio that contains comprehensive audio features.

However, during the weighted audio fusion process, the high-frequency components often suffer from detail loss as a result of the distribution of frequency band weights. The absence of high-frequency details may weaken the expression of key emotional information in the audio, particularly the clarity of the timbre and the authenticity of environmental sounds, which are critical for enhancing the emotional resonance and immersive experience of the listener [53]. Therefore, HiFi-GAN technology is further introduced into the post-fusion audio to enhance audio quality and compensate for the loss of detail incurred during the fusion process [54].

3.4. User Interaction in Virtual Environments

The interactive features of the virtual space are built using the Unity 3D engine. As a real-time engine, Unity provides flexible support for resource management, rendering, and interaction systems. It allows for the efficient organization of model structures, texture loading, lighting management, and user interaction logic, thereby enabling the creation of digital cultural environments with structural coherence and expressive depth [55]. All 3D models are used in .glb format to ensure compatibility with Unity’s structural hierarchy, topology, and coordinate definitions. Texture assets follow the Physically Based Rendering workflow [56], covering base color, normal map, metallic, roughness, and ambient occlusion channels, and are imported in .png format. The Universal Render Pipeline is employed to efficiently map textures [57], enhancing the realism and detail of surfaces.

The system integrates real-time lighting, shadows, reflection probes, and post-processing effects to further enhance visual quality and immersive ambiance. This enhances dynamic adaptability in light behavior, material response, and color transitions. For dynamic updates, externally generated .glb models are saved into Unity directories using Python 3.9.13 scripts, and runtime model loading is handled via Resources.Load.

Interaction recognition is built using Unity’s EventSystem and physical raycasting, enabling actions such as click-triggered responses and information display. Users can switch building surface textures, trigger pop-up windows, or play ambient sounds, thereby creating semantic connections with scene elements. Character navigation and camera control are supported by Unity’s CharacterController and Cinemachine, enabling smooth movement and dynamic viewpoint adjustment. The completed project is packaged as a standalone executable to support local deployment and interactive exhibition.

4. Experiments

4.1. Virtual Space Construction

The Great Mosque of Aleppo in Syria was selected as the experimental subject. The reconstruction of cultural heritage is achieved through the digital restoration of the mosque’s architectural form. By virtually recreating the religious scenes, the project provides spiritual support to aid the restoration of community identity and mental health, offering psychological comfort for the healing of post-war trauma. A customizable design methodology enables the reconstruction of the 3D model, transfer of surface textures and partial structures, and synchronized presentation of environmental sound effects. Research has shown that spatial configurations, especially the openness and layout of urban elements, significantly influence individuals’ experiences in the digital restoration of the mosque [58]. This insight underscores the importance of spatial fidelity and sensory coherence in virtual reconstructions. Such multidimensional representation reflects individuals’ diverse understanding of a utopian virtual space and enables them to construct diverse and personalized virtual environments through various interactive means.

4.2. Data Collection and Preprocessing

The Great Mosque of Damascus is one of the oldest and most architecturally significant monuments in the Islamic world. Originally constructed on the site of a Roman temple and later a Byzantine church, the mosque reflects a blend of Roman, Byzantine, and early Islamic architectural traditions. Its layout and features, such as the expansive courtyard, arcaded porticoes, grand prayer hall, and monumental minaret, exemplify Umayyad architectural innovation. The structure is primarily constructed from stone and marble, incorporating elements such as carved wooden ceilings, mosaics, and stucco work. Decorative motifs, including geometric patterns and Arabic calligraphy, are seamlessly integrated into structural components, showcasing both aesthetic refinement and cultural symbolism. This architectural heritage provided the foundational reference for the following digital documentation and restoration processes.

During data collection for texture style transfer and 3D reconstruction, over 10 representative objects inside the mosque, such as bells, the minaret, stone walls, and chandeliers, were selected for scanning. Scans were taken from three perspectives—horizontal, 20° upward, and 20° downward—so as to capture the overall architectural profile and ensure detailed structural coverage. Accurate 3D reconstruction is highly dependent on the spatial resolution and consistency of multi-angle image inputs. Prior studies have shown that differences in imaging configurations and processing workflows can significantly affect the dimensional accuracy of reconstructed models [59].

For the local structural redraw aimed at restoring surface textures and enhancing the realism of the Great Mosque of Damascus, reference images from the Syrian Heritage Archive (https://syrian-heritage.org/, accessed on 8 June 2025) were meticulously selected based on criteria such as historical authenticity, stylistic coherence, visual clarity, and alignment with the mosque’s architectural aesthetics. The selected images, featuring Arabic carvings and calligraphic decorations, share common artistic stylistic characteristics of Umayyad design, including strong geometric symmetry, repetitive interlacing patterns, and the use of Kufic and early cursive script forms. These stylistic elements reflect the precision and elegance of early Islamic visual culture and carry deep cultural and religious symbolism. By applying these images to the 3D model’s surfaces and selected structural components, the reconstruction enriches both the visual realism and the immersive quality of user interaction.

In audio feature extraction and fusion, sound elements, including the call to prayer, marketplace chatter, and urban traffic, were selected from the Great Mosque of Aleppo, so as to recreate its auditory ambiance. Approximately 30 audio clips, including eight call-to-prayer recordings across various times, 12 samples of marketplace sounds (vendors, buyers, and transactions), and 13 urban noise samples (vehicles and crowd flow), that reflect the mosque’s surrounding acoustic environment were curated from the Freesound cultural heritage archive.

To ensure cultural and intellectual responsibility in the digital restitution process, all reference materials were sourced from open-access heritage repositories with clearly defined usage rights. Visual assets, including photographs, motifs, and inscriptions, were obtained from the Syrian Heritage Archive under academic reuse provisions. Similarly, audio elements were curated from the Freesound cultural heritage database in accordance with applicable Creative Commons licenses. These measures collectively uphold the legal, ethical, and cultural integrity of the virtual reconstruction work.

4.3. Experimental Procedure

4.3.1. Texture Style Transfer and 3D Reconstruction

Decorative pattern images from the Syrian cultural heritage website were applied to the original dataset of the minaret to enable multi-style presentation on the 3D model surface (Figure 2a). Masking (gray-white areas) was performed on the minaret’s body to confine the style transfer to specific regions (Figure 2b). The masked minaret image was used as the source for Stable Diffusion, while Arabic-style decorative patterns served as reference inputs for the control model (Figure 2c). A reference preprocessor and pretrained model ensured accurate texture feature fusion. The generation process was precisely controlled by refining the prompt text to emphasize carvings and using negative prompts to limit shape distortion. Results of experiments showed that the method produced high-quality textures aligned with Islamic stylistic features (Figure 2d). This confirms that it is feasible to extract surface textures from multiple Arabic carving and pattern images and apply them to architectural skin materials, so as to enhance the visual richness of virtual space construction.

Based on the architectural images with decorative pattern styles (Figure 2d), a dataset of angle-continuous keyframes was constructed using the Wan2.1 model. ComfyUI was employed as the inference framework, and a lightweight Wan2.1 version was used for model execution. During dataset generation, LoRA weights were loaded, and prompt phrases were applied to ensure the creation of high-fidelity and temporally smooth video clips. A series of new minaret keyframes were extracted from these videos for NeRF training.

In this stage, the new Wan2.1-generated minaret dataset was used to train the NeRF model (Figure 3a). COLMAP-extracted local features were first aligned to reduce positional and viewpoint shifts, improving NeRF training accuracy. Ray sampling precision was adjusted to ensure sufficient sampling density, which enhanced detail reconstruction. During training, the sparse point cloud from COLMAP was incorporated as a geometric constraint (Figure 3b). After 30,000 iterations, a fine, stylistically updated 3D model of the minaret was produced (Figure 3c) and exported in GLTF format. This result confirms the strength of NeRF in rendering architectural geometry and textures with high realism and fidelity.

4.3.2. Detailed Modification of 3D Reconstruction

After completing NeRF-based 3D reconstruction and texture style transfer, Stable Diffusion and ControlNet were further incorporated to enhance architectural diversity and design expression based on multi-view images. Through prompt-based control and structural guidance, localized 3D redraw experiments were conducted on architectural elements of the Great Mosque of Damascus. This approach explores the integration of diverse cultural architectural styles into 3D models, offering multidimensional style representations for refugee communities.

The entire workflow has five steps: multi-view image acquisition, image-to-image style transfer, prompt-controlled generation, ControlNet-based structural guidance, and NeRF re-training. First, images from various angles were captured from the trained NeRF model, forming a dataset of key architectural parts such as the minaret, dome, capital, column base, mihrab, and window lattice. Stable Diffusion’s image-to-image technique was then applied for local redraws, incorporating both positive and negative prompts to enhance decorative pattern depth. During this process, ControlNet ensured structural integrity by constraining output regions and preserving edge contours. HED and Canny were used as control preprocessors, with the edge weight set to 0.9, 512 × 512 resolution, sampler as DPM++ 2M Karras, CFG scale at 7, denoising strength at 0.5, and 20 sampling steps. Following manual filtering and style consistency checks, the validated images were used with the Wan2.1 model to build a multi-angle training dataset for NeRF re-modeling.

To demonstrate the workflow of 3D localized style transfer based on prompt-controlled generation and structural guidance, Figure 4 shows the example of the Dome of the Treasury in the Umayyad Mosque of Damascus. The process includes the following: (1) extracting original screenshots from the NeRF-generated model; (2) applying semantic control using prompts such as “lotus dome” and “arabesque carvings”; (3) generating a structural control map using HED as input for ControlNet; (4) producing style-enhanced images through Stable Diffusion; and (5) using the processed images for NeRF retraining to output a style-enhanced 3D model with optimized details.

Experiments covered multiple architectural components, including capitals, domes, columns, window lattices, chandeliers, roofs, and spires, generating over 100 images (a detailed table is provided on GitHub), and presented the prompt settings, model choices, and control strategies for each type of component. Locally redrawn images of models such as the Dome of the Treasury and Mosque Capital demonstrated good performance in terms of detail density, structural clarity, and cross-cultural decorative coordination.

Based on the locally redrawn multi-angle images, a new NeRF model was retrained, and the model files were exported. The model demonstrated excellent performance in texture expression, geometric preservation, and visual expressiveness, with no significant morphological drift or damage observed, thus validating the structural stability and style adaptability of the proposed method.

4.3.3. Audio Reproduction

After completing the virtual reconstruction of the mosque, the extraction and fusion of environmental audio features were further explored. Three types of recordings from the Great Mosque of Aleppo were selected—the call to prayer, marketplace conversations, and urban traffic noise—each with distinct spectral, harmonic, and dynamic properties. The signals were divided by frequency bands to capture their hierarchical structure: low (20–200 Hz) for melody, mid (200–2000 Hz) for timbre and rhythm, and high (above 2000 Hz) for fine detail.

In the Jukebox processor, each audio clip generates a latent vector (Zs) with encoded features. A Short-Time Fourier Transform was used to produce spectrograms from these vectors; these were visualized with Matplotlib 3.5.2 to highlight frequency characteristics (Figure 5). The mosque audio showed concentrated energy in the mid-frequency range (200–2000 Hz), with a smooth, regular spectrogram structure reflecting stable acoustic properties. Marketplace recordings displayed complex, dense energy across the mid- and high-frequency bands, indicating a rich mix of speech and high-frequency noise. Urban traffic sounds exhibited strong energy in low and mid-frequencies, consisting mainly of sustained low-frequency vibrations and sporadic high-frequency elements.

To more clearly illustrate the audio features and the differences across frequency bands, the spectrograms of each frequency band were visualized (Figure 6a), and K-means clustering was employed to analyze the feature points, followed by 3D spatial visualization (Figure 6b–d). In the point cloud diagrams, colors represent features from different frequency bands, displaying the distribution results of audio signals after feature embedding into the latent space using our method. It can be observed that the features from different frequency bands form distinct clusters in the embedding space, indicating good performance in feature separation.

4.4. Construction of Virtual Space

To realize the integrated reconstruction of post-disaster cultural heritage in a virtual context, architectural components, based on NeRF-trained 3D models and style transfer outputs, were exported in .glb format and embedded into a virtual environment modeled after the Umayyad Mosque of Damascus. Informed by the architectural layout and cultural semantics, this space creates an immersive field supporting multimodal interaction, aiming to evoke emotional memory and cultural ties. The model includes key or severely damaged components, including domes, capitals, window lattices, and chandeliers. Some regions were reshaped using localized 3D redrawing techniques and reintegrated into semantic partitions, displaying features of cross-cultural style fusion. Figure 7 shows the overall layout, where components are arranged by functional logic and visual hierarchy, forming a utopian setting that balances historical coherence with artistic expressiveness.

All 3D models were imported into Unity in .glb format. Given the challenge of restoring original sound sources, a point cloud-based sound visualization mechanism was implemented, converting audio into spatial coordinate points embedded in the virtual scene, where each point corresponds to a specific location, such as a prayer courtyard or marketplace arcade. When users approach or click a point, a “memory switch” triggers playback of audio fragments, such as call-to-prayer bells, marketplace chatter, or ambient hums, eliciting spatial perception tied to place.

In the Unity scene, the structure includes background elements (terrain, skybox), the main architecture, interactive objects (doors, windows, lighting), and character components. By configuring Unity’s EventSystem and raycasting, users can interact with building surfaces to switch styles, display information, or activate audio playback.

The virtual space shows a high degree of scalability and participatory capability. The system provides an open collaborative construction interface, which allows users to autonomously generate new cultural components by uploading images or providing keyword descriptions, which can be embedded into the virtual space to continuously enrich its content. This mechanism enables users to reshape memories and assert cultural autonomy, transforming the virtual environment from a mere display platform into a field for the reconstruction of collective memory. Related experimental models, data, and procedural results can be accessed at 14 June 2025 (https://github.com/tendermango/Digital-Damascus-Minaret-Reconstruction-in-Virtual-Utopias). All project outputs are shared under non-commercial, open-access terms exclusively for academic use. The reconstructed assets are not intended for commercial purposes, and historical authenticity is carefully maintained to avoid misrepresentation or cultural distortion.

5. Discussion

The effectiveness of the multimodal virtual reconstruction system and its application in the alleviation of post-disaster psychological trauma among refugees are explored in this section. Interviews, testing, and surveys were conducted in person, incorporating the revised Impact of Event Scale (IES-R) for trauma-related psychological evaluation and the System Usability Scale (SUS) for usability evaluation, so as to quantitatively assess the experiences of participants.

5.1. Participants

Twenty participants were recruited for the study, with ages ranging from 18 to 50 years (mean = 31.4, SD = 8.01), with a balanced gender distribution (10 males and 10 females). All participants had prior exposure to traumatic events, such as natural disasters, and possessed personal memories related to the destruction of cultural heritage. Considering their refugee status, participation was voluntary. Participants were required to be willing to provide materials related to their cultural memories and to actively engage in discussions and testing. All participants were first-time users of such digital experience-oriented devices. To avoid potential psychological harm or identity exposure, relevant background information was not disclosed, and all data were anonymized.

5.2. Evaluation Procedure

Experiments were conducted through in-person discussions and testing, ensuring that participants fully understood and were actively engaged. Through the Instagram platform, each participant submitted an average of eight images related to local architecture and three audio clips reflecting their cultural traditions. Submitted images and audio materials were uniformly preprocessed and standardized to ensure consistent data quality and were used to generate customized virtual environments. In collaboration with the participants, neural generative technologies, as mentioned, including NeRF, Stable Diffusion, and VQ-VAE, were employed to reconstruct 3D architectural models from the provided images, apply texture-transfer and localized modifications, and generate immersive environmental soundscapes based on the audio content. Using Unity, these visual and auditory elements were integrated into an interactive virtual scene for subsequent evaluation.

Participants experienced these virtual scenes through Unity, with a total testing duration of approximately 1 h. First, 10–15 min were allocated for organizing the submitted materials and demonstrating the core functions of the virtual environment. During the model generation period, interviews were conducted, and participants were invited to complete a questionnaire. Then each participant underwent a semi-structured interview and responded to an open-ended questionnaire based on the IES-R, lasting approximately 25 min. The IES-R was administered once during system interaction to capture participants’ immediate psychological state and symptom levels. The interview focused on psychological changes during the virtual experience, paying particular attention to post-traumatic stress symptoms, enhanced cultural identity, and emotional fluctuations. Open-ended questions encouraged participants to freely describe their subjective experiences, supplementing information in the questionnaires. Interviews were audio-recorded and transcribed, and the resulting qualitative data were used for subsequent analysis to contextualize and enrich quantitative findings. After the interview, participants were given approximately 10 min to freely explore the virtual environment and were encouraged to provide immediate subjective feedback. Finally, participants completed the SUS questionnaire to evaluate system usability. The questionnaire included 10 items rated on a five-point Likert scale and took approximately 10 min to complete. The scores were then converted into a standardized 0–100 SUS score according to standard calculation procedures.

5.3. Evaluation Instrument Scale Descriptions

Two measurement dimensions were employed to comprehensively assess the effectiveness of the multimodal virtual reconstruction system in terms of user experience and psychological response. Psychological symptoms were evaluated using a semi-structured interview and open-ended questionnaire based on the IES-R, aiming to explore the system’s impact on participants’ post-disaster psychological state. The SUS was used to evaluate the operational experience and ease of use.

5.3.1. IES-R Trauma Interview Assessment

Given the high risk of post-traumatic stress disorder (PTSD) among populations affected by disaster, the IES-R was incorporated into the evaluation through a structured interview guide and psychological response questionnaire. The IES-R was proposed by Weiss and Marmar and consists of 22 items [60]. The full scale was not directly administered, but its dimensional structure was referenced in the design of a hybrid questionnaire combining structured and open-ended components. The interview focused on three key areas: (1) alleviation of psychological distress; (2) changes in emotional or mental states; and (3) restoration of cultural identity. Each item was rated on a 0–4 scale (0 = not at all; 4 = extremely), with a total possible score ranging from 0 to 88. Results of previous studies have indicated that an IES-R total score greater than or equal to 24 suggests a potential risk of moderate to severe PTSD symptoms [61].

5.3.2. System Usability Scale

System usability was assessed using the SUS, by which participants evaluated the system’s performance and user experience [62,63]. The SUS consists of 10 items that measure aspects of user perception such as ease of use, satisfaction, importance, and usefulness. The scale includes five positively worded items (Items 1–5) and five negatively worded items (Items 6–10). Each item is rated on a five-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree) [64]. A more intuitive version of the SUS was adopted, featuring statements such as “I think the system is easy to use and helps me reconstruct architectural memories from before the disaster” and “I believe most people could quickly learn to use this system to restore cultural heritage destroyed by conflict.” Responses were converted into a standardized score ranging from 0 to 100. According to industry benchmarks, an SUS score above 68 indicates above-average usability, and a score above 80 represents excellent usability.

5.4. Evaluation Results

In interviews, most participants expressed strong identification with, and gave positive feedback on, the virtual scenes generated by the system. Regarding visual realism and cultural fidelity, participants generally perceived the reconstructed environments as highly authentic and historically accurate. Several noted that architectural structures, stone textures, and urban layouts closely resembled their pre-disaster memories, evoking emotional resonance such as a “return to childhood.” This demonstrates the potential of the multimodal reconstruction system to restore cultural authenticity.

Participants also expressed a strong emotional connection to the “embodied presence” of their uploaded content within the virtual space, perceiving it not as a generic template but as “my memory space” [65]. This phenomenon of “contextual personalization” aligns with the theory of personalized digital storytelling in recent digital anthropology, which posits that individuals reconstruct their cultural identities through active engagement in digital media processes. As shown in the statistical results in Figure 8, most participants believed the system provided a platform for expressing cultural memory and triggering emotional projection. Engaging in multisensory cultural interactions within the virtual space supported post-disaster cultural identity reconstruction and helped alleviate long-term psychological stress.

Survey results (Figure 9) generally indicated positive experiences with the system.

All participants completed the cross-sectional psychological assessment using the IES-R during the test. The average IES-R score was 24.4 (SD = 5.91), with 11 participants scoring below the moderate PTSD threshold of 24, indicating that overall trauma levels were low to moderate. In the “psychological distress” dimension, approximately 65% of participants scored between 0 and 10 on related items, reflecting mild symptoms such as flashbacks or tension. For the “emotional and psychological state” dimension, the average score was around 7.85, with participants commonly reporting feeling “calmer” and “emotionally relaxed” during the experience. Regarding cultural identity, over 60% of participants stated that the virtual environment helped them “reconnect with cultural memory.”

Regarding system usability (SUS), 20 participants achieved an average SUS score of approximately 78.5 of 100, with a mean value above 80, indicating that most found the system “easy to use.” Many participants reported that the system was intuitive and straightforward, and while they rarely interacted with high-tech products in daily life, they were able to quickly adapt and successfully reconstruct cultural heritage destroyed by conflict using their photos.

The proportion of treatment barriers was identified through the survey on refugees’ backgrounds and psychological responses (Figure 10). Some participants still faced challenges such as device limitations and uncertainty regarding the realism of the virtual environment, but these obstacles did not significantly diminish their overall emotional connection to the system, nor their willingness to engage with it [66]. The survey also revealed that elements such as sound and architecture within the virtual environment strongly resonated with personal memories, with some participants reporting unprecedented emotional insights through the virtual experience.

6. Limitations and Future Work

The study has several limitations. The sample was drawn exclusively from disaster-affected regions within specific geographic areas, and cultural homogeneity may limit the generalizability of the findings. Additionally, the current system supports only viewing and perspective switching, without interactive editing or embodied engagement, and the elicited emotional response mechanisms remain to be fully understood. The collection of source materials was affected by individual preferences and technical issues related to device variability. Material quality and network transmission stability also influenced the fidelity of model generation. Future research could incorporate multi-channel data collection and cloud-based processing to enhance sample diversity and platform adaptability.

Participants directly engaged with the construction of the virtual space and provided culturally informed modification suggestions, resulting in representations that more accurately reflected the local landscape. This reflects the diversity of memory expression among different refugee groups. Future work could explore simple online deployment strategies that allow users to independently perform “immersive roaming,” “gesture mapping,” and “voice annotation” for more personalized experiences. Notably, virtual spaces may offer emotional outlets for bereaved refugees, especially orphans who face specific challenges in processing and expressing grief [67]. Long-term, systematic studies of specific sensory combinations are needed, so as to understand the system’s emotional regulation effects. Therefore, the collaborative application of post-disaster virtual spaces in refugee mental health care warrants further investigation.

7. Conclusions

This study explores the application of virtual digital technologies within the context of socio-cultural heritage reconstruction following disasters, integrating innovative methodologies such as 3D reconstruction, multimodal interaction, and collaborative design to achieve the dual objectives of heritage restoration and psychological trauma intervention. To illustrate the relationship between conceptual analysis and practical design, the exploration of virtual digital technologies in urban reconstruction was adopted as the core domain. Using Neural Radiance Fields (NeRF) technology, high-precision reconstruction of disaster-damaged cultural heritage was achieved. Furthermore, immersive content was generated by integrating Vector-Quantized Variational Autoencoder (VQ-VAE)-based audio reconstruction with Stable Diffusion-based style transfer, creating an interactive, post-disaster virtual environment built with the Unity engine. This open virtual platform empowered refugees to participate in reconstructing cultural symbols and facilitated emotional expression among culturally displaced groups by activating implicit memory through multisensory interaction. Although the effectiveness of these approaches may be limited, they nonetheless provide an innovative pathway for addressing psychological trauma among disaster-displaced populations in the technological era. Empirical evaluation using IES-R and SUS survey data demonstrated the potential of technological empowerment in enhancing cultural belonging and supporting psychological healing. Therefore, this research offers valuable insights into interdisciplinary post-disaster reconstruction cases, illustrating a novel approach that integrates cultural preservation with emotional healing. Ultimately, multimodal virtual digital technologies serve not only as tools for the digital preservation of cultural heritage, but also as mechanisms for rebuilding emotional connections through technological empowerment and multimodal interactions.

Author Contributions

Conceptualization, G.C., Y.T. and Y.W. (Yongjin Wu); Methodology, G.C., Y.T. and Y.W. (Yongjin Wu); Software, G.C., Y.T. and Y.W. (Yongjin Wu); Validation, G.C. and Y.T.; Formal Analysis, G.C., Y.T. and Y.W. (Yuwei Wu); Investigation, G.C., Y.W. (Yuwei Wu) and Z.L.; Resources, G.C. and Y.T.; Data Curation, G.C., Y.W. (Yuwei Wu) and Z.L.; Writing—Original Draft Preparation, G.C., Y.T. and Y.W. (Yuwei Wu); Writing—Review and Editing, G.C., Y.T. and Y.W. (Yuwei Wu); Visualization, G.C. and Y.W. (Yuwei Wu); Supervision, J.H.; Project Administration, J.H.; Funding Acquisition, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 52278014.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the anonymous and non-invasive nature of the procedures involving human participants.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study have been uploaded to a private GitHub repository (https://github.com/tendermango/Digital-Damascus-Minaret-Reconstruction-in-Virtual-Utopias) are available from the corresponding author upon reasonable request. The authors intend to make the repository publicly accessible on 14 June 2025. All image and audio data used for 3D reconstruction, style transfer, and immersive environmental simulation were obtained from publicly accessible, open-source, and legally authorized platforms, including the Syrian Heritage Archive and open-access sound libraries. Experimental and survey data involving expert input were collected with informed consent and under appropriate ethical approval, ensuring full compliance with legal, institutional, and academic standards.

Acknowledgments

During the preparation of this manuscript, the authors used Stable Diffusion (v2.1) with a Low-Rank Adaptation (LoRA) module to generate synthetic image content for experimental purposes in the virtual reconstruction workflow. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Mediastika, C.E.; Sudarsono, A.S.; Utami, S.S.; Setiawan, T.; Mansell, J.G.; Santosa, R.B.; Wiratama, A.; Yanti, R.J.; Cliffe, L. The sound heritage of Kotagede: The evolving soundscape of a living museum. Built Herit. 2024, 8, 38. [Google Scholar] [CrossRef]
Salman, R. Emotional Aspects of Cultural Heritage Destruction During War Conflicts. Master’s Thesis, Charles University, Prague, Czech Republic, 2023. [Google Scholar]
Belal, A.; Shcherbina, E. Heritage in post-war period: Challenges and solutions. IFAC-PapersOnLine 2019, 52, 252–257. [Google Scholar] [CrossRef]
Naser, S.R. Revealing the Collective Memory in the Digital Reconstruction of the Lost Cultural Heritage—Example of Al-Nuri Mosque in Mosul. Blue Shield Türkiye. 2024. Available online: https://theblueshield.org/wp-content/uploads/2024/03/BST-BSI-EFFECTS-OF-WAR-TRAUMA-ON-STUDENTS-FROM-WAR-AREAS_compressed.pdf#page=41 (accessed on 14 May 2025).
Nazarenko, V.; Martyn, A. Geospatial technologies in post-war reconstruction: Challenges and innovations in Ukraine. Zemleustrìj Kadastr ì Monìtorìng Zemel’ 2024, 2024, 7. [Google Scholar] [CrossRef]
Hutson, J.; Weber, J.; Russo, A. Best practices in digital twins for cultural heritage preservation. J. Cult. Herit. Manag. Sustain. Dev. 2022, 12, 47–62. [Google Scholar]
Pietroni, E.; Ferdani, D. Virtual restoration and virtual reconstruction in cultural heritage: Terminology, methodologies, visual representation techniques and cognitive models. Information 2021, 12, 167. [Google Scholar] [CrossRef]
Abukarki, H.J. Beyond preservation: A survey of the role of virtual reality in experiencing and understanding historical architectural spaces. Buildings 2025, 15, 1531. [Google Scholar] [CrossRef]
Haux, D.H.; Dominicé, A.M.; Raspotnig, J.A. A cultural memory of the digital age? Int. J. Semiot. Law-Rev. Int. Sémiot. Jurid. 2021, 34, 769–782. [Google Scholar] [CrossRef] [PubMed]
Ikeuchi, K. e-Heritage, cyber archaeology, and cloud museum. In Proceedings of the 2013 International Conference on Culture and Computing, Kyoto, Japan, 16–18 September 2013; pp. 1–7. [Google Scholar] [CrossRef]
Neglia, G.; Angrisano, M.; Mecca, I.; Fabbrocino, F. Cultural heritage at risk in world conflicts: Digital tools’ contribution to its preservation. Heritage 2024, 7, 6343–6365. [Google Scholar] [CrossRef]
Reading, A. Memory and digital media: Six dynamics of the globital memory field. In On Media Memory: Collective Memory in a New Media Age; Zelizer, B., Tenenboim-Weinblatt, K., Eds.; Palgrave Macmillan: London, UK, 2011; pp. 241–252. [Google Scholar]
Pereira, E.; Raulin, A.; Menezes, R.; Pereira, E.; de Souza Pinto, D.; Pinheiro, T.M.; Loir-Mongazon, E.; Girard, G.; Goacolou, E.; Fonseca, V.L. Mourning, reconstruction, and the future after heritage catastrophes: A comparative social science perspective agenda. J. Cult. Herit. 2023, 65, 199–205. [Google Scholar] [CrossRef]
Georgieva, I.; Georgiev, G.V. Reconstructing personal stories in virtual reality as a mechanism to recover the self. Int. J. Environ. Res. Public Health 2020, 17, 26. [Google Scholar] [CrossRef]
Lin, Z.; Yang, Z.; Yuan, J. Research on the design and image perception of cultural landscapes based on digital roaming technology. Herit. Sci. 2024, 12, 397. [Google Scholar] [CrossRef]
Liao, J.; Yao, Y.; Yuan, L.; Hua, G.; Kang, S.B. Visual attribute transfer through deep image analogy. ACM Trans. Graph. (TOG) 2017, 36, 120:1–120:15. [Google Scholar] [CrossRef]
Kerdreux, T.; Thiry, L.; Kerdreux, E. Interactive neural style transfer with artists. In Proceedings of the 11th International Conference on Computational Creativity (ICCC), Lisbon, Portugal, 7–11 September 2020; pp. 123–130. [Google Scholar] [CrossRef]
An, J.; Huang, S.; Song, Y.; Dou, D.; Liu, W.; Luo, J. ArtFlow: Unbiased image style transfer via reversible neural flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 862–871. [Google Scholar] [CrossRef]
Chen, Y.; Chen, R.; Lei, J.; Zhang, Y.; Jia, K. TANGO: Text-driven photorealistic and robust 3D stylization via lighting decomposition. Adv. Neural Inf. Process. Syst. 2022, 35, 30923–30936. [Google Scholar] [CrossRef]
Liu, K.; Zhan, F.; Xu, M.; Theobalt, C.; Shao, L.; You, S. StyleGaussian: Instant 3D style transfer with Gaussian splatting. In Proceedings of the SIGGRAPH Asia 2024 Technical Communications, Tokyo, Japan, 3–6 December 2024. [Google Scholar] [CrossRef]
Amico, N.; Felicetti, A. Ontological entities for planning and describing cultural heritage 3D models creation. In Proceedings of the 18th International Conference on Digital Preservation (iPRES), Beijing, China, 19–22 October 2021; pp. 345–352. [Google Scholar] [CrossRef]
Zachos, A.; Anagnostopoulos, C.-N. Using TLS, UAV, and MR methodologies for 3D modelling and historical recreation of religious heritage monuments. ACM J. Comput. Appl. Archaeol. 2024, 17, 56:1–56:23. [Google Scholar] [CrossRef]
Altaweel, M.; Khelifi, A.; Zafar, M.H. Using generative AI for reconstructing cultural artifacts: Examples using Roman coins. J. Comput. Appl. Archaeol. 2024, 7, 1–15. [Google Scholar] [CrossRef]
Wang, S.; Zhang, J.; Tun, A.N.; Sein, K. Research on identification, evaluation, and digitization of historical buildings based on deep learning algorithms: A case study of Quanzhou World Cultural Heritage Site. Buildings 2025, 15, 1843. [Google Scholar] [CrossRef]
Fang, W.; Li, H.; Zhang, Y. A GAN-based approach for 3D point cloud completion in cultural heritage preservation. Remote Sens. 2024, 16, 5542. [Google Scholar]
Song, H.; Choi, S.; Do, H.; Lee, C.; Kim, T.-K. Blending-NeRF: Text-driven localized editing in neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14383–14393. [Google Scholar] [CrossRef]
Ye, X.; Wang, J.; Liu, Z. Point cloud transfer for UV mapping in low-poly 3D models. Sensors 2022, 22, 7718. [Google Scholar]
Song, H.; Kim, J.; Lee, S.-H. Multi-view attention-based 3D local structure fusion for cultural heritage reconstruction. Appl. Sci. 2024, 14, 3230. [Google Scholar]
Pang, H.-W.; Hua, B.-S.; Yeung, S.-K. Locally stylized neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 12345–12354. [Google Scholar] [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3813–3824. [Google Scholar] [CrossRef]
Magalhães, M.; Melo, M.; Coelho, A.F.; Bessa, M. Affective landscapes: Navigating the emotional impact of multisensory stimuli in virtual reality. IEEE Access 2024, 12, 169955–169972. [Google Scholar] [CrossRef]
Gozzi, A.; Grazioli, G. Listen to the theatre! Exploring Florentine performative spaces. In Proceedings of the 2023 Immersive and 3D Audio: From Architecture to Automotive (I3DA), Bologna, Italy, 5–7 September 2023. [Google Scholar] [CrossRef]
Kritikos, Y.; Giariskanis, F.; Protopapadaki, E.; Papanastasiou, A.; Papadopoulou, E.; Mania, K. Audio augmented reality for cultural heritage outdoors. In Proceedings of the EUROGRAPHICS Workshop on Graphics and Cultural Heritage (Short Papers), Delft, The Netherlands, 28–30 September 2022; pp. 37–40. [Google Scholar] [CrossRef]
Zou, C.; Rhee, S.-Y.; He, L.; Chen, D.; Yang, X. Sounds of history: A digital twin approach to musical heritage preservation in virtual museums. Electronics 2024, 13, 2388. [Google Scholar] [CrossRef]
Chen, J.; Mao, Q.; Liu, D. Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 2642–2646. [Google Scholar] [CrossRef]
Yang, H.; Su, J.; Kim, M.; Jin, Z. Genhancer: High-fidelity speech enhancement via generative modeling on discrete codec tokens. In Proceedings of the Interspeech 2024, Dublin, Ireland, 1–5 September 2024; pp. 1170–1174. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4713–4726. [Google Scholar] [CrossRef]
WanTeam; Wang, A.; Ai, B.; Wen, B.; Mao, C.; Xie, W.-C.; Chen, D.; Yu, F.; Zhao, H.; Yang, J.; et al. Wan: Open and Advanced Large-Scale Video Generative Models. 2025. Available online: https://arxiv.org/abs/2503.20314 (accessed on 14 May 2025).
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Virtual Conference, 25–29 April 2022. [Google Scholar] [CrossRef]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 387–397. [Google Scholar] [CrossRef]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 405–421. [Google Scholar]
Schönberger, J.L.; Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar] [CrossRef]
Yao, Y.; Luo, Z.; Li, S.; Fang, T.; Quan, L. MVSNet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 767–783. [Google Scholar] [CrossRef]
Meng, Q.; Chen, A.; Luo, H.; Wu, M.; Su, H.; Xu, L.; He, X.; Yu, J. GNeRF: GAN-based neural radiance field without posed camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6351–6360. [Google Scholar] [CrossRef]
Xie, S.; Tu, Z. Holistically-nested edge detection [Conference presentation]. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1395–1403. [Google Scholar] [CrossRef]
Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 8, 679–698. [Google Scholar] [CrossRef] [PubMed]
Zhu, Z.; Engel, J.H.; Hannun, A. Learning multiscale features directly from waveforms. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 1305–1309. [Google Scholar] [CrossRef]
van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
Dhariwal, P.; Jun, H.; Payne, C.; Kim, J.W.; Radford, A.; Sutskever, I. Jukebox: A Generative Model for Music. 2020. Available online: https://arxiv.org/abs/2005.00341 (accessed on 14 May 2025).
Bregman, A.S. Auditory Scene Analysis: The Perceptual Organization of Sound; MIT Press: Cambridge, MA, USA, 1990. [Google Scholar]
Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 17022–17033. [Google Scholar] [CrossRef]
Unity Technologies. Unity Manual: Unity User Manual. 2021. Available online: https://docs.unity3d.com/Manual/index.html (accessed on 14 May 2025).
McAuley, S.; Hill, S.; Hoffman, N.; Gotanda, Y.; Burley, B.; Martinez, A. Practical physically-based shading in film and game production. In Proceedings of the ACM SIGGRAPH 2012 Courses, Los Angeles, CA, USA, 5–9 August 2012. [Google Scholar] [CrossRef]
Unity Technologies. Universal Render Pipeline Overview. 2021. Available online: https://docs.unity3d.com/Packages/com.unity.render-pipelines.universal@10.2/manual/index.html (accessed on 14 May 2025).
Huang, H.; Jie, P. Research on the characteristics of high-temperature heat waves and outdoor thermal comfort: A typical space in Chongqing Yuzhong District as an example. Buildings 2022, 12, 625. [Google Scholar] [CrossRef]
Wang, D.; Shu, H. Accuracy analysis of three-dimensional modeling of a multi-level UAV without control points. Buildings 2022, 12, 592. [Google Scholar] [CrossRef]
Beck, J.G.; Grant, D.M.; Read, J.P.; Clapp, J.D.; Coffey, S.F.; Miller, L.M.; Palyo, S.A. The Impact of Event Scale–Revised: Psychometric properties in a sample of motor vehicle accident survivors. J. Anxiety Disord. 2008, 22, 187–198. [Google Scholar] [CrossRef] [PubMed]
Creamer, M.; Bell, R.; Failla, S. Psychometric properties of the Impact of Event Scale–Revised. Behav. Res. Ther. 2003, 41, 1489–1496. [Google Scholar] [CrossRef] [PubMed]
Lewis, J.R. The system usability scale: Past, present, and future. Int. J. Hum.–Comput. Interact. 2018, 34, 577–590. [Google Scholar] [CrossRef]
Liu, X.; Nikkhoo, M.; Wang, L.; Chen, C.P.C.; Chen, H.-B.; Chen, C.-J.; Cheng, C.-H. Feasibility of a kinect-based system in assessing physical function of the elderly for home-based care. BMC Geriatr. 2023, 23, 495. [Google Scholar] [CrossRef]
Likert, R. A technique for the measurement of attitudes. Arch. Psychol. 1932, 140, 1–55. [Google Scholar]
Linares-Vargas, B.G.P.; Cieza-Mostacero, S.E. Interactive virtual reality environments and emotions: A systematic review. Virtual Real. 2025, 29, 3. [Google Scholar] [CrossRef]
Love, A.W. Progress in understanding grief, complicated grief, and caring for the bereaved. Contemp. Nurse 2007, 27, 73–83. [Google Scholar] [CrossRef] [PubMed]
Segal, R.M. Helping children express grief through symbolic communication. Soc. Casework 1984, 65, 590–599. [Google Scholar] [CrossRef]

Figure 1. Multimodal virtual reconstruction framework for the Great Mosque of Aleppo in Damascus, Syria.

Figure 2. (a) Image from primitive minaret dataset; (b) prompt image and model parameters; (c) control map and ControlNet model parameters; and (d) output image and all parameters.

Figure 3. (a) Wan2.1-generated dataset; (b) sparse point cloud from COLMAP; and (c) 3D Gaussian splatter from NeRF.

Figure 4. Experimental workflow of local style transfer on Dome of Treasury.

Figure 5. Visualization of potential audio features in mosque environment: (a) prey spectrum; (b) Arabic chat spectrum; and (c) urban environmental sound spectrum.

Figure 6. (a) Spectrogram by frequency band; (b–d) K-means clustering of features (yellow: high; pink: mid; and blue: low).

Figure 7. Spatial layout of stylized structures and sound points in virtual space.

Figure 8. Virtual world system user experience.

Figure 9. (a) IES-R scores of 20 participants; (b) scoring statistics for each question with different scores; and (c) SUS scores of 20 participants.

Figure 10. Proportion of treatment barriers.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, G.; Tong, Y.; Wu, Y.; Wu, Y.; Liu, Z.; Huang, J. Reconstruction of Cultural Heritage in Virtual Space Following Disasters. Buildings 2025, 15, 2040. https://doi.org/10.3390/buildings15122040

AMA Style

Chen G, Tong Y, Wu Y, Wu Y, Liu Z, Huang J. Reconstruction of Cultural Heritage in Virtual Space Following Disasters. Buildings. 2025; 15(12):2040. https://doi.org/10.3390/buildings15122040

Chicago/Turabian Style

Chen, Guanlin, Yiyang Tong, Yuwei Wu, Yongjin Wu, Zesheng Liu, and Jianwen Huang. 2025. "Reconstruction of Cultural Heritage in Virtual Space Following Disasters" Buildings 15, no. 12: 2040. https://doi.org/10.3390/buildings15122040

APA Style

Chen, G., Tong, Y., Wu, Y., Wu, Y., Liu, Z., & Huang, J. (2025). Reconstruction of Cultural Heritage in Virtual Space Following Disasters. Buildings, 15(12), 2040. https://doi.org/10.3390/buildings15122040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reconstruction of Cultural Heritage in Virtual Space Following Disasters

Abstract

1. Introduction

2. Theoretical Background

2.1. Restoration of Cultural Heritage

2.2. Application of Digital Technologies in Cultural Heritage

2.2.1. Texture Style Transfer and 3D Reconstruction

2.2.2. Detailed Modification of 3D Reconstruction

2.2.3. Audio Reproduction

2.3. Research Motivation and Objective

3. Methodology

3.1. Texture Style Transfer and 3D Reconstruction

3.1.1. Texture Style Transfer

3.1.2. Three-Dimensional Reconstruction

3.2. Detailed Modification of 3D Reconstruction

3.3. Audio Reproduction

3.4. User Interaction in Virtual Environments

4. Experiments

4.1. Virtual Space Construction

4.2. Data Collection and Preprocessing

4.3. Experimental Procedure

4.3.1. Texture Style Transfer and 3D Reconstruction

4.3.2. Detailed Modification of 3D Reconstruction

4.3.3. Audio Reproduction

4.4. Construction of Virtual Space

5. Discussion

5.1. Participants

5.2. Evaluation Procedure

5.3. Evaluation Instrument Scale Descriptions

5.3.1. IES-R Trauma Interview Assessment

5.3.2. System Usability Scale

5.4. Evaluation Results

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI