1. Introduction
Ancient paintings illustrate tangible aspects of heritage, such as styles, characters, and constructions, as an intangible window into the economy, politics, and customs of the past. These illustrations usually show how a peaceful and joyful era may be achieved from the artist’s point of view. The context was supposedly used to promote a generally desired ideal living environment, depicting various joyful activities of civilians. Urban and suburban district markets were painted as miniature depictions of the culture and customs during different dynasties. Although subjects can be metaphorically exemplified within the current context, almost all of their characteristics have evolved or have been made obsolete, with only limited contemporary connections remaining. Their contemporary transfer should inspire a new interpretation of experience in heritage preservation. Regeneration and interaction should be developed to connect tangible and intangible modern-day experiences.
1.1. Research Purpose
The aim of this research is to preserve and interact with characters in 2D paintings by 3D using AI and AR. AI generation creates new tangible characters and enhances the comprehension of an intangible culture, while interaction in AR allows for portable tangible characters on smartphones. Thus, AI and AR could be combined to activate a new cultural heritage preservation paradigm.
This research covers the open access digital collections of traditional Chinese paintings archived at the National Palace Museum, mainly
New Year’s Market in a Time of Peace [
1] (
Figure 1). This painting presents a collection of vendors, street artists, and people of different ages exhibiting different behaviors on a roll of canvas painted in a 3D parallel perspective.
1.2. Related Studies
Ancient characters represent an intangible window to the styles of the past. Identity in contemporary culture includes interactions between urban spaces and deployed graphic statements in relation to characters. These interactions involve unique spatial structures, and the role of applications extends to the social–spatial dialectic, cultural identity, and collaborative tools and environments.
Public visibility is contingent on the urban environment [
2]. Characters and graffiti share morphological resemblance, while urban redevelopment provides a platform and opportunities for increased visibility for graffiti [
3,
4]. The old painting elucidates old lifestyles and living environments using characters. Graffiti is part of the context of the social–spatial dialectic [
5].
An AR character is a tailored form of content that needs to be reconstructed on a virtual platform ahead of time and then deployed afterward. Few studies have been conducting regarding AI and AR simulation. Three-dimensional reconstruction requires sufficiently well-pictured imagery to improve the modeling quality. A survey has shown that 3D Gaussian Splatting (3DGS), which represents a paradigm shift in neural rendering, has the potential to become a mainstream method for the 3D representations of characters. It effectively transforms multi-view images into 3D Gaussian images in real-time [
6] and produces highly detailed 3D reconstructions. Software can estimate camera poses for arbitrarily long video sequences [
7]. In a study on the surface reconstruction of large-scale scenes captured using a UAV, the quality of the surface reconstruction was ensured at heavy computational costs [
8]. High-precision real-time rendering reconstruction has also been applied to cultural relics and buildings in large scenes [
9], peach orchards [
10], and historic architecture using 360° capture [
11]. In addition to assisting in the massive 3D digitization of remains after the Notre-Dame de Paris fire [
12], neural rendering can be applied to leaf structures [
13], the analysis and promotion of dance heritage [
14], and as an ethical framework for cultural heritage and creative industries [
15]. Special renderers have been developed for effective visualization of point clouds or meshes [
16].
However, the 2D nature of paintings and field-access restrictions usually prevent effective 3D reconstruction. RODIN
® applies a generative model for sculpting 3D digital avatars using diffusion [
17], with the 3D neural radiance field (NeRF) model exhibiting computational efficiency. Three-dimensional content generation has benefited from advanced neural representations and generative models [
18]. In regard to details, generative AI can be used to create accessories for a virtual costume [
19].
Three-dimensional digital avatars and the creation of virtual costumes and accessories may reconstruct detailed paintings of 2D characters. The reconstruction should reactivate the tangible heritage of ancient paintings by rebuilding the characters and through the use of modern-day urban social fabrics for a more well-placed connection between them.
Both images and physical facilities contribute to preservation. The role of active imagination is important [
20] and is missing in many intangible aspects of cultural heritage [
21]. Just like the complex interaction that occurs with heritage languages, paintings are a type of visual language that illustrate a fundamental structural difference from the contemporary modern-day context. The generation of meaning through images calls for the exploration of specific mechanisms [
22]. Adaptive reuse of cultural heritage can be a valid strategy to recover heritage buildings [
23], while sustainable conservation can transform buildings into resources in a contemporary context [
24]. Activating both images and physical facilities can be challenging in regard to interaction with the urban context. The use of augmented reality and GPS-guided applications has proven successful to revive cultural heritage [
25].
This study takes a new approach to AR in comparison to existing 3D scanning or photogrammetry [
26,
27]. This approach allows for a fast and intuitive simulation using an image-based generative approach with AI, and translates this to AR. Digital technology has evolved from the positive effects of street art in the urban landscape [
28], and new forms of visual language systems have emerged in the form of stickers for asynchronous communication [
29]. Three-dimensional digital characters should create a new type of identity and facilitate cross-cultural communication, with the potential to increase its clarity.
2. Materials and Methods
Characters in ancient 2D paintings were usually differentiated, created, and directed in order to respond to urban context through situated interactions. The subjects usually covered many types, such as Gods, Goddesses, females, artists, poets, exiled persons, children, street buskers, vendors, soldiers, animals, or ancient beasts. The market painting,
New Year’s Market in a Time of Peace, is a part of the digital resources archived at the National Palace Museum. The whole roll of the painting was scanned into frames at 600 dpi for open download under the image copyright of “CC BY 4.0@
www.npm.gov.tw”. The painting illustrates and interprets subjects through their roles, jobs, living setups, or clothing.
Characters and urban social fabrics are interconnected. A simulation flowchart was developed, following the processes of image collection, image editing, parameterized training and generation, texture referencing, AR editing, and AR publishing (
Figure 2). Image editing includes masking, AI generation of missing parts, filtering by reducing noise, and quality enhancement, while AR editing includes the adjustment of relative scale and deployment orientation. The scope of our project began with a test subject or a single character and expanded to a family, a vending booth, a stage with audiences, or a street block. The private-to-public classification arrangement was simulated with the support of cloud-accessed AR models and smartphones.
The platform consists of several parts (
Figure 2). The humanistic experience was analyzed via deployment on a campus and in a commercial district, verified by first- and second-generation recursive 3D reconstruction, which is sustainable and environmentally friendly. In the first-generation reconstruction using RODIN
®, there was only one image of the old painting, instead of the general photogrammetry of 3DGS, where as many images as possible. On the other hand, the second-generation reconstruction used (1) smartphones to take images or videos and (2) 3D Zephyr
® or similar software to create 3D models of the characters and the urban social fabric from images or videos. The AR platform manages AR cloud access. Field deployment was facilitated using a smartphone supporting depth API for a no-drift and focused inspection of the displayed characters. Verification was performed using reconstructed 3D models, screenshots, and side-by-side comparisons.
For this painting, about 25 characters were used for training and 3D reconstruction using RODIN® as part of the collaborative tests with another 550 3D models under various levels of acceptance. Field tests were conducted by selecting related screenshots, and a program was used to render outputs of the 780 images and videos from smartphones such as the Sony Xperia® 1 II (Sony Group Corp., made in Tokyo, Japan) and Xiaomi® 15 Ultra (Xiaomi Corp., made in Beijing, China).
2.1. The Tool for 3D Training and Generation
RODIN
® from Hyper3D
® (
Table 1) was used to reconstruct subjects using AI. AI-trained models should create features that are similar to the features in the original painting. RODIN
® trained characters from an initiated seed benchmark of -1. Parameterized variations were performed to create symbolic or physical details at different levels of similarity, especially for clothes (including wrinkles and curvature), accessories, and animals. Inconsistencies that arose during RODIN
® training may have resulted in a variety of versions, each of which contained certain differences.
The consistency, capacity, and completeness of the generated models were of concern, especially with only one image being available. The final model featured differently oriented objects that were differentiated or clustered, with people facing forward or backward and with depth between foreground and background objects. Each cluster of people involved a unique training sequence and composition hierarchy. The allocated hierarchy and training completeness allowed for, at most, a cluster of five people with recognizable details.
Follow-up AR interactions and interpretation were conducted on the Augment® (until April 2025) and Sketchfab® platforms through cloud access to specific AR model formats on different sites. AR models with different compositions were produced to rearrange the people in the picture with regard to closeness and facial orientation. Collaborative media were used; for example, the 3D characters were also 3D-printed for physical verification.
2.2. Test Sequence
Most of the unique human figures were painted or composed using different levels of curvatures. The experiment started with a simple free-form object, a symmetric object, and an asymmetric object that were or were not related to wood sculptures of Chinese-themed ornaments. In the test sequences, images of general free-form objects, such as flowers, were selected, as well as prototype and non-prototype physical sculptures, prior to creating human characters in different roles or professions. The features ranged from thin rose petals, fine hairs, and facial features on a lion head to a rather symbolic representation of a crane. Depth layers may or may not be correctly generated with only one image.
The source images or field photographs consisted of layers of abstract or characteristic forms and compositions, with or without plain backgrounds (
Figure 3). Three-dimensional models were reconstructed with or without reference to existing physical forms. The selected urban fabric included two campuses and a business street, which featured a sequence of views that one might find on an everyday walking or driving route.
2.3. Seed Elaboration and Style Fusion
From general to restricted seed-led 3D reconstruction, the tests encompassed abstract or basic shapes to AI-simulated characters. The latter were sourced from the painting. In order to represent the possible evolution of characters in context, images were applied, starting with a text-based description and evolving into iconic 3D modeling.
The process included three parts (
Figure 4): (a) selecting the most meaningful characters; (b) selecting the preferred 3D reconstructed outcomes; and (c) deciding the favored seed and the CFG scale (classifier-free guidance scale) material rendering value. Geometry generation may assist in creating a balance in the physical details between focused specific parts (hairs, right-hand fingers, or chains, for example) and the whole body and clothes in general (
Figure 4b). Seed development was attempted by comparing the painting with 3D physical models or Stable Diffusion
® (SD)-generated outcomes. The seeds were also elaborated for multiple views or under fusion, with a satisfactory proportion between different images. Gestures (body language), clothes, and accessories were adjusted under different weights for a realistic style combination.
Subjective decisions had to be made during each adjustment process. As mentioned above, a straightforward decision was made regarding the similarity to the features depicted in each character. The final 3D model was not a 100% accurate replica of the target in the painted character, as some part of the body was always not present or hidden behind other objects. The configuration regarding the shape and texture of clothing was still mainly interpreted and interpolated using RODIN®. In this study, we did not use the tools provided within the software to modify shape.
2.4. AR Model Categories and Field Tests
In this approach, I presume that conversion to modern-day cultural heritage should proceed through an interaction between AR characters and a 3D physical urban fabric to cross distinct urban spaces. Augment
® and Sketchfab
® provided a cloud-accessed database of AR models, which were created in RODIN
® and edited on AR platforms for on-site download and field deployment (
Figure 5).
3. Results
Scenes and screenshots were generated from recordings. The former were smartphone-framed 3D visual frustums; the latter were smartphone screenshots from AR apps with or without control buttons or prompts in Augment® or Sketchfab®. The composition and related poses in AR were applied at three sites on campus.
3.1. Characters
Characters consist of general individuals, vendors, workers, children, street artists, and families (
Figure 6). They were purposely adapted into the current-day urban social fabric and are depicted partaking in daily experiences.
3.2. Interaction Patterns
Three-dimensional models of people were used to create a scene of a specific scenario in the spring market. People were positioned relative to each other in various compositions, alone or in a group. In Sketchfab
®, three models can be joined to form one. The compositions and related poses were applied in different settings using AR (
Figure 7) as presented in the painting (
Table 2). This application is similar to today’s role-playing activities, where participants wear ancient costumes, but using AR instead. Lines of sight frequently intersected and gestures were commonly used, implying interconnection between individuals or within groups.
3.3. Urban Social Fabric Adaptations
Introduction into the environment facilitates rethinking about how the old social fabric was arranged in a painting. The scenarios for modern-day urban adaptation consist of themes, roles, spaces, and their interconnections. Themes emerge from town, community, market, and family to seasons. For example, spring is the first season of a new year, and the market is a micro-representation of a country’s economy.
New Year’s Market in a Time of Peace features a seasonal metaphor to represent people’s various living patterns through activities occurring at or near the market. This painting by Ding Guanpeng was created during the Qin Dynasty, in 1742 [
1], on a paper roll with a landscape layout.
Roles range from family gatherings and vendors to young and old customers, while space setups range from open or staged gatherings to entertainment spaces. One of the most successful scenes was set in a business district full of parked cars and complex advertising panels, in which the characters seemed to have already merged with and became part of the pedestrians (
Figure 7). The route, spanning from a campus to the outer streets and then back to the campus, allowed for a positive trans-context experience, especially when the scene was enriched by an elementary school field trip on the campus (
Figure 8).
3.4. Reconstruction of a Character and Urban Social Fabric
A character [
32] and landscape were filmed (
Figure 9a) and reconstructed as proof of presence in Zephyr
® and Postshot
® (
Figure 9b,c).
3.5. Three Dimensions with Stable Diffusion®
Setups of ancient characters do not always merge well with general modern-day scenes. Additional AI tools such as SD® were applied to create consistencies between characters and styles. Textural consistency is related to style and weaving patterns, such as texture or fibers, while structural consistency consists of the rational modeling of a configuration from the front to the back of invisible parts, such as makeup, hair, jewelry, clothes, accessories, or finishes. This creates a complete set, extending around the body as it was part of it.
RODIN®, SD®, and Photoshop® have built on traditional image tools using modern-day training models to interpret and reactivate heritage works. RODIN® is both a generation and assembly tool capable of 3D modeling using different images for multiple views or fusion. It works with 2D imagery resources or paintings rendered in different styles (Chinese or Western) or 3D tones, from watercolors, generative renderings, or unclear images or paintings. It features abstract line drawing, text description, complicated but not matching forms, or guiding lines and references in order to reach the required level of realism. Photoshop® generation also infills outputs and diffusive outcomes.
Similarities exist between SD
® and RODIN
® in the convergence of diffused 3D models. SD features a different control interface to limit or direct towards expected or relatively acceptable answers [
33]. AI-assisted image restoration was first conducted using the SD
® Controlnet paradigm (
Figure 10), using the Canny, Openpose, MLSD, Lineart, Softedge, Scribble/Sketch, Inpaint, InstructP2P, Reference, T21-Adapter, or IP-Adapter options. Image diffusion with text prompts facilitated more AR interaction in terms of style across dynasties or professions, for example, between officials, workers, and emperors, while RODIN
® created the other 3D parts to replace missing information. As a result, a loop was developed from a refined diffusive process constantly fed with heuristics to correct graphics.
Mixing styles is as important as highlighting individual aspects. The former concerns issues resulting from transferring already known object syntax matched with novel patterns to either highlight or blend differences. Any research tool capable of this can provide sufficient background information after differences are discovered, followed by creating new patterns from old ones. Differences between the characters may include differences between categories and details (
Table 3).
When mixing 3D styles, RODIN
® combines or blends different details in two images under various weight distribution proportions (
Figure 11). Details may include the one or two pieces making up the front of a Chinese jacket, including the aglet and waistband on a robe or slack. Different weights are designated to generate 3D versions of models under different levels of similarity, which leads to a modularized creation process for the different characters. For example, mixtures of accessories and clothes are generated or trained by combining 3D forms of ancient hairstyles and garments with makeup painted by different artists from different dynasties. The balance can be adjusted and enhanced to form more varieties by choosing the dominating or key image before material generation (from PBR temperature 4.5/reference strength 0.65 to full).
The combined style does not necessarily represent the evolved development of customs, but is rather a reversed study method to distinguish the similar and different parts of the two characters, providing an alternative to creating a dress code.
4. Discussion
Activated heritage preservation aims to redefine old subjects in a contemporary modern-day context. Seeking a matching context is the first step towards understanding living and artistic compositions. Cross-dynasty composition is a good idea for cross-referencing characters outside of the complex background of an individual. The tools and setup used to achieve this should also be applicable to accessories, goods, or furniture. It also facilitates the study of painting styles using symbolic curves, either as smooth or polygonal calligraphy lines, to depict cloth wrinkles.
This novel way of viewing an ancient painting contributes to reinterpretations of the context of old paintings using AI and interaction in AR, which uses a smartphone in the basic setup. This paper is structured to present feasibility data, platforms, devices, and examples. This study intends to reactivate painted characters in a familiar modern-day social fabric. To do this, the interaction was staged in two forms: in a mixed context with mixed characters from a painting, and in a familiar context in the real world with mixed characters.
4.1. Group Rearticulation
Each scene was a picture residing in a larger picture as part of a story. Context was determined from convergent lines of sight, gestures, or orientation. People were seldom isolated and presented without any kind or level of communication with others. Even if a single person was present in the scene, they were overlooked by others from a distance, in contrast to a portrait of a lone person. As a result, this can be represented as a one-way stage performance surrounded by groups of audiences, for example, maids expecting a master’s call for service, or as a conversation between two people in a homogeneous or heterogeneous background.
Lines of sight and facial expressions revealed that there was more to uncover regarding the interaction within small groups of people, rather than just considering them unrelated individuals. One scene featured two ladies emerging from a gate to see a man with two children off to the market (
Figure 12). Ladies were supposed to stay at home away from the public, and it is interesting that, in the painting, there are only a few females in the field, such as a street artist, a girl, a servant, and an old lady. The traditional social value was rearranged in RODIN
® using the front image first, and the missing part was filled in using Photoshop
® and generative AI. Then, using AR, the two groups of family members could be augmented together on a modern campus with street artists and vendors.
4.2. Adaptation and Restructuring
Adaptive reuse of national treasures should consider the composition and restructuring of characters. AR provides an interesting immersive experience where characters can be introduced into a contrasting context. The first question considered for urban AR was the selection of a similar or differentiated urban social fabric. Since the living background is different, using AR, the characters were adapted to a new site or scenario, such as the traditional market or campus-like landscape in spring with a street artist or vendor. There are many choices when tailoring a character to a site.
The composition was inevitably restructured through adaptation due to contrasting eras or inconsistent landscapes. The interlaced timelines created differences and, at the same time, forced exchanges between characters and the urban social fabric. Similarly to the dynamics that occurred during ancient festivals, composition and contrast were replicated again in AR as an optional stage-on-demand in a city.
The introduction of an old painting into the current urban social fabric can be achieved in a divergent or convergent manner. In contrast to hanging paintings on museum walls or posting media on webpages, divergent thinking can be applied to promote or reactivate preservation. Graffiti is a good example of where the deployment can either be divergently distributed to as many places as possible or, at the same time, convergently gathered at a hotspot shared with other artists. This is why graffiti was used as a metaphor to describe the combination of AI and AR. SD® also works as an additional method of divergent activation, using controlled alternatives as a research tool to understand styles.
4.3. Three-Dimensional Rationality in Stage Effects
The inconsistency between 3D models is a valuable indicator for determining a painting’s meaning and composition. Contents that are inconsistent or need to be verified enhance inspection in terms of structural and visual details. Yakov Chernikhov, a Russian architect and graphic designer, was known for the constructivist style, frequently adopting a one-point perspective to exaggerate a building’s scale [
36]. The same intention was applied in the creation of Chinese paintings, and AI was used to rationalize the pose and relative distribution between characters and accessories. It was found that the characters acted similarly to actors, using the front-facing area to maximize the stage effect. Deployment was conducted in detail to facilitate invisible conversation across regions within a limited canvas space. It is assumed that AI not only reconstructed 3D subjects based on one single image, but also rationalized the supposedly correct configuration of these characters.
The artist’s intentions can be adjusted in AR depending on how it is deployed. The differentiation between AI-assisted generation and physical appearance was verified visually; the more images that were taken on-site for the same subject, the more consistent the reconstructed model was. Differences were observed, and deviated between supposedly rationalized and purposely irrational deployed outcomes.
Moreover, details were purposely deformed in some way as a stage effect for audiences, as the artifacts used to be media in temples or paintings with an educational purpose. However, this effect is subject to different levels of spatial restriction, such as on a 360-degree panorama stage or a theater stage, and 3D characters can be framed or boxed in at the front, with a limited viewing depth, in a 2D painting in perspective or a parallel perspective. This deployment is important, especially where the subjects were deployed and restricted by a narrowed visual cone or viewing orientation.
4.4. Development of a Theoretical Framework
The assumption was straightforward in that the preserved or activated subjects should be introduced within a modern-day fabric, and not just as a morphological icon or commercial peripheral object. Verification should also be straightforward using an accessible device or equipment.
Theoretical and methodological levels were connected to reconstruct the framework and to reorganize the results. It can be understood as follows: images or videos taken from smartphones are used repeatedly to document and create 3D models. This process is simple and facilitates heritage preservation and activation, and it is assisted by two of the most popular technologies, AI and AR. The success of the reconstruction verifies the feasibility of the theoretical and methodological aspects.
This looped process reconstructed characters from a painting in the urban fabric of the real world, based on the assumption that a city is a museum supporting the on-demand humanistic experience of ancient characteristics. The reconstruction of both the character and the social fabric highlights that this method is a subjective and personal documentation tool for exploring urban areas, and represents a new check-in ritual for reconnecting with the environment. It merges art into our daily lives instead of being a painting on a museum wall.
5. Conclusions
AI and AR activated the ubiquitous preservation and interaction. Reconnecting the past with the present was made possible and, at the same time, the context of an ancient society was visualized through the interactions between characters.
Recursive processing of diffused images and reconstructed 3D models facilitated the interaction between the painting and reality. The painting was reactivated and readapted to new urban fabrics using characters that were reconstructed, rearticulated, and recomposed in different combinations of single individuals or groups. Not only was AI able to reconstruct the characters, but AR was also used to reinterpret human relationships with the environment. AR has become a powerful tool to reinterpret traditional contexts in the modern world.
This interaction was modeled in two ways: in a mixed context with mixed characters from a painting, and in a familiar context in the real world with these mixed characters. Deployment of the technology was either sustainable and environment-friendly, or was merged into a situated scene similar to a location-specific dress code. Using AI and AR is an appropriate method to create an intangible window to the past in either a contrasting or consistent approach.
This comprehensive critical AI approach used to study old paintings is both a contribution and a limitation. The combination of AI and AR opens up the possibility of adapting platforms, enabling rearticulation and restructuring within the domain of ethics. The 3D rationality presents a positive way to explore detail. One of the most important findings of this study is that collaboration across different AI platforms (SD®, for example) can improve our comprehension of the past.
The theme of the new year’s market was an appropriate example. This study does not limit future studies, but rather enables a collaborative exploration of AI data and tools. Future research will follow the development of AI and AR in order to visualize and merge new character styles from different pictures.