Realistic Virtual Humans for Cultural Heritage Applications

: Virtual Humans are becoming a commodity in computing technology and lately have been utilized in the context of interactive presentations in Virtual Cultural Heritage environments and exhibitions. To this end, this research work underlines the importance of aligning and ﬁne-tuning Virtual Humans’ appearance to their roles and highlights the importance of affective components. Building realistic Virtual Humans was traditionally a great challenge requiring a professional motion capturing studio and heavy resources in 3D animation and design. In this paper, a workﬂow for their implementation is presented, based on current technological trends in wearable mocap systems and advancements in software technology for their implementation, animation, and visualization. The workﬂow starts from motion recording and segmentation to avatar implementation, retargeting, animation, lip synchronization, face morphing, and integration to a virtual or physical environment. The testing of the workﬂow occurs in a use case for the Mastic Museum of Chios and the implementation is validated both in a 3D virtual environment accessed through Virtual Reality and on-site at the museum through an Augmented Reality application. The ﬁndings, support the initial hypothesis through a formative evaluation, and lessons learned are transformed into a set of guidelines to support the replication of this work.


Introduction
Virtual Humans (VHs) are today considered a commodity in the domain of computer animation, cinema, and games. In a sense, they are changing the way that audio-visual information is presented and in a gaming context, they have contributed to a more humanlike interaction metaphor through Non-Playable Characters (NPCs). Furthermore, recent technical advances in computing technologies have made Virtual and Augmented Reality (VR and AR) easily accessible for the wide public through powerful mobile devices and inexpensive VR headsets. In this sense, the evolution of computing technology has made possible access to high-quality audio-visual content including 3D representation from a simple web browser.
AR and VR could benefit from realistic VHs. AR is an interactive experience that uses the real world as a reference and is enhanced by computer-generated sensory stimuli. In the case of visual AR, the goal is to create the pseudo-illusion that the computer-generated content is a seamless part of the environment. In the case of VR, the pursued illusion is that the virtual environment and everything in it are real-including the VHs. In both cases, lack of realism manifested as physical inconsistencies of virtual content with the environment Heritage 2021, 4 4149 (physical or digital) are known to be a key factor of observed fatigue and misperception of distance in all types of audiovisual and Extended Reality (XR) displays [1,2]. For example, when VHs are part of the augmented content, lack of realism in very simple and ordinary animations produces dislike for the end outcome, due to the "uncanny valley" effect [3,4]. To avoid such inconsistencies, VHs should be as realistic as possible, including their appearance, animations, and speech.
In the context of digitization and preservation of Cultural Heritage (CH), AR and VR technologies have been widely adopted. Cultural institutions (including museums) all over the world seek compelling ways to reach new audiences and enhance the museumvisiting experience. AR and VR technologies could not be left out of such context, as they enhance the physical and virtual museum-visiting experience. Augmenting interactive exploration essentially allows for users to experience culture without the need to come into contact with the real objects, visit objects that do not have a physical presence or are in maintenance, while also offers additional experiences such as immersive experiences, personalized guidance [5,6] and augmentation of exhibits with several multimedia and storytelling [7]. Additionally, such technologies can direct visitors' attention by emphasizing and superimposing techniques, which effectively enhance the learning experience [8]. Indeed, digitally mediated personalization and personalized learning are becoming prominent trends in museums in recent years. For example, through mobile apps, museums can provide supplementary information about exhibits or the museum itself [7]. In this context, VHs can be a valuable arrow in the curators' arch, since they can impersonate persons of the past, storytellers, curators, guards, visitors, personal guides, and so on [9] and undertake the role of guiding visitors while providing extra information in the form of multimedia, offering visitors and enhanced museum-visiting experience. However, little work can be found in the literature about presenting a methodology for creating realistic VHs for presenting tangible and intangible CH aspects, such as presenting an exhibit and explaining its usage (tangible), or narrating stories of the past around the exhibits, conveying the life and work of people of a previous era, as an elder would narrate to their grandchildren (intangible), all according to the visitor's interests. Such narrators would not only allow visitors to fully understand what they see in a museum's space but also allow them to mentally travel into the past, providing a deeper understanding of how people lived and work back then.
In light of the above, this paper presents a cost-effective methodology for achieving realistic storyteller VHs creation for CH applications. The proposed methodology covers all steps of creation and presentation of virtual storytellers in various settings including VR and AR, focusing on their looks, movement, and speech. We do that through a case study in the context of the Mingei European project [10], which aims at representing and making accessible both tangible and intangible aspects of crafts as CH [11]. The presented methodology was used in the Mastic pilot of the Mingei project, which aims at preserving the traditional craft of mastic cultivation and processing [12]. This craft is unique in the world and takes place only at the Chios island of Greece. As previous research has shown that emotional responses caused by VHs are heavily affected by whether the VH's looks adheres to their profession [13], our case study refers to creating realistic interactive VHs that represent actual workers of the Chios Mastic Grower's Association, which undertakes the mastic tear processing from their initial form into their final form in the market and also into the famous Greek mastic gum. The final VHs will provide information about the machines exhibited at the Chios Mastic Museum and the lives of the people of that age to the visitors through narratives. However, the methodology presented is independent of the case study and can be applied for the creation process of realistic storyteller VHs for any cultural heritage application.

Background and Related Work
The usage of VHs in Digital Cultural Heritage (DCH) environments has been a subject of study by several research works [14,15]. For example, in [9] the affective potential, persuasiveness, and overall emotional impact of VHs with different professional and social characteristics (a curator, a museum guard, and a visitor), in an immersive VM environment has been studied. In the study, persuasiveness relates to VHs' capacity to engage, affect, and stimulate emotional and cognitive responses by employing different narration styles. The authors underline the importance of aligning and fine-tuning narrative styles and contents to VHs, which should correspond in terms of appearance to their roles, and highlight the importance of affective components in their storytelling approach [16]. Testón and Muñoz in [17] undertake the challenge of immersing visitors in the museum in a way that encouraged them to discover the hidden stories of the ancient city of Valencia. To do that, they analyze different interfaces to achieve natural and humanized behaviour in a museum visit, concluding that VH solutions emerge as cost-effective, empathic mediums to engage new audiences and highlighting that narratives "represent a new way to discover hidden treasures from the past". In [17], the early stages of virtual guides for onsite museum experiences are presented. They used several portraits exhibited in the museum to build VHs representing the corresponding personalities, which were then used to present the exhibits and narrate stories of the past to the visitors. In another work [18], VHs were used for guiding users through virtual CH environments. The VHs provide users with the context and background of virtual exhibits through legends, tales, poems, rituals, dances, and customs etc. VHs have been also used for preserving and simulating cultures [19] and teaching crafts. For example, [20] utilize VHs in VE for teaching the craft of printmaking, while [21] utilizes VR as a tool for communicating the craftsmanship of engraving. In [22], Danks et. al. combines interactive television storytelling and gaming technologies to immerse museum visitors with artifacts on exhibition, engaging the user into physical space using virtual stories, while [23] describes the use of a VH as a means for providing interactive storytelling experiences at a living history museum.
Research has shown that VHs can affect the virtual experience and stimulate attention and involvement [24][25][26] and thus can make the stories presented in VEs more credible and thus influence users positively and constructively. Furthermore, they contribute to the suspension of disbelief and which enables the user to become immersed and follow their story and turn of events. The categorization of areas that should be paid attention to when designing VEs has been defined as i. Information Design, ii. Information Presentation, iii. Navigation Mechanism, and iv. Environment Setting. Information design and presentation, in particular, stress the fact that the users should be able to understand the significance of information providing engagingly through narratives [27]. Narratives, when successfully used for guiding the user through a VM, motivate visitors to stay longer and see more [28]. Besides, visiting a cultural site in the company of a guide who tells fascinating stories about the exhibits becomes a memorable experience, and when human guides are a scarce resource-or doors to a museum or gallery are closed, VHs can bring these experiences to a wider audience as well as provide a welcome invitation to discovery [29].
As previous research has shown, the VH's looks along with their behavior heavily affect the user's response to them. Thus, it is important for the VHs to look, move and sound natural. In this vein, this paper will provide a methodology to address those issues, the tools utilized, and why we have concluded in utilizing them. This section provides a background about the choices made on the tools that helped us achieve each of the three main goals for our characters-looks, motion, and speech.

Realistic VHs Creation
As we focus on realistic VHs, solutions that aim at creating cartoon-like VHs, such as the Ready Player Me [30] were rejected and instead we examined solutions that offer integrated solutions for building high-resolution realistic ones. Among the latter, Character Creator 3 [31] by Reallusion is a full character creation solution for designers, enabling easy creation and customization of realistic-looking character assets. Another character creation software is DAZ studio [32], which aims at users who are interested in posing human figures for illustrations and animation. MakeHuman [33] is a free, open-source, interactive modeling tool for creating custom 3D human characters, however, according to their official documentation, some of its tools have not yet been created or are in the early stages of development (poses, animation cycles, managing facial expressions, hair, and clothes). Another software that promises realistic VHs is the Didimo [34], which focuses on the creation of life-like digital representations of real people-however, it currently only supports the generation of human heads. Having reviewed other solutions in addition to the aforementioned ones, we have decided to adopt the Reallusion's Character Creator 3 suite, as it creates high-resolution realistic, whole-body VHs and can perfectly collaborate with the tools chosen for realizing realistic animation, lip-synching, and facial expressions.

Realistic Presentation of Human Motion
Probably the most sufficient way of achieving believable human-like animation of VHs is motion capture (mo-cap) which refers to the process of recording the movement of objects or people. It is used in military, entertainment, sports, medical applications, and for validation of computer vision [35] and robotics [35,36]. Mo-cap technology was used as early as the 19th century when photographer Eadweard Muybridge studied the motion of humans and animals through stop-motion photography [37]. The basic principles of his study would soon serve filmmakers when Max Fleischer invented the Rotoscope in 1915 ( Figure 1). In essence, a camera would project a single frame onto an easel so that the animator could draw over it, frame by frame, capturing realistic movement for the on-screen VHs. Rotoscoping was partially used in 1938's 'Snow White and the Seven Dwarves, 'Star Wars', and others [38].
the Ready Player Me [30] were rejected and instead we examined solutions that offer integrated solutions for building high-resolution realistic ones. Among the latter, Character Creator 3 [31] by Reallusion is a full character creation solution for designers, enabling easy creation and customization of realistic-looking character assets. Another character creation software is DAZ studio [Error! Reference source not found.], which aims at users who are interested in posing human figures for illustrations and animation. MakeHuman [33] is a free, open-source, interactive modeling tool for creating custom 3D human characters, however, according to their official documentation, some of its tools have not yet been created or are in the early stages of development (poses, animation cycles, managing facial expressions, hair, and clothes). Another software that promises realistic VHs is the Didimo [34], which focuses on the creation of life-like digital representations of real people-however, it currently only supports the generation of human heads. Having reviewed other solutions in addition to the aforementioned ones, we have decided to adopt the Reallusion's Character Creator 3 suite, as it creates high-resolution realistic, wholebody VHs and can perfectly collaborate with the tools chosen for realizing realistic animation, lip-synching, and facial expressions.

Realistic Presentation of Human Motion
Probably the most sufficient way of achieving believable human-like animation of VHs is motion capture (mo-cap) which refers to the process of recording the movement of objects or people. It is used in military, entertainment, sports, medical applications, and for validation of computer vision [35] and robotics [35,36]. Mo-cap technology was used as early as the 19th century when photographer Eadweard Muybridge studied the motion of humans and animals through stop-motion photography [37]. The basic principles of his study would soon serve filmmakers when Max Fleischer invented the Rotoscope in 1915 ( Figure 1). In essence, a camera would project a single frame onto an easel so that the animator could draw over it, frame by frame, capturing realistic movement for the onscreen VHs. Rotoscoping was partially used in 1938's 'Snow White and the Seven Dwarves, 'Star Wars', and others [38]. During the following years, biomechanic organizations were making strides in monitoring and tracking the human body's motions for medical research, introducing the During the following years, biomechanic organizations were making strides in monitoring and tracking the human body's motions for medical research, introducing the concepts of degrees of freedom and hierarchical structure of motion control. In the early 1980s, Tom Calvert attached potentiometers to a body to drive computer-animated figures to study movement abnormalities [39], while Ginsberg and Maxwell presented the Graphical Marionette, a system using an early optical motion capture system that featured a bodysuit with the LEDs on anatomical landmarks and two cameras with photodetectors that returned the 2-D position of each LED in their fields of view. The computer then used the position information from the two cameras to drive a 3D stick figure [40]. This technique was evolved through the next years and reached a state where it was used in movies, from "The Polar Express" (2004) to the "Lord of the Rings" (2001)(2002)(2003) and others ( Figure 2).
ical Marionette, a system using an early optical motion capture system that featured a bodysuit with the LEDs on anatomical landmarks and two cameras with photodetectors that returned the 2-D position of each LED in their fields of view. The computer then used the position information from the two cameras to drive a 3D stick figure [40]. This technique was evolved through the next years and reached a state where it was used in movies, from "The Polar Express" (2004) to the "Lord of the Rings" (2001)(2002)(2003) and others ( Figure 2). The aforementioned mo-cap techniques that include such retroreflective markers belong to the greater family of "Optical" motion capturing systems. They essentially utilize data captured from image sensors to triangulate the 3D position of a subject between two or more cameras calibrated to provide overlapping projections. One can easily understand that such motion capturing systems are very costly since it requires a lot of cameras to be used (typically a system will consist of around 2 to 48 cameras). Many companies offer complete optical motion capturing solutions, including suits, markers, cameras, and software. Some are Qualisys [41], Vicon [42], NaturalPoint [43], and Motion Analysis [44]. Markerless optical systems have also been developed [44][45][46][47][48][49][50]. Such systems use the Open-Pose system [51] or multiple sensors like Kinect to track body motion; however, they are less accurate. Hybrid approaches have also been developed, enabling marker-based tracking and markerless tracking with the same camera system [52].
Non-optical motion capturing (mo-cap) systems do not use any sort of physical marker. This is much more affordable since it is mostly software-based and requires much less equipment [53]. Essentially, they track the motion of sensors, e.g., inertial, stretch sensors etc. For example, magnetic systems calculate position and orientation by the relative magnetic flux of three orthogonal coils on the transmitter and each receiver [54]. The relative intensity of the voltage or current allows computers to calculate both range and The aforementioned mo-cap techniques that include such retroreflective markers belong to the greater family of "Optical" motion capturing systems. They essentially utilize data captured from image sensors to triangulate the 3D position of a subject between two or more cameras calibrated to provide overlapping projections. One can easily understand that such motion capturing systems are very costly since it requires a lot of cameras to be used (typically a system will consist of around 2 to 48 cameras). Many companies offer complete optical motion capturing solutions, including suits, markers, cameras, and software. Some are Qualisys [41], Vicon [42], NaturalPoint [43], and Motion Analysis [44]. Markerless optical systems have also been developed [44][45][46][47][48][49][50]. Such systems use the Open-Pose system [51] or multiple sensors like Kinect to track body motion; however, they are less accurate. Hybrid approaches have also been developed, enabling marker-based tracking and markerless tracking with the same camera system [52].
Non-optical motion capturing (mo-cap) systems do not use any sort of physical marker. This is much more affordable since it is mostly software-based and requires much less equipment [35]. Essentially, they track the motion of sensors, e.g., inertial, stretch sensors etc. For example, magnetic systems calculate position and orientation by the relative magnetic flux of three orthogonal coils on the transmitter and each receiver [53]. The relative intensity of the voltage or current allows computers to calculate both range and orientation through mapping the tracking volume. However, magnetic mo-cap systems are susceptible to magnetic and electrical interference from metal objects in the environment, like rebar (steel reinforcing bars in concrete) or electrical sources such as monitors, lights, cables, and computers [53], making them unsuitable for recording motions in most house and office environments. Other mo-cap systems can utilize wearable stretch sensors [54][55][56], which are not affected by magnetic interference and are free from occlusion. However, the data collected are transmitted via Bluetooth or direct input, severely limiting the freedom of the actors to move. Such systems are used to detect minute changes in body motion [57] and are rarely used for motion capture. Another solution for motion capture in the bibliography is the mechanical motion capture systems, which require the user to wear physical recording devices for each joint, (exoskeleton) such as the Gypsy Mocap system [58]. Such systems can suffer from motion data drift, noise, and the limitations of mechanical compared to human motion [59].
The most popular non-optical motion capture technology makes use of inertial sensors. Most of the systems that belong to this category make use of Inertial Measurement Units (IMUs), which contain a combination of gyroscope, magnetometer, and accelerometer. The data collected from such units are wirelessly transmitted to a computer where the motion is translated to a VH. This allows for great freedom in movements, excellent accuracy, ease of use, zero occlusion issues, quick and easy setup, and the ability to record in various environments. These benefits, along with the significantly lower cost than the optical-based methods, make inertial systems increasingly popular amongst game developers [60].

Realistic Presentation of Human Speech
For a VH to speak, two activities have to be synchronized-first, lip moving and face morphing to support facial expressions, and second, audio containing the VH's speech content performed with the VH's voice. Both activities are very important for presenting a realistic virtual speech, especially when there is a need for a very expressive VH. However, in the case of VHs used for narrations in CH sites, more emphasis should be put on the audio part of the speech, as narrators usually do not express very intense emotions (such as surprise, scare, enthusiasm, etc.) that need to be emphasized on the VH's face as it is the case with most cartoonish characters met in games. Most of the time, lip movement and facial expressions of storyteller VHs do not directly contribute to a better user experience and will be hardly noticed by users, as the respective VHs only occupy a small proportion of the user's view window. More specifically, when it comes to AR the VHs will be displayed on a relatively small screen of a tablet or mobile phone, while in VR the VH will be put in a peripheral space, as the main exhibit should be put directly in front of the user's view window to draw their attention. For example, VH narrators for cultural applications can be presented on the side of the user's view window and show mild emotions. Furthermore, in some cases, the character is put in the middle of the user's view window, expressing more vivid emotions [20,61,62].
Face morphing and lip-synching can be automatically generated by software, based on the text or audio that has to be performed, or it can be captured directly from an actor's face. The second option is more accurate and can produce higher-quality facial morphing and lip motion. This instantly inhabits the use of automatic Text-to-Speech (TTS) synthesizers, as it is almost impossible to synchronize automatically-generated speech with the captured lip motion. Moreover, captured facial expressions would express emotions that synthesized voices cannot fully express, leading to an uncanny effect that would break the suspension of disbelief. Capturing facial expressions should always be accompanied by audio recorded by a human actor, preferably at the same time as the facial capturing. However, precise synching of the recorded audio and the captured lip-motion is not an easy task, and additionally, it requires re-capturing of both audio and facial expressions if the narrated story changes (even if the curator needs to add or remove a single word from the narration). On the other hand, automatically generated lip-synching and face morphing is made possible via specialized software that drives the VH's facial anchors in a way that their expression and lips match the given text or audio. TTS software works initially by analyzing the input text, then processed it, and finally convert it to digital audio through concatenating short sound samples from a database [63]. TTS software is cheap, easy to use, and quick to adopt as it requires no voice recordings and the input text can change any time, allowing curators to freely edit narrations. However, as speech is automatically generated, it still faces some drawbacks, for example, inadequate expression of emotions, failing to perform a spontaneous speech in terms of naturalness and intelligibility, naturalsounding, and others [64]. It can be used in cases where natural voice recordings are hard or impossible to take place, such as in screen readers [65], translators, etc. Voice recordings are a much better solution when it comes to performing specific text parts that do not change frequently. They pose a serious time overhead since the recordings should be performed by actors, usually in a studio, and then processed with audio-processing software. Once recorded, it is difficult to correct wrong pronunciations or change a word, since such changes require time-consuming audio processing or recording a new voice track from scratch. However, voice recordings sound natural, as actors can color their voices and express emotions. They are much more pleasing to hear and are more comprehensible.
Lip-synching is a technical term for matching lip movements with sound. The respective software accepts voice recordings as input, which they analyze to extract individual sounds, known as phonemes. Then, the program uses a built-in dictionary to select the appropriate viseme (mouth shape) for each sound. Much research work has been done on lip-synching [66][67][68]; for example, Martino et al. in [69] generate speech synchronized 3D facial animation that copes with anticipatory and perseverative coarticulation.
Except for the aforementioned research works, many enterprise solutions promise quality lip-synching for VHs as well. CrazyTalk [70] for example is a facial animation and lip-synching tool mainly targeting the creation of cartoonish effects that use voice and text to vividly animate facial images by defining the facial wireframes in them. Papagayo [71] is another lip-syncing program for matching visemes with the actual recorded sound of actors speaking. The lip-synching process is not automated-the developers should provide the text being spoken and drag them on top of the sound's waveform until they line up with the proper sounds. Salsa LipSync Suite [72] provides automated, high quality, language-agnostic, lip-sync approximation for 2D and 3D characters, offering real-time processing of the input audio files to reduce/eliminate timing lag. It is also capable of controlling eye, eyelid, and head movement and performs random emote expressions, essentially providing a realistic face motion for the target 3D characters. In the context of storyteller VHs for CH applications, we propose lip-synching over facial capturing, as it requires no synchronization between audio and face motion, and it is easier for curators to revise narrated texts, either via sound-editing or via re-recording the target speech with no need for face re-capturing.

AR and VR Presentation
The presentation of virtual content within visual representations of a scene captured through a camera has been often achieved using physical markers in designated locations and arrangements. After the proliferation of unique keypoint features in Computer Vision, markers are substituted by (almost) unique key point features occurring at multiple scales in the environment, mainly in the form of visual textures. In the latter case, marker placement is substituted by a prior reconstruction of the environment to find and spatially index key points and their arrangement in the 3D environment. When the system is later presented with an image of the environment, it recognizes these points and estimates the pose of the camera. Once the camera position is estimated a virtual camera is (hypothetically) placed at the same point and content is ray-traced, visually predicting its (hypothetical) appearance on the surface that is imaged by the camera.
Since the advancement of holographic technology, AR headsets are evolving including interactive features like gesture and voice recognition, as well as improvements in resolution and field of view. In addition, untethered AR headsets paved the way for mobile experiences without the need for external processing power from a PC. Such embedded systems, facilitate great tools to represent Virtual Museums (VMs) [73] due to their lack of cables and enhanced interactive capabilities. VMs are institutional centers in the service of society, open to the public for acquiring and exhibiting the tangible and intangible heritage of humanity for education, study, and enjoyment. In addition, True AR has recently been defined to be a modification of the user's perception of their surroundings that cannot be detected by the user [74] due to their realism. VHs and objects should blend with their surroundings, supporting the "suspension of disbelief".
In recent years, many approaches to holographic CH applications emerged, each one focusing on a different aspect of representing the holographic exhibits within the real environment. A published survey [75] investigated the impact of VR and AR on the overall visitor experience in museums, highlighting the social presence of AR environments. Papaeftymiou et al. [76] presented a comparison of the latest methods for rapid reconstruction of real humans using as input RGB and RGB-D images. They also introduce a complete pipeline to produce highly realistic reconstructions of VHs and digital asserts suitable for VR and AR applications. The InvisibleMuseum project contributed with an authoring platform for collaborative authoring of Virtual Museums with VR support [77]. Another project [78] integrates ARCore and ARCore to implement a portal-based AR virtual museum along with gamified tour guidance and exploration of the museum's interior. Storytelling, presence, and gamification are three very important fields that should be considered when creating an MR application for CH. Papagiannakis et al. [79] presented a comparison of existing MR methods for virtual museums and pointed out the importance of these three fields for applications that contribute to the preservation of CH. In [80] fundamental elements for MR applications alongside examples are presented. Another recent example [81] presented two Mixed Reality Serious Games in VR and AR comparing the two technologies over their capabilities and design principles. Both applications showcased antiquities through interactive mini-games and a virtual/holographic tour of the archaeological site using Meta AR glasses. Abate et al. [82] successfully published an AR application for visualizing restored ancient artifacts based on an algorithm that addresses geometric constraints of fragments to rebuild the object from the available parts.

Proposed Methodology
The primary hypothesis of this research work regarding achieving realistic storytelling animation for VHs is that it is important that they look, move, and sound natural. In this vein, and considering the above analysis, we have decided to use ultra-high-resolution VHs to make them look realistic, mo-cap technologies to ensure that animations look natural, and software that can use voice recordings, lip-synching and facial expressions. The overall workflow proposed is shown in Figure 3. First, high-resolution VHs have to be created. Then, a motion-capturing suit should be used to record the VH's movement for each of the stories we want them to narrate. Although any mo-cap system could be adequate for this task, we propose the mo-cap suit as a cost-efficient yet effective mo-cap system in comparison to any other solution (cost of 2000 euros pes suit in conjunction tõ 200K for a vicon room). The suit provides accurate recording and can cope with visual occlusions which is the main drawback of optical systems. Furthermore, we propose that the corresponding voice clips of each narration should be recorded in parallel with the motion capturing, as this will allow for perfect synchronization between the recorded animation and the audio. Then, to retarget the captured animation to the VH, a game engine is facilitated (in our case Unity). Finally, human voice recordings are proposed to be used for the narrations instead of automatically generated ones via TTS Finally, we propose that the face and lips of the VH should be automatically controlled via software and that face-capturing solutions should be avoided in the context discussed. We justify these propositions in Sections 3.1-3.6.
Heritage 2021, 4 FOR PEER REVIEW 9 authoring platform for collaborative authoring of Virtual Museums with VR support [78].
Another project [79] integrates ARCore and ARCore to implement a portal-based AR virtual museum along with gamified tour guidance and exploration of the museum's interior. Storytelling, presence, and gamification are three very important fields that should be considered when creating an MR application for CH. Papagiannakis et al. [80] presented a comparison of existing MR methods for virtual museums and pointed out the importance of these three fields for applications that contribute to the preservation of CH.
In [81] fundamental elements for MR applications alongside examples are presented. Another recent example [82] presented two Mixed Reality Serious Games in VR and AR comparing the two technologies over their capabilities and design principles. Both applications showcased antiquities through interactive mini-games and a virtual/holographic tour of the archaeological site using Meta AR glasses. Abate et al.
[83] successfully published an AR application for visualizing restored ancient artifacts based on an algorithm that addresses geometric constraints of fragments to rebuild the object from the available parts.

Proposed Methodology
The primary hypothesis of this research work regarding achieving realistic storytelling animation for VHs is that it is important that they look, move, and sound natural. In this vein, and considering the above analysis, we have decided to use ultra-high-resolution VHs to make them look realistic, mo-cap technologies to ensure that animations look natural, and software that can use voice recordings, lip-synching and facial expressions. The overall workflow proposed is shown in Figure 4. First, high-resolution VHs have to be created. Then, a motion-capturing suit should be used to record the VH's movement for each of the stories we want them to narrate. Although any mo-cap system could be adequate for this task, we propose the mo-cap suit as a cost-efficient yet effective mo-cap system in comparison to any other solution (cost of 2000 euros pes suit in conjunction to ~200K for a vicon room). The suit provides accurate recording and can cope with visual occlusions which is the main drawback of optical systems. Furthermore, we propose that the corresponding voice clips of each narration should be recorded in parallel with the motion capturing, as this will allow for perfect synchronization between the recorded animation and the audio. Then, to retarget the captured animation to the VH, a game engine is facilitated (in our case Unity). Finally, human voice recordings are proposed to be used for the narrations instead of automatically generated ones via TTS Finally, we propose that the face and lips of the VH should be automatically controlled via software and that face-capturing solutions should be avoided in the context discussed. We justify these propositions in Sections 3.1-3.6.

Implementation of VHs
The VHs' bodies and clothes were created to obtain one unified and optimized model, enhancing the visual impact of the characters with texture mapping and material editing. The 3D generation of the virtual bodies has also to take into consideration the total number of polygons used to create the meshes to keep a balance between the 3D real-time simulation restrictions and the skin deformation accuracy of the models.
For VHs acting as conversational agents are meant to be part of a storytelling scenario. The requirement to use a blend-shape system for the facial animation meant working with software that supports from one had the external BVH files for the animation of the body and from the other hand gives tools for controlling the facial animation. Reallusion pipeline fits all these requirements since it is a 2D and 3D character creation and animation software with tools for digital humans' creation and animation pipelines, that provides a motion controller for the face, body, and hands (see Figures 4 and 5).
Once the character is created in CC3, it can be directly exported to iClone [83] which is the animation module where the external motion file can be tested after its conversion inside 3Dxchange [84].
Heritage 2021, 4 FOR PEER REVIEW 10 The VHs' bodies and clothes were created to obtain one unified and optimized model, enhancing the visual impact of the characters with texture mapping and material editing. The 3D generation of the virtual bodies has also to take into consideration the total number of polygons used to create the meshes to keep a balance between the 3D realtime simulation restrictions and the skin deformation accuracy of the models.
For VHs acting as conversational agents are meant to be part of a storytelling scenario. The requirement to use a blend-shape system for the facial animation meant working with software that supports from one had the external BVH files for the animation of the body and from the other hand gives tools for controlling the facial animation. Reallusion pipeline fits all these requirements since it is a 2D and 3D character creation and animation software with tools for digital humans' creation and animation pipelines, that provides a motion controller for the face, body, and hands (see Figures 5 and 6).
Once the character is created in CC3, it can be directly exported to iClone [84] which is the animation module where the external motion file can be tested after its conversion inside 3Dxchange [85].

Animation Recording
After carefully analyzing the pros and cons of each of these systems, always considering their cost, we chose the Rokoko suit due to its high accuracy, portability, easiness of setup and use, and its overall price to quality ratio. The SmartSuite uses 19 Inertial Measurement Unit (IMU) sensors to track full-body, minus the fingers, which are tracked by the SmartGloves. The sensors have a wireless range of up to 100 m and require no external parts. The recordings, acquired using the suit, can be trimmed within the Rokoko Studio and exported under various mainstream formats (FBX, BVH, CSV, C3D) using various skeletons (HumanIK, Mixamo, Biped, and Newton).
Using the Rokoko equipment and software, we put on the suit and the gloves and recorded unique animations for each narration. For the narration moves to be more realistic, we also narrated the stories during the recordings and used a voice recording program to capture our voice. In this way, the synchronization of voice and movement in the narration was a lot easier, and it guaranteed a more natural narration. Moreover, to further enhance the realism of the animation we used male and female "actors" for this Heritage 2021, 4 FOR PEER REVIEW 10 The VHs' bodies and clothes were created to obtain one unified and optimized model, enhancing the visual impact of the characters with texture mapping and material editing. The 3D generation of the virtual bodies has also to take into consideration the total number of polygons used to create the meshes to keep a balance between the 3D realtime simulation restrictions and the skin deformation accuracy of the models.
For VHs acting as conversational agents are meant to be part of a storytelling scenario. The requirement to use a blend-shape system for the facial animation meant working with software that supports from one had the external BVH files for the animation of the body and from the other hand gives tools for controlling the facial animation. Reallusion pipeline fits all these requirements since it is a 2D and 3D character creation and animation software with tools for digital humans' creation and animation pipelines, that provides a motion controller for the face, body, and hands (see Figures 5 and 6).
Once the character is created in CC3, it can be directly exported to iClone [84] which is the animation module where the external motion file can be tested after its conversion inside 3Dxchange [85].

Animation Recording
After carefully analyzing the pros and cons of each of these systems, always considering their cost, we chose the Rokoko suit due to its high accuracy, portability, easiness of setup and use, and its overall price to quality ratio. The SmartSuite uses 19 Inertial Measurement Unit (IMU) sensors to track full-body, minus the fingers, which are tracked by the SmartGloves. The sensors have a wireless range of up to 100 m and require no external parts. The recordings, acquired using the suit, can be trimmed within the Rokoko Studio and exported under various mainstream formats (FBX, BVH, CSV, C3D) using various skeletons (HumanIK, Mixamo, Biped, and Newton).
Using the Rokoko equipment and software, we put on the suit and the gloves and recorded unique animations for each narration. For the narration moves to be more realistic, we also narrated the stories during the recordings and used a voice recording program to capture our voice. In this way, the synchronization of voice and movement in the narration was a lot easier, and it guaranteed a more natural narration. Moreover, to further enhance the realism of the animation we used male and female "actors" for this

Animation Recording
After carefully analyzing the pros and cons of each of these systems, always considering their cost, we chose the Rokoko suit due to its high accuracy, portability, easiness of setup and use, and its overall price to quality ratio. The SmartSuite uses 19 Inertial Measurement Unit (IMU) sensors to track full-body, minus the fingers, which are tracked by the SmartGloves. The sensors have a wireless range of up to 100 m and require no external parts. The recordings, acquired using the suit, can be trimmed within the Rokoko Studio and exported under various mainstream formats (FBX, BVH, CSV, C3D) using various skeletons (HumanIK, Mixamo, Biped, and Newton).
Using the Rokoko equipment and software, we put on the suit and the gloves and recorded unique animations for each narration. For the narration moves to be more realistic, we also narrated the stories during the recordings and used a voice recording program to capture our voice. In this way, the synchronization of voice and movement in the narration was a lot easier, and it guaranteed a more natural narration. Moreover, to further enhance the realism of the animation we used male and female "actors" for this process according to the scenario of the narration and the gender of the expected VH by the scenario narrator (see Figure 6).
Heritage 2021, 4 FOR PEER REVIEW 11 process according to the scenario of the narration and the gender of the expected VH by the scenario narrator (see Figure 7). Once the narration animations were recorded, we segment then [86] and we exported them in .fbx format, using the Newton skeleton. This action essentially creates a series of bones, body joints, and muscles, and defines their rotations in the 3D space over time.

Retargeting Recorded Animations to Virtual Humans
The next step is to import the narration animations and the VHs into the Unity game engine. After that, we can add the VH to our scene, and define an animator component to control their animations. The controller defines which animations the VH can perform, as well as when to perform them. Essentially, the controller is a diagram, which defines the animation states and the transitions among them. In Figure 8, an animation controller is shown where the VH initially performs an idle animation, and it can transit (arrows) to a state where the character introduces herself ("Self_introduction") or to a state where she narrates a specific process ("Narration about Sifting Process").

Recording Natural Human Voice in Virtual Narrators
Recent studies on the usage of state-of-the-art TTS synthesizers instead of human voices recordings [87] have indicated that a human voice is still preferable: both in terms of exhibited listener facial expressions indicating emotions during storytelling, as well as in terms of non-verbal gestures involving head and arms. Indeed, the main purpose of storyteller VHs is to complement a user's visit to a CH site with audio stories; the VHs remain on the side of the user's view window and help users to understand the exhibit in front of them or to discover more information around it. Thus, the narrator's voice should sound clear and natural, while slight fluctuations in the narrator's tone, speed, and volume can arise users' interest and draw their attention to what's important. This unfortunately comes with the cost of having to process or re-record the audio clip every time the narration script changes. In this vein, we propose to record natural human voice for every narration part that the humans have to reproduce, in separate audio clips so that the rerecording (if needed) should be easier. In this work, we recorded human voice and then Once the narration animations were recorded, we segment then [85] and we exported them in .fbx format, using the Newton skeleton. This action essentially creates a series of bones, body joints, and muscles, and defines their rotations in the 3D space over time.

Retargeting Recorded Animations to Virtual Humans
The next step is to import the narration animations and the VHs into the Unity game engine. After that, we can add the VH to our scene, and define an animator component to control their animations. The controller defines which animations the VH can perform, as well as when to perform them. Essentially, the controller is a diagram, which defines the animation states and the transitions among them. In Figure 7, an animation controller is shown where the VH initially performs an idle animation, and it can transit (arrows) to a state where the character introduces herself ("Self_introduction") or to a state where she narrates a specific process ("Narration about Sifting Process").
Heritage 2021, 4 FOR PEER REVIEW 11 process according to the scenario of the narration and the gender of the expected VH by the scenario narrator (see Figure 7). Once the narration animations were recorded, we segment then [86] and we exported them in .fbx format, using the Newton skeleton. This action essentially creates a series of bones, body joints, and muscles, and defines their rotations in the 3D space over time.

Retargeting Recorded Animations to Virtual Humans
The next step is to import the narration animations and the VHs into the Unity game engine. After that, we can add the VH to our scene, and define an animator component to control their animations. The controller defines which animations the VH can perform, as well as when to perform them. Essentially, the controller is a diagram, which defines the animation states and the transitions among them. In Figure 8, an animation controller is shown where the VH initially performs an idle animation, and it can transit (arrows) to a state where the character introduces herself ("Self_introduction") or to a state where she narrates a specific process ("Narration about Sifting Process").

Recording Natural Human Voice in Virtual Narrators
Recent studies on the usage of state-of-the-art TTS synthesizers instead of human voices recordings [87] have indicated that a human voice is still preferable: both in terms of exhibited listener facial expressions indicating emotions during storytelling, as well as in terms of non-verbal gestures involving head and arms. Indeed, the main purpose of storyteller VHs is to complement a user's visit to a CH site with audio stories; the VHs remain on the side of the user's view window and help users to understand the exhibit in front of them or to discover more information around it. Thus, the narrator's voice should sound clear and natural, while slight fluctuations in the narrator's tone, speed, and volume can arise users' interest and draw their attention to what's important. This unfortunately comes with the cost of having to process or re-record the audio clip every time the narration script changes. In this vein, we propose to record natural human voice for every narration part that the humans have to reproduce, in separate audio clips so that the rerecording (if needed) should be easier. In this work, we recorded human voice and then

Recording Natural Human Voice in Virtual Narrators
Recent studies on the usage of state-of-the-art TTS synthesizers instead of human voices recordings [86] have indicated that a human voice is still preferable: both in terms of exhibited listener facial expressions indicating emotions during storytelling, as well as in terms of non-verbal gestures involving head and arms. Indeed, the main purpose of storyteller VHs is to complement a user's visit to a CH site with audio stories; the VHs remain on the side of the user's view window and help users to understand the exhibit in front of them or to discover more information around it. Thus, the narrator's voice should sound clear and natural, while slight fluctuations in the narrator's tone, speed, and volume can arise users' interest and draw their attention to what's important. This unfortunately comes with the cost of having to process or re-record the audio clip every time the narration script changes. In this vein, we propose to record natural human voice for every narration part that the humans have to reproduce, in separate audio clips so that the re-recording (if needed) should be easier. In this work, we recorded human voice and then all audio clips were trimmed and lightly processed to remove excess parts and ensure the audio level will be equal among all the clips.

Lip Synchronization and Face Morphing
When it comes to VH's facial expressions, we propose to avoid using face capturing techniques for building storyteller VHs for CH applications. That is because, in practice, storyteller VHs do not make very vivid facial expressions (such as surprise, fear, overexcitement, etc.), thus mild facial morphs automatically controlled by software will cover our needs. Secondly, building virtual storytellers in the context of CH applications implies that curators will be in charge of providing the scripts that the VHs should narrate, and they should be able to slightly alter those scripts without needing to re-capture facial animations from scratch; they should only need to process or re-record the respective audio clips. In the case of face capturing, the slightest change in the scripts would make the whole face look out-of-synch, as the VHs would move their lips in a different way than one would expect during narrations. The same would happen for the different languages supported-each language would require a new face capture to look and feel natural. Auto facial morphing and lip-synching provide the advantage of automatically controlling the VH's face based on the provided audio. In terms of quality, automatic face morphing and lip-synching results are inferior to facial capturing; but as the VH's voice and body language remain the main focus of the users this is no problem.
In the light of the above, in this work, we used software for controlling both facial morphs and lip-synching, as these two should comply with each other. We have used the Crazy Minnow Studio's Salsa lip-sync suite [87], as it creates face morphs and lip synchronization from any given audio input, produces realistic results, and is fully compatible with the Unity game engine and the software used for the creation of the VHs. Such compatibility is important because the lip-synch algorithms mix the existing blend-shapes of the VHs, and such blend-shapes differ depending on the software used to create the VHs. Following the aforementioned steps, we have applied Salsa lip-sync to the VHs provided by the Miralab and we assigned to each VH the corresponding recorded animations. The narration stories refer to the mastic chewing gum creation process (Mastic Pilot), the carafe creation process (Glass Pilot), and the ecclesiastical garments creation process (Silk Pilot).

Putting Them All Together
The unity game engine was used to compile VHs, animations, lip-synching, and voice recordings together. The animations and the VHs were imported using the Filmbox (.fbx) format [88], and a humanoid rig was applied to them. Then, animation controllers were built. Each character was bound to one animation controller which defines which animations he/she will be able to perform and controls the transitions among them using transition parameters. Such parameters can be then triggered via code each time we need a specific animation to be played. Colliders were also used to define proximity areas around the VHs so that specific parameters would be triggered upon the trespassing of them. An example is that, when a player's VH approaches a virtual narrator, the latter greets and introduces itself. Such a collider is shown with green lines Figure 8 (left). On the right part of the image, the collider's properties are shown. Notice that the collider is set to be a trigger because we don't want to stop other objects from entering the collider area, but we need the VH's introduction animation to be triggered upon a character's trespassing the collider's borders. Heritage 2021, 4 FOR PEER REVIEW 13

Use Case
The Chios Gum Mastic Growers Association is agricultural cooperation established in 1937 in Chora of the island of Chios, after a long period of crisis that the mastic market faced. While its establishment helped mastic growers have better work conditions concerning their compensation. The use case study in the context of this research work regards the traditional mastic processing that takes place in the mastic growers association, on the island of Chios, Greece. VHs have been created once (following the aforementioned propositions) and have been deployed in two CH applications. In all applications, the VHs represent actual workers of the association, and they are narrating stories from their personal life, their work-life, and their duties at the factory. Those narrations are created based on real testimonies as explained in 0. Through them, the museum visitors can virtually travel back in time to that era and learn how people lived, and also about the mastic processing stages, the machines' functionality, and more. In this vein, the AR application regards the augmentation of the physical museum exhibits with narrator VHs. The second application, which enables a virtual visit to the old factory of the traditional Chios mastic chicle. The idea is that users can explore the 3D models of the museum's machines via 3D while the VH standing in front of each one will provide extra information about the machine, the respective step of the process line, and their personal lives. The production line, the VHs, and the scripts that they will narrate have been defined after careful examination of the requirements that the museum curators provided and analyzing former and present-day workers of the Association (research of the late 2000s) kept in the PIOP archive.

Production Line at the Warehouse and the Factory
The production line at the Chios Mastic Grower's association is illustrated in Figure  10. The top part of the figure regards the process of collection, cleaning, and sorting mastic tears based on their quality resulting in the following byproducts (a) mastic oil, (b) mastic teardrops, and small tears used for gum production. In the case of chewing gum, the production takes place in the factory. There, workers make the chicle mixture by adding natural mastic, sugar (optional), butter, and cornflour in a blending machine (h). When the chicle dough is ready (i), the workers transfer it to a marble counter on top of which they have sprinkled icing sugar and knead the chicle dough to form 'pies'. The 'pies' are then placed on wooden shelves to cool before being transferred to an automated machine called 'cutting machine' (j). The cutting machine shapes the pies into sheets and cuts them into gum dragées (k). The dragées are left to cool again on wooden shelves. When they are ready, workers break the sheets to separate the formed dragées (l). If the dragées are

Use Case
The Chios Gum Mastic Growers Association is agricultural cooperation established in 1937 in Chora of the island of Chios, after a long period of crisis that the mastic market faced. While its establishment helped mastic growers have better work conditions concerning their compensation. The use case study in the context of this research work regards the traditional mastic processing that takes place in the mastic growers association, on the island of Chios, Greece. VHs have been created once (following the aforementioned propositions) and have been deployed in two CH applications. In all applications, the VHs represent actual workers of the association, and they are narrating stories from their personal life, their work-life, and their duties at the factory. Those narrations are created based on real testimonies as explained in 0. Through them, the museum visitors can virtually travel back in time to that era and learn how people lived, and also about the mastic processing stages, the machines' functionality, and more. In this vein, the AR application regards the augmentation of the physical museum exhibits with narrator VHs. The second application, which enables a virtual visit to the old factory of the traditional Chios mastic chicle. The idea is that users can explore the 3D models of the museum's machines via 3D while the VH standing in front of each one will provide extra information about the machine, the respective step of the process line, and their personal lives. The production line, the VHs, and the scripts that they will narrate have been defined after careful examination of the requirements that the museum curators provided and analyzing former and present-day workers of the Association (research of the late 2000s) kept in the PIOP archive.

Production Line at the Warehouse and the Factory
The production line at the Chios Mastic Grower's association is illustrated in Figure 9. The top part of the figure regards the process of collection, cleaning, and sorting mastic tears based on their quality resulting in the following byproducts (a) mastic oil, (b) mastic teardrops, and small tears used for gum production. In the case of chewing gum, the production takes place in the factory. There, workers make the chicle mixture by adding natural mastic, sugar (optional), butter, and cornflour in a blending machine (h). When the chicle dough is ready (i), the workers transfer it to a marble counter on top of which they have sprinkled icing sugar and knead the chicle dough to form 'pies'. The 'pies' are then placed on wooden shelves to cool before being transferred to an automated machine called 'cutting machine' (j). The cutting machine shapes the pies into sheets and cuts them into gum dragées (k). The dragées are left to cool again on wooden shelves. When they are ready, workers break the sheets to separate the formed dragées (l). If the dragées are not well shaped, they are sent back for heating in the blending machine. If they are well shaped, they are loaded into the candy machine to be coated with sugar (m). After the coating has finished, the dragées are left to cool down once again, and then are polished using a revolving cylinder (n). When the chicle dragées are ready, they are packaged and packed in boxes (o) [89].
Heritage 2021, 4 FOR PEER REVIEW 14 not well shaped, they are sent back for heating in the blending machine. If they are well shaped, they are loaded into the candy machine to be coated with sugar (m). After the coating has finished, the dragées are left to cool down once again, and then are polished using a revolving cylinder (n). When the chicle dragées are ready, they are packaged and packed in boxes (o) [90].

Stories of the Workers for the Virtual Warehouse and Factory of the Chios Gum Mastic Growers Association
The virtual factory of the Chios Gum Mastic Growers Association is based on the exhibition space of the Chios Mastic Museum at the island of Chios in Greece. The museum's space is designed to showcase the machinery that the Association first used during the 1950s and 1960s. The machines for the exhibition were a donation of the Association to the Piraeus Bank Group Cultural Foundation (PIOP) to include a section in the Chios Mastic Museum regarding the industrial processing of mastic. The museum space aims to introduce visitors to the industrial aspect of mastic processing during the 1960s through its exhibition, and offer a unique experience of interacting with some of the machines.
In the context of the Mingei project, after completing a series of co-creation meetings, it was decided to enhance the museum-visiting experience through storytellers. Each of these VHs represent a worker of the association who is in charge of the process step carried out by the respective machine. The goal of using VHs is two-fold: Firstly, they can provide information about their personal lives and their work lives, virtually transporting the museum visitors to another era, allowing them to get a deeper understanding of those times. Secondly, it has been observed that the visitors do not follow or understand the order of processing mastic to produce mastic chewing gum. VHs can fill in this gap by providing information about each exhibit around them and by explaining the whole process to the visitors. To create the stories that those VH workers would narrate, it was necessary to go through the PIOP archive, and more specifically, the archive part containing oral testimonies of former and present-day workers of the Association (research of the late 2000s). From these testimonies, we have been able to extract information regarding the gender of the workers in each process, their age, their family background, as well as other information concerning the Chian society, which we then used to create the stories that each VR would narrate. One of the most significant observations was that the majority of workers at the warehouse and factory of the Association were women. The age of the interviewees spanned from forty to eighty years old in twenty-three participants. As not all of the interviewees worked at the Association at the same time, it is interesting to spot the differences and similarities in their testimonies to unveil the developments made at those times regarding the working environment and/or in their personal lives.

Stories of the Workers for the Virtual Warehouse and Factory of the Chios Gum Mastic Growers Association
The virtual factory of the Chios Gum Mastic Growers Association is based on the exhibition space of the Chios Mastic Museum at the island of Chios in Greece. The museum's space is designed to showcase the machinery that the Association first used during the 1950s and 1960s. The machines for the exhibition were a donation of the Association to the Piraeus Bank Group Cultural Foundation (PIOP) to include a section in the Chios Mastic Museum regarding the industrial processing of mastic. The museum space aims to introduce visitors to the industrial aspect of mastic processing during the 1960s through its exhibition, and offer a unique experience of interacting with some of the machines.
In the context of the Mingei project, after completing a series of co-creation meetings, it was decided to enhance the museum-visiting experience through storytellers. Each of these VHs represent a worker of the association who is in charge of the process step carried out by the respective machine. The goal of using VHs is two-fold: Firstly, they can provide information about their personal lives and their work lives, virtually transporting the museum visitors to another era, allowing them to get a deeper understanding of those times. Secondly, it has been observed that the visitors do not follow or understand the order of processing mastic to produce mastic chewing gum. VHs can fill in this gap by providing information about each exhibit around them and by explaining the whole process to the visitors. To create the stories that those VH workers would narrate, it was necessary to go through the PIOP archive, and more specifically, the archive part containing oral testimonies of former and present-day workers of the Association (research of the late 2000s). From these testimonies, we have been able to extract information regarding the gender of the workers in each process, their age, their family background, as well as other information concerning the Chian society, which we then used to create the stories that each VR would narrate. One of the most significant observations was that the majority of workers at the warehouse and factory of the Association were women. The age of the interviewees spanned from forty to eighty years old in twenty-three participants. As not all of the interviewees worked at the Association at the same time, it is interesting to spot the differences and similarities in their testimonies to unveil the developments made at those times regarding the working environment and/or in their personal lives.

Creation of Stories
The profiles and stories of the VHs are a mix-and-match of the material in the oral testimonies. Eight VH have been created, seven of them representing people coming from different villages of southern Chios (also known as mastic villages) and one VH represents a woman coming from Sidirounta (a northern village). The represented villages were chosen after they repeated appearance in the archive, either in singularity or by wider locality (e.g., the village of Kini might have been referenced once, so the profile of the participant will correspond better with those of participants coming from villages of the same area, for example from Kalamoti). The age of the VHs was also defined as the middle age of the participants coming from villages of the same area.
In creating the content of the profiles and the stories of the VH workers, it was sought to represent how life at the villages was, how the worker grew up in the village (i.e., education, agricultural life, leisure time, adolescence, and married life), what led them to seek work at the Association in Chora of Chios, how their working life in the Association was, and in which process(es) they worked in. All this information is divided into sections according to (a) family background and early and adult years of life, (b) work-life in the Association, and (c) explanation of the processes. Through this process, eight personas were created. Each persona has a different name, age, work experience, and family life background, which were then imprinted into eight story scenarios. Then, eight VHs were created based on these personas.

Creation of Virtual Humans
As explained previously, based on the requirements analysis and the stories that are to be narrated, eight VHs were created; seven females and one male. This decision reflects the disproportion of the workers' sex at the Chios Gum Mastic Growers' Association, where women workers were preferred over men in most steps of the production line.
Life and work conditions back then. Figure 10 shows an example of a VH in different poses, while Figure 11 shows the total of 8 VHs built for the Chios Mastic Museum.

Creation of Stories
The profiles and stories of the VHs are a mix-and-match of the material in the oral testimonies. Eight VH have been created, seven of them representing people coming from different villages of southern Chios (also known as mastic villages) and one VH represents a woman coming from Sidirounta (a northern village). The represented villages were chosen after they repeated appearance in the archive, either in singularity or by wider locality (e.g., the village of Kini might have been referenced once, so the profile of the participant will correspond better with those of participants coming from villages of the same area, for example from Kalamoti). The age of the VHs was also defined as the middle age of the participants coming from villages of the same area.
In creating the content of the profiles and the stories of the VH workers, it was sought to represent how life at the villages was, how the worker grew up in the village (i.e., education, agricultural life, leisure time, adolescence, and married life), what led them to seek work at the Association in Chora of Chios, how their working life in the Association was, and in which process(es) they worked in. All this information is divided into sections according to (a) family background and early and adult years of life, (b) work-life in the Association, and (c) explanation of the processes. Through this process, eight personas were created. Each persona has a different name, age, work experience, and family life background, which were then imprinted into eight story scenarios. Then, eight VHs were created based on these personas.

Creation of Virtual Humans
As explained previously, based on the requirements analysis and the stories that are to be narrated, eight VHs were created; seven females and one male. This decision reflects the disproportion of the workers' sex at the Chios Gum Mastic Growers' Association, where women workers were preferred over men in most steps of the production line.
life and work conditions back then. Figure 11 shows an example of a VH in different poses, while Figure 12 shows the total of 8 VHs built for the Chios Mastic Museum. The Reallusion's Character Creation3 (CC3) software was used for creating the VHs. Their outfit has been designed to match the actual clothes that workers wore at the factory-mostly a white robe and a white cap, while their facial and body characteristics were designed to match those of an average Greek woman living at Chios island. This The Reallusion's Character Creation3 (CC3) software was used for creating the VHs. Their outfit has been designed to match the actual clothes that workers wore at the factorymostly a white robe and a white cap, while their facial and body characteristics were designed to match those of an average Greek woman living at Chios island. This step was essential, since, as explained previously, making the characters look like actual workers of that time can travel users back in time and prompt them to learn more about.
Heritage 2021, 4 FOR PEER REVIEW step was essential, since, as explained previously, making the characters look like actu workers of that time can travel users back in time and prompt them to learn more abou

AR Augmentation of the Museum's Machines
In the context of the Mingei project, an AR application has been built to augme exhibits of the Chios Mastic Museum with VHs as described in Section 4.2.2. VHs we created following the methodology proposed in this paper. Viewing the machin through the museum's tablets, the visitors will be able to see VHs standing next to them ready to share their stories and explain the functionality of the respective machines. Th exact VH's location will be initially defined by a museum's curator, and visitors will sele the story they are interested in by selecting the story from the left part of the tablet screen. Due to COVID-19, the app has not yet been installed in the museum (initial planned for early 2021 but now it is scheduled for October 2021). In Figure 13, a screensh of the application running in our lab is shown.

AR Augmentation of the Museum's Machines
In the context of the Mingei project, an AR application has been built to augment exhibits of the Chios Mastic Museum with VHs as described in Section 4.2.2. VHs were created following the methodology proposed in this paper. Viewing the machines through the museum's tablets, the visitors will be able to see VHs standing next to them, ready to share their stories and explain the functionality of the respective machines. The exact VH's location will be initially defined by a museum's curator, and visitors will select the story they are interested in by selecting the story from the left part of the tablet's screen. Due to COVID-19, the app has not yet been installed in the museum (initially planned for early 2021 but now it is scheduled for October 2021). In Figure 12, a screenshot of the application running in our lab is shown. Figure 13. A VH narrating a story.

A Tour inside a Virtual Mastic Factory
The second application created in this use case regards a 3D model of an old mastic factory, where visitors can discover the machines met in the chicle and mastic oil production line and interact with VHs standing before them. Each machine carries out a specific task in the mastic chicle/oil production line. They have been reconstructed from the machines that are exhibited in the Chios museum; the machines were thoroughly scanned using a handheld trinocular scanner. Finally, the 3D was further processed using the Blender 3D creation and editing software [91]. The 3D reconstruction technique works well for objects that comprise flat surfaces that reflect light in a straight manner. However, when it comes to scanning and reconstructing curvy or hollow objects it is really difficult to achieve a decent result. Such objects were post-processed with 3D editing software. Figure 14 displays the reconstruction of the Candy Machine, which features both curvy and hollow surfaces. The machine as it was photographed at the museum is displayed on the left-hand side of the image. In the middle, one can witness the automatic reconstruction of the machine and on the right the machine after post-processing which is very close to the original machine. Overall, seven machines were reconstructed, namely, (i) the sifting machine, (ii) the blending machine, (iii) the cutting machine, (iv) the candy machine, (v) the revolving cylinder, (vi) the printing machine, and (vii) the distillation machine. All of them are met in Figure 12. A VH narrating a story.

A Tour inside a Virtual Mastic Factory
The second application created in this use case regards a 3D model of an old mastic factory, where visitors can discover the machines met in the chicle and mastic oil production line and interact with VHs standing before them. Each machine carries out a specific task in the mastic chicle/oil production line. They have been reconstructed from the machines that are exhibited in the Chios museum; the machines were thoroughly scanned using a handheld trinocular scanner. Finally, the 3D was further processed using the Blender 3D creation and editing software [90]. The 3D reconstruction technique works well for objects that comprise flat surfaces that reflect light in a straight manner. However, when it comes to scanning and reconstructing curvy or hollow objects it is really difficult to achieve a decent result. Such objects were post-processed with 3D editing software. Figure 13 displays the reconstruction of the Candy Machine, which features both curvy and hollow surfaces. The machine as it was photographed at the museum is displayed on the left-hand side of the image. In the middle, one can witness the automatic reconstruction of the machine and on the right the machine after post-processing which is very close to the original machine.
Heritage 2021, 4 FOR PEER REVIEW 17 Figure 13. A VH narrating a story.

A Tour inside a Virtual Mastic Factory
The second application created in this use case regards a 3D model of an old mastic factory, where visitors can discover the machines met in the chicle and mastic oil production line and interact with VHs standing before them. Each machine carries out a specific task in the mastic chicle/oil production line. They have been reconstructed from the machines that are exhibited in the Chios museum; the machines were thoroughly scanned using a handheld trinocular scanner. Finally, the 3D was further processed using the Blender 3D creation and editing software [91]. The 3D reconstruction technique works well for objects that comprise flat surfaces that reflect light in a straight manner. However, when it comes to scanning and reconstructing curvy or hollow objects it is really difficult to achieve a decent result. Such objects were post-processed with 3D editing software. Figure 14 displays the reconstruction of the Candy Machine, which features both curvy and hollow surfaces. The machine as it was photographed at the museum is displayed on the left-hand side of the image. In the middle, one can witness the automatic reconstruction of the machine and on the right the machine after post-processing which is very close to the original machine. Overall, seven machines were reconstructed, namely, (i) the sifting machine, (ii) the blending machine, (iii) the cutting machine, (iv) the candy machine, (v) the revolving cylinder, (vi) the printing machine, and (vii) the distillation machine. All of them are met in Overall, seven machines were reconstructed, namely, (i) the sifting machine, (ii) the blending machine, (iii) the cutting machine, (iv) the candy machine, (v) the revolving cylinder, (vi) the printing machine, and (vii) the distillation machine. All of them are met in the Chicle production line, except the distillation machine that is used for producing mastic oil. The final 3D models of these machines are illustrated in Figure 14.
Heritage 2021, 4 FOR PEER REVIEW 18 the Chicle production line, except the distillation machine that is used for producing mastic oil. The final 3D models of these machines are illustrated in Figure 15. The 3D model of the factory building was initially created in Unity and then imported in Blender for further processing (e.g., windows and doors were cut and a High Dynamic Range (HDR) environment texture was used, to provide ambient light in the scene). We have used a "stone tile" texture for the walls and plain "cement" texture for the floor, as these materials were predominately used in factory buildings in Greece at that age. Then the 3D model was imported back into the Unity game engine and used as the virtual factory's building. We have used Unity 2020 with the High Definition Rendering Pipeline (HDRP) enabled to achieve a high-quality, realistic lighting result. Inside the virtual factory building, we have placed the 3D machine models in the order that are met in the Chicle production line. To guide the users to visit the machines in the correct order, virtual carpets have been placed on the floor in a way that creates corridors for the visitors to follow, starting from the factory's front door as shown in Figure 16. The 3D model of the factory building was initially created in Unity and then imported in Blender for further processing (e.g., windows and doors were cut and a High Dynamic Range (HDR) environment texture was used, to provide ambient light in the scene). We have used a "stone tile" texture for the walls and plain "cement" texture for the floor, as these materials were predominately used in factory buildings in Greece at that age. Then the 3D model was imported back into the Unity game engine and used as the virtual factory's building. We have used Unity 2020 with the High Definition Rendering Pipeline (HDRP) enabled to achieve a high-quality, realistic lighting result. Inside the virtual factory building, we have placed the 3D machine models in the order that are met in the Chicle production line. To guide the users to visit the machines in the correct order, virtual carpets have been placed on the floor in a way that creates corridors for the visitors to follow, starting from the factory's front door as shown in Figure 15.
We have used the same VHs as the ones in the AR application. Each VH is placed next to their respective machine and is ready to explain the functionality of the respective machine, explain how the respective process step was performed at the factory before and after the machine acquisition, and narrate stories about their personal and work lives. When a VH is approached by the camera controlled by the user, he/she starts talking to introduce themselves. Then, the available stories are presented to the users in the form of buttons, that visitors can press to listen to the respective narrations ( Figure 16). Figure 16. the interior space of the virtual factory, with and without the VHs. Carpet corridors are placed on the floor to guide users to follow the mastic chicle production line. VHs are placed next to each machine, ready to share their stories.
We have used the same VHs as the ones in the AR application. Each VH is placed next to their respective machine and is ready to explain the functionality of the respective machine, explain how the respective process step was performed at the factory before and after the machine acquisition, and narrate stories about their personal and work lives. When a VH is approached by the camera controlled by the user, he/she starts talking to introduce themselves. Then, the available stories are presented to the users in the form of buttons, that visitors can press to listen to the respective narrations ( Figure 17).   We have used the same VHs as the ones in the AR application. Each VH is placed next to their respective machine and is ready to explain the functionality of the respective machine, explain how the respective process step was performed at the factory before and after the machine acquisition, and narrate stories about their personal and work lives. When a VH is approached by the camera controlled by the user, he/she starts talking to introduce themselves. Then, the available stories are presented to the users in the form of buttons, that visitors can press to listen to the respective narrations ( Figure 17).

Guidelines for Lessons Learned by This Research Work
The process of defining the proposed methodology has provided insights on the facilitated technologies summarized in a collection of guidelines to be considered by people replicating this research work: Guideline #1: Review the plethora of available mo-cap systems and select the one most capable to support your scenario. For example, if multiple users are to be tracked optical mo-cap systems may be preferable. Please consider the possibility of visual occlusions and the effect on your use case. Examine whether the usage of tools or handheld equipment is required to be tracked. Carefully consider the option of a solution for hand tracking if this is the case or a solution for infering tool state from the recorded animation [91,92].
Guideline #2: Consider scale when transferring motion for the tracked system to an VH. Calibrate both the mo-cap system and the VH dimensions accurately to reduce retargeting issues Guideline #3: Invest in a high-quality VH to enhance the visual appearance of textures and support for manipulation of facial morphology Guideline #4: Carefully consider TTS vs speech recording based on the requirements of your scenario. The first is cost-efficient but less realistic, the second is more realistic but needs re-capturing sound for each change in the script.

Formative Evaluation
Both applications were planned to be installed in the Chios Mastic Museum in early 2021. However, they were delayed due to the COVID-19 pandemic and are currently planned for October 2021. Nevertheless, the applications have been completed and we expect the results to be very promising. As a first step towards evaluating the outcome of this research work, an expert-based evaluation was conducted. In this approach, the inspection is conducted by a user experience usability expert. Such evaluation is based on prior experiences and knowledge of common human factors and ergonomics guidelines, principles, and standards. In this type of inspection, the evaluator looks at the application through the "eyes" of the user, performing common tasks in the application or system while noting any areas in the design which may cause problems to the user. For the case of the two applications, the following issues were reported by the usability expert.
• AR application • Make sure that the VH is looking at the user browsing the application • Include subtitles on the bottom side of the screen since in noisy environments the information may not be audible.

•
In some cases, there are some deficiencies in the animations such as non-physical hand poses and arms overlapping with parts of the body • 3D application • Improve camera handling by locking the z-axis practically not required by the session • Make sure that the VH is looking to the location of the "user" accessing the application (center of the screen) The above usability issues were all considered critical in terms of user experience and were corrected before installation in the pilot site.

Conclusions and Future Work
To sum up, this research work contributed a cost-effective methodology for the creation of realistic VHs for CH applications. The VH models have been created using the Character Creation3 (CC3) software. The animations for the VHs have been recorded using the Rokoko suit and gloves, which utilize inertial sensors. The decision to use the specific suite was made after thoroughly analyzing mo-cap systems on the market in terms of their performance, usability, comfort, cost, ease of use, etc. Each animation recorded using the Rokoko suit and gloves features natural, expressive body moves and gestures corresponding to the story that the VH will narrate, while we have used an audio-recording application to simultaneously record natural voice narrating the desired stories. Then, the VH models along with the animations recorded and the recorded audio clips have then been imported into the Unity game engine. To make the narrator VHs narrate the desired stories, we make use of the Salsa Lip-Synch software, which can produce realistic lip-synch animations and facial expressions, and is compatible with both the Unity game engine and the CC3 software used to create the VHs. An automated module is applied to the VHs to get them ready for lip-synching, and for each narration part, we are loading the corresponding audio clip upon the start of each animation. This way we achieve on-demand realistic VHs, capable of narrating the desired stories, thus offering an enhanced user experience to the Mingei platform users.
The final solution has been tested in two settings: the first is a true AR application that can present VH as storytellers next to the actual factory machines and the second is a virtual environment of a mastic factory experienced as a Windows application and can be also be experienced in VR.
The validation until now of the proposed workflow and solutions supports our initial hypothesis and results in the production of realistic avatars. The selection of technologies has been proven suitable for the needs of the project and minor adjustments to the methodology can further enhance the output. For example, the careful calibration of the tracking suit and software concerning the anthropometric characteristics of the user recorded and those of the created VH can save labor on retargeting and animation tuning. Furthermore, the combination of motion and voice recordings has been proven to be beneficial both for the pursued realism and development time. It is thus expected that further experimentation is required when replicating the proposed solution to define the required fine tunings for each hardware and software setup.
Regarding future research directions, the above-mentioned knowledge is expected to be enriched through valuable feedback received from the installation and evaluation with end-users of the two variations in the context of the mastic pilot of the Mingei project. More specifically evaluation will target information quality, education value, realism, and perceived quality of experience. More specifically, end-users will access the implemented applications and will be requested to fill in a user experience evaluation questionnaire, participate in targeted interviews, and be monitored while interacting with the system (observation sessions). It is expected that the analysis of the data acquired by the aforementioned methods will provide further input for the improvement of the implemented prototypes.