Dialogue Enhanced Extended Reality: Interactive System for the Operator 4.0

: The nature of industrial manufacturing processes and the continuous need to adapt production systems to new demands require tools to support workers during transitions to new processes. At the early stage of transitions, human error rate is often high and the impact in quality and production loss can be signiﬁcant. Over the past years, eXtended Reality (XR) technologies (such as virtual, augmented, immersive, and mixed reality) have become a popular approach to enhance operators’ capabilities in the Industry 4.0 paradigm. The purpose of this research is to explore the usability of dialogue-based XR enhancement to ease the cognitive burden associated with manufacturing tasks, through the augmentation of linked multi-modal information available to support operators. The proposed Interactive XR architecture, using the Spoken Dialogue Systems’ modular and user-centred architecture as a basis, was tested in two use case scenarios: the maintenance of a robotic gripper and as a shop-ﬂoor assistant for electric panel assembly. In both cases, we have conﬁrmed a high user acceptance rate with an efﬁcient knowledge communication and distribution even for operators without prior experience or with cognitive impairments, therefore demonstrating the suitability of the solution for assisting human workers in industrial manufacturing processes. The results endorse an initial validation of the Interactive XR architecture to achieve a multi-device and user-friendly experience to solve industrial processes, which is ﬂexible enough to encompass multiple tasks.


Introduction
Human Augmentation (HA) can be described as a functional extension of the physical body mediated through technology to increase human productivity or enhance body capabilities [1]. In industry, the enhancement of workstations with augmentation technology can reshape the human presence in the process value chain and support the development of self-awareness and new skills, primarily where manual labour is inevitable [2,3].
The major technological factors impacting the industry sector are cyber-physical systems, big data analytics, collaborative and fully connected robots, and interaction devices that exploit the integration of wireless network capabilities with Augmented Reality (AR), Virtual Reality (VR), and Mixed Reality (MR) [4]. Industry 4.0 is an umbrella term that encompasses a wide range of concepts belonging to

Background
Industrial control and execution processes are carried out by technical operators, shift leaders, field workers, and engineers. These processes, to be carried out properly, require complex management of the domain knowledge, where the creation, distribution, access, and communication of this knowledge is critical [6]. In addition, the process industry requires specific expert knowledge, thus making the training and education of workers a challenging task, usually ad hoc to each product and factory. As a result, domain knowledge is not distributed at all, and it heavily relies on a few expert operators [7]. As immersive technologies are becoming gradually more robust and affordable, new case studies and applications are being explored to enhance workers with new skills, decreasing the skills gap between untrained and expert operators by providing just-in-operation computer-based task assistance. Along the way, Industrial Augmented Reality (IAR) emerged as a research line focused on how these technologies can cognitively enhance workers in industrial processes. IAR differs from traditional systems in terms of quality and reliability, which must comply with those of manufacturing industries.
The following subsection describes the different Industrial AR applications and how they were used to improve and enhance human capabilities, reduce cognitive barriers and improve workplace efficiency in several scenarios. After this, Spoken Dialogue Systems and their modular architecture are presented, and how they have been used as natural, user-centred process solving systems. This background motivates the conceptual system presented in Section 3, in which different AR mechanisms are encompassed in a modular, natural, and user-centred architecture, similar to the one that SDSs employ.

Industrial Augmented Reality Systems
The first AR prototypes explored the human vision senses and tackled the combination of image registration and content visualisation. The first seminal work can be traced back to Thomas Caudell and David Mizell [8] in 1992. Their prototype allowed for a computer-produced diagram to be superimposed and stabilised on a specific position on a real-world object to help the worker with cable wiring. The contributions of Kollatsch et al. [9] led to an industrial prototype for the visualisation of information from control systems (e.g., Programmable Logic Controller and Computer numerical control machines) directly in situ, enabling real-time digital content to overlap real-world objects. To prevent the drawbacks of AR technology in terms of visual inputs, and mainly their limited field-of-view, projection-based approaches have been broadly presented as an alternative to AR and VR. These solutions are referred to as projection mapping or Spatial Augmented Reality (SAR). For example, Sand et al. [10] developed a prototype to project instructions into the physical workspace, which helped end users to find the pieces required to assemble products without prior knowledge. Rodriguez et al. [11] proposed a similar solution in which instructions were directly overlaid with the real world using projection mapping. Similarly, Petersen et al. [12] projected video overlays into the environment at the correct position and time using a piece-wise homographic transform. By displaying a colour overlay of the user's hands, feedback can be given without occluding task-relevant objects. More recently, Álvarez et al. [13] have improved the manufacturing process of a packaging company by integrating an SAR system in a real factory floor to provide assistance to operators during the setup of die cutters. Projecting virtual content directly on physical objects contributed to reducing the mental workload of the operator (no need to interpret the real workspace on a screen) [14] and an improvement of other practical open challenges like the need for the worker to hold a tablet or wear HMD devices. However, there are a number of challenging factors of IAR development that need further research. Some challenges are transversal and related to the necessary interdisciplinary knowledge in areas such as computer graphics, artificial intelligence, object recognition, and human-computer interaction [15]. For example, user intuitive interfaces still remain a challenge, particularly in situations where understanding the user's actions and intentions is required for their adaptation in unexpected conditions. The use of mid-air gestures enable users to interact with virtual objects using their hands and with varying levels of intuitiveness [16,17]. Techniques grounded on gaze [18,19] (based on head and/or eyes' movements), electromyography (by analysing muscular activity) [20], electroencephalogram (brain electrical signals) [18,20], and hands tracking [21,22] were developed to empower users with new forms of interaction that do not require hands to be concurrently used for interaction and to perform the task [23]. This type of interaction is particularly helpful in industrial assembly settings where dexterity of the hands is key to perform the task [3]. Voice-based interfaces' popularity is rising as well, although these solutions are often limited to a small set of fixed commands. Few of them add a natural layer to allow a more comfortable experience. In Section 2.2, we provide a literature review of voice-driven interaction systems and their modelling processes.

Spoken Dialogue Systems
Spoken Dialogue Systems (SDSs) are voice-enabled Human-Machine Interfaces for natural communication with a computer, robot, and other devices [24,25]. SDSs rely on processing and exploiting domain knowledge to guide users and solve their needs [26][27][28]. One of their main advantages is that they exploit spoken language, rendering a convenient and frictionless system, which communicates in a natural way.
Additionally, these systems facilitate hands-free access to information, thus enabling online information consumption for processes that require manual work. Also, SDSs can be enhanced with error handling techniques that detect interaction breakdowns during the process due to misunderstanding, channel noise, etc. [29,30], improving the communication between the user and the system. As SDSs provide a suitable response for the knowledge management and expertise distribution challenge, it comes as no surprise that they have been used in multiple sectors. To name a few examples, these include: LARRI, a dialogue-based system for support in maintenance and repair activities for aircraft mechanics [28]; the International Space Station procedural assistant [31]; bus scheduling systems [32,33]; tourism booking systems [34]; resident helping with robotic assistants in nursing homes [35]; elderly assistance and coaching systems [36]; educational dialogue systems for tutoring [37][38][39]; and retail assistance [40]. Despite their popularity, past limitations in speech recognition engines and in microphones' noise-cancelling mechanisms prevented the application of SDSs in industrial environments.
To carry out a flexible and natural communication with the operator, the proposed framework emulates the modular and user-centred SDS architecture. Traditional SDSs are a pipeline of specialised technological modules (see Figure 1). The first module encapsulates a Speech To Text (STT) service to transform audio signals into textual transcriptions. Audio transcriptions are processed by a Spoken Language Understanding (SLU) module and encoded as semantic actions (i.e., the communication intent of the operator, which can be understood as a simplification of what the user meant to say), so the linguistic variability of the input is reduced (e.g., "Tell me what to do now", "What's the next step?", and "Now what?" can all be represented as the same semantic codification: [intent=ask, object=next-step]). Semantic actions are passed down to the Dialogue Manager (DM) module, which is in charge of defining the best strategy to give an appropriate answer to the operator. The interaction strategy relies on real-time flow states and existing domain knowledge. In the last process stage, the interaction feedback computed by the DM module is converted into a human-interpretable interactive action by the Response Generator (RG) and synthesised into audio speech by the Text to Speech (TTS) module.
The modules that constitute an SDS can be implemented in many different ways and they can integrate a vast range of technologies [41,42]. The in-detail description of these modules exceeds the scope of this paper. Withal, it is noteworthy that module stack strategies facilitate the deployment of SDSs both in scenarios where enough labelled data are available, where Machine Learning can be successfully applied, and in data scarce scenarios, where mechanisms such as encoding expert rules, designing dialogue flows, and others can be implemented.

Conceptual System
For HA technology to be effective it has to complement existing cognitive processes rather than creating competing or shared channels of information for them. It must be adapted for the underlying mechanisms and processes of human perception and biology, and avoid cognitive barriers triggered by decisional processes. Most cognitive barriers can be removed when proper information processing methods are applied. For example, the momentary pause that is required for the worker to process the need and execute a specific command is a cognitive barrier, which is removed only when workers intuitively understand what to do to overcome the barrier. That being said, invisible interfaces and unclear interactions still represent a potential abandonment or frustration point to workers who cannot figure out the interaction mechanics.
Spoken language is a natural form of communication for humans to interact and share information. A system that is capable of conveying and managing information in real time using natural language as an interface is expected to be an intuitive and frictionless solution against cognitive barriers that may arise due to lack of knowledge about some process to solve. Recent advances in speech recognition and language processing technologies enabled SDSs to gain popularity as voice-guided task solvers in multiple domains [28,40,43], but their combination with and role within XR scenarios is still unclear.

Conceptual Workflow
In this section, we describe an Interactive eXtended Reality (IXR) system that helps operators to carry out a certain task or process through the combination of XR technologies with the SDS process control logic (see Figure 2). The proposed work streamlines multiple input and output XR devices into the logical scheme of SDSs. As a result, we describe a framework that enhances both classical SDSs and XR devices as Human-Machine communication interfaces. A Device Control Layer (DCL) handles machine-dependent data transformations and communication protocols between devices. The IO API is a modular interface for the use of peripheral devices in industrial environments, for example, collaborative robots. Analogous APIs are available for, for example, aural and visual interaction devices. The DCL copes with the Interpretation Layer (IL) and the Response Generation Layer (RGL) to gather and send the exchanged data content and format from and to each involved device.
Data processed by the DCL is accessed by the IL and transformed into a normalised semantic encoding, similar to the SLU module in Figure 1. The objective is to discretise the highly-variable data collected by multiple input devices and extract meaning out of it. For example, if the operator asks for the robot's malfunctioning parts, the semantic interpretation of both the operator's audio transcription and the robot sensors' data are extracted in this module.
The Interaction Manager Layer (IML) processes the encoded information of every input device to decide the next system response and communicates it to the RGL. In order to decide the appropriate response to give to the operator, the IML takes into account the semantic intention of the operator (in a similar way to [44]), but also the status of the contextual knowledge stored in the Domain Knowledge Layer, such as data from the involved physical industrial devices, ontologies, operator profiles, and so on. The system response is represented as a set of semantic actions, which is then sent to the RGL. With each user turn, the IML updates the interaction state according to the processed input and the selected system responses.
Finally, the RLG transforms the set of semantic actions in the system response to data formats usable by the target output devices so they can be communicated to the operator, for example, rendering commands, natural language audio or text, and device movement commands. These device-usable responses are then sent to the DCL to close the communication loop with the operator.

Use Cases
In this section, we describe two use case scenarios that have been validated by 20 operators of different characteristics not familiarised with the project, 10 for each use case. These operators received minimal instructions on how to use the system and about their given task.

Use Case 1: Universal Robot's Gripper Maintenance
We chose the maintenance of a Universal Robot's gripper as our first use case. Participants were assigned the task to perform a periodic inspection for the robot's gripper (the actual instruction manual can be consulted online in the following address: https://assets.robotiq.com/website-assets/support_ documents/document/3-Finger_PDF_20190322.pdf?_ga=2.105997637.750885948.1563187207-129849 5031.1563187207assets.robotiq.com/website-assets/support_documents/document/3-Finger_PDF _20190322.pdf, last accessed 29/05/2020-see section 7.3 Periodic Inspection), which is necessary to ensure its good condition and safe operation. In the experimental setup, the robot arm was placed over a table, fixed on a pre-set position, with the gripper coupled. The gripper was located about the chest-height of an average person and facing up, with its fingers closed, as shown in Figure 3. The tools needed for the task were placed on the working-table.
Before each participant's session would begin, they were informed that they would have to complete a form for the system's assessment afterwards. At the beginning of the session, participants were invited to try and get familiarised with the hardware (in this use case, HoloLens Glasses were used for outputting visualisations). While doing this, a researcher would present the participant with the robot, the system, and the task at hand that they had to complete. Once the explanation was done, the task would begin. The completion of the task requires participants to operate with both hands, making it an appropriate task for the integration and testing of the proposed XR architecture. The set of instructions described in the manual is equivalent to a series of processes that require a single action to be performed by the user in each step. The compiled set of steps works as a base dialogue flow where each of the steps corresponds to a dialogue state. We added some extra steps to the flow to accommodate the system's contextual needs: an introductory step to define the state prior to the beginning of the task, and a final step to mark the conclusion of the interaction. Figure 4 shows the sequence of steps required to complete this use case's task. In addition, as the system can exploit the domain knowledge to guide the operator, knowledge about multiple question types that could arise during the task was included-what to do in the current step, how to do it, where are the involved parts or items, and so on.  During the maintenance task, the IM decides which AR animation to render (pointing the location of bolts to unscrew or cables to connect or disconnect, showing the movement to perform to check the gripper's fingers, and so on). These animations are accompanied with aural responses and contextualised according to the current step in the maintenance process, for example, the bolts to unscrew are different in the decoupling step and the finger-opening step. Furthermore, the operator can explicitly ask to repeat an AR animation as many times as needed.
Ten people (50% male, 40% female, 10% would rather not say their gender) participated in the testing of use case 1. A total of 50% were in the age range between 25 and 29, 30% between 35 and 39, 10% between 30 and 34, and 10% between 18 and 24. All the participants had European Spanish as one of their mother tongues. Participants were all volunteers working at the same centre where the prototype was developed and their data were treated anonymously (the operator appearing in Figure 3 gave consent to its use in this publication).

Use Case 2: Industrial Electrical Wiring
The proposed architecture was also validated for cabling tasks, more specifically the assembling of electric panels. This use case was proposed by industrial partners of the project, since it is a process that they carry out on a daily basis. Voice interaction was achieved by headphones with noise-cancelling, which participants could choose to wear on them or leave on the working table. To handle the visual communication, an Optoma projector with a wide angle lens was installed to handle the projection mapping. Additionally, it could also deliver audio output for users that were not wearing headsets. Figure 5 shows an operator in the described setting during the task. Additionally, a third-party optical reader [45] was added to the Device Control Layer to receive visual input. This reader detects the wires grabbed by the operator, so this information can be used for interaction contextualisation. In this scenario, different uses of the projections and voice-based responses were combined to develop a user-centred communication, adjusted to their needs and impairments. For example, for participants with hearing impairments, aural responses were also projected in the form of captions. The completion of this task required the participants to work with both hands, one to manipulate wires and the other to screw the wire connection to the terminal. The best practices for the wiring sequence recommend operators to connect the longest wires first in order to reduce cluttering at early stages. Each wire is identified with a printed label and has its properties, attributes, and process logic stored in the DKL.
A set of ten participants from an external industrial factory was selectively assigned to the testing group of this use case. The inclusion criteria for this group was the level of worker disability. Only workers with a cognitive or physical disability were selected. Due to the sensitive nature of this particular group of testers, we are not providing further information on their age and gender.

System Implementation
In this section, the sequence of workflow steps that are involved in the systems implemented for our particular use cases is described in more detail:

1.
The DCL is responsible for gathering the operators' input to communicate with the system. For the proposed use cases, the first one uses only voice input while the second one combines additional devices such as optical readers too. For the voice interaction, microphones with noise cancellation capture the audio signals. These signals are streamed using media streaming libraries such as FFmpeg into an energy-based automata that discards any silent or noisy audio segment. The remaining audio segment is sent to the Input Communication API.

2.
The IL has to encode the raw and unstructured data into formats that can be interpreted by machines and ontologies. In the presented use cases, the audio segments are transcribed to text by the Speech-to-Text module. The Language Understanding module encodes transcriptions into a semantic structure that can be easily interpreted by the IML. As it is commonly done for Spoken Dialogue Systems, we use a scheme based on act-slot-value sets for representing symbolic semantic meaning of the operator input. The semantic representation of the user action is dispatched to the IML for processing. For the optical reader input, the detected wires' label is transferred to the IML to contextualise the interaction. 3.
The DKL works both as the persistence layer and as semantic ontology for each system. The persistence layer stores the interaction context of each operator to be kept turn-by-turn, as well as the specific domain knowledge encoded in an ontology. In both use cases, this ontology describes the know-how of the steps to perform in the process and their sequence, the physical devices that interact during the task and their properties (e.g., screwdriver size, or wire's location). In addition, it encodes the physical devices tied to each step of the interaction to handle ambiguous questions, for example, if the operator asks "Which size?" in the Open fingers step of the first use case, the ontology relations are used by the system's IML to understand that the operator refers to the size of the tool required to this specific step, the precision screwdriver, and not to the hex key used in the Decouple gripper step. 4.
The IML plans and selects the next action to be performed by the system using both contextual information and the semantic concepts received from the IL. First, it retrieves the task information and the planning rules to complete the task from the Domain Knowledge Layer, and then it defines a strategy to reach the user's objectives. In other words, it consists of a set of expert rules S = {(x, c t ) −→ (y, c t+1 ) | x ∈ R, c t , c t+1 ∈ C, y ∈ A} that evaluate events x and the interaction context c t into system actions y and the also updates the context c t+1 based on the Attributed Probabilistic Finite State Bi-Automata schema as in [33,43,46]. This context, which the IML reads from and writes to the DKL, is maintained and updated through the interaction. In our particular tasks, the input x corresponds to the user semantic representation and the context c, which takes into account the interaction state (e.g., current step), the history of shown AR animations or the selected wire. 5.
The RGL translates the IML's output actions into understandable interactions for the users, for example, answering the information in natural language or augmenting the user's surroundings with visuals. For our particular use case, the system uses the Text-to-Speech module to generate synthesised audio when speech modality is required, using the module described in AhoTTS [47]. Additionally, this layer computes suitable visual properties for the action, which include animation selection, as well as feedback duration. 6.
The commands given by the RLG are dispatched to the DCL, which interfaces with the output devices. For the presented use cases, visual and aural interfaces are used to communicate with the operators.
Note that the presented XR interaction architecture is flexible enough to encompass different input and output devices, which is the case of our two use cases. This allows to select the most-suitable devices for each task, adapting to the heterogeneous needs of the different industrial processes.

Results
In this section, results are reported for each of the use cases where the proposed dialogue-supported XR architecture was applied. Such results are depicted in terms of usability, based on the participants' impressions gathered from post-study usability questionnaires. Note that the goal of these studies is not to measure times nor operational benefits of the systems but to measure the technological adoption of the proposed solutions by the operators. To perform the measurement of the adoption of the described systems, the System's Usability Scale (SUS) [48] was used as a reference. This scale intends to provide a global view of subjective usability assessments over multiple dimensions. User responses are presented as a 1 to 5 agreement scoring scale, where 1 means "strongly disagree" and 5 means "strongly agree".
The employed questionnaires contain several questions adapted to each particular task and users' characteristics, with the aim of capturing the operators' responses as detailed and unambiguously as possible. To make the results interpretable within a global scope, each question was clustered within one of the following usability dimensions: • Self-confidence: this dimension involves those questions related to the ability to solve the problems that arise during the industrial processes without the help of any element other than the system itself.

•
Learning Curve: this dimension measures how hard it was for the operators to adopt the proposed systems and to learn the required concepts and practices to use them. • Efficiency: this dimension involves those questions related to the perceived efficiency by the users, i.e., if they find the system useful and helpful to improve their work processes. • Ease-of-use: this dimension involves questions regarding the systems' difficulty. This is related to the naturalness of the system and whether it was perceived as intuitive by the operators. • Consistency: this dimension measures how predictable the usage of the system is in terms of user experience.   As can be seen, all the usability dimension scores are bounded between [3.6, 4.2] for both cases. In both situations, the confidence perceived by the operator is incremented due to the multi-modal guidance, and the system is perceived as consistent, even among different operator profiles. Most importantly, both systems are perceived as easy-to-use, one of the main objectives when implementing these systems to overcome cognitive barriers. This score is consistent between both use cases, where the technical profile of the operators differs: note that in the first scenario the operators were non-expert, untrained newcomers to the task, whereas in the second scenario operators were people who had some cognitive impairment. The perceived efficiency fluctuates between one use case and another, yet it still remains high. This fluctuation can be expected as the cognitive barriers that arise may differ between one use case and the other; thus, the perceived efficiency may be lower in tasks that are easier to solve. Finally, the learning curve dimension was the most poorly graded, yet achieving an average score of 3.7 for both use cases. This may be justified due to the experimental setup, where little-to-no instructions were given to the operators on how to use the system.

Usability Results
Participants' feedback corroborates the hypothesis that the combination of aural and visual technologies in an interactive dialogued system helps them to be hands free, which they agreed was an adequate feature given the type of tasks. Apart from the hands-free feature, participants found the system intuitive and natural. According to their observations and post-session informal feedback, the interactive nature of spoken dialogue with task-oriented contextualisation provided helpful real-time insights of the industrial processes to fulfil.
Hands-free systems are appropriate not only for cases like ours were both hands are needed to actually perform the task, but also for other industrial and non-industrial settings where people cannot use their hands for whatever reason (e.g., while driving, or because of having some physical impairment), and for cases where a physical transformation and/or control of the environment has to take place, often implying manual work.

Conclusions and Future Work
This paper describes an architecture to develop natural and hands-free human-machine interaction systems for industrial environments, which can combine more classical Human Augmentation technologies (such as Virtual, Augmented, and Mixed Reality) with dialogue based interaction for process solving tasks. The proposed solution facilitates the capture, distribution, and communication of domain specific knowledge to operators in training and production phases. Two systems constructed using the same architecture are presented and initially evaluated in terms of usability and acceptance, one for a maintenance task and the other for assembly scenarios. The implemented systems were hands-free, multi-modal (including visual and speech interaction) and supported multiple technologies. They were well accepted by a group of non-expert operators unfamiliar to the task and a group of operators with cognitive disabilities. The results obtained can serve as a reference starting point in further research for new advances in the industrial sector.
Given the number of participants in the groups selected for these initial validations, one of the main goals of future research is to further validate the proposed Interactive XR architecture with a wider sample of participants to ensure the significance of the evaluation. Testing additional systems in new industrial scenarios that require the combination of additional XR interfaces, like Kinesthetic-Bionic devices, Internet of Things, machine-sensors, and multi-operator scenario settings will be addressed to assert the flexibility of the proposed architecture. Future work also includes finding ways to easily adapt the language-specific modules (STT and NLU in the Interpretation Layer, and RTG and TTS in the Response Generation Layer) to multiple languages to overcome possible linguistic barriers for multinational companies.