Multilingual Conversational Systems to Drive the Collection of Patient-Reported Outcomes and Integration into Clinical Workflows

: Patient-reported outcomes (PROs) and their use in the clinical workﬂow can improve cancer survivors’ outcomes and quality of life. However, there are several challenges regarding efﬁcient collection of the patient-reported outcomes and their integration into the clinical workﬂow. Patient adherence and interoperability are recognized as main barriers. This work implements a cancer-related study which interconnects artiﬁcial intelligence (spoken language algorithms, conversational intelligence) and natural sciences (embodied conversational agents) to create an omni-comprehensive system enabling symmetric computer-mediated interaction. Its goal is to collect patient information and integrate it into clinical routine as digital patient resources (the Fast Healthcare Interoperability Resources). To further increase convenience and simplicity of the data collection, a multimodal sensing network is delivered. In this paper, we introduce the main components of the system, including the mHealth application, the Open Health Connect platform, and algorithms to deliver speech enabled 3D embodied conversational agent to interact with the cancer survivors in ﬁve different languages. The system integrates cancer patients’ reported information as patient gathered health data into their digital clinical record. The value and impact of the integration will be further evaluated in the clinical study.


Introduction
Patient-reported outcomes (PROs) are a type of patient-gathered health data (PGHD), collected from patients to help address a health concern [1]. Since they represent self-reports from every-day life, PROs are an interesting data source in healthcare due to their usage for improving patients' experience, quality of life, and participation of the patient in the clinical workflow [2]. With the technological advance, PROs have become a complementary data source to telemonitoring [3], data mining, imaging-based AI techniques [4][5][6][7] as PROs are more sensitive to treatment-related differences and give patients a voice [8]. The knowledge domains of clinical specialties are expanding rapidly and due to the sheer volume and complexity of data, clinicians often fail to exploit it [9].
The first use of patient-reported outcomes was proposed in 1988 [10]. The study highlighted the different concepts on how to collect patient data, compared different areas of possible use of PROs, and defined the directions for further development of PROs [11]. Given the overall technological advances of the era, patient outcomes were collected mostly face-to-face, using paper-written forms. Those forms were then added to paper-form health records (HRs). With the significant advance of information and communication based care, and meaningful physician-patient patterns are put into center. However, it requires new sources of data to improve shared decision making and enable personalized decision-making. Conversational intelligence is one of the main digital technologies that can significantly contribute to patient activation and engagement. Overall, the technology is based on spoken language technologies (SLT), i.e., NLP, ASR and TTS, that enables machines to interact with humans over mobile or web platforms [25]. In healthcare, the first adaptation of these technologies was proposed in early 1966, with ELIZA [26]. Since then, the NLP and SLT has progressed significantly. The conversational agents have been used to solve complex tasks, such as booking tickets, fetching the result from API and, therefore, acting as customer service agents [27]. In the context of healthcare, conversational agents are intended to provide patients with personalized health and therapy information, relevant products and services, to connect them with health care providers as well as suggest diagnoses and recommended treatments based on patient symptoms and reports. Having properties such as cost-effectiveness, multilingual communication, and 24/7 availability, make conversational agents useful for patients with medical concerns outside of their doctor's operating hours. Studies also reported that patients perceive conversational agents as safer interaction partners than human physicians and are consequently willing to disclose more medical information and report more symptoms to them [28].
In the medical domain, particularly in oncology setting, conversational intelligence focuses primarily on (speech-enabled) chatbots [29] to contribute to screening (i.e., iDecide [30]), improving mental health state through managing psychological distress [31][32][33] and lifestyle changes [34]. Overall, chatbots have been proven as an enabler for active patient engagement, adherence and satisfaction increase [35,36]. "The self-reporting aspect delivered with the mobile application provides benefits that might otherwise be difficult to obtain" [36]:6689. However, the chatbots have yet to tackle the long-term adherence with sustainable quality of the reported data [37]. In [36], active use of the technology drops after 14 days. Patients' understanding (i.e., familiarity), their ability to remember the details and perceived trustworthiness represent the main factors of patient adherence [38].
Instead of delivering merely a chatbot, in the proposed system, an ECA is presented. ECAs can further increase the long-term adherence by engaging with users in more diverse interaction significantly enriched by incorporating non-verbal communication [37]. One of the earliest definitions of ECA is "more or less autonomous and intelligent software entities with an embodiment used to communicate with the user" [39]. ECAs can deliver a system with symmetric multimodality with speech, gesture, facial expression on both the input and the output side. The main three components of ECAs are user interfaces for communication with the ECA; computer modelling structure to make the ECA react emphatically; and embodiment (visual representation) for communication with users. Embodiments can be designed as virtual human characters [40], animals [41], or robots [42]. In general, the fully symmetric interaction opens up the opportunity to introduce human-like qualities, significantly improving the believability of the interfaces [43]. The main areas of ECAs in healthcare are the treatment of mood disorders, anxiety, psychotic disorders, autism, and substance use disorders [44]. In a review of Kramer [17] about the design and evaluation of the ECAs for healthcare, ECAs proved a promising tool for persuasive communication in healthcare. And, in another review study [42], technological and clinical possibilities of less complex ECAs were investigated and shown as a solution for routine applications in the means of rapid development, testing, and application. The design features of ECAs for healthcare were investigated by Stal [45] who found that the agents' speech and/or textual output, as well as its facial and gaze expressions are the most used features for ECAs. Previous ECA studies for healthcare mainly focused on physical activity [46][47][48], nutrition [49,50], stress [34], blood glucose monitoring [41], and sun protection [51]. Overall, most of related research on ECAs, however, focuses on speech, facial, and gaze expressions as the main design features [45]. Most of the ECAs in healthcare are 2D based. Their gestures (as part of the non-verbal communication channel) and appearance are most often not considered as main design features. In fact, Kramer [17] reviewed 20 ECA studies to compare their functionalities and appearances. For the appearance of ECA, most of the studies used similar virtual human characters like middle-aged African American women. Moreover, only 3 studies actually addressed gestures. As a contrast we offer two fully ECAs, a male and a female, capable to interact with patients in six languages: English, French, Latvian, Slovenian, Spanish, and Russian. Both can not only represent facial expressions but exploit gestures to enhance user experience by regulating communicative relationships, support verbal counterparts, and maintain certain degree of clarity in the discourse.
Although highly sophisticated, the (embodied) conversational agents are designed as a prototype (proof-of-concept) and the actual contribution to health-related outcomes is evaluated without relevant statistical significance [52]. Further, the interoperability and integration of data collected in the clinical workflow is not considered. However, meaningful use of collected data will significantly contribute to patient adherence. Sayeed et al. [53] describe an approach to create a patient-centered health system that is based on the FHIR standard and patients/clinicians' applications that can make requests and reports of HL7 FHIR resources. Following the same baseline workflow, i.e., the collection of PGHD and forming of FHIR resources, the proposed systems aggregates combine FHIR resources with MSN to offer a fully connected and integrated approach of collecting and integrating PGHD in clinical workflow.
To sum up, the main contributions of proposed system are: • A multilingual, fully articulated ECA implementing symmetric interaction in 6 languages and with male and female representation • A micro-service-based sensing network to collect patient information further supported by patient and clinician mHealth application • A holistic approach towards interoperable and fully integrated PGHD

The PERSIST Sensing Network
The main building blocks in MSN are outlined in Figure 1. The MSN consist of Apache Camel that implements the REST API, ActiveMQ Artemis, and Apache Kafka to implement efficient machine-machine communication among the various service blocks. ActiveMQ Artemis implements the MQTT Broker, whereas Apache Kafka is exploited to deliver an efficient microservice architecture. Apache Camel can act as a router by having the ability to convert synchronous messages to asynchronous ones and vice versa. Apache Camel can also run as a Spring Boot application that provides REST API endpoints for HTTP requests. The MQTT broker is an intermediate between OHC and mHealth App. In this case, the mHealth App is delivered as an MQTT client subscribed to ActiveMQ Artemis. Microservices are interconnected using Kafka topics and HTTP APIs. For Kafka services, asynchronous communication is used where predefined topics for each language are supported and synchronous communication is used for Rasa chatbot that uses HTTP REST requests over Camel REST endpoints API.
Considering the connections, with the mHealth app Figure 2 outlines two types of connections. The first one is the synchronous connection, communication over the secured application protocol HTTPS REST used for questionnaires requests and responses. The second one is the asynchronous connection with the MQTT protocol that uses MQTT topics. Connections to the OHC platform are also using synchronous HTTPS REST protocol. For MSN internal connections, besides REST and MQTT, Camel Java Messaging Service (JMS) and Kafka topics are used.

mHealth Application
In the study, patients and clinicians have separate mHealth applications ( Figure 3). The patient mHealth application is mainly used for data gathering and trends monitoring; the clinician mHealth application is mainly used for patient monitoring and specifying the patient's care plans. Both mobile applications were developed by the company Emoda. Considering the connections, with the mHealth app Figure 2 outlines two types of connections. The first one is the synchronous connection, communication over the secured application protocol HTTPS REST used for questionnaires requests and responses. The second one is the asynchronous connection with the MQTT protocol that uses MQTT topics. Connections to the OHC platform are also using synchronous HTTPS REST protocol. For MSN internal connections, besides REST and MQTT, Camel Java Messaging Service (JMS) and Kafka topics are used.

mHealth Application
In the study, patients and clinicians have separate mHealth applications ( Figure 3). The patient mHealth application is mainly used for data gathering and trends monitoring; the clinician mHealth application is mainly used for patient monitoring and specifying the patient's care plans. Both mobile applications were developed by the company Emoda.

mHealth Application
In the study, patients and clinicians have separate mHealth applications ( Figure 3). The patient mHealth application is mainly used for data gathering and trends monitoring; the clinician mHealth application is mainly used for patient monitoring and specifying the patient's care plans. Both mobile applications were developed by the company Emoda. The patient application has functionalities such as mood selection, diary recordings, reading of selected articles by clinicians, answering the questionnaires, receiving messages from the clinician, and clinician appointment scheduling as the main functionalities of the application. For the clinician application, clinicians have options to see all the patients' lists and their clinical details. They are able to create a new patient record and edit or delete an already existing one. Also, clinicians can create appointments, see the calendar, receive alerts for specific patients, and send/receive messages from patients. The mHealth App can use both synchronous and asynchronous protocols. While the synchronous REST protocol is used for communication with OHC and MSN REST OpenAPI (Swagger) endpoints, asynchronous MQTT protocol is used for receiving notifications.

OHC FHIR Server
The OHC platform is the complete integration and streaming platform for largescale distributed environments provided by Dedalus. OHC is a digital health platform Symmetry 2021, 13, 1187 6 of 19 that helps unlock isolated data. By consolidating data in a standardised (FHIR) format from across a broad range of systems of record, OHC enables innovation through near real time access to longitudinal patient records. Our deliberately open and modular architecture allows OHC to be adaptive to the specific business outcomes. OHC enables all the interfaces to be connected to and make decisions across disparate data sources in real-time. The OHC Digital Health Platform comprises a set of components, as depicted in the conceptual/logical architectures. The OHC solution is flexible and can be deployed on-premise (private data center) or via cloud in environments like Azure or AWS. OHC provides the latest version of HAPI FHIR R4 [54]. The patient application has functionalities such as mood selection, diary recordings, reading of selected articles by clinicians, answering the questionnaires, receiving messages from the clinician, and clinician appointment scheduling as the main functionalities of the application. For the clinician application, clinicians have options to see all the patients' lists and their clinical details. They are able to create a new patient record and edit or delete an already existing one. Also, clinicians can create appointments, see the calendar, receive alerts for specific patients, and send/receive messages from patients. The mHealth App can use both synchronous and asynchronous protocols. While the synchronous REST protocol is used for communication with OHC and MSN REST OpenAPI (Swagger) endpoints, asynchronous MQTT protocol is used for receiving notifications.

OHC FHIR Server
The OHC platform is the complete integration and streaming platform for large-scale distributed environments provided by Dedalus. OHC is a digital health platform that helps unlock isolated data. By consolidating data in a standardised (FHIR) format from across a broad range of systems of record, OHC enables innovation through near real time access to longitudinal patient records. Our deliberately open and modular architecture allows OHC to be adaptive to the specific business outcomes. OHC enables all the interfaces to be connected to and make decisions across disparate data sources in real-time. The OHC Digital Health Platform comprises a set of components, as depicted in the conceptual/logical architectures. The OHC solution is flexible and can be deployed on-premise (private data center) or via cloud in environments like Azure or AWS. OHC provides the latest version of HAPI FHIR R4 [54].

End-to-End Multilingual Text-To-Speech Synthesis
The first microservice is Speech synthesis, the Text-to-speech (TTS) microservice. It mainly generates audio files with given transcriptions for the ECA that communicate with the patients. In short, the sequence-to-sequence model optimized for TTS is used to 'map' a sequence of letters to a sequence of phonemes. The TTS architecture used in [55] was developed for real-time or close to real-time systems by combination of two neural network models: a feature prediction NN model and a flow-based neural-network-vocoder WaveGlow. The model from [55] is outlined in Figure 4. It consists of an embedding layer that creates a 512-dimensional vector. The embedding vectors are directed into a series of

End-to-End Multilingual Text-To-Speech Synthesis
The first microservice is Speech synthesis, the Text-to-speech (TTS) microservice. It mainly generates audio files with given transcriptions for the ECA that communicate with the patients. In short, the sequence-to-sequence model optimized for TTS is used to 'map' a sequence of letters to a sequence of phonemes. The TTS architecture used in [55] was developed for real-time or close to real-time systems by combination of two neural network models: a feature prediction NN model and a flow-based neural-network-vocoder WaveGlow. The model from [55] is outlined in Figure 4. It consists of an embedding layer that creates a 512-dimensional vector. The embedding vectors are directed into a series of three 1-D convolutional layers, each layer with 512 filters with length of 5. Each convolutional layer is followed with a mini-batch normalization and ReLU activation.
After the convolutional block, the tensors are fed to a bidirectional LSTMs and the feedback and backward results are concatenated. Since the decoder is implemented with a recurrent architecture, the outputs of the previous step (i − 1) are considered in each next step (i). The soft-attention mechanism represents a crucial element in this process. To create attention, the mechanism forms a context vector at each decoding step and updates the attention weight accordingly. The context vector (C i ), denoted by Equation (1), is computed as product of encoder output (h) and attention weight (α): where the attention weight α ij is calculated as: where e ij represents the energy and is calculated by a hybrid approach considering both location-based and content-based attention: where S i−1 represents the previous state of the decoders LSTM, h j represents the jth hidden encoder state, and f i,j location-signs calculated as a convolution operation f over the previous attention weight (α i−1 ). W, V, U, and b are trained parameters.
Symmetry 2021, 13, x FOR PEER REVIEW 7 of 19 three 1-D convolutional layers, each layer with 512 filters with length of 5. Each convolutional layer is followed with a mini-batch normalization and ReLU activation. After the convolutional block, the tensors are fed to a bidirectional LSTMs and the feedback and backward results are concatenated. Since the decoder is implemented with a recurrent architecture, the outputs of the previous step (i − 1) are considered in each next step (i). The soft-attention mechanism represents a crucial element in this process. To create attention, the mechanism forms a context vector at each decoding step and updates the attention weight accordingly. The context vector (Ci), denoted by Equation (1), is computed as product of encoder output (h) and attention weight (α): where the attention weight is calculated as: where represents the energy and is calculated by a hybrid approach considering both location-based and content-based attention: where represents the previous state of the decoders LSTM, ℎ represents the jth hidden encoder state, and , location-signs calculated as a convolution operation f over the previous attention weight ( ). W, V, U, and b are trained parameters. The output of the decoder is a predicted spectrogram. To improve the spectrogram The output of the decoder is a predicted spectrogram. To improve the spectrogram quality, it is passed through the PostNet module; a stack of five one-dimensional convolutional layers with 512 filters in each one and a filter size of 5. Each layer (except the last one) is followed by batch-normalization and tangent activation. Finally, to transform the feature representation (i.e., spectrogram) into waveform (i.e., speech) a WaveNet architecture is used [56]. It consists of 30 dilated convolutional layers segmented into 3 cycles with the dilation rate of 2 k (mod 10) ; where k ∈ [0, 30]. To compute the logistic mixture distribution, the WaveNet output is finally passed through a ReLU activation, followed by linear projection. The used loss function is the negative log-likelihood.

End-to-End Multilingual Speech Recognition
Automatic speech recognition (ASR) represents the second microservice. It is implemented to support the spoken language interface in the Health App and feed the survivor's answers to the dialog management component (i.e., Rasa chatbot) where language is determined and processed. For the project PERSIST we deliver SPREAD, an E2E ASR system based on B × R Jasper model ( Figure 5) [57], where B represents number of blocks, and Figure 4 represents the number of subblocks. Each subblock applies 1D convolutions, batch normalization, clipped ReLU activation, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. mented to support the spoken language interface in the Health App and feed the survivor's answers to the dialog management component (i.e., Rasa chatbot) where language is determined and processed. For the project PERSIST we deliver SPREAD, an E2E ASR system based on B × R Jasper model ( Figure 5) [57], where B represents number of blocks, and Figure 4 represents the number of subblocks. Each subblock applies 1D convolutions, batch normalization, clipped ReLU activation, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. As highlighted in Figure 5, each residual connection is first projected through a 1 × 1 convolution. This enables the algorithm to account for different numbers of input and output channels. The residual connection is in this way added to the output of the last 1Dconvolutional layer in the block before the clipped relu activation and dropout. Next, the output goes through a batch norm layer and the output of the batch norm layer is added As highlighted in Figure 5, each residual connection is first projected through a 1 × 1 convolution. This enables the algorithm to account for different numbers of input and output channels. The residual connection is in this way added to the output of the last 1D-convolutional layer in the block before the clipped relu activation and dropout. Next, the output goes through a batch norm layer and the output of the batch norm layer is added to the output of the batch norm layer in the last sub-block. The overall sum is then passed through the activation function ReLU and dropout to produce the output of the current block B. All Jasper-based models have four generic convolutional blocks: one preprocessing, to reduce the time dimension of input speech signal, and three post-processing at the end. The first post-processing block performs a dilation of 2 to increase the model's receptive field while the last two post-processing blocks are fully connected. These are used to project the final output to a distribution over characters.
As outlined in Figure 5, a decoder based on Connectionist Temporal Classification (CTC) [58] is used to transform the output of the model in a sequence of letters corresponding to the speech input. In contrast to attention mechanism-based ASRs [59], the CTC decoder uses Markov assumptions to efficiently solve sequential problems by dynamic programming.
This allows the ASR to perform frame-by-frame label prediction with low computational cost by performing a greedy search and make it applicable even for long audio sequence. An E2E Inference is defined as classification of most problem grapheme sequencê W in a given audio input X, i.e.,:Ŵ where X = (x 1 , . . . , x T ) is a T-length speech feature sequence and W = (w 1 , . . . , w N ) is an N-length grapheme sequence (i.e., a word sequence). Thus, at frame t, x i is a D-dimensional speech feature vector and w n is a word in vocabulary V on index n. The main problem of the ASR is therefore how to calculate the posterior distribution p(W|X) . CTC formulation follows originates from Bayes decision theory and defines posterior distribution as: CTC formulation uses L-length letter sequences, C = (c 1 , . . . , c L ), with a set of distinct letters, U and a 'blank' symbol C to denote the letter boundary and handle letter repetition: where: which means that '<s>' is blank if l is odd and '<s>' is a letter from U if l is an even number. CTC also uses a conditional assumption p(C|Z, X) ≈ p(C|Z) to simplify the dependency between acoustic model and the letter models used in CTC.

Embodied Conversational System and Embodied Conversational Agent
A Rasa NLU [60] and ECA Framework [61] represent the final set of microservices and constitute an Embodied Conversational System which is capable to create responses in natural language as well as 'visualize' them. The Rasa Chatbot is used to manage the discourse between the survivor and the system. It is implemented as an API and the Rasa NLU represents the engine of the chatbot. The chatbot is running on a Linux server and is programmed in python and YAML language. The first version of our API implements 18 standardized patient reported outcomes (PROs) as storylines in six languages used in the PERSIST Clinical Study [62]. For storing the data, Rasa API uses the SQLite database which is possible with a function called SQLTrackerStore in the Rasa chatbot. POST and GET requests are used to store information, such as patients' answers, questionnaires, and all events that are triggered in a specific conversation.
The ECA Framework is designed to transform plain text sequences generated by the chatbot into multimodal responses incorporating gestures. The proprietary algorithm [61] is highlighted in Figure 6. The ECA Framework is designed to transform plain text sequences generated by the chatbot into multimodal responses incorporating gestures. The proprietary algorithm [61] is highlighted in Figure 6. It is based on the idea of segmenting a text sequence into non-verbal conversational intents and represent them as content units (CUs). As outlined in Figure 6 the algorithm performs the five phases in order to generate and synchronize no-verbal behavior with speech signal. In phase 1, the intent classification, it tries to recognize morphosyntactic It is based on the idea of segmenting a text sequence into non-verbal conversational intents and represent them as content units (CUs). As outlined in Figure 6 the algorithm performs the five phases in order to generate and synchronize no-verbal behavior with speech signal. In phase 1, the intent classification, it tries to recognize morphosyntactic patterns in the text and assign them and a communicative intent. The input of phase 1 is the POS-tagged text and the Semiotic Grammar; where the semiotic grammar represents a finite parametric space of the intent M τ . The output of phase 1 is a set of multiple content units representing all possible communicative intents recognized in the text sequence. The content units identified may overlap and introduce inconsistencies. Thus, in phase 2, the intent planning, the algorithm needs to resolve these inconsistencies by elimination, margining and selection process. The algorithm considers prosodic alignment (i.e., prosodic phrases and intonation) to define final sequence of communicative intents to be 'visulalized'. In phase 3, the movement planning, for each planned intent an appropriate movement structure, i.e., the prosody of movement represented through movement phases (preparation, stroke, retraction, and hold), must be defined. Namely, a gesture phraseĜ, visualizing the input text, is then defined as a sequence of gestures G i visualizing each such content unit:Ĝ = (G 1 , . . . , G N ) where each gesture G 1 represents a visualization of a specific CU with a movement model H and over time t. The operator × represents the successive execution of the movement models. Movement modelĤ is 'sum' of animated sequence performed to visualized shapes/poses belonging to one of the movement phases executed over time t. Thus for each movement modelĤ and 'end-pose' must be selected that can be viably animated given the duration of the movement phase. This is implemented in phase 4 of the algorithm, the synchronization of form. The selection of shape depends the type of movement-phase (inherently related to the supposed power of movement), the conversational intent and its possible 'semantic representation' and utterances (words or syllables) the non-verbal behavior is intended to visualize. Phase 4 adds shape to the structure the movement models. In order to animate the gestureĜ movement models are finally transformed into a script, understandable to the ECA realization engine. This is carried out in final phase 5 of the algorithm entitled co-verbal behavior specification. We choose a proprietary EVA-Script notation s [61]. In the EVA-Script notation, each movement modelĤ is formalized as simultaneous execution within the block <bgesture>. The Poses P within stroke phases and the preparation phases are represented as <unit> blocks within <bgesture> Each <unit> block contains the configuration of individual movement controllers involved in the representation of the pose. Since the hold phases and the retraction phases only represent the existing shape being withheld or retracted into the neutral state, they are added to the <unit> block in the form of attributes DurationRetraction and DurationHold in the block.

Results
For the preliminary evaluation, The PERSIST platform was deployed on two different physical servers at the University of Maribor, FERI. The detailed functionality of the system is highlighted by the Figure 7. As highlighted by Figure 7. The main actors of PERSIST are the clinician who defines and schedules an activity as part of patient's care workflow (i.e., phase 1 in Figure 7) and the patient, who executes the activity (i.e., phase 3 in Figure 7). The MSN and OHC represent the main services of the system. The OHC is used to store data and automate the execution of the clinical workflow. The MSN is used to implement activities and make their execution more natural by delivering the symmetric model of interaction. system is highlighted by the Figure 7. As highlighted by Figure 7. The main actors of PER-SIST are the clinician who defines and schedules an activity as part of patient's care workflow (i.e., phase 1 in Figure 7) and the patient, who executes the activity (i.e., phase 3 in Figure 7). The MSN and OHC represent the main services of the system. The OHC is used to store data and automate the execution of the clinical workflow. The MSN is used to implement activities and make their execution more natural by delivering the symmetric model of interaction. As highlighted in Figure 7, the integration of a PRO starts in phase 1, when a clinician specifies a 'Care plan' FHIR resource (FHIR CarePLan: https://www.hl7.org/fhir/careplan.html, last visited 19 June 2021) for the patient. The care plans can be used for a general practitioner to schedule and keep track of when their patient is due to carry out a specific activity. In case of PERSIST, a self-report. Exploiting this resource, the OHC FHIR server is capable to automatically trigger a request for the 'todo'. This request is triggered by sending a 'notification' to the patient via the MQTT Broker hosted on the MSN (i.e., phase 2 in Figure 7). The notification about the triggered activity is sent through MSN's MQTT broker to the patient mHealth App in JSON data format. The patient can see five types of notifications: a request to fill the questionnaire, a request to provide information about the mood, a request to record a diary, a notification of a received message from the clinician, and other notifications. For all notification types, after the activity or 'todo' has been fulfilled the MSN automatically transforms the response into FHIR resource (phase 4 in Figure 7). Three types of resources are used, 'Observation' (FHIR Obsevation: https://www.hl7.org/fhir/observation.html, last visited 19 June 2021) to store reports regarding well-being (i.e., mood reports) and biometric data, 'DocumentReference' (FHIR DocumentReference: https://www.hl7.org/fhir/documentreference.html, last visited 19 As highlighted in Figure 7, the integration of a PRO starts in phase 1, when a clinician specifies a 'Care plan' FHIR resource (FHIR CarePLan: https://www.hl7.org/fhir/careplan. html, last visited 19 June 2021) for the patient. The care plans can be used for a general practitioner to schedule and keep track of when their patient is due to carry out a specific activity. In case of PERSIST, a self-report. Exploiting this resource, the OHC FHIR server is capable to automatically trigger a request for the 'todo'. This request is triggered by sending a 'notification' to the patient via the MQTT Broker hosted on the MSN (i.e., phase 2 in Figure 7). The notification about the triggered activity is sent through MSN's MQTT broker to the patient mHealth App in JSON data format. The patient can see five types of notifications: a request to fill the questionnaire, a request to provide information about the mood, a request to record a diary, a notification of a received message from the clinician, and other notifications. For all notification types, after the activity or 'todo' has been fulfilled the MSN automatically transforms the response into FHIR resource (phase 4 in Figure 7). Three types of resources are used, 'Observation' (FHIR Obsevation: https: //www.hl7.org/fhir/observation.html, last visited 19 June 2021) to store reports regarding well-being (i.e., mood reports) and biometric data, 'DocumentReference' (FHIR Docu-mentReference: https://www.hl7.org/fhir/documentreference.html, last visited 19 June 2021) to store diary recordings and 'QuestionaryResponse' (FHIR QuestionarryResponse: https://www.hl7.org/fhir/questionnaireresponse.html, last visited 19 June 2021) The system also notifies the clinician that an activity was completed (phase 5 in Figure 7). Figure 8 highlights the dataflow for the case of a request to answer pro (i.e., phase 3 in Figure 7). As highlighted in Figure 8, the main actor of the execution of the activity is the patient who uses the Health App to carry out the activity. The MSN and ECA (Chatbot) personalize interaction by delivering symmetric interaction on both input and output. The OHC implements JSON Web Tokens (RFC 7519) to ensure claims between two parties (i.e., mHealthApp and MSN) are secure. The process is initiated when a patient clicks on the notification request and thereby starts the question and answering (Q&A) dialog. In this exchange the ECA delivers questionnaires as multimodal responses and the user delivers the answer by clicking on the option or by answering with speech. the patient who uses the Health App to carry out the activity. The MSN and ECA (Chatbot) personalize interaction by delivering symmetric interaction on both input and output. The OHC implements JSON Web Tokens (RFC 7519) to ensure claims between two parties (i.e., mHealthApp and MSN) are secure. The process is initiated when a patient clicks on the notification request and thereby starts the question and answering (Q&A) dialog. In this exchange the ECA delivers questionnaires as multimodal responses and the user delivers the answer by clicking on the option or by answering with speech. As outlined in Figure 8, the system implements a symmetric interactive system in which users can answer questions from standardized questionnaires for the specific PRO. Questionnaires are available in the six different languages of the PERSIST project (Slovenian, English, Russian, Latvian, French, and Spanish). In order to, support multimodality on the output side, the system visualizes the chatbot generated information and represents them by the female agent 'Eva' and the male agent 'Adam' (Figure 9). Non-verbal elements are associated with speech. Unannotated texts are given as multimodal output which offers a spoken communication channel as well as a synchronized visual communication channel. BGM is synchronizing the verbal and non-verbal elements for our ECAs to act more naturally, more human-like. On the input side, the system accepts responses in text or speech format. To properly map the user response to the answers expected by PROs, a word-to-concept mapping is delivered as part of spoken language understanding. As outlined in Figure 8, the system implements a symmetric interactive system in which users can answer questions from standardized questionnaires for the specific PRO. Questionnaires are available in the six different languages of the PERSIST project (Slovenian, English, Russian, Latvian, French, and Spanish). In order to, support multimodality on the output side, the system visualizes the chatbot generated information and represents them by the female agent 'Eva' and the male agent 'Adam' (Figure 9). Non-verbal elements are associated with speech. Unannotated texts are given as multimodal output which offers a spoken communication channel as well as a synchronized visual communication channel. BGM is synchronizing the verbal and non-verbal elements for our ECAs to act more naturally, more human-like. On the input side, the system accepts responses in text or speech format. To properly map the user response to the answers expected by PROs, a word-to-concept mapping is delivered as part of spoken language understanding. Symmetry 2021, 13, x FOR PEER REVIEW 13 of 19 Figure 9. Visualizing conversational response with ECAs.
The proposed system was deployed on a server that hosts five virtual machines over the Proxmox VE 6.3-2 for the needs of the PERSIST project. That server is running the Xubuntu 20.04 LTS operating system. There are no virtual machines on the (other) server, named PERSIST_INFERENCE. On that server, running the Ubuntu Server 20.04 LTS OS, are microservices for ASR and TTS. TTS and ASR services are integrated using predefined topics, Kafka producers and consumers. Both microservices are being developed using NVIDIA Triton Inference Server with TensorRT in Python programming language. The ECA microservice implementing the virtual agent is yet to be fully integrated and is operating as a standalone service. The specification for the infrastructure is lightweight with 8 GB of RAM for the Apache Kafka version 2.13-2.7.0 and the Apache Came version 3.4.0 The proposed system was deployed on a server that hosts five virtual machines over the Proxmox VE 6.3-2 for the needs of the PERSIST project. That server is running the Xubuntu 20.04 LTS operating system. There are no virtual machines on the (other) server, named PERSIST_INFERENCE. On that server, running the Ubuntu Server 20.04 LTS OS, are microservices for ASR and TTS. TTS and ASR services are integrated using predefined topics, Kafka producers and consumers. Both microservices are being developed using NVIDIA Triton Inference Server with TensorRT in Python programming language. The ECA microservice implementing the virtual agent is yet to be fully integrated and is operating as a standalone service. The specification for the infrastructure is lightweight with 8 GB of RAM for the Apache Kafka version 2.13-2.7.0 and the Apache Came version 3.4.0 with Apache ActiveMQ Artemis version 2.17.0. The 32 GB of RAM is provided to the Rasa chatbot version 2.1 which requires more memory capacity. Every building block has assigned 1 socket with 4 processor cores and 32 GB of SSD. To evaluate the hardware performance of the system, we simulated the load on the system by measuring CPU usage, memory usage, and average response time for both Camel and the Chatbot. The results are outlined in Figures 10-12.
The proposed system was deployed on a server that hosts five virtual machines over the Proxmox VE 6.3-2 for the needs of the PERSIST project. That server is running the Xubuntu 20.04 LTS operating system. There are no virtual machines on the (other) server, named PERSIST_INFERENCE. On that server, running the Ubuntu Server 20.04 LTS OS, are microservices for ASR and TTS. TTS and ASR services are integrated using predefined topics, Kafka producers and consumers. Both microservices are being developed using NVIDIA Triton Inference Server with TensorRT in Python programming language. The ECA microservice implementing the virtual agent is yet to be fully integrated and is operating as a standalone service. The specification for the infrastructure is lightweight with 8 GB of RAM for the Apache Kafka version 2.13-2.7.0 and the Apache Came version 3.4.0 with Apache ActiveMQ Artemis version 2.17.0. The 32 GB of RAM is provided to the Rasa chatbot version 2.1 which requires more memory capacity. Every building block has assigned 1 socket with 4 processor cores and 32 GB of SSD. To evaluate the hardware performance of the system, we simulated the load on the system by measuring CPU usage, memory usage, and average response time for both Camel and the Chatbot. The results are outlined in Figures 10-12.    The proposed system was deployed on a server that hosts five virtual machines over the Proxmox VE 6.3-2 for the needs of the PERSIST project. That server is running the Xubuntu 20.04 LTS operating system. There are no virtual machines on the (other) server, named PERSIST_INFERENCE. On that server, running the Ubuntu Server 20.04 LTS OS, are microservices for ASR and TTS. TTS and ASR services are integrated using predefined topics, Kafka producers and consumers. Both microservices are being developed using NVIDIA Triton Inference Server with TensorRT in Python programming language. The ECA microservice implementing the virtual agent is yet to be fully integrated and is operating as a standalone service. The specification for the infrastructure is lightweight with 8 GB of RAM for the Apache Kafka version 2.13-2.7.0 and the Apache Came version 3.4.0 with Apache ActiveMQ Artemis version 2.17.0. The 32 GB of RAM is provided to the Rasa chatbot version 2.1 which requires more memory capacity. Every building block has assigned 1 socket with 4 processor cores and 32 GB of SSD. To evaluate the hardware performance of the system, we simulated the load on the system by measuring CPU usage, memory usage, and average response time for both Camel and the Chatbot. The results are outlined in Figures 10-12.   As can be seen from data in Figure 10, with the duplication of active users in tests, CPU usage is rising linearly from 11.65% on 25 active users, to 56.04% on 1000 active users for Camel, and also mostly linear from 5.86% on 25 active users, to 30.44% on 1000 active users for Rasa chatbot. In Figure 11, volatile memory was stagnating on both the Camel and the Rasa chatbot and proved independent of the increase of users. On Camel, memory usage was near 50%, and on the Rasa chatbot near 25%. Figure 12  The preliminary evaluation of the end-to-end ASR system was delivered for the English language. We trained the models on DGX-1, 8 × V100, 8 × 32g GPUMEM, and evaluated models on a workstation with 2 × RTX8000, 2 × 48g GPUMEM. The audio dataset was roughly 1308.61 h. The best Jasper model reached 0.22 WER. The model was trained with relu encoder. To preliminarily evaluate the quality of end-to-end TTS, MUSHRA listening tests [63] were delivered for the English models among the PERSIST consortium partners. As can be seen from data in Figure 10, with the duplication of active users in tests, CPU usage is rising linearly from 11.65% on 25 active users, to 56.04% on 1000 active users for Camel, and also mostly linear from 5.86% on 25 active users, to 30.44% on 1000 active users for Rasa chatbot. In Figure 11, volatile memory was stagnating on both the Camel and the Rasa chatbot and proved independent of the increase of users. On Camel, memory usage was near 50%, and on the Rasa chatbot near 25%. Figure 12  The preliminary evaluation of the end-to-end ASR system was delivered for the English language. We trained the models on DGX-1, 8 × V100, 8 × 32 g GPUMEM, and evaluated models on a workstation with 2 × RTX8000, 2 × 48 g GPUMEM. The audio dataset was roughly 1308.61 h. The best Jasper model reached 0.22 WER. The model was trained with relu encoder. To preliminarily evaluate the quality of end-to-end TTS, MUSHRA listening tests [63] were delivered for the English models among the PERSIST consortium partners. 21 consortium members participated. Overall, the best rated model was the model implemented as the Tacotron2 + Waveglow combination, represented in Section 4.2. It was evaluated with an average score of 75 on 100 level scale. The results show that speech generated by this model is intelligible and understandable, however, may sometimes sound machine-like. The evaluation of the multimodal conversational response was reported in [61]. 30 individuals assigned an average score of 3.45 on the 5-level Likert scale. This clearly implies that the system can produce a more viable and more believable user interface. To fully evaluate the ECA, responses in a targeted real-life environment with the same approach will be adapted in the project PERSIST. The patients will evaluate five dependent variables, describing the quality of the presentation on a 5-level Likert scale. In addition to gesture quality (e.g., form, dynamics, fluidity, synchronization), they will also report on how understandable the represented the content is (the sixth variable). After observing both instances (text + speech, and ECA with gestures), they will be asked to identify their general perception of the observed viability and human-likeness, expressed via the final, seventh dependent variable, which was also assessed on a 5-level Likert scale.

Discussion
However, the main challenges for wide adaptation of PGHD in clinical practice include usability (i.e., integration, interoperability with existing EHRs) and sustainable quality of results (i.e., patient motivation and adherence) [21,37] In order to address the interoperability and suitability of the collected resources we delivered a FHIR Methodology. The system presented includes patient/clinician mobile applications, an OHC FHIR server, and the MSN. The patient/clinician mobile applications are designed according to the security concerns and clinical trial requirements. Interoperability among the components is provided by the OHC FHIR server. OHC provides the framework and set of tools for the integration, ingestion, storage, indexing, and surfacing of patient information. It is an innovative, open digital integration hub with the proven ability to deliver the speed, scale, and flexibility needed to securely gain value through the integration of health systems. By consolidating data in a standardized format (FHIR) from across a broad range of systems of record, OHC enables innovation through near-real-time access to longitudinal patient records. The APIs provides opportunities to flexibly design services that can seamlessly ingest discrete data from the source (i.e., the EHR platform) into a third-party application (i.e., clinical or patient mHealthApp.) The proposed approach is further supported by recent trends in health care IT systems. Namely, FHIR has been recognized as an approach suitable for citizen developers since it also supports 'low-code/no-code' solutions [21]. This trend allows a user-centric design that creates intuitive data visualizations for transdisciplinary collaborations, including citizens, cognitive scientists, bioinformaticians, and clinicians [21]. For the data collected to be valuable, however, the whole IT system should be transformed to support the FHIR methodology. Several studies indeed report on the issue of actionability. If the PGHD is to drive the clinical decision workflow, the clinicians should be able to seamlessly exploit other data from the EHRs. Thus, our future efforts will be directed towards the transformation and ingestion of EHRs from existing IT platforms into FHIR ready server. Based on the preliminary studies, the main activities will involve the definition of an ontology (i.e., mapping model) that will correlate existing fields with specific FHIR resources. Moreover, since most of the information in existing EHRs is stored as partially structured or unstructured text, a specific focus will be directed towards extracting information using modern NLP techniques and data to concept mapping.
The other challenge related to the integration of PGHD relate to the patient's perspective; i.e., long-term sustainability and quality of collected information [36,37]. Familiarity, perceived complexity, and trustworthiness represent the main drivers of patient adher-ence [38]. To address this challenge, the MSN delivers a microservice infrastructure. In this infrastructure, the building blocks are each divided into their own service and are scaled so that the services are distributed among the servers and can be replicated if needed. A fully articulated ECA was deployed as the central technology to implement more natural human-machine interaction. The EVA framework is capable of capturing various contexts in the "data" and providing the basis to analytically investigate various multidimensional correlations among co-verbal behavior features. The role of the EVA realization framework is to transform the co-verbal descriptions contained in EVA events into articulated movement generated by the expressive virtual entity, e.g., to apply the EVA-Script language onto the articulated 3D model EVA in the form of animated movement [43]. Also, multilingual properties of our ECAs differentiates it from other similar ECAs. Trustworthiness is a clinical value which has a significant impact on adherence mitigating pervasive threats to health and wellbeing [64]. The applied model of symmetric multimodality for dialogue systems enables the ECAs to deliver and understand all-natural input/output modes, including speech, gestures, and facial expressions. This makes the synthetic interfaces significantly more familiar and trustworthy [38]. This is significant since trustworthiness is one of the building blocks of patient compliance and responsiveness [65]. The Chatbot API is using PREMs and PROMs to see the patients' health status and the patients' perceptions of their experience whilst receiving treatment. All sent and received data is in JSON format due to easy usage for transmitting data in web applications, and better representation and understanding. That is important because doctors should see the history of the report for specific patients in an understandable form in order to give more accurate decisions. For conversations between the Rasa Chatbot API and patients, it is important to configure story.yaml which contains the flow of the conversation or the order of the intents to be executed, which depends on the patients responses [66].
Overall, existing literature implies the consensus on the effects of an ECAs may have on patient adherence. Compared to chatbots and 2D agents, the fully articulated Embodied conversational agents have been proven to decrease the complexity of user interfaces and significantly contribute to familiarity and long-term sustainability [29]. Namely, fully articulated ECAs have a virtual body they can exploit to generate non-verbal cues with significant impact on understanding and cohesion of information exchange and the believability/trustworthiness of the digital entity. However, Ciechanowskiet et al. [67] note that the phenomenon off ''uncanny valley" may have significant negative impact on the overall user experience with articulated entities compared to ''disembodied" agents. Therefore, in our future efforts we will focus specifically on the synchronization of non-verbal behavior with speech. We plan to deliver a comparison study and Wizzard of Oz experiments to clearly define the design features of the symmetric model of interaction. Our preliminary experiments also showed that the 'quality' of synthesized behavior is closely related to hardware requirements. The model deployed for this study (Section 4.3) is not end-to-end and its computational requirements go well beyond any mobile or end-user device. As a result, our future activities will also investigate and end-to-end deployment of ECAs and non-verbal behavior generation models.

Conclusions
In this paper, we have represented a holistic approach towards sustainable collection of PGHD and PROs and their efficient integration into clinical workflow. PGHD may significantly contribute to personalized care and early identification related to psychological and physiological symptoms and negative health outcomes (e.g., cancer progression, toxicity, psychological distress). The system proposed in the paper represents an opportunity to integrate the possible benefits and deliver them to the patients. The system includes patient/clinician mobile applications, an OHC FHIR server, and a MSN. One of the major limitations stems from the inflexibility of existing healthcare platforms to adapt to FHIR or any other standardized. The lack of objective evaluation of the relevance and efficiency represents another limitation. Similarly, as related research, this study addresses the technologies from the prototype (proof-of-concept) perspective. The technology was evaluated on component basis, statistically, and on a short-term-use basis. However, within the project PERSIST, we plan to execute a 6-month final clinical evaluation with 160 cancer survivors and over 20 clinicians. The final limitation also arises from to the feasibility nature of the study and relates to usability. Although the co-creation activity implemented to design the system addressed the issues from the perspectives of multiple stakeholders, the clinical setting is limited to oncology and survivors of breast and colon cancer. Thus, results may not sufficiently reflect requirements of cancer patients (i.e., during treatment) or patients suffering from other (chronic) diseases. Data Availability Statement: The data are not publicly available due to restrictions apply to the availability of these data.