A Parallel Multimodal Integration Framework and Application for Cake Shopping

Fang, Hui; Weng, Dongdong; Tian, Zeyu

doi:10.3390/app14010299

Open AccessArticle

A Parallel Multimodal Integration Framework and Application for Cake Shopping

by

Hui Fang

,

Dongdong Weng

^* and

Zeyu Tian

Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(1), 299; https://doi.org/10.3390/app14010299

Submission received: 30 November 2023 / Revised: 22 December 2023 / Accepted: 26 December 2023 / Published: 29 December 2023

(This article belongs to the Special Issue New Insights into Human-Computer Interaction)

Download

Browse Figures

Versions Notes

Abstract

Featured Application

This paper introduces a flexible parallel multimodal integration framework that outlines a development paradigm and encourages multimodal application across various scenarios.

Abstract

Multimodal interaction systems can provide users with natural and compelling interactive experiences. Despite the availability of various sensing devices, only some commercial multimodal applications are available. One reason may be the need for a more efficient framework for fusing heterogeneous data and addressing resource pressure. This paper presents a parallel multimodal integration framework that ensures that the errors and external damages of integrated devices remain uncorrelated. The proposed relative weighted fusion method and modality delay strategy process the heterogeneous data at the decision level. The parallel modality operation flow allows each device to operate across multiple terminals, reducing resource demands on a single computer. The universal fusion methods and independent devices further remove constraints on the integrated modality number, providing the framework with extensibility. Based on the framework, we develop a multimodal virtual shopping system, integrating five input modalities and three output modalities. The objective experiments show that the system can accurately fuse heterogeneous data and understand interaction intent. User studies indicate the immersive and entertaining of multimodal shopping. Our framework proposes a development paradigm for multimodal systems, fostering multimodal applications across various domains.

Keywords:

multimodal interaction; multimodal integration; virtual shopping; virtual agents; wearable devices

1. Introduction

Multimodal human–computer interaction (MMHCI) has gained increasing attention. It lies at the crossroads of several research areas, including computer vision, psychology, artificial intelligence, and many others [1]. Multimodal interaction focuses on interacting with computers in a more “human” way using speech, gestures, and other modalities [2]. Many studies indicate that multimodal interaction can offer better flexibility and reliability [2,3]. It can meet the needs of diverse users with a range of usage patterns and preferences [3]. Developing a versatile multimodal integration framework can simplify the MMHCI system implementation process, saving time and lowering the technical expertise required [4]. Furthermore, it can facilitate the application of multimodal technologies in various fields.

Considering the direction of transmission, the individual modalities can be categorized into input and output modalities. Input modalities refer to the modalities that the computer receives from users. Output modalities refer to the modalities that the computer feeds back to users. We list some modalities for multimodal human–computer interaction in Table 1 based on Turk’s research [3]. Notably, the data from each modality, like images for expressions, wave signals for speech, and pose keypoint coordinates, exhibit significant differences in dimensions, structures, and sampling rates. Processing these heterogeneous data and obtaining user intent is difficult. Multimodal fusion algorithms are designed to tackle this challenge by leveraging complementary information and learning a unified decision from diverse data. According to when the fusion occurs, fusion algorithms are generally categorized into data-level [5], feature-level [6], and decision-level fusions [7,8]. For decision-level fusion, the data are processed separately and interpreted unimodally before being integrated with information from other modalities [3]. When input modalities have very different dimensions and sample rates, it is much simpler to implement decision-level fusion [9]. Moreover, the errors generated by multiple classifiers in this approach tend to be uncorrelated [9], enhancing the system’s robustness in practical applications. Based on these advantages, we consider fusing multiple modalities at the decision level.

Further, the implementation rules for fusion methods consist of classification-based, estimation-based, and rule-based methods. Classification-based fusion uses some machine learning methods (e.g., support vector machine [10], Bayesian inference [11,12], and neural networks [13]) to categorize the multimodal observation into one predefined class. Estimation-based methods have been primarily used in object localization and tracking tasks [14,15]. The rule-based fusion method includes a variety of basic rules for combining multimodal information [16,17]. It is simple as well as computationally less expensive compared to other methods [17]. The basic rules of rule-based fusion contain weighted fusion [18,19], MAX, majority voting [7], and other custom-defined rules [20,21,22], with weighted fusion being the most prevalent. Lucey et al. [8] adopt a linear weighted sum strategy to fuse audio and video results. They assign fixed discrete weights to modalities, which is a simple but nonrealistic choice. Neti et al. [18] use the training data to determine the relative reliability of audio and video and accordingly adjust their weights. Dynamic weights can match the real-time data distribution. Based on this opinion, we propose a relative weighted fusion method, adjusting modal weights based on relative probability.

When working with multimodal systems, decision-level fusions allow us to pick the used modalities independently. There have been significant advances in individual modality processing. Speech recognition [23,24], pose tracking [25], gesture recognition [26], and expression recognition [27] form the foundation for processing speech, pose, gesture, and expression data. Virtual reality [28,29], force feedbacks [30], and auditory feedbacks [31,32] are propelling traditional 2D human–computer interaction into a more intelligent 3D interaction. More challenging modalities, such as taste [33,34] and tactile sense [35], are gaining attention. The development of these unimodal interaction technologies has laid the foundation for multimodal interaction. However, runtime complexity tends to grow with the modality number in multimodal systems, making real-time performance a challenge. The processes require substantial resources, such as memory and CPU, making prolonged and stable operation on a typical computer difficult. We propose a parallel input and output modality operation flow, allowing each device to operate across multiple terminals and reducing the resource burden on a single computer.

Based on the proposed integration framework, we develop a multimodal virtual shopping system. Online shopping is becoming increasingly integrated into people’s daily lives. Most online shopping platforms only provide simple, 2D image-based and text-based interfaces to access the products [36]. Such environments neither make consumers enjoy fun during purchase nor provide real-time personalized services according to consumers’ different personal attributes and actual shopping behaviors [36]. Multimodal shopping [37,38] disrupts the typical online shopping pattern and enhances the user’s shopping experience. Schnack et al. [39] build a virtual supermarket scenario and demonstrate that 3D interactive shopping can evoke more natural behaviors. Wasinger et al. [40] integrate three modalities–speech, handwriting (pointing pen), and gestures—in a camera shopping system. It still relies on a 2D text-based interface. Facebook combines virtual environments with contextual intent to analyze user’s intentions and show products in situated interactive multimodal shopping system [41]. While it achieves speech-based intelligent shopping, it overlooks nonverbal cues from users. We successfully conduct a multimodal virtual shopping system, combining speech with body language.

The main contributions of this paper can be summarized as follows:

We propose a versatile parallel multimodal integration framework to support the development of multimodal applications in a reasonable time and with limited resources. It can simplify the MMHCI system implementation process and facilitate the application of multimodal technologies in various fields.
The parallel framework allows each device to operate across multiple terminals, reducing resource demands on a single computer. The proposed relative weighted fusion method is novel, application-independent, and can be easily extended to support new modalities.
We develop a multimodal virtual shopping system using the multimodal integration framework. The system seamlessly integrates five input modalities and three output modalities. A tree-structured task-oriented dialogue system is used to accomplish logical processing during shopping. Subjective and objective experiments show the robustness and intelligence of our shopping system.

2. Methods

2.1. Preliminaries

Before delving into our methods, we will discuss two works similar to our multimodal framework. Francesco et al. [42] propose a framework that aims to help design simple multimodal commands in the mobile environment. The proposed framework manages multimodal interaction in four steps: (1) Obtain data from input interaction modalities. (2) Extract semantics from the received data. (3) Fuse data in order to carry out user intent. (4) Send a message to the application client to perform an action based on the recognized user intention. They model each step using an XML configuration file, defining the process logic. This modeling approach is similar to our slot-based logic processing, which is straightforward and scalable. One of their limitations is that users can only trigger specific multimodal requests based on the defined motion sequences. For example, the task “moving an object” is specified by the following sequence of inputs: (1) Mouse-press on an object, mouse-drag, and mouse-release. (2) Mouse-press on an object, mouse-release, speech input “move”, and mouse-press mouse-release on a target position. This task definition ensures that multiple modal inputs will not occur simultaneously. It is easier compared to our multimodal fusion task. Essentially, this framework accomplishes a flat interaction task that incorporates voice commands.

Frans et al. [4] introduced a reusable development framework for multimodal interfaces, incorporating fusion, dialog management, and fission modules. The fusion manager features a novel fusion method that can be used across applications and modalities. It can obtain possible values for a slot from resolving agents and chooses the optimal one based on the agents’ probabilities and confidence values. In collaborative situation map applications, they employ a straightforward voting algorithm in fusion manager, in which the scores for each modality are summed, and the one with the highest total sum is selected. Their fusion is specifically employed for contextual understanding of reference terms. In cases where the fusion manager encounters unresolvable ambiguity, they will seek clarification from the user again. This setup addresses the precision challenges inherent in the fusion module.

2.2. Parallel Multimodal Integration Framework

The proposed parallel multimodal integration framework is illustrated in Figure 1. It comprises input modality operation flow (red part in Figure 1), output modality operation flow (green part), the data distribution module (gray part), and the situated logic processor (blue part). The framework is so robust that the unexpected disconnection or damage to a device does not impact the overall system. Each modality in the system operates parallelly and has unrestricted resource allocation, reducing the development barriers. Furthermore, our independent and asynchronous unimodal devices, the efficient and straightforward fusion method, and the flexible and rapid data distribution module collectively contribute to reducing system latency and meeting the practical demands of various applications.

To obtain the final decision, we propose a relative weighted fusion method and modality delay strategy for fusing multiple modalities at the decision level. During the fusion phase, each modality’s independent decisions

L_{1}, \dots, L_{N}

are sent to the fusion module and saved in the corresponding modality slots. For the j-th modality, the value of the modality slot defaults to the zero feature

D_{0}

in the absence of classification results, i.e.,

L_{j} = D_{0}

. When receiving the classification result

D_{j}

,

L_{j}

is updated, i.e.,

L_{j} = D_{j}

. This approach ensures the scalability and robustness of the system.

Unlike fusion methods designed for well-aligned datasets [43], our fusion module must handle real-time multimodal data from users. Inevitably, there is a temporal misalignment issue in which the modality input occurs. The misalignment comes from user-triggered time variations (e.g., waving hands after saying “Goodbye”) and computation delays. Dealing with misalignment—often called multimodal synchronization—is particularly important to multimodal systems [40]. Inspired by [40], we propose a modality delay strategy to address this problem. Specifically, we set a fusion delay threshold. When the fusion module receives the first valid classification result, the timestamps are recorded. The fusion module waits for other possible modality results within the delay threshold. If the same modality receives the other valid result, the result will be updated. The valid classification results refer to the data with a straightforward user intent. Regarding expression modality, the neutral expression category is considered invalid, while the happy expression category is considered valid. The modality delay strategy ensures that the system effectively captures intent decisions across various modalities, enhancing the accuracy of intent recognition. It is crucial to set a reasonable fusion delay threshold. If the threshold is too small, the chances of valid user input being disregarded will be high. If the threshold is too large, users will perceive and become aware of system latency, negatively impacting the user experience and satisfaction [44]. We recommend adjusting the threshold based on the specific usage scenarios, device latency, and users’ familiarity with the system. For example, controller-based selection in virtual reality takes longer than mouse-based selection on a computer. The threshold should be set larger. Furthermore, the longer it takes for unimodal decision making, and the less familiar users are with the device, the larger the delay threshold should be set.

After the waiting phase, The individual modality results are fused using the relative weighted (RW) fusion method. RW can be formulated as:

H = argmax \sum_{j = 1}^{N} w_{j} L_{j}

(1)

where H is the final decision, and

w_{j}

is the weight of j-th modality. For each modality result, the categories with the highest probability are

q_{1}, \dots, q_{N}

. The j-th modality weight is calculated by the following equation:

w_{j} = \frac{exp (P (q_{j}))}{\sum_{j = 1}^{N} exp (P (q_{j}))}

(2)

where

P (q_{j})

represents the probability of category

q_{j}

. This method considers relative probability as confidence value and modality weight. It is worth mentioning that after completing a modal fusion, the modality slots are reset to

D_{0}

.

Afterward, the final decision is sent to the situated logic processor. The situated logic processor defines the system processing logic based on the target scenario. It analyzes user intent and generates appropriate system responses. The control signals are then sent to the relevant output modalities, delivering multimodal feedback to the users. In our parallel multimodal framework, each modality can be deployed to different terminals. A cross-terminal data transmission method is essential. We craft a data transmission framework for the data distribution module, which is implemented with Spring Cloud using an HTTP protocol. Each modality can transmit data with the fusion module and logic processor based on the assigned IP address and port.

2.3. Multimodal Virtual Shopping System Prototype

This section presents our multimodal virtual shopping system in the domain of cake. We focused on the shopping domain, as it often induces rich multimodal interactions. The system overview is illustrated in Figure 2. It integrates five input modalities (expressions, gestures, poses, contact force, and speech) and three output modalities (visual, auditory, and touch) based on the quality and availability of algorithms and devices. We employ creative scenarios to apply these modalities in a shopping context, enhancing enjoyment. The logic processor is a task-oriented dialogue system. We construct a situated multimodal dialogue dataset that defines user intents and system responses. It is worth noting that speech is the primary interaction modality in our shopping system, specifically for choosing cakes. Other modalities are employed to address the user’s preference regarding the cakes. Figure 3 shows an example conversation for shopping cakes. User and system utterances are marked in red and blue, respectively. Figure 4 shows a scene of a user interacting with the shopping system. We recommend readers watch Supplementary Video S1 to learn more about how the shopping system works. Just a reminder, the system is built for native Chinese users. The conversation examples presented in this paper are all translated from Chinese. We provide a Supplementary Video S1 showing the complete system in working. Please watch and refer to it.

2.3.1. The Integrated Modalities

Figure 2 lists the modalities, devices, raw data, and other elements used in the shopping system. Figure 5 shows the appearance and user interface of the devices and software. The input modalities classify three emotion states, including negative, positive, and neural. For the expression modality, we use expression recognition algorithms [45] to recognize facial images as angry, happy, and neutral, corresponding to negative, positive, and neutral. Figure 5a shows the user interface for expression recognition algorithms. The algorithm is trained using deep learning techniques, enabling it to recognize local and global image features. It performs well in situations with partial occlusion, showcasing a certain level of robustness. For the gesture modality, an arrayed electrode armlet [46] is integrated to obtain the user’s gestures (as shown in Figure 5b). A deep neural network is trained to recognize gestures from 8-channel electromyographic (EMG) signals. Users put on the armlet and complete a quick calibration and training process, taking less than a minute to start. We interpret hand waving as a sign of negative emotions and view ‘OK’ gestures as positive. For the pose modality, we use wearable inertial sensors [47] (Figure 5c) to collect motion data from limbs and joints. We then estimate joint positions and body posture by integrating acceleration and angular velocity. The head-shaking posture is recognized as conveying a negative emotion, while nodding is a positive one. For the contact force modality, arrayed flexible tactile sensors [48] (Figure 5d) can detect the press area and force based on piezoelectric sensing principles [48]. When press area and force exceed a predefined threshold, it is considered a negative emotion. The probability of negative emotion is linearly related to the pressure force. Therefore, the contact force modality can only complete a classification with two classes (negative and neutral). We set the probability of positive emotion to zero to align with other modalities. For the speech modality, speech is first recognized as text using Microsoft Azure Speech-to-text Tools [23]. The modality delay strategy means that sending recognized text directly to the fusion module would result in delays in each conversation round. This could impact the efficiency and user experience. To avoid this, we establish an admission condition for the speech fusion. We have defined regular expressions for recognizing positive and negative words in text. If the text matches the positive regular expression, the speech modality is considered to meet the admission condition and is identified as a positive emotion. If the text matches the negative regular expression, the speech modality is recognized as a negative emotion and is sent to the fusion module. If the text does not meet the admission condition, it is sent directly to the logic processor for further dialogue.

For visual output modality, we design a virtual cake shop using Unreal Engine as shown in Figure 5f. Artists model each cake in the cake database. Additionally, we create a human-looking virtual assistant to simulate a salesperson. The assistant can talk with users using natural expressions and body movements. She also manipulates the co-observed environment to showcase cakes from the shopping inventory. The assistant’s voice is generated using Microsoft Azure Text-to-speech Tools [23]. The expression is driven by actor using ARKit [49]. The Body movements are recorded using the Vicon motion capture system [50]. For the touch output modality, we use a cup-like device with temperature and vibration feedback (as shown in Figure 5e) [30]. Two Paltier units on the cup can simulate the sensation of water temperature. The vibration motor embedded inside the cup simulates the ice vibration. The water cup is driven by a 14-bit serial control code with low response delay to meet the real-time application demands.

2.3.2. Multimodal Fusion

The fusion decisions are employed to determine the user’s preference regarding the cakes. Therefore, multimodal fusion only operates after the cake has been recommended and shown in the environment. As mentioned in Section 2.2, the modality weights

w_{1}, \dots, w_{5}

are calculated using Equation (2). Subsequently, we manually add strength coefficients

α

for each modality based on the strength of interaction intent observed in the real world. The final modality weights are formulated as follows:

{w_{j}}^{'} = α_{j} w_{j}

(3)

H = argmax \sum_{j = 1}^{5} {w_{j}}^{'} L_{j}

(4)

where

L_{j}

is the j-th modality result. The strength coefficients corresponding to speech, expression, gesture, pose, and contact force modalities are 1, 2, 2, 2, and 6, respectively. Additionally, based on our experiments, we set the modality delay threshold in the shopping system to 4 s.

2.3.3. Situated Multimodal Dialogue

In our task-oriented dialogue system, the situated dialogue dataset adopts a hierarchical and recursive semantic representation for user dialog states and system dialog acts, similar to Cheng’s work [51]. Every meaning is rooted at a user or system node to distinguish between the two classes. Non-terminals of the representation include user verbs, system actions, and slots. User verbs represent the predicate of the user intent, such as Ask, Update, and Book. For example, users can Update the recommended cake by specifying an attribute value of cakes (see Turn 2 of Figure 3). System actions represent the system dialog act in response to a user intent. For example, the system could Prompt for a slot value, and Show the object that matches the slot values in the environment (see Turn 1 of Figure 3). Slots are categorized into attribute and intent slots, with an example in Figure 6. Attribute slots define the object properties and accept either a categorical label (e.g., “around 500 Yuan” for cake price) or an open value (e.g., “450 yuan”). Intent slots specify a set of required slots and corresponding system actions for each user intent. We employ regular expressions to identify user intent and slot values. Regular expression patterns work well in limited-domain conversations. Through extensive user dialogue experiments, we collect user intent keywords and utterance formats and formulate regular expressions for intent and attribute slots. Figure 7 illustrates an example of the intent recognition process. The regular expression recognizes price keywords and updates attribute values. In summary, our multimodal dialogue dataset consists of 5 attribute slots, 6 intent slots, 6 system actions, 18 regular expressions, and 60 types of system utterances.

In the cake database, each cake has five attributes: size, price, flavor, theme, and population. Table 2 lists the attribute values for all cakes in the database. Based on the history attribute values and the updated ones, the system matches (system.Find) the cake in the database that best meets the specified attributes. The cake matching follows a hierarchical attribute matching method. For example, as shown in Turn 3 of Figure 3, the attribute values are [“>11”, “around_500”, “cream”, “old”, “old”]. The attribute importance order is [1, 3, 4, 5, 2]. During the matching process, the system finds cakes in the database that meet the attribute with the highest importance, denoted as ‘

R_{1} - size -

>11’. The results are [“cake018”, “cake019”, …, “cake024”]. Next, the system operates ‘

R_{2} - population - old

’, resulting in [“cake018”, “cake022”]. The result of ‘

R_{3} - price - around_500

’ is [“cake018”]. Since there is only one cake in the result, there is no need to match other attributes. “cake018” is the final choice. If the results have not been narrowed down to one cake after matching all attributes, the system will randomly select one from the results as the final choice. If the matching results of

R_{j}

are empty, the final result will be randomly selected from the results of

R_{j - 1}

. In our hierarchical attribute matching, the attribute importance order is variable. The initial importance order is manually specified based on experience, that is, [1, 2, 3, 4, 5]. When users update a cake attribute value, the importance order of this attribute is set to one, and other attributes are adjusted accordingly (see Turn 2 and Turn 3 of Figure 3). If users update multiple attribute values simultaneously, the importance order of these attributes is determined according to the initial order. Additionally, any attributes that have not been updated in the conversation will be set to the attribute values of the recommended cake (see Turn 3 of Figure 3).

3. Results

This section includes the following tasks: (1) Evaluation of shopping system robustness; (2) Evaluation of the dialogue system; (3) Evaluation of multimodal fusion methods; (4) Reporting user studies. The experimental setup operates on a computer with an NVIDIA GeForce RTX 3090 GPU, an 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50 GHz CPU, and 64.0 GB memory. The test data are obtained by manual labeling.

3.1. Evaluation of System Robustness

Our cake shopping system integrates multiple devices and algorithms, posing a challenge to the robustness. We conduct experiments to assess the robustness during prolonged shopping. During the experiment, at the beginning of each test round and every hour after, the examiner completes a cake-buying conversation using the shopping system and checks the system’s response. The test conversations are randomly selected from the test set. We conduct three test rounds. Each round lasts for 4 h. There are a total of 15 test conversations (

3 \times 5

). The results show that all 15 conversations function as expected, demonstrating the system’s robustness.

3.2. Evaluation of the Dialogue System

We evaluate the dialogue system by performing a corpus-based evaluation [52]. The evaluation uses the dialogue system to predict each system action in the test set. We exclusively utilize the speech input modality. Two evaluation metrics are used: Entity Matching Rate and Objective Task Success Rate [52]. We calculate the Entity Matching Rate by determining whether the recommended cake matches the user-specified attributes. The Task Success Rate is employed to assess whether the system answered all the associated information requests from the user (e.g., “What kind of flavors do you have?”). The test set has 27 cake matching tasks and 66 conversation rounds. The results are shown in Table 3. Our system obtains a

100 %

Entity Matching Rate and a

93.94 %

Task Success Rate. Occasional failures in system actions occur due to errors in speech recognition. We also perform an ablative study of the cake matching method by testing two alternative variants: ‘No History Value’ and ‘Fixed Attribute Order’. The primary distinction among them lies in the attribute importance order. ‘No History Value’ treats the attribute requested by the user in the current round as the most crucial. The order of other attributes remains the same as the default. In the ‘Fixed Attribute Order’ group, the attribute order is set to default values and does not change throughout conversations. Table 3 illustrates a significant decrease in the Entity Matching Rate for variants. Figure 8 presents some conversation examples in the ablation study. The differences in system actions are bolded. ‘Fixed Attribute Order’ struggles to provide accurate recommendations that meet the specified requirements. As the number of matching rounds increases, ‘No History Value’ matches more wrong cakes.

3.3. Evaluation of Multimodal Fusion Method

In this section, we evaluate the accuracy of our multimodal fusion method. We first explore the performances of different fusion methods on the test set. We compare our relative weighted fusion (RW) to majority voting (MV) and linear weighted fusion methods (LW). For MV, the final decision is the one where the majority of the classifiers reach a similar decision [17]. When the modality numbers are equal in two categories, the final decision is the category with a higher relative probability. LW is similar to RW, but each modality weight is fixed to 1. We conduct tests on a total of 1399 data, including 329 two-modality fusions (2-modals), 390 three-modality fusions (3-modals), 326 four-modality fusions (4-modals), and 354 five-modality fusions (5-modals). The fusion accuracy is calculated by comparing the labels with the system decisions. The results are shown in Table 4. The accuracy of RW is much higher than that of MV and LW. We separately calculate the fusion accuracy for different modality numbers. As shown in Table 4, with an increase in the modality number, the RW accuracy gradually rises, suggesting the effectiveness of multimodality in enhancing decision quality. For the two-modality fusions, the conflicting decisions confuse the system and reduce the fusion accuracy. MV and LW show similar accuracy, achieving their highest accuracy in the 2-modals group. They struggle to effectively integrate multiple modalities.

Our shopping system is designed for practical use. So, we also conduct a user-based evaluation for it. One examiner participates in the experiment. He triggers each input modality device based on the classification results in the test set. For ease of control, the relative probability of the classification result for each modality should be bigger than

0.8

, denoted as

P (q_{j}) \geq 0.8

. There are 20 fusion tests, comprising 5 two-modality fusions, 5 three-modality fusions, 5 four-modality fusions, and 5 five-modality fusions. The experimental results reveal a

95 %

accuracy, with only one error occurring during the two-modality fusions. These tests have demonstrated the advantages of our relative weighted fusion method in the laboratory and practical environments.

3.4. User Study

We conduct user studies to compare our multimodal shopping system (MMS) with the traditional 2D shopping system (2DS). The 2D shopping system is built using a PyQt5 module. It shows cake information through text and images (see Figure 9). We invite 10 participants and make them use MMS and 2DS separately to purchase the desired cakes. The order of using the system is randomized. Before the MMS group, participants complete five conversations to become familiar with the system. After the experiment of each group, participants complete a system evaluation questionnaire, which consists of four items: “I can quickly select the cake I want” (Efficiency), “I find the shopping process enjoyable” (Pleasure), “I feel like I am shopping at a real cake store” (Immersion), and “I would be willing to shop in this way” (Likability). Each item has five levels (1 = totally disagree to 5 = totally agree). All participants are students. Five of them are male. The average age is

24.4

(SD = 1.35). They are naive to the purposes of the experiment.

Figure 10 shows the mean scores of four items for the MMS and 2DS. The error bars indicate

95 %

within-subject confidence intervals. It can be seen that 2DSs have higher scores in terms of Efficiency and Likability. MMS has higher scores in terms of Pleasure and Immersion. We perform a paired samples t-test on each item for two groups. The results show statistically significant differences between the MMS and 2DS in terms of Efficiency (

t (10) = - 6.708, p = 0.000 < 0.05

), Pleasure (

t (10) = 6.128, p = 0.000 < 0.05

), Immersion (

t (10) = 2.400, p = 0.040 < 0.05

), and Likability (

t (10) = - 2.535, p = 0.032 < 0.05

). The analysis above indicates that our multimodal system offers users immersive shopping experiences. The multimodal interaction also enhances the entertainment. However, compared to 2D text-based shopping, our system is not efficient. This inefficiency affects how much participants like the multimodal system. We conduct an interview after the experiments to gather their opinions. One key observation is that participants found the integrated device challenging to use. The need to navigate and coordinate between devices resulted in a perceived lack of efficiency. The device calibration and wearing prior to the experiment also added to the sense of complexity.

4. Discussion

The parallel multimodal integration framework proposed in this paper provides a development paradigm for multimodal interaction systems. It simplifies the MMHCI system implementation process. We successfully applied the framework to a virtual shopping scenario in the cake domain. The experiments confirm the robustness of the shopping system during prolonged shopping sessions. A tree-structured task-oriented dialogue system is used for logic processing in shopping. The experiments show that the dialogue system achieves a high Entity Matching Rate and Task Success Rate, showcasing its effectiveness in understanding user requests and providing relevant responses. Occasional failures in system actions are attributed to speech recognition errors, suggesting that speech recognition improvements could lead to further performance enhancements. The ablative study for the hierarchical attribute matching method emphasizes the superiority of our dialogue system.

In addition, our framework displays outstanding performance in multimodal fusion tasks. RW can achieve higher accuracy than the prevalent MV and LW methods. The gradual increase in RW accuracy with an increase in modality numbers highlights the positive impact of multimodality on decision quality. The user-based evaluation demonstrates the practical advantages of the relative weighted fusion method in real-world scenarios. In user studies, the comparison between MMS and 2DS revealed differences in Efficiency, Pleasure, Immersion, and Likability. MMS offers immersive shopping experiences and enhanced entertainment but could have been more efficient. The user interviews indicate that the perceived complexity of using integrated devices in MMS adversely affected its efficiency.

Furthermore, one limitation of our framework is that the fusion accuracy still needs to improve compared to existing fixed-modal fusion algorithms [53]. Our simple and modality-unrestricted fusion method provides flexibility to our framework but loses some accuracy. We plan to enhance multimodal fusion accuracy by employing machine learning methods in specific scenarios.

In the future, we will improve our multimodal shopping system. The cake database with customizable parameters aligns more with practical cake-buying scenarios. The dialogue scope in the shopping system is limited. Combining our dialogue dataset with open-domain dialogue systems [54] can further enhance the interaction naturalness. Large language models [55] have achieved remarkable feats, significantly elevating the intelligence of human–computer interactions. The challenge of resource-intensive operations hinders its widespread application. Additionally, there is a need for further advancements in lightweight and accurate sensing devices, such as tactile and taste sensors. Constructing a multimodal system that more closely resembles the human interaction way demands ongoing efforts.

5. Extension

Our multimodal shopping interface is crafted using 3D modeling technology. As a result, this system can be seamlessly extended to virtual reality (VR) environments (see Figure 11). Utilizing VR technology will take the user’s immersion to the next level. In this case, the expression input modality will be unavailable because the headset causes significant occlusions. An alternative solution is to employ a facial recognition algorithm suitable for wearing headsets [56].

6. Conclusions

We propose a versatile parallel multimodal integration framework for multimodal human-computer interaction systems. The framework employs input and output modality operation flow, allowing each modality to operate in parallel. Individual modality has unrestricted resource allocation, reducing the development barriers. We utilize a relative weighted fusion method to calculate the final multimodal decision, setting modality weights based on relative probability. The modality delay strategy addresses the issue of inconsistent triggered time and computation delays. Based on the integration framework, we construct a multimodal virtual shopping system in the domain of cake, integrating five input modalities and three output modalities. A tree-structured task-oriented dialogue system is used for logic processing. We establish a situated multimodal dialogue dataset comprising 5 attribute slots, 6 intent slots, 6 system actions, 18 regular expressions, and 60 system utterances. We conduct objective and subjective experiments on the shopping system. The results indicate that the system is robust and intelligent. It can accurately understand user intent. Tests in both controlled laboratory and practical environments demonstrate high fusion accuracy. User studies show that the multimodal shopping scenario enhances immersion and entertainment but has low shopping efficiency. The user interviews indicate that the perceived complexity of using integrated devices in MMS adversely affected its efficiency. This insight suggests that future improvements prioritize the development of more innovative and more user-friendly unimodal devices to streamline the user experience.

In conclusion, our proposed versatile parallel multimodal integration framework for multimodal human–computer interaction systems presents a significant advancement in the field. It can simplify the MMHCI system implementation process and reduce development barriers. Our framework provides several critical guiding principles for researchers developing multimodal systems. The parallel operation flow is crucial, offering flexibility in resource allocation and facilitating a more streamlined development process. Our experiments demonstrate that choosing lightweight, user-friendly multimodal devices is essential for user experience. Developers can draw inspiration from the implementation logic of our shopping systems and develop various multimodal applications based on the framework.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14010299/s1, Video S1: the complete buying process using the developed multimodal shopping system.

Author Contributions

Conceptualization, D.W.; methodology, D.W. and H.F.; software, H.F.; validation, H.F. and Z.T.; formal analysis, H.F. and Z.T.; investigation, Z.T.; resources, D.W.; data curation, H.F.; writing—original draft preparation, H.F.; writing—review and editing, H.F.; visualization, H.F.; supervision, D.W.; project administration, D.W. and H.F.; funding acquisition, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (grant number 2022YFF0902303) and the 2022 major science and technology project “Yuelu·Multimodal Graph-Text-Sound-Semantic Gesture Big Model Research and Demonstration Application“ in Changsha (grant number kh2301019) and the Strategic research and consulting project of Chinese Academy of Engineering (grant number 2023-HY-14).

Institutional Review Board Statement

Ethical review and approval are not applicable to this research. Participants only use some wearable devices to shop online on a computer, which does not cause any harm to humans. Additionally, participants cannot be identified from the information collected.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The test data used in the Results section are publicly available on GitHub at: https://github.com/shanzhajuan/shopping-system-data/ (accessed on 26 November 2023). For the situated multimodal dialogue dataset, we regretfully cannot make it publicly available at this time. The primary reason is that the dataset requires further supplementation and improvement before releasing it to the public.

Acknowledgments

The authors thank the School of Optics and Photonics, Beijing Institute of Technology, for their support. The authors thank the editor and anonymous reviewers for providing helpful suggestions for improving the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jaimes, A.; Sebe, N. Multimodal human-computer interaction: A survey. Comput. Vis. Image Underst. 2007, 108, 116–134. [Google Scholar] [CrossRef]
Dumas, B.; Lalanne, D.; Oviatt, S. Multimodal interfaces: A survey of principles, models and frameworks. In Human Machine Interaction: Research Results of the Mmi Program; Springer: Berlin, Germany, 2009; pp. 3–26. [Google Scholar]
Turk, M. Multimodal interaction: A review. Pattern Recognit. Lett. 2014, 36, 189–195. [Google Scholar] [CrossRef]
Flippo, F.; Krebs, A.; Marsic, I. A Framework for Rapid Development of Multimodal Interfaces. In Proceedings of the 5th International Conference on Multimodal Interfaces, ICMI ’03, Vancouver, BC, Canada, 5–7 November 2003; pp. 109–116. [Google Scholar]
Abdallah, C.; Changyue, S.; Kaibo, L.; Abhinav, S.; Xi, Z. A data-level fusion approach for degradation modeling and prognostic analysis under multiple failure modes. J. Qual. Technol. 2018, 50, 150–165. [Google Scholar]
Kamlaskar, C.; Abhyankar, A. Multimodal System Framework for Feature Level Fusion based on CCA with SVM Classifier. In Proceedings of the 2020 IEEE-HYDCON, Hyderabad, India, 11–12 September 2020; pp. 1–8. [Google Scholar]
Radová, V.; Psutka, J. An approach to speaker identification using multiple classifiers. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, 21–24 April 1997; Volume 2, pp. 1135–1138. [Google Scholar]
Lucey, S.; Sridharan, S.; Chandran, V. Improved speech recognition using adaptive audio-visual fusion via a stochastic secondary classifier. In Proceedings of the 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing, ISIMP 2001, Hong Kong, China, 4 May 2001; pp. 551–554. [Google Scholar][Green Version]
Ramachandram, D.; Taylor, G.W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
Adams, W.; Iyengar, G.; Lin, C.Y.; Naphade, M.R.; Neti, C.; Nock, H.J.; Smith, J.R. Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP J. Adv. Signal Process. 2003, 2003, 1–16. [Google Scholar] [CrossRef]
Pitsikalis, V.; Katsamanis, A.; Papandreou, G.; Maragos, P. Adaptive multimodal fusion by uncertainty compensation. In Proceedings of the INTERSPEECH, Pittsburgh, PA, USA, 17–21 September 2006. [Google Scholar]
Meyer, G.F.; Mulligan, J.B.; Wuerger, S.M. Continuous audio–visual digit recognition using N-best decision fusion. Inf. Fusion 2004, 5, 91–101. [Google Scholar] [CrossRef]
Cutler, R.; Davis, L. Look who’s talking: Speaker detection using video and audio correlation. In Proceedings of the 2000 IEEE International Conference on Multimedia and Expo. ICME2000: Latest Advances in the Fast Changing World of Multimedia, New York, NY, USA, 30 July–2 August 2000; Volume 3, pp. 1589–1592. [Google Scholar]
Strobel, N.; Spors, S.; Rabenstein, R. Joint audio-video object localization and tracking. IEEE Signal Process. Mag. 2001, 18, 22–31. [Google Scholar] [CrossRef]
Zotkin, D.N.; Duraiswami, R.; Davis, L.S. Joint audio-visual tracking using particle filters. EURASIP J. Adv. Signal Process. 2002, 2002, 162620. [Google Scholar] [CrossRef]
Garg, S.N.; Vig, R.; Gupta, S. Multimodal biometric system based on decision level fusion. In Proceedings of the 2016 International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES), Paralakhemundi, India, 3–5 October 2016; pp. 753–758. [Google Scholar]
Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
Neti, C.; Maison, B.; Senior, A.W.; Iyengar, G.; Decuetos, P.; Basu, S.; Verma, A. Joint processing of audio and visual information for multimedia indexing and human-computer interaction. In Proceedings of the RIAO, Paris, France, 12–14 April 2000; pp. 294–301. [Google Scholar]
Donald, K.M.; Smeaton, A.F. A comparison of score, rank and probability-based fusion methods for video shot retrieval. In Proceedings of the International Conference on Image and Video Retrieval, Singapore, 20–22 July 2005; pp. 61–70. [Google Scholar]
Pfleger, N. Context based multimodal fusion. In Proceedings of the 6th International Conference on Multimodal Interfaces, State College, PA, USA, 13–15 October 2004; pp. 265–272. [Google Scholar]
Corradini, A.; Mehta, M.; Bernsen, N.O.; Martin, J.; Abrilian, S. Multimodal input fusion in human-computer interaction. NATO Sci. Ser. Sub Ser. III Comput. Syst. Sci. 2005, 198, 223. [Google Scholar]
Holzapfel, H.; Nickel, K.; Stiefelhagen, R. Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures. In Proceedings of the 6th International Conference on Multimodal Interfaces, State College, PA, USA, 13–15 October 2004; pp. 175–182. [Google Scholar]
Microsoft. Bing Speech API. Available online: https://azure.microsoft.com/en-us/products/ai-services/ai-speech/ (accessed on 18 August 2023).
Tan, T.; Qian, Y.; Hu, H.; Zhou, Y.; Ding, W.; Yu, K. Adaptive very deep convolutional residual network for noise robust speech recognition. IEEE ACM Trans. Audio Speech Lang. Process. 2018, 26, 1393–1405. [Google Scholar] [CrossRef]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. MediaPipe Hands: On-device Real-time Hand Tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar]
Zhang, S.; Huang, Z.; Paudel, D.P.; Van Gool, L. Facial emotion recognition with noisy multi-task annotations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 21–31. [Google Scholar]
Scavarelli, A.; Arya, A.; Teather, R.J. Virtual reality and augmented reality in social learning spaces: A literature review. Virtual Real. 2021, 25, 257–277. [Google Scholar] [CrossRef]
Hu, M.; Luo, X.; Chen, J.; Lee, Y.C.; Zhou, Y.; Wu, D. Virtual reality: A survey of enabling technologies and its applications in IoT. J. Netw. Comput. Appl. 2021, 178, 102970. [Google Scholar] [CrossRef]
Aziz, K.A.; Luo, H.; Asma, L.; Xu, W.; Zhang, Y.; Wang, D. Haptic handshank—A handheld multimodal haptic feedback controller for virtual reality. In Proceedings of the 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Porto de Galinhas, Brazil, 9–13 November 2020; pp. 239–250. [Google Scholar]
Liu, S.; Cheng, H.; Tong, Y. Physically-based statistical simulation of rain sound. ACM Trans. Graph. TOG 2019, 38, 1–14. [Google Scholar] [CrossRef]
Cheng, H.; Liu, S. Haptic force guided sound synthesis in multisensory virtual reality (VR) simulation for rigid-fluid interaction. In Proceedings of the 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Osaka, Japan, 23–27 March 2019; pp. 111–119. [Google Scholar]
Niijima, A.; Ogawa, T. Study on control method of virtual food texture by electrical muscle stimulation. In Proceedings of the UIST ’16: The 29th Annual ACM Symposium on User Interface Software and Technology, Tokyo, Japan, 16–19 October 2016; pp. 199–200. [Google Scholar]
Ranasinghe, N.; Tolley, D.; Nguyen, T.N.T.; Yan, L.; Chew, B.; Do, E.Y.L. Augmented flavours: Modulation of flavour experiences through electric taste augmentation. Food Res. Int. 2019, 117, 60–68. [Google Scholar] [CrossRef]
Frediani, G.; Carpi, F. Tactile display of softness on fingertip. Sci. Rep. 2020, 10, 20491. [Google Scholar] [CrossRef]
Chen, T.; Pan, Z.G.; Zheng, J.M. Easymall-an interactive virtual shopping system. In Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, Jinan, China, 18–20 October 2008; Volume 4, pp. 669–673. [Google Scholar]
Speicher, M.; Cucerca, S.; Krüger, A. VRShop: A mobile interactive virtual reality shopping environment combining the benefits of on-and offline shopping. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2017, 1, 1–31. [Google Scholar] [CrossRef]
Ricci, M.; Evangelista, A.; Di Roma, A.; Fiorentino, M. Immersive and desktop virtual reality in virtual fashion stores: A comparison between shopping experiences. Virtual Real. 2023, 27, 2281–2296. [Google Scholar] [CrossRef]
Schnack, A.; Wright, M.J.; Holdershaw, J.L. Immersive virtual reality technology in a three-dimensional virtual simulated store: Investigating telepresence and usability. Food Res. Int. 2019, 117, 40–49. [Google Scholar] [CrossRef]
Wasinger, R.; Krüger, A.; Jacobs, O. Integrating intra and extra gestures into a mobile and multimodal shopping assistant. In Proceedings of the International Conference on Pervasive Computing, Munich, Germany, 8–13 May 2005; pp. 297–314. [Google Scholar]
Moon, S.; Kottur, S.; Crook, P.A.; De, A.; Poddar, S.; Levin, T.; Whitney, D.; Difranco, D.; Beirami, A.; Cho, E.; et al. Situated and interactive multimodal conversations. In Proceedings of the 28th International Conference on Computational Linguistics, Virtual, 8–13 December 2020; pp. 1103–1121. [Google Scholar]
Cutugno, F.; Leano, V.A.; Rinaldi, R.; Mignini, G. Multimodal Framework for Mobile Interaction. In Proceedings of the International Working Conference on Advanced Visual Interfaces, AVI ’12, Capri Island, Italy, 21–25 May 2012; pp. 197–203. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Attig, C.; Rauh, N.; Franke, T.; Krems, J.F. System latency guidelines then and now–is zero latency really considered necessary? In Proceedings of the Engineering Psychology and Cognitive Ergonomics: Cognition and Design: 14th International Conference, EPCE 2017, Vancouver, BC, Canada, 9–14 July 2017; pp. 3–14. [Google Scholar]
Li, Y.; Zeng, J.; Shan, S.; Chen, X. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans. Image Process. 2018, 28, 2439–2450. [Google Scholar] [CrossRef] [PubMed]
Geng, W.; Du, Y.; Jin, W.; Wei, W.; Hu, Y.; Li, J. Gesture recognition by instantaneous surface EMG images. Sci. Rep. 2016, 6, 65–71. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Xu, S.Q.; Cheng, N.; You, Y.; Zhang, X.; Tang, Z.; Yang, X. Orientation Estimation Algorithm for Motion Based on Multi-Sensor. CSA 2015, 24, 134–139. [Google Scholar]
Chuang, C.H.; Wang, M.S.; Yu, Y.C.; Mu, C.L.; Lu, K.F.; Lin, C.T. Flexible tactile sensor for the grasping control of robot fingers. In Proceedings of the 2013 International Conference on Advanced Robotics and Intelligent Systems, Tainan, Taiwan, 31 May–2 June 2013; pp. 141–146. [Google Scholar]
Apple Inc. ARKit: Tracking and Visualizing Faces. 2017. Available online: https://developer.apple.com/documentation/arkit/arkit_in_ios/content_anchors/tracking_and_visualizing_faces (accessed on 28 August 2023.).
Vicon Motion Systems Ltd. UK. Vicon. 1984. Available online: https://www.vicon.com/ (accessed on 10 January 2020.).
Cheng, J.; Agrawal, D.; Martínez Alonso, H.; Bhargava, S.; Driesen, J.; Flego, F.; Kaplan, D.; Kartsaklis, D.; Li, L.; Piraviperumal, D.; et al. Conversational Semantic Parsing for Dialog State Tracking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Virtual, 16–20 November 2020; pp. 8107–8117. [Google Scholar]
Wen, T.H.; Vandyke, D.; Mrksic, N.; Gasic, M.; Rojas-Barahona, L.M.; Su, P.H.; Ultes, S.; Young, S. A Network-based End-to-End Trainable Task-oriented Dialogue System. arXiv 2016, arXiv:1604.04562. [Google Scholar]
Middya, A.I.; Nag, B.; Roy, S. Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities. Knowl. Based Syst. 2022, 244, 108580. [Google Scholar] [CrossRef]
Bao, S.; He, H.; Wang, F.; Wu, H.; Wang, H.; Wu, W.; Wu, Z.; Guo, Z.; Lu, H.; Huang, X.; et al. PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 107–118. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Dey, A.; Barde, A.; Yuan, B.; Sareen, E.; Dobbins, C.; Goh, A.; Gupta, G.; Gupta, A.; Billinghurst, M. Effects of interacting with facial expressions and controllers in different virtual environments on presence, usability, affect, and neurophysiological signals. Int. J. Hum. Comput. Stud. 2022, 160, 102762. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed parallel multimodal integration framework.

Figure 2. Overview of the multimodal virtual shopping system. The modalities, devices, raw data, and other elements used in the shopping system are listed.

Figure 3. A conversation example using the cake shopping system. User and system utterances are marked in red and blue, respectively. We use dots to represent tree edges and an increased indentation level to reveal multiple children attached to the same parent node. The environment response column shows the recommended cakes in the interactive environment.

Figure 4. A scene of a user interacting with the multimodal shopping system. We recommend readers watch Supplementary Video S1 to learn more about how the shopping system works.

Figure 5. The appearance and user interface of the integrated devices and software in the shopping system: (a) The interface of expression recognition algorithms. (b) The arrayed electrode armlet for gesture input modality. (c) The inertial sensing units used in pose capture devices. (d) The appearance of the arrayed flexible tactile sensors. (e) The vibration and temperature feedback cup. (f) The virtual store interface.

Figure 6. Examples of attribute slots and intent slots.

Figure 7. An example of the intent recognition process. User utterances are marked in red. The regular expression for Update price is shown in the box.

Figure 8. Conversation examples in ablation study for cake matching method. User utterances are marked in red. The differences in system actions are bolded.

Figure 9. Two-dimensional text-based shopping interface.

Figure 10. The mean scores of four items in the multimodal (MMS) and 2D shopping systems (2DS). The error bars indicate 95% within-subject confidence intervals.

Figure 11. The user interacts with the multimodal shopping system in the virtual reality environment.

Table 1. Modalities used in multimodal human–computer interaction system based on Turk’s research [3].

Modality	Example
Visual	expression
	gaze
	face-based identity (age, sex, race, etc.)
	gesture
	body pose
	virtual environment *
	virtual agent *
	virtual body (induce the ownership illusion) *
Auditory	speech
Auditory	non-speech audio (clapping sound, environment noise etc.)
Touch	contact force (contact area, value, etc.)
Touch	tactile sense *
Other sensors	temperature
	smell *
	taste *

* Generally used as an output modality.

Table 2. The attribute values for all cakes in the database.

Number	Name	Size	Price	Flavor	Theme	Population
cake001	Teddy Bear	1	39	cream	normal	normal
cake002	Tiramisu	1	29	mousse	normal	normal
cake003	Black Forest	1	29	chocolate	normal	normal
cake004	Roseberry Girl	1	25	cream	normal	normal
cake005	Grape Bobo	4	289	mousse	normal	normal
cake006	Star Wish	4	269	cheese	normal	normal
cake007	Let’s Dance	4	239	cream	normal	normal
cake008	Candy House	3	269	chocolate	Children’s day	kids
cake009	Confession of Love	4	269	mousse	Valentine’s Day	couple
cake010	Happy Growth	7	348	milk fat	Children’s day	kids
cake011	Sweet Story	7	348	chocolate	normal	normal
cake012	Happiness and Longevity	7	328	milk fat	old people birthday	old people
cake013	Overjoyed	7	318	cream	Valentine’s Day	couple
cake014	Mr. Charming	7	348	mousse	Valentine’s Day	couple
cake015	Fruit Windmill	11	428	mousse	normal	normal
cake016	Peach Holding Sun	11	408	cream	old people birthday	old people
cake017	Carefree	11	428	milk fat	Children’s day	kids
cake018	Elegant	15	518	cream	old people birthday	old people
cake019	Cream Cheese	15	568	cheese	normal	normal
cake020	Chocolate Encounter	15	568	chocolate	normal	normal
cake021	Princess Party	25	758	milk fat	Children’s day	kids
cake022	Warm Wishes	25	699	cream	old people birthday	old people
cake023	Rose Bloom	25	718	cheese	normal	normal
cake024	Wonderful Life	over 25	2098	cream	wedding	couple

Table 3. Ablation for cake matching method.

Method	Entity Matching Rate (%)	Task Success Rate (%)
Ours	100	93.94
No History Value	74.07	-
Fixed Attribute Order	14.81	-

Table 4. Ablation for multimodal fusion method. We compare our relative weighted fusion (RW) to majority voting (MV) and linear weighted fusion methods (LW) in terms of fusion accuracy. 2-modals means two-modality fusion experiments, and so on for 3-modals, 4-modals, and 5-modals.

Method	Overall	2-Modals	3-Modals	4-Modals	5-Modals
Method	ACC	ACC	ACC	ACC	ACC
RW	0.844	0.763	0.790	0.854	0.968
MV	0.684	0.721	0.677	0.641	0.698
LW	0.686	0.709	0.690	0.669	0.677

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fang, H.; Weng, D.; Tian, Z. A Parallel Multimodal Integration Framework and Application for Cake Shopping. Appl. Sci. 2024, 14, 299. https://doi.org/10.3390/app14010299

AMA Style

Fang H, Weng D, Tian Z. A Parallel Multimodal Integration Framework and Application for Cake Shopping. Applied Sciences. 2024; 14(1):299. https://doi.org/10.3390/app14010299

Chicago/Turabian Style

Fang, Hui, Dongdong Weng, and Zeyu Tian. 2024. "A Parallel Multimodal Integration Framework and Application for Cake Shopping" Applied Sciences 14, no. 1: 299. https://doi.org/10.3390/app14010299

APA Style

Fang, H., Weng, D., & Tian, Z. (2024). A Parallel Multimodal Integration Framework and Application for Cake Shopping. Applied Sciences, 14(1), 299. https://doi.org/10.3390/app14010299

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Parallel Multimodal Integration Framework and Application for Cake Shopping

Abstract

Featured Application

Abstract

1. Introduction

2. Methods

2.1. Preliminaries

2.2. Parallel Multimodal Integration Framework

2.3. Multimodal Virtual Shopping System Prototype

2.3.1. The Integrated Modalities

2.3.2. Multimodal Fusion

2.3.3. Situated Multimodal Dialogue

3. Results

3.1. Evaluation of System Robustness

3.2. Evaluation of the Dialogue System

3.3. Evaluation of Multimodal Fusion Method

3.4. User Study

4. Discussion

5. Extension

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI