Image-Based Vocabulary Learning Through Conversational Robots: An Application Using Tapia Robot

Wu, Tsui-Hua; Kiệt, Lê Anh; Hwang, I-Shyan

doi:10.3390/engproc2026141008

Open AccessProceeding Paper

Image-Based Vocabulary Learning Through Conversational Robots: An Application Using Tapia Robot^†

by

Tsui-Hua Wu

¹,

Lê Anh Kiệt

²

and

I-Shyan Hwang

^2,*

¹

Department of Foreign Languages & Applied Linguistics, Yuan Ze University, Taoyuan 320, Taiwan

²

Department of Computer Science and Engineering, Yuan Ze University, Taoyuan 320, Taiwan

^*

Author to whom correspondence should be addressed.

^†

Presented at the 9th Eurasian Conference on Educational Innovation 2026 (ECEI 2026), Da Nang City, Vietnam, 30 January–2 February 2026.

Eng. Proc. 2026, 141(1), 8; https://doi.org/10.3390/engproc2026141008

Published: 9 June 2026

Download

Browse Figures

Versions Notes

Abstract

We applied conversational robots in language learning, building on the previously developed Japanese repeating (shadowing) system for beginners in an applied foreign languages program at a northern Taiwan university. The previous system, designed to support after-class practice, served as the foundation for the present project. In this study, the system is extended to create an image-based vocabulary learning tool for Japanese, Mandarin Chinese, and Vietnamese. The design concepts, integration of visual prompts, and the potential of conversational agents in this study enhance multilingual vocabulary acquisition. To evaluate the system’s effectiveness, a group of student participants tested and validated the prototype, providing feedback on usability, learning support, and overall performance.

Keywords:

conversational robots; human–robot interaction; educational technology; image-based vocabulary learning; computer-assisted language learning

1. Introduction

In foreign language education, repeated practice and meaningful interaction are essential for developing learners’ listening and speaking abilities. Nevertheless, in many university-level language classrooms, particularly at the introductory level, large class sizes and limited instructional time constrain opportunities for individualized oral practice and immediate feedback. As a result, learners often rely on after-class self-study to supplement in-class instruction. However, such self-directed learning activities frequently lack interactivity and sustained motivation, which may reduce their effectiveness in supporting oral language development.

To address these challenges, various forms of technology-enhanced language learning have been proposed and developed since the 1980s, including Computer-Assisted Language Learning, Technology-Enhanced Language Learning, and Mobile-Assisted Language Learning. These approaches extend learning beyond the classroom and promote learner autonomy by providing flexible access to practice materials. More recently, advances in artificial intelligence, speech recognition, and interactive media have enabled the use of augmented reality (AR), virtual reality, and conversational agents to create immersive language learning environments, though novelty effects may influence learner perceptions [1]. Empirical study results suggest that AR-based learning environments can enhance learner engagement. Such approaches have also been discussed in the context of integrated Japanese language education in global and cross-cultural learning environments [2]. The applications of AR have been explored in printed and digital teaching materials, demonstrating potential benefits for instructional support.

Despite their pedagogical potential, AR and VR applications require specialized hardware, substantial development costs, and technical expertise, which limit their scalability and adoption in typical university classrooms. In contrast, conversational robots can enhance learner engagement through embodied social interaction, with studies showing that social rewards from agents positively influence learning outcomes [3]. By combining physical presence, speech-based interaction, and affective feedback, conversational robots can support language practice in a natural and less intimidating manner, thereby helping learners reduce anxiety and increase their willingness to engage in spoken interaction.

In previous studies, we introduced an AI-based conversational robot, Tapia, as a learning assistant for Japanese language education. In earlier studies, we implemented a shadowing (repeating) practice system using Tapia, targeting beginning learners. Shadowing, which involves closely repeating spoken input in real time, has been shown to enhance pronunciation accuracy, listening comprehension, and speaking fluency [4]. Subsequent studies have demonstrated the instructional effectiveness of shadowing as a structured listening and speaking practice method [1].

Classroom observations and questionnaire results from pilot implementations indicated that learners responded positively to Tapia’s immediate feedback, encouraging verbal responses, and a non-threatening interaction style, which collectively contributed to reduced anxiety and increased motivation to practice oral skills. These findings align with prior research indicating that learners at different proficiency levels perceive shadowing as beneficial for listening and speaking development [5]. Similar applications of AR have been explored in printed and digital teaching materials, demonstrating potential benefits for instructional support [6].

On the basis of the previous results, the Tapia-based learning system integrates visual prompts into an image-based vocabulary learning framework. By combining images with spoken input and conversational interaction, the system supports multilingual vocabulary acquisition in Japanese, Mandarin Chinese, and Vietnamese. Visual cues facilitate meaning construction, particularly for beginner learners, while the conversational robot serves as a consistent practice partner, providing repetition, feedback, and motivational support. The design of the image-based vocabulary learning system describes its integration with the Tapia robot, and discusses its potential to enhance autonomous and interactive language learning in multilingual educational contexts.

2. Learning Framework

We adopted a design-based research approach to extend an existing conversational robot–assisted shadowing system into an image-based vocabulary learning framework. The developed system is implemented on the Tapia conversational robot and aims to support beginner-level learners of Japanese, Mandarin Chinese, and Vietnamese through multimodal input, repeated oral practice, and interactive feedback. This section describes the design rationale, system architecture, and interaction flow of the proposed learning framework.

2.1. Design

Vocabulary acquisition is a fundamental component of foreign language learning, particularly at the beginner level. However, novice learners often experience difficulties in establishing form–meaning connections when exposure is limited to auditory input alone. In previous studies, the Tapia robot was employed to support pronunciation and listening development through a shadowing-based practice system. While effective in promoting oral fluency, this approach provided limited semantic support for learners encountering new lexical items. To address this issue, the present system integrates visual prompts into the robot-assisted learning process.

The design adopts multimedia learning theory, which suggests that combining visual and verbal information can facilitate cognitive processing and enhance memory retention. By presenting images with spoken vocabulary items, the system supports meaning construction and reduces cognitive load for beginner learners. This multimodal approach is beneficial in multilingual learning contexts, where learners may have varying levels of prior linguistic knowledge.

2.2. System Architecture

The image-based vocabulary learning system consists of three main components: (1) a visual presentation module, (2) a speech interaction module, and (3) a feedback and control module (Figure 1).

The visual presentation module displays images on the built-in robot screen to introduce target vocabulary items. Images are selected to represent concrete objects, actions, or commonly encountered concepts, minimizing ambiguity and facilitating immediate comprehension. The speech interaction module employs text-to-speech (TTS) technology to provide model pronunciations of target vocabulary items. Learners are prompted to view the displayed image and verbally produce the corresponding word in the target language, which is then captured and processed using speech-to-text (STT) technology. This enables the system to recognize learners’ utterances and determine subsequent feedback.

The feedback and control module manages the learning sequence and delivers verbal responses based on recognition results. Rather than using numerical scores, the system provides qualitative and affective feedback, such as encouragement or repetition prompts. This design choice aims to reduce learners’ anxiety and foster a supportive learning environment conducive to repeated practice.

2.3. Interaction Flow

The learning interaction follows a structured yet flexible sequence. First, learners initiate the activity through a spoken command. The robot then presents an image corresponding to a target vocabulary item and produces the associated pronunciation. Learners repeat the item, and the system processes the utterance in real time. Based on recognition outcomes, the robot provides immediate feedback and guides learners to either proceed to the next item or repeat the current one. This interaction cycle is repeated as much as needed, allowing learners to control the pace and frequency of practice. Such learner-controlled interaction supports autonomous learning while maintaining a conversational and non-threatening practice environment.

2.4. Multilingual Design Considerations

In contrast to the earlier shadowing system, which focused solely on Japanese, the present framework is designed to accommodate multiple languages. Japanese, Mandarin Chinese, and Vietnamese were selected to reflect the linguistic diversity of the applied foreign languages curriculum. The system architecture enables the integration of language-specific vocabulary sets and speech resources without modifying the overall interaction structure. This design demonstrates the scalability and adaptability of the proposed framework for multilingual vocabulary learning.

3. Prototype Implementation

To operationalize the proposed image-based vocabulary learning framework, a functional prototype was implemented using a conversational robot platform. The prototype integrates hardware components for embodied interaction with software modules for speech processing, visual presentation, and learning control. This section describes the hardware configuration and software implementation of the system.

3.1. Hardware Platform

The prototype is implemented on the conversational robot Tapia, which is designed for home and educational use. Tapia features a compact, mascot-style physical design intended to reduce user anxiety and promote approachability. The robot is equipped with the following hardware components: (1) A built-in display screen used to present visual prompts, including images associated with target vocabulary items, (2) a microphone and speaker system that supports bidirectional speech interaction, (3) an embedded camera (not utilized in the current prototype), and (4) wireless network connectivity for cloud-based speech processing and content management. The physical embodiment of the robot enables face-to-face interaction and provides a sense of co-presence that distinguishes the system from screen-based language learning applications (Figure 2). The display screen allows visual information to be tightly synchronized with spoken input, supporting multimodal learning.

3.2. Software Architecture

The software architecture of the prototype follows a modular design, ensuring flexibility and extensibility. The system consists of four main software modules: (1) a vocabulary content module, (2) a speech processing module, (3) a dialogue control module, and (4) a user interface module.

The vocabulary content module stores language-specific learning materials, including vocabulary items, corresponding images, and textual representations. Separate content sets are prepared for Japanese, Mandarin Chinese, and Vietnamese. This modular structure enables the addition of new languages or vocabulary sets without modifying the core system logic. The speech processing module integrates TTS and STT functionalities. TTS is used to generate model pronunciations of target vocabulary items, while STT converts learners’ spoken responses into textual form for comparison and feedback generation. Cloud-based speech recognition services are utilized to enhance recognition accuracy across various languages.

The dialogue control module manages interaction flow and system states. It determines the sequence of actions, including image presentation, speech output, response processing, and feedback delivery. The module also handles repetition requests and transitions between vocabulary items based on learner input. The user interface module controls visual output on the robot’s screen. Images are displayed in full-screen format to maintain learner attention and minimize distraction. Textual information is kept to a minimum, particularly for beginner learners, to avoid cognitive overload.

3.3. Learning Content Preparation

Vocabulary used in the prototype is selected from beginner-level instructional materials commonly used in applied foreign languages courses. Priority is given to high-frequency, concrete vocabulary items that can be clearly represented through images, such as everyday objects, food items, and basic actions. Each vocabulary consists of three elements: an image, a target word or phrase, and a corresponding audio output generated through TTS. Images are curated to be culturally neutral and visually unambiguous, ensuring that learners can infer meaning without additional explanation. This design supports intuitive learning and reduces reliance on translation (Figure 3).

3.4. Interaction and Feedback Implementation

During system operation, learners initiate the learning session through a predefined voice command. The robot then presents an image and pronounces the associated vocabulary item. Learners repeat the pronunciation, and their speech is captured and processed by the STT module. Based on recognition results, the system generates immediate verbal feedback. Instead of assigning numerical scores, the prototype employs qualitative feedback strategies, such as encouraging remarks or prompts to repeat the item. This approach maintains learner motivation and reduces the pressure associated with performance evaluation. Learners can repeat each vocabulary item multiple times or proceed to the next item at their own pace. This design supports self-regulated learning and accommodates individual differences in learning speed and confidence.

4. System Evaluation

To assess the usability and perceived learning support of the proposed image-based vocabulary learning system, a pilot evaluation was designed focusing on learner interaction, system usability, and subjective learning experience.

4.1. Participants

Participants were undergraduate students enrolled in introductory foreign language courses at a northern Taiwan university (Yuan Ze University). The target population consisted primarily of beginner-level learners with limited prior exposure to Japanese, Mandarin Chinese (as a second language), or Vietnamese. Participation was voluntary, and all participants were informed of the study’s purpose before the evaluation of the system. The system was designed for beginner. Therefore, no advanced language proficiency was required. Basic familiarity with spoken interaction was considered sufficient for participation.

4.2. Evaluation Procedure

The evaluation was conducted in a controlled classroom or laboratory setting to ensure consistent interaction conditions. Each participant interacted individually with the Tapia-based prototype to minimize audio interference and facilitate focused observation of the learner–robot interaction. The evaluation procedure consisted of the following steps.

Orientation: The participants received a brief introduction explaining the system’s purpose and basic instructions on how to initiate interaction with the robot. No explicit training on vocabulary content was provided in advance;
Practice session: The participants engaged in an image-based vocabulary learning session lasting approximately 5–10 min. During the session, the robot presented images and corresponding spoken vocabulary items, and participants were asked to repeat each item aloud. Participants were allowed to control the pace of interaction and repeat items as needed;
Completion and feedback: After completing the practice session, participants were asked to reflect on their experience and complete a questionnaire evaluating the system.

The evaluation was conducted to assess natural interaction and autonomous learning rather than task performance under time pressure. No formal language proficiency test was administered during this phase, as the primary goal was to assess usability and perceived learning support.

4.3. Questionnaire

A post-session questionnaire was administered to collect participants’ subjective evaluations. The questionnaire consisted of open-ended questions on a Likert scale addressing the following dimensions. Likert-scale items were rated on a five-point scale ranging from “strongly disagree” to “strongly agree.” Open-ended questions allowed participants to provide qualitative feedback and suggestions for system improvement.

Perceived ease of use of the system;
Clarity and usefulness of image-based vocabulary presentation;
Perceived support for pronunciation and vocabulary learning;
Level of engagement and motivation during interaction;
Overall satisfaction with the learning experience.

During the evaluation sessions, researchers recorded observational notes focusing on participants’ interaction behaviors, including hesitation, repetition frequency, engagement level, and reactions to system feedback. These observations were used to supplement questionnaire data and to identify potential usability issues not captured through self-report measures.

4.4. Data Analysis

Multiple data sources were used to capture participants’ experiences with the system. Quantitative data from the questionnaires were summarized using descriptive statistics to identify general trends in user perceptions. Qualitative responses and observational notes were analyzed thematically to extract recurring patterns related to usability, learning support, and interaction experience. Given the exploratory nature of this study, the analysis was conducted to identify strengths and areas for improvement rather than establishing causal relationships. The results can be used for future system refinement and the design of larger-scale experimental studies.

5. Results and Discussion

5.1. User Perceptions of System Usability

Overall, participants reported positive perceptions of the proposed system’s usability. Most learners indicated that the interaction process was intuitive and easy to follow, even during their first encounter with the robot. The use of simple voice commands and a consistent interaction sequence reduced the need for external instructions, supporting autonomous use. The embodied presence of the conversational robot Tapia was frequently mentioned in open-ended responses as a factor contributing to a relaxed learning atmosphere. Compared with screen-based applications, the participants noted that interacting with a physical robot felt more engaging and less monotonous, which aligns with previous findings in human–robot interaction research suggesting that physical embodiment can enhance user engagement.

5.2. Effectiveness of Image-Based Vocabulary Presentation

The participants agreed that integrating images with spoken vocabulary items facilitated comprehension, particularly for unfamiliar words. Visual prompts were perceived as helpful in establishing immediate form–meaning associations without relying on translation. This was especially evident for concrete nouns and action-related vocabulary, which could be easily inferred from images. From a cognitive perspective, the results support the assumption that multimodal input can reduce cognitive load for beginner learners. The combination of visual and auditory information allowed learners to focus on pronunciation practice while simultaneously constructing semantic understanding. This suggests that image-based vocabulary presentation is a suitable extension to the previously audio-focused shadowing system.

5.3. Engagement and Learning Motivation

Questionnaire responses indicated that participants felt motivated to repeat vocabulary items multiple times during the session. The absence of numerical scoring and the use of affective verbal feedback were perceived as reducing performance pressure. Several participants commented that the robot’s encouraging responses made them more willing to continue practicing, even when pronunciation errors occurred. Observational data further revealed that learners often repeated items voluntarily beyond the minimum requirement, suggesting increased engagement. This finding is consistent with earlier Tapia-based studies, in which supportive feedback and a non-threatening interaction style contributed to learners’ willingness to practice oral skills.

5.4. Multilingual Applicability

Although the pilot evaluation was exploratory in nature, the participants interacting with different language modules reported similar levels of usability and engagement. This suggests that the overall interaction design is not language-specific and can be adapted to multiple target languages with minimal modification. The ability to reuse the same interaction framework across Japanese, Mandarin Chinese, and Vietnamese highlights the scalability of the proposed system. This is particularly relevant for applied foreign languages programs, where instructional resources must often support multiple languages within a unified learning environment.

5.5. Limitations and Design Implications

Despite the generally positive feedback, limitations were identified in this study. First, speech recognition accuracy occasionally affected the interaction flow, a challenge that has also been reported in prior studies on spoken language processing systems [7], particularly when learners’ pronunciation deviated substantially from the target model. Although qualitative feedback helped mitigate frustration, recognition errors remain a technical challenge. Second, the evaluation relied primarily on self-reported data and short-term observation. As a result, the findings reflect perceived learning support rather than measurable gains in vocabulary acquisition. Future studies should incorporate pre- and post-tests to assess learning outcomes more objectively.

From a perspective of design, the results suggest that conversational robots are well-suited for low-stakes, repetitive vocabulary practice, especially at the beginner level. The integration of images appears to enhance semantic support without increasing system complexity, making it a practical design choice for further development.

6. Conclusions

We designed and evaluated an image-based vocabulary learning system implemented on a conversational robot platform. Based on previous work involving a shadowing-based pronunciation practice system, the developed system integrates visual prompts with spoken input and interactive feedback to support beginner-level vocabulary learning in Japanese, Mandarin Chinese, and Vietnamese. The system was designed to promote autonomous practice, reduce learning anxiety, and enhance engagement through multimodal interaction. The results from the pilot evaluation indicate that learners perceived the system as easy to use and engaging. The combination of images and spoken vocabulary items was reported to facilitate meaning construction, while the robot’s affective feedback contributed to sustained motivation and willingness to practice. These findings suggest that conversational robots can serve as effective supplementary tools for vocabulary learning, particularly in contexts where opportunities for individual practice and immediate feedback are limited.

Despite these promising outcomes, several limitations must be acknowledged. The evaluation relied primarily on subjective measures and short-term interaction, and long-term learning gains or vocabulary retention were not assessed. Additionally, speech recognition errors occasionally disrupted the interaction flow, underscoring the need for further technical refinement.

Therefore, it is necessary to evaluate the system through larger-scale studies incorporating objective pre- and post-tests to measure vocabulary acquisition and retention. Longitudinal studies are also required to examine sustained learning effects and learner motivation over extended periods of use. From a system perspective, future development will explore adaptive difficulty control, richer visual content, and more robust speech recognition strategies to accommodate diverse learner pronunciations. Further investigation into classroom integration scenarios and blended learning models will also be conducted to better align the system with formal language instruction.

The potential of conversational robots is validated in this study as accessible and scalable tools for learning multilingual vocabulary. The results provide a reference for continued research into robot-assisted language learning and the design of learner-centered, technology-enhanced educational environments.

Author Contributions

Conceptualization, T.-H.W. and I.-S.H.; methodology, T.-H.W., L.A.K. and I.-S.H.; software, L.A.K.; validation, T.-H.W., L.A.K. and I.-S.H.; formal analysis, T.-H.W. and L.A.K.; investigation, T.-H.W. and L.A.K.; resources, T.-H.W. and I.-S.H.; data curation, L.A.K.; writing—original draft preparation, T.-H.W. and L.A.K.; writing—review and editing, T.-H.W., L.A.K. and I.-S.H.; visualization, L.A.K.; supervision, I.-S.H.; project administration, T.-H.W. and I.-S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Science Council under grants NSTC 114-2221-E-155-053.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tamai, K. A Study on the Effectiveness of Shadowing as a Listening Instruction Method. Ph.D. Thesis, Kobe University, Tokyo, Japan, 2005. (In Japanese) [Google Scholar]
Nakazawa, K. Study on Japanese Learners’ Perceived Novelty of Augmented Reality (AR) Learning Digital Games. J. Jpn. Lang. Educ. Taiwan 2020, 82–111. (In Japanese) [Google Scholar] [CrossRef]
Shiomi, M.; Okumura, S.; Kimoto, M.; Iio, T.; Shimohara, K. Two is better than one: Social rewards from two agents enhance offline improvements in motor skills more than single agent. PLoS ONE 2020, 15, e0240622. [Google Scholar] [CrossRef] [PubMed]
Tamai, K. The Effects of Shadowing and Its Role in the Listening Comprehension Process. Jiji Eigo Kenkyu 1997, 36, 105–116. [Google Scholar] [CrossRef]
Mochizuki, M. Exploring the Application of the Shadowing Method in Japanese Language Education: Focusing on the Relationship Between Learners’ Japanese Proficiency and Their Evaluation of Shadowing Effects. Kansai Univ. Audiov. Educ. 2006, 29, 37–53. (In Japanese) [Google Scholar]
Chung, K.H. Application of AR in Printed Teaching Materials: A Case Study of the National Academy of Civil Service Materials. T&D Feixun 2018, 244, 1–32. (In Chinese) [Google Scholar]
Moriyama, A. Summary of the First Public Lecture Hosted by the Comparative Japan Studies Education and Research Center and the Graduate Education Reform Support Program. Available online: https://teapot.lib.ocha.ac.jp/record/7757/files/61_365-366.pdf (accessed on 2 June 2026). (In Japanese)

Figure 1. Overview of the system architecture of the image-based vocabulary learning framework implemented on the Tapia conversational robot. The arrows indicate the interaction flow among image presentation, speech input/output, recognition processing, and feedback generation.

Figure 2. The Tapia conversational robot used in the prototype implementation, equipped with a built-in display, microphone, and speaker to support multimodal language learning interaction.

Figure 3. The image-based vocabulary learning interface displayed on the Tapia robot screen during a practice session.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, T.-H.; Kiệt, L.A.; Hwang, I.-S. Image-Based Vocabulary Learning Through Conversational Robots: An Application Using Tapia Robot. Eng. Proc. 2026, 141, 8. https://doi.org/10.3390/engproc2026141008

AMA Style

Wu T-H, Kiệt LA, Hwang I-S. Image-Based Vocabulary Learning Through Conversational Robots: An Application Using Tapia Robot. Engineering Proceedings. 2026; 141(1):8. https://doi.org/10.3390/engproc2026141008

Chicago/Turabian Style

Wu, Tsui-Hua, Lê Anh Kiệt, and I-Shyan Hwang. 2026. "Image-Based Vocabulary Learning Through Conversational Robots: An Application Using Tapia Robot" Engineering Proceedings 141, no. 1: 8. https://doi.org/10.3390/engproc2026141008

APA Style

Wu, T.-H., Kiệt, L. A., & Hwang, I.-S. (2026). Image-Based Vocabulary Learning Through Conversational Robots: An Application Using Tapia Robot. Engineering Proceedings, 141(1), 8. https://doi.org/10.3390/engproc2026141008

Article Menu

Image-Based Vocabulary Learning Through Conversational Robots: An Application Using Tapia Robot^†

Abstract

1. Introduction