1. Introduction
A social robot is designed to support people in achieving social or emotional goals [
1]. Recently, studies on human–robot interaction involving social robots have attracted significant attention, and researchers have explored various approaches—such as vision and voice—to realize effective HRI for human interaction [
2]. Different embodiments of social robots lead to different user experiences, as the shape and appearance of a robot not only affect how it is perceived but also influence its overall performance in human–robot interaction. The physical form plays a decisive role in shaping user perception and the overall interaction experience, as shown in a recent study comparing humanoid and telepresence robots [
3]. In addition, object properties such as size, shape, form, texture, and color are critical factors that shape the responses of the user [
4]. Human-like social robots are particularly susceptible to the uncanny valley effect, in which increasing human resemblance can provoke discomfort or unease in users [
5]. In response to this issue and with the aim of improving HRI, research has increasingly focused on pet-like or zoomorphic companion robots in various application areas [
6,
7].
In this paper, we introduce PEPE, a cat-like companion robot with an animal-inspired appearance and emotional gestures informed by feline behavior analysis. The aim is to alleviate the uncanny valley effect by fostering comfort and a sense of friendliness in human–robot interaction. The design of PEPE was informed by a previous study on domestic cat emotional behavior, which employed the Facial Action Coding System (FACS) to identify relevant Action Units (AUs) and Action Descriptors (ADs) [
8]. Analysis revealed that, across friendly and non-friendly contexts, the most active AUs and ADs were linked to the ears and eyes. These findings guided the design of PEPE’s facial features. For PEPE’s body design, we incorporated insights from online resources on feline body language, focusing on how cats express emotions through posture and movement in various situations. Based on these resources, we concluded that the body parts most relevant to emotional expression are the tail, ears, and legs [
9,
10,
11,
12,
13]. These insights guided the design of PEPE’s body, and neck movement was additionally incorporated to support interaction.
For effective HRI, the behaviors of social robots must be carefully designed; otherwise, they may be misunderstood or misinterpreted, causing users to perceive weirdness, unfriendliness, or even danger [
14,
15]. Previous studies have introduced a rule-based approach, including sentiment-driven models [
14], to address the problem; however, this method remains limited in flexibility and expressiveness.
Recent advancements in Large Language Models (LLMs) have shown potential for robot motion and task planning, particularly in generating expressive or context-aware actions [
16,
17]. Manually designing an action sequence and executing the motion onto the robot is a complex task that requires sustained effort, an area where LLMs could play a significant role. Prior work has shown that LLMs can be effective in generating gestures for human-like social robots [
18]. The application of LLMs, and evidence of their effectiveness, has the potential to simplify emotional gesture generation and provide greater flexibility and expressiveness. While the usage of LLMs increases, prompting techniques and methods have also advanced. Because LLMs are trained on large datasets, they may often posses the general capabilities required for the task given [
19]. However, to complete the task accordingly, the user must specify the necessary information and provide proper instructions for the desired answer. Previous studies have sought to address this problem by applying various prompting techniques [
20], including a more specific method such as Chain-of-Thought (CoT) [
21]. Instead of emphasizing the prompting method itself, this study focuses on how an LLM can be guided to produce emotionally coherent and physically feasible motion sequences for a companion robot.
This study presents an LLM-based hierarchical motion generation framework for companion robots. The approach enables the model to synthesize emotional gestures by integrating robot structural data, motion code formats, and emotion-specific guidelines in a stepwise manner. Through this structured process, emotional motions can be produced efficiently and aligned with the robot’s physical design. The process employs a progressive instructional sequence to provide the model with robot-specific context, motion data structures, and emotion principles. The architecture of the method consists of (1) providing the companion robot’s structural features, (2) defining the gesture-generating code format, and (3) specifying emotion-based guidelines. The first prompt outlines the mechanical design and structural details of the robot, including actuator locations, body part placement, and actuator angle thresholds. The second prompt defines the required output format, which is subsequently executed on the robot for motion evaluation. The third prompt specifies the emotions that each segment of the generated code is intended to represent. By utilizing the proposed LLM-based method, it becomes possible to generate emotional gestures for companion robots. The generated motions were implemented on the developed companion robot, PEPE, and a Likert-scale questionnaire survey is conducted to validate the implemented motions.
The key contributions of this paper are as follows:
Development of PEPE, a cat-like companion robot with multiple degrees of freedom (DoFs) inspired by feline emotional behaviors.
Proposal of a progressive instructional prompting technique for LLMs to generate emotional gestures.
Implementation on PEPE and evaluation through a Likert-scale questionnaire survey.
The remainder of this paper is organized as follows:
Section 2 presents an analysis of feline emotional expressions to identify the degrees of freedom necessary for efficient HRI in the development of companion robots.
Section 3 describes the structure of the cat-like companion robot, PEPE, designed based on this analysis.
Section 4 introduces the LLM-based motion generation framework for producing emotional gestures in PEPE.
Section 5 reports the evaluation of the proposed method through a user study. Finally,
Section 6 concludes the paper.
2. Analysis of Feline Emotional Expressions
Designing a pet-like robot to express emotional gestures through non-verbal communication and enhance HRI is a challenging task. Prior studies have highlighted how embodiment and gesture recognition play a central role in human–robot communication [
22] and how multimodal expression through motion, color, and sound can add complexity to design [
23]. Building on these insights, we propose a design of a cat-like companion robot, PEPE, which leverages multiple degrees of freedom derived from domestic cat behaviors [
8] to improve the richness of emotional gestures and enhance the overall user experience.
Our companion robot is designed to visually resemble and behaviorally mimic a domestic cat. To guide the development of its emotional motion, we drew on prior research by other researchers analyzing feline facial signals using the Cat Facial Action Coding System (catFACS) [
8,
24]. These studies demonstrated how specific facial muscle movements are active during different affiliative and non-affiliative social interactions, identifying which facial regions, such as the ears, eyes, and whiskers, are most expressive in emotional contexts.
Table 1 presents the top five most active AUs, with the ear identified as the most frequently involved body part. In addition to facial expression studies, we also referred to online resources [
9,
10,
11,
12,
13] on feline body language to capture how cats display emotions through posture and movement. These sources provide practical insights into how domestic cats use their heads, feet, bodies, tails, and overall body stance to communicate emotions. By combining this behavioral knowledge, we were able to identify key non-verbal cues, as shown in
Table 2.
For instance, relaxed ear and tail movements often indicate positive emotions such as happiness, while rigid or rapid motions may signal heightened arousal states such as anger, fear, or excitement. Conversely, lowered ears and a crouched body posture typically reflect negative affect, such as sadness or fear.
Table 2 summarizes these distinctive behavioral markers across a range of emotional situations.
To determine which emotional categories to incorporate, we referred to earlier works on basic emotions in Drosophila, which identified three fundamental affective states: happy, angry, and sad [
25]. Building on this foundation, we extended the set to include fearful, joyful, and excited, as these states are commonly associated with expressive animal behaviors and are relevant for enriching emotional diversity in companion robots. In addition, we incorporated positive and negative feedback as interaction-oriented categories, enabling PEPE to respond directly to user actions. This expanded set of eight emotions provides a balanced framework that combines biologically grounded basic emotions with socially meaningful interaction states, guiding both the behavioral design and motion generation processes.
The identified features were subsequently taken into account in the design of each body part.
3. Design and Implementation of PEPE
3.1. Mechanical and Structural Design of PEPE
The cat-like form of PEPE was chosen to enhance approachability and foster emotional engagement in HRI. Prior research [
6,
7,
26] has shown that zoomorphic or pet-like robots can improve user comfort and acceptance by evoking familiarity and empathy. In addition, we adopted a neutral sitting-cat posture to convey stability and provide users with a reassuring impression of the robot. Maintaining the original esthetic design constrained the available space for integrating mechanical features that enable expressive motions. With respect to prior research [
8,
9,
10,
11,
12,
13], we provided DoFs to ears, neck, front legs, and tail for expressing emotional gestures.
Figure 1 shows the external and internal design of PEPE.
The external covers of the head, ears, body, legs, and tail were produced through 3D printing with filament. The head cover was later fitted with artificial fur to enable future interaction scenarios, such as users touching the robot’s head or providing tactile feedback. The internal structural framework that supports the robot is constructed of aluminum alloy. The skeletal structure and degrees of freedom are shown in
Figure 2. Further details are provided in the following paragraphs.
The ear module consists of two XC330-T288-T(ROBOTIS, Seoul, South Korea) actuators (ROBOTIS), with each actuator controlling one ear independently, as shown in
Figure 2b. The ears move along the pitch axis and can operate separately. Based on
Table 2, when the ears reach the forward threshold, they tilt downward, conveying a sad or negative impression to the user. When positioned at the center, the ears are oriented forward, representing a neutral state. At the backward threshold, the ears are drawn back to express anger or stress.
The head module integrates all major electronic components, including a motor controller, a single-board computer (Raspberry Pi 5, Raspberry Pi Ltd., Cambridge, UK), a WIFI router, two LCD displays (T-Display-S3 AMOLED, AMOLED, LilyGO, Shenzhen, China), a camera (Arducam Mini 16 MP IMX519, Arducam, Chengdu, China), two microphones (I2S MEMS INMP441, InvenSense, San Jose, CA, USA), and a mini speaker (MAX 98347A, Analog Devices, Inc., San Jose, CA, USA). The motor controller, equipped with a relay module, provides low-level control of the actuators and is used to execute the emotional motion codes. The single-board computer performs high-level control, managing communication with the camera. The LCD displays function as the robot’s eyes, enabling the visualization of emotional states during gestures. The camera, microphones, and speaker were included to support multimodal interaction between the robot and the user; however, these were not utilized in this study in order to focus on the evaluation of motion-based emotional gestures. Details of the control system are presented in the next section.
The neck module consists of two XM430-W350-T (ROBOTIS, Seoul, Republic of Korea) actuators (ROBOTIS), which control the pitch and yaw axes, as presented in
Figure 2c. More powerful actuators were selected for this module because they support the head, which houses multiple electronic components, the ear module, and the external head cover. The inclusion of both pitch and yaw axes provides multiple degrees of freedom, enabling the robot to orient its head toward the user or other objects during interaction. In addition, based on
Table 2, this configuration allows the head to perform motions—such as shaking side to side or facing downward—to convey emotional states.
The body module consists of two XM430-W350-T actuators (ROBOTIS), each controlling the pitch axis of a front leg, as illustrated in
Figure 2d. These actuators were selected to accommodate the sitting design of the robot, in which the majority of the weight is supported by the front. The purpose of the front legs is not mobility, but rather the expression of specific emotions; therefore, their motion is designed to reflect the feline behaviors identified in our analysis. The internal structure houses six lithium-polymer batteries, enabling the robot to function independently when lifted or moved to different locations. For reasons of weight and space, mobility was not implemented in the back legs. Instead, a passive mechanism was introduced, allowing them to move independently of the body and giving the appearance of a four-legged creature rather than a rigid or immobile structure.
The tail module consists of two XC330-T288-T actuators (ROBOTIS), each controlling the pitch and roll axes, as detailed in
Figure 2e. At least two degrees of freedom are required for the tail to reproduce the emotional movements identified in
Table 2. The tail was designed to be lightweight, ensuring safe interaction while preserving natural motion.
3.2. Electronic and Control System Design of PEPE
As illustrated in
Figure 3, the workflow of PEPE is structured around modular subsystems interconnected through the internet and coordinated via an MQTT broker. Emotional gesture codes are generated offline using a LLM and uploaded to the robot. The ESP32 modules handle the primary control functions: one ESP32 manages the actuators for motion execution, another drives the left and right LCD eye displays to visualize emotional states, and a third supports the microphone and speaker for potential audio interaction. The specific eye displays used for each emotion are shown in
Figure 4, adding an additional layer of visual expressiveness to the gestures. The actuators, including those in the ears, neck, legs, and tail, execute the predefined gestures under ESP32 control. A Raspberry Pi 5 is dedicated exclusively to camera processing, separating vision tasks from low-level motor control. This distributed architecture ensures clear functional separation, lightweight communication through MQTT, and flexible expansion for future multimodal interaction.
Overall, the integration of structural modules, actuators, and electronic components enabled PEPE to resemble a cat-like companion robot capable of expressing emotions through non-verbal gestures. The following section presents the LLM-based framework using a progressive instructional prompting technique developed to generate and evaluate these emotional motions.
4. Progressive Instructional Prompting Technique
LLMs are capable of generating answers ranging from simple factual responses to detailed descriptions or specific tasks. However, the quality of the generated output strongly depends on the user’s ability to formulate clear and optimal prompts. The emerging paradigm of “pre-train, prompt, and predict” highlights this dependency, as effective prompt design is essential to leverage pre-trained models across diverse domains [
27,
28]. In addition, recent advances show that prompt-based frameworks are not limited to language but extend to graph-based reasoning and structured domains [
29]. Parallel to these developments, prompting techniques have also been applied in motion-related tasks, where multimodal prompts (e.g., text, image, motion) enable conversational and interactive motion generation, bridging natural language with embodied control [
30,
31,
32].
Prompting techniques have been widely studied as a means to improve the reliability and task alignment of LLMs. Prior survey studies categorize a broad range of strategies, including zero-shot and few-shot prompting, chain-of-thought reasoning, and multimodal prompting [
33,
34,
35]. These findings suggest that tailored prompting techniques are necessary when adapting LLMs to domains beyond their pretrained data. The companion robot used in this study was developed recently and would not have been included in the datasets. Therefore, we suggest a progressive instructional prompting method that informs the companion robot’s information and generates the code, aiming to minimize errors, reduce hallucinations, and increase robustness for motion generation in companion robots.
In prior research, persona prompting has been adopted to steer LLM behavior by assigning a consistent role. However, systematic evaluations show mixed outcomes: in many objective tasks, persona-based system prompts yield negligible or negative gains in task performance [
36], and model behavior can vary depending on how the persona is phrased or contextualized [
37]. These findings motivate careful persona design and intermediate checks in sequential prompt pipelines.
In this study, a progressive instructional prompting method was employed to generate emotional gestures. ChatGPT served as the LLM responsible for generating emotional motion sequences for PEPE. The prompting process followed a hierarchical and sequential structure to ensure coherent and physically feasible outputs. As illustrated in
Figure 5, the overall prompting flow consists of three main instructional prompts: robot external description, output code structure, and emotion guidelines. This structured prompting framework was designed to progressively refine the model’s understanding of the task—from general context to specific motion generation—while minimizing inconsistencies and undesired behavior in the output.
The designed persona and basic rules are as follows:
You are a helpful assistant for creating robotic motion codes.
I will provide instructions on how to develop emotional codes for a cat-like companion robot.
Three sets of instructions will be given sequentially: robot information, code information, and emotion information.
Do not generate the code until explicitly instructed to do so via a prompt.
If you have understood the assignment, respond with “Yes”.
The first instruction includes PEPE’s structural and external description. Despite being trained on a vast dataset, ChatGPT lacks essential information specific to our robot. To clearly convey this information, we primarily provided a structured description of the robot’s actuators, degrees of freedom, and angle thresholds for each body part (
Appendix A), while these thresholds and specifications form the core of the prompt, we additionally supplied an image of the robot (
Figure 1a) as a supporting feature to enhance contextual understanding. The actuators are located in the ears, neck, front feet, and tail, and these details—rather than the photo—serve as the essential basis for describing PEPE’s structure.
The second instruction outlines the fundamental structure of the code that needs to be generated. The robot’s control mechanism operates by receiving velocity, acceleration data, and actuator angular position information at each time step, which together define the temporal dynamics of motion execution. ChatGPT’s task is to infer and assign these parameters in accordance with the designated rules, ensuring that the generated code adheres to safe and feasible actuation. To guide this process, we provided a template of the expected code format in
Appendix A, which specifies the required fields and their arrangement. This template acted as a structural reference, allowing ChatGPT to align its outputs with the defined conventions while also reducing the likelihood of missing parameters or generating inconsistent sequences. An illustrative example was included within the instructions to demonstrate how the conditions should be applied in practice, serving as a few-shot reference to stabilize generation quality.
The third and final instruction contains the principle governing the emotions that the robot should express. To generate distinct emotional gestures, we deliberately provided only the essential information for each emotion, rather than prescribing detailed movement trajectories. This fundamental information (outlined in
Appendix A) consists of three components: (1) the target emotion to be conveyed, (2) the motion intensity, defined as whether the movement should be vibrant or scarce, and (3) the intended pace of the motion. By constraining the description to these elements, we encouraged ChatGPT to generalize motion features while maintaining consistency with the robot’s physical limits. Furthermore, we emphasized that outsourcing emotional features was acceptable in the generation process, thereby allowing the model to integrate both structured instructions and externally learned knowledge.
After instructing each guideline, ChatGPT answered with “Yes” and provided a summary for each given instruction. We divided this process into three instructions because, when presented as a single prompt, ChatGPT often omitted critical details-such as matching the lower bound of timestamps-produced incomplete or inconsistent code structures, and occasionally generated logically mismatched outputs (e.g., specifying 10 timesteps but providing only 9). By structuring the instructions step-by-step and requiring intermediate summaries, we aimed to make the system more robust in generating emotional gestures. During experimentation, however, we observed that including explicit parameter values in examples sometimes biased the output. For instance, if a sample prompt used a 500 ms timestamp interval, the generated code would rigidly adopt this interval rather than exploring alternative values. This behavior reflects the well-documented tendency of LLMs to overfit to demonstrated patterns. To mitigate this issue, we explicitly instructed the model not to replicate previous parameter settings and to generate values independently instead. This refinement increased the flexibility of the generated code while maintaining adherence to the structural rules of the prompt.
Before executing the final prompt to generate emotional gestures, a verification task was conducted to evaluate the effectiveness of the progressive instructional prompting technique. The objective of this task was to determine whether the model could accurately follow given instructions and generate a motion sequence as intended. The verification task required moving each body part individually to its maximum and minimum thresholds for a second each. Once the code was generated, we tested it on the robot and confirmed that it accurately managed to execute the task.
After completing the verification sequence, the final execution prompt was issued, successfully generating eight distinct emotional gestures. Three representative examples—Happy, Angry, and Joyful—are shown in
Figure 6, illustrating the diversity of expressions produced through the sequential prompting technique. The corresponding motion codes are provided in
Appendix B. Each gesture was subsequently tested on the physical robot to validate its performance and confirm that the motions operated as intended. Details of the validation procedure are presented in the following section.
5. Performance Validation
5.1. Validation Method
To evaluate the appropriateness of each emotional gesture, a questionnaire survey using a five-point Likert scale was conducted. Previous studies have also assessed the perception and acceptance of robotic gestures through similar survey-based methods [
18,
26]. Since emotional expression is inherently subjective, this approach is considered appropriate for performance validation.
A total of 15 participants (ages 25–35; 4 females and 11 males) were recruited to take part in the study. All participants observed the robot in person under controlled laboratory conditions. Prior to participation, the purpose of the experiment and evaluation guidelines were explained, and all participants provided informed consent. No personally identifiable information was collected, and all responses were analyzed anonymously.
The questionnaire consisted of nine items. The first eight questions evaluated whether each emotional gesture accurately represented its corresponding emotion (Happy, Angry, Sad, Fearful, Joyful, Excited, Positive Feedback, and Negative Feedback), while the ninth question assessed the overall impression of all gestures combined. Each question was rated on a five-point Likert scale, where 1 indicated very inappropriate, 2 indicated inappropriate, 3 indicated neutral, 4 indicated appropriate, and 5 indicated very appropriate, yielding a maximum total score of 45 points. The survey questions are presented in
Table A1 (
Appendix C).
During the evaluation, each emotional gesture was demonstrated individually, with participants observing from a fixed, safe distance of approximately two meters. Participants could request additional repetitions if needed. Importantly, participants were not informed that the gestures were generated by ChatGPT to avoid potential bias in their evaluations. They were told that PEPE is a cat-like companion robot designed to enhance human–robot interaction and were encouraged to provide honest assessments based solely on their perceptions of the robot’s movements.
All evaluations were conducted under defined and controlled conditions. Each participant experienced the exact same procedure: the same laboratory space, identical lighting and background environment, constant noise level, and the same robot motion sequences executed from the same initial pose. The gestures were presented in a fixed demonstration order, and the robot operated with the same hardware configuration and motion code for all trials. Participants observed from a fixed distance of approximately one meter, and no interaction with the robot was permitted during the evaluation. These standardized conditions ensured that all participants evaluated the gestures under equivalent and fully reproducible circumstances.
5.2. Quantitative Evaluation of Emotional Gestures
The questionnaire survey evaluated how 15 participants perceived the appropriateness of PEPE’s emotional gestures. Participants were also encouraged to provide written feedback for each gesture when possible.
As illustrated in
Figure 7a, the average Likert-scale ratings revealed varying levels of acceptance across the eight emotional gestures. The Happy gesture averaged 2.8, Angry 3.8, Sad 3.0, Fearful 3.2, Joyful 3.8, Excited 3.5, Positive Feedback 3.1, and Negative Feedback 2.6. The overall mean rating was 3.0, indicating a neutral level of acceptance for the generated gestures. Among all emotions, Angry and Joyful were rated the highest, suggesting that these gestures were perceived as the most expressive and appropriate, whereas Negative Feedback and Happy received the lowest scores.
The error bars in
Figure 7a represent the standard deviations for each emotion, showing the level of participant agreement. The Happy gesture (SD = 0.68) exhibited relatively consistent responses below the neutral threshold, while Angry (SD = 0.74) and Joyful (SD = 0.83) showed moderate agreement around the “appropriate” range. The Sad gesture (SD = 0.85) and Fearful gesture (SD = 0.94) indicated more dispersed responses, suggesting mixed perceptions among participants. The Excited gesture (SD = 0.83) also showed moderate variability, whereas Positive Feedback (SD = 0.83) remained neutral overall. Finally, the Negative Feedback gesture (SD = 0.74) demonstrated relatively consistent disagreement, indicating that most participants did not associate the motion with the intended emotion.
As shown in
Figure 7b, the gestures were further grouped into positive (Happy, Joyful, Excited, and Positive Feedback) and negative (Angry, Sad, Fearful, and Negative Feedback) emotion categories to examine broader trends in perception. The positive emotions achieved an average rating of 3.3 ± 0.87, indicating general agreement among participants that these gestures were appropriate and recognizable. In contrast, the negative emotion category averaged 3.1 ± 0.92, reflecting slightly lower scores and greater variability in participants’ evaluations. This suggests that while PEPE’s positive gestures were interpreted more consistently, the negative expressions elicited more diverse responses, possibly due to subtler or less distinguishable motion features.
To complement the descriptive results, a one-sided Wilcoxon signed-rank test—summarized in
Table 3—was performed to determine whether each gesture’s rating exceeded the neutral midpoint (3). The results showed that three high-arousal emotions—Angry, Joyful, and Excited—were rated significantly above the neutral threshold (
p < 0.05), corresponding to the Clear recognition category in the table. In contrast, Sad, Fearful, and Positive Feedback exhibited average ratings near the neutral point and did not differ significantly from the midpoint (
p ≥ 0.05), leading to their classification as Neutral recognition. The gestures Happy and Negative Feedback, which received lower mean scores and non-significant results, were categorized as Subtle recognition, reflecting their low-arousal or understated motion characteristics. A grouped analysis further supported this trend: the positive emotion set (Happy, Joyful, Excited, Positive Feedback) was rated significantly above neutral (W = 490.0,
p = 0.0037), whereas the negative emotion set (Angry, Sad, Fearful, Negative Feedback) did not reach significance (W = 435.0,
p = 0.085). These findings collectively suggest that PEPE’s high-intensity emotions were reliably recognized, while low-intensity emotions produced milder or more neutral impressions, consistent with their design intent.
In summary, the evaluation results indicate that participants generally perceived PEPE’s emotional gestures at a moderate level, with clearer recognition for high-arousal expressions. The descriptive ratings and Wilcoxon analysis consistently showed that Angry, Joyful, and Excited gestures were interpreted as appropriate and expressive, reflecting their dynamic and high-intensity motion designs. In contrast, low-arousal emotions such as Happy, Sad, Fearful, Positive Feedback, and Negative Feedback tended to cluster around the neutral midpoint and did not reach statistical significance, which aligns with their subtle and low-amplitude behaviors. When grouped, the positive emotions collectively surpassed the neutral threshold, whereas the negative set showed greater variability and did not achieve significance, suggesting that certain negative expressions require clearer motion differentiation. Overall, these findings confirm that PEPE effectively conveys high-intensity emotions while providing milder expressions for low-arousal states, with room for refinement in enhancing the clarity and distinctiveness of subtler emotional gestures.
5.3. Qualitative Feedback Analysis
In addition to the quantitative ratings, participants provided written feedback for each emotional gesture to describe their impressions of PEPE’s movements. Overall, the comments suggested that while the gestures were generally understandable, several required refinement to appear more expressive and natural. Participants also noted that a clearer distinction between positive and negative emotional gestures would improve the robot’s overall expressiveness.
Feedback on the Happy gesture, which received a relatively low acceptance score, indicated that the motion appeared “vague,” “slow,” and “unnatural.” One participant remarked that “the movements were not fluent and lacked diversity,” suggesting that the gestures should include more dynamic transitions. The Angry gesture, which received one of the highest ratings, was described as “appropriate” and “relatable.” Participants commented that the motion effectively conveyed tension and matched the expected behavior of a cat expressing anger. Feedback on the Sad gesture reflected a mix of opinions. Positive comments mentioned that “the movement showed a relatable image of a cat being sad,” while others noted that it was “vague” and “should include more sensible movement.” These responses suggest that, although the gesture conveyed the intended emotion to some extent, its expressiveness could be enhanced through more pronounced or fluid motion. The Fearful gesture received the most divided feedback. Some participants described it as “acceptable,” while others felt “it looked more like anger.” One participant pointed out that the robot’s mechanical limitations-such as the number of degrees of freedom (DoFs)-restricted its ability to fully express fear. This indicates that hardware constraints may have affected the clarity of this emotion. For the Joyful gesture, which showed moderate agreement in ratings, most participants responded positively, stating that “it gave the vibe of a joyful cat.” A few participants, however, suggested that the gesture “needed more diverse actions” to appear more vivid and energetic. The Excited gesture was also reviewed favorably, though one participant mentioned that it “should be more distinguishable from the Joyful gesture,” implying a need for clearer motion differentiation between these two similar emotions. Feedback on the Positive Feedback gesture suggested that participants found it appropriate for scenarios in which the robot communicates a positive or encouraging response to the user. In contrast, participants described the Negative Feedback gesture as “vague,” and some noted that “it might have been more expressive if the cat had shown a form of rejection or rebellion toward the user.”
Overall, the qualitative feedback indicates that, while PEPE’s emotional gestures were generally acceptable and recognizable, several require refinement to enhance distinctiveness, fluidity, and emotional depth. Participants emphasized the importance of a clearer separation between positive and negative motions and suggested that additional expressive features—such as increased degrees of freedom, sound effects, or changes in eye display—could further improve emotional realism and user engagement.
6. Discussion
This study examined the potential of large language models to generate emotionally expressive gestures for the cat-like companion robot PEPE through a progressive instructional prompting technique. The survey results and participant feedback together provide insight into both the strengths and limitations of this approach.
Overall, the LLM-generated gestures were successful in conveying the intended emotions. Joyful and Angry gestures received the highest recognition scores, indicating that movements characterized by clear dynamics and higher intensity were more easily interpreted by observers. This observation was supported by the Wilcoxon signed-rank analysis, which confirmed that Angry, Joyful, and Excited gestures were rated significantly above the neutral midpoint (
p < 0.05), corresponding to Clear recognition in the evaluation. In contrast, the remaining low-arousal gestures—such as Fearful, Sad, and Positive Feedback—displayed mean ratings near the neutral point and larger variability, indicating Neutral recognition rather than strong emotional identification. The Happy and Negative Feedback gestures showed slightly lower mean values and non-significant results (
p ≥ 0.05), reflecting Subtle recognition consistent with their understated motion profiles. Group-level analysis also revealed that positive emotions (M = 3.3 ± 0.87) were perceived more consistently than negative emotions (M = 3.1 ± 0.92), and the positive emotion set as a whole was statistically above neutral (
p = 0.0037). This asymmetry aligns with previous findings in human–robot interaction studies, where positive and energetic expressions are generally recognized with greater accuracy than restrained or defensive ones [
18,
26].
Earlier research on robot emotion expression has emphasized the importance of motion amplitude, speed, and synchrony with other modalities for clear affective communication. Consistent with these observations, participants in our study responded favorably to gestures that contained evident amplitude changes and temporal variation. Compared with rule-based or manually designed gestures, the LLM-generated motions achieved comparable levels of user acceptance while requiring substantially less design time. These findings demonstrate that prompt-driven generation can serve as a viable alternative to handcrafted emotional motion libraries.
Feedback indicated that several gestures appeared “vague” or “unnatural,” primarily due to mechanical limitations in PEPE’s degrees of freedom and the absence of supporting cues, such as facial expressions or sounds. The Fearful gesture, in particular, was often interpreted as Angry, revealing the need for clearer motion segmentation and timing control. Moreover, some gestures lacked smooth transitions between poses, implying that prompt-based generation should include temporal-continuity constraints or post-processing filters. The small number of participants (n = 15) and single-modality evaluation also limit the generalization of the results; larger and more diverse user groups may yield deeper insights into cultural or demographic differences in emotion recognition.
The results highlight promising directions for enhancing LLM-based motion generation. Future studies will refine the progressive instructional prompting framework by introducing additional physical parameters—such as velocity envelopes, phase timing, and amplitude scaling—to achieve smoother and more lifelike gestures. Integrating multimodal features, including sound cues or animated eye displays, could further strengthen emotional clarity. Reinforcement or imitation learning techniques may also be combined with LLM-generated base motions to enable adaptive, user-responsive behavior. Together, these improvements could advance the realism and social expressiveness of companion robots like PEPE.
7. Conclusions
This study presented the design and implementation of PEPE, a cat-like companion robot developed to enhance emotional expressiveness and human–robot interaction. The robot was designed based on feline behavioral analysis, providing multiple degrees of freedom in the ears, neck, legs, and tail to support diverse non-verbal emotional gestures.
Building upon this design, an LLM-based hierarchical framework employing a progressive instructional prompting technique was applied to generate motion sequences. The framework allowed ChatGPT to produce eight distinct motion sequences representing different emotional states, which were evaluated through a user study involving 15 participants. Quantitative results showed that Joyful and Angry gestures were perceived as the most appropriate and expressive, while Fearful and Negative Feedback received lower ratings. The Wilcoxon signed-rank analysis further confirmed that Angry, Joyful, and Excited gestures were rated significantly above the neutral clarity threshold (p < 0.05), corresponding to clear recognition, whereas the remaining low-arousal gestures showed neutral or subtle recognition, remaining statistically indistinguishable from the midpoint—consistent with their mild and understated design. Qualitative feedback further revealed that participants generally recognized the intended emotions, but found some gestures to be vague or lacking fluidity. These findings demonstrate that the proposed prompting approach can generate recognizable emotional motions while minimizing manual design effort.
The study also identified several challenges in improving robot expressiveness. Limited mechanical degrees of freedom, the absence of multimodal cues—such as sound or facial animation—and subtle motion transitions were key factors that affected emotional clarity. In addition, the current design is limited by the small eye-display area and the lack of multimodal outputs such as sound or mouth animation. A next-generation version of PEPE is currently in development, incorporating an expanded visual display and additional multimodal cues to enhance emotional expressiveness. Future work will focus on enhancing the prompting framework by integrating physical motion constraints, increasing robot articulation, and incorporating multimodal feedback mechanisms. Additionally, recent LLM models—such as updated ChatGPT variants, Google’s Gemini models, Meta’s LLaMA family, and other open-source alternatives—have continued to advance, offering opportunities to explore more diverse and expressive gesture-generation capabilities in future work. Combining the LLM-based generation method with reinforcement or imitation learning may further enable adaptive and user-responsive motion behavior. Overall, this research provides a foundational step toward data-efficient and scalable methods for emotional motion generation in companion robots, contributing to more natural and affect-aware human–robot interactions.