You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Feature Paper
  • Article
  • Open Access

5 February 2020

Four-Features Evaluation of Text to Speech Systems for Three Social Robots

,
,
,
and
Department of Robotics, University Carlos III of Madrid, Avda de la Universidad, 30, 28911 Leganés (Madrid), Spain
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Applications and Trends in Social Robotics

Abstract

The success of social robotics is directly linked to their ability of interacting with people. Humans possess verbal and non-verbal communication skills, and, therefore, both are essential for social robots to get a natural human–robot interaction. This work focuses on the first of them since the majority of social robots implement an interaction system endowed with verbal capacities. In order to do this implementation, we must equip social robots with an artificial voice system. In robotics, a Text to Speech (TTS) system is the most common speech synthesizer technique. The performance of a speech synthesizer is mainly evaluated by its similarity to the human voice in relation to its intelligibility and expressiveness. In this paper, we present a comparative study of eight off-the-shelf TTS systems used in social robots. In order to carry out the study, 125 participants evaluated the performance of the following TTS systems: Google, Microsoft, Ivona, Loquendo, Espeak, Pico, AT&T, and Nuance. The evaluation was performed after observing videos where a social robot communicates verbally using one TTS system. The participants completed a questionnaire to rate each TTS system in relation to four features: intelligibility, expressiveness, artificiality, and suitability. In this study, four research questions were posed to determine whether it is possible to present a ranking of TTS systems in relation to each evaluated feature, or, on the contrary, there are no significant differences between them. Our study shows that participants found differences between the TTS systems evaluated in terms of intelligibility, expressiveness, and artificiality. The experiments also indicated that there was a relationship between the physical appearance of the robots (embodiment) and the suitability of TTS systems.

1. Introduction

Social robots are intended to “live” around humans to help and/or entertain them. In this regard, the speech is probably the richest and the preferred way for humans to communicate, making the software that allows the robot to generate an artificial voice a crucial element during human–robot interaction. These systems, commonly known as Text To Speech (TTS) systems, can convert text to artificial voice.
There are several definitions of what a TTS system is. Van Bezooijen defines it as a system that ‘allows the generation of novel (oral) messages, either from scratch (i.e., entirely by rule) or by recombining shorter pre-stored units’ [1]. On the other hand, Handley uses the following definition: ‘Speech synthesis systems, or speech synthesizers, are computer programs which automatically generate speech, i.e., systems which enable the computer to “talk” or “speak” to the user’ [2]. Since the beginning of the 1990s, several authors have analyzed the technological foundations of text to voice conversion using a computer [3]. Klatt presented one of the first analyses of the TTS systems available in the late 1980s [4].
Nowadays, there are many proposals of commercial TTS systems with different features and performance. Therefore, it seems that a comparative study of them would be quite useful for robotics researchers, among others. In this paper, we present a comparative study of the following TTS systems: Ivona, Nuance, Google, Microsoft, AT&T, Espeak, Pico, and Loquendo.
Since this work is motivated by our own need to decide which TTS system to select for our robots, each TTS has been configured and executed in three social robots, Mbot, Mini, and Maggie, the robots of the Social Robotics group of the RoboticsLab (University Carlos III of Madrid).
This comparative study is focused on the evaluation of four features: the first one is the degree of intelligibility of the generated speech. In this sense, the user must evaluate if the robot communicates with clarity and if the sentence is well understood. The second one is the expressiveness of the generated voice. That is, if the voice sounds monotonous or not: if it is able to emphasize certain words, to make pauses, to change the speech speed, etc. The third feature is related to the artificiality of the voice that is, if it is perceived as more or less robotic (in the sense of less human-like). Finally, we evaluate the suitability of the generated voice for the robot, i.e., if the voice fits with the external appearance of the robot. These features will translate into four research questions in Section 3.4, that is, the hypotheses that this study is intended to verify.
This paper is structured as follows. Section 2 offers a review of the TTS systems found in the literature, considering the platforms in which they are integrated. Next, in Section 3, there is described the procedure that has been followed. Then, Section 4 presents the results obtained for each evaluated feature. Finally, in Section 5, the authors present their conclusions and a discussion of this study, as well as the limitations and lessons learned.

3. Experiment

As already stated, in this paper, we present a comparative study of the performance of several TTS systems to be used in social robots. In order to carry out this study, some of the TTS systems that are currently available, described in Section 2.1, were integrated in our social robots, introduced in this section, particularly in Section 3.2. By means of a questionnaire, the participants evaluated them by rating their features.

3.1. The Compared Text-To-Speech Systems

The social robots used to carry out this comparative study have an interaction system known as the “Robotic Dialog System”, or just RDS, presented in [33]. The RDS gives to these robots the capacity to interact with humans, especially using multimodal speech dialogs. In this study, we have implemented and used the component called ‘Text-To-Speech’. This component allows our social robots to communicate with the users using different kinds of voice, language, volume, etc. In addition, it integrates the eight TTS systems that are analyzed in this paper:
  • AT&T
  • Google
  • Ivona
  • Microsoft
  • Nuance
  • Loquendo (v7.7)
  • Espeak (v1.48)
  • Pico (v2018).
The first five TTS systems require an Internet connection (since they use web services), while the last three do not require a persistent connection.
We have selected these eight TTS systems based on three main requirements: (i) the system should be used in different domains, paying special attention to developments integrated by the robotics research community; (ii) the software should be open source or, at least, it should offer a trial version, and (iii) it should support the Spanish language with acceptable technical support. Thus, Festival was not selected since it does not offer robust Spanish support, and Verbio was discarded since it does not offer a trial version. It should be noted that the selected TTS systems (except for Loquendo) cannot be customized, that is, they offer just one version. In the case of Loquendo, we use its default speech in order to make all results in this study comparable.

3.2. The Social Robots

For the embodiment comparative, we have used our three social robots, built at the RoboticsLab from Universidad Carlos III of Madrid (Spain). These robots are: Maggie [34,35] (Figure 1), Mini [36] (Figure 2), and Mbot [37] (Figure 3).
Figure 1. Maggie.
Figure 2. Mini.
Figure 3. Mbot.
The robots integrate a dialog mechanism to enable natural HRI. For this reason, selecting the most adequate TTS system is crucial to enhance the user experience. Apart from the dialog system, the robots include high-quality speakers, microphones, and sound cards. The first robot, Maggie, is able to move through the environment to interact with people. The robot was originally designed as a generic research platform to test interaction mechanisms to improve the HRI experience. Maggie can communicate through sounds, gestures, and a touch-screen mounted in its chest. The robot has a rigid plastic shell and is 1.40 m tall. Mini is a desktop version of Maggie, also developed by the RoboticsLab, that acts as a companion for elderly people. In contrast to Maggie, Minnie is shorter, (just 55 cm) and is covered in a plush-like soft fabric and integrates the same HRI capabilities as Maggie, with an external tablet to enhance interaction. Finally, Mbot is another mobile platform developed in the EU project MOnarCH [38]. The robot is 1.15 m tall, that is, like the height of an 8–11 year-old child, since this social platform was designed to interact with children at the pediatric ward of the Portuguese Oncology Institute in Lisbon (Portugal). Similarly to Maggie, Mbot’s shell is of a rigid material, carbon fibre.

3.3. Procedure

As seen in Section 2, in order to determine the performance of a TTS system, several authors have proposed different sets of characteristics to be evaluated. In the present paper, considering these references, especially the ones presented by Handley [2] and Viswanathan [30], the performance of each TTS system is determined by the evaluation, using questionnaires, of the following features:
  • Intelligibility: ‘Can you clearly understand the voice of this robot?’
  • Expressiveness: ‘How do you perceive this robot’s voice: monotonous or very expressive?’
  • Artificiality: ‘Do you think that this is a robotic voice?’
  • Suitability: ‘Do you think that this voice is suitable for this robot?’
Each of these questions have been rated using a Likert 5-point scale. In the case of expressiveness, the ranking varies between ‘very monotonous’ (1) and ‘very expressive’ (5). For the other features, a lower number of points corresponds to ‘Not at all’ while the maximum one is for ‘Yes, absolutely.’
As can be observed, in addition to intelligibility (known as comprehensibility by Handley) and naturalness (also known as expressiveness by Handley), considering our target scenarios, TTS systems in Human–Robot interaction, and more specifically in social robotics, we have included two other important features: artificiality, related to the metallic/robotic sound of the voice, and suitability, related to the perception that the user has of whether the voice suits the robot considering its external appearance. The evaluation of these characteristics, as also stated in [32], is important in this kind of comparative study.
The questionnaires were created using the web tool ‘Google Forms’ [39]. The first page of the questionnaire is an introductory page where the user has to read some instructions about how to fill it in, and to answer some personal questions: age, gender, and educational level (university or non-university studies). The main part of the questionnaire is divided into eight pages, each one associated with a TTS system. The order of the pages was randomized when the forms were created. Every page shows a short video where the robot is talking using a specific TTS system. The robot says the following sentence in Spanish: ‘This is the robot X and this is a test sentence to evaluate the TTS system Y.’ After hearing this sentence, the user must score the four questions, and then the next page appears, showing the same robot using a different TTS system.
These questionnaires were distributed publicly for a month through the Internet using social networks in order to try to obtain the maximum diffusion. Each user was only allowed to fill out one questionnaire, so the user evaluated the performance of the eight TTS systems for just one robot. This assignment was made by the researchers, so the user did not know about the existence of the other robots, trying to balance the number of participants per robot/questionnaire type.

3.4. Research Questions

These questionnaires had two goals. The first one was to verify the following questions:
  • RQ1: are all TTS systems equally well understood?
  • RQ2: do all TTS systems have the same expressiveness?
  • RQ3: are all TTS systems equally perceived as robotic?
  • RQ4: are all TTS systems equally suitable for each robot?
In case the results confirm these RQs, then the second goal was to rank the TTS systems considering the features evaluated.

3.5. Participants

For this study, we obtained 125 questionnaires in all (for the three robots). The distribution among the robots is the following: 44 questionnaires for Maggie (35.2%), 42 for Mini (33.6%), and 39 for Mbot (31.2%).
Regarding their age, participants were grouped into three categories: 17–30 years, with 33 participants (26.4%); 31–40 years, with 86 participants (68.8%); and more than 41 years, with six participants (4.8%). Most of the participants were males (94 participants, which means 75.2% of the participants) and just 31 participants were females (24.8%). Finally, regarding the educational level, 24 participants (19.2%) say that they have carried out only non-university studies (just primary or secondary), while the majority of the participants (101, 80.8%) declare that they have carried out university studies (a bachelor’s degree, masters, or PhD).

4. Results

This section introduces a thorough analysis of the questionnaires, grouping the results regarding the research questions presented in Section 3.4. The software used in the statistical analysis of the results was IBM SPSS [40].
In our analysis, we considered the scores given to each TTS system, our independent variables, considering all the research questions (features), our dependent measures: the mean and the standard deviation values were calculated and are presented in the next sections. We also had to prove that the differences between the mean values were significant for each TTS in relation to each dependent measure using one-way repeated measures ANOVA. After proving a statistically significant result from the above analyses, we could select which TTS systems differ from one another. This information was provided in the Pairwise Comparison tables, presented in Appendix A.

4.1. Intelligibility: Are All TTS Systems Equally Well Understood?

This first feature evaluates if the voice is clearly understood. Considering the results of the multivariate test, Wilks’ Lambda (WL), there are significant differences between the TTS systems, W L = 0.101 , F ( 7 , 118 ) = 149.89 , p < 0.001 . In Figure A1 (see the Appendix A), the pairwise comparison table is presented.
Therefore, we can say that the answer to RQ1 is that not all the TTS systems are equally well understood. This answer allows ranking the TTS systems by representing the results in the order in which the TTS system with the highest mean value is situated first (at the left of the figure) and the one with the lowest mean value appears at the last position (at the right of the figure); see Figure 4. The ranking shows that, in terms of the intelligibility, the best-synthesized voice corresponds to Google. Ivona TTS also receives a good score. In fact, there is no significant difference with Google: p = 0.228 . We can identify a second group significantly different from the previous ones. This is composed by Loquendo, Nuance, Microsoft, and Pico. The study shows that the intelligibility of AT&T and Espeak is noticeably worse.
Figure 4. Ranking of TTS systems for Intelligibility. The vertical axis represents the mean score obtained on the questionnaires, being 5 the maximum. The error bars represent the standard deviation.

4.2. Expressiveness: Do All TTS Systems Have the Same Expressiveness?

This feature expresses how monotonous or expressive users perceive the synthetic voice generated by the TTS system. Again, we analyze the results provided by the ANOVA test. In this case, W L = 0.25 , F ( 7 , 118 ) = 49.56 , p < 0.001 , which means that the different TTS systems differ in expressiveness. For this reason, we can say that not all TTS systems have the same expressiveness (RQ2). The pairwise comparison table is presented in Figure A2; see Appendix A.
As in the previous feature, we can use the means and standard deviation to rank the systems evaluated regarding their expressiveness.
Again, Google TTS stands out, being perceived as the most expressive system, ( p < 0.05 ) (see Figure 5). After Google, we find Loquendo, Ivona, Microsoft, and Nuance with no significant differences, p = 1 , among them in terms of expressiveness. Pico, AT&T, and Espeak are perceived as the least expressive systems.
Figure 5. Ranking of TTS systems for Expressiveness. The vertical axis represents the mean score obtained on the questionnaires, 5 being the maximum. The error bars represent the standard deviation.

4.3. Artificiality: Are All TTS Systems Equally Perceived as Robotic?

Considering artificiality, the aim is to analyze how “robotic” the participants perceive the robot’s voice. By robotics, we consider how not human-like or metallic the voice sounds. The results from the multivariate test, Wilks’ Lambda, show significant differences between the TTS systems, W L = 0.37 , F ( 7 , 118 ) = 28.17 , p < 0.001 .
Given these results, the answer to the RQ3 is that not all the TTS systems are equally perceived as “robotic”. Figure A3 (Appendix A) presents the pairwise comparison table for this feature.
The results show that Espeak was perceived as the most artificial TTS system, with a significant difference with respect to the other systems evaluated. Figure 6 shows the ranking regarding Artificiality where, after Espeak, the systems are sorted as follows: AT&T, Loquendo, Pico, Microsoft, Nuance, Ivona, and Google. In contrast to intelligibility and expressiveness features, there is no clear set differentiation among the TTS systems as they all present similarities ( p > 0.05 ) with their neighboring ranked ones. In any case, Google is perceived as the most natural TTS system showing that there is a correlation between the features analyzed in this work.
Figure 6. Ranking of TTS systems for Artificiality. The vertical axis represents the mean score obtained on the questionnaires, 5 being the maximum. The error bars represent the standard deviation.

4.4. Suitability: Are All TTS Systems Equally Suitable for Each Robot?

This feature tries to investigate which TTS system is perceived as the most suitable for each of the three different social robots presented in Section 3.2. This research question is considered for each robot separately:
  • RQ4.1: are all TTS systems equally suitable for Maggie?
  • RQ4.2: are all TTS systems equally suitable for Mbot?
  • RQ4.3: are all TTS systems equally suitable for Mini?
Therefore, a one-way repeated measures ANOVA is conducted, using the scores obtained for each robot, to determine whether there are significant differences between the TTS systems in terms of their suitability for a specific robot.

4.4.1. Maggie

According to the results obtained for Maggie—Wilks’ Lambda = 0.52 , F ( 7 , 116 ) = 15.61 , p < 0.001 —there are significant differences between the TTS systems. For this reason, we can say that not all the TTS systems are equally suitable for Maggie. Table 1 shows the descriptive statistics and Figure A4 presents the pairwise comparison table. In this figure, it can be observed that the most suitable one is Google although Ivona, Loquendo, and Nuance obtain similar results, p > 0.112 . On the other hand, the worst evaluated TTS systems, being significantly different from Google ( p < 0.05 ), are Espeak and Pico (see Figure 7).
Table 1. Descriptive statistics for Suitability of each TTS system for Maggie.
Figure 7. Ranking of suitability, grouped by robot—the TTS systems preferred for each robot. Five is the maximum score.

4.4.2. Mbot

In the case of Mbot, the results of the multivatiate test, W L = 0.73 , F ( 7 , 116 ) = 6.22 , p < 0.001 , also confirm that not all the TTS systems are perceived as equally suitable for Mbot. Table 2 presents the values of the mean and the standard deviation, and the pairwise comparison table is shown in Figure A5. For this robot, the favourite one is Ivona, with Microsoft and Google the second and the third best evaluated TTS systems. These three systems obtained similar results, p = 1 . On the opposite side, AT&T and Loquendo are the TTS systems considered as significantly not well-suited for this robot ( p < 0.05 ), in comparison to Ivona, since they were the worst evaluated ones (see Figure 7).
Table 2. Descriptive statistics for Suitability of each TTS system for Mbot.

4.4.3. Mini

Finally, for Mini, W L = 0.64 , F ( 7 , 116 ) = 9.21 , p < 0.001 , so, again, there are significant differences between the TTS systems in terms of their suitability for this robot. The descriptive statistics are presented in Table 3. According to these results, it seems that there is no clear ‘winner’ on this occasion. In the pairwise comparison table, Figure A6, it is observed that just one TTS system, Espeak, is significantly different from the rest of the systems except for AT&T. These two systems are perceived as the least suitable for Mini, so, although we cannot affirm that all the TTS systems are equally suitable for this robot, there are no significant differences between the other ones ( p = 1 ). This means that there are six TTS systems equally suitable for Mini.
Table 3. Descriptive statistics for Suitability of each TTS system for Mini.
In Figure 7, we can observe that, as has been said, although the preferred TTS system is Loquendo, the majority of the TTS systems obtained similar results: there are no significant differences between the TTS systems except for Espeak and AT&T, which were the worst evaluated ones.

4.5. Correlations between the Four Features Analyzed

To complete this study, we intended to analyze the correlations between the four features using the Pearson product–moment correlation coefficient. To do so, we performed a preliminary analysis to prove the conditions of normality, linearity and homoscedasticity. The test showed a strong positive correlation between three of the features: intelligibility, expressiveness, and suitability ( r > 0.476 , p < 0.01 ). Additionally, there is a negative correlation between the previous features and artificiality ( r < 0.225 , p < 0.01 ) as shown in Table 4.
Table 4. Pearson product-moment correlations between the four features analyzed.
It means that those questions related to intelligibility, expressiveness, and suitability are directly correlated. The cause could correspond to the following reasons: (i) all questions are related to the same feature or, at least, this is what participants have perceived; or (ii) there is a real relation between the analyzed features. In our opinion, this could be the actual cause. Considering the second assumption, we can infer that, if a TTS system is perceived as intelligible, it will also be perceived as expressive and, consequently, these systems will tend to be preferred for a social robot.

5. Discussion and Conclusions

In this work, we have presented a comparison of eight TTS systems considering four features: intelligibility, how clear the voice of the robot is; expressiveness, how monotonous the voice is; artificiality, how “robotic” the robot voice is; and suitability, how adequate the voice is for a robot. The first two features are usually included in these kinds of studies as the aspects to be optimized. Additionally, we have included the last two, since, in social robotics, it is important to analyze how natural and suitable for the robot the voice is perceived. The tests have been carried out after integrating these systems into three social robots.
In total, 125 participants evaluated all features for each TTS system, but each participant just considered one of the social robots. After that, we conducted a statistical analysis to see if there were significant differences in the results obtained by each TTS. The method used was a one-way repeated measure ANOVA. Regarding RQ1, RQ2, and RQ3, the statistical analysis shows that there are differences in terms of intelligibility, expressiveness, and artificiality for the TTS systems. This allows establishing a comparison between the systems, indicating which one is the most and least intelligible, expressive, and artificial. Moreover, the analysis indicates that a direct correlation exists between the features intelligibility and expressiveness and an inverse correlation between these ones and artificiality. In general, the TTS system provided by Google is the best rated one with respect to intelligibility and expressiveness, being perceived as the least artificial. Finally, Espeak is at the end of the ranking, with user perception of being robotic, monotonous, and not clear.
In relation to RQ4, we observe that, although for each robot there are significant differences between the TTS systems, we cannot conclude that there is just one most suitable TTS system for each robot. In fact, there is a set of TTS systems preferred for each robot—for Maggie: Google, Ivona, Nuance, and Loquendo; for Mbot: Ivona, Microsoft, and Google; for Mini: Loquendo, Ivona, Pico, Google, Microsoft, and Nuance.
Considering the results obtained for this feature, we can make the following observations:
  • For our three social robots, the most suitable TTS systems overall are Google and Ivona. In fact, Ivona has been, in all cases, the second best rated (with no significant differences from the first and the third ones). Therefore, this TTS system can be a good selection for these robots.
  • In relation to the less suitable TTS systems, it is interesting to note that, for Maggie and Mini, the worst evaluated system is Espeak (it is significantly different from the most suitable one ( p < 0.05 )). On the contrary, this TTS system is not perceived as the least suitable one for Mbot.
    One reason could be that Maggie and Mini have more physical similarities between them (Mini is a small version of Maggie) than with Mbot. Another reason could be related to gender issues. One aspect about the TTS systems that has not been considered until now is the gender of the synthesized voice. This characteristic may seem to be not very relevant at first, but, considering that we give names to the robots, which people can associate with the feminine or masculine gender, this feature must be considered in order to evaluate the suitability of a particular voice with a specific robot. All TTS systems have been tested using a feminine voice except for Espeak, which uses a masculine voice. According to our own experience, people tend to refer to Maggie and Mini as feminine, and to Mbot as masculine. Therefore, it is logical that this TTS system is perceived as less suitable for Maggie and Mini, and not so unsuitable for Mbot.
  • In general, the TTS systems that are evaluated as the most ‘robotic’ ones (Espeak, AT&T, and Pico) are also considered as less suitable for the robots. This seems to be a contradiction, but, it must be noted that these TTS systems are also the ones that were evaluated as the less clearly understood by the participants (intelligibility).

Limitations and Lessons Learned

The work presented in this paper has some limitations. First of all, the validity of the analysis might be influenced by the language used in the experiments: Spanish. Although this may not be a limitation per se, we limited the study to TTS system that offered that specific language. Therefore, we have missed other interesting TTS systems.
Another limitation, also related to the selection process, is that another reason to choose these eight systems was their price. As in the previous point, this may cause some good TTS systems (maybe better than the ones considered in this paper) to have been discarded.
In relation to the suitability feature, just three social robots were used, and, moreover, they may have some resemblance to each other: all of them have a head, eyes, similar colors, etc. This fact may explain the results and conclusions obtained in RQ4: although there are some TTS systems clearly not suitable for the robots, when selecting the most suitable one, we do not have a clear winner for each robot.
It should be noted that the participants filled the questionnaires after watching a video of the robots speaking instead of directly interacting with the robots. This limitation presented an important advantage to this study, allowing for reaching a broader number of participants. We are aware that some bias may have been introduced due to this limitation associated with the lack of interaction. In addition, the sounds registered may have been affected by some constraints such as our microphones when recording the utterances, the audio encoding in the recordings, the recording distance and position with respect to the robot, and the sound equipment of the participants. In this regard, we acknowledge that using videos for the evaluation could have introduced some bias due to the lack of direct interactions with the robot and the system chosen for reproducing the sounds. The quality of the voice perceived by participants could be affected by some aspects as the microphone used to collect and record the audio; the audio codec used in the video; the distance and position regarding the robot; and the sound equipment used by the volunteers. For the first limitations, we made an effort to make sure that the recordings were made from the same position with respect to the robots and TTS systems and with a high-quality recording system. In addition, the sound system used by the participants in the experiments was an aspect in which we had no control.
Finally, another factor that should be taken into account is that the name of the TTS system is said in the videos. Although Google has a very good performance objectively, maybe participants were influenced by the name, since it is a well known name product (authority bias). In this sense, the order in which each user listens to the utterance could also be affected by the comparison bias since the users evaluating TTS systems for each robot have heard the utterances in the same order.

Author Contributions

All authors have actively contributed to the elaboration of the manuscript, more particularly F.A.M. has performed the integration of the TTS systems in the robot architecture, Á.C.-G. and M.Á.S. have focused on the statistical analysis, M.M. and J.C.C. on performing the test scenario and collect the necessary data. All authors have read and agreed to the published version of the manuscript.

Funding

The research leading to these results has received funding from the projects: “Development of social robots to help seniors with cognitive impairment (ROBSEN)”, funded by the Ministerio de Economia y Competitividad; “RoboCity2030-DIH-CM”, funded by Comunidad de Madrid and co-funded by Structural Funds of the EU; “Robots Sociales para estimulación física, cognitiva y afectiva de mayores (ROSES)” funded by Agencia Estatal de Investigación (AEI).

Acknowledgments

Give thanks to all the entities that have financed part of this research, as well as, to all the users who have wanted to participate and contribute to the development of this work.

Conflicts of Interest

The authors of this paper certify that they have NO affiliations with or involvement in any organization or entity with any financial interest, or non-financial interest in the subject matter or materials discussed in this manuscript.

Appendix A

Figure A1. Pairwise comparisons for Intelligibility. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma.
Figure A2. Pairwise comparisons for Expressiveness. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma.
Figure A3. Pairwise comparisons for Artificiality. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma.
Figure A4. Pairwise comparisons for Suitability and the robot Maggie. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma.
Figure A5. Pairwise comparisons for Suitability and the robot Mbot. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma.
Figure A6. Pairwise comparisons for Suitability and the robot Mini. Pairs of TTS systems with significative differences are highlighted in yellow. Note that due to language configuration of the system the decimal part is delimited by a comma.

References

  1. Van Bezooijen, R.; Pols, L.C. Evaluating text-to-speech systems: Some methodological aspects. Speech Commun. 1990, 9, 263–270. [Google Scholar] [CrossRef]
  2. Handley, Z. Is text-to-speech synthesis ready for use in computer-assisted language learning? Speech Commun. 2009, 51, 906–919. [Google Scholar] [CrossRef]
  3. O’Malley, M. Text-to-speech conversion technology. Computer 1990, 23, 17–23. [Google Scholar] [CrossRef]
  4. Klatt, D.H. Review of text-to-speech conversion for English. J. Acoust. Soc. Am. 1987, 82, 737. [Google Scholar] [CrossRef] [PubMed]
  5. Pappas, C. Top 10 Text to Speech (TTS) Software for eLearning. 2019. Available online: https://elearningindustry.com/top-10-text-to-speech-tts-software-elearning (accessed on 12 December 2019).
  6. Comparison of Speech Synthesizers. 2018. Available online: https://en.wikipedia.org/wiki/Comparison_of_speech_synthesizers (accessed on 12 December 2019).
  7. Dutoit, T.; Pagel, V.; Pierret, N.; Bataille, F.; Van der Vrecken, O. The MBROLA project: Towards a set of high quality speech synthesizers free of use for non commercial purposes. In Proceedings of the Fourth International Conference on Spoken Language Processing, ICSLP’96, Philadelphia, PA, USA, 3–6 October 1996; Volume 3, pp. 1393–1396. [Google Scholar]
  8. Cao, H.; de Perre, G.V.; Simut, R. Enhancing My Keepon robot: A simple and low-cost solution for robot platform in Human-Robot Interaction studies. In Proceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication (ROMAN), Edinburgh, UK, 25–29 August 2014; pp. 555–560. [Google Scholar]
  9. Wada, K.; Ikeda, Y.; Inoue, K.; Uehara, R. Development and preliminary evaluation of a caregiver’s manual for robot therapy using the therapeutic seal robot Paro. In Proceedings of the 19th International Symposium in Robot and Human Interactive Communication, Viareggio, Italy, 13–15 September 2010; pp. 533–538. [Google Scholar]
  10. Fujita, M. On activating human communications with pet-type robot AIBO. Proc. IEEE 2004, 92, 1804–1813. [Google Scholar] [CrossRef]
  11. Shamsuddin, S.; Ismail, L.I.; Yussof, H.; Zahari, N.I.; Bahari, S.; Hafizan, H.; Jaffar, A. Humanoid robot NAO: Review of control and motion exploration. In Proceedings of the 2011 IEEE International Conference on Control System, Computing and Engineering, Penang, Malaysia, 25–27 November 2011; pp. 511–516. [Google Scholar]
  12. Lafaye, J.; Gouaillier, D.; Wieber, P.B. Linear model predictive control of the locomotion of Pepper, a humanoid robot with omnidirectional wheels. In Proceedings of the 2014 IEEE-RAS International Conference on Humanoid Robots, Madrid, Spain, 18–20 November 2014; pp. 336–341. [Google Scholar]
  13. Tsagarakis, N.; Metta, G.; Sandini, G. iCub: The design and realization of an open humanoid platform for cognitive and neuroscience research. Adv. Robot. 2007, 21, 1151–1175. [Google Scholar] [CrossRef]
  14. Metta, G.; Sandini, G.; Vernon, D. The iCub humanoid robot: An open platform for research in embodied cognition. In Proceedings of the PerMIS ’08, Workshop on Performance Metrics for Intelligent Systems, Gaithersburg, MD, USA, 19–21 August 2008; pp. 50–56. [Google Scholar]
  15. Group, A. Acapela. 2019. Available online: http://www.acapela-group.com (accessed on 12 December 2019).
  16. Kenmochi, H.; Ohshita, H. VOCALOID-commercial singing synthesizer based on sample concatenation. In Proceedings of the INTERSPEECH 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, 27–31 August 2007; pp. 4009–4010. [Google Scholar]
  17. Kenmochi, H. VOCALOID and Hatsune Miku phenomenon in Japan. In Proceedings of the Interdisciplinary Workshop on Singing Voice, Tokyo, Japan, 1–2 October 2010. [Google Scholar]
  18. Tachibana, M.; Nakaoka, S.; Kenmochi, H. A singing robot realized by a collaboration of VOCALOID and Cybernetic Human HRP-4C. In Proceedings of the Interdisciplinary Workshop on Singing Voice (InterSinging 2010), Tokyo, Japan, 1–2 October 2010. [Google Scholar]
  19. Apple. Siri. 2019. Available online: http://www.apple.com/ios/siri (accessed on 12 December 2019).
  20. Google. Google Now. 2019. Available online: https://www.google.com/landing/now (accessed on 12 December 2019).
  21. Amazon. Kindle. 2019. Available online: https://kindle.amazon.com (accessed on 12 December 2019).
  22. Corporation, M. Cortana. 2019. Available online: http://windows.microsoft.com/es-es/windows-10/getstarted-what-is-cortana (accessed on 12 December 2019).
  23. Roehling, S.; MacDonald, B.; Watson, C. Towards expressive speech synthesis in english on a robotic platform. In Proceedings of the Australasian International Conference on Speech Science and Technology, Auckland, New Zealand, 6–8 December 2006; pp. 130–135. [Google Scholar]
  24. Bakhsh, N.K.; Alshomrani, S.; Khan, I. A comparative study of Arabic text-to-speech synthesis systems. Int. J. Inf. Eng. Electron. Bus. 2014, 6, 27. [Google Scholar] [CrossRef]
  25. Shruthi, G.; Kumar, P. Comparative study of text to speech system for indian language. Int. J. Adv. Comput. Inf. Technol. 2012, 1, 199–209. [Google Scholar]
  26. Francis, A.; Nusbaum, H. Evaluating the quality of synthetic speech. In Human Factors and Voice Interactive Systems; Springer: Boston, MA, USA, 1999; pp. 63–97. [Google Scholar]
  27. Handley, Z.; Hamel, M. Establishing a methodology for benchmarking speech synthesis for computer-assisted language learning (CALL). Lang. Learn. Technol. 2005, 9, 99–120. [Google Scholar]
  28. ITU-T. Transmission Quality Subjective Opinion Tests. A Method for Subjective Performance Assessment of the Quality of Speech Voice Output Devices. Available online: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.85-199406-I!!PDF-E&type=items (accessed on 12 December 2019).
  29. MOS Scale. 2019. Available online: https://en.wikipedia.org/wiki/Mean_opinion_score (accessed on 12 December 2019).
  30. Viswanathan, M. Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale. Comput. Speech Lang. 2005, 19, 55–83. [Google Scholar] [CrossRef]
  31. King, S. Measuring a decade of progress in text-to-speech. Loquens 2014, 1, 6. [Google Scholar] [CrossRef]
  32. Alonso-Martin, F. Sistema de Interacción Humano-Robot Basado en Diálogos Multimodales y Adaptables. Ph.D. Thesis, Universidad Carlos III de Madrid, Madrid, Spain, 2014. [Google Scholar]
  33. Alonso-Martín, F.; Castro-González, A.; Luengo, F.; Salichs, M. Augmented Robotics Dialog System for Enhancing Human–Robot Interaction. Sensors 2015, 15, 15799–15829. [Google Scholar] [CrossRef] [PubMed]
  34. Gonzalez-Pacheco, V.; Ramey, A.; Alonso-Martin, F.; Castro-Gonzalez, A.; Salichs, M.A. Maggie: A Social Robot as a Gaming Platform. Int. J. Soc. Robot. 2011, 3, 371–381. [Google Scholar] [CrossRef]
  35. Salichs, M.; Barber, R.; Khamis, A.; Malfaz, M.; Gorostiza, J.; Pacheco, R.; Rivas, R.; Corrales, A.; Delgado, E.; Garcia, D. Maggie: A Robotic Platform for Human-Robot Social Interaction. In Proceedings of the 2006 IEEE Conference on Robotics, Automation and Mechatronics, Bangkok, Thailand, 1–3 June 2006; pp. 1–7. [Google Scholar]
  36. Castro-González, Á.; Castillo, J.C.; Alonso-Martín, F.; Olortegui-Ortega, O.V.; González-Pacheco, V.; Malfaz, M.; Salichs, M.A. The Effects of an Impolite vs. a Polite Robot Playing Rock-Paper-Scissors. In Proceedings of the International Conference on Social Robotics, Kansas City, MO, USA, 1–3 November 2016; pp. 306–316. [Google Scholar]
  37. González-Pacheco, V.; Castro-González, Á.; Malfaz, M.; Salichs, M.A. Human-Robot Interaction in the MOnarCH project. In Proceedings of the 13th Robocity2030 Workshop, Madrid, Spain, 11 December 2015; pp. 1–8. [Google Scholar]
  38. Monarch European Project. 2019. Available online: http://monarch-fp7.eu (accessed on 12 December 2019).
  39. Google. Google Forms. 2019. Available online: https://www.google.es/intl/es/forms/about (accessed on 12 December 2019).
  40. IBM. SPSS. 2019. Available online: http://www-01.ibm.com/software/es/analytics/spss (accessed on 12 December 2019).

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.