Improving Aphasic Speech Recognition by Using Novel Semi-Supervised Learning Methods on AphasiaBank for English and Spanish

: Automatic speech recognition in patients with aphasia is a challenging task for which studies have been published in a few languages. Reasonably, the systems reported in the literature within this ﬁeld show signiﬁcantly lower performance than those focused on transcribing non-pathological clean speech. It is mainly due to the difﬁculty of recognizing a more unintelligible voice, as well as due to the scarcity of annotated aphasic data. This work is mainly focused on applying novel semi-supervised learning methods to the AphasiaBank dataset in order to deal with these two major issues, reporting improvements for the English language and providing the ﬁrst benchmark for the Spanish language for which less than one hour of transcribed aphasic speech was used for training. In addition, the inﬂuence of reinforcing the training and decoding processes with out-of-domain acoustic and text data is described by using different strategies and conﬁgurations to ﬁne-tune the hyperparameters and the ﬁnal recognition systems. The interesting results obtained encourage extending this technological approach to other languages and scenarios where the scarcity of annotated data to train recognition models is a challenging reality.


Introduction
Aphasia is a language disorder that causes impairments in dimensions including speech, writing, interaction or communication. People with aphasia (PWA) mainly acquire this disorder after suffering a stroke, a traumatic brain injury, a tumoral brain or any other affection in some specific areas of the brain that are related to language. Particularly, aphasia is more likely to be developed when the affected areas are located in the left hemisphere [1]. Every year, millions of people worldwide acquire aphasia through one of these issues and its prevalence on the full population ranges between 6 and 62 people per 100.000 inhabitants depending on the region and country [2][3][4]. These values may increase even up to 30-60% in people who have survived a stroke, which is the second cause of death globally [4][5][6].
PWA may acquire communication impairments that affect their daily life in different grades depending on the severity of the disorder [7]. Usually, these impediments are classified with the scale proposed by the Western Aphasia Battery (WBA) [8] ranging from mild to very severe depending on the performance on several tasks that include reading, speech or writing, among others [8]. On the other hand, aphasia disorders can also be distinguished by a combination of symptoms and the affected physical areas [7]. The most extended classification uses the Wernicke-Lichtheim model, which associates communication capabilities with different brain regions [9,10], differentiating three main types of aphasia depending on the area damaged: Broca, Wernick and Anomic. Nevertheless, language comprehension and production are not isolated at the specific brain areas considered by this model [11], and more modern and complete theories, e.g., dual stream model [12] consider that language capabilities are organized in a distributed system in different cortical regions, emphasizing the connections between them [13][14][15]. However, cortical damages that causes aphasic impairment have barely been mapped using these new theories; therefore, the Wernicke-Lichtheim model is still the most widely used method in clinical assessment [11].
Intensive speech therapy conducted by interdisciplinary groups of clinical experts has a fundamental role in recovering the communication abilities of PWA [16]. During the last years, intense research carried out in speech recognition technology promises to support the work of these clinical experts by automating processes and improving access to therapy related to isolated areas and/or less favored socioeconomic environments and collectives. In this sense, some applications such as Constant Therapy [17], Lingraphica [18] and Tactus Therapy [19], for which their usefulness has been recognized by the National Aphasia Association of United States (accessed on 15 July 2021) https://www.aphasia.org/, provide exercises to practice speech, language and cognitive tasks by customizing the PWA progress. These applications have been proven to reinforce the therapy, achieving marked goals in less time [20], especially in rural areas [21]. Other technological applications focus on the adaptation of standard cognitive tests [22] or on the automatic quantitative analysis of aphasia severity through speech [23]. Taken together, these new techniques and solutions promise to enhance face-to-face therapy, to extend the treatment to more patients and, therefore, to improve the quality of life of PWA.
Nonetheless, there are still challenges related to automatic speech recognition (ASR) that must be solved worldwide in order to extend these therapy applications, since they basically depend on adequate engines that should properly recognize aphasic speech. ASR systems are usually trained with the voices of people without any speech pathology, and their performance degrades when they are applied to aphasic speech [23][24][25][26][27]. Furthermore, ASR systems are usually language-dependent and have to be trained with hundreds or thousands of hours of transcribed speech. This idiosyncrasy avoids, in many cases, extending their use to the thousands of languages currently spoken in the world and, particularly, to the use case of aphasic speech recognition due to the lack of so many annotated data for training recognition models following the more traditional supervised learning methods.
In this work, we explore the application of novel semi-supervised end-to-end (E2E) learning methods on ASR to perform aphasic speech recognition in English and Spanish in a very challenging scenario with few annotated data. More specifically, we make use of the wav2vec2.0 architecture [28], building models adapted to aphasic speech for English and Spanish and comparing the results with previous fully supervised technological approaches presented in the literature. In particular, we achieved a relative error reduction in Word Error Rate (WER) for the English test set by ∼25% when comparing with previous published results. In addition, we demonstrate that this technological approach can be extended to perform aphasic speech recognition with few annotated data. To this end, we built the first Spanish E2E model adapted to aphasic speech recognition with less than one hour of data from PWA and report the first results in the literature for this language and domain.
The rest of the paper is organized as follows: Section 2 introduces previous work in aphasic speech recognition. Section 3 details the process performed over the main corpora used for the experiments in addition to the creation and compositions of the train, validation and test partitions. In Section 4, the speech recognition architectures and constructions are explained, whilst the evaluation results obtained over different configurations of the systems are presented in Section 5 for English and Spanish. Finally, Section 6 concludes the paper and presents future work.

Related Work in Aphasic Speech Recognition
ASR is a technological field that has remarkably evolved over the last years from the hand of new methods and architectures based on Deep Neural Networks (DNNs), which are closer to reaching human-like performance in controlled acoustic environments [28][29][30][31][32]. These improvements have great potential to impact new ASR clinical applications and to develop new e-health solutions [33][34][35]. Particularly, ASR technology applied to disordered voices brings the opportunity to implement new assisted and personalized therapies, generate automatic cognitive tests or to develop adapted applications for people with impairments.
The first ASR systems for aphasic speech recognition found in the literature were focused on recognizing isolated words within small vocabularies for English [36] and Portuguese [24]. More recently, thanks to the advancements in deep learning speech recognition technologies, new studies achieved up to 90% accuracy on assessing correct versus incorrect naming attempts in controlled utterance verification systems [37]. However, the biggest challenge in the field nowadays is to improve the performance of the continuous recognition of aphasic speech in large vocabularies. To the best of our knowledge, the published works in the task of aphasic continuous speech recognition of large vocabularies only consider English [23,38,39] and Cantonese [40] to date. In this sense, the performance and results for these systems widely oscillate depending on the severity level of aphasia, ranging WER from 33 on mildest cases to more than 60 on very severe cases. All these studies employ the same AphasiaBank database [41] as the main corpus for training and evaluation, but they usually differ on the train-test-validation partitions and on the evaluation metrics employed, given that some studies used the Phoneme Error Rate (PER) as its main metric and others employed the Character Error Rate (CER). This decision strongly depends on the configuration and the basic modeling unit used to train their systems (phonemes or characters). Hence, a fair and balanced comparison between systems and technological approaches cannot always be guaranteed. Nonetheless, in some cases, notable improvements can be appreciated between the 52.3 of PER in moderate aphasia test group presented in [25] and the more recent 41.7 of PER reported in [39]. These results seems to be in line with the 38.3 global Syllable Error Rate (SER) reported for the full test set in Cantonese [40], where more than 60% of the test set was composed of mild severity speech data.
Regarding technological approaches, previous works focused on developing ASR technology for aphasic speech considering architectures based on hybrid Acoustic Models (AMs) such as Deep Neural Networks and Hidden Markov Models (DNN-HMM) [25], Bidirectional Long Short-Term Memory and Recurrent Neural Models (BLSTM-RNN) [23], and solutions based on Mixture of Experts (MoEs) [39]. More specifically, in the work presented in [38], the authors established the first large-vocabulary continuous speech recognition baseline for English built on the AphasiaBank dataset using a DNN-HMM hybrid AM trained on unseen train-validation-test partitions and by distinguishing performances depending on aphasia severity. They reached PER metrics between 47.41 for mild severity test and 75.81 for very severe test set and reported that appending utterance fixed-length speaker identity vectors (i-vectors) to frame-level acoustic features resulted in PER reductions specially in speakers with more severe levels of aphasia. These results were then improved by using an acoustic modeling method based on a BLSTM-RNN architecture enriched with a trigram language model (LM) estimated on the transcripts of the training audios [23]. In this case, the training of the AM was reinforced with transcribed data from healthy speakers, achieving an improved WER ranging from 33.68 on mild test set to 53.17 on very severe test set. In the work described in [39], an AM based on a MoE of DNN models was proposed, where each expert in the model was specialized on specific aphasia severity. Additionally, an Speech Intelligibility Detector (SID) composed of two hidden layers and a final softmax function was trained to detect the Aphasia Quotient (AQ) severity level of a given speech frame by using the acoustic features and utterance-level speaker embeddings. At inference time, the contribution of each expert was decided by the SID module. Once again, the train-validation-test partitions were randomly generated, and they achieved PER values ranging from 33.37 on mild test set to 61.41 on severe test set.
Finally, the first ASR system for Cantonese continuous aphasic speech was described in the work presented in [40]. They used a Time Delay Neural Network (TDNN) combined with a BLSTM model as the main AM, which was trained with both in-domain and outof-domain speech data and a syllable-based trigram LM. The performance of the system was evaluated at the syllable level by using the SER metric. In this work, any distinctions between aphasia severities, yielding an overall SER of 38.77 for aphasic speech and 15.07 of SER for the healthy speakers, were not reported.
As it can be concluded, over the last years, the speech recognition of aphasic voices has benefited from the latest improvements in the ASR based on fully supervised learning methods, gradually enhancing its performance and, thus, allowing its application in real clinical and therapists tools. In this work, we show that semi-supervised learning methods have great potential in this particular domain, reporting interesting WER improvements for English and competitive results for Spanish considering the scarcity of annotated PWA data (less than 1 h) for this language.

General Description
In this work, transcribed speech data from the AphasiaBank dataset [41] were used as the main corpus. The AphasiaBank corresponded to a computerised database of interviews between PWA and clinicians. The interviews are presented in recorded video format, and they were transcribed and transformed into CHAT file format following a protocol designed by a table of experts based on previous successful experiences [42]. This protocol mainly consisted of narrative and procedural discourse in order to maximize task comparability across participants [41].
The contents in the original AphasiaBank dataset are organized by the severity of the aphasia impairment for the English language. This measurement was performed with the standardized comprehensive assessment by using the WAB scale and yielding an AQ value which ranged from 0 to 100. Lower AQ value meant a higher degree of aphasia severity. The AQ score served as a threshold to classify patients into four aphasic levels, including mild (AQ < 75), moderate (50 < AQ ≤ 75), severe (25 < AQ ≤ 50) and very severe (0 < AQ ≤ 25) [41].
Regarding the amount of data, at the time the authors accessed the database, the full English subpart of the AphasiaBank dataset included 116 h and 54.9 h of transcribed speech from 435 PWA and healthy control speakers, respectively, collected at various sites across the United States and Canada [41]. The PWA speakers were organized by their severity of the aphasia impairment. By contrast, for the case of Spanish, the available data only included chunks from 4 PWA collected at four different sites across the United States, summarizing a total of 1.2 h of transcribed speech [41]. In this case, with the aim of adding contents from healthy people, 1 h (700 speech utterances) from the Spanish Mozilla Common Voice corpus [43] was selected in order to reinforce the training of the Spanish AM. It should be noted that no information about the aphasia severity of the Spanish PWA patients was reported in the original database.

Data Processing
The original data from the AphasiaBank dataset were processed at different acoustic and text levels in order to generate suitable corpora to build the E2E AMs for English and Spanish. The audio files were first extracted from the video recordings and converted to PCM WAV 16 kHz 16-bit format using the open sourced FFmpeg tool [44]. Since the timecodes were provided at the sentence level, the audio was split into correctly aligned audio chunks by using the SoX [45] tookit in order to manage shorter segments for the training of the neural models. In this respect, audio chunks shorter than 0.3 s were discarded to avoid future problems when computing Fourier transform for the spectrograms generation or during the CTC layer alignment in the neural network. Furthermore, audio chunks longer than 30 s were not included in our corpus, with the aim of avoiding memory issues during training.
Concerning text transcriptions, they originally contained enriched information including not only literally transcribed words and phenomena such as repetitions, sound fragments and phonological transcription but also artifacts such as misalignments or phoneme omissions. In the latter cases, different criteria were applied in order to maintain or definitively discard these phenomena. In the cases where some phonemes were missing but the full word was intelligible, we chose to maintain the entire word, although some of its phonemes may not have been properly pronounced. Moreover, the repetitions of words and semantic mismatches that may occur during the speech were also preserved, since replacing them would not reflect the real speech patterns of the PWA collective. Additionally, it should be remarked that transcriptions also included special symbols representing isolated noises interjections or fillers, including um, uh, uhuh or huh, among others. These symbols included (FLR) to represent fillers; (SPN) for spoken noises; (BRTH) as breathing sounds; and (LAU) for laughter. These special symbols were included for training and considered as individual words and characters in the acoustic E2E model. Moreover, contents with empty or mismatched transcriptions were discarded. We illustrated in Table 1 this methodology showing a real example that includes the original and processed transcription from an audio chunk performed by a female moderate non-fluent Broca English speaker. Since standard partitions for train, validation and test are not provided in the original AphasiaBank dataset, we applied the following criteria to split the processed data.
For the English corpus, we randomly selected 25% of PWA speakers from each severity level for the test partition, 19% of PWA speakers for the validation test set and the remaining 56% for the training set to create an unseen train/test/validation set. This train partition was called PWA acoustic set. In addition, we also created a second training set, which we called Mixed acoustic set, by adding data from healthy controls. The configuration of the train/test/validation partitions was mainly thought so that speakers cannot appear simultaneously in more than one subset while the data remained balanced throughout the aphasia severities. Moreover, both validation and test sets were composed only with data from PWA. In this manner, we could compare two different train sets to investigate the usefulness of adding healthy control data in order to improve the performance of the ASR model. Detailed information of the constructed English corpus can be found in Table 2, including the number of subjects, the amount of hours per partition and the levels of aphasia considered.

Experimental Setup
Given that the original Spanish corpus from the AphasiaBank dataset was composed by only 4 PWA participants without information about their aphasia severity level, a different configuration was followed for this language but maintaining the same partition percentages. In this case, 56%, 19% and 25% of the audio chunks were randomly selected from each PWA speaker to form the train, validation and test set, respectively. As in the case of the English language, two train sets were also created for Spanish; the PWA acoustic set including only PWA data for training and the Mixed acoustic set, which added data from healthy controls. In the case of the PWA acoustic set, its configuration allow the authors to explore the ability to train an ASR system with an extremely small number of data using semi-supervised learning methods. Detailed information for each Spanish partition including the total number of speakers and hours is summarized in Table 3.

Semi-Supervised Learning Based System
In this section, the ASR architecture based on semi-supervised learning techniques used during this research is described, providing details on the strategies employed to find the best hyperparameters and the fine-tuning techniques implemented. Finally, the two decoding strategies used to generate the recognition hypothesis are described as well.

Main Architecture
The main ASR architecture used in this work is based on the unsupervised E2E model wav2vec2.0 proposed by Facebook AI [28], which is schematically represented in Figure 1. The wav2vec2.0 model maps speech audio through a multi-layer convolutional feature encoder f : χ → Z to latent speech representations z 1 , ...z T , which are fed into a Transformer network g : Z → C to output context representations c 1 , ...c T . These context representations are then quantized to q 1 ...q T in order to represent the targets in the selfsupervised learning objective [28,46]. The feature encoder contains seven blocks, and the temporal convolutions in each block include 512 channels with strides (5, 2, 2, 2, 2, 2, 2) and kernel widths (10, 3, 3, 3, 3, 2, 2). The transformer used had 24 blocks, a model dimension of 1024, an inner dimension of 4096 and a total of 16 attention heads. The model was pretrained by solving a contrastive task over masked feature encoder outputs. Afterwards, it was fine-tuned relative to the aphasia domain by adding a randomly initialized linear projection on top of the context network into C classes representing the vocabulary of the task [47] and optimized by using a Connectionist Temporal Classification (CTC) layer [28,46,48]. The pretrained task was based on the XLSR-53 [46] model, which was originally trained with 56,000 h of nontranscribed speech data in 53 different languages, including English and Spanish. These data were composed of audio from the CommonVoice [43], Babel [49] and Multilingual Librispeech (MLS) [50] datasets. The unsupervised task learns a set of quantized latent speech representations shared across languages that are later combined together on the supervised training to identify the phonemes or characters to decode. The speech audio representations are learned by solving a contrastive task, which requires identifying the true quantized latent speech representation for a masked time step within a set of distractors [28]. This strategy has been shown to be capable of learning non-language-dependent universal quantized representations of speech that can then be combined to train specific phonemes and sounds of each language [46].

Supervised Fine-Tuning Phase
The fine-tuning phase of the pre-trained XLSR-53 model corresponded to the supervised training where quantized representations of speech are mapped into the output vocabulary by using Connectionist Temporal Classification (CTC) loss [48]. The last layer corresponded to the vocabulary set, and it was composed of 35 characters for the case of English and 38 for the case of the Spanish language.
In a first step, we performed grid search hyperparameter tuning on the validation set, training models with small subsets of the train partition by using the Weights & Biases tools [51]. Using this information, we set the learning rate to 2 × 10 −5 using a warm-up during the first 10% of updates and then using a linear decay learning rate scheduler. Additionally, the feature and layer dropouts were set to 0.05 and 0.02, respectively, whilst the accumulation steps was set to 3, the mask time to 0.057 and the activation and attentions dropouts were established as 0.03 and 0.036, respectively.
In addition, we also applied a masking strategy to the feature encoder outputs similar to the SpecAugment technique presented in [30], and mask embeddings were randomly applied, as explained in [28]. Previous research studies reported weight update optimal values between 16 k and 300 k during training, depending on training corpus size, training batch-size and number of GPU (Graphics Processing Unit) cards employed [46]. Following these recommendations and considering our hardware resources, we used a batch size of 6 during training and performed finetuning during 10 epochs on English (∼21 k updates on the PWA acoustic set and ∼50 k on the Mixed acoustic set). For the Spanish dataset, our best results were achieved by finetuning the model during 100 epochs when using the PWA acoustic set (∼2 k updates) and 200 epochs when using the Mixed acoustic set (∼ 13 k updates).

Decoding Strategies and External LMs
Two different decoding strategies were applied during the experiments for both languages. The first decoding strategy was based on a greedy-search approximation, which selected the most likely character at each step in the output sequence. Although this approach had the benefit of being very fast, its performance strongly depends on the robustness of the E2E AM and the quality of the final output sequences may not be the most optimal.
As the second decoding strategy, a beam-search approximation was applied by using external LMs for rescoring the initial hypothesis of the E2E AM. Different external LMs were built and constructed for the experiments. For the case of the English language, three LMs were trained and tested: (i) a model trained only with the transcriptions of the audio of the PWA acoustic set called In-domain LM; (ii) a second LM model using the transcriptions of the audio from the Mixed acoustic set called Mixed LM, which mixed audio of the PWA acoustic set and healthy controls; and (iii) a final large LM model, called Large LM, which includes the transcriptions of the above two acoustic sets plus texts from the Librispeech [52] and CommonVoice [43] public datasets. Each LM was trained with 250 k words, 600 K words and 813.2 million words, respectively. With respect to Spanish language, given the low amount of texts from the PWA acoustic set and Mixed acoustic set, only one external LM was trained, including the transcriptions of the audios from the PWA acoustic set and Mixed acoustic set, in addition to texts extracted from the public CommonVoice dataset (1.8 million words) and generic news extracted from Spanish digital newspapers (25.2 million words). The model was identified as Large LM. In total, the Spanish text corpus contained 27.1 million words.
The LMs were built through the KenLM toolkit [53] in which modified Kneser-Ney smoothed 3-gram models were estimated. Beam-search decoding was performed with a beam-width value of 10 in all experiments, whilst the LM weight parameters al pha and the insertion weight beta were tuned with the validation dataset for each language. In this manner, for English, an al pha value of 0.8 and a beta value of 0 were used onIn-domain LM and Mixed LM, while al pha value of 1.4 and a beta value of 0 were used on the Large LM, whilst the al pha and beta parameters for Spanish were set to 1.4 and 0, respectively.

Evaluation Results and Discussion
In this section, the evaluation results for English and Spanish are reported, together with the results obtained by the reference ASR systems of the literature, which are shown in Table 4. All the evaluations were performed following the experimental setup, neural acoustic and language models and decoding strategies detailed in Sections 3 and 4. In addition, a discussion of the results achieved is provided as well.

Semi-Supervised ASR Performance for English
The performances of the different ASR systems developed in this work for aphasic speech recognition in English are reported in Tables 5 and 6 for the CER and WER metrics, respectively. The results are organized by the AM of the ASR system, the acoustic data used to finetune the pre-trained XLSR-53-wav2vec2.0 model, the decoding type, the external LM used for rescoring the initial lattices and the aphasia severity level. Table 5. CER results on the English corpus of AphasiaBank detailed by severity level of aphasia: mild, moderate, severe and very severe. The PWA acoustic set is only composed by PWA patients, and Mixed acoustic set combines PWA and healthy controls. In-domain LM was trained by using transcriptions from the PWA acoustic set, Mixed LM was trained with the transcriptions from the audio of the Mixed acoustic set and the Large LM by using the transcriptions from the above acoustic sets and texts from Librispeech and CommonVoice datasets. As it was expected, audio contents from more severe levels of PWA are more challenging to transcribe, whilst the speech segments from mild severity cases are recognized with lower error rates on CER and WER values. The differences between the performance in the different groups that establish the degree of the aphasia severity are quite significant, obtaining up to 2x error on the most severe groups when comparing with mild cases. These big differences between AQ level groups are in line with previous publications [23,38,39], which PER and WER results are summarized in Table 4.

CER (Character Error
At acoustic levels, the best performance was obtained when finetuning the XLSR-53 pre-trained model with data from the Mixed acoustic set, which included audio content from PWA and healthy controls. In this sense, we report CER and WER reductions of almost a ∼5% when adding the healthy controls in comparison with using only audios of PWA for training. It implies that the impact of the scarcity of annotated aphasic speech can be partially reduced by incorporating speech from healthy speakers and domains. This finding was explored and applied later on the Spanish dataset.
Regarding the beam-search decoding using external LMs for rescoring the initial lattices, it was demonstrated that this strategy clearly improves the performance of the speech recognition systems, showing different results depending on the level of severity of aphasia and the type of LM employed. At this point, it is worth remarking that the Large LM does not enhance overall results when comparing with the other LMs, even if it includes more than 803 million extra words, and the special symbols were ignored in order to compute metrics. It suggests that, in this case, the texts from the Librispeech and CommonVoice datasets used for training the LM are too far from the domain sentences of the AphasiaBank dataset. In this manner, the best results are achieved using the Mixed LM model, reaching a 22.3 WER on the mild severity level group, a 35.1 WER over the moderate subset, a 34.1 WER for severe PWA and 55.5 WER on very severe cases. Overall, this LM reported improvements of ∼2% in comparison with using the In-domain LM and ∼7% when comparing to greedy decoding.
The results obtained show that, despite the great differences in the quality of pronunciation in speakers from mild to very severe groups, the semi-supervised learning method applied in this work is able to generalize the learning of contextualized speech representations of a very diverse type of speech, improving the ASR performance for all cases. This strategy is again demonstrated in Section 5.2 for the Spanish language. Finally, although a fair and well-balanced comparison of these results cannot be fully established with the ones published in previous studies (see Table 4) considering the differences in the modeling units (character versus phoneme) and the possible mismatch in data partitions, the results provided in this work for the English language (Tables 5 and 6) constitute a significant improvement in the quality of aphasic speech recognition systems tested to date on the AphasiaBank dataset.

Semi-Supervised ASR Performance for Spanish
The evaluation results achieved for the Spanish language are summarized in Table 7 at CER and WER levels. Firstly, it is worth noting that, even when we used less than one hour of PWA transcribed speech, we were able to achieve performances of 25.8 of CER and 49.8 of WER on the test set using the most simple greedy search decoding. These results were further improved by integrating audio from healthy control speakers and the Large LM trained with million of words to rescore and enhance the initial recognition hypothesis. If we consider the challenge of the task and the previous benchmarks of English and Cantonese ASR systems, which were trained with up to 50x more hours of transcribed speech, these results can be considered very competitive and promising. Moreover, these results are, to the best of our knowledge, the first benchmark of aphasic speech recognition published for Spanish. The best initial results with the Spanish AM models trained with the PWA acoustic set were reached by fine-tuning the pre-trained model for 100 epochs, achieving a CER 25.8 and a WER of 49.8. However, previous results in English demonstrated that augmenting the training dataset with data from healthy controls improved the overall ASR performance. In this manner, the Spanish model trained with the Mixed acoustic set improved the WER performance at around 10% when finetuning the pre-trained model for 200 epochs. Once again, this approach showed that using semi-supervised methods on clinical data scarcity domains together with non-pathological data augmentation results in a very promising and interesting strategy.
Finally, the best performance for this language was achieved through a beam search decoding with the external Large LM model. Once again, the special symbols FLR, SPN, BRTH and LAU were discarded during the evaluation since these symbols were not covered in the generic texts. Following this strategy, we achieved a 24.8 of CER and a 42.8 of WER on the test set. These results differs with the English subset where the external Large LM did not improve the results at all. This may be due to the fact that the Spanish AM, fine-tuned with much fewer data, did not learn special symbols properly. As a result, they could be removed during evaluation without a negative impact on the performance.

Conclusions and Future Work
In this work, we show that semi-supervised learning methods applied to the ASR are promising solutions for improving the performance on aphasic speech recognition. Moreover, we set new benchmarks for the English AphasiaBank dataset, and we performed the first study for the Spanish language. The acoustic data for training were augmented using a mix of data from PWA and healthy controls, demonstrating that this strategy considerably improves the performance. This benefit was boosted for the case of Spanish, which included less than one hour of available aphasic speech data. These results open the door to improve ASR systems for people with aphasia and other clinical speech pathologies, or even simply to make speech recognition engines available for those languages with few annotated and available data.
As future work, it would be interesting to check if the performance of the systems could be improve by considering some other learning rate schedulers, by tuning the SpecAugment parameters or by considering other hyperparameters configurations. Moreover,whether the results could be enhanced by fine-tuning specific models for each level of aphasia severity should be evaluated, as speakers in each group probably perform similar speech and acoustic patterns. Another strategy worth studying would be to train AMs by directly removing the special symbols and then rescoring with an external Large LM. In any case, this point should be considered depending on the application, since special symbol information can be important for clinical practice but irrelevant for voice assistants. Furthermore, AMs may even be finetuned relative to individual patient speech by using Federated Learning approaches [54]. Finally, future studies should be also focused on extending this semisupervised learning method to other languages where no benchmarks on aphasic speech recognition voices has been reported, probably due to the scarcity of annotated data.
In addition, this technology should be tested in clinical practice, as well as in real medical environments and applications. Acknowledgments: The authors would like to acknowledge AphasiaBank and, especially, to the people who contributed to it.

Conflicts of Interest:
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript or in the decision to publish the results.

Abbreviation
The following abbreviations are used in this manuscript: