Balancing Specialization and Generalization Trade-Off for Speech Recognition Models

Cygert, Sebastian; Despot-Mładanowicz, Piotr; Czyżewski, Andrzej

doi:10.3390/electronics14244792

Open AccessArticle

Balancing Specialization and Generalization Trade-Off for Speech Recognition Models

by

Sebastian Cygert

^1,2,*,†

,

Piotr Despot-Mładanowicz

^2,† and

Andrzej Czyżewski

²

¹

NASK—National Research Institute, 01-045 Warsaw, Poland

²

Multimedia Systems Department, Gdańsk University of Technology, 80-222 Gdańsk, Poland

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(24), 4792; https://doi.org/10.3390/electronics14244792

Submission received: 24 October 2025 / Revised: 24 November 2025 / Accepted: 2 December 2025 / Published: 5 December 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Recently, using foundation models pretrained on massive volumes of data that are finetuned for the downstream task has become a standard practice in many machine learning applications, including automatic speech recognition (ASR). In some scenarios, we are interested in optimizing performance for the target domain (specialization)while preserving the general capabilities of the pretrained model. In this work, we study this effect for various finetuning strategies that aim to preserve pretrained model capabilities. We identify model merging as a promising strategy that performs well across diverse scenarios. However, our findings show that leveraging a small number of data points from the task we are interested in preserving the accuracy of significantly improves the balance between specialization and generalization. In this context, we demonstrate that combining a simplest finetuning strategy with a memory buffer yields highly competitive results, surpassing other more complicated approaches. Our analysis highlights the need for further research into methods that effectively utilize memory buffers, especially in low-resource scenarios. To encourage further exploration in this area, we have open-sourced our code.

Keywords:

automatic speech recognition; deep learning; catastrophic forgetting

1. Introduction

Most recent progress in deep learning is based on large foundation models [1] that were pretrained on large-scale data and finetuned for the target downstream tasks. These models, often based on transformer architectures [2], have revolutionized natural language processing (NLP), computer vision, and speech-related tasks by leveraging vast amounts of diverse training data. Those advances were also brought to automatic speech recognition: a Whisper [3] model was trained on 680,000 h collected from the internet and has demonstrated automatic speech recognition (ASR) capabilities in 96 languages. Such models possess impressive general capabilities: they are effective learners from scarce data (few-shot or zero-shot learning) and are robust to different types of distribution shift [1,3]. Their adaptability and ability to transfer knowledge across tasks and languages make them particularly useful in scenarios where labeled data is limited.

As with all foundation models, they usually require some finetuning to the downstream task. For example, the Whisper training data is heavily English-centric, with about 65% of the labeled data being in English, which influences the Whisper model’s performance: it performs best in English but still provides reasonable accuracy in other supported languages. However, they still stand as a strong starting point for further finetuning; they generally perform better on low-resource languages than monolingual models trained from scratch, since similarities between languages can be leveraged.

Interestingly, when finetuning those models, their desirable properties (i.e., robustness, zero-shot performance) are forgotten [4,5]. A standard strategy to mitigate this issue is a data replay. However, this is very costly and often not feasible as the data used to train foundation models is kept private [3,6]. Yet, such a problem has many use cases. For example, when finetuning the Whisper model for some low-resource languages, we would like this model to still excel in some commonly used languages (i.e., English). Similarly, when improving ASR performance on some specialized domain, for example in the medical domain which includes challenges such as medical jargon and terminology, we would like to maintain the general recognition capabilities learned during pretraining.

The aforementioned problem of forgetting previously acquired knowledge, also known as catastrophical forgetting [7,8], has been long studied in the Continual Learning (CL) domain. The most popular approaches include some kind of regularization, either by constraining changes in model weights [9], changes in model features (DER [10]) or outputs (e.g., LwF [11]). Recently, weight interpolation between pretrained and finetuned models, known as model merging, was also used in this area [12].

However, this problem has not been extensively studied in the ASR domain. In this work, we investigate it in two scenarios: one where no retention task data (i.e., data from the language whose recognition capabilities we aim to preserve) is available, and another where some of this data is accessible. In the first, more restrictive scenario, we focus on understanding the problem of forgetting in isolation. We find that out of existing methods, model merging is the method that performs consistently the best; however, it did not resolve the forgetting problem for the low-resource adaptation scenario. In the second, we explore potential solutions to mitigate this issue by using small data for experience replay. In this scenario, we show that merging does not benefit much from additional data, and as more data is available for the replay, standard approaches, including simple finetuning, are very competitive.

Overall, our work offers numerous findings, and the contributions are as follows:

Analysis of the specialization–generalization trade-off in ASR—We systematically study how finetuning affects a model’s ability to retain general ASR capabilities while adapting to new tasks, particularly in low-resource and domain-specific scenarios.
Evaluation of model merging as a forgetting mitigation strategy—We find that model merging effectively balances adaptation and knowledge retention, especially when no prior task data is available, but struggles for tasks with low relatedness.
Demonstration of experience replay as a simple yet powerful baseline—Our results show that even a small memory buffer significantly improves retention, often outperforming more complex regularization methods.
Code release—We release our code to enable reproduction of our work and further investigation.

2. Related Work

ASR. Progress in ASR has been accelerated by the development of self-supervised (SSL) pretraining techniques [13,14,15] and scaling them to larger datasets and models. Large speech models trained on large datasets have been studied previously in both monolingual [16] and multilingual contexts [3,17,18,19]. The availability of pretrained multilingual models in ASR has enabled transfer learning approaches for domains with limited data. This has been especially beneficial for improving speech recognition for non-standard speech and low-resource languages [20,21,22]. Self-supervised learning (SSL) has also been explored to adapt initially trained ASR models to new domains without using extra labels, leveraging unlabeled speech data to improve domain robustness [23]; however, in our work, we also focus on scenarios where some labels are available for the adaptation.

Robust finetuning. Naively finetuning the model to the downstream task will cause forgetting of the pretrained model knowledge [12,24]. Wortsman et al. [12] proposed weight averaging as an effective strategy that balances the accuracy of the target distribution and robustness to distribution shifts. Weight averaging uses a linear interpolation between the pretrained and finetuned model. It works well when the two models are connected by mode connectivity, which is often the case when finetuning pretrained models [25]. This strategy was then followed by others [26]. Zheng et al. [4] additionally observed the degradation of zero-shot performance during the finetuning stage. Their mitigation strategy includes weight averaging combined with knowledge distillation from external datasets. In [24], the authors proposed minimizing changes in feature space (instead of the weight space) to preserve knowledge of the pretrained model. Alternatively, it is possible to preserve pretrained model knowledge by changing only selected model parameters [27]. Some other strategies include training only adapters for different tasks [28]. However, this requires knowing or predicting task identity during inference, which is not always feasible.

Forgetting in ASR. The problem of forgetting was also studied from the ASR perspective. To prevent this issue, Pekarek-Rosin and Wermter proposed combining data replay with selective model freezing [29]. This problem was extended to the continual learning setup [30,31] when the model is sequentially trained on disjoint tasks (i.e., ASR for different languages), and the goal is to achieve high accuracy for all languages. Combining weight averaging and knowledge distillation was shown to work well in such a setting [32]; however, as we show, the results of such a method can be surpassed by simple finetuning, even with a very small memory budget. Another strategy included increasing the model size for new tasks, which can limit the forgetting [33]. However, this increases the model size and requires the model to predict the task identity during inference, which we want to avoid.

Continual Learning (CL). In CL, regularization-based approaches are most common, either in the output space [11] or in the weight space [9]. Using some buffer from previous task data, known as Experience Replay (ER) [34,35], is a very strong baseline in continual learning. While many works in the CL focus on scenarios without any access to exemplars [36,37], we consider both scenarios and compare how existing methods work in such settings.

Comparative Analysis of Related Work. In our work, we are interested in analyzing the generality–specificity in ASR, which occurs when learning different languages or specialized terminology. A similar trade-off was analyzed by Diwan et al. [38], who trained speaker-specific ASR systems, observing a decrease in performance for other speakers. Their strategy involves learning user (task) specific modules, with the aim of personalization for real-word embedded devices. Our work is complementary to theirs; here, we focus on tasks that can be in different languages. Additionally we assume that one model should be able to handle all tasks, as we do not assume the task identity to be known. In contrast to approaches that train only adapters for different tasks [28], which require task identity to be known or predicted during inference, our work avoids this constraint. Moreover, strategies such as increasing model capacity to accommodate new tasks, as proposed in [33] to mitigate forgetting, introduce additional complexity and still rely on task identity prediction during inference. This assumption is not feasible in our setting.

3. Method

Model. Whisper is an end-to-end (E2E) encoder–decoder model [39] designed for automatic speech recognition (ASR). The architecture consists of a Transformer encoder and decoder [2], optimized for transcribing audio input into text tokens. The encoder is a multi-layer Transformer network that processes audio features extracted from the input waveform. It employs self-attention mechanisms to capture temporal dependencies in the audio sequence. The decoder is also a multi-layer Transformer network. It utilizes cross-attention to the encoder outputs and autoregressively generates text tokens, translating the audio representation into a text sequence.

Let

θ \in R^{N}

represent the model parameters, where N is the total number of parameters in the model. The function

f (X; θ)

denotes the outputs (after applying softmax) of the model, for an input utterance

X \in R^{F \times f}

consisting of F frames of dimension f. The Whisper model is trained by cross-entropy loss comparing the model output

f (X; θ)

for utterance X with y—the ground truth transcription:

L_{C E} (X, y; θ) = - \frac{1}{T} \sum_{t = 1}^{T} \sum_{i = 1}^{V} y_{t, i} log ({\hat{y}}_{t, i})

(1)

where T is the number of tokens, V is the vocabulary size,

y_{t, i}

is the true label (one-hot encoded) and

y_{t, i}^{*}

is the predicted probability for the ground-truth class.

Problem formulation. During finetuning, we are interested in preserving the existing knowledge of the current model (general capabilities) while adapting to the new task (specialization); e.g., when improving recognition of the pretrained Whisper model in specialized medical jargon and terminology, we would like to maintain the general recognition capabilities learned during pretraining. This is an important problem which was considered from both transfer learning [12] and continual learning perspectives [4,24]. The problem arises because, in many scenarios, we do not have access to the data on which the pretrained model was trained either because the data is large and kept private [3], or we want to limit used computational resources. Let

D_{p r e}, D_{d o w n s t r e a m}

represent the labeled pretraining and downstream datasets.

T = 0

specifically denotes the pretraining task. The objective of the downstream task

T = 1

is to learn a new model

θ^{t}

that satisfies

θ^{T} = \underset{θ}{argmin} \sum_{t = 0}^{T} \sum_{(X, y) \in D_{t}} L (X, y; θ)

(2)

However, when learning downstream tasks, the previous task’s data

D_{p r e}

is not accessible (e.g., no access to pretraining data) or limited. Therefore,

θ^{T}

cannot be directly computed from Equation (2). In transfer learning, we are learning only one downstream task (

T = 1

), while in continual learning, the number of tasks can be larger.

4. Experimental Setup

4.1. Experiments

We focus on a transfer learning setup for downstream tasks, with the additional goal of preserving certain knowledge from the pretrained model, such as its recognition capabilities in the English (EN) language. Throughout this paper, we refer to this as the retention task, emphasizing our interest in maintaining the knowledge associated with this task. More specifically, we consider the following experiments:

Adaptation to Polish (PL) language while preserving the pretrained model recognition capabilities in the English (EN) language.
Similar to the above, but adaptation is performed to the Slovak (SK) language. Here, we are experimenting with a very low-resource language adaptation, with an order of magnitude smaller number of utterances than for the PL language.
Adaptation to specialized medical jargon in the PL language while investigating forgetting in the PL language. This experiment allows us to study the generalization–specialization trade-off when there is a similarity between the two tasks.

We have used PL language as it will be also used in experiments with medical data, and SK language was chosen because it is an example of a very low-resource language.

4.2. Data

Datasets. As a data source for PL, we use CommonVoice [40] (version 11.0) and LibriSpeech Multilingual [41] datasets (69 k utterances in total). For the SK language, we use the CommonVoice 11.0 dataset (3 k training utterances). To monitor forgetting during finetuning, we measure model performance on the English (EN) language, by using CommonVoice and LibriSpeech datasets. Summary is given in Table 1.

MED-PL. We additionally experiment with our in-house Polish Medical corpus (MED-PL) dataset, collected for the development of a specialized ASR system for medical applications in Poland, in the context of supporting medics [42]. MED-PL consists of short speech commands (up to 30 s) recorded by Polish speakers. Overall 181 people recorded 48 k utterances, for a total of 99 h. While most of the transcripts are close to the ground truth, some speakers deviated by repeating, skipping, or adding words.

4.3. Models

Whisper. As our model, we use a small version of the Whisper pretrained model, which contains 244 M parameters [3]. The model contains 12 transformer encoders and 12 decoders, with attention dimension 768 and feedforward dimension 3072. For all of the experiments, all model parameters are shared across all of the task. This is in contrast to existing works [32,33], where some parts of the network are task-specific. As a result our inference is task-agnostic; that is, we do not need to pass the task identifier (language or dialect) to the model, which is more difficult, but also more realistic.

Baselines. In our work we consider the following baselines.

(1): Finetuning—naively adapts the model to the target. This is a highly plastic model that will perform well on new tasks but forget much of the previous knowledge.
(2): Frozen—we additionally report the results of the frozen model which favors stability (knowledge preservation).
(3): Learning without forgetting (LwF) [11]—it is a popular Continual Learning method, which was previously successfully applied also in ASR [43,44]. During finetuning it adds a distillation loss to minimize the KL divergence between outputs for the ( $θ$ ) and the model from the previous task ( $θ^{*}$ ):

$L (X, y; θ) = L_{C E} (X, y; θ) + λ K L (f (θ, X), f (θ^{*}, X))$

(3)

We run a parameter search for the optimal $λ$ value in Section 5.3.
(4): EMA_teacher. The LwF is a strong Continual Learning approach; however, it can limit learning of the new task [36]. Therefore, to improve its flexibility, we also experiment with a variant where we allow the teacher to slightly adapt towards the target task. Specifically, we present a EMA_teacher strategy, which is a similar strategy to the LwF, but here the teacher is set to be an exponential moving average (EMA) of the current model:

$θ_{t e a c h e r} = β * θ_{t e a c h e r} + (1 - β) * θ$

(4)

The EMA was first proposed by Tarvainen & Valpola [45] in the semi-supervised setting. The same strategy was later used in self-supervised learning [46,47] and more recently to online continual learning [48], which inspired us to use it in our setting. Following popular approaches in this area, we set $β = 0.999$ [48,49].
(5): Model Merging—we also experiment with merging approaches [12,50]. More specifically, in the first stage, a simple finetuning procedure is applied, and then the weights of the new model are computed as follows:

$θ_{n e w} = α * θ_{t} + (1 - α) * θ_{t - 1},$

(5)

where in our case, $t = 1$ refers to the downstream task, and $t = 0$ refers to the pretraining stage. Similar, as in other works [32], we set $α = 0.5$ . We also experiment with Model Merging combined with LwF regularization, following [32].
(6): Experience Replay (ER)—we additionally optionally equip all methods with a small budget of memory of previous task in Section 5.3, as commonly used in Continual Learning [29,31]. The specific type of data used for experience replay varies depending on the experiment and is detailed later in the paper.

4.4. Methodology

Implementation details. To provide a fair comparison, all methods follow the same training procedure: number of training steps, learning rates, batch size, etc. The optimizer is AdamW, with a batch size of 64 and weight decay of 0.1, following [29]. The learning rate and number of steps were determined empirically by running the finetuning procedure until the validation loss converged. Each experiment was repeated 3 times and mean results are reported.

Metrics. Our main metric is the Word Error Rate (WER). For transfer learning experiments, we report WER on target tasks (PL, SK or MED-PL) and for previous knowledge (EN or PL). We are interested in finding a good balance between accuracies in downstream and pretrained tasks. We use geometric mean contour lines to visualize the trade-off between forgetting and downstream accuracy, as this metric emphasizes balanced performance and penalizes extreme imbalances more than an arithmetic mean.

Training Time. The experiments were conducted on an NVIDIA A100 40 GB GPU. All evaluated methods have similar computational requirements. For example, finetuning for the PL experiment takes 4 h 21 min 34 s, while LwF and

{EMA}_{teacher}

increase the training time by approximately 10%. The additional cost of Model Merging is negligible.

5. Results

5.1. Forgetting in ASR Models

In the first part, we seek to understand how finetuning the model on the target domain affects the pretrained model knowledge in two scenarios:

The target and retention tasks are disjointed. We analyze finetuning in low-resource downstream tasks (PL and SK) and evaluate forgetting on retention tasks (EN).
The target and retention tasks are similar. For this purpose, we finetune the model on a specialized dataset of Polish medical speech and evaluate forgetting in the PL language (retention task).

Results are presented in Figure 1. For PL language, the model was finetuned for 2.3 epochs, and for both SK and MED, it was finetuned for 5 epochs, corresponding to 1.5 k, 0.23 k and 3.35 k steps, respectively, to provide the most optimal results.

Effect of Downstream Tasks on Forgetting. We can see that the level of forgetting on the retention task is highly dependent on the downstream task. While the forgetting is excessive for the PL and SK downstream tasks (WER for the retention task reaches 63.8% and 75.7%, respectively), it is less severe in the case of the MED-PL experiment, where there is more similarity between downstream and retention tasks.

Generality–specificity trade off. The finetuning strategy favors accuracy on the target task, losing the general recognition capabilities of the model, compared to the frozen model. Regularization-based approaches (LwF,

{EMA}_{teacher}

), in general, struggle to find significantly better solutions. On the other hand, Merging is the strategy that works consistently well, significantly outperforming other approaches, also when combined with LwF.

However, we can see that the efficiency of this method is again highly task-dependent. In particular, for the SK language, where the pretrained model struggles (WER on SK equals to 66.3%), the Merging allows us to find some balance, but the resulting model is only mediocre in both EN and SK languages.

Forgetting is a relevant problem. Without any memory buffer, most methods are affected by forgetting, as indicated by results on general knowledge (EN) and previous tasks, compared to the frozen model. Although Model Merging seems to perform quite well, the forgetting is far from being solved. For example, in the PL experiment, WER for EN increases from 17.6% to 26.4%, and even more for the SK language. In the medical experiment, forgetting was minimal; however, adaptation to the downstream task was more compromised in this scenario.

5.2. Impact of Experience Replay

In this section, we assume some access to the data of the task for which we aim to preserve recognition. While the data used to train pretrained models is often private, it is practical to assume that some data for the task of interest can be gathered. Here, we revisit the scenarios discussed in the previous subsection:

When finetuning PL and SK languages, we also used additional data for EN language.
When finetuning on the Medical dataset, we also used PL data.

In both scenarios we use Common Voice and Libri Speech as a source for memory buffer from previous tasks. Following [29], instead of including a fixed number of samples from the retention task domain in each batch, we include a fixed percentage of samples from the retention task spread out over all batches. Results in this section are reported with a memory budget of 5%, following prior work [29].

Experience Replay is a strong baseline. We can observe that adding memory buffer to the finetuning is a strong baseline in this setup (Figure 2). More advanced methods, such as (LwF,

{EMA}_{teacher}

) do not yield substantial improvements over this baseline. This observation aligns with prior Continual Learning studies, which report that memory buffers often constitute a strong baseline in similar scenarios [51]. On the other hand, Merging approach is still very competitive; however, it does not consistently improve over the simple finetuning approach.

Varying size of the memory buffer. We also take a detailed look at how using replay data affects different methods’ performance. Specifically, we look at the most promising methods (Finetuning, LwF, Merging) under different sizes of the memory, where the memory budget is within {1, 5, 10, 20}% of the retention task data. Figure 3 examines the performance of various methods with different memory buffer sizes. This allows us to study how the size of the memory buffer affects different methods. Overall, we can observe that when a very small memory budget is used (i.e., 1%), Merging still works the same as or better than competing methods. However, as soon as reasonable amounts of data buffer are available, other approaches (LwF, ER) find better solutions. Interestingly, we can observe that Merging benefits very little from adding the memory buffer.

5.3. Detailed Analysis

5.3.1. Corruption Robustness

Here, we evaluate how the robustness of the ASR models is affected during finetuning. More specifically, we use robustness to synthetic Gaussian noise, as recently introduced in the Speech Robust Bench [52]. We show the evaluation in Figure 4, where all the testing utterances are modified using the Gaussian noise distortions. First, we can observe that adding Gaussian noise decreases model recognition capabilities, WER increases from 17.5% and 17.2% to 60.1% and 88.0% for EN and PL languages, respectively. The decrease in quality is significantly more severe for the PL language, which shows that robustness to noise is also language-dependent. As the pretrained model has seen much more data for the EN language, presumably with varying quality, it is also more robust to noise distortion in that language. Finally, we can observe that our previous findings also transfer to this robustness setting. While Merging excels when there is no memory buffer available, we can see that it is outperformed by other methods when some data for the retention task is available.

5.3.2. Similarity Analysis

Throughout our experiments we have observed that the following simple strategies work well:

Merging models, when there is a limited access to the data of the pretrained task,
Experience Replay when sufficient buffer of data is available.

We are interested in understanding how those different approaches result in their solutions, i.e., whether the solutions found by Model Merging are similar to those found when using experience replay. To understand this, we compute the Euclidean distance of the final solution with regard to the pretrained model and compute Centered Kernel Alignment (CKA [53]). CKA is a similarity metric used to compare two sets of high-dimensional representations. It evaluates the agreement between two matrices, each representing pairwise similarities of all samples in a dataset, where the matrices are derived from the representations (features) produced by a model. In our case, we compute the CKA with regard to the pretrained model on the retention dataset (EN for PL and SK experiments and PL for MED-PL experiment). The results can be found in Table 2. We can observe that Model Merging achieves effective solutions by remaining close to the pretrained model and maintaining high CKA alignment. Surprisingly, adding memory buffer replay has small impact on these metrics, despite the fact that memory buffer significantly boosts the retention task accuracy.

5.3.3. Hyperparameter Sensitivity

In Figure 5, we analyze the effect of the interpolation factor

α

in the merging experiments. The interpolation curves are smoother when the target and retention tasks are more similar. In low-resource scenarios, the optimal

α

can deviate significantly from the commonly used value of 0.5. For the SK language in particular, it appears that a better value could be found, though determining this would require validation sets, which can be challenging in very low-resource scenarios.

LwF uses additional hyperparameters that balance between adapting to the new task and preserving previous task accuracy. To determine the optimal value, we run a hyperparameter sensitivity check and visualize those results in Figure 6. We set

λ = 0.8

as it offers a good trade-off between downstream and retention task accuracies, which is also a value commonly used in the community [54,55].

Further, we investigate how varying the length of the finetuning stage affects the merging performance (Table 3). Overall, we find that Model Merging is quite robust to the training length, i.e., the results did not change much as we reduced the training time. However, training too long can reduce the effectiveness of Merging; therefore, it is important to use cross-validation during finetuning.

6. Discussion

Throughout this article, we conducted a series of experiments to explore the balance between generalization and specialization in ASR models. Overall, the key findings can be summarized as follows:

Model Merging consistently performs well across tasks, data quantity, and quality.
When some memory buffer is available, simple finetuning achieves competitive results.
Forgetting is a key issue, influenced by the pretrained model’s initial difficulty.
Model Merging improves CKA alignment and reduces L2 distance; memory buffer slightly impacts these metrics but significantly boosts accuracy.

While Model Merging did not always yield optimal results (e.g., for the SK language), it outperformed competing approaches in most cases. These findings align with recent research on robust finetuning and continual learning, which also highlights the effectiveness of this technique [12,50].

However, when a reasonable memory buffer was allowed, other methods, including the straightforward Experience Replay, demonstrated superior performance. The relatively suboptimal performance of Model Merging in such scenarios appears underexplored in the literature, likely because many studies assume no access to prior task data, especially in continual learning [36,37]. While this assumption may hold in certain cases, such as those involving privacy concerns, there are many scenarios where it is feasible to gather some data, as it is our case. From an ASR perspective, even if access to data from the pretrained model’s original tasks is restricted, it is often possible to collect data specific to the task for which knowledge preservation is required.

We have also identified the initial difficulty of the task for the pretrained model as a key factor influencing the potential to achieve a good balance in accuracy between downstream and retention tasks, particularly when using Model Merging. As illustrated in Figure 5, the interpolation curves are smooth when the two tasks are related (right image). In contrast, for SK language adaptation (center image), there is a significant loss barrier. This suggests that finetuning has substantially altered the model’s parameters to adapt to the new language, thereby increasing the difficulty of merging.

Limitations. Overall, while Model Merging performs well across various scenarios, it is important to acknowledge its limitations, particularly from a practitioner’s perspective. First, the approach struggles when tasks are unrelated, as seen in the SK experiment. In such scenarios, more sophisticated merging approaches could be used (e.g., [56,57]). Additionally, although Merging is relatively robust to hyperparameter changes, careful validation of the

α

interpolation factor (Figure 5) and training duration (Table 3) remains important. Scaling this approach to a larger number of tasks also presents challenges [26]. In such cases, incorporating exemplars or adopting more specialized merging techniques, may be necessary.

Our study focuses on task-agnostic inference, which we consider a more realistic and challenging setting compared to the commonly used task-aware inference [28,33]. In preliminary experiments with larger models from the Whisper family, we found that reliable training in the task-agnostic setting was not feasible, likely due to the limited size of the available datasets. As a result, we restricted our investigation to the small model variant.

7. Conclusions

In this paper, we examine the challenge of balancing the acquisition of new knowledge (specialization) with the retention of prior knowledge (generality) when adapting a pretrained model to a downstream task. This issue holds significant practical importance; for instance, when adapting a pretrained model to specialized medical data, it is crucial to preserve the model’s general capabilities, particularly given that access to pretraining data is often unavailable.

We structure our analysis around two scenarios: one assuming limited access to previous task data and the other assuming no access at all. Our findings reveal that in the absence of prior task data, Model Merging emerges as a consistently effective strategy, outperforming more sophisticated regularization approaches. However, when even a small memory buffer is permitted, this advantage diminishes, and simple experience replay can achieve superior performance. These results highlight a gap in the community’s focus, as much of the current research prioritizes scenarios without memory buffers, whereas practical applications often allow for some degree of data retention.

A limitation of our paper is using a relatively small set of target tasks and running experiments on one pretrained model. Looking ahead, we identify significant potential in combining Model Merging with memory buffer strategies, presenting a promising avenue for future exploration.

Author Contributions

S.C. and P.D.-M. contributed equally to this work. S.C. conceived the research idea, conducted initial experiments, designed the experimental setup, wrote the first draft of the manuscript, and supervised the overall project. P.D.-M. developed the code, carried out experiments and provided visualisations. A.C. secured funding for the project. All authors have read and agreed to the published version of the manuscript.

Funding

The project is co-financed by the National Centre for Research and Development under the Infostrateg IV program, project number: INFOSTRATEG-IV/0003/2022.

Data Availability Statement

We provide our code (https://github.com/PiotrDespot/balancing-specialization-generalization-whisper, accessed on 1 December 2025) in the public repository. In our work, we have used both publicly available datasets (Common Voice and Libri Speech) and one in-house dataset. Due to the sensitive nature of the data, including audio recordings of hospital patients, the private dataset cannot be shared publicly. Access may be granted on a case-by-case basis to qualified researchers under confidentiality agreements [42].

Acknowledgments

We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/017431. Generative AI tools were used solely to improve the clarity and conciseness of the text. No new scientific content, interpretations, or experimental results were generated by AI.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASR	Automatic Speech Recognition
CKA	Centered Kernel Alignment
CL	Continual Learning
NLP	Natural Language Processing
WER	Word Error Rate

References

Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems NeurIPS, Virtual, 6–12 December 2020. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Zheng, Z.; Ma, M.; Wang, K.; Qin, Z.; Yue, X.; You, Y. Preventing zero-shot transfer degradation in continual learning of vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 19125–19136. [Google Scholar]
Kumar, A.; Raghunathan, A.; Jones, R.M.; Ma, T.; Liang, P. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. In Proceedings of the Tenth International Conference on Learning Representations, ICLR, Virtual, 25 April 2022. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
French, R.M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 1999, 3, 128–135. [Google Scholar] [CrossRef] [PubMed]
McCloskey, M.; Cohen, N.J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation; Academic Press: Cambridge, MA, USA, 1989. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; Calderara, S. Dark experience for general continual learning: A strong, simple baseline. Adv. Neural Inf. Process. Syst. 2020, 33, 15920–15930. [Google Scholar]
Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef]
Wortsman, M.; Ilharco, G.; Kim, J.W.; Li, M.; Kornblith, S.; Roelofs, R.; Lopes, R.G.; Hajishirzi, H.; Farhadi, A.; Namkoong, H.; et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7959–7971. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Kheddar, H.; Hemis, M.; Himeur, Y. Automatic speech recognition using advanced deep learning approaches: A survey. Inf. Fusion 2024, 109, 102422. [Google Scholar] [CrossRef]
Gao, C.; Cheng, G.; Li, T.; Zhang, P.; Yan, Y. Self-Supervised Pre-Training for Attention-Based Encoder-Decoder ASR Model. IEEE ACM Trans. Audio Speech Lang. Process. 2022, 30, 1763–1774. [Google Scholar] [CrossRef]
Zhang, Y.; Park, D.S.; Han, W.; Qin, J.; Gulati, A.; Shor, J.; Jansen, A.; Xu, Y.; Huang, Y.; Wang, S.; et al. Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE J. Sel. Top. Signal Process. 2022, 16, 1519–1532. [Google Scholar] [CrossRef]
Li, B.; Pang, R.; Sainath, T.N.; Gulati, A.; Zhang, Y.; Qin, J.; Haghani, P.; Huang, W.R.; Ma, M.; Bai, J. Scaling end-to-end models for large-scale multilingual ASR. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 1011–1018. [Google Scholar]
Pratap, V.; Tjandra, A.; Shi, B.; Tomasello, P.; Babu, A.; Kundu, S.; Elkahky, A.; Ni, Z.; Vyas, A.; Fazel-Zarandi, M.; et al. Scaling Speech Technology to 1000+ Languages. J. Mach. Learn. Res. 2024, 25, 4798–4849. [Google Scholar]
Fatehi, K.; Torres Torres, M.; Kucukyilmaz, A. An overview of high-resource automatic speech recognition methods and their empirical evaluation in low-resource environments. Speech Commun. 2025, 167, 103151. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, W. Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models. IEEE J. Sel. Top. Signal Process. 2022, 16, 1227–1241. [Google Scholar] [CrossRef]
Shi, J.; Berrebbi, D.; Chen, W.; Hu, E.; Huang, W.; Chung, H.; Chang, X.; Li, S.; Mohamed, A.; Lee, H.; et al. ML-SUPERB: Multilingual Speech Universal PERformance Benchmark. In Proceedings of the 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, 20–24 August 2023; ISCA: Singapore, 2023; pp. 884–888. [Google Scholar]
Yue, X.; Gao, X.; Qian, X.; Li, H. Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters. Electronics 2024, 13, 190. [Google Scholar] [CrossRef]
Naini, A.R.; Kohler, M.A.; Richerson, E.; Robinson, D.; Busso, C. Generalization of Self-Supervised Learning-Based Representations for Cross-Domain Speech Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Seoul, Republic of Korea, 14–19 April 2024; pp. 12031–12035. [Google Scholar]
Mukhoti, J.; Gal, Y.; Torr, P.; Dokania, P.K. Fine-Tuning Can Cripple Your Foundation Model; Preserving Features May Be the Solution. Trans. Mach. Learn. Res. 2024, 1–26. Available online: https://openreview.net/forum?id=kfhoeZCeW7 (accessed on 1 December 2025).
Wortsman, M.; Ilharco, G.; Gadre, S.Y.; Roelofs, R.; Lopes, R.G.; Morcos, A.S.; Namkoong, H.; Farhadi, A.; Carmon, Y.; Kornblith, S.; et al. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the International Conference on Machine Learning—ICML, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 23965–23998. [Google Scholar]
Ilharco, G.; Ribeiro, M.T.; Wortsman, M.; Schmidt, L.; Hajishirzi, H.; Farhadi, A. Editing models with task arithmetic. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Li, G.; Duggal, R.; Singh, A.; Kundu, K.; Shuai, B.; Wu, J. Robustness Preserving Fine-Tuning Using Neuron Importance. 2024. Available online: https://assets.amazon.science/71/58/65a2cad64759bddb8fe34525ae03/robustness-preserving-fine-tuning-using-neuron-importance.pdf (accessed on 1 December 2025).
Yu, J.; Zhuge, Y.; Zhang, L.; Hu, P.; Wang, D.; Lu, H.; He, Y. Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Seattle, WA, USA, 16–22 June 2024; pp. 23219–23230. [Google Scholar]
Pekarek Rosin, T.; Wermter, S. Replay to Remember: Continual Layer-Specific Fine-Tuning for German Speech Recognition. In Proceedings of the International Conference on Artificial Neural Networks, Heraklion, Greece, 26–29 September 2023; pp. 489–500. [Google Scholar]
van de Ven, G.M.; Tuytelaars, T.; Tolias, A.S. Three types of incremental learning. Nat. Mac. Intell. 2022, 4, 1185–1197. [Google Scholar] [CrossRef]
Masana, M.; Liu, X.; Twardowski, B.; Menta, M.; Bagdanov, A.D.; Van De Weijer, J. Class-incremental learning: Survey and performance evaluation on image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5513–5533. [Google Scholar] [CrossRef]
Eeckt, S.V.; hamme, H.V. Weight Averaging: A Simple Yet Effective Method to Overcome Catastrophic Forgetting in Automatic Speech Recognition. In Proceedings of the ICASSP 2023, Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
Vander Eeckt, S.; Van Hamme, H. Using adapters to overcome catastrophic forgetting in end-to-end automatic speech recognition. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.P.; Wayne, G. Experience Replay for Continual Learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 348–358. [Google Scholar]
Robins, A. Catastrophic forgetting, rehearsal and pseudorehearsal. Connect. Sci. 1995, 7, 123–146. [Google Scholar] [CrossRef]
Rypesc, G.; Cygert, S.; Khan, V.; Trzcinski, T.; Zielinski, B.; Twardowski, B. Divide and not forget: Ensemble of selectively trained experts in Continual Learning. In Proceedings of the Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Goswami, D.; Liu, Y.; Twardowski, B.; van de Weijer, J. FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Diwan, A.; Yeh, C.F.; Hsu, W.N.; Tomasello, P.; Choi, E.; Harwath, D.; Mohamed, A. Continual learning for on-device speech recognition using disentangled conformers. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, 11–16 May 2020. [Google Scholar]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
Czyżewski, A.; Cygert, S.; Marciniuk, K.; Szczodrak, M.; Harasimiuk, A.; Odya, P.; Galanina, M.; Szczuko, P.; Kostek, B.; Graff, B.; et al. A Comprehensive Polish Medical Speech Dataset for Enhancing Automatic Medical Dictation. Sci. Data 2025, 12, 1436. [Google Scholar] [CrossRef]
Eeckt, S.V.; hamme, H.V. Continual Learning for Monolingual End-to-End Automatic Speech Recognition. In Proceedings of the 30th European Signal Processing Conference, EUSIPCO, Belgrade, Serbia, 29 August–2 September 2022; pp. 459–463. [Google Scholar]
Chang, H.; Lee, H.; Lee, L. Towards Lifelong Learning of End-to-End ASR. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 2551–2555. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar]
Soutif-Cormerais, A.; Carta, A.; van de Weijer, J. Improving Online Continual Learning Performance and Stability with Temporal Ensembles. In Proceedings of the Conference on Lifelong Learning Agents, Montréal, QC, Canada, 22–25 August 2023; Volume 232, pp. 828–845. [Google Scholar]
Morales-Brotons, D.; Vogels, T.; Hendrikx, H. Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits. Trans. Mach. Learn. Res. 2024, 1–27. Available online: https://hal.science/hal-04830859v1 (accessed on 1 December 2025).
Marczak, D.; Twardowski, B.; Trzcinski, T.; Cygert, S. MagMax: Leveraging Model Merging for Seamless Continual Learning. 2024. Available online: https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/11489.pdf (accessed on 1 December 2025).
Prabhu, A.; Torr, P.H.S.; Dokania, P.K. GDumb: A Simple Approach that Questions Our Progress in Continual Learning. In Proceedings of the Computer Vision-ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Lecture Notes in Computer Science. Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12347, pp. 524–540. [Google Scholar]
Shah, M.A.; Noguero, D.S.; Heikkila, M.A.; Kourtellis, N. Speech Robust Bench: A Robustness Benchmark For Speech Recognition. arXiv 2024, arXiv:2403.07937. [Google Scholar] [CrossRef]
Kornblith, S.; Norouzi, M.; Lee, H.; Hinton, G.E. Similarity of Neural Network Representations Revisited. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Proceedings of Machine Learning Research. Volume 97, pp. 3519–3529. [Google Scholar]
Gandhi, S.; von Platen, P.; Rush, A.M. Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling. arXiv 2023, arXiv:2311.00430. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Marczak, D.; Magistri, S.; Cygert, S.; Twardowski, B.; Bagdanov, A.D.; van de Weijer, J. No Task Left Behind: Isotropic Model Merging with Common and Task-Specific Subspaces. In Proceedings of the 42nd International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Gargiulo, A.A.; Crisostomi, D.; Bucarelli, M.S.; Scardapane, S.; Silvestri, F.; Rodolà, E. Task Singular Vectors: Reducing Task Interference in Model Merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Nashville, TN, USA, 10–17 June 2025; pp. 18695–18705. [Google Scholar]

Figure 1. Transfer learning experiments for PL (left), SK (center), and MED-PL (right) downstream tasks without any additional memory buffer. The x-axis represents target task accuracy, and the y-axis shows pretrained model knowledge preservation (WER on EN for left and center plots, WER on PL for the right plot). Optimal results are located in the bottom-left corner, indicating maximal knowledge preservation and task adaptation. The Finetuning strategy prioritizes target task accuracy but sacrifices general recognition capabilities (higher WER). In contrast, the Merging approach consistently balances these goals but faces challenges in low-resource settings, particularly for the SK language task. The contour lines show the geometric mean of target and retention task accuracies.

Figure 2. Transfer learning experiments for PL (left), SK (center), and MED-PL (right) downstream tasks with additional memory bufferfor the pretrained model knowledge preservation. The axes are the same as in Figure 1: target task accuracy (x-axis) and pretrained model knowledge (y-axis). Using a simple small memory buffer (Finetuning) is a very strong baseline, competitive with more sophisticated approaches (LwF,

{EMA}_{teacher}

). Compared to Figure 1 we can also observe that the previously most competitive baseline, Merging, does not benefit well from the use of exemplars. We can also observe greatly improved results compared to Figure 1, in particular for the first column. The contour lines show the geometric mean of target and retention task accuracies.

Figure 2. Transfer learning experiments for PL (left), SK (center), and MED-PL (right) downstream tasks with additional memory bufferfor the pretrained model knowledge preservation. The axes are the same as in Figure 1: target task accuracy (x-axis) and pretrained model knowledge (y-axis). Using a simple small memory buffer (Finetuning) is a very strong baseline, competitive with more sophisticated approaches (LwF,

{EMA}_{teacher}

). Compared to Figure 1 we can also observe that the previously most competitive baseline, Merging, does not benefit well from the use of exemplars. We can also observe greatly improved results compared to Figure 1, in particular for the first column. The contour lines show the geometric mean of target and retention task accuracies.

Figure 3. Transfer learning experiments for PL (left), SK (center), and MED-PL (right) downstream tasks with additional memory buffer for the retention task when using memory buffer of size [1, 5, 10, 20]% of retention task data. Larger markers represent larger memory buffers. The axes are the same as in Figure 1: target task accuracy (x-axis) and pretrained model knowledge (y-axis), optimal results in bottom-left corner. Under larger memory budgets, Merging is outperformed by competing methods. The contour lines show the geometric mean of target and retention task accuracies.

Figure 4. Transfer learning experiment for PL language when synthetic Gaussian noise is added to test sequences, without (left) and with (right) additional memory buffer for EN language. The pretrained model shows greater robustness for the high-resource EN language. Model Merging excels when no memory buffer is available, but is outperformed by other methods when it is. The contour lines show the geometric mean of target and retention task accuracies.

Figure 5. Transfer learning experiments for PL (left), SK (center), and MED-PL (right) downstream tasks with an additional memory buffer for the retention task. The axes match those in Figure 1: target task accuracy (x-axis) and pretrained model knowledge (y-axis). The interpolation curves appear smoother when the target and retention tasks are more similar (right). In low-resource scenarios, the optimal

α

can deviate significantly from the commonly used value of 0.5, as shown particularly in the center image. The contour lines show the geometric mean of target and retention task accuracies.

Figure 5. Transfer learning experiments for PL (left), SK (center), and MED-PL (right) downstream tasks with an additional memory buffer for the retention task. The axes match those in Figure 1: target task accuracy (x-axis) and pretrained model knowledge (y-axis). The interpolation curves appear smoother when the target and retention tasks are more similar (right). In low-resource scenarios, the optimal

α

can deviate significantly from the commonly used value of 0.5, as shown particularly in the center image. The contour lines show the geometric mean of target and retention task accuracies.

Figure 6. Accuracy of LwF method on PL task when using different values of

λ

parameter.

Figure 6. Accuracy of LwF method on PL task when using different values of

λ

parameter.

Table 1. Details on data used in this paper. EN data is used for memory buffer (train split) and for evaluating forgetting (test split).

	Speakers	Source	Train Split Size	Test Split Size
PL	2823	CV and LS	41.5 k	8.8 k
SK	137	CV	3.01 k	2.24 k
EN	55,574	CV and LS	216.0 k	19.0 k
MED-PL	181	Private	42.9 k	4.8 k

Table 2. Word error rates for the transfer learning experiments, along with the Euclidean distance and CKA alignment relative to the pretrained model (using control task data). Model Merging achieves effective solutions by remaining close to the pretrained model and maintaining high CKA alignment. Interestingly, adding experience replay has only a minimal impact on these metrics. Bolded font marks best results.

Method	PL				SK				MED-PL
Method	PL	EN	L2-Dist	CKA	SK	EN	L2-Dist	CKA	MED-PL	PL	L2-Dist	CKA
Finetuning	$12.71 \pm 0.24$	$63.81 \pm 6.78$	11.25	0.77	$27.34 \pm 0.68$	$75.71 \pm 2.48$	4.38	0.82	$7.72 \pm 0.33$	$38.46 \pm 2.03$	17.83	0.64
Merging	$13.09 \pm 0.07$	$26.4 \pm 0.07$	5.63	0.84	$54.79 \pm 4.27$	$56.79 \pm 0.53$	2.19	0.85	$14.50 \pm 0.06$	$18.79 \pm 0.29$	8.91	0.81
LwF	$16.26 \pm 1.42$	$45.79 \pm 5.14$	11.55	0.77	$27.31 \pm 0.68$	$72.21 \pm 2.48$	4.38	0.80	$9.52 \pm 0.01$	$37.33 \pm 0.09$	17.55	0.59
Finetuning + ER	$13.15 \pm 0.31$	$16.26 \pm 2.11$	10.89	0.81	$27.08 \pm 0.14$	$54.85 \pm 3.87$	4.48	0.77	$8.56 \pm 0.85$	$24.19 \pm 0.51$	18.8	0.66
Merging + ER	$14.02 \pm 0.32$	$15.25 \pm 0.09$	5.45	0.86	$42.83 \pm 2.85$	$54.85 \pm 1.42$	2.24	0.83	$15.22 \pm 0.08$	$15.63 \pm 0.18$	8.91	0.80
LwF + ER	$15.19 \pm 2.56$	$16.26 \pm 0.97$	11.11	0.76	$27.09 \pm 0.30$	$54.83 \pm 0.19$	4.51	0.78	$8.93 \pm 0.68$	$24.41 \pm 0.49$	18.56	0.64

Table 3. Impact of finetuning duration on Model Merging performance in the PL task. Results remain stable with shorter training, but excessive finetuning reduces accuracy, highlighting the importance of early stopping.

Train time	50%	75%	100%	125%	150%
Downstream Task WER (PL)	13.4	12.8	13.3	13.3	18.1
Pretrained Task WER (EN)	26.1	25.4	26.4	29.9	38.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cygert, S.; Despot-Mładanowicz, P.; Czyżewski, A. Balancing Specialization and Generalization Trade-Off for Speech Recognition Models. Electronics 2025, 14, 4792. https://doi.org/10.3390/electronics14244792

AMA Style

Cygert S, Despot-Mładanowicz P, Czyżewski A. Balancing Specialization and Generalization Trade-Off for Speech Recognition Models. Electronics. 2025; 14(24):4792. https://doi.org/10.3390/electronics14244792

Chicago/Turabian Style

Cygert, Sebastian, Piotr Despot-Mładanowicz, and Andrzej Czyżewski. 2025. "Balancing Specialization and Generalization Trade-Off for Speech Recognition Models" Electronics 14, no. 24: 4792. https://doi.org/10.3390/electronics14244792

APA Style

Cygert, S., Despot-Mładanowicz, P., & Czyżewski, A. (2025). Balancing Specialization and Generalization Trade-Off for Speech Recognition Models. Electronics, 14(24), 4792. https://doi.org/10.3390/electronics14244792

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Balancing Specialization and Generalization Trade-Off for Speech Recognition Models

Abstract

1. Introduction

2. Related Work

3. Method

4. Experimental Setup

4.1. Experiments

4.2. Data

4.3. Models

4.4. Methodology

5. Results

5.1. Forgetting in ASR Models

5.2. Impact of Experience Replay

5.3. Detailed Analysis

5.3.1. Corruption Robustness

5.3.2. Similarity Analysis

5.3.3. Hyperparameter Sensitivity

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI